in ,

Distinct on: a confusing, unique and useful feature in Postgres, Hacker News

Distinct on: a confusing, unique and useful feature in Postgres, Hacker News

We saw how DISTINCT works at the end of this post: Select Statement In Postgres With Examples

Let’s explore it further.

When I saw DISTINCT ON, I was like, there must not be anything new about it, you know, just another similar kind of feature with a different name. But I was wrong! It seems very powerful feature to me at least!

What I did was nothing special. Googled it and went through tons of articles. Most of them are filled with the official documentation but not the simple explanation and some of them are with nice decent explanation. I couldn’t understand most of it, to be frank. I was wondering why the feature is so hard to get hold of! So, I tried to play with it and find out more about it. Before I show some of my findings let’s just go through some crappy theory first!

As per the official documentation, SELECT DISTINCT ON (expression [, …]) keeps only the first row of each set of rows where the given expressions evaluate to equal.

In simple terms, if we sue DISTINCT ON then it will give us the first result from the set of results which are grouped together.

Now the question is, when did we group them? We just simply use DISTINCT ON. Well, with DISTINCT ON, we just want PostgreSQL to return a single row for each distinct group defined by the ON clause.

The DISTINCT ON clause will only return the first row based on the DISTINCT ON (column) and ORDER BY clause provided in the query. For other columns, it will return the corresponding values. Basically, it LIMITs 1 by default when we use DISTINCT ON.

The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY. That means we can decide what row we want in our results-but by only ascending and descending the columns.

Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. That means the results will be different if we don’t use ORDER BY as Postgres is not smart enough to know our minds! Basically, the results will not be in order. For example,

  SELECT DISTINCT ON (location) location, time, report  FROM weather_reports; 

Above example is from official documentation. Now, we are not adding ORDER BY clause in it and that’s why we don’t know which of the rows will be selected. If we add, ORDER BY like the following example, we can be sure of a specific row.

  SELECT DISTINCT ON (location) location, time, report  FROM weather_reports  ORDER BY location, time DESC; 

The query retrieves the most recent weather report for each location.

Now, let’s just take a few more examples.

Is there any difference between the results of DISTINCT and DISTINCT ON queries?

Yes, there is. That’s why they have both! Take a look at the queries and results below.

I am using this database. https://github.com/ydchauh/yogeshchauhan.com-public

Simple DISTINCT query:

  SELECT DISTINCT country, contact_name, company_name FROM customers; 

Output:

distinct-on-query-output-yogesh-chauhan

DISTINCT ON query:

  SELECT DISTINCT ON (country) contact_name, country, company_name FROM customers; 

Output:

distinct-on-query-output-yogesh-chauhan

What about adding ORDER BY?

  SELECT DISTINCT ON (country) contact_name, country, company_name FROM customers  ORDER BY country; 

Output:

distinct-on-query-output-yogesh-chauhan

There is no difference between the outputs from the previous and the current query. That’s because Postgres is showing results in ASC (Ascending) order by default.

Let’s try DSC (Descending) order.

  SELECT DISTINCT ON (country) contact_name, country, company_name FROM customers  ORDER BY country DESC; 

Output:

As we can see, the results are completely different. So, when we use thie DISTINCT ON, we need to be careful what we want in our results.

Now, by this point, we know that it acts like GROUP BY.

Then, why do we need GROUP BY in Postgres?

Well, let’s understand that by queries.

  SELECT country, contact_name, company_name FROM customers group by country; 

The query above will raise an error as we are not using COUNT or any other aggregate functions as well as we are not adding all the columns to the GROUP BY clause. So, one of those options we need to choose. Either add aggregate function or add all columns to the GROUP BY clause.

The screenshot of the error:

distinct-on-query-output-yogesh-chauhan

So, if we add all the columns to the GROUP BY clause then we will basically get the same results as just simple DISTINCT query.

  SELECT country, contact_name, company_name FROM customers  GROUP BY country, contact_name, company_name; 

Output:

We can not add an aggregate function to just one column. Take a look at the query and results below.

  SELECT country, COUNT (contact_name), company_name FROM customers group by country;     

distinct-on-query-output-yogesh-chauhan

So, we need to write down query like this:

SELECT country, COUNT (contact_name), COUNT (company_name) FROM customers group by country;

  SELECT country, COUNT (contact_name), COUNT (company_name) FROM customers GROUP BY country; 

Output:

So, as we can see in the screenshot above, we are getting the number of rows using GROUP BY but if we want the first result from all those groups, we need to use DISTINCT ON and that’s why I consider it a powerful feature of Postgres.

Sources: https://www.postgresql.org/docs/9.5/sql-select.html distinct-on-query-output-yogesh-chauhandistinct-on-query-output-yogesh-chauhan (Read More) distinct-on-query-output-yogesh-chauhan

What do you think?

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings

A special announcement about SAT Subject Tests | MIT Admissions, Hacker News

A special announcement about SAT Subject Tests | MIT Admissions, Hacker News

Tax Day moved: You now have until July 15 to file your 2019 tax return, Recode

Tax Day moved: You now have until July 15 to file your 2019 tax return, Recode