As the title suggests, I’d like to select the first row of each set of rows grouped with a
Specifically, if I’ve got a
purchases table that looks like this:
SELECT * FROM purchases;
I’d like to query for the
id of the largest purchase (
total) made by each
customer. Something like this:
SELECT FIRST(id), customer, FIRST(total) FROM purchases GROUP BY customer ORDER BY total DESC;
DISTINCT ON is typically simplest and fastest for this in PostgreSQL.
(For performance optimization for certain workloads see below.)
SELECT DISTINCT ON (customer) id, customer, total FROM purchases ORDER BY customer, total DESC, id;
Or shorter (if not as clear) with ordinal numbers of output columns:
SELECT DISTINCT ON (2) id, customer, total FROM purchases ORDER BY 2, 3 DESC, 1;
total can be NULL (won’t hurt either way, but you’ll want to match existing indexes):
... ORDER BY customer, total DESC NULLS LAST, id;
DISTINCT ON is a PostgreSQL extension of the standard (where only
DISTINCT on the whole
SELECT list is defined).
List any number of expressions in the
DISTINCT ON clause, the combined row value defines duplicates. The manual:
Obviously, two rows are considered distinct if they differ in at least
one column value. Null values are considered equal in this
Bold emphasis mine.
DISTINCT ON can be combined with
ORDER BY. Leading expressions in
ORDER BY must be in the set of expressions in
DISTINCT ON, but you can rearrange order among those freely. Example.
You can add additional expressions to
ORDER BY to pick a particular row from each group of peers. Or, as the manual puts it:
DISTINCT ONexpression(s) must match the leftmost
ORDER BYclause will normally contain additional
expression(s) that determine the desired precedence of rows within
id as last item to break ties:
“Pick the row with the smallest
id from each group sharing the highest
To order results in a way that disagrees with the sort order determining the first per group, you can nest above query in an outer query with another
ORDER BY. Example.
total can be NULL, you most probably want the row with the greatest non-null value. Add
NULLS LAST like demonstrated. See:
SELECT list is not constrained by expressions in
DISTINCT ON or
ORDER BY in any way. (Not needed in the simple case above):
You don’t have to include any of the expressions in
You can include any other expression in the
SELECTlist. This is instrumental for replacing much more complex queries with subqueries and aggregate / window functions.
I tested with Postgres versions 8.3 – 13. But the feature has been there at least since version 7.1, so basically always.
The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
May be too specialized. But use it if read performance for the particular query is crucial. If you have
DESC NULLS LAST in the query, use the same in the index so that sort order matches and the index is applicable.
Effectiveness / Performance optimization
Weigh cost and benefit before creating tailored indexes for each query. The potential of above index largely depends on data distribution.
The index is used because it delivers pre-sorted data. In Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. The index has to be scanned in its entirety, though.
For few rows per customer (high cardinality in column
customer), this is very efficient. Even more so if you need sorted output anyway. The benefit shrinks with a growing number of rows per customer.
Ideally, you have enough
work_mem to process the involved sort step in RAM and not spill to disk. But generally setting
work_mem too high can have adverse effects. Consider
SET LOCAL for exceptionally big queries. Find how much you need with
EXPLAIN ANALYZE. Mention of “Disk:” in the sort step indicates the need for more:
- Configuration parameter work_mem in PostgreSQL on Linux
- Optimize simple query using ORDER BY date and text
For many rows per customer (low cardinality in column
customer), a loose index scan (a.k.a. “skip scan”) would be (much) more efficient, but that’s not implemented up to Postgres 14. (An implementation for index-only scans is in development for Postgres 15. See here and here.)
For now, there are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don’t:
- SELECT DISTINCT is slower than expected on my table in PostgreSQL
- Optimize GROUP BY query to retrieve latest row per user
- Optimize groupwise maximum query
- Query last N related rows per row
On databases that support CTE and windowing functions:
WITH summary AS ( SELECT p.id, p.customer, p.total, ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) AS rank FROM PURCHASES p) SELECT * FROM summary WHERE rank = 1
Supported by any database:
But you need to add logic to break ties:
SELECT MIN(x.id), -- change to MAX if you want the highest x.customer, x.total FROM PURCHASES x JOIN (SELECT p.customer, MAX(total) AS max_total FROM PURCHASES p GROUP BY p.customer) y ON y.customer = x.customer AND y.max_total = x.total GROUP BY x.customer, x.total
Testing the most interesting candidates with Postgres 9.4 and 9.5 with a halfway realistic table of 200k rows in
purchases and 10k distinct
customer_id (avg. 20 rows per customer).
For Postgres 9.5 I ran a 2nd test with effectively 86446 distinct customers. See below (avg. 2.3 rows per customer).
Added an accented test for Postgres 13 below.
CREATE TABLE purchases ( id serial , customer_id int -- REFERENCES customer , total int -- could be amount of money in Cent , some_column text -- to make the row bigger, more realistic );
I use a
serial (PK constraint added below) and an integer
customer_id since that’s a more typical setup. Also added
some_column to make up for typically more columns.
Dummy data, PK, index – a typical table also has some dead tuples:
INSERT INTO purchases (customer_id, total, some_column) -- insert 200k rows SELECT (random() * 10000)::int AS customer_id -- 10k customers , (random() * random() * 100000)::int AS total , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int) FROM generate_series(1,200000) g; ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id); DELETE FROM purchases WHERE random() > 0.9; -- some dead rows INSERT INTO purchases (customer_id, total, some_column) SELECT (random() * 10000)::int AS customer_id -- 10k customers , (random() * random() * 100000)::int AS total , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int) FROM generate_series(1,20000) g; -- add 20k to make it ~ 200k CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id); VACUUM ANALYZE purchases;
customer table – for superior query:
CREATE TABLE customer AS SELECT customer_id, 'customer_' || customer_id AS customer FROM purchases GROUP BY 1 ORDER BY 1; ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id); VACUUM ANALYZE customer;
In my second test for 9.5 I used the same setup, but with
random() * 100000 to generate
customer_id to get only few rows per
Object sizes for table
Generated with a query taken from this related answer:
what | bytes/ct | bytes_pretty | bytes_per_row -----------------------------------+----------+--------------+--------------- core_relation_size | 20496384 | 20 MB | 102 visibility_map | 0 | 0 bytes | 0 free_space_map | 24576 | 24 kB | 0 table_size_incl_toast | 20529152 | 20 MB | 102 indexes_size | 10977280 | 10 MB | 54 total_size_incl_toast_and_indexes | 31506432 | 30 MB | 157 live_rows_in_text_representation | 13729802 | 13 MB | 68 ------------------------------ | | | row_count | 200045 | | live_tuples | 200045 | | dead_tuples | 19955 | |
row_number() in CTE, (see other answer)
WITH cte AS ( SELECT id, customer_id, total , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn FROM purchases ) SELECT id, customer_id, total FROM cte WHERE rn = 1;
row_number() in subquery (my optimization)
SELECT id, customer_id, total FROM ( SELECT id, customer_id, total , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn FROM purchases ) sub WHERE rn = 1;
DISTINCT ON (see other answer)
SELECT DISTINCT ON (customer_id) id, customer_id, total FROM purchases ORDER BY customer_id, total DESC, id;
4. rCTE with
LATERAL subquery (see here)
WITH RECURSIVE cte AS ( ( -- parentheses required SELECT id, customer_id, total FROM purchases ORDER BY customer_id, total DESC LIMIT 1 ) UNION ALL SELECT u.* FROM cte c , LATERAL ( SELECT id, customer_id, total FROM purchases WHERE customer_id > c.customer_id -- lateral reference ORDER BY customer_id, total DESC LIMIT 1 ) u ) SELECT id, customer_id, total FROM cte ORDER BY customer_id;
customer table with
LATERAL (see here)
SELECT l.* FROM customer c , LATERAL ( SELECT id, customer_id, total FROM purchases WHERE customer_id = c.customer_id -- lateral reference ORDER BY total DESC LIMIT 1 ) l;
ORDER BY (see other answer)
SELECT (array_agg(id ORDER BY total DESC)) AS id , customer_id , max(total) AS total FROM purchases GROUP BY customer_id;
Execution time for above queries with
EXPLAIN ANALYZE (and all options off), best of 5 runs.
All queries used an Index Only Scan on
purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.
A. Postgres 9.4 with 200k rows and ~ 20 per
1. 273.274 ms 2. 194.572 ms 3. 111.067 ms 4. 92.922 ms -- ! 5. 37.679 ms -- winner 6. 189.495 ms
B. Same as A. with Postgres 9.5
1. 288.006 ms 2. 223.032 ms 3. 107.074 ms 4. 78.032 ms -- ! 5. 33.944 ms -- winner 6. 211.540 ms
C. Same as B., but with ~ 2.3 rows per
1. 381.573 ms 2. 311.976 ms 3. 124.074 ms -- winner 4. 710.631 ms 5. 311.976 ms 6. 421.679 ms
Retest with Postgres 13 on 2021-08-11
Simplified test setup: not deleting rows, because
VACUUM ANALYZE cleans the table completely for the simple case.
- General performance improvements.
- CTEs can be inlined since Postgres 12, so query 1. and 2. now perform mostly identical (same query plan).
D. Like B. ~ 20 rows per customer_id
1. 103 ms 2. 103 ms 3. 23 ms -- winner 4. 71 ms 5. 22 ms -- winner 6. 81 ms
E. Like C. ~ 2.3 rows per customer_id
1. 127 ms 2. 126 ms 3. 36 ms -- winner 4. 620 ms 5. 145 ms 6. 203 ms
Accented tests with Postgres 13
1M rows, 10.000 vs. 100 vs. 1.6 rows per customer.
F. with ~ 10.000 rows per customer
1. 526 ms 2. 527 ms 3. 127 ms 4. 2 ms -- winner ! 5. 1 ms -- winner ! 6. 356 ms
G. with ~ 100 rows per customer
1. 535 ms 2. 529 ms 3. 132 ms 4. 108 ms -- ! 5. 71 ms -- winner 6. 376 ms
H. with ~ 1.6 rows per customer
1. 691 ms 2. 684 ms 3. 234 ms -- winner 4. 4669 ms 5. 1089 ms 6. 1264 ms
DISTINCT ONuses the index effectively and typically performs best for few rows per group. And it performs decently even with many rows per group.
For many rows per group, emulating an index skip scan with an rCTE performs best – second only to the query technique with a separate lookup table (if that’s available).
row_number()(technique of the currently accepted answer) never wins any performance test. Not then, not now. It never comes even close to
DISTINCT ON, not even when the data distribution is unfavorable for the latter. The only good thing about
row_number(): it does not scale terribly, just mediocre.
Here is a new one by “ogr” testing with 10M rows and 60k unique “customers” on Postgres 11.5 (current as of Sep. 2019). Results are still in line with what we have seen so far:
Original (outdated) benchmark from 2011
I ran three tests with PostgreSQL 9.1 on a real life table of 65579 rows and single-column btree indexes on each of the three columns involved and took the best execution time of 5 runs.
Comparing @OMGPonies’ first query (
A) to the above
DISTINCT ON solution (
- Select the whole table, results in 5958 rows in this case.
A: 567.218 ms B: 386.673 ms
- Use condition
WHERE customer BETWEEN x AND yresulting in 1000 rows.
A: 249.136 ms B: 55.111 ms
- Select a single customer with
WHERE customer = x.
A: 0.143 ms B: 0.072 ms
Same test repeated with the index described in the other answer
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
1A: 277.953 ms 1B: 193.547 ms 2A: 249.796 ms -- special index not used 2B: 28.679 ms 3A: 0.120 ms 3B: 0.048 ms
This is common greatest-n-per-group problem, which already has well tested and highly optimized solutions. Personally I prefer the left join solution by Bill Karwin (the original post with lots of other solutions).
Note that bunch of solutions to this common problem can surprisingly be found in the one of most official sources, MySQL manual! See Examples of Common Queries :: The Rows Holding the Group-wise Maximum of a Certain Column.
In Postgres you can use
array_agg like this:
SELECT customer, (array_agg(id ORDER BY total DESC)), max(total) FROM purchases GROUP BY customer
This will give you the
id of each customer’s largest purchase.
Some things to note:
array_aggis an aggregate function, so it works with
array_agglets you specify an ordering scoped to just itself, so it doesn’t constrain the structure of the whole query. There is also syntax for how you sort NULLs, if you need to do something different from the default.
- Once we build the array, we take the first element. (Postgres arrays are 1-indexed, not 0-indexed).
- You could use
array_aggin a similar way for your third output column, but
DISTINCT ON, using
array_agglets you keep your
GROUP BY, in case you want that for other reasons.
The solution is not very efficient as pointed by Erwin, because of presence of SubQs
select * from purchases p1 where total in (select max(total) from purchases where p1.customer=customer) order by total desc;
SELECT purchases.* FROM purchases LEFT JOIN purchases as p ON p.customer = purchases.customer AND purchases.total < p.total WHERE p.total IS NULL
HOW DOES THAT WORK! (I’ve been there)
We want to make sure that we only have the highest total for each purchase.
Some Theoretical Stuff (skip this part if you only want to understand the query)
Let Total be a function T(customer,id) where it returns a value given the name and id
To prove that the given total (T(customer,id)) is the highest we have to prove that
We want to prove either
- ∀x T(customer,id) > T(customer,x) (this total is higher than all other
total for that customer)
- ¬∃x T(customer, id) < T(customer, x) (there exists no higher total for
The first approach will need us to get all the records for that name which I do not really like.
The second one will need a smart way to say there can be no record higher than this one.
Back to SQL
If we left joins the table on the name and total being less than the joined table:
LEFT JOIN purchases as p ON p.customer = purchases.customer AND purchases.total < p.total
we make sure that all records that have another record with the higher total for the same user to be joined:
+--------------+---------------------+-----------------+------+------------+---------+ | purchases.id | purchases.customer | purchases.total | p.id | p.customer | p.total | +--------------+---------------------+-----------------+------+------------+---------+ | 1 | Tom | 200 | 2 | Tom | 300 | | 2 | Tom | 300 | | | | | 3 | Bob | 400 | 4 | Bob | 500 | | 4 | Bob | 500 | | | | | 5 | Alice | 600 | 6 | Alice | 700 | | 6 | Alice | 700 | | | | +--------------+---------------------+-----------------+------+------------+---------+
That will help us filter for the highest total for each purchase with no grouping needed:
WHERE p.total IS NULL +--------------+----------------+-----------------+------+--------+---------+ | purchases.id | purchases.name | purchases.total | p.id | p.name | p.total | +--------------+----------------+-----------------+------+--------+---------+ | 2 | Tom | 300 | | | | | 4 | Bob | 500 | | | | | 6 | Alice | 700 | | | | +--------------+----------------+-----------------+------+--------+---------+
And that’s the answer we need.
I use this way (postgresql only): https://wiki.postgresql.org/wiki/First/last_%28aggregate%29
-- Create a function that always returns the first non-NULL item CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement ) RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$ SELECT $1; $$; -- And then wrap an aggregate around it CREATE AGGREGATE public.first ( sfunc = public.first_agg, basetype = anyelement, stype = anyelement ); -- Create a function that always returns the last non-NULL item CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement ) RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$ SELECT $2; $$; -- And then wrap an aggregate around it CREATE AGGREGATE public.last ( sfunc = public.last_agg, basetype = anyelement, stype = anyelement );
Then your example should work almost as is:
SELECT FIRST(id), customer, FIRST(total) FROM purchases GROUP BY customer ORDER BY FIRST(total) DESC;
CAVEAT: It ignore’s NULL rows
Edit 1 – Use the postgres extension instead
Now I use this way: http://pgxn.org/dist/first_last_agg/
To install on ubuntu 14.04:
apt-get install postgresql-server-dev-9.3 git build-essential -y git clone git://github.com/wulczer/first_last_agg.git cd first_last_app make && sudo make install psql -c 'create extension first_last_agg'
It’s a postgres extension that gives you first and last functions; apparently faster than the above way.
Edit 2 – Ordering and filtering
If you use aggregate functions (like these), you can order the results, without the need to have the data already ordered:
So the equivalent example, with ordering would be something like:
SELECT first(id order by id), customer, first(total order by id) FROM purchases GROUP BY customer ORDER BY first(total);
Of course you can order and filter as you deem fit within the aggregate; it’s very powerful syntax.
In SQL Server you can do this:
SELECT * FROM ( SELECT ROW_NUMBER() OVER(PARTITION BY customer ORDER BY total DESC) AS StRank, * FROM Purchases) n WHERE StRank = 1
Explaination:Here Group by is done on the basis of customer and then order it by total then each such group is given serial number as StRank and we are taking out first 1 customer whose StRank is 1
Very fast solution
SELECT a.* FROM purchases a JOIN ( SELECT customer, min( id ) as id FROM purchases GROUP BY customer ) b USING ( id );
and really very fast if table is indexed by id:
create index purchases_id on purchases (id);
QUALIFY clause which works like
HAVING for windowed functions:
SELECT id, customer, total FROM PURCHASES QUALIFY ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) = 1
The accepted OMG Ponies’ “Supported by any database” solution has good speed from my test.
Here I provide a same-approach, but more complete and clean any-database solution. Ties are considered (assume desire to get only one row for each customer, even multiple records for max total per customer), and other purchase fields (e.g. purchase_payment_id) will be selected for the real matching rows in the purchase table.
Supported by any database:
select * from purchase join ( select min(id) as id from purchase join ( select customer, max(total) as total from purchase group by customer ) t1 using (customer, total) group by customer ) t2 using (id) order by customer
This query is reasonably fast especially when there is a composite index like (customer, total) on the purchase table.
t1, t2 are subquery alias which could be removed depending on database.
using (...)clause is currently not supported in MS-SQL and Oracle db as of this edit on Jan 2017. You have to expand it yourself to e.g.
on t2.id = purchase.idetc. The USING syntax works in SQLite, MySQL and PostgreSQL.
In PostgreSQL, another possibility is to use the
first_value window function in combination with
select distinct customer_id, first_value(row(id, total)) over(partition by customer_id order by total desc, id) from purchases;
I created a composite
(id, total), so both values are returned by the same aggregate. You can of course always apply
This way it work for me:
SELECT article, dealer, price FROM shop s1 WHERE price=(SELECT MAX(s2.price) FROM shop s2 WHERE s1.article = s2.article GROUP BY s2.article) ORDER BY article;
Select highest price on each article
If you want to select any (by your some specific condition) row from the set of aggregated rows.
If you want to use another (
sum/avg) aggregation function in addition to
max/min. Thus you can not use clue with
You can use next subquery:
SELECT ( SELECT **id** FROM t2 WHERE id = ANY ( ARRAY_AGG( tf.id ) ) AND amount = MAX( tf.amount ) ) id, name, MAX(amount) ma, SUM( ratio ) FROM t2 tf GROUP BY name
You can replace
amount = MAX( tf.amount ) with any condition you want with one restriction: This subquery must not return more than one row
But if you wanna to do such things you probably looking for window functions
For SQl Server the most efficient way is:
with ids as ( --condition for split table into groups select i from (values (9),(12),(17),(18),(19),(20),(22),(21),(23),(10)) as v(i) ) ,src as ( select * from yourTable where <condition> --use this as filter for other conditions ) ,joined as ( select tops.* from ids cross apply --it`s like for each rows ( select top(1) * from src where CommodityId = ids.i ) as tops ) select * from joined
and don’t forget to create clustered index for used columns
My approach via window function dbfiddle:
row_numberat each group:
row_number() over (partition by agreement_id, order_id ) as nrow
- Take only first row at group:
filter (where nrow = 1)
with intermediate as (select *, row_number() over ( partition by agreement_id, order_id ) as nrow, (sum( suma ) over ( partition by agreement_id, order_id ))::numeric( 10, 2) as order_suma, from <your table>) select *, sum( order_suma ) filter (where nrow = 1) over (partition by agreement_id) from intermediate
This can be achieved easily by MAX FUNCTION on total and GROUP BY id and customer.
SELECT id, customer, MAX(total) FROM purchases GROUP BY id, customer ORDER BY total DESC;