SQL Top n Rows

From Exterior Memory
Jump to: navigation, search

Let's assume the following table:

Sales:

employee product price
2 Printer 50.00
2 Mouse 10.00
2 Keyboard 15.00
3 Laptop 1500.00
NULL Desk 200.00

If you are interested in testing the queries below, here is the SQL to create these tables:

CREATE TABLE sales (
  employee int,
  product char(63),
  price float,
  KEY (employee)
);
INSERT INTO sales VALUES ('2',  'Keyboard', 15.00);
INSERT INTO sales VALUES ('2',  'Printer',50.00);
INSERT INTO sales VALUES ('2',  'Mouse',  10.00);
INSERT INTO sales VALUES ('3',  'Laptop', 1500.00);
INSERT INTO sales VALUES (NULL, 'Desk',   200.00);

Select Overall Top Rows

If you want to select the overal top n rows from a table, you can simply order the results and use LIMIT.

The top-2 sales:

(for MySQL and PostgreSQL:)
SELECT employee, product, price FROM sales ORDER BY price DESC LIMIT 2
(for MS SQL:)
SELECT TOP 2 employee, product, price FROM sales ORDER BY price DESC
(for Oracle:)
SELECT employee, product, price FROM sales ORDER BY price WHERE ROWNUM <= 2

gives:

employee product price
3 Laptop 1500.00
NULL Desk 200.00

As you can see, the query differs per SQL implementation. I will use the MySQL/PostgreSQL syntax in the rest of this article.

Selecting a Single Top Row

Let's assume that you want to select the best sale for each employee id, rather than the overall best sales. This requires more complex queries. Let's first examine the case where you only want a single best sales for each employee (rather than the 2 or 3 best sales).

There are a few options here, each with different performance characteristics and requirements for your SQL database.

Group By Having construct

SELECT s1.employee, s1.product, s1.price FROM sales s1 
  LEFT JOIN sales s2 ON s1.employee=s2.employee 
  GROUP BY s1.employee,s1.product,s1.price 
  HAVING max(s2.price)=s1.price

gives:

employee product price
2 Printer 50.00
3 Laptop 1500.00

Double Table Comparison Construct

SELECT s1.employee, s1.product, s1.price FROM sales s1 
  LEFT JOIN sales s2 ON s1.employee=s2.employee AND s1.price < s2.price 
  WHERE s2.price IS NULL

which gives:

employee product price
2 Printer 50.00
3 Laptop 1500.00
NULL Desk 200.00

You may wonder why both queries have a LEFT JOIN with themselves, even though we don't SELECT any field from the joined table s2. Good question. What we basically do is, for each row in the sales table (s1), compare to all other rows in the sales table (s2) with the same employee id. Like this:

SELECT s1.employee, s1.product, s1.price, s2.employee, s2.product, s2.price 
  FROM sales s1 LEFT JOIN sales s2 ON s1.employee=s2.employee
s1.employee s1.product s1.price s2.employee s2.product s2.price
2 Keyboard 15.00 2 Keyboard 15.00
2 Keyboard 15.00 2 Printer 50.00
2 Keyboard 15.00 2 Mouse 10.00
2 Printer 50.00 2 Keyboard 15.00
2 Printer 50.00 2 Printer 50.00
2 Printer 50.00 2 Mouse 10.00
2 Mouse 10.00 2 Keyboard 15.00
2 Mouse 10.00 2 Printer 50.00
2 Mouse 10.00 2 Mouse 10.00
3 Laptop 1500.00 3 Laptop 1500.00
NULL Desk 200.00 NULL NULL NULL

Since employee 2 made three sales, there are 3*3 = 9 rows for him. Of these 9 rows, we select the one with the highest value. For the first query, we GROUP BY a single row (GROUP BY all properties of a row; it would have been more efficient if we had a unique key for the sales table!), and select the value HAVING the maximum price for all entries in the group. The second query is an alternative for those database servers that do not understand the HAVING clause. There, we simply select all rows from s2 whose prices is higher then s1.price. We find one or more for each row, except if there is no higher price: then s2.price will be NULL. That is exactly the value we want (we want the highest price), so we select only those.

Subquery Construct

SELECT s1.employee, s1.product, s1.price FROM sales s1 WHERE price=
  (SELECT MAX(price) FROM sales s2 WHERE s1.employee=s2.employee)

gives:

employee product price
2 Printer 50.00
3 Laptop 1500.00

Top-2 rows

An alternative to the subquery SELECT MAX(price) FROM sales s2 WHERE s1.employee=s2.employee is: SELECT price FROM sales s2 WHERE s1.employee=s2.employee ORDER BY price DESC LIMIT 1.

The advantage of the later query is that we can decide not to use only the top-1 (LIMIT 1), but also the top 2 (LIMIT 2) rows. This gives the query:

SELECT s1.employee, s1.product, s1.price FROM sales s1 WHERE price >= ANY 
  (SELECT price FROM sales s2 WHERE s1.employee=s2.employee ORDER BY price DESC LIMIT 2)

gives:

employee product price
2 Printer 50.00
2 Keyboard 15.00
3 Laptop 1500.00

The above query does not work for all SQL databases (MySQL 5 can not deal with ANY and LIMIT combined in a subquery). However, there is another trick. We can add a "rank" for each sale, using a subquery in the SELECT statement rather than in the WHERE statement:

SELECT employee, product, price, 
  (SELECT COUNT(*) FROM sales s2 WHERE s2.price>=s1.price AND s2.employee=s1.employee) AS rank 
  FROM sales s1

which results in:

employee product price rank
2 Keyboard 15.00 2
2 Printer 50.00 1
2 Mouse 10.00 3
3 Laptop 1500.00 1
NULL Desk 200.00 0

With a WHERE clause, we can only select those rows whose rank <= 2. However, since rank is not defined in the scope of the WHERE clause, we must write:

SELECT employee, product, price, 
  (SELECT COUNT(*) FROM sales s2 WHERE s2.price>=s1.price AND s2.employee=s1.employee) AS rank 
  FROM sales s1
  WHERE s2.price>=s1.price AND s2.employee=s1.employee) <= 2

However, if we know that the following to statements are equivalent:

SELECT employee, product, price FROM sales s1</tt> and 
SELECT * FROM (SELECT employee, product, price FROM sales s1) AS x

We can use the shorter SQL query:

SELECT * FROM (SELECT employee, product, price, 
  (SELECT COUNT(*) FROM sales s2 WHERE s2.price>=s1.price AND s2.employee=s1.employee) AS rank 
  FROM sales s1) as x
  WHERE rank <= 2

which results in:

employee product price rank
2 Keyboard 15.00 2
2 Printer 50.00 1
3 Laptop 1500.00 1
NULL Desk 200.00 0

Joining with Other Tables

For examples how to join this with other tables, see SQL Joins