SQL Joins

From Exterior Memory
Revision as of 13:43, 8 July 2007 by MacFreek (Talk | contribs) (Top-2 sales)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

SQL joins can be complex. This page refreshes your knowledge by giving a few examples.

Let's assume the following two tables:

Employee:

id first name last name
1 John Smith
2 James Williams
3 David Taylor

Sales:

employee product price
2 Printer 50.00
2 Mouse 10.00
3 Laptop 1500.00
NULL Desk 200.00

If you are interested in testing the queries below, here is the SQL to create these tables:

CREATE TABLE employee (
  id int PRIMARY KEY,
  firstname char(63),
  lastname char(63)
);
INSERT INTO employee VALUES ('1', 'John','Smith');
INSERT INTO employee VALUES ('2', 'James','Williams');
INSERT INTO employee VALUES ('3', 'David','Taylor');
CREATE TABLE sales (
  employee int,
  product char(63),
  price float,
  KEY (employee)
);
INSERT INTO sales VALUES ('2',  'Printer',50.00);
INSERT INTO sales VALUES ('2',  'Mouse',  10.00);
INSERT INTO sales VALUES ('3',  'Laptop', 1500.00);
INSERT INTO sales VALUES (NULL, 'Desk',   200.00);

Join

The following three statements are equivalent:

SELECT firstname, lastname, product, price FROM employee, sales WHERE employee.id=sales.employee
SELECT firstname, lastname, product, price FROM employee JOIN sales ON employee.id=sales.employee
SELECT firstname, lastname, product, price FROM employee INNER JOIN sales ON employee.id=sales.employee

All result in:

First name Last name product price
James Williams Printer 50.00
James Williams Mouse 10.00
David Taylor Laptop 1500.00

Note that neither John Smith, nor the Desk is listed, since they do not have an equivalent in the other table. Also, note that James Williams occurs twice.

Left Join and Right Join

To get at least one listing of each employee, use a left join:

SELECT firstname, lastname, product, price FROM employee LEFT JOIN sales ON employee.id=sales.employee

This results in:

First name Last name product price
John Smith NULL NULL
James Williams Printer 50.00
James Williams Mouse 10.00
David Taylor Laptop 1500.00

To get at least one listing of each sale, use a right join:

SELECT firstname, lastname, product, price FROM employee RIGHT JOIN sales ON employee.id=sales.employee

This results in:

First name Last name product price
James Williams Printer 50.00
James Williams Mouse 10.00
David Taylor Laptop 1500.00
NULL NULL Desk 200.00

Grouping

The left join above did list all employees, but some employees had duplicate records. That may not be what you want. You can solve this by a GROUP BY:

(warning: bad example!)
SELECT firstname, lastname, product, price FROM employee 
  LEFT JOIN sales ON employee.id=sales.employee GROUP BY employee.id

Which results in:

First name Last name product price
John Smith NULL NULL
James Williams Printer 50.00
David Taylor Laptop 1500.00

However, a big warning is in place: There is no way to know which of the two sales of James Williams will appear in the result!

If you are only interested in the highest sale made, this is simply:

SELECT firstname, lastname, MAX(price) AS "highest sale" FROM employee 
  LEFT JOIN sales ON employee.id=sales.employee GROUP BY employee.id
First name Last name highest sale
John Smith NULL
James Williams 50.00
David Taylor 1500.00

Do not make the mistake to also select the product:

(warning: bad example!)
SELECT firstname, lastname, product, MAX(price) AS "highest sale" FROM employee 
  LEFT JOIN sales ON employee.id=sales.employee GROUP BY employee.id

As the product and price entry may not match (note that the mouse cost €10 and the printer €50, not the other way around):

First name Last name product highest sale
John Smith NULL NULL
James Williams Mouse 50.00
David Taylor Laptop 1500.00

In fact, some databases like PostgreSQL won't even let you execute the above SQL statement, because the result is undefined.

To find the correct way to query the highest sale of each employee, we need to dive into queries that select the top n rows in a table.

Selecting Top Rows

The above problems boil down to the problem that you want to select a set of specific rows in the sales table. In this case, you want, for each employee, the record with the highest price. This problem is generally know as "selecting top n rows".

We can either use a query with HAVING, or a slightly less efficient query without a HAVING clause:

SELECT firstname, lastname, s1.product AS "best sale", s1.price AS "highest price"
  FROM employee
  LEFT JOIN sales s1 ON employee.id=s1.employee 
  LEFT JOIN sales s2 ON s1.employee=s2.employee 
  GROUP BY employee.id
  HAVING max(s2.price)=s1.price

with result:

First name Last name product highest sale
James Williams Printer 50.00
David Taylor Laptop 1500.00

and:

SELECT firstname, lastname, s1.product AS "best sale", s1.price AS "highest price"
  FROM employee
  LEFT JOIN sales s1 ON employee.id=s1.employee 
  LEFT JOIN sales s2 ON s1.employee=s2.employee AND s1.price < s2.price 
  WHERE s2.price IS NULL

with result:

First name Last name best sale highest price
John Smith NULL NULL
James Williams Printer 50.00
David Taylor Laptop 1500.00

Note that these are relatively complex queries, because you LEFT JOIN the sales table twice: once to select an entry, and once to "loop" through all rows in the sales table to make sure the current row has the highest value. The details are explained on the page SQL Top n Rows.

Subselections

Smart readers will have noticed that the top-n queries are very inefficient. For each employee with n sales, there are n*n records int the LEFT JOIN queries, and only 1 is filtered out. If you have a lot of records, this may become slow.

A better solution is to use subqueries, as they support the "ANY", "ALL" and "SOME" comparison.

In this case, what we really like is to LEFT JOIN with a WHERE clause determined by a subquery:

SELECT firstname, lastname, product AS "best sale", price AS "highest price" FROM employee 
  LEFT JOIN sales ON 
  price=(SELECT MAX(price) FROM sales WHERE employee.id=sales.employee)

which results in:

First name Last name best sale highest price
John Smith NULL NULL
James Williams Printer 50.00
David Taylor Laptop 1500.00

This is without doubt the most elegant and probably fastest way to make this query. However, not all database servers may support subqueries.

Top-2 sales

Using the technique explained in SQL Top n Rows, we can select the top-n (here top 2) sales for each employee. We first select the two highest prices for that employee with the subquery SELECT price FROM sales WHERE employee.id=sales.employee ORDER BY price DESC LIMIT 2, and then use that result to select only rows whose price is higher than at least one of those values:

SELECT firstname, lastname, product AS "best sale", price AS "highest price" FROM employee 
  LEFT JOIN sales ON employee.id=sales.employee AND
  price>= ANY (SELECT price FROM sales WHERE employee.id=sales.employee ORDER BY price DESC LIMIT 2)

results in:

First name Last name best sale highest price
John Smith NULL NULL
James Williams Keyboard 15.00
James Williams Printer 50.00
David Taylor Laptop 1500.00