Understanding LEFT JOINS

Carlton Patterson 741 Reputation points
2021-01-05T13:00:42.073+00:00

Hello Forum,

I would like to get a practical understanding of when to use Left Joins. I understand Left Joins from the point where we use Left Joins to find missing data. However, the following example uses Left Joins to provide the results needed for the the query.

The logic is as follows:

Find the number of orders, the number of customers and the total cost of orders per each city that has made at least 5 orders.
Output each calculation along with the corresponding city name.

The tables used for the query are as follows:

CREATE TABLE customers (
    id int,
    first_name varchar(50),
    last_name varchar(50),
    city varchar(50),
    address varchar(50),
    phone_number varchar(50))

INSERT customers VALUES
(8,'John','Joseph','Arizona','','928-386-8164'),
(7,'Jill','Michael','Florida','','813-297-0692'),
(4,'William','Daniel','Colorado','','813-368-1200'),
(5,'Henry','Jackson','Hawaii','','808-601-7513'),
(13,'Emma','Isaac','Hawaii','','808-690-5201'),
(14,'Liam','Samuel','Hawaii','','808-555-5201'),
(15,'Mia','Owen','Hawaii','','808-640-5201'),
(1,'Mark','Thomas','Arizona','4476 Parkway Drive','602-993-5916'),
(12,'Eva','Lucas','Arizona','4379 Skips Lane','301-509-8805'),
(6,'Jack','Aiden','Arizona','4833 Coplin Avenue','480-303-1527'),
(2,'Mona','Adrian','California','1958 Peck Court','714-409-9432'),
(10,'Lili','Oliver','California','3832 Euclid Avenue','530-695-1180'),
(3,'Farida','Joseph','Florida','3153 Rhapsody Street','813-368-1200'),
(9,'Justin','Alexander','Colorado','4470 McKinley Avenue','970-433-7589'),
(11,'Frank','Jacob','Hawaii','1299 Randall Drive','808-590-5201')

SELECT * FROM customers

CREATE TABLE orders (
id int,
cust_id int,
order_date date,
order_quantity int,
order_details varchar(50),
order_cost int)

INSERT orders VALUES
(1,3,CONVERT(DATETIME, '2019-03-04', 120),4,'Coat',100),
(2,3,CONVERT(DATETIME, '2019-03-01', 120),1,'Shoes',80),
(3,3,CONVERT(DATETIME, '2019-03-04', 120),1,'Skirt',30),
(4,7,CONVERT(DATETIME, '2019-02-01', 120),1,'Coat',100),
(5,7,CONVERT(DATETIME, '2019-03-10', 120),1,'Shoes',80),
(6,15,CONVERT(DATETIME, '2019-02-01', 120),2,'Boats',100),
(7,15,CONVERT(DATETIME, '2019-01-11', 120),3,'Shirts',60),
(8,15,CONVERT(DATETIME, '2019-03-11', 120),1,'Slipper',20),
(9,15,CONVERT(DATETIME, '2019-03-01', 120),2,'Jeans',80),
(10,15,CONVERT(DATETIME, '2019-03-09', 120),3,'Shirts',50),
(11,5,CONVERT(DATETIME, '2019-02-01', 120),1,'Shoes',80),
(12,12,CONVERT(DATETIME, '2019-01-11', 120),3,'Shirts',60),
(13,12,CONVERT(DATETIME, '2019-03-11', 120),1,'Slipper',20),
(14,4,CONVERT(DATETIME, '2019-02-01', 120),1,'Shoes',80),
(15,4,CONVERT(DATETIME, '2019-01-11', 120),3,'Shirts',60),
(16,3,CONVERT(DATETIME, '2019-04-19', 120),3,'Shirts',50),
(17,7,CONVERT(DATETIME, '2019-04-19', 120),1,'Suit',150),
(18,15,CONVERT(DATETIME, '2019-04-19', 120),1,'Skirt',30),
(19,15,CONVERT(DATETIME, '2019-04-19', 120),2,'Dresses',200),
(20,12,CONVERT(DATETIME, '2019-01-11', 120),5,'Coat',20),
(21,7,CONVERT(DATETIME, '2019-04-01', 120),1,'Suit',50),
(22,7,CONVERT(DATETIME, '2019-04-02', 120),1,'Skirt',30),
(23,7,CONVERT(DATETIME, '2019-04-03', 120),1,'Dresses',50),
(24,7,CONVERT(DATETIME, '2019-04-04', 120),1,'Coat',25)

SELECT * FROM orders

I know the correct query results involve using
LEFT JOIN orders ON customers.id = orders.cust_id

However, I need to understand why the best approach is to use Left Join for this solution.

Please note, I have posted similar questions in the past asking for help and get very negative or not very constructive feedback. Therefore, if you can't help without providing nasty comments, then don't comment at all!

Developer technologies Transact-SQL
{count} votes

Accepted answer
  1. Tom Cooper 8,481 Reputation points
    2021-01-05T16:11:11.53+00:00

    For your query (cities with at least 5 orders), you don't need a left join. An inner join will work fine. But if you want to find all cities even if they don't have any orders, you need a left outer join. See below for the right way to find cities with at least 5 orders, the wrong way to find all cities, and the right way to find all cities.

    -- Get cities with at least 5 orders
    Select c.city, 
       Count(Distinct c.id) As NbrOfCustomers, 
       Count(o.id) As NbrOfOrders,
       Sum(o.order_cost) As TotalCost
    From customers c
    Inner Join orders o On c.id = o.cust_id
    Group By city
    Having Count(o.id) >= 5;
    
    -- If you want all cities even they don't have any orders
    -- the following is wrong.  It won't find cities with no orders
    -- California is missing - that's because there are no orders for
    -- California, so the Inner Join removes California
    Select c.city, 
       Count(Distinct c.id) As NbrOfCustomers, 
       Count(o.id) As NbrOfOrders,
       Sum(o.order_cost) As TotalCost
    From customers c
    Inner Join orders o On c.id = o.cust_id
    Group By city;
    
    -- To include cities with no orders, use a left outer join
    -- Then California is included in the result
    Select c.city, 
       Count(Distinct c.id) As NbrOfCustomers, 
       Count(o.id) As NbrOfOrders,
       IsNull(Sum(o.order_cost), 0) As TotalCost
    From customers c
    Left Join orders o On c.id = o.cust_id
    Group By city;
    

    Tom

    0 comments No comments

9 additional answers

Sort by: Most helpful
  1. Tom Cooper 8,481 Reputation points
    2021-01-05T20:28:51.22+00:00

    There are a couple of problems with the first query you suggested (the one with the LEFT JOIN).

    To see the first problem, suppose you have a city with 10 orders all for the same customer. You query above would return 10 orders_per_city (which is correct) and 10 customers_per_city. Which if I understand what you want is incorrect, Since there is only one customer for that city, I assume you want to return 1 in the customers_per_city column. The way to do that is to add the keyword DISTINCT to that count - so it would be count(distinct customers.id) as customers_per_city. Of course, if my assumption is wrong that you would want 1 in this case instead of 10, then you don't want the distinct keyword. Which brings up another point, you did a great job of providing sample data and an explanation of what you wanted the query to do. But for questions like this, you should also provide the result you would want given the sample data you provided. That is a big help for us to understand exactly what you want.

    The second problem is that you are using a LEFT JOIN instead of an INNER JOIN. Now a LEFT JOIN will return the correct result, but it may be less efficient than the INNER JOIN. This is because the HAVING count(orders.id) >= 5 effectively changes the LEFT JOIN to an INNER JOIN. It does this because the only difference between INNER JOIN and LEFT JOIN in this query is that cities with no orders will be returned from the JOIN. But then the HAVING count(orders.id) >= 5 removes all cities with fewer than 5 orders from the result. Which means, of course, that the cities that were added because you used LEFT JOIN (that is, those cities with 0 orders) are promptly dropped from the result since any city with less than 5 orders is not included in the result. So that LEFT JOIN is effectively an INNER JOIN. And since it is now an INNER JOIN, you help the Query Optimizer produce the most efficient plan for you query. Using LEFT JOIN is correct in the sense that it will return the correct result, but INNER JOIN may produce a more efficient plan. Also, since it effectively is an inner join calling it a left join makes it more difficult for any future person who wants to maintain the query to understand exactly what your query is doing.

    Your second question was about what does the HAVING clause do? Since you said you want the result for "each city that has made at least 5 orders", the HAVING clause counts the number of orders for each city only only keeps those cities which have at least 5 orders.

    Your third question was about your suggested alternative solution and which is better. First, with one exception they return the same result. The exception is that I have the DISTINCT keyword is the count for customers per city and you don't. So that makes a difference in the result. For example in your sample data for the city of Florida there are 11 orders, 4 orders for customer number 3 and 7 orders for customer number 7. My query calls that 2 customers for the city of Florida and yours returns 11. If you want the value returned to be 11, just remove DISTINCT from my query and if you want the value to be 2 add DISTINCT to your query. Having done that they are the same in that they both return the same result.

    But your question was about efficiency. The only real test for efficiency is to run both queries with your actual data. But my guess would be that unless you have millions of rows, you won't see any significant difference. And if you are having efficiency problems, getting the correct indexes is often more important than exactly which way did you write the query.

    But when making the choice between two different ways of writing the same query, going for the most efficient one is often not what you should. Of course, you don't want to write bad queries (like using a cursor when a set based solution is many times faster), but often times the difference in performance is nowhere near as important as the maintainability of your queries. Over time, it is likely that there will be requirements to update the query and the simpler and more straight forward your query is the easier that process will be. But to some extent what is easiest to understand and maintain is a matter of opinion. I think my query is easier to understand and maintain and I think most SQL experts would agree with me. But maybe for you and your shop your query is more maintainable, in which case you might will be best off to use it.

    On the other hand, I suspect that if you like your query better, it is because you are not familiar with the HAVING clause. So you might be best off in the long run to become familiar with it and use it. All of which leaves us with the two best answers to many SQL questions, "It depends" and "Your mileage may vary".

    Tom


  2. Tom Cooper 8,481 Reputation points
    2021-01-05T22:55:02.307+00:00

    (I know this is really a comment and not an answer, but comments are limited to 1000 characters and I want to write more than that)

    I know you are using WHERE and not HAVING. Basically, you have used a subquery to use a WHERE to do what a HAVING would do. And like I say, it is fine - it returns the correct result and it's equally efficient. It's just more complex than I would want. It's what you would have to do if HAVING didn't exist.

    If we take what you have in your subquery and call that "main query", then your query is

    SELECT
    SubQuery.city
    ,SubQuery.OrdersPerCity
    ,SubQuery.CustomersPerCity
    ,SubQuery.TotalCostOfOrders
    FROM (
    "main query"
    ) SubQuery
    WHERE SubQuery.OrdersPerCity >= 5
    GROUP BY SubQuery.city
    ,SubQuery.OrdersPerCity
    ,SubQuery.CustomersPerCity
    ,SubQuery.TotalCostOfOrders

    My query would be just

    "main query"
    HAVING orders.order_quantity >= 5

    Which is simpler and easier to understand (to me, maybe not to everyone).

    Incidentally, you can simplify your query by removing the final GROUP BY. Since you subquery guarantees you only have one row per city, grouping again by city doesn't do anything - that is, there is only one row per city before you do the second GROUP BY so there will be one row per city after you do the second GROUP BY.

    Tom

    0 comments No comments

  3. Carlton Patterson 741 Reputation points
    2021-01-05T23:49:26.757+00:00

    Tom

    Fantastic.

    I just couldn’t see how HAVING orders.order_quantity >= 5 selects the orders related to my OrdersPerCity.

    Thank you very much

    0 comments No comments

  4. EchoLiu-MSFT 14,621 Reputation points
    2021-01-06T08:24:17.51+00:00

    Hi @Carlton Patterson

    53966-image.png
    If you want to know the query efficiency, you can open the execution plan and you will find the difference in efficiency. Of course, as Tom said, if it is not a million pieces of data, the difference in efficiency will not be obvious:
    53919-image.png

    Regards
    Echo


    If the answer is helpful, please click "Accept Answer" and upvote it.
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread. 2: /api/attachments/53829-image.png?platform=QnA

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.