Use Subqueries - they are fast!

September 05, 2008

I was looking through some legacy code to track down why a product export script was timing out and came across this (I've formatted it so that it fits into a blog post a bit better ).


<!--- get product information --->
<cfquery name="qryProductExport" datasource="#dsn#">
SELECT
product_id
, product_title
, product_description
, product_costexvat
, product_imagemain
, option_stockcode
, option_id
, option_title
, category_title
FROM [Products]
INNER JOIN [Options]
ON ( product_id = option_product_id )
INNER JOIN [Categories]
ON ( category_id = product_category_id )
WHERE product_active = 1
ORDER BY product_title
</cfquery>

<!--- get minimum product quantity that can be ordered --->
<cfquery name="qryMinQty" datasource="#dsn#">
SELECT
MIN(pricebreak_quantityMin) AS pricebreak_quantityMin
, pricebreak_product_id
FROM [PriceBreaks]
WHERE pricebreak_product_id in (
SELECT product_id
FROM [Products]
WHERE product_active = 1
)
GROUP BY pricebreak_product_id
</cfquery>

<!--- add new column to store min quantity in --->
<cfset QueryAddColumn(
qryProductExport
, "min_quantity"
, ArrayNew(1) )
/
>


<cfloop query="qryProductExport">

<!--- do this with QofQ to reduce database I/O... --->
<cfquery name="qryProductMinQty" dbtype="query">
SELECT
pricebreak_quantityMin
FROM qryMinQty
WHERE pricebreak_product_id =
<cfqueryparam
value="#qryProductExport.product_id#"
cfsqltype="cf_sql_integer"
/>

</cfquery>

<!--- update min quantity column for this product... --->
<cfset QuerySetCell(
qryProductExport
, "min_quantity"
, qryProductMinQty.pricebreak_quantityMin
, qryProductExport.CurrentRow )
/
>


</cfloop>

The code above was actually timing out - the loop is where it grinds to a halt. It probably worked fine when there were a hundred product options. A while back I posted about how the performance gain with subquery instead of group. In this instance a subquery can be used to get the data that the section query was pulling out of the database. By rewriting the SQL statement to use a subquery to get the minimum quantity, instead of two seperate queries which are then combined, I've got the execution time down to a few seconds.

Here is the new code using a subquery.


<cfquery name="qryProductExport" datasource="#dsn#">
SELECT
product_id
, product_title
, product_description
, product_costexvat
, product_imagemain
, option_stockcode
, option_id
, option_title
, category_title
, (
SELECT MIN(pricebreak_quantityMin)
FROM PriceBreaks
WHERE Products.product_id = PriceBreaks.pricebreak_product_id
) AS min_quantity
FROM [Products]
INNER JOIN [Options]
ON ( product_id = option_product_id )
INNER JOIN [Categories]
ON ( category_id = product_category_id )
WHERE product_active = 1
ORDER BY product_title
</cfquery>

The point of this post? Often when we need to get a specific recordset from a database we tend to use our favourite language to manipulate the data, when in fact by learning a bit more SQL you can make a huge difference to performance and write cleaner code. I'm not an SQL guru so if I can do it so can you!

After some really good feedback from Ben Nadel and Brian Kotek (two guys who really know their stuff!) You would probably be better off using this version of the query as it should cope better with large tables. You can see the speeds of the two queries in the comments below.


<cfquery name="qryProductExport" datasource="#dsn#">
SELECT product_id
, product_title
, product_description
, product_updated
, product_costexvat
, product_imagemain
, option_stockcode
, option_id
, option_title
, category_title
, min_quantity
FROM [Products]
INNER JOIN [Options]
ON ( product_id = option_product_id )
INNER JOIN [Categories]
ON ( category_id = product_category_id )
INNER JOIN (
SELECT
pricebreak_product_id
, MIN(pricebreak_quantityMin) AS min_quantity
FROM PriceBreaks
GROUP BY pricebreak_product_id
) AS TempRecordSet
ON ( Products.product_id = TempRecordSet.pricebreak_product_id )
WHERE product_active = 1
ORDER BY product_title
</cfquery>

 


18 comments

  1. That's a really useful tip John. I had forgotten that it was possible to do that. :o)

    Comment by Simon Bingham – September 05, 2008
  2. Good stuff. Another thing that I have found to help in similar situations is to actually make a derived table and join to it. Something like:

    SELECT

    FROM
    . . . [Products]
    INNER JOIN
    (
    . . . SELECT
    . . . . . pricebreak_product_id,
    . . . . . MIN(pricebreak_quantityMin)
    . . . FROM PriceBreaks
    . . . GROUP BY pricebreak_product_id
    ) AS temp
    ON
    . . . Products.product_id = temp.pricebreak_product_id

    I find that depending on the amount of data, this can perform pretty fast because it's only doing the derived table once rather than the sub-select for each record.

    Comment by Ben Nadel – September 05, 2008
  3. @Simon, Yeah - SQL can be really powerful.

    @Ben, Really good point. I'm going to try your suggestion that to see which is faster. When I changed the code I decided not to do it like that as I have a lot of products that are inactive which would be included in the subquery recordset so figured it would be faster but there is only one way to find out...!

    Comment by John Whish – September 05, 2008
  4. @Ben - congratulations you won :)

    The results are (on SQL Server 2000):
    john : 2750ms
    ben : 2547ms

    Comment by John Whish – September 05, 2008
  5. Ha ha, pretty close :)

    Comment by Ben Nadel – September 05, 2008
  6. Update, I've just realised that I'd forgotten to return the "min_quantity" column in my query. When I added that in I got:

    john : 2656ms
    ben : 2719ms

    For reference the other SQL statement is:

    SELECT product_id
    , product_title
    , product_description
    , product_updated
    , product_costexvat
    , product_imagemain
    , option_stockcode
    , option_id
    , option_title
    , category_title
    , min_quantity
    FROM [Products]
    INNER JOIN [Options] ON ( product_id = option_product_id )
    INNER JOIN [Categories] ON ( category_id = product_category_id )
    INNER JOIN (
    SELECT
    pricebreak_product_id
    , MIN(pricebreak_quantityMin) as min_quantity
    FROM PriceBreaks
    GROUP BY pricebreak_product_id
    ) as TempRecordSet
    ON ( Products.product_id = TempRecordSet.pricebreak_product_id )
    WHERE product_active = 1
    ORDER BY product_title

    Comment by John Whish – September 05, 2008
  7. Best of 3 Ben? :)

    Comment by John Whish – September 05, 2008
  8. I demand satisfaction! :) Ok, you win.

    Comment by Ben Nadel – September 05, 2008
  9. He, he :)

    Actually, it's been interesting to compare techniques; I'm tempted to test further when I have the time to see which is best under different conditions. Thanks for your input Ben.

    Comment by John Whish – September 05, 2008
  10. I think it depends on the number of records returned in teh overall query. If you have 3 records returned, I bet the sub-select will do better. If you have 1000 records returned, I bet the derived table would do better.

    I think it has to do with the ratio of record to sub-queries.

    Comment by Ben Nadel – September 05, 2008
  11. How things happen under the hood depends on the database and how it optimizes queries, but the main difference is that the subquery in the SELECT clause is run for EVERY record in the result set, while the derived table (aka inline view) is run only once. So Ben is almost certainly correct in that the performance of the subquery will decrease as the number of records in the result set increases. In general, I would advise people to use derived tables whenever possible for this reason. It doesn't have much downside: it performs very well for small queries as well as large ones. You might as well "bake in" the scalability up front. :-)

    Comment by Brian Kotek – September 05, 2008
  12. Hi Brian, I think you and Ben are probably right that the subquery will get slower with more records. So it makes a lot of sense to build it in from the start. Thanks for your thoughts :)

    Comment by John Whish – September 06, 2008
  13. I think the discussion of performance is interesting and has value.

    Still, I think the readability of the code is more important (up until you run into a performance problem). When I look at your subquery example and the derived table example, one looks much clearer to me.

    To me, the subquery accurately depicts what you are doing. The other approach is a bunch of code to try to get the same result, but without the same clarity.

    Next, imagine if you want a different subquery from another table. What will the code look like then? Will the purpose be clear when you return to it in a year?

    Comment by Steve Bryant – September 08, 2008
  14. Hmm I suppose I'd disagree, Steve, but that may just be because I've gone down both of these paths and the derived table has almost always ended up being a better option in my use cases. I don't see it as a "bunch of code", it's almost the exact same code, especially if the subselect used the INNER JOIN syntax (which is the ANSI SQL-compliant way to do this). As long as the derived table has a well thought alias, the intent is very clear.

    Normally I'm not a fan of premature optimization, but this is such a minor difference that personally I wouldn't see any good reason to use the subquery when it is almost guaranteed to be an issue in the future, since databases only get bigger in nearly all cases. Again though, this may just be my own personal bias and experience. I learned the hard way many years ago about the problems that subqueries can cause over time.

    Comment by Brian Kotek – September 08, 2008
  15. Hi Steve, thanks for your perspective. It's a personal preference, but comparing the two I find the derived table method easier to read. I can just look at the list of columns to see what I'm going to get back. If I want to go into the data relationships then I can look at the 'from' part of the query. I haven't tried writing a monster query like this so can't be sure, but as the derived table is in affect acting like a physical table adding another one should still make sense.... famous last words? :)

    Comment by John Whish – September 08, 2008
  16. I guess I am in the minority on this one then.

    Oh well, not the first time and probably won't be the last.

    Comment by Steve Bryant – September 08, 2008
  17. John, that's exactly how it is seen by the database: as far as this query is concerned, that derived table is just another table like any other. And you're right, adding more just means adding another derived table and the appropriate join clause(s).

    The other reason I might advise people to look at derived tables if this is new to them, is that these let you do things that would be impossible with a normal query, or that would require multiple separate queries to achieve. Derived tables are quite possibly the most powerful and yet most underused feature of SQL.

    Comment by Brian Kotek – September 08, 2008
  18. I have to agree with Brian. Derived tables are pretty awesome. I only rencently started to think in terms of derived tables; it wasn't an easy mindset to get into because with a sub-query, you think of one match per row, but with a derived table, you have to think in a whole record set that then can be joined to something else.... a bit strange to think that way when you are not used to it - especially when the derived table uses a GROUP BY (which, IMO is one of the most powerful features).

    Comment by Ben Nadel – September 09, 2008

Leave a comment

If you found this post useful, interesting or just plain wrong, let me know - I like feedback :)

Please note: If you haven't commented before, then your comments will be moderated before they are displayed.