Sunday, October 23, 2005

Correlated Subqueries

Just last week we faced a SQL performance issue in one of our applications during ongoing development work. Reason was a correlated subquery - that is, a query that can't be evaluated independently, but depends on the outer query for its results, something like:

select A1,
       (select min(B1)
        from TableB
        where TableB.B2 = TableA.A2) as MinB1
from TableA


or

select A1
from TableA
where exists (select 1
              from TableB
              where TableB.B2 = TableA.A2)


Please note that those are just examples for better understanding, and that database optimizers can do a better job on simple subqueries like that. Our real-life SQL included several inserts-by-select and a deletes-by-select with inverse exists / not exists operations on complex subqueries.

I know correlated subqueries are unavoidable sometimes, but often there are alternatives. In many cases they can be replaced by joins (which might be or not be faster - e.g. there are situations when exists/not exists-operators are superior to joins, because the database will stop looping as soon as one row is found - but that of course depends on the underlying data). Incoherent subqueries though (in opposite to correlated subqueries) should also outperform joins under normal circumstances.

SQL Server Pro Robert Vieira writes in "Professional SQL Server 2000 Programming": "Internally, a correlated subquery is going to create a nested loop situation. This can create quite a bit of overhead. Subqueries are substantially faster than cursors in most instances, but slower than other options that might be available."

When joining is no option, it might be a good idea to take a step back and do some redesign in a larger context. We managed to achieve a many-fold performance boost on a typical workload by replacing the correlated subqueries with a different approach that produced the same results at the end.

Do not expect the database engine to optimize away all tradeoffs that might be caused by a certain query design. The most-advanced optimizer can't take over complete responsibility from the developer. It's important to be conscious about which kind of tuning your database can do during execution, and which not. Creating an index on the correlated subquery's attribute that connects it to the embedding query is a good starting point.