Ian's DBC SF Firecrackers Blog

Running complex SQL queries in large DB's

In my professional experience as a software quality assurance analyst, I've been in a lot of situations where complex queries had to be executed on SQL databases that had upwards of 3 million records on each table that only grew in size as a lot of the tables needing validation had one-to-many and many-to-many relationships. As a result, a lot of the test scripts that I was writing would take upwards of 16 hours to execute on each of the data migration mappings that I was tasked to validate. In the beginning, I would write and immediately run the tests, but as complexities grew, I found myself sitting at my desk for hours at a time completely idle waiting for my machine to finish executing the latest draft of the script before I could do anything else.

Over time I managed to successfully lobby for a second workstation to split writing and basic executions from the larger complex query executions as well as execute most of the complex queries as I was leaving for the day so that the results would be ready the following morning. Also key to work optimization was to shift a lot of the mindset away from complete validation to validation of boundary conditions for each field for daytime testing.

One thing that I never leveraged was the ability for SQL to create temporary tables. This was largely due to server space issues during part of this project, but since some queries I was running had to look up the same relationships between table records again and again and again, a lot of time could have been saved by creating the temp tables to prevent repeating lookups had the space been available.

One of the key ways to optimize performance from a developer standpoint is to determine which queries will be executed frequently and design around minimizing performance expense of those queries. This requires monitoring the frequnecy that the queries that are run so that adjustments could be made, particularly with upgrades and data migration, to re-optimize performance if common queries start to become expensive.