Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.

  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint
Share this Page URL
Help

Chapter 1: Execution Plan Basics > What Happens when a Query is Submitted?

What Happens When a Query is Submitted?

When you submit a query to SQL Server, a number of processes on the server go to work on that query. The purpose of all these processes is to manage the system such that it will SELECT, INSERT, UPDATE or DELETE the data.

These processes kick into action every time we submit a query to the system. While there are many different actions occurring simultaneously within SQL Server, we're going to focus on the processes around queries. The processes for meeting the requirements of queries break down roughly into two stages:

  1. Processes that occur in the relational engine.
  2. Processes that occur in the storage engine.

In the relational engine, the query is parsed and then processed by the query optimizer, which generates an execution plan. The plan is sent (in a binary format) to the storage engine, which then uses that plan as a basis to retrieve or modify the underlying data. The storage engine is where processes such as locking, index maintenance, and transactions occur. Since execution plans are created in the relational engine, that's where we'll be focusing the majority of our attention.

Query parsing

When we pass a T-SQL query to the SQL Server system, the first place it goes to is the relational engine.1 As the T-SQL arrives, it passes through a process that checks that the T-SQL is written correctly, that it's well formed. This process is query parsing. If a query fails to parse correctly, for example, if you type SELETC instead of SELECT, then parsing stops and SQL Server returns an error to the query source. The output of the Parser process is a parse tree, or query tree (or it's even called a sequence tree). The parse tree represents the logical steps necessary to execute the requested query.

If the T-SQL string is not a data manipulation language (DML) statement, but instead is a data definition language (DDL) query, it will be not be optimized because, for example, there is only one “right way” for the SQL Server system to create a table; therefore, there are no opportunities for improving the performance of that type of statement.

Algebrizer

If the T-SQL string is a DML statement and it has parsed correctly, the parse tree is passed to a process called the algebrizer. The algebrizer resolves all the names of the various objects, tables and columns, referred to within the query string. The algebrizer identifies, at the individual column level, all the data types (varchar(50) versus datetime and so on) for the objects being accessed. It also determines the location of aggregates (such as GROUP BY, and MAX) within the query, a process called aggregate binding. This algebrizer process is important because the query may have aliases or synonyms, names that don't exist in the database, that need to be resolved, or the query may refer to objects not in the database. When objects don't exist in the database, SQL Server returns an error from this step, defining the invalid object name. As an example, the algebrizer would quickly find the table Person.Person in the AdventureWorks2008R2 database. However, the Product.Person table, which doesn't exist, would cause an error and the whole optimization process would stop.

The algebrizer outputs a binary called the query processor tree, which is then passed on to the query optimizer. The algebrizer's output includes a hash, a coded value representing the query. The optimizer uses the hash to determine whether there is already a plan generated and stored in the plan cache. If there is a plan there, the process stops here and that plan is used. This reduces all the overhead required by the query optimizer to generate a new plan.

The query optimizer

The query optimizer is essentially a piece of software that “models” the way in which the database relational engine works. The most important pieces of data used by the optimizer are statistics, which SQL Server generates and maintains against indexes and columns, explicitly for use by the optimizer. Using the query processor tree and the statistics it has about the data, the optimizer applies the model in order to work out what it thinks will be the optimal way to execute the query – that is, it generates an execution plan.

In other words, the optimizer figures out how best to implement the request represented by the T-SQL query you submitted. It decides if it can access the data through indexes, what types of joins to use and much more. The decisions made by the optimizer are based on what it calculates to be the cost of a given execution plan, in terms of the required CPU processing and I/O. Hence, this is a cost-based plan.

The optimizer will generate and evaluate many plans (unless there is already a cached plan) and, generally speaking, will choose the lowest-cost plan, that is, the plan it thinks will execute the query as fast as possible and use the least amount of resources, CPU and I/O. The calculation of the execution cost is the most important calculation, and the optimizer will use a process that is more CPU-intensive if it returns results that much faster. Sometimes, the optimizer will settle for a less efficient plan if it thinks it will take more time to evaluate many plans than to run a less efficient plan. The optimizer doesn't find the best possible plan. The optimizer finds the plan with the least cost in the shortest possible number of iterations, meaning the least amount of time within the processor.

If you submit a very simple query – for example, a SELECT statement against a single table with no aggregates or calculations within the query – then, rather than spend time trying to calculate the absolute optimal plan, the optimizer will simply apply a trivial plan to these types of queries. For example, a query like the one in Listing 1.1 would create a trivial plan.

SELECT  d.Name
FROM    HumanResources.Department AS d 
WHERE   d.DepartmentID = 42

Listing 1.1

 

Adding even one more table, with a JOIN, would make the plan non-trivial. If the query is non-trivial, the optimizer will perform a cost-based calculation to select a plan. In order to do this, it relies on the statistics that by SQL Server maintains.

Statistics are collected on columns and indexes within the database, and describe the data distribution and the uniqueness, or selectivity, of the data. We don't want the optimizer to read all the data in all the tables referenced in a query each time it tries to generate a plan, so it relies on statistics, a sample of the data that provides a mathematical construct of the data used by the optimizer to represent the entire collection of data. The reliance the optimizer has on statistics means that these things need to be as accurate as possible or the optimizer could make poor choices for the execution plans it creates.

The information that makes up statistics is represented by a histogram, a tabulation of counts of the occurrence of a particular value, taken from 200 data points evenly distributed across the data. It's this “data about the data” that provides the information necessary for the optimizer to make its calculations.

If statistics exist for a relevant column or index, then the optimizer will use them in its calculations. The optimizer will examine the statistics to determine if the index supplies a sufficient level of selectivity to act as assistance for the query in question. Selectivity is how unique the data is across the whole set of the data. The level of selectivity required to be of assistance for an index is quite high, usually with x% of unique values required in most instances.

Statistics, by default, are created and updated automatically within the system for all indexes or for any column used as a predicate, as part of a WHERE clause or JOIN ON clause. Table variables do not ever have statistics generated on them, so the optimizer always assumes they contain a single row, regardless of their actual size.

Temporary tables do have statistics generated on them and their statistics are stored in the same type of histogram as permanent tables, and the optimizer can use these statistics.

The optimizer takes these statistics, along with the query processor tree, and heuristically determines the best plan. This means that it works through a series of plans, testing different methods of accessing data, attempting different types of join, rearranging the join order, trying different indexes, and so on, until it arrives at what it thinks will be the least cost plan. During these calculations, the optimizer assigns a number to each of the steps within the plan, representing its estimation of the combined amount of CPU and disk I/O time it thinks each step will take. This number is the estimated cost for that step. The accumulation of costs for each step is the estimated cost for the execution plan itself.

It's important to note that the estimated cost is just that – an estimate. Given an infinite amount of time and complete, up-to-date statistics, the optimizer would find the perfect plan for executing the query. However, it attempts to calculate the best plan it can in the least amount of time possible, and is limited by the quality of the statistics it has available. Therefore, these cost estimations are very useful as measures, but are unlikely to reflect reality precisely.

Once the optimizer arrives at an execution plan, the estimated plan is created and stored in a memory space known as the plan cache – although this is all different if a plan already exists in cache (more on this shortly, in the section on Execution Plan Reuse). As stated earlier, if the optimizer finds a plan in the cache that matches the currently executing query, this whole process is short-circuited.

Query execution

Once the optimizer has generated an execution plan, or retrieved one from cache, the action switches to the storage engine, which usually executes the query according to the plan.

We will not go into detail here, except to note that the carefully generated execution plan may be subject to change during the actual execution process. For example, this might happen if:

  • SQL Server determines that the plan exceeds the threshold for a parallel execution (an execution that takes advantage of multiple processors on the machine – more on parallel execution in Chapter 3)
  • the statistics used to generate the plan were out of date, or have changed since the original execution plan was created
  • processes or objects within the query, such as data inserts to a temporary table, result in a recompilation of the execution plan.

Any one of these could change the estimated execution plan.

SQL Server returns the results of the query after the relational engine changes the format to match that requested in the submitted T-SQL statement, assuming it was a SELECT.

  • Safari Books Online
  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint