Benjamin Nevarez Rotating Header Image

Statistics

SQL Server 2014 Incremental Statistics

A major problem with updating statistics in large tables in SQL Server is that the entire table always has to be scanned, for example when using the WITH FULLSCAN option, even if only recent data has changed. This is also true when using partitioning: even if only the newest partition had changed since the last time statistics were updated, updating statistics again required to scan the entire table including all the partitions that didn’t change. Incremental statistics, a new SQL Server 2014 feature, can help with this problem.

Using incremental statistics you can update only the partition or partitions that you need and the information on these partitions will be merged with the existing information to create the final statistics object. Another advantage of incremental statistics is that the percentage of data changes required to trigger an automatic update of statistics now works at the partition level which basically means that now only 20% of rows changed (changes on the leading statistics column) per partition are required. Unfortunately the histogram is still limited to 200 steps for the entire statistics object in this version of SQL Server.

Let us look at an example of how can you update statistics at a partition level to explore its behavior as of SQL Server 2014. First we need to create a partitioned table using the AdventureWorks2012 database:

CREATE PARTITION FUNCTION TransactionRangePF1 (DATETIME)
AS RANGE RIGHT FOR VALUES 
(
   '20071001', '20071101', '20071201', '20080101', 
   '20080201', '20080301', '20080401', '20080501', 
   '20080601', '20080701', '20080801'
);
GO

CREATE PARTITION SCHEME TransactionsPS1 AS PARTITION TransactionRangePF1 TO 
(
  [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], 
  [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], 
  [PRIMARY], [PRIMARY], [PRIMARY]
);
GO

CREATE TABLE dbo.TransactionHistory 
(
  TransactionID        INT      NOT NULL, -- not bothering with IDENTITY here
  ProductID            INT      NOT NULL,
  ReferenceOrderID     INT      NOT NULL,
  ReferenceOrderLineID INT      NOT NULL DEFAULT (0),
  TransactionDate      DATETIME NOT NULL DEFAULT (GETDATE()),
  TransactionType      NCHAR(1) NOT NULL,
  Quantity             INT      NOT NULL,
  ActualCost           MONEY    NOT NULL,
  ModifiedDate         DATETIME NOT NULL DEFAULT (GETDATE()),
  CONSTRAINT CK_TransactionType 
    CHECK (UPPER(TransactionType) IN (N'W', N'S', N'P'))
) 
ON TransactionsPS1 (TransactionDate);
GO

Note: For details about partitioning and the CREATE PARTITION FUNCTION / SCHEME statements please refer to Partitioned Tables and Indexes in Books Online.

We currently have data to populate 12 partitions. Let us start by first populating only 11.

INSERT INTO dbo.TransactionHistory
SELECT * FROM Production.TransactionHistory
WHERE TransactionDate < '2008-08-01';

If required, you can use the following statement to inspect the contents of the partitions:

SELECT * FROM sys.partitions
  WHERE object_id = OBJECT_ID('dbo.TransactionHistory');

Let us create an incremental statistics object using the CREATE STATISTICS statement with the new INCREMENTAL clause set to ON (OFF is the default):

CREATE STATISTICS incrstats ON dbo.TransactionHistory(TransactionDate) 
  WITH FULLSCAN, INCREMENTAL = ON;

You can also create incremental statistics while creating an index using the new STATISTICS_INCREMENTAL clause of the CREATE INDEX statement.

You can inspect the created statistics object using DBCC:

DBCC SHOW_STATISTICS('dbo.TransactionHistory', incrstats);

Among other things, you will notice that the histogram has 200 steps (only the last 3 shown here):

  RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS
198 2008-07-25 00:00:00.000 187 100 2
199 2008-07-27 00:00:00.000 103 101 1
200 2008-07-31 00:00:00.000 281 131 3

Initial DBCC results

So we already have the maximum of steps in a statistics object. What would happen if you add data to a new partition? Let us add data to partition 12:

INSERT INTO dbo.TransactionHistory 
SELECT * FROM Production.TransactionHistory 
WHERE TransactionDate >= '2008-08-01';

Now, we update the statistics object using the following statement:

UPDATE STATISTICS dbo.TransactionHistory(incrstats) 
  WITH RESAMPLE ON PARTITIONS(12);

Note the new syntax specifying the partition, where you can specify multiple partitions, separated by comma. The UPDATE STATISTICS statement reads the specified partitions and then merges their results with the existing statistic object to build the global statistics. Note the RESAMPLE clause; this is required as partition statistics need to have the same sample rates to be merged to build the global statistics. Although only the specified partition was scanned, you can see that SQL Server has rearranged the histogram. The last three steps now show data for the added partition. You can also compare the original with the new histogram for other minor differences:

  RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS
197 2008-07-31 00:00:00.000 150 131 2
198 2008-08-12 00:00:00.000 300 36 9
199 2008-08-22 00:00:00.000 229 43 7
200 2008-09-03 00:00:00.000 363 37 11

DBCC results after the incremental update

If for any reason you want to disable the incremental statistics you can use the following statement to go back to the original behavior (or optionally just drop the statistics object and create a new one).

UPDATE STATISTICS dbo.TransactionHistory(incrstats) 
  WITH FULLSCAN, INCREMENTAL = OFF;

After disabling the incremental statistics trying to update a partition as shown previously will return the following error message:

Msg 9111, Level 16, State 1

UPDATE STATISTICS ON PARTITIONS syntax is not supported for non-incremental statistics.
 

Finally, you can also enable incremental statistics for your automatic statistics at the database level, if needed. This requires the INCREMENTAL = ON clause in the ALTER DATABASE statement and obviously also requires AUTO_CREATE_STATISTICS set to ON.

Note: Article originally posted on http://sqlperformance.com

Statistics on Ascending Keys

One query processor problem I’ve been trying to research since some time ago is that of statistics on ascending keys. The traditional recommendation from Microsoft to fix this problem is to manually update statistics after loading data as explained here. The document describes the problem in the following way:

“Statistics on ascending or descending key columns, such as IDENTITY or real-time timestamp columns, might require more frequent statistics updates than the query optimizer performs. Insert operations append new values to ascending or descending columns. The number of rows added might be too small to trigger a statistics update. If statistics are not up-to-date and queries select from the most recently added rows, the current statistics will not have cardinality estimates for these new values. This can result in inaccurate cardinality estimates and slow query performance. For example, a query that selects from the most recent sales order dates will have inaccurate cardinality estimates if the statistics are not updated to include cardinality estimates for the most recent sales order dates.”

Trace flags 2389 and 2390, which were first published by Ian Jose in his article Ascending Keys and Auto Quick Corrected Statistics, can help to address this problem. Ian explains that when data typically ascends, most new insertions are out of the previously found range. My testing shows that these new values fall outside the range of values of the existing statistics histogram. This can lead to poorly performing plans as filters selecting recent data seem to exclude the entire relation when in fact a significant number of rows may be included.

To show you what the problem is and how this mystery trace flags 2389 works, let us start by creating a table in AdventureWorks2012.

CREATE TABLE dbo.SalesOrderHeader (
    SalesOrderID int NOT NULL,
    RevisionNumber tinyint NOT NULL,
    OrderDate datetime NOT NULL,
    DueDate datetime NOT NULL,
    ShipDate datetime NULL,
    Status tinyint NOT NULL,
    OnlineOrderFlag dbo.Flag NOT NULL,
    SalesOrderNumber nvarchar(25) NOT NULL,
    PurchaseOrderNumber dbo.OrderNumber NULL,
    AccountNumber dbo.AccountNumber NULL,
    CustomerID int NOT NULL,
    SalesPersonID int NULL,
    TerritoryID int NULL,
    BillToAddressID int NOT NULL,
    ShipToAddressID int NOT NULL,
    ShipMethodID int NOT NULL,
    CreditCardID int NULL,
    CreditCardApprovalCode varchar(15) NULL,
    CurrencyRateID int NULL,
    SubTotal money NOT NULL,
    TaxAmt money NOT NULL,
    Freight money NOT NULL,
    TotalDue money NOT NULL,
    Comment nvarchar(128) NULL,
    rowguid uniqueidentifier NOT NULL,
    ModifiedDate datetime NOT NULL
)

Populate the table with some initial data and create an index on it (notice that both tables have the same name but in the dbo and Sales schemas)

INSERT INTO dbo.SalesOrderHeader SELECT * FROM Sales.SalesOrderHeader 
WHERE OrderDate < '2008-07-20 00:00:00.000'
CREATE INDEX IX_OrderDate ON SalesOrderHeader(OrderDate)

The problem

After creating the index SQL Server will also create a statistics object for it, so a query like this will have a good cardinality estimate as shown next (as there is data for July 19 and it is captured on the last step of the statistics histogram object, which you can verify by using the DBCC SHOW_STASTISTICS statement).

SELECT * FROM dbo.SalesOrderHeader WHERE OrderDate = '2008-07-19 00:00:00.000'

clip_image001

Now, let us suppose we add new data for July 20th.

INSERT INTO dbo.SalesOrderHeader SELECT * FROM Sales.SalesOrderHeader 
WHERE OrderDate = '2008-07-20 00:00:00.000'

Changing the query to look for records for July 20

SELECT * FROM dbo.SalesOrderHeader WHERE OrderDate = '2008-07-20 00:00:00.000'

Since the number of rows added is too small, it does not trigger an automatic update of statistics. And since the value July 20 is not represented on the histogram SQL Server will use an estimate of 1 as shown in the following plan

clip_image002

Although both plans showed in this example are very similar a bad cardinality estimate may produce bad plans in some more realistic scenarios and queries.

Using trace flag 2389

Now let us see how trace flag 2389 helps on this problem. Run the next statements (notice that trace flag 2388 has not been mentioned before and will be explained shortly)

DBCC TRACEON (2388)
DBCC TRACEON (2389)

Trace flag 2389, which was introduced with SQL Server 2005 Service Pack 1, begins to track the nature of columns via subsequent operations of updating statistics. When the statistics are seen to increase three times in a row the column is branded ascending.

Trace flag 2388 is not required to enable the behavior described in this article but we can use it to show how trace flags 2390 works and determine if a column has been branded ascending. The trace flag changes the output of the DBCC SHOW_STATISTICS statement to show you a historical look at the most recent statistics update operations.

Trace flag 2390 enables a similar behavior than 2389 even if the ascending nature of the column is not known but I will not cover it here.

Run DBCC SHOW_STASTISTICS

DBCC SHOW_STATISTICS ('dbo.SalesOrderHeader', 'IX_OrderDate')

The statement shows the following output

Updated              Table Cardinality      Snapshot Ctr         Steps  Density                Rows Above             Rows Below             Squared Variance Error Inserts Since Last Update Deletes Since Last Update Leading column Type
-------------------- ---------------------- -------------------- ------ ---------------------- ---------------------- ---------------------- ---------------------- ------------------------- ------------------------- -------------------
Feb 26 2013  2:31AM  31095                  31095                199    0.000899280596058816   NULL                   NULL                   NULL                   NULL                      NULL                      Unknown

Not much data for now. But I’ll show you this output after three consecutive batches inserting data and updating statistics. Run the following statement to update statistics including the data you just added for February 20th.

UPDATE STATISTICS dbo.SalesOrderHeader WITH FULLSCAN

DBCC SHOW_STATISTICS now shows

Updated              Table Cardinality      Snapshot Ctr         Steps  Density                Rows Above             Rows Below             Squared Variance Error Inserts Since Last Update Deletes Since Last Update Leading column Type
-------------------- ---------------------- -------------------- ------ ---------------------- ---------------------- ---------------------- ---------------------- ------------------------- ------------------------- -------------------
Feb 26 2013  2:40AM  31125                  31125                199    0.000898472615517676   30                     0                      0.0135968539563045     30                        0                         Unknown
Feb 26 2013  2:31AM  31095                  31095                199    0.000899280596058816   NULL                   NULL                   NULL                   NULL                      NULL                      NULL

where ‘Rows Above’ and ‘Insert Since Last Update’ accounts for the 30 rows added previously (you may need to scroll to the right). Now run the second batch

INSERT INTO dbo.SalesOrderHeader SELECT * FROM Sales.SalesOrderHeader 
WHERE OrderDate = '2008-07-21 00:00:00.000'

Again running this query will verify the one row estimate in the plan

SELECT * FROM dbo.SalesOrderHeader WHERE OrderDate = '2008-07-21 00:00:00.000'

Update statistics again

UPDATE STATISTICS dbo.SalesOrderHeader WITH FULLSCAN

DBCC SHOW_STATISTICS now shows this. Notice a new record with ‘Insert Since Last Update’ and ‘Rows Above’ with a value of 27. ‘Leading column Type’ still shows ‘Unknown’.

Updated              Table Cardinality      Snapshot Ctr         Steps  Density                Rows Above             Rows Below             Squared Variance Error Inserts Since Last Update Deletes Since Last Update Leading column Type
-------------------- ---------------------- -------------------- ------ ---------------------- ---------------------- ---------------------- ---------------------- ------------------------- ------------------------- -------------------
Feb 26 2013  2:44AM  31152                  31152                199    0.000897666090168059   27                     0                      0.0122265623860741     27                        0                         Unknown
Feb 26 2013  2:40AM  31125                  31125                199    0.000898472615517676   30                     0                      0.0135968539563045     30                        0                         NULL
Feb 26 2013  2:31AM  31095                  31095                199    0.000899280596058816   NULL                   NULL                   NULL                   NULL                      NULL                      NULL

A third batch

INSERT INTO dbo.SalesOrderHeader SELECT * FROM Sales.SalesOrderHeader 
WHERE OrderDate = '2008-07-22 00:00:00.000'

Update statistics one last time

UPDATE STATISTICS dbo.SalesOrderHeader WITH FULLSCAN

DBCC SHOW_STATISTICS now shows

Updated              Table Cardinality      Snapshot Ctr         Steps  Density                Rows Above             Rows Below             Squared Variance Error Inserts Since Last Update Deletes Since Last Update Leading column Type
-------------------- ---------------------- -------------------- ------ ---------------------- ---------------------- ---------------------- ---------------------- ------------------------- ------------------------- -------------------
Feb 26 2013  2:47AM  31184                  31184                199    0.000896860961802304   32                     0                      0.0144758706820584     32                        0                         Ascending
Feb 26 2013  2:44AM  31152                  31152                199    0.000897666090168059   27                     0                      0.0122265623860741     27                        0                         NULL
Feb 26 2013  2:40AM  31125                  31125                199    0.000898472615517676   30                     0                      0.0135968539563045     30                        0                         NULL
Feb 26 2013  2:31AM  31095                  31095                199    0.000899280596058816   NULL                   NULL                   NULL                   NULL                      NULL                      NULL

In addition to the new record accounting for the 32 rows added, now you can notice that the branding was changed to ‘Ascending’. Once the column is branded ‘Ascending’ SQL Server will be able to give you a better cardinality estimate, without the need to manually update statistics.

Now try this batch

INSERT INTO dbo.SalesOrderHeader SELECT * FROM Sales.SalesOrderHeader 
WHERE OrderDate = '2008-07-23 00:00:00.000'

And run the following query

SELECT * FROM dbo.SalesOrderHeader WHERE OrderDate = '2008-07-23 00:00:00.000'

This time we get a better cardinality estimate. Notice that no UPDATE STATISTICS was required this time.

clip_image011

Instead the estimated of one row now we get 27.9677. But where is this value coming from? The query optimizer is now using the density information of the statistics object. The definition of density is 1 / number of distinct values and the estimated number of rows is obtained using the density multiplied by the number of records in the table which in this case is 0.000896861 * 31184, or 27.967713424 as shown in the plan. Also notice that density information is only used for values not covered in the histogram (You can see the density information using the same DBCC SHOW_STATISTICS statement but in another session where trace flag 2388 is not enabled).

In addition, if we look for data that does not exist we still get the one row estimate which is always adequate since it will return 0 records.

SELECT * FROM dbo.SalesOrderHeader WHERE OrderDate = '2008-07-23 00:00:00.000'

Notice that branding a column ascending requires statistics to increase three times in a row. If later we insert older data, breaking the ascending sequence, the column ‘Leading column Type’ will show ‘Stationary’ and the query processor will be back to the original cardinality estimate behavior. Three new additional updates in a row with increasing values can brand it as Ascending again.

Finally, at this moment I am not able to verify if these trace flags are “officially” documented and supported by Microsoft. Interestingly they are documented in the article FIX: You may notice a large increase in compile time when you enable trace flags 2389 and 2390 in SQL Server 2005 Service Pack 1 (look at the ‘More Information’ section). Another fix is documented here. Anyway, in any case talk to Microsoft and obviously test your application carefully if you think that these trace flags can improve the performance of your application.

DBCC SHOW_STATISTICS Works With SELECT Permission

I’ve been wanting to blog about several topics in the last few months and now that I’ve finished a couple of projects, including working on the SQL Server 2012 Internals book, I think I will have some time to do just that. The first of these topics takes me back to the first day at the PASS Summit 2012 keynote where Ted Kummert, Microsoft Corporate Vice President, Data Platform group, made several announcements, including the release of SQL Server 2012 Service Pack 1. Then after looking at this service pack documentation I noticed that one of its enhancements is that the DBCC SHOW_STATISTICS statement now works with SELECT permission. This has been a fix requested while ago, related to a problem with statistics and linked servers, something I blogged about previously in my post Optimizer Statistics on Linked Servers. In summary, the problem was that the query processor was not able to get the required optimizer statistics from a remote SQL Server instance due to permissions of the user used by the linked server. Previous to SQL Server 2012 Service Pack 1, in order to obtain all available statistics, the user must own the table or be a member of the sysadmin fixed server role, the db_owner fixed database role, or the db_ddladmin fixed database role on the linked server.

So let us see how it works, using an example similar to what I did on my previous post.

SELECT l.SalesOrderID, l.CustomerID 
FROM AdventureWorks2012.Sales.SalesOrderHeader l 
JOIN [remote].AdventureWorks2012.Sales.SalesOrderHeader r 
ON l.SalesOrderID = r.SalesOrderID 
WHERE r.CustomerID = 11000

My test configuration uses a linked server relying on a login with read only permissions on the data (for example, db_datareader). Running the previous query against a remote SQL Server instance without Service Pack 1 installed returns a bad cardinality estimate and the following plan using a hash join. Since the local query processor does not have access to the statistics on the remote server it has to rely on a guess, in this case estimating 2,362.49 records. You can notice a big difference between the actual and estimated number of rows.

clip_image002

After applying Service Pack 1 on the remote instance I can run the same query and this time we get a better cardinality estimate and the query optimizer is able to make a better decision. In this case we get a new plan using a nested loops join, which is more appropriate for a small number of records.

clip_image004

Although the new behavior is enabled by default after service pack 1 is installed, you also have the choice of disabling it by using trace flag 9485, which you can use in case of regressions in scenarios where the performance of some queries may be getting worst. Enabling trace flag 9485 reverts the new permission check to the original behavior.

Run the following statement to enable trace flag 9485.

DBCC TRACEON (9485, -1)

Running the query again will produce the original plan with the hash join and the 2,362.49 cardinality estimate. You may also need to use a statement to clear the current plan or force a new optimization during your testing, for example, DBCC FREEPROCCACHE. Apart from the change in permissions mentioned in this post, the behavior of using statistics with linked servers is the same as the one described in my post, so you may refer to it to understand how it works. Finally, Microsoft has not mentioned yet if this fix will be implemented on SQL Server 2008 R2 or other previous supported releases.

Statistics on Computed Columns

Another interesting topic that I usually talk about on my presentations is statistics on computed columns so I will use this post to show you how they work and how they can help you to improve the performance of your queries.

A problem faced by some queries using scalar expressions is that they usually cannot benefit from statistics and, without them, the Query Optimizer will use the 30% selectivity guess on inequality comparisons. A solution to this problem can be the use of computed columns, as SQL Server can automatically create and update statistics on these columns which can help the Query Optimizer to create better execution plans. An additional benefit of this solution is that you don’t need to specify the name of the computed column in your queries for SQL Server to use its statistics. The Query Optimizer automatically matches the computed column definition to an existing scalar expression in a query, so your applications do not need to be changed. Although computed columns have been available in previous versions of SQL Server, the automatic matching feature was only introduced with SQL Server 2005.

To see an example, run this query, which creates the plan shown next:

SELECT * FROM Sales.SalesOrderDetail
WHERE OrderQty * UnitPrice > 25000

clip_image002The estimated number of rows is 36,395.1, which is 30% of the total number of rows, 121,317, although the query returns only 5 records. SQL Server is obviously using a selectivity guess, as it cannot estimate the selectivity of the expression OrderQty * UnitPrice > 25000.

Now create a computed column:

ALTER TABLE Sales.SalesOrderDetail
ADD cc AS OrderQty * UnitPrice

Run the previous SELECT statement again and note that, this time, the estimated number of rows has changed to 84.3101 which is very close to the actual number of rows returned by the query, as shown in the following plan:

clip_image004You can optionally test replacing the 25,000 in the query with some other values, like 1,000, 10,000, or 20,000 and verify that the estimated again will be close to the actual number of rows returned.

Note that creating the computed column does not create statistics; these statistics are created the first time that the query is optimized, and you can run the next query to display the information about the statistics objects for the Sales.SalesOrderDetail table:

SELECT * FROM sys.stats
WHERE object_id = object_id('Sales.SalesOrderDetail')

The newly created statistics object will most likely be at the end of the list. Copy its name and use the following command to display the details about the statistics object (I’ve used the name of my local object, but you should replace that as appropriate). You can also use "cc" as the name of the object to get the same results. In both cases, the "cc" column should be shown on the Columns field in the density section.

DBCC SHOW_STATISTICS ('Sales.SalesOrderDetail', _WA_Sys_0000000C_2645B050)

Unfortunately, for the automatic matching feature to work, the expression must be exactly the same as the computed column definition. So, if I change the query to UnitPrice * OrderQty, instead of OrderQty * UnitPrice, the execution plan will show an estimated number of rows of 30% again, as this query will demonstrate:

SELECT * FROM Sales.SalesOrderDetail
WHERE UnitPrice * OrderQty > 25000

As mentioned, the computed column provides statistics so the Query Optimizer can try to get you a better execution plan. In addition, you can create an index on the existing computed column to provide a better navigational alternative. Create the following index

CREATE INDEX IX_cc on Sales.SalesOrderDetail(cc)

By running the original SELECT statement again the Query Optimizer will now choose the newly created index and will produce a more efficient plan using an Index Seek/Key Lookup instead of a Clustered Index Scan, as shown next.

clip_image006Finally, drop the index and computed column you’ve just created:

DROP INDEX Sales.SalesOrderDetail.IX_cc
GO
ALTER TABLE Sales.SalesOrderDetail
DROP COLUMN cc

Database Engine Tuning Advisor and the Query Optimizer – Part 2

One of the most interesting and perhaps not well known features of the Database Engine Tuning Advisor (DTA) is that you can use it with a test server to tune the workload of a production server. As I mentioned on the first part of this post, the DTA relies on the Query Optimizer to make its tuning recommendations and you can use it to make these optimizer calls to a test server instance without impacting the performance of the production server.

Information Required by the Query Optimizer

To better understand how this works let us first review what kind of information the Query Optimizer needs to tune a workload. Basically the most important information it needs to perform an optimization is:

1) The database metadata (i.e. table and column definitions, indexes, constraints, etc.)

2) Optimizer statistics (index and column statistics)

3) Table size (number of rows and pages)

4) Available memory and number of processors

The DTA can gather the database metadata and statistics from the production server and use it to create a similar database, with no data, on a different server. This is called a shell database. The DTA can also obtain the available memory and number of processors on the production server, by using the extended stored procedure xp_msver, and use this information for the optimization process. It is important to remind that no data is needed for the optimization process. This process is summarized in the following figure taken from Books Online: clip_image001[8]

This process provides the following benefits:

1) There is no need to do an expensive optimization on the production server which can impact its resources usage. Production server is only used to gather initial metadata and the required statistics.

2) No need to copy the entire database to a test server either, which is especially important for big databases, saving disk space and time to copy the database

3) No problems where test servers are not as powerful as production server as the DTA tuning session will consider the available memory and number of processors of the production server.

Running a Tuning Session

Now I am going to show an example of how to run a tuning session. First of all, the use of a test server is not supported by the DTA graphical user interface so the use of the dta utility, the command prompt version of DTA, is required. Configuring a test server also requires an XML input file containing the dta input information. I am using the following input file for this example

<?xml version="1.0" encoding="utf-16" ?>
<DTAXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        xmlns="http://schemas.microsoft.com/sqlserver/2004/07/dta">
  <DTAInput>
    <Server>
      <Name>production_instance</Name>
      <Database>
        <Name>AdventureWorks</Name>
      </Database>
    </Server>
    <Workload>
      <File>workload.sql</File>
    </Workload>
    <TuningOptions>
      <TestServer>test_instance</TestServer>
      <FeatureSet>IDX</FeatureSet>
      <Partitioning>NONE</Partitioning>
      <KeepExisting>NONE</KeepExisting>
    </TuningOptions>
  </DTAInput>
</DTAXML>

The Server and Database elements of the XML file include the production SQL Server instance and database. The Workload element includes the definition of a script containing the workload to tune. TuningOptions includes the TestServer subelement which is used to include the name of the test SQL Server instance.

Create the workload.sql file containing a simple query like this

SELECT * FROM AdventureWorks.Sales.SalesOrderDetail
WHERE ProductID = 898

Run the following command

dta -ix input.xml -S production_instance -s session1

A successful execution will show an output similar to this

Microsoft (R) SQL Server Microsoft SQL Server Database Engine Tuning Advisor com
mand line utility
Version 9.00.5000.00
Copyright (c) Microsoft Corporation. All rights reserved.

Tuning session successfully created. Session ID is 26.

Total time used: 00:00:03
Workload consumed: 100%, Estimated improvement: 96%

Tuning process finished.

This example creates an entire copy of AdventureWorks (with no data) and performs the requested optimization. The shell database is automatically deleted after the tuning session is completed. Optionally you can keep the shell database, for example if you want to use it again on another tuning exercise, by using the RetainShellDB in the TuningOptions element like in the following XML fragment.

<TuningOptions>
  <TestServer>test_instance</TestServer>
  <FeatureSet>IDX</FeatureSet>
  <Partitioning>NONE</Partitioning>
  <KeepExisting>NONE</KeepExisting>
  <RetainShellDB>1</RetainShellDB>
</TuningOptions>

If the shell database already exists when you request a tuning session, the database creation process will be skipped. However, you will have to manually delete this database when it is no longer needed.

Once the tuning session is completed you can use the DTA graphical user interface as usual to see the recommendations. To do this open the DTA, open the session you used by double-clicking its session name (session1 in our example) and chose the Recommendations tab if it is not already selected.

Scripting Statistics

Although the DTA automatically gathers the metadata and statistics to build the shell database, I am going to show you how to script the required objects and statistics to tune a simple query. This can be helpful in cases where you don’t want to script the entire database. Scripting database objects is a fairly simple process well known by SQL Server professionals. Something that may be new for many though, is how to script the statistics. Created scripts make use of the undocumented STATS_STREAM, ROWCOUNT and PAGECOUNT options of the CREATE/UPDATE STATISTICS statement.

As an example to optimize the simple query shown previously try the following on Management Studio: Select Databases, right-click the AdventureWorks database, select Tasks, Generate Scripts …, click Next, select “Select specific database objects”, expand Tables, select Sales.SalesOrderDetail, click Next, click Advanced, look for the “Script Statistics” choice and select “Script statistics and histograms”. Finally chose True on “Script Indexes”. Your Advanced Scripting Options window should look similar to this:

clip_image002

Click Ok and finish the wizard to generate the scripts. You will get a script with a few UPDATE STATISTICS statements similar to this (with the STAT_STREAM value shortened to fit in this page).

UPDATE STATISTICS [Sales].[SalesOrderDetail]([IX_SalesOrderDetail_ProductID]) 
WITH STATS_STREAM = 0x010000000300000000000000000000004036000 , 
ROWCOUNT = 121317, PAGECOUNT = 227

These UPDATE STATISTICS statements are used to update the statistics of existing indexes (obviously the related CREATE INDEX statements were scripted as well). If the table also has column statistics it will include CREATE STATISTICS statements instead.

Testing Scripted Statistics

Finally, I will show you an example of how to use the scripted statistics to obtain plans and cost estimates on an empty table. Running the following query on the regular AdventureWorks database creates the following plan with an estimated number of rows of 9 and a cost of 0.0296835.

SELECT * FROM Sales.SalesOrderDetail
WHERE ProductID = 898

clip_image002[7]

Let us produce the same plan on an empty database. Following the procedure described before you can script the Sales.SalesOrderDetail table. You will end with multiple statements including the following (again shortened to fit on this post).

CREATE TABLE [Sales].[SalesOrderDetail](
    [SalesOrderID] [int] NOT NULL, 
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [IX_SalesOrderDetail_ProductID] ON 
[Sales].[SalesOrderDetail] 
(
    [ProductID] ASC
)
GO
UPDATE STATISTICS [Sales].[SalesOrderDetail]([IX_SalesOrderDetail_ProductID]) 
WITH STATS_STREAM = 0x010000000300000000000, ROWCOUNT = 121317, PAGECOUNT = 227
GO
UPDATE STATISTICS [Sales].[SalesOrderDetail]
([PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID]) 
WITH STATS_STREAM = 0x010000000200000000000000000000003C2F68F6, ROWCOUNT = 121317, 
PAGECOUNT = 1237

Create a new database and run at least the previous four statements using the scripts you got on the previous step (or you can use the attached script on this post containing the statements needed to reproduce the example). After implementing the script on an empty database and running the sample query, you will get again the plan with cost 0.0296835 and estimated number of rows of 9.

Optimizer Statistics on Linked Servers

Recently I was asked to troubleshoot a performance problem with a query using linked servers. The problem was related to a well known issue where the query processor is not able to get the required optimizer statistics from the remote server due to permissions of the user used on the linked server. This behavior is documented on the Books Online entry Guidelines for Using Distributed Queries as shown next:

 “To create the best query plans when you are using a table on a linked server, the query processor must have data distribution statistics from the linked server. Users that have limited permissions on any columns of the table might not have sufficient permissions to obtain all the useful statistics, and might receive a less efficient query plan and experience poor performance. If the linked server is an instance of SQL Server, to obtain all available statistics, the user must own the table or be a member of the sysadmin fixed server role, the db_owner fixed database role, or the db_ddladmin fixed database role on the linked server.”

Basically the problem is that if the user used by the linked server does not have the permissions described in the previous Books Online entry, SQL Server will not be able to execute the DBCC SHOW_STATISTICS statement on the remote server to obtain the required statistics. It is however unfortunate that having access to the data does not also give you access to its statistics as having to provide higher level permissions could be a security concern. Let me show you the problem with an example.

Plan with read-only permissions

I’ve created a linked server between two SQL Server instances each one hosting a copy of AdventureWorks database. First, I grant read-only permissions to the user used by the linked server and run the following artificial query just to demonstrate this behavior.

SELECT l.* FROM AdventureWorks.Sales.SalesOrderHeader l

JOIN remote.AdventureWorks.Sales.SalesOrderHeader r

ON l.SalesOrderID = r.SalesOrderID

WHERE r.CustomerID = 666

Running the previous query gets me the following plan:

clip_image002

In this case the Query Optimizer could benefit of knowing the cardinality estimate of the query executed on the remote server, that is, to know how many orders were placed by customer 666, but this information is not available for this plan. With an estimated guess of 2,362.49 rows the Query Optimizer is deciding to use a Merge Join plus a Clustered Index Scan when in fact the query is returning only 8 records. You can run Profiler against the remote server to learn what kind of information the local query processor is requesting from it. I’ve noticed that it executes the following five system stored procedures which obtain information about the tables, columns, indexes, check constraints and statistics involved on the remote query.

exec [AdventureWorks].[sys].sp_tables_info_90_rowset N’SalesOrderHeader’,N’Sales’,NULL

exec [AdventureWorks].[sys].sp_columns_100_rowset N’SalesOrderHeader’,N’Sales’,NULL

exec [AdventureWorks].[sys].sp_indexes_100_rowset N’SalesOrderHeader’,NULL,N’Sales’

exec [AdventureWorks].[sys].sp_check_constbytable_rowset N’SalesOrderHeader’,N’Sales’,NULL,NULL

exec [AdventureWorks].[sys].sp_table_statistics2_rowset N’SalesOrderHeader’,N’Sales’,N’AdventureWorks’,NULL,NULL,NULL

The last stored procedure, sp_table_statistics2_rowset, tries to obtain the header and density information of the statistics object using the following statement but it fails because of the lack of permissions mentioned earlier.

dbcc show_statistics(@qtbl, @statname) with stat_header join density_vector

Plan with db_owner permissions

Now grant db_owner permissions to the user used on the liked server and run the query again while forcing a new optimization (for example using DBCC FREEPROCCACHE to clear the plan cache on the local server). Profiler will show that the previous five system stored procedures were again executed but this time the sp_table_statistics2_rowset procedure is able to successfully get the requested header and density information. In addition, the following statement is also issued to obtain the histogram information from the statistics object

DBCC SHOW_STATISTICS(N’"AdventureWorks"."Sales"."SalesOrderHeader"’,

"IX_SalesOrderHeader_CustomerID") WITH HISTOGRAM_STEPS

This time the following execution plan is returned

clip_image004

By using the histogram from the remote server, the local query processor is able to get an estimated number of rows of 6.65385 and decides to use a Nested Loops Join and a Clustered Index Seek instead which is a better plan and more appropriate as the actual number of records returned is only 8.

If you manually run the previous DBCC SHOW_STATISTICS statement on the remote server you can see the histogram, an extract of which is shown next, showing the step corresponding for the value for CustomerID 666 which in this case shows an estimated of 6.653846 records on the AVG_RANGE_ROWS column, which is also shown on the previous execution plan.

clip_image006

Remote plan

It is also worth mentioning that the query executed on the remote servers is practically the same in both cases, except that ORDER BY is needed on the first example as the Merge Join on the local plan requires sorted data, but the plan is still the same as shown next (both the query and the plan can be captured using Profiler).

SELECT "Tbl1003"."SalesOrderID" "Col1011"

FROM "AdventureWorks"."Sales"."SalesOrderHeader" "Tbl1003"

WHERE "Tbl1003"."CustomerID"=(666)

ORDER BY "Col1011" ASC

Again notice that the estimated number of rows is 6.65385

clip_image008

Conclusion

So we’ve seen how using a user with limited permissions to run a query through a linked server can in some specific cases be a performance problem because of the lack of access to optimizer statistics. However, it seems inappropriate that having access to the data does not also give you access to its statistics as having to provide higher level permissions could be a security concern. In fact, there is a related connect entry by Erland Sommarskog discussing the problem and suggesting that permissions to access data should also give access to its statistics. So let us hope a better solution to the use of linked servers is provided in the near future.

My Book, “Inside the Query Optimizer”, available at the PASS Summit

My book, “Inside the SQL Server Query Optimizer”, is almost finished and we will have a conference edition of it available at the PASS Summit. The final version of the book, published by Red Gate books, will be available on Amazon by Christmas.

For more details on the contents, I am including the Preface of the book next.

clip_image002

Preface

The Query Optimizer has always been one of my favorite SQL Server topics, which is why I started blogging about it and submitting related presentations to PASS. And so it would have continued, except that, after several blog posts discussing the Query Optimizer, Red Gate invited me to write a book about it. This is that book.

I started learning about the Query Optimizer by reading the very few SQL Server books which discussed the topic, and most of them covered it only very briefly. Yet I pressed on, and later, while trying to learn more about the topic, I found an extremely rich source of information in the form of the many available research papers. It was hard to fully grasp them at the beginning, as academic papers can be difficult to read and understand, but soon I got used to them, and was all the more knowledgeable for it.

Having said that, I feel that I’m in a bit of a minority, and that many people still see the Query Optimizer just as a black box where a query is submitted and an amazing execution plan is returned. It is also seen as a very complex component, and rightly so. It definitely is a very complex component, perhaps the most complex in database management software, but there is still a lot of great information about the Query Optimizer that SQL Server professionals can benefit from.  

The Query Optimizer is the SQL Server component that tries to give you an optimal execution plan for your queries and, just as importantly, tries to find that execution plan as quickly as possible. A better understanding of what the Query Optimizer does behind the scenes can help you to improve the performance of your databases and applications, and this book explains the core concepts behind how the SQL Server Query Optimizer works. With this knowledge, you’ll be able to write better queries, provide the Query Optimizer with the information it needs to produce efficient execution plans, and troubleshoot the cases when the Query Optimizer is not giving you a good plan.

With that in mind, and in case it’s not obvious, the content of this book is intended for SQL Server professionals: database developers and administrators, data architects, and basically anybody who submits more than just trivial queries to SQL Server. Here’s a quick overview of what the book covers:

The first chapter, Introduction to Query Optimization, starts with an overview on how the SQL Server Query Optimizer works and introduces the concepts that will be covered in more detail in the rest of the book. A look into some of the challenges query optimizers still face today is covered next, along with a section on how to read and understand execution plans. The Chapter closes with a discussion of join ordering, traditionally one of the most complex problems in query optimization.

The second chapter talks about the Execution Engine, and describes it as a collection of physical operators that perform the functions of the query processor. It emphasizes how these operations, implemented by the Execution Engine, define the choices available to the Query Optimizer when building execution plans. This Chapter includes sections on data access operations, the concepts of sorting and hashing, aggregations, and joins, to conclude with a brief introduction to parallelism.

Chapter 3, Statistics and Cost Estimation, shows how the quality of the execution plans generated by the Query Optimizer is directly related to the accuracy of its cardinality and cost estimations. The Chapter describes Statistics objects in detail, and includes some sections on how statistics are created and maintained, as well as how they are used by the Query Optimizer. We’ll also take a look at how to detect cardinality estimation errors, which may cause the Query Optimizer to choose inefficient plans, together with some recommendations on how to avoid and fix these problems. Just to round off the subject, the chapter ends with and introduction to cost estimation.

Chapter 4, Index selection, shows how SQL Server can speed up your queries and dramatically improve the performance of your applications just by using the right indexes. The Chapter shows how SQL Server selects indexes, how you can provide better indexes, and how you can verify your execution plans to make sure these indexes are correctly used. We’ll talk about the Database Engine Tuning Advisor and the Missing Indexes feature, which will show how the Query Optimizer itself can provide you with index tuning recommendations.

Chapter 5, The Optimization Process, is the Chapter that goes right into the internals of the Query Optimizer and introduces the steps that it performs without you ever knowing. This covers everything from the moment a query is submitted to SQL Server until an execution plan is generated and is ready to be executed, including steps like parsing, binding, simplification, trivial plan and full optimization. Important components which are part of the Query Optimizer architecture, such as transformation rules and the memo structure, are also introduced.

Chapter 6, Additional Topics, includes a variety of subjects, starting with the basics of update operations and how they also need to be optimized just like any other query, so that they can be performed as quickly as possible. We’ll have an introduction to Data Warehousing and how SQL Server optimizes star queries, before launching into a detailed explanation of Parameter sniffing, along with some recommendations on how to avoid some problems presented by this behavior. Continuing with the topic of parameters, the Chapter concludes by discussing auto-parameterization and forced parameterization.

Chapter 7 describes Hints, and warns that, although hints are a powerful tool which allows you to take explicit control over the execution plan of a query, they need to be used with caution and only as a last resort when no other option is available. The chapter covers the most-used hints, and ends with a couple of sections on plan guides and the USE PLAN query hint.

Before we get started, please bear in mind that this book contains many undocumented SQL Server statements. These statements are provided only as a way to explore and understand the Query Optimizer and, as such, should not be used on a production environment. Use them wisely, and I hope you enjoy learning about this topic as much as I do.

Benjamin Nevarez

Rebuilding Indexes vs. Updating Statistics

One of the questions I was asked recently while speaking at the Los Angeles SQL Server Professionals Group and the Orange County SQL Server User Group (*) was regarding the order in which jobs like rebuilding indexes or updating statistics should be executed as part of the database maintenance activities. In general you should consider the following important points, focusing in the fact that there are two kinds of statistics: index and column statistics.

1) By default, the UPDATE STATISTICS statement updates both index and column statistics. Using the COLUMNS option of this statement will update column statistics only. Using the INDEX option will update index statistics only.

2) By default, the UPDATE STATISTICS statement uses only a sample of records of the table. Using UPDATE STATISTICS WITH FULLSCAN will scan the entire table.

3) Rebuilding an index, for example by using the ALTER INDEX … REBUILD statement, will update only index statistics with the equivalent of using WITH FULLSCAN. Rebuilding indexes does not update any column statistics.

4) Reorganizing an index, for example using the ALTER INDEX … REORGANIZE statement, does not update any statistics.

So depending on your maintenance jobs and scripts several scenarios can exist. The simplest scenario is if you want to rebuild all the indexes and update all the statistics. As mentioned before, if you rebuild all your indexes then all the index statistics will also be updated by scanning all the rows on the table. Then you just need to update your column statistics by running an UPDATE STATISTICS WITH FULLSCAN, COLUMNS statement. Since the first job only updates index statistics and the second one only updates column statistics, it does not matter which one you execute first.

Some other more complicated scenarios include when you have jobs which rebuild your indexes based on their fragmentation level. Of course, the worst case scenario would be if you first rebuild your indexes, which also updates the index statistics by scanning the entire table, and later you run UPDATE STATISTICS using the default values, which again updates the index statistics but this time only with a default and smaller sample. Not only you are updating your index statistics twice but you are overwriting the better of the two choices.

Let me show you how these commands work with some examples using the AdventureWorks database. Create a new table dbo.SalesOrderDetail

SELECT * INTO dbo.SalesOrderDetail

FROM sales.SalesOrderDetail

The next query uses the sys.stats catalog view and shows that initially there are no statistics objects for the new table.

SELECT name, auto_created, stats_date(object_id, stats_id) AS update_date FROM sys.stats

WHERE object_id = object_id(‘dbo.SalesOrderDetail’)

Use this query using the sys.stats catalog again when asked to inspect the status of the statistics after each of the following commands. Now run the following query

SELECT * FROM dbo.SalesOrderDetail

WHERE SalesOrderID = 43670 AND OrderQty = 1

Use the previous sys.stats query to verify that two statistics objects were created, one for the SalesOrderID column and another one for the OrderQty column (they both have names starting with _WA_Sys as shown later). Run the following statement to create an index on the ProductID column.

CREATE INDEX ix_ProductID ON dbo.SalesOrderDetail(ProductID)

Run again the query to verify that a new statistics object for the ProductID column has been created. Notice the value of the auto_created column which tells if the statistics were automatically created by the query processor.

name                          auto_created      update_date

_WA_Sys_00000004_7F80E8EA     1                 2010-07-01 23:27:45.953

_WA_Sys_00000001_7F80E8EA     1                 2010-07-01 23:27:46.117

ix_ProductID                  0                 2010-07-01 23:29:41.110

Run the next statement to update the column statistics only. You can validate that only the column statistics were updated by looking at the update_date column which uses the STATS_DATE function to display the last date the statistics were updated.

UPDATE STATISTICS dbo.SalesOrderDetail WITH FULLSCAN, COLUMNS

name                          auto_created      update_date

_WA_Sys_00000004_7F80E8EA     1                 2010-07-01 23:34:13.753

_WA_Sys_00000001_7F80E8EA     1                 2010-07-01 23:34:13.850

ix_ProductID                  0                 2010-07-01 23:29:41.110

This statement will do the same for the index statistics

UPDATE STATISTICS dbo.SalesOrderDetail WITH FULLSCAN, INDEX

The next two statements will update both index and column statistics

UPDATE STATISTICS dbo.SalesOrderDetail WITH FULLSCAN

UPDATE STATISTICS dbo.SalesOrderDetail WITH FULLSCAN, ALL

See how an index rebuild only updates index statistics

ALTER INDEX ix_ProductID ON dbo.SalesOrderDetail REBUILD

Here you can verify that reorganizing an index does not update any statistics

ALTER INDEX ix_ProductID ON dbo.SalesOrderDetail REORGANIZE

Finally, remove the table you have just created

DROP TABLE dbo.SalesOrderDetail

* Updated from an article originally written in October, 2009

How OPTIMIZE FOR UNKNOWN Works

One of the questions I have been asked several times is about how OPTIMIZE FOR UNKNOWN works. OPTIMIZE FOR is a query hint introduced with SQL Server 2005 that can help with the parameter sniffing problem and requires from you to specify a value for a parameter. For an introduction to the parameter sniffing problem you can see my previous post here. On the other hand, OPTIMIZE FOR UNKNOWN, which was introduced in SQL Server 2008, does not require from you to specify a value for a parameter.

A traditional way to avoid the parameter sniffing problem, especially in previous versions of SQL Server, was by using local variables. But entirely avoiding parameter sniffing does not mean that it is always a good solution. As I mentioned in my previous article, from the point of view of query optimization, parameter sniffing is a good thing. When the Query Optimizer knows the value of a parameter it can use the statistics histogram to estimate the number of records that can be returned by a query. Using the histogram will give you the best estimate possible. But when you use local variables SQL Server is not able to use the histogram anymore. Instead it uses the information on the density vector of the statistics object. OPTIMIZE FOR UNKNOWN works pretty much in the same way.

To better understand how OPTIMIZE FOR UNKNOWN works let us first see the case when a parameter value is known. Create the following stored procedure

CREATE PROCEDURE test (@pid int)

AS

SELECT * FROM Sales.SalesOrderDetail

WHERE ProductID = @pid

Running this stored procedure and requesting a plan shows 188 estimated records which can be seen on the following execution plan which uses both an index seek and a key lookup operators.

EXEC test @pid = 709

clip_image002

In this case SQL Server is able to use the histogram and estimate that 188 records would be returned. The Query Optimizer uses that estimate to take a decision about the plan to generate. Use the following statement to inspect the statistics object used by this stored procedure.

DBCC SHOW_STATISTICS(‘Sales.SalesOrderDetail’, IX_SalesOrderDetail_ProductID)

Running the statement shows the following information (for space limitations, only the information needed for this post is displayed, including the first line of the density vector and the first three lines of the histogram).

All density   Average Length Columns

0.003759399   4              ProductID

 

RANGE_HI_KEY RANGE_ROWS    EQ_ROWS       DISTINCT_RANGE_ROWS 

707          0             3083          0                   

708          0             3007          0                   

709          0             188           0                   

In this case SQL Server used the histogram to find the value of ProductID 709, defined as RANGE_HI_KEY, and finds the estimated number of rows 188, defined as EQ_ROWS.

Let us now change the stored procedure to use local variables

ALTER PROCEDURE test (@pid int)

AS

DECLARE @lpid int

SET @lpid = @pid

SELECT * FROM Sales.SalesOrderDetail

WHERE ProductID = @lpid

Run the procedure again using the same value

EXEC test @pid = 709

This time we get a different plan, using a clustered index scan, as shown next

clip_image004

Local variables are not known at optimization time so the Query Optimizer is not able to use the value 709 for the optimization, as it did before. Actually, this time it does not matter which value you use, you always will get the same estimated number of rows and the same plan. The estimated number of rows is 456.079.

Now use the version of the stored procedure with OPTIMIZE FOR UNKNOWN

ALTER PROCEDURE test (@pid int)

AS

SELECT * FROM Sales.SalesOrderDetail

WHERE ProductID = @pid

OPTION (OPTIMIZE FOR UNKNOWN)

You will notice that it is behaving the same as with local variables, it is getting an estimated number of rows of 456.079 and getting the same plan, using a clustered index scan.

But now let us see how SQL Server is obtaining the value 456.079 and what the reasoning behind this is.

Density is defined as 1 / number of distinct values. The SalesOrderDetail table has 266 distinct values for ProductID, so the density is calculated as 1 / 266 or 0.003759399 as shown before on the statistics object. One assumption in the statistics mathematical model used by SQL Server is the uniformity assumption. Since in this case SQL Server can not use the histogram, the uniformity assumption tells that for any given value the data distribution is the same. To obtain the estimated number of records SQL Server will multiply the density by the current total number of records, 0.003759399 * 121,317 or 456.079, as shown on the plan. This is also the same as to divide the total number of records by the number of distinct values, 121,317 / 266, which also gives 456.079.

Finally, I would conclude saying that the benefit of using OPTIMIZE FOR UNKNOWN is that you always get the same execution plan, just make sure that the chosen plan benefits most of the instances of your query.