Benjamin Nevarez Rotating Header Image

Statistics

Optimizer Statistics on Linked Servers

Recently I was asked to troubleshoot a performance problem with a query using linked servers. The problem was related to a well known issue where the query processor is not able to get the required optimizer statistics from the remote server due to permissions of the user used on the linked server. This behavior is documented on the Books Online entry Guidelines for Using Distributed Queries as shown next:

 “To create the best query plans when you are using a table on a linked server, the query processor must have data distribution statistics from the linked server. Users that have limited permissions on any columns of the table might not have sufficient permissions to obtain all the useful statistics, and might receive a less efficient query plan and experience poor performance. If the linked server is an instance of SQL Server, to obtain all available statistics, the user must own the table or be a member of the sysadmin fixed server role, the db_owner fixed database role, or the db_ddladmin fixed database role on the linked server.”

Basically the problem is that if the user used by the linked server does not have the permissions described in the previous Books Online entry, SQL Server will not be able to execute the DBCC SHOW_STATISTICS statement on the remote server to obtain the required statistics. It is however unfortunate that having access to the data does not also give you access to its statistics as having to provide higher level permissions could be a security concern. Let me show you the problem with an example.

Plan with read-only permissions

I’ve created a linked server between two SQL Server instances each one hosting a copy of AdventureWorks database. First, I grant read-only permissions to the user used by the linked server and run the following artificial query just to demonstrate this behavior.

SELECT l.* FROM AdventureWorks.Sales.SalesOrderHeader l

JOIN remote.AdventureWorks.Sales.SalesOrderHeader r

ON l.SalesOrderID = r.SalesOrderID

WHERE r.CustomerID = 666

Running the previous query gets me the following plan:

clip_image002

In this case the Query Optimizer could benefit of knowing the cardinality estimate of the query executed on the remote server, that is, to know how many orders were placed by customer 666, but this information is not available for this plan. With an estimated guess of 2,362.49 rows the Query Optimizer is deciding to use a Merge Join plus a Clustered Index Scan when in fact the query is returning only 8 records. You can run Profiler against the remote server to learn what kind of information the local query processor is requesting from it. I’ve noticed that it executes the following five system stored procedures which obtain information about the tables, columns, indexes, check constraints and statistics involved on the remote query.

exec [AdventureWorks].[sys].sp_tables_info_90_rowset N’SalesOrderHeader’,N’Sales’,NULL

exec [AdventureWorks].[sys].sp_columns_100_rowset N’SalesOrderHeader’,N’Sales’,NULL

exec [AdventureWorks].[sys].sp_indexes_100_rowset N’SalesOrderHeader’,NULL,N’Sales’

exec [AdventureWorks].[sys].sp_check_constbytable_rowset N’SalesOrderHeader’,N’Sales’,NULL,NULL

exec [AdventureWorks].[sys].sp_table_statistics2_rowset N’SalesOrderHeader’,N’Sales’,N’AdventureWorks’,NULL,NULL,NULL

The last stored procedure, sp_table_statistics2_rowset, tries to obtain the header and density information of the statistics object using the following statement but it fails because of the lack of permissions mentioned earlier.

dbcc show_statistics(@qtbl, @statname) with stat_header join density_vector

Plan with db_owner permissions

Now grant db_owner permissions to the user used on the liked server and run the query again while forcing a new optimization (for example using DBCC FREEPROCCACHE to clear the plan cache on the local server). Profiler will show that the previous five system stored procedures were again executed but this time the sp_table_statistics2_rowset procedure is able to successfully get the requested header and density information. In addition, the following statement is also issued to obtain the histogram information from the statistics object

DBCC SHOW_STATISTICS(N’"AdventureWorks"."Sales"."SalesOrderHeader"’,

"IX_SalesOrderHeader_CustomerID") WITH HISTOGRAM_STEPS

This time the following execution plan is returned

clip_image004

By using the histogram from the remote server, the local query processor is able to get an estimated number of rows of 6.65385 and decides to use a Nested Loops Join and a Clustered Index Seek instead which is a better plan and more appropriate as the actual number of records returned is only 8.

If you manually run the previous DBCC SHOW_STATISTICS statement on the remote server you can see the histogram, an extract of which is shown next, showing the step corresponding for the value for CustomerID 666 which in this case shows an estimated of 6.653846 records on the AVG_RANGE_ROWS column, which is also shown on the previous execution plan.

clip_image006

Remote plan

It is also worth mentioning that the query executed on the remote servers is practically the same in both cases, except that ORDER BY is needed on the first example as the Merge Join on the local plan requires sorted data, but the plan is still the same as shown next (both the query and the plan can be captured using Profiler).

SELECT "Tbl1003"."SalesOrderID" "Col1011"

FROM "AdventureWorks"."Sales"."SalesOrderHeader" "Tbl1003"

WHERE "Tbl1003"."CustomerID"=(666)

ORDER BY "Col1011" ASC

Again notice that the estimated number of rows is 6.65385

clip_image008

Conclusion

So we’ve seen how using a user with limited permissions to run a query through a linked server can in some specific cases be a performance problem because of the lack of access to optimizer statistics. However, it seems inappropriate that having access to the data does not also give you access to its statistics as having to provide higher level permissions could be a security concern. In fact, there is a related connect entry by Erland Sommarskog discussing the problem and suggesting that permissions to access data should also give access to its statistics. So let us hope a better solution to the use of linked servers is provided in the near future.

Rebuilding Indexes vs. Updating Statistics

One of the questions I was asked recently while speaking at the Los Angeles SQL Server Professionals Group and the Orange County SQL Server User Group (*) was regarding the order in which jobs like rebuilding indexes or updating statistics should be executed as part of the database maintenance activities. In general you should consider the following important points, focusing in the fact that there are two kinds of statistics: index and column statistics.

1) By default, the UPDATE STATISTICS statement updates both index and column statistics. Using the COLUMNS option of this statement will update column statistics only. Using the INDEX option will update index statistics only.

2) By default, the UPDATE STATISTICS statement uses only a sample of records of the table. Using UPDATE STATISTICS WITH FULLSCAN will scan the entire table.

3) Rebuilding an index, for example by using the ALTER INDEX … REBUILD statement, will update only index statistics with the equivalent of using WITH FULLSCAN. Rebuilding indexes does not update any column statistics.

4) Reorganizing an index, for example using the ALTER INDEX … REORGANIZE statement, does not update any statistics.

So depending on your maintenance jobs and scripts several scenarios can exist. The simplest scenario is if you want to rebuild all the indexes and update all the statistics. As mentioned before, if you rebuild all your indexes then all the index statistics will also be updated by scanning all the rows on the table. Then you just need to update your column statistics by running an UPDATE STATISTICS WITH FULLSCAN, COLUMNS statement. Since the first job only updates index statistics and the second one only updates column statistics, it does not matter which one you execute first.

Some other more complicated scenarios include when you have jobs which rebuild your indexes based on their fragmentation level. Of course, the worst case scenario would be if you first rebuild your indexes, which also updates the index statistics by scanning the entire table, and later you run UPDATE STATISTICS using the default values, which again updates the index statistics but this time only with a default and smaller sample. Not only you are updating your index statistics twice but you are overwriting the better of the two choices.

Let me show you how these commands work with some examples using the AdventureWorks database. Create a new table dbo.SalesOrderDetail

SELECT * INTO dbo.SalesOrderDetail

FROM sales.SalesOrderDetail

The next query uses the sys.stats catalog view and shows that initially there are no statistics objects for the new table.

SELECT name, auto_created, stats_date(object_id, stats_id) AS update_date FROM sys.stats

WHERE object_id = object_id(‘dbo.SalesOrderDetail’)

Use this query using the sys.stats catalog again when asked to inspect the status of the statistics after each of the following commands. Now run the following query

SELECT * FROM dbo.SalesOrderDetail

WHERE SalesOrderID = 43670 AND OrderQty = 1

Use the previous sys.stats query to verify that two statistics objects were created, one for the SalesOrderID column and another one for the OrderQty column (they both have names starting with _WA_Sys as shown later). Run the following statement to create an index on the ProductID column.

CREATE INDEX ix_ProductID ON dbo.SalesOrderDetail(ProductID)

Run again the query to verify that a new statistics object for the ProductID column has been created. Notice the value of the auto_created column which tells if the statistics were automatically created by the query processor.

name                          auto_created      update_date

_WA_Sys_00000004_7F80E8EA     1                 2010-07-01 23:27:45.953

_WA_Sys_00000001_7F80E8EA     1                 2010-07-01 23:27:46.117

ix_ProductID                  0                 2010-07-01 23:29:41.110

Run the next statement to update the column statistics only. You can validate that only the column statistics were updated by looking at the update_date column which uses the STATS_DATE function to display the last date the statistics were updated.

UPDATE STATISTICS dbo.SalesOrderDetail WITH FULLSCAN, COLUMNS

name                          auto_created      update_date

_WA_Sys_00000004_7F80E8EA     1                 2010-07-01 23:34:13.753

_WA_Sys_00000001_7F80E8EA     1                 2010-07-01 23:34:13.850

ix_ProductID                  0                 2010-07-01 23:29:41.110

This statement will do the same for the index statistics

UPDATE STATISTICS dbo.SalesOrderDetail WITH FULLSCAN, INDEX

The next two statements will update both index and column statistics

UPDATE STATISTICS dbo.SalesOrderDetail WITH FULLSCAN

UPDATE STATISTICS dbo.SalesOrderDetail WITH FULLSCAN, ALL

See how an index rebuild only updates index statistics

ALTER INDEX ix_ProductID ON dbo.SalesOrderDetail REBUILD

Here you can verify that reorganizing an index does not update any statistics

ALTER INDEX ix_ProductID ON dbo.SalesOrderDetail REORGANIZE

Finally, remove the table you have just created

DROP TABLE dbo.SalesOrderDetail

* Updated from an article originally written in October, 2009