Benjamin Nevarez Rotating Header Image

Query Optimizer

Optimizer Statistics on Linked Servers

Recently I was asked to troubleshoot a performance problem with a query using linked servers. The problem was related to a well known issue where the query processor is not able to get the required optimizer statistics from the remote server due to permissions of the user used on the linked server. This behavior is documented on the Books Online entry Guidelines for Using Distributed Queries as shown next:

 “To create the best query plans when you are using a table on a linked server, the query processor must have data distribution statistics from the linked server. Users that have limited permissions on any columns of the table might not have sufficient permissions to obtain all the useful statistics, and might receive a less efficient query plan and experience poor performance. If the linked server is an instance of SQL Server, to obtain all available statistics, the user must own the table or be a member of the sysadmin fixed server role, the db_owner fixed database role, or the db_ddladmin fixed database role on the linked server.”

Basically the problem is that if the user used by the linked server does not have the permissions described in the previous Books Online entry, SQL Server will not be able to execute the DBCC SHOW_STATISTICS statement on the remote server to obtain the required statistics. It is however unfortunate that having access to the data does not also give you access to its statistics as having to provide higher level permissions could be a security concern. Let me show you the problem with an example.

Plan with read-only permissions

I’ve created a linked server between two SQL Server instances each one hosting a copy of AdventureWorks database. First, I grant read-only permissions to the user used by the linked server and run the following artificial query just to demonstrate this behavior.

SELECT l.* FROM AdventureWorks.Sales.SalesOrderHeader l

JOIN remote.AdventureWorks.Sales.SalesOrderHeader r

ON l.SalesOrderID = r.SalesOrderID

WHERE r.CustomerID = 666

Running the previous query gets me the following plan:

clip_image002

In this case the Query Optimizer could benefit of knowing the cardinality estimate of the query executed on the remote server, that is, to know how many orders were placed by customer 666, but this information is not available for this plan. With an estimated guess of 2,362.49 rows the Query Optimizer is deciding to use a Merge Join plus a Clustered Index Scan when in fact the query is returning only 8 records. You can run Profiler against the remote server to learn what kind of information the local query processor is requesting from it. I’ve noticed that it executes the following five system stored procedures which obtain information about the tables, columns, indexes, check constraints and statistics involved on the remote query.

exec [AdventureWorks].[sys].sp_tables_info_90_rowset N’SalesOrderHeader’,N’Sales’,NULL

exec [AdventureWorks].[sys].sp_columns_100_rowset N’SalesOrderHeader’,N’Sales’,NULL

exec [AdventureWorks].[sys].sp_indexes_100_rowset N’SalesOrderHeader’,NULL,N’Sales’

exec [AdventureWorks].[sys].sp_check_constbytable_rowset N’SalesOrderHeader’,N’Sales’,NULL,NULL

exec [AdventureWorks].[sys].sp_table_statistics2_rowset N’SalesOrderHeader’,N’Sales’,N’AdventureWorks’,NULL,NULL,NULL

The last stored procedure, sp_table_statistics2_rowset, tries to obtain the header and density information of the statistics object using the following statement but it fails because of the lack of permissions mentioned earlier.

dbcc show_statistics(@qtbl, @statname) with stat_header join density_vector

Plan with db_owner permissions

Now grant db_owner permissions to the user used on the liked server and run the query again while forcing a new optimization (for example using DBCC FREEPROCCACHE to clear the plan cache on the local server). Profiler will show that the previous five system stored procedures were again executed but this time the sp_table_statistics2_rowset procedure is able to successfully get the requested header and density information. In addition, the following statement is also issued to obtain the histogram information from the statistics object

DBCC SHOW_STATISTICS(N’"AdventureWorks"."Sales"."SalesOrderHeader"’,

"IX_SalesOrderHeader_CustomerID") WITH HISTOGRAM_STEPS

This time the following execution plan is returned

clip_image004

By using the histogram from the remote server, the local query processor is able to get an estimated number of rows of 6.65385 and decides to use a Nested Loops Join and a Clustered Index Seek instead which is a better plan and more appropriate as the actual number of records returned is only 8.

If you manually run the previous DBCC SHOW_STATISTICS statement on the remote server you can see the histogram, an extract of which is shown next, showing the step corresponding for the value for CustomerID 666 which in this case shows an estimated of 6.653846 records on the AVG_RANGE_ROWS column, which is also shown on the previous execution plan.

clip_image006

Remote plan

It is also worth mentioning that the query executed on the remote servers is practically the same in both cases, except that ORDER BY is needed on the first example as the Merge Join on the local plan requires sorted data, but the plan is still the same as shown next (both the query and the plan can be captured using Profiler).

SELECT "Tbl1003"."SalesOrderID" "Col1011"

FROM "AdventureWorks"."Sales"."SalesOrderHeader" "Tbl1003"

WHERE "Tbl1003"."CustomerID"=(666)

ORDER BY "Col1011" ASC

Again notice that the estimated number of rows is 6.65385

clip_image008

Conclusion

So we’ve seen how using a user with limited permissions to run a query through a linked server can in some specific cases be a performance problem because of the lack of access to optimizer statistics. However, it seems inappropriate that having access to the data does not also give you access to its statistics as having to provide higher level permissions could be a security concern. In fact, there is a related connect entry by Erland Sommarskog discussing the problem and suggesting that permissions to access data should also give access to its statistics. So let us hope a better solution to the use of linked servers is provided in the near future.

My book “Inside the SQL Server Query Optimizer” Now Available on Amazon

I just found today that my book, “Inside the SQL Server Query Optimizer”, is finally available on Amazon. You can find the description of the book on the Amazon page or a more detailed description chapter by chapter on one of my previous posts here. So I would like to use this post to thank the people who helped me or in some way influenced the content of the book.

Writing this book was a lot of fun, but also a lot of work; actually a lot more work than I originally expected. Fortunately I got help from several people. First of all, I would like to thank Chris Massey. Chris helped me as the technical editor of the book, and guided me through most of the writing process, doing absolutely outstanding work. Very special thanks also go out to Grant Fritchey who helped us as the technical reviewer, and went very much beyond just reviewing, as his comments provided invaluable feedback to improve the quality of this book. Thanks also go to Tony Davis for offering me this opportunity in the first place, and helping to get the project started.

Outside the Red Gate team, my deepest gratitude goes to Cesar Galindo-Legaria, Manager of the SQL Server Query Optimization team at Microsoft, for answering my endless list of questions about the Query Optimizer, and educating me through some of the information he has published, including numerous research papers and an excellent chapter of a SQL Server book. I had the opportunity to meet Tony, Cesar, and Grant at the PASS Summit back in 2009, when I first started this journey, and I hope to finally meet Chris at the same conference in October 2011.

Although Chris, Cesar, Grant and Tony have directly helped me to shape the content of this book, there’s also a host of other people who have indirectly influenced the book through their work, which helped me to learn about and better understand the SQL Server query processor. With that in mind, I would like to thank the authors who have discussed the query processor in some of the available SQL Server books, Microsoft white papers and blogs, including Kalen Delaney, Ken Henderson, Lubor Kollar, Craig Freedman, Conor Cunningham and Eric Hanson.

Research papers have provided me with an unlimited source of information, and helped me to understand the Query Optimizer at a significantly deeper level than ever before. So, thanks to all that amazing work by the many people in the research community including, in addition to Cesar, Goetz Graefe, Surajit Chaudhuri, Yannis Ioannidis, Vivek Narasayya, Pat Selinger, Florian Waas, and many, many more.

Finally, on the personal side, I would like to thank my parents, Guadalupe and Humberto, and my family: my wife Rocio, and my three sons, Diego, Benjamin and David. Thanks all for your unconditional support and patience.

Slide Deck and Demo Files for Recent Presentations

Attached are WinZip files containing the PowerPoint presentation, as well as the scripts used in the demos for my sessions “Top 10 Query Optimizer Topics for Better Performance” and “Inside the SQL Server 2008 Data Collector” which I presented recently at the SoCal Code Camp, the Los Angeles SQL Server Professionals Group and the PASS Summit 2010.

Inside the SQL Server 2008 Data Collector - TheDataCollector.zip
10 Query Optimizer Topics for Better Performance - QueryOptimizerTopics.zip

Presenting at the SoCal Code Camp

I am speaking again this month, this time at the SoCal Code Camp at Cal State Fullerton. SoCal Code Camp is a community driven event where developers come and learn from their peers and it is scheduled for Saturday, January 29th and Sunday, January 30th. I will be presenting two sessions on Saturday: “Inside the SQL Server 2008 Data Collector” at 8:45 am, and “Top 10 SQL Server Query Optimizer Topics for Better Performance” at 4:00 pm, both on room UH-335.

I will also be participating on the SQL Server Q&A session along with Denny Cherry, Lynn Langit, Bret Stateham, Ben Aminnia, Ike Ellis, Andrew Karcher and Thomas Mueller. This session will be hosted on room UH-335 at 2:45 pm.

For more information regarding sessions, schedule and directions visit the SoCal Code Camp website. I hope to see you there,

clip_image002

Speaking at the Los Angeles SQL Server Professionals User Group

I haven’t updated this blog in a long time so I wanted to put in a quick post about a session that I will be presenting at the Los Angeles SQL Server Professionals User Group this Thursday, January 20th. The session, “Top 10 Query Optimizer Topics for Better Performance”, is the same topic I presented a couple of months ago at the PASS Summit in Seattle.

The meeting will be hosted at the UCLA campus and will start at 6:30 PM with Allen Berezovsky who will talk about File Stream in SQL Server. My session will follow next. More details and directions can be found at the Los Angeles SQL Server Professionals Group website.

I hope to see you there,

Benjamin

clip_image002

My Book, “Inside the Query Optimizer”, available at the PASS Summit

My book, “Inside the SQL Server Query Optimizer”, is almost finished and we will have a conference edition of it available at the PASS Summit. The final version of the book, published by Red Gate books, will be available on Amazon by Christmas.

For more details on the contents, I am including the Preface of the book next.

clip_image002

Preface

The Query Optimizer has always been one of my favorite SQL Server topics, which is why I started blogging about it and submitting related presentations to PASS. And so it would have continued, except that, after several blog posts discussing the Query Optimizer, Red Gate invited me to write a book about it. This is that book.

I started learning about the Query Optimizer by reading the very few SQL Server books which discussed the topic, and most of them covered it only very briefly. Yet I pressed on, and later, while trying to learn more about the topic, I found an extremely rich source of information in the form of the many available research papers. It was hard to fully grasp them at the beginning, as academic papers can be difficult to read and understand, but soon I got used to them, and was all the more knowledgeable for it.

Having said that, I feel that I’m in a bit of a minority, and that many people still see the Query Optimizer just as a black box where a query is submitted and an amazing execution plan is returned. It is also seen as a very complex component, and rightly so. It definitely is a very complex component, perhaps the most complex in database management software, but there is still a lot of great information about the Query Optimizer that SQL Server professionals can benefit from.  

The Query Optimizer is the SQL Server component that tries to give you an optimal execution plan for your queries and, just as importantly, tries to find that execution plan as quickly as possible. A better understanding of what the Query Optimizer does behind the scenes can help you to improve the performance of your databases and applications, and this book explains the core concepts behind how the SQL Server Query Optimizer works. With this knowledge, you’ll be able to write better queries, provide the Query Optimizer with the information it needs to produce efficient execution plans, and troubleshoot the cases when the Query Optimizer is not giving you a good plan.

With that in mind, and in case it’s not obvious, the content of this book is intended for SQL Server professionals: database developers and administrators, data architects, and basically anybody who submits more than just trivial queries to SQL Server. Here’s a quick overview of what the book covers:

The first chapter, Introduction to Query Optimization, starts with an overview on how the SQL Server Query Optimizer works and introduces the concepts that will be covered in more detail in the rest of the book. A look into some of the challenges query optimizers still face today is covered next, along with a section on how to read and understand execution plans. The Chapter closes with a discussion of join ordering, traditionally one of the most complex problems in query optimization.

The second chapter talks about the Execution Engine, and describes it as a collection of physical operators that perform the functions of the query processor. It emphasizes how these operations, implemented by the Execution Engine, define the choices available to the Query Optimizer when building execution plans. This Chapter includes sections on data access operations, the concepts of sorting and hashing, aggregations, and joins, to conclude with a brief introduction to parallelism.

Chapter 3, Statistics and Cost Estimation, shows how the quality of the execution plans generated by the Query Optimizer is directly related to the accuracy of its cardinality and cost estimations. The Chapter describes Statistics objects in detail, and includes some sections on how statistics are created and maintained, as well as how they are used by the Query Optimizer. We’ll also take a look at how to detect cardinality estimation errors, which may cause the Query Optimizer to choose inefficient plans, together with some recommendations on how to avoid and fix these problems. Just to round off the subject, the chapter ends with and introduction to cost estimation.

Chapter 4, Index selection, shows how SQL Server can speed up your queries and dramatically improve the performance of your applications just by using the right indexes. The Chapter shows how SQL Server selects indexes, how you can provide better indexes, and how you can verify your execution plans to make sure these indexes are correctly used. We’ll talk about the Database Engine Tuning Advisor and the Missing Indexes feature, which will show how the Query Optimizer itself can provide you with index tuning recommendations.

Chapter 5, The Optimization Process, is the Chapter that goes right into the internals of the Query Optimizer and introduces the steps that it performs without you ever knowing. This covers everything from the moment a query is submitted to SQL Server until an execution plan is generated and is ready to be executed, including steps like parsing, binding, simplification, trivial plan and full optimization. Important components which are part of the Query Optimizer architecture, such as transformation rules and the memo structure, are also introduced.

Chapter 6, Additional Topics, includes a variety of subjects, starting with the basics of update operations and how they also need to be optimized just like any other query, so that they can be performed as quickly as possible. We’ll have an introduction to Data Warehousing and how SQL Server optimizes star queries, before launching into a detailed explanation of Parameter sniffing, along with some recommendations on how to avoid some problems presented by this behavior. Continuing with the topic of parameters, the Chapter concludes by discussing auto-parameterization and forced parameterization.

Chapter 7 describes Hints, and warns that, although hints are a powerful tool which allows you to take explicit control over the execution plan of a query, they need to be used with caution and only as a last resort when no other option is available. The chapter covers the most-used hints, and ends with a couple of sections on plan guides and the USE PLAN query hint.

Before we get started, please bear in mind that this book contains many undocumented SQL Server statements. These statements are provided only as a way to explore and understand the Query Optimizer and, as such, should not be used on a production environment. Use them wisely, and I hope you enjoy learning about this topic as much as I do.

Benjamin Nevarez

Presenting at the SoCal Rock & Roll Code Camp

I will be presenting two sessions at the SoCal Rock & Roll Code Camp this Saturday. This is a community driven event with over 100 sessions, hosted at the University of Southern California (USC) on both Saturday October 23rd and Sunday 24th. My sessions will be “Inside the SQL Server 2008 Data Collector” at 12:15 pm, and “Top 10 SQL Server Query Optimizer Topics for Better Performance” at 1:30 pm, both on room VKC-105.

For more information regarding sessions, schedule and directions visit the SoCal Rock & Roll Code Camp website.

clip_image002

Speaking at the Orange County SQL Server Professionals User Group

I will be speaking at the Orange County SQL Server Professionals User Group this Thursday, October 7th, 2010. The topic is “Top 10 Query Optimizer Topics for Better Performance”. So if you are in the Orange County or Los Angeles area please stop by and say hello. 

The meeting starts at 6:30 PM. More details and directions can be found here 

Orange County SQL Server Professionals User Group

http://www.sqloc.com

An Introduction to Cost Estimation

Last year when I presented my session regarding the Query Optimizer at the PASS Summit, I was asked how the estimated CPU and I/O costs in an execution plan are calculated, that is, where a specific value like 1.13256 is coming from. All I was able to say at the moment was that Microsoft does not publish how these costs are calculated.

Since this time I am working on a related project, I thought that perhaps I could look into this question again and show one example. But since there are dozens of operators, I decided to take a look at a simple one: the Clustered Index Scan operator. So I captured dozens of XML plans, used XQuery to extract their cost information and after some analysis I was able to obtain a basic formula for this specific operator.

But first a quick introduction to cost estimation: the cost of each operator depends on its algorithm, each operator is associated with a CPU cost, and some of them will also have an I/O cost. The total cost of the operator is the sum of these two costs. An operator like a Clustered Index Scan has both CPU and I/O costs. Some other operators, like a Stream Aggregate, will have only CPU cost. It is interesting to note that this cost used to mean the estimated time in seconds that a query or operator would take to execute on a particular reference machine. In recent versions of SQL Server this cost should no longer be interpreted as seconds, milliseconds, or any other unit.

To show the example, let us look at the largest table in AdventureWorks, Sales.SalesOrderDetail. Run the following query and look at the estimated CPU and I/O costs for the Clustered Index Scan operator as shown on the next figure.

SELECT * FROM Sales.SalesOrderDetail

WHERE LineTotal = 35

clip_image002

For a Clustered Index Scan operator, I observed that the CPU cost is 0.0001581 for the first record, plus 0.0000011 for any additional record after that. In this specific case we have an estimated number of records of 121,317, as shown on the picture above, so we can use 0.0001581 + 0.0000011 * (121317 – 1) or 0.133606 which is the value shown as Estimated CPU Cost. In a similar way, I noticed that the minimum I/O cost is 0.003125 for the first database page and then it grows in increments of 0.00074074 for every additional page. Since the Clustered Index Scan operator scans the entire table, I can use the following query to find the number of database pages, which returns 1,234.

SELECT in_row_data_page_count, row_count

FROM sys.dm_db_partition_stats

WHERE object_id = object_id(‘Sales.SalesOrderDetail’)

AND index_id = 1

 

In this case we have 0.003125 + 0.00074074 * (1234 – 1) or 0.916458 which is the value shown as estimated I/O Cost.

Finally, we add both costs, 0.133606 + 0.916458 to get 1.05006 which is the total estimated cost of the operator. In the same way, adding the cost of all the operators will give the total cost of the plan. In this case, the cost of the Clustered Index Scan, 1.05006, plus the cost of the first Compute Scalar operator, 0.01214, the second Compute Scalar operator, 0.01213, and the cost of the Filter operator, 0.0582322, will give the total cost of the plan, 1.13256, as shown next.

clip_image004

Presenting at the PASS Summit 2010

I am honored to be selected to present at the PASS Summit for the third time. This November in Seattle I will be presenting the following two sessions:

Top 10 Query Optimizer Topics for Better Query Performance

This session will show you how a better understanding on how the Query Optimizer works can help you to improve the performance of your queries. I will show you the top 10 Query Optimizer topics that can give you the more benefit by focusing both on the concepts and practical solutions. The SQL Server Query Optimizer is a cost-based optimizer which job is to analyze the possible execution plans for a query, estimate the cost of these plans and select the one with the lowest cost. So a better knowledge on how the Query Optimizer works can help both database developers and administrators to get better performance from their databases. Several areas of the query processor will be covered, everything from troubleshooting query performance problems and identifying what information the Query Optimizer needs to do a better job to the extreme cases where, because of the its limitations, the Query Optimizer may not give you a good plan and you may need to take a different approach.

Inside the SQL Server 2008 Data Collector

The SQL Server 2008 Data Collector provides some low overhead data collection functionality to store performance and diagnostics historic information of your SQL Server instances. See how you can use this information to troubleshoot problems and to provide trend analysis for the performance of your SQL Server instance. In addition to show the basics and architecture of the new Data Collector, this session focuses on the predefined system data collection sets that are provided by SQL Server 2008 that automatically collect data from the disk usage, instance activity, and queries statistics. You will learn about the Disk Usage collection set, which gathers statistics regarding the growth of the data and transaction log database files; explore the Server Activity collection set which focus on the server activity and resources utilization; and learn about the Query Statistics collection set which collects data regarding the queries running in your instance.

See you in Seattle!

clip_image002