One of the most exciting new features that will be available on the next version of SQL Server, code-named Denali, are the columnstore indexes and some related new optimizations for data warehouse queries. Most of us learned about this project, code-named Apollo, at the PASS Summit 2010 last November when it was demonstrated on Tuesday’s keynote and explained on some sessions by Eric Hanson later that day, and by Susan Price the following day. Eric has also written the white paper “Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0” which you can read here. With this new technology Microsoft promises to improve the performance of star join queries, used in data warehousing scenarios, by several orders of magnitude.
As mentioned, project Apollo, consists of the addition of a new type of index, called columnstore, plus the addition of new sophisticated query processing algorithms. Microsoft has released some information regarding how the columnstore indexes work but has indicated that the new optimization and execution algorithms will be kept “top secret” and will “remain a mystery”. The column-based storage approach is not new and has been used before on some other databases, although Microsoft claims SQL Server is the first major database vendor to offer a pure column-based approach. Columnstore indexes are based on Microsoft’s VertiPaq technology which is also used on Analysis Services and PowerPivot for Excel and SharePoint.
Let me briefly explain how the columnstore indexes work. The traditional approach, used by SQL Server and most other databases, is to store rows on a data page, and it is now called a row store. On the other hand, columnstore indexes will instead dedicate an entire page to a column, that is, pages are defined to contain only data for a specific column. For example in the following figure, taken from the previously mentioned white paper, several pages are assigned to contain data for column c1 only, and the same is true for c2, c3, and the other columns.
One of the problems that columnstore indexes are trying to address is that with a row-based storage SQL Server always reads all the columns of the record even when not all of them are needed by the query. In fact, some of the performance benefits of the columnstore index are related to the fact that most queries only use less than 15% of the columns of the table. Not having to read all the columns of a table provides significant savings in disk I/O. An additional benefit of this technology is that the data is heavily compressed again needing fewer disk I/O and at the same time being able to fit more data in memory. Having enough memory will also play an important role on the technology as well. The only thing you need to do to take benefit of this performance improvement is to define a columnstore index; there is no need to change your queries, use any specific syntax or to do anything else.
Same as with previous versions of SQL Server, the query optimizer will have the choice between using a columnstore index or any of the other available access methods and, as always, this will be a cost-based decision. And same as before, you will still have the choice to force any index using a hint if the query optimizer is not giving you a good execution plan. This can happen for example when the query optimizer is choosing a columnstore index when it shouldn’t or when you want to force a columnstore index when it is not being selected. Something which is also new is the concept of an execution mode: the new query processing algorithms mentioned earlier will run in what is called a batch execution mode, which is different from the traditional processing mode, now called row mode. This is a different approach that the query optimizer will have to choose and this information will be available on the execution plans. So I suppose we will need to get used to some new terms from now on: column store and row store from the storage point of view, and batch mode and row mode from the query processing point of view.
One of the limitations at the moment is that data using columnstore indexes will not be updatable (no INSERT, DELETE, UPDATE, MERGE or other update operations will be allowed). Not being able to update data may look like a big disadvantage but this may not be a big problem as the target for this technology are data warehouses which are usually read-only most of the day and may just require updates once a day. A few possible solutions to update data using columnstore indexes are provided and I will refer you to the mentioned white paper for the details. Two additional limitations are that at the moment only a single nonclustered columnstore index can be created per table and that no clustered columnstore indexes are available yet. Both limitations may go away in a future release.
Unfortunately the columnstore indexes and related query processing algorithms were not included on the first and currently only publicly available Denali CTP. Eric Hanson mentioned back in December on his twitter account, @ENH_SQLServer, that this feature would be included on CTP2 although the software was not made publicly available. He mentioned more recently that the columnstore indexes will finally make its public debut on Denali CTP3.