Benjamin Nevarez Rotating Header Image

Column store indexes

Query Optimization with Denali Columnstore Indexes

In a previous post I talked about the new columnstore indexes and their related processing algorithms which are available in SQL Server code-named Denali. In this post I will cover the query processing part of the technology in more detail and will show you some examples that you can test on the recently released CTP3 (Community Technology Preview) of the product.

Same as with previous versions of SQL Server, in Denali the query optimizer can choose between the available access methods, which now also include columnstore indexes, and as always, this will be a cost-based decision. A new choice the query optimizer will have to make is the selection of an execution mode. The new query processing algorithms mentioned in my previous post will run in what is called a batch execution mode, which is different from the traditional processing mode, now called row mode.

In the row execution mode operators process data one row at a time. The new batch execution mode process data in batches which is more efficient for large amounts of data, like the workloads present on data warehouse queries. Each operator in an execution plan can use the row execution mode and, when columnstore indexes are available, some operators can also use the batch mode. There is both an estimated and an actual execution mode and this information is displayed on the query execution plan as I will show later. It is also worth mentioning that, although columnstore indexes can speed up the performance of data warehouse queries, they are not a good choice for very selective queries returning only a few records. In this case the query optimizer may have to rely on row stores, like clustered or regular nonclustered indexes, to find those records quickly. There are no seeks on columnstore indexes.

Same as with previous versions of SQL Server, you still have the choice to use a hint to force any index in the cases where the query optimizer is not giving you a good execution plan. This can happen for example when the query optimizer is choosing a columnstore index when it shouldn’t or when you want to force a columnstore index when it is not being selected. You can also use the new IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX hint to ask the query optimizer to avoid using any columnstore index.

Let me show you an example which you can test on SQL Server Denali CTP3, currently available for download here. To follow this example you will also need the AdventureWorksDWDenali database, available at CodePlex and I will use the same example on BOL to skip the basics and go directly to analyze the batch processing mode (By the way the BOL example didn’t work directly with the AdventureWorksDWDenali database so I had to add a few more columns at the end of the CREATE TABLE statement.)

First, use the following BOL code to create a partition function, a partition scheme and a new partitioned table with a columnstore index

USE AdventureWorksDWDenali;
GO

CREATE PARTITION FUNCTION [ByOrderDateMonthPF](int) AS RANGE RIGHT 
FOR VALUES (
    20050701, 20050801, 20050901, 20051001, 20051101, 20051201, 
    20060101, 20060201, 20060301, 20060401, 20060501, 20060601, 
    20060701, 20060801, 20060901, 20061001, 20061101, 20061201, 
    20070101, 20070201, 20070301, 20070401, 20070501, 20070601, 
    20070701, 20070801, 20070901, 20071001, 20071101, 20071201, 
    20080101, 20080201, 20080301, 20080401, 20080501, 20080601, 
    20080701, 20080801, 20080901, 20081001, 20081101, 20081201
) 
GO

CREATE PARTITION SCHEME [ByOrderDateMonthRange] 
AS PARTITION [ByOrderDateMonthPF] 
ALL TO ([PRIMARY]) 
GO

-- Create a partitioned version of the FactResellerSales table
CREATE TABLE [dbo].[FactResellerSalesPtnd]( 
    [ProductKey] [int] NOT NULL, 
    [OrderDateKey] [int] NOT NULL, 
    [DueDateKey] [int] NOT NULL, 
    [ShipDateKey] [int] NOT NULL, 
    [ResellerKey] [int] NOT NULL, 
    [EmployeeKey] [int] NOT NULL, 
    [PromotionKey] [int] NOT NULL, 
    [CurrencyKey] [int] NOT NULL, 
    [SalesTerritoryKey] [int] NOT NULL, 
    [SalesOrderNumber] [nvarchar](20) NOT NULL, 
    [SalesOrderLineNumber] [tinyint] NOT NULL, 
    [RevisionNumber] [tinyint] NULL, 
    [OrderQuantity] [smallint] NULL, 
    [UnitPrice] [money] NULL, 
    [ExtendedAmount] [money] NULL, 
    [UnitPriceDiscountPct] [float] NULL, 
    [DiscountAmount] [float] NULL, 
    [ProductStandardCost] [money] NULL, 
    [TotalProductCost] [money] NULL, 
    [SalesAmount] [money] NULL, 
    [TaxAmt] [money] NULL, 
    [Freight] [money] NULL, 
    [CarrierTrackingNumber] [nvarchar](25) NULL, 
    [CustomerPONumber] [nvarchar](25) NULL,
    [OrderDate] datetime NULL,
    [DueDate] datetime NULL,
    [ShipDate] datetime NULL
) ON ByOrderDateMonthRange(OrderDateKey);
GO

-- Copy the data from the FactResellerSales into the new table
INSERT INTO dbo.FactResellerSalesPtnd WITH(TABLOCK)
SELECT * FROM dbo.FactResellerSales;
GO

-- Create the columnstore index
CREATE NONCLUSTERED COLUMNSTORE INDEX [csindx_FactResellerSalesPtnd]
ON [FactResellerSalesPtnd]
( 
    [ProductKey], 
    [OrderDateKey], 
    [DueDateKey], 
    [ShipDateKey], 
    [ResellerKey], 
    [EmployeeKey], 
    [PromotionKey], 
    [CurrencyKey], 
    [SalesTerritoryKey], 
    [SalesOrderNumber], 
    [SalesOrderLineNumber], 
    [RevisionNumber], 
    [OrderQuantity], 
    [UnitPrice], 
    [ExtendedAmount], 
    [UnitPriceDiscountPct], 
    [DiscountAmount], 
    [ProductStandardCost], 
    [TotalProductCost], 
    [SalesAmount], 
    [TaxAmt], 
    [Freight], 
    [CarrierTrackingNumber], 
    [CustomerPONumber] 
);

Now run the following query

SELECT SalesTerritoryKey, SUM(ExtendedAmount) AS SalesByTerritory
FROM FactResellerSalesPtnd
GROUP BY SalesTerritoryKey;

This will create the following plan where you can see the new Columnstore Index Scan operator

clip_image002

The properties of the Columnstore Index Scan operator are shown next

clip_image003

You may notice that the actual and estimated execution mode is Row (lines 3 and 4 on the list of properties). Row execution mode was selected because the table is not large enough to require the batch execution mode. We can use the undocumented ROWCOUNT and PAGECOUNT options of the UPDATE STATISTICS statement to simulate a larger table as shown next (for more information about how this works see my post about the DTA here)

UPDATE STATISTICS FactResellerSalesPtnd WITH ROWCOUNT = 10000000, PAGECOUNT = 1000000

Removing the existing plan (using for example DBCC FREEPROCCACHE) and running the same query again will now show the following plan (only part is shown), which this time is using parallelism.

clip_image005

In addition, by looking at the properties of the Columnstore Index Scan you can notice that this time it is using the batch execution mode

clip_image006

You can also use the new IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX hint to disallow the use of a columnstore index. Run the following code

SELECT SalesTerritoryKey, SUM(ExtendedAmount) AS SalesByTerritory
FROM FactResellerSalesPtnd
GROUP BY SalesTerritoryKey
OPTION (IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX);

This will show you the following plan which as you can see it is now directly using the FactResellerSalesPtnd table, without using the columnstore index.

clip_image008

Finally, since the number of records and pages of the FactResellerSalesPtnd table was altered for this test, perhaps you want drop it and create a new copy if you need to do some additional testing

DROP TABLE FactResellerSalesPtnd

Columnstore Indexes and other new Optimizations in Denali

One of the most exciting new features that will be available on the next version of SQL Server, code-named Denali, are the columnstore indexes and some related new optimizations for data warehouse queries. Most of us learned about this project, code-named Apollo, at the PASS Summit 2010 last November when it was demonstrated on Tuesday’s keynote and explained on some sessions by Eric Hanson later that day, and by Susan Price the following day. Eric has also written the white paper “Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0” which you can read here. With this new technology Microsoft promises to improve the performance of star join queries, used in data warehousing scenarios, by several orders of magnitude.

As mentioned, project Apollo, consists of the addition of a new type of index, called columnstore, plus the addition of new sophisticated query processing algorithms. Microsoft has released some information regarding how the columnstore indexes work but has indicated that the new optimization and execution algorithms will be kept “top secret” and will “remain a mystery”. The column-based storage approach is not new and has been used before on some other databases, although Microsoft claims SQL Server is the first major database vendor to offer a pure column-based approach. Columnstore indexes are based on Microsoft’s VertiPaq technology which is also used on Analysis Services and PowerPivot for Excel and SharePoint.

Let me briefly explain how the columnstore indexes work. The traditional approach, used by SQL Server and most other databases, is to store rows on a data page, and it is now called a row store. On the other hand, columnstore indexes will instead dedicate an entire page to a column, that is, pages are defined to contain only data for a specific column. For example in the following figure, taken from the previously mentioned white paper, several pages are assigned to contain data for column c1 only, and the same is true for c2, c3, and the other columns.

clip_image002

One of the problems that columnstore indexes are trying to address is that with a row-based storage SQL Server always reads all the columns of the record even when not all of them are needed by the query. In fact, some of the performance benefits of the columnstore index are related to the fact that most queries only use less than 15% of the columns of the table. Not having to read all the columns of a table provides significant savings in disk I/O. An additional benefit of this technology is that the data is heavily compressed again needing fewer disk I/O and at the same time being able to fit more data in memory. Having enough memory will also play an important role on the technology as well. The only thing you need to do to take benefit of this performance improvement is to define a columnstore index; there is no need to change your queries, use any specific syntax or to do anything else.

Same as with previous versions of SQL Server, the query optimizer will have the choice between using a columnstore index or any of the other available access methods and, as always, this will be a cost-based decision. And same as before, you will still have the choice to force any index using a hint if the query optimizer is not giving you a good execution plan. This can happen for example when the query optimizer is choosing a columnstore index when it shouldn’t or when you want to force a columnstore index when it is not being selected. Something which is also new is the concept of an execution mode: the new query processing algorithms mentioned earlier will run in what is called a batch execution mode, which is different from the traditional processing mode, now called row mode. This is a different approach that the query optimizer will have to choose and this information will be available on the execution plans. So I suppose we will need to get used to some new terms from now on: column store and row store from the storage point of view, and batch mode and row mode from the query processing point of view.

One of the limitations at the moment is that data using columnstore indexes will not be updatable (no INSERT, DELETE, UPDATE, MERGE or other update operations will be allowed). Not being able to update data may look like a big disadvantage but this may not be a big problem as the target for this technology are data warehouses which are usually read-only most of the day and may just require updates once a day. A few possible solutions to update data using columnstore indexes are provided and I will refer you to the mentioned white paper for the details. Two additional limitations are that at the moment only a single nonclustered columnstore index can be created per table and that no clustered columnstore indexes are available yet. Both limitations may go away in a future release.

Unfortunately the columnstore indexes and related query processing algorithms were not included on the first and currently only publicly available Denali CTP. Eric Hanson mentioned back in December on his twitter account, @ENH_SQLServer, that this feature would be included on CTP2 although the software was not made publicly available. He mentioned more recently that the columnstore indexes will finally make its public debut on Denali CTP3.