Snowflake Cluster Keys and Micro Partition Elimination: Best Practices
Photo by Riccardo Pierri on Unsplash
Last Updated: 30th July 2024
Having worked as a solution architect advising over 100 Snowflake customers, I firmly believe Snowflake Clustering keys and partition elimination are some of the most misunderstood and misused tools to maximize query performance.
In this article, I will explain:
How Snowflake Cluster Keys can help speed SQL query performance
What Statistics (metadata) is captured as Snowflake loads data
How Snowflake uses the metadata to tune query performance and improve query performance 741 times, down from 21 minutes to two seconds.
How Snowflake Cluster Keys maximize query performance
How to measure the existing table clustering and when to create a cluster key
Best practices to deploy Snowflake clustering and cluster keys
After reading this article, you'll fully appreciate this challenging but hugely rewarding technique to maximize query performance.
What are Snowflake Micro Partitions?
Before discussing Snowflake Clustering, it’s essential to understand how Snowflake physically stores data in micro partitions.
Unlike many databases (e.g., Oracle, SQLServer, PostgreSQL), which store data in row format in relatively small blocks, Snowflake stores data in Columnar format in massive 16MB compressed blocks called micro-partitions.
Micro-partitions in Snowflake are the physical units of storage, locking, and I/O, and Snowflake typically stores up to 500MB of raw data in a single highly compressed micro-partition, which can include thousands of rows.
The two data storage methods are illustrated in the diagram below:
Snowflake stores data in micro-partitions because research has demonstrated that columnar-based storage is around 50 times faster than row-stores for analytic-type queries. It also has the considerable advantage of maximizing data compression and, therefore, helps to eliminate the I/O bottleneck in reading large data volumes from relatively slow disk storage.
The diagram above illustrates how data is loaded and stored in Snowflake. It shows data files that store data in row format are automatically converted to columnar format, which is compressed into micro-partitions before being loaded.
“Column stores are around 50 times faster than row stores in data warehouse queries” — Professor Mike Stonebraker. MIT
The process of loading micro partitions in Snowflake is much more complex, but the diagram below illustrates the overall steps taken.
The process of loading data into Snowflake includes:
As data is loaded into Snowflake, the Cloud Services layer automatically captures and stores metadata statistics about the data being loaded. These include (for reasons which will become clear later) the minimum and maximum value of every column in every micro-partition. As statistics are automatically captured, subsequent queries can immediately use the data to maximize performance.
The virtual warehouse physically constructs the micro-partitions, compressing every column using a potentially different algorithm and recording the statistics in the cloud services layer, which will be used to maximize query performance.
When the data is committed, it is written to slower disk storage on the cloud provider. This is S3 on AWS or blob storage on Azure or Google and is much slower than the super-fast SSD in the virtual warehouse layer.
The diagram below illustrates the overall process by which Snowflake stores data. Because the virtual warehouse can be suspended, it does not store permanent data. Still, the Cloud Services layer records every micro-partition, its physical location on Cloud Storage (disk), and the statistics needed to help tune query performance.
Snowflake Micro Partition Pruning or Partition Elimination
The diagram below illustrates a deeper dive into Snowflake's data and metadata. If we assume data is bulk-loaded at the end of every month, the chart below illustrates the situation by the end of March.
The diagram above shows how the Cloud Services layer stores tiny volumes of metadata (typically kilobytes to megabytes).
In contrast, the Storage Layer stores the physical data in 16MB micro-partitions. This often amounts to gigabytes or terabytes in contiguous micro-partitions which are loaded in the sequence in which they arrive.
Let's assume a user executes the following query:
select *
From sales
where sale date = '14-Feb-22';
Snowflake quickly checks the metadata statistics when the query is executed to identify the exact micro partitions to search. The diagram below illustrates the situation whereby Snowflake confirms the data for 14-Feb-22 could only be in micro-partition 3.
It’s worth noting the above partition pruning approach is entirely automatic and does not use indexes - which are only supported in Hybrid Tables. A database index records where the data is located, whereas in this case, the Cloud Services metadata records where the data might be.
For example, in the above case, Snowflake knows that micro-partition 3 holds data between 1-Feb and 29-Feb. It does not know if it holds data for the 14-Feb, but it knows it might be, and it's not in the other micro-partitions.
This information can partition prune all other micro partitions being scanned, leading to incredible query performance improvements on huge tables. In many cases, this can improve query performance from hours to seconds.
The diagram below illustrates how Snowflake uses the metadata statistics captured in the Cloud Services layer to tune queries and automatically maximize query performance.
The diagram above illustrates the two-stage tuning process used by Snowflake:
The cloud services layer accepts and parses the SQL query, checks you have the correct access privileges to view the data, and then uses the metadata to perform Static micro-partition elimination, which we'll explain below. This produces a Query Plan, which indicates which tables should be scanned first, the join order, and the steps needed to deliver the results.
The Virtual Warehouse executes the query using the Query Plan, but has additional tricks. In addition to the Static Elimination designed to reduce the number of micro-partitions scanned at execution time, Snowflake employs Dynamic Run-time Optimization to further speed query performance. You can read more about this feature in my article on the Snowflake Query Profile.
It's worth pointing out that the entire process, including query compilation, static, and run-time partition elimination, is entirely automated and needs no query hints or indexes. Snowflake automatically applies this run-time performance tuning on every table
However, deploying a Snowflake Cluster Key can improve performance further.
Want to learn more from a Snowflake Expert? Click on the image below for training from Analytics Today.
Benchmarking Snowflake Cluster Key Performance
To give some idea of the potential impact of using a Snowflake cluster key on query performance, consider the following SQL statement:
select *
From web_sales
where ws_sold_date_sk = 2451161
and ws_item_sk = 463520;
This query was executed against a table with 4.7 TB of data containing around 72 billion rows of data, and the screenshot below demonstrates the performance gains:
The query on the left scanned 300,112 micro-partitions and took nearly 21 minutes to complete, whereas the same query on the right used micro-partition pruning to produce the same results in under two seconds.
Notice that the query with partition pruning scanned just four micro-partitions? This simple step eliminated 99.86% of the data and delivered results 741 times faster.
While the above may be very impressive, it raises a question: Why was the first execution so slow?
What Does Clustering Mean?
Data Clustering refers to how data with similar values are physically grouped or clustered on data storage. For example, when storing orders, the orders tend to arrive on a given date, and therefore, all entries for a given day tend to be physically stored together on the same micro partition.
This is often called "Natural Clustering," as there's no need to sort the data before loading. It just arrives in ORDER DATE
sequence.
Snowflake uses natural data clustering to eliminate micro partitions thereby improving query performance automatically.
Snowflake Clustering: Good Vs. Bad
There are many reasons why data may be stored with poor data clustering. If you do accidentally have natural clustering, you're pretty lucky. For example, loading data using Snowpipe tends to disrupt natural data clustering, which can lead to poor query performance.
Indeed, I worked with a customer who switched from batch loading using COPY to Snowpipe, and they found query performance was severely impacted.
Let's consider the previous situation where we load data at the end of each month. Assuming we are migrating from an on-premises Oracle data warehouse and must migrate ten years of history, one sensible way to achieve this would be to unload all ten years to flat files and use a COPY or SNOWPIPE to load the data to Snowflake.
Unlike regular monthly loading, this method risks loading the data in a random sequence. Instead of loading one month at a time, the data will be extracted from Oracle to data files in random sequence and loaded to Snowflake.
The diagram below illustrates how the same three months of data might look:
The above example differs from the previous case, as every micro-partition holds a random set of dates spanning the entire range from January to March.
If we execute a query to find all data for 14-Feb-22 against this data, it would need to scan the entire table. In this example, there are just four micro-partitions, but in a large table, this could be many thousands of entries, which could significantly impact performance.
How to Measure Clustering in Snowflake
While we've illustrated the extremes from good to poor clustering in Snowflake, we need a method to identify the extent to which data is clustered. To achieve this, we need to understand the Clustering Depth.
The diagram below illustrates what we mean by a Poor Clustering Depth. Assuming we queried poorly clustered data, we'd need to scan every micro-partition to find whether it included data for 21st January.
Compare the situation above to the Good Clustering Depth illustrated in the diagram below. This shows the same query against a table where the data is highly clustered.
In this case, Snowflake eliminates all but one micro-partition, as it's the only possible entry with data for 21st January.
The above examples show that tables with a Low Clustering Depth are highly clustered, whereas tables with a High Clustering Depth are very poorly clustered.
Measuring Clustering in Snowflake
Snowflake provides an SQL function called SYSTEM$CLUSTERING_INFORMATION to provide an objective clustering measure. Unfortunately, it isn't easy to understand, but we'll explain the output here and how to read the runes.
The SQL Snippet below shows how to execute this function:
select system$clustering_information('ORDERS','(o_orderdate)');
The parameters are:
Table Name
Column Name(s). Notice these must be enclosed in brackets
The diagram below highlights the key points:
The example above shows the typical output, but you must focus on just a few statistics.
Total Partition Count: This is important because unless you have at least 1,000 micro-partitions, you shouldn't be concerned with clustering
Average Depth: This estimates the average number of micro-partitions you'd need to read to find all entries for a given value, in the above example, to see entries for a given O_ORDERDATE.
Maximum Depth: This is an estimate of the worst-case number of reads. In the above example, you'd need to read just six micro-partitions; there are only four entries at this depth.
Hopefully, you'll conclude the above indicates this table is highly clustered.
It indicates that despite having 3,242 micro-partitions, around 1/3rd can be found using a single read. To put this in context, a query to retrieve entries:
WHERE O_ORDERDATE = '01-JAN-22'
It would most likely return within seconds despite the table holding around 54GB of data.
Unlike a CREATE INDEX statement, Snowflake clustering is a background task that will take some time to complete. I re-clustered a 1.3TB table in benchmark tests, which took over four hours for the background re-clustering to finish.
The automatic clustering service doesn't use a virtual warehouse but uses Snowflake-supplied compute resources billed separately. Clustering a terabyte of compressed data costs around 69 credits (around $200, assuming $3 per credit).
Effect of Updates on Snowflake Clustering
When a Snowflake cluster key is created on the table, Snowflake will automatically maintain the clustering sequence by re-sorting (clustering) the data as needed. The diagram below illustrates how Snowflake deals with update operations:
In summary
Snowflake Micro-partitions are immutable and, once written, cannot be modified.
This means an update against one or more rows in a given micro-partition creates a new version of the entire micro-partition, including the updated rows.
The prior micro-partition version is kept for Time Travel or immediately removed as required.
Over time, a series of updates will gradually upset the clustering, and the diagram below illustrates the effect of updates against a table over time:
The diagram above shows how updated micro partitions in Snowflake are added to the end of the table, disrupting the sort order as these micro partitions are updated.
This process slowly turns good clustering into poor clustering. Snowflake identifies this situation, and an automatic clustering process re-sorts the data in the background.
It's important to note that the number of updated rows is unimportant. A single update can modify a single row in every micro-partition to disrupt clustering on the entire table. The number of micro-partitions changed, and the frequency of updates is far more critical than the number of rows changed each time.
Snowflake Training by an Expert.
Click in the image below for more detail
Identifying Updates on Clustered Tables
Although the number of rows changed is less important than the partitions modified, it's still a good indicator of a poor clustering candidate. The query below returns the number of update
and delete
operations per day against tables with cluster keys to help identify poor clustering candidates.
select to_char(h.start_time,'DD-Mon-YY') as day, h.table_name, sum(h.rows_removed) + sum(h.rows_updated) as total_changed,
round(avg(t.row_count)) as row_count, t.clustering_key
from snowflake.account_usage.table_dml_history h,
snowflake.account_usage.tables t
where true
and h.table_id = t.table_id
and h.start_time > dateadd(day, -30, current_timestamp())
and (h.rows_removed > 0 or h.rows_updated > 0)
and t.clustering_key is not null
and t.auto_clustering_on = 'YES'
group by all
order by total_changed desc, table_name, day
limit 200;
The screenshot above shows the daily changes (updates and deletes) to clustered tables. It's a good practice to avoid a cluster key on a table with frequent updates, and the SALE_TRANSACTIONS table above illustrates the case.
Another notable case is the SALE_TRANSACTIONS_STAGING
table, which loads over 5 billion daily rows, but the daily table row count is zero. This indicates a staging or landing table that is loaded, processed, and deleted. This also is a poor candidate for a Snowflake cluster key.
Pausing and Resuming Snowflake Clustering
Because the automatic re-clustering operation will be billed, the option exists to suspend or resume clustering on a table. Using the following SQL commands, we can temporarily halt the re-clustering operation:
alter table sales suspend recluster;
alter table sales resume recluster;
Creating a cluster key on a frequently updated table is sometimes beneficial. However, it's best practice to suspend reclustering during busy periods, and the above SQL can be used to achieve this.
This technique works because, during relatively quiet times (for example, at the weekend), the automatic clustering process doesn't compete with update operations on the same table.
Likewise, you can verify whether a table has automatic clustering enabled using the following SQL:
show tables;
What does Clustering in Snowflake do?
Clustering in Snowflake physically re-sorts the data and stores it in the sequence of the Snowflake cluster key. However, it's not quite as simple as that.
A sort operation would physically sort every row in the table, which can be incredibly expensive. Instead, Snowflake performs an approximate sort that groups (or clusters) data with the same key value. This means (for example) that if data is clustered by DATE, it tries to keep all values for a given date in the same micro-partition.
However, clustering will not sort data within the micro-partition as there's no benefit, making clustering far more efficient than data sorting.
Why not Sort Data after Loading?
There's no technical reason why you cannot sort data after it's been loaded into Snowflake. Indeed, the following command will re-sort the sales table:
insert overwrite sales as
select *
from sales
order by sale_date;
Large sort operations can, however, lead to memory spillage, extending the elapsed time and increasing query cost. Assuming the SALES table is large, this operation is best executed on a sizable virtual warehouse, as it will re-sort the entire table.
The advantage of using automatic Snowflake clustering, however, is it's incremental. Once micro partitions are well clustered, the automatic clustering service ignores these.
We can identify these in the output of SYSTEM$CLUSTERING_INFORMATION above as the constant_partition_count.
Because automatic clustering is incremental, reclustering data can be significantly less expensive than performing a complete sort each time. In addition, unlike a sort operation, the clustering operation runs as a background task without locking the table and, therefore, has no impact on other operations.
Snowflake Add Cluster Key
While it's possible to create a table with a cluster key, adding the cluster key is more sensible once the processes that update or query the table have been operational for a while. This gives time to understand the query patterns against the table and the frequency and scale of updates.
The SQL below shows the syntax to add a cluster key to an existing table:
alter table sales
cluster by (sale_date);
How to Remove a Clustering Key
The SQL below shows the syntax to drop a clustering key:
alter table sales
drop clustering key;
Snowflake Clustering Best Practices
If you've read this far, you'll know how clustering works, how to create, suspend, or resume Snowflake clustering, and the alternative sort operation. However, I would advise caution when deploying clustering. Experience has demonstrated that around 80% of customers deploy cluster keys very poorly, which can lead to significant costs with little benefit.
It’s sensible to follow the Snowflake clustering best practices to avoid mistakes.
Best Practices for Snowflake Cluster Keys
Snowflake clustering is only sensible in a relatively small number of situations, including:
Large Tables: Unless you have at least 1,000+ micro-partitions in the table, you won't benefit from clustering. Snowflake recommends cluster keys on tables over 1TB in size. In reality, it depends. As clustering eliminates entire micro-partitions, it is only sensible on relatively large tables.
Frequent Queries: Snowflake Cluster Keys are best deployed on tables with many SELECT queries relative to the number of
UPDATE
orDELETE
operations. Remember, the purpose of cluster keys is to maximize query performance. There's little potential benefit if there are few queries against the table. Likewise, updating and deleting operations disrupt the clustering, which adds to the overall cost.Partitions Scanned: The objective of clustering is to reduce the number of micro-partitions scanned by queries. Therefore, you should check that queries you intend to improve scan a high proportion of the data. You can check this on
QUERY_HISTORY
columnPARTITIONS_SCANNED
as a proportion ofPARTITIONS_TOTAL
.Query Elapsed Time: It sounds obvious, but the objective of clustering is reducing the query elapsed times. You should, therefore, target queries with poor elapsed times. To state the obvious, if your queries return within an acceptable time frame, why bother tuning them?
Few Updates: As indicated above, because of the impact of
UPDATE
andDELETE
operations on clustering, it's seldom cost-effective to deploy a clustering key on a table with frequent updates or deletes. There are mitigating factors, but keep this in mind. I did (for example) have a customer who created cluster keys on a table that was frequently deleted and re-loaded. They abandoned clustering on the entire system because they saw no performance benefits.
Best Practices to Identify the Right Cluster Key
Choosing inappropriate cluster columns will add to re-clustering costs for little or no performance gains. The best practices include:
The Where Clause: Cluster key columns must frequently appear in the
WHERE
clause of queries, ideally using equality or range comparisons.Wrapping columns with functions: I frequently see queries that change the data type from
DATE
toVARCHAR
To compare to a fixed string, as illustrated in the example below, leads to poor query performance as Snowflake cannot fully partition eliminate.
-- Avoid the following:
select *
from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.ORDERS
where to_char(o_orderdate,'DD-MM-YY') = '01-04-92';
-- Instead use:
select *
from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.ORDERS
where o_orderdate = to_date('01-04-92','DD-MM-YY');
Snowflake Clustering Key Cardinality: Best Practices
A common mistake customers make is creating a cluster key, including a TIMESTAMP column. Snowflake clustering aims to bucket values with the same value together in micro-partitions, and clustering by a unique key goes against this as it produces a clustering key with high cardinality.
You can create a cluster key on a unique key, and the query performance may be excellent, but the cost of reclustering from update operations often outweighs the benefits.
Remember, Snowflake Clustering aims to keep data with the same values (e.g., the exact date) together in one or more micro-partitions. That's impossible to achieve with a unique key, and any updates will likely have a significant impact.
Modify the key instead of creating the cluster key on a TIMESTAMP to reduce the cardinality. For example:
-- Instead of
alter table mobile_calls
cluster by event_timestamp;
-- Cluster by the DAY
alter table mobile_calls
cluster by (date_trunc('DAY',event_timestamp);
Using the above technique means any query fetching data for a given DATE
or TIMESTAMP
Performing partition elimination by day will massively reduce the clustering cost but retain the performance benefits.
Best Practices: Cluster Key Data Types
Sometimes, data may be delivered in VARCHAR
format instead of DATE
or NUMBER
formats. It's highly advisable to store data in its actual native data type.
Instead of storing dates in VARCHAR
the format 'YYYYMMDD'
, you must store them as DATE
. Likewise, numbers should be stored as datatype NUMBER
.
It's a little-known fact that clustering and partition elimination are only performed on the first five characters of the key. This means if you cluster by a column in the format YYYYMMDD
, you'll be clustering by YYYYM
, and most data will be clustered by year.
Best Practices: Initial Snowflake Clustering
As indicated above, Snowflake performs automatic clustering on an incremental basis, which can be highly efficient for ongoing data clustering. However, manually sorting the data is almost certainly cheaper and faster for the initial clustering effort.
Consider the following SQL, which sorts a 1.3TB table:
alter warehouse demo_wh set warehouse_size = 'X2LARGE';
-- Copy 1.3TB table STORE_SALES and sort by customer
create or replace table store_sales_sort_by_customer as
select *
from snowflake_sample_data.tpcds_sf10tcl.store_sales
order by ss_customer_sk;
alter warehouse demo_wh suspend;
This took 23 minutes on an X2LARGE
warehouse costing 12 credits (about $36), but was over five times more expensive (69 credits) and took over four hours using background clustering.
Therefore, the initial clustering (where possible) should use an INSERT OVERWRITE
operation, but subsequent clustering utilizes automatic clustering.
Best Practices: Frequent Updates
As mentioned above, clustering is typically unsuited for tables with frequent UPDATEs as Snowflake records data modification events (Inserts, Updates, or Deletes) and automatically triggers re-clustering as required.
However, customers sometimes have frequent Update
or Delete
operations against clustered tables, which risks a very high re-clustering cost.
In extreme cases, Snowflake may constantly re-cluster data, which is then disrupted by subsequent update operations, leading to further costly reclustering.
One way around this is to suspend clustering during peak update times manually and then resume clustering during quiet times.
Many customers find that suspending clustering during the week and resuming at the weekend (when there are fewer updates) restores the balance between keeping the tables well clustered and reclustering the cost. For example:
alter table sales suspend recluster;
In many cases, update operations marginally impact query performance, leading to graceful performance degradation over time. Re-clustering during quiet times can restore query performance without the cost of frequent clustering operations.
In many ways, this is like trying to clear a snow path while it's still snowing. It's far easier to wait until it's all settled and then clean the path.
Conclusion
Hopefully, by now, you'll appreciate why Snowflake Cluster Keys and Clustering are considered more an art than a science. However, many Snowflake customers have proven many good practices successful (and likewise, many pitfalls to avoid).
Note this article and return to it the next time you consider deploying clustering on a table. You will find it worthwhile.
Snowflake Training from an Expert
Here at Analytics.Today, we now provide:
Click on the image below for more information.