Snowflake Cache: How It Works and Why It Matters

·

13 min read

This article explains how the Snowflake cache works using an innovative three-layer hardware architecture. Unlike most other database platforms, which allow the DBA to tune the cache sizes, the process is entirely automatic on Snowflake, raising the question:

What’s the best practice to maximize cache usage and performance?

What is Snowflake Caching?

Caching is a technique for maximizing query performance by maintaining frequently accessed data in faster memory or SSD rather than fetching it from slow remote storage (disk).

Many database systems widely use it and can dramatically improve query response time, especially for short-running end-user queries where fast response times are vital.

Snowflake Database Architecture

Before starting, it’s worth considering the Snowflake architecture and explaining where Snowflake caches data. The diagram below shows the three-layer hardware architecture Snowflake deployed.

Snowflake 3 Tier Hardware Architecture

Each layer in the above diagram represents independently scalable hardware, and each layer has dedicated storage and cache.

  1. Cloud Services Layer: This layer accepts connections and SQL queries, manages transactions, and returns results. Its storage holds the Snowflake query result cache and the Snowflake metadata cache.

  2. Compute Layer: This is where most SQL queries are executed on one or more virtual warehouses. Storage at this layer is deployed using fast SSD, sometimes called the Snowflake Data Cache or Virtual Warehouse Cache.

  3. Cloud Storage Layer: This provides long-term storage of results. It is often called Remote Storage and is deployed on cloud storage, such as Amazon S3 or Azure Blob Storage, which typically runs much slower hard disk drives. This is the slowest component in the stack, and reducing remote storage I/O can dramatically improve query performance.


Want to learn more from a Snowflake Expert? Click on the image below for training from Analytics Today.

Snowflake Training from Analytics Today


Snowflake Cache Layers

The diagram below illustrates the levels at which data and results are cached for subsequent use, including:

  1. Metadata and Result Cache: The Results Cache holds the results set of every query executed in the past 24 hours. These are available across all virtual warehouses, and results returned to one user may be used by any other user on the system provided they execute the same query and the underlying data remains unchanged.

  2. Local Storage Cache: This is used to cache raw data from SQL queries. Data fetched from slower remote storage is cached in faster SSD and memory, which virtual warehouse queries can use.

  3. Remote Storage: Provides long-term data storage and resilience. On Amazon Web Services, this means 99.999999999% durability, even in the event of an entire data center failure.

How Snowflake stores data in Cloud Services, Warehouse and Cloud Storage Layers

What is the Snowflake Metadata Cache?

The Snowflake metadata cache is held in the cloud services layer and holds metadata (data about data). For example, it has the minimum and maximum value of every micro-partition in each table, along with the number of rows and distinct rows.

Metadata statistics are automatically captured as data is loaded into Snowflake, and means queries like the following return within milliseconds:

select count(*)
from sales;

What is the Snowflake Results Cache?

Whenever Snowflake executes a SELECT query, the results returned to the end-user are stored and immediately returned by subsequent queries . While this may seem pointless, it significantly affects query performance, primarily when queries run from dashboards or business intelligence applications where the query text is unchanged.

This means (for example) that when a complex query is executed, which can take minutes to complete, subsequent executions return results within milliseconds.

Like other Snowflake features, this cache is managed automatically and is cleared whenever the underlying data changes, so users are always given a correct, consistent result.

One of the benefits of fetching data from the results cache is that the executing query doesn't need a virtual warehouse. In addition to improving query performance, this can also reduce costs.

What is the Virtual Warehouse Cache?

Assuming the query cannot be satisfied by the Results Cache or Metadata Cache, Snowflake needs to execute the query using a Virtual Warehouse. However, each node in the warehouse has a dedicated area in memory and SSD to cache data.

This means that whenever data is fetched from remote storage, it's held in the virtual warehouse cache as long as the virtual warehouse is not suspended.

Unlike the other cache layers, data in this area is automatically aged out on a least recently used basis. This means frequently accessed data (like reference data) is stored in this cache.

Diagram illustrating how Snowflake stores and ages out data in the Virtual Warehouse Cache

The diagram above illustrates how the Snowflake Data Cache works. As new data is fetched from slower cloud storage it's stored in the Recently Used end. As queries fetch data from the cache, the data is moved up the cache, meaning infrequently used data is gradually aged out.

Can you Query the Snowflake Result Cache?

The short answer is yes. The results are cached in the Snowflake Query Result Cache whenever a query is executed. Provided you have not disabled the result cache, you can both re-query the results or even execute queries directly against the results, for example, to return a subset of columns or rows.

Consider the following SQL statements:

-- List all warehouses
show warehouses;

-- Following query will execute on Result Set
select *
from table(RESULT_SCAN(LAST_QUERY_ID()));

The first statement lists the available virtual warehouses. The second statement uses the RESULT_SCAN function to return the result set of the previous query.

Using this technique, you can execute queries against the result set, which means these subsequent queries don’t need a virtual warehouse.

I've seen cases where customers execute a sequence of queries at the start of each day to populate the results cache and maximize query performance. However, it's a rather specialized use case.

How to clear the Data Cache

Snowflake caching is entirely automatic. Unlike other database systems, you cannot adjust the amount of memory available in any cache; however, (as we’ll see below), you can clear the warehouse (data) cache by suspending the warehouse.

The following SQL demonstrates how to suspend a virtual warehouse which will (normally) clear the data cache.

        alter warehouse prod_reporting_vwh suspend;

When a subsequent SQL statement is executed the warehouse is automatically resumed, starting with a clean cache. Data will be read from remote storage (which will be slower than when it’s read from local SSD storage).

However, suppose you immediately restart the virtual warehouse. In that case, Snowflake will try to recover the virtual warehouse and restore the cache, so it may be sensible to wait a few minutes if you're performing benchmark testing.

Because the data cache is cleared when the virtual warehouse is suspended, by default, Snowflake sets the AUTO_SUSPEND time to 600 seconds (ten minutes). This means the cache is maintained and improves query performance.

However, experience demonstrates this is way too long for most applications, and it's best practice to set the AUTO_SUSPEND time to 60 seconds in most cases. This should dramatically reduce the system's cost without significantly impacting overall performance.

How to Disable the Snowflake Results Cache

You can temporarily suspend the Results Cache, although this only makes sense when benchmarking query performance. Use the following SQL statement:

        alter session set use_cached_result = false;

Setting the parameter USE_CACHED_RESULT = FALSE can be used to temporarily suspend the Snowflake results cache purely for performance testing benchmarks. Executing the following command resets this parameter:

alter session set use_cached_result = TRUE;

Snowflake Cache Query Benchmark

Every Snowflake database is delivered with a pre-built and populated set of Transaction Processing Council (TPC) benchmark tables. To benchmark test the result of Snowflake caching, I set up a series of test queries against a small subset of the data, illustrated below.

Diagram showing Snowflake Tables used to Benchmark Cache Performance

All the queries were executed on a medium-sized cluster (4 nodes) and joined the tables. The tables were queried precisely as is, without any performance tuning.

The following query was executed multiple times, and the elapsed time and query plan were recorded each time.


        select  r_name
        ,       n_name
        ,       sum(o_totalprice)   as total_price
        ,       count(*)            as count_rows
        from    orders o
        join    customer c
                on c_custkey = o_custkey
        join    region r
                on r_regionkey = n_regionkey
        where o_orderdate = '01-APR-199'
        group by r_name
        ,        n_name
        order by r_name
        ,        n_name

The screenshot below illustrates the query results, summarizing the data by Region and Country. In total, the SQL query summarized and counted over 1.5 billion rows.

The screenshot shows the first few lines returned.

Screenshot of Data Queried from the Snowflake database


Snowflake Training from an Snowflake Superhero.

Click on the image below for more information.

Analytics Today Snowflake Training


Benchmark Test Sequence

The test sequence was as follows:-

  1. Run from cold: This meant starting a new virtual warehouse (with no data in the virtual warehouse cache) and executing the query.

  2. Run from warm: This meant disabling the result caching and repeating the query. This makes use of the local disk caching but not the result cache.

  3. Run from hot: This again repeated the query, but with the result, caching switched on.

Each query ran against 60Gb of data, although as Snowflake returned only the columns queried and was able to compress the data automatically, the actual data transfers were around 12Gb.

As Snowflake is a columnar data warehouse, it automatically returns the columns needed rather than the entire row to help maximize query performance.

Query Performance with no Cache

This query returned in around 20 seconds and demonstrated it scanned around 12Gb of compressed data, with 0% from the local disk cache. This means it had no benefit from disk caching.

Snowflake Query Profile showing query performance without using the virtual warehouse cache

The bar chart above demonstrates that around 50% of the time was spent on local or remote disk I/O and only 2% on actually processing the data. Clearly, any design changes we can make to reduce the disk I/O will help this query.

The results also demonstrate that the queries could not perform partition pruning, which might improve query performance. My article on partition pruning and cluster keys explains how clustering to dramatically improve query performance.

Query using Virtual Warehouse Data Cache

This query was executed immediately after, but with the result cache disabled, and it completed in 1.2 seconds – around 16 times faster. In this case, the Virtual Warehouse cache (SSD on Amazon Web Services) was used to return results, and disk I/O is no longer a concern.

Snowflake Query Profile showing Query profile when the Virtual Warehouse cache is used

In the above case, the disk I/O has been reduced to around 11% of the total elapsed time, and 99% of the data came from the (local disk) cache. This is an excellent result for querying 1.5 billion rows.

Query using the Results Cache

This query returned results in milliseconds and involved re-executing the query, but with this time, the result cache was enabled. Usually, this is the default situation, but it was disabled purely for testing purposes.

Snowflake Query Profile when the Results Cache is used

The above profile indicates the entire query was served directly from the result cache (taking around two milliseconds).

Finally, results are usually retained for 24 hours. However, the clock is reset every time the query is re-executed, up to a limit of 30 days, after which results query the remote disk.

Snowflake Performance Summary

The test sequence was designed to illustrate the effect of data caching on Snowflake. The tests included:-

  • Raw Data: Including over 1.5 billion rows of TPC-generated data, a total of over 60Gb of raw data

  • Initial Query: It took 20 seconds and ran entirely from the remote disk. Quite impressive.

  • Second Query: It was 16 times faster at 1.2 seconds and used the Local Disk (SSD) cache.

  • Result Set Query: Returned results 130 milliseconds from the result cache (intentionally disabled on the prior query).

To put the above results in context, I repeatedly ran the same query on an Oracle 11g production database server for a tier-one investment bank, and it took over 22 minutes to complete.

Snowflake Cache Best Practices

Data caching data makes a massive difference to Snowflake query performance, but what can you do to ensure maximum efficiency when you cannot adjust the cache?

Here are a few best practice tips:-

  1. AUTO_RESUME: This parameter controls whether the warehouse should automatically start when a query is executed. It’s usually good practice to leave this as TRUE unless you want to control when the warehouse starts and is suspended. This may be useful if you have a warehouse for a dedicated task, such as executing a specific batch job.

  2. AUTO_SUSPEND: This is the number of seconds the warehouse should remain active (and therefore billing credits) when no queries are being executed. Experience shows that, in most cases, this should be set to 60 seconds. This gives the best balance of query performance and reduced cost.

  3. AUTO_SUSPEND = 0: Never, ever set the AUTO_SUSPEND setting to zero unless you are confident you have steps in place to manually suspend the warehouse when no longer needed. Setting this to zero means the warehouse will remain running even if there's no queries being executed.

  4. Scale up for large data volumes: If you have a sequence of large queries to perform against massive (multi-terabyte) size data volumes, you can (in many cases) improve workload performance by scaling up. Execute a SQL statement to increase the virtual warehouse size, and new queries will start on the larger (faster) cluster. While this will start with a clean (empty) cache, you should generally find performance doubles at each size increase, and this extra performance boost will more than outweigh the cost of refreshing the cache. Be aware, however, the data cache is cleared if you scale up (or down).

  5. Scale down - but not too soon: Once your enormous task has been completed, you could reduce costs by scaling down or suspending the virtual warehouse. Be aware again. However, the cache will start clean again on the smaller cluster. By all means, tune the warehouse size dynamically, but don't keep adjusting it, or you'll lose the benefit.

  6. Scale up for large SORT operations: Each time you increase the warehouse size you add additional nodes and therefore double the memory and SSD available to cache data. If you identify spilling to storage, it's a best practice to move the offending queries to a larger warehouse as this often delivers improved query performance and potentially reduced cost.

A Word of Warning

I worked with one customer who accidentally spent over $120,000 before they realized and suspended a warehouse. They made the mistake of executing the following query:

create warehouse TEST_WH with
   warehouse_size = 'X2-LARGE'
   min_cluster_count = 8
   max_cluster_count = 10
   auto_suspend = 0;

Once a query was executed, the warehouse remained active until it was manually suspended. However, because it was also sized as X2-LARGE with a MIN_CLUSTER_COUNT of 8, it immediately started charging at 256 credits per hour or 6,144 ($12,288) per day.

This particular customer had no Resource Monitors in place, and the offending person no longer works there.

Conclusion

Although the benchmark queries above show impressive performance improvements, you shouldn't worry too much about cache usage. Since Snowflake automatically manages caching, there's little you can do to influence it.

In my experience, the most performance-critical queries are either sub-second (where compilation time is a bigger concern) or take 15 minutes or more (where other Snowflake performance tuning techniques are more important).

Therefore, while understanding how Snowflake caches data is useful, it's not crucial. Please read my Snowflake Query Performance Tuning article for more advice on performance tuning.


Snowflake Training from an Expert

We now provide:

Click on the image below for more information.

Snowflake Training from Analytics Today