Snowflake Best Practices: Time Travel, Fail-safe and Data Retention

·

8 min read

Snowflake Best Practices: Time Travel, Fail-safe and Data Retention

First Published: August 5th 2024

In my previous article on Snowflake Time Travel, I explained the basics of how Snowflake deploys data versioning at the micro-partition level to support data recovery. In this article, I'll discuss some of the best practices for time travel, fail-safe, and data retention. In particular, I'll explain some common pitfalls and show how I helped customers save over $50,000 annually with simple changes.

Available Options for Snowflake Data Recovery

It's worth noting that Time Travel (which tracks up to 90 days of changes) is just one of the features used to avoid data loss. When data is aged out of Time Travel, it's not immediately deleted but may be kept in Failsafe.

The diagram below illustrates the range of data, which includes:

  • Active Data: This represents the active data in the database.

  • Time Travel: This holds prior versions of changed micro partitions. This data can be queried using Time Travel Queries and data retained from zero to 90 days.

  • Fail-Safe: This holds changed partition versions for seven days, after which the data is permanently removed. It is only accessible via Snowflake Support,

As data is modified, the changed micro partitions are moved from right to left, from Active to Time Travel and Fail Safe. By default, tables have one Time Travel day and seven Fail Safe days.

Options to recover data include:

Do we need to Backup Snowflake Data?

While you may be forgiven for assuming Time Travel provides comprehensive backup and recovery solutions, you'll often need to recover data from more than 90 days ago. You'll also find relying upon 90 days of Snowflake Time Travel will quickly add to storage costs.

It's, therefore, a recommended best practice to take periodic clones of databases or schemas. For example:

create database edw_backup_2024_07_27
   clone edw;

One mistake made by Snowflake customers is to assume they need to unload data using a COPY INTO <location> command.

In addition to the compute cost of physically unloading data from Snowflake, this adds to the overall data storage cost as Snowflake deploys column-level data compression. Finally, this approach doesn't provide additional safety unless the data is moved off the cloud storage provider.

Snowflake Data Retention and Data Storage

As discussed in the Time Travel article, every UPDATE statement creates a new version of changed micro partitions, which can lead to significant storage costs.

The Time-Travel data volume is controlled using DATA_RETENTION_TIME_IN_DAYS - a parameter that automatically defaults to one day. However, on the Snowflake Enterprise edition or above, this can be set up to 90 days.

During a visit to one Snowflake customer, I found the customer spending over $400,000 per year on storage, but this was quickly reduced by over $300,000 by reducing the DATA_RETENTION_TIME_IN_DAYS from 90 to 7 days.

In addition to using the standard Snowflake dashboard, you can track storage costs using the following query:

-- Data Storage by Month and Type

select to_char(usage_date,'YYYYMM') as sort_month 
,      to_char(usage_date,'Mon-YYYY') as month 
,      avg(storage_bytes) as storage
,      avg(stage_bytes) as stage
,      avg(failsafe_bytes) as failsafe
from snowflake.account_usage.storage_usage
group by month, sort_month
order by sort_month;

Tracking Snowflake Storage Cost

The graph below illustrates a typical scenario whereby storage costs (at $29 per terabyte per month) gradually increase over time as more data is loaded. Notice that Stage and Failsafe storage is a small percentage of the total storage.

Data Storage over Time

Reducing Snowflake Stage Cost

The graph below illustrates a common Snowflake storage issue whereby data was frequently loaded into a Snowflake Internal Stage before being loaded into a Snowflake table.

The graph shows, however, that the data wasn't removed once loaded, which meant the data storage costs were double what was expected. This example was one of several insights described in my article on monitoring Snowflake costs.


Want to sharpen your Snowflake Skills? Consider On-Demand training by Analytics Today. And receive instruction from a Snowflake Superhero.

Snowflake training by Analytics Today


Reporting Snowflake Data Storage Usage

The SQL statement below shows a snapshot breakdown of current storage by type, including Active, Time-Travel, and Fail-Safe storage. This can quickly highlight some frequent mistakes in managing data retention and storage.

with stats as (
    select sum(active_bytes)        as total_active_bytes, 
           sum(time_travel_bytes)   as total_time_travel_bytes, 
           sum(failsafe_bytes)      as total_failsafe_bytes
    from rakuten.john.table_storage_metrics
    where active_bytes > 0
) 
select  
       case
           when total_active_bytes >= power(2, 40) then to_char(round(total_active_bytes / power(2, 40), 1)) || 'TB'
           when total_active_bytes >= power(2, 30) then to_char(round(total_active_bytes / power(2, 30), 1)) || 'GB'
           when total_active_bytes >= power(2, 20) then to_char(round(total_active_bytes / power(2, 20), 1)) || 'MB'
           when total_active_bytes >= power(2, 10) then to_char(round(total_active_bytes / power(2, 10), 1)) || 'K'
           else to_char(total_active_bytes)
        end as "Total Active Bytes",
       case
           when total_time_travel_bytes >= power(2, 40) then to_char(round(total_time_travel_bytes / power(2, 40), 1)) || 'TB'
           when total_time_travel_bytes >= power(2, 30) then to_char(round(total_time_travel_bytes / power(2, 30), 1)) || 'GB'
           when total_time_travel_bytes >= power(2, 20) then to_char(round(total_time_travel_bytes / power(2, 20), 1)) || 'MB'
           when total_time_travel_bytes >= power(2, 10) then to_char(round(total_time_travel_bytes / power(2, 10), 1)) || 'K'
           else to_char(total_time_travel_bytes)
        end as "Total Time Travel Bytes",
       case
           when total_failsafe_bytes >= power(2, 40) then to_char(round(total_failsafe_bytes / power(2, 40), 1)) || 'TB'
           when total_failsafe_bytes >= power(2, 30) then to_char(round(total_failsafe_bytes / power(2, 30), 1)) || 'GB'
           when total_failsafe_bytes >= power(2, 20) then to_char(round(total_failsafe_bytes / power(2, 20), 1)) || 'MB'
           when total_failsafe_bytes >= power(2, 10) then to_char(round(total_failsafe_bytes / power(2, 10), 1)) || 'K'
           else to_char(total_failsafe_bytes)
        end as "Total Failsafe Bytes",       
        round((total_time_travel_bytes / (total_time_travel_bytes + total_active_bytes + total_failsafe_bytes)) * 100,1) as "Time Travel %",
        round((total_failsafe_bytes    / (total_time_travel_bytes + total_active_bytes + total_failsafe_bytes)) * 100,1) as "Fail-Safe %"        
from stats
group by all;

The graph below illustrates the data distribution above.

Typically, we'd expect the fail-safe and time-travel bytes to account for around 2-4% of the total storage. The above indicates potential issues as follows:

  • Fail-Safe Bytes: When the Fail-Safe bytes are more than 4% of total storage, this indicates the system includes excessive data change, which could potentially be reduced.

  • Time Travel Bytes: When this exceeds 4% of storage, it indicates a potential problem in the DATA_RETENTION_TIME_IN_DAYS which means the system could reduce storage costs by reducing data retention.

Reducing Fail-Safe Storage Cost

By default, changed versions of micro partitions are written to Fail-Safe and held for seven days before being deleted. This area, therefore, represents the amount of data updated, deleted, or dropped during the past seven days.

When fail-safe storage exceeds 2-4% of total storage, micro partitions are frequently created and removed. This is often the result of one of the following:

  1. Data Staging: Typically, data is loaded to a staging table before being merged into the master tables and deleted or truncated. If these staging tables are built as default permanent tables, every micro partition loaded is retained for seven days. It would be best if you replaced these tables with transient tables.

  2. Intermediate Tables: Similar to data staging tables, when intermediate results are written to permanent tables, they should be replaced by transient or temporary tables to reduce the cost of Fail-Safe data.

  3. Other Data Changes: These can be challenging to track down, but they often result from poorly deployed Data Clustering or frequent changes to Materialized Views. In the case of Automatic Clustering, frequent updates to clustered data often lead to data re-clustering, which on permanent tables results in large volumes of fail-safe storage. Likewise, frequent updates against tables that source Materialized Views can lead to high fail-safe.


Snowflake Training by an Expert.

Click in the image below for more detail


Reducing Time Travel Storage Cost

While it's relatively easy to identify Fail-Safe storage savings, time travel savings can be challenging.

The most common cause is simply setting a high value for the DATA_RETENTION_TIME_IN_DAYS at the account or database level. This can be adjusted by a user with ACCOUNT_ADMIN privilege using the following SQL:

-- Setting at the ACCOUNT level
alter account 
   set data_retention_time_in_days = 1;

-- Setting at the DATABASE level
alter database edw
   set data_retention_time_in_days = 1;

While the above SQL will solve the data retention problem, it may not be obvious that creating database or schema clones can lead to unexpectedly high storage costs. While working for another Snowflake customer, I reduced storage costs by 50% by dropping redundant clones, as explained in the article here.

Time Travel, Fail-safe, and Data Retention: Best Practices

To summarize the above recommendations as a series of Snowflake best practices:

  1. Be aware: Of the in-built data recovery options in Snowflake, including Time-Travel, Fail-Safe and cloning.

  2. Deploy a Backup Strategy: This involves taking a periodic zero-copy clone at the database or schema level.

  3. Track Storage Costs: To help identify and eliminate issues, including excessive Fail-Safe and Internal Stage costs.

  4. Set Data Retention Time at the Account or Database level to a sensible level to avoid excessive Time Travel storage.

  5. Identify Redundant Clones: These are often used to build test environments but can lead to unexpectedly high Time Travel costs, which are otherwise challenging to identify.

  6. Be aware: Of the potentially significant storage savings from using TRANSIENT tables, especially for staging tables, which are frequently loaded and then deleted or truncated.

Conclusion

In my experience with over 50 Snowflake customers across Europe and the Middle East, administrators are often unaware of the impact of data storage on the overall cost of operating Snowflake. However, knowing how Snowflake handles data recovery and the pitfalls of Time-Travel and Fail-Safe can pay dividends.

Snowflake Training from an Expert

We now provide:

Click on the image below for more information.

Analytics Today Snowflake Training