Snowflake Data Lake: A Comprehensive Introduction

·

9 min read

This is the first in a series of articles about Snowflake Data Lake Architecture. In this article, I'll describe some background including:

  • What is a Data Lake and why the architecture was developed.

  • How Snowflake supports Data Lake (or raw history) both internally and externally.

  • How to integrate a cloud based Data Lake with Snowflake while also maximizing query performance.

In my next article, I'll build upon this and describe how Snowflake integrates with a Data Lake built upon the Apache Iceberg table format.

What is a Data Lake?

A Data Lake is an architectural pattern that provides a central store of raw data. Unlike a data warehouse, which tends to clean, restructure, and transform data for analysis in structured tables, data lakes hold data in its raw (unprocessed) form and open-source (e.g., CSV, Parquet, or JSON) format rather than proprietary formats.

The greatest advantage of a data lake over a data warehouse is vendor independence. Therefore, a data lake tends to be a central source of enterprise data that is often used as the source for other analytic platforms, including the data warehouse.

Why was a Data Lake Developed?

The Data Lake architecture was developed out of the limitations of "old school" relational database technology, including Oracle, SQL, and Postgres. Traditional relational databases were designed to query data structured in fixed-format rows and columns, which were slow to handle the growing demands of semi-structured data (e.g., JSON or Parquet). They were also built upon expensive, highly resilient hardware, which made multi-terabyte storage expensive and impractical.

The diagram below shows the overall architecture of a typical data warehouse before the data lake.

Traditional Data Warehouse Architecture

The diagram above illustrates the method I've used for 30 years to deploy data warehouses, including at several London-based investment banks. It shows data fed from database systems as CSV files loaded into a Staging Area in the database.

Once loaded, a set of ETL routines read and transform the data, storing intermediate results in the Integration Area before delivering them to Kimball-style data marts, where they're used for reporting and analysis.

💡
Notice that data in both the Staging and Integration areas is purely temporary. Because on-premises data storage was so expensive, it was not feasible to store the raw data, which was discarded once processed.

One of the most significant drawbacks of the above architecture is the lack of raw history, (a record of the raw data), which is driven by the considerable cost of data storage.

As with any data warehouse, we retained years of transactional history and only loaded attributes that were known, understood, and needed for analysis. When an additional source attribute was added, the history was unavailable, which was a huge source of frustration for the business.

Furthermore, the cleaned and refined data was inappropriate for data scientists who needed the raw data, and the time and effort required to load additional data sources were too slow which meant there was a constant backlog of data requirements for ad-hoc analysis.

Users, ideally wanted a dumping ground for raw data in Structured, Semi-structured or even unstructured format where they could perform ad-hoc analysis to see if the data was actually useful.

Data Warehouse and Data Lake Architecture

The diagram above shows how the Data Lake was developed to fill the gap left by traditional relational databases. Unlike Oracle 11g, the data lake (in this case on Hadoop) can handle:

  • Data Volumes: From a performance and cost viewpoint, the data warehouse couldn't compete with a Data Lake, which used inexpensive hardware to store multiple terabytes of data in its raw form.

  • Semi-Structured Data: As Hadoop stores data in flat data files, it's better placed to handle semi-structured data (e.g., JSON, ORC, or Parquet format), which was either not supported or suffered from poor performance on traditional relational databases.

  • Streaming Sources: Users increasingly need real-time data sources, and traditional database technology simply cannot handle high-velocity data loading.


Want to sharpen your Snowflake Skills? Consider On-Demand training by Analytics Today and receive instruction from a Snowflake Superhero.

On-Demand Snowflake Training from Analytics Today


Is Snowflake a Data Lake?

A Data Lake is an architectural pattern rather than a hardware solution (e.g., Hadoop) or a commercial product (e.g., Databricks Delta Lake), so the short answer is that Snowflake is not a data lake. However, it does have a huge range of features to support a Data Lake architecture with data stored internally within Snowflake or integrated into a cloud-based, vendor-independent external data store.

The key Data Lake features supported by Snowflake include:

  • Inexpensive Storage: As Snowflake charges just $23 per terabyte per month, it's an inexpensive store for terabytes to petabytes of data. Furthermore, because Snowflake includes automatic column-level data compression, it removes the need to archive data, for example, in AWS Glacier storage.

  • Data Volumes: As the Snowflake underlying data store is AWS, Azure, or Google cloud, data storage is inexpensive and virtually unlimited.

  • Semi-Structured Data: Data in JSON, Avro, Parquet, ORC, and XML are natively stored in Snowflake using the VARIANT data format. This has many of the compression and query performance benefits of structured data with the added ability to query semi-structured data directly using SQL.

  • Snowflake also supports unstructured data. To access and process the data using Java or Python, register the data files with Snowflake.

  • Streaming Data Sources: Snowflake provides an excellent toolkit for loading, transforming, and presenting real-time data. It offers the option of Snowpipe for micro-batch and Snowpipe Streaming with Dynamic Tables and Snowflake Streams.

The Snowflake Data Lakehouse Architecture

The diagram below shows a modern data warehouse architecture that combines the benefits of a Data Warehouse and a Data Lake to provide a single, secure repository of data for analysis and machine learning applications.

Snowflake with an integrated Data Lake

As Snowflake delivers the benefits of a Data Lake with unlimited, inexpensive storage, it's now feasible to store the entire history of raw data within the data warehouse. This is a significant advantage over the traditional architecture described above. The new data loading process includes:

  1. Ingest Data: Into Staging Area tables. These should be Snowflake Transient tables to avoid Failsafe storage costs. This is used to stage data and is deleted after each batch is processed.

  2. Raw History: Stores the raw history of each table loaded. Semi-structured data is in VARIANT format, while structured data is as default tables with a load timestamp as part of the unique key. Raw history is used to source the data warehouse and is accessible to data scientists and analysts.

  3. The Integration Area: Stores data in 3rd Normal Form and represents the standard Inmon data warehouse approach. Unlike the original architecture, data can be reproduced or recalculated from the Raw History.

  4. Data Marts: Represent the standard Kimball Dimensional Designs with Fact and Dimension tables to support Business Intelligence, data analysis, and reporting requirements.

The one drawback of Snowflake as a Data Lake is that the data is held in a proprietary rather than open-source format like Apache Glacier. However, Snowflake is accessible from all three major cloud providers and supports hundreds of data analysis and machine learning tools.

This leads, however, to an alternative architecture, one where Snowflake works alongside a cloud-based data lake.

Snowflake and a Data Lake: External Tables

The diagram below illustrates an architecture deployed by many large enterprises with an existing cloud-based data lake.

Snowflake alongside an existing Data Lake

This method provides many benefits, both as an open-source data platform for raw data and as an integration with the Data Warehouse. Notice that we don't need to load data from the Data Lake into Snowflake. Using External Tables, Snowflake can support read-only queries directly against the data lake.

The diagram below illustrates how External Tables allow Snowflake users to query data from the Data Lake as if it were loaded in Snowflake. Queries can combine external data with existing tables transparently with excellent query performance.

Snowflake External Tables over an external Data Lake

The SQL statement below shows how to create an external table over data in an external stage.

create external table sales
   location = @sales/daily
   auto_refresh = true
   aws_sns_topic = 'arn:aws:sns:us-west-2:00123450:s3_sales'
   file_format(type = parquet);

The parameter auto_refresh = true is used to automatically notify Snowflake when new data files arrive, which are immediately available for end-users to query. In the above example, S3 notifies new file arrival using the AWS_SNS_TOPIC.

As Snowflake tables automatically support partition elimination and cluster keys, query performance will nearly always be faster than that of external tables. However, the option exists to perform file-level partition elimination against data in an external table. For example, filtering out data for a given month is possible, assuming sales data includes the year and month in the file name.

The example below shows April, May, and June 2024 data files.

  s3://sales/2024/04/data_0_0_0.parquet
  s3://sales/2024/05/data_0_0_1.parquet
  s3://sales/2024/06/data_0_0_2.parquet

The following SQL creates a partitioned external table.

create external table sales with
   year number as to_number(split_part(metadata$filename,'/',3)),
   month number as to_number(split_part(metadata$filename,'/',4))
   location = @sales/daily
   auto_refresh = true
   file_format(type = parquet
   partition by (year, month);

The PARTITION_BY (YEAR, MONTH)clause means any query that filters by YEAR and MONTH eliminates files, which helps maximize query performance. For example, the following query returns data from just one data file.

select *
from sales
where year  = 2024
and   month = 05;

Snowflake Training by an Expert.

Click in the image below for more detail


Maximizing Snowflake Data Lake Performance

The diagram below illustrates a simple approach to maximize query performance when querying large external tables over a data lake.

Snowflake External Tables and Materialized Views over a Data Lake

In the above example, we create a Materialized View over an external table. In many cases, a majority of end-user queries is against the most recent data, and the MView effectively deploys a cache of recent data, while automatically supporting queries against historical data.

The SQL statement below shows how to deploy this solution.

create materialized view sales_this_year as
select *
from sales
where year > year(current_date())-1;

The query above creates a materialized view which holds the data for the current and previous year in a standard snowflake table. However, because the underlying external table is automatically refreshed as additional data is added, the MView is automatically refreshed in turn.

In conclusion, queries against the most recent data are quickly delivered from the Snowflake table which can support data clustering and Time Travel, while queries against historical data are automatically served from the Data Lake.

Conclusion

The short-lived success of Hadoop led to the growth of the Data Lake as a repository of raw, structured, semi-structured, and unstructured data. Similar to most database deployments, the Data Lake is now firmly established as a cloud-based solution, with a data catalog addressing many of the early Data Swamp challenges.

In this article, we demonstrated how Snowflake can support a Data Lake within Snowflake and work alongside an external cloud-based Data Lake using Snowflake External Tables. Concerns around query performance can be easily addressed using file-level partition elimination and Materialized Views over external tables where appropriate.

In conclusion, using the simple architectural patterns described, we can quickly gain the benefits of a Data Lake architecture using Snowflake. Just be aware that query performance against Snowflake tables will nearly always exceed external data files, but the option still exists to load data into Snowflake from the Data Lake.

Snowflake Training from an Expert

We now provide:

Click on the image below for more information.

Analytics Today Snowflake Training