From Snowflake Database to AI Data Cloud
Launched in October 2014 after two years in stealth mode, the Snowflake database was initially designed to disrupt the data warehouse marketplace.
Over time, the cloud data platform was extended to handle a range of workloads, including data warehousing, data engineering, data lakes, and data science, and recently, with the announcement of the Marketplace - Application development and transaction processing.
The diagram below illustrates the key components of the Snowflake Data Cloud, which is deployed across all three major cloud providers: AWS, Google, and Azure.
The unique Snowflake architecture features the separation of storage and compute, instant elastic scalability, data sharing, and cloning, and supports a huge ecosystem of 3rd parties enabling access through an extensive network of connectors, drivers, programming languages, and utilities.
What Does Snowflake Do?
Traditional on-premise data processing solutions have led to a hugely complex and expensive set of data silos where IT spends more time managing the infrastructure than extracting value from the data. The attempt by Hadoop to deliver a data platform was an extreme example and proved to be hugely expensive to maintain.
The Snowflake Data Cloud provides a single cloud-based solution for both analytic queries (OLAP) and, with the announcement of Unistore and Hybrid Tables, it will support transaction-based (OLTP) workloads. It is effectively providing a single cloud data platform for data-intensive applications.
Snowflake Database Workloads
Initially deployed as a data warehouse application, the Snowflake database provides a single unified platform to manage all of your data, including the following workloads:
Data Engineering: Snowflake supports many data integration and processing tools. Combined with the ability to deploy virtual warehouses within milliseconds and instantly scale the computing power, makes for a very compelling solution for ELT-style data engineering. With few performance tuning options, it’s a remarkably low-maintenance platform and entirely charged on a pay-as-you-use basis.
Snowflake Data Warehouse: Has led the delivery of a cloud-based data warehouse platform. In 2009, Gartner recognized it as a leader in the space for the third consecutive year. Low maintenance administration and the ability to ingest, transform, and query data in near real-time make this a class-leading database solution.
Data Lake: Although initially launched as the Snowflake Data Warehouse, with the ability to store semi-structured data in the database, adding External Tables helps integrate Snowflake into a cloud-based Data Lake. The combination of infinite compute power and unlimited inexpensive data storage with the Snowflake database's unique ability to query semi-structured data using SQL makes this a perfect platform for a Data Lake.
Data Science: The ability to scale up the virtual warehouse and process terabytes of data quickly makes the Snowflake database a compelling platform for data science. This, combined with the deep integration with machine learning and an extensive list of data science partners, eases the task of delivering machine learning solutions.
Data Applications: One of the greatest challenges solution architects face in delivering data-intensive applications is the ability to handle massive concurrency and scale gracefully. Snowflake’s unique Multi-cluster Warehouses solve this problem and provide excellent performance despite thousands of concurrent queries.
Data Exchange: Refers to the ability to share and exchange data directly in the Snowflake database with subsidiaries, partners or third parties in the data cloud. The Snowflake Data Marketplace provides live access to ready-to-query data within a few clicks. According to Forbes Magazine, “Data is the New Oil”, and Snowflake makes it simple to access data globally using a few clicks. This allows every enterprise to monetize its data, as demonstrated by the leading UK supermarket, Sainsbury’s.
At Analytics Today we now provide:
Click on the image below for more information.
Requirements of Snowflake Data Cloud
Workload Separation: One of the most significant challenges facing data warehouse architects today is maintaining the balance of compute resources for several competing user groups. The most obvious one is ELT/ETL load processes, which need to extract, transform, clean, and aggregate the data, and the end-users who want to analyze the results to extract value. Who should be given priority? The diagram below illustrates the massively different workloads of these two competing groups. On a typical Snowflake data warehouse, the ELT processes run a regular batch load with multiple parallel processes causing 100% CPU usage and the analyst workload, which is much more irregular. This leads to the need to separate these workloads and eliminate the contention between user groups. Snowflake achieves this by supporting multiple independently sized virtual warehouses.
-
Maximize Data Loading Throughput: As indicated above, we need to extract, load and transform data rapidly within the Snowflake database, and this means we need to maximize the throughput — the total amount of work completed rather than the response time of any single query. To achieve this, we typically need to run multiple parallel load streams with CPU usage approaching 100%, and this is challenging alongside balancing these demands with the need for a high level of end-user concurrency. Again, Snowflake achieves this by supporting multiple virtual warehouses.
Maximize Concurrency: A typical Snowflake data warehouse has many busy users who want to get their job done. They want their results to be as quick as possible but often fight for machine resources with everyone else. In summary, we need to maximize concurrency—the ability to handle many queries from multiple users simultaneously. Almost every data warehouse, both on-premise and cloud-based, is built upon a single principle. Size for the biggest workload, and hope for the best.
In an ideal world, the data warehouse would automatically scale out to add compute resources on the fly as needed. The hardware resources would simply grow (and shrink) to match the demands, and the users would be billed for the actual compute time they used—not a monolithic investment every five years with the promise of superb performance—for a while. The Snowflake Data Warehouse meets this need with multi-cluster warehouses that automatically spin up (and shut down) virtual warehouses as concurrency demands change.
Minimize Latency – Maximum Speed: C-Suite executives and front office traders want sub-second response times on their dashboards. They are not concerned with the performance of ETL throughput or batch reports — they want fast response times on dashboard queries. The Snowflake database has multiple levels of caching, including result set caching to deliver sub-second performance on executive dashboards while segmenting workloads so large complex reports do not degrade performance.
Fast Time to Value: Since Ralph Kimball and Bill Inmon first proposed data warehouses in the 1980s, the typical data loading architecture has remained unchanged. Data is extracted from the source systems overnight, and the results are transformed and loaded to the warehouse in time for analysis at the start of the next working day. In an increasingly global 24x7 economy, overnight batch processing is no longer an option. With globally based supply chains and customers, the systems that feed the warehouse no longer pause, and data must be fed constantly, in near-real time.
To put this in context, back in 1995, a leading UK mobile phone provider took 30 days to capture and analyze retail cellphone sales, and the marketing director was thrilled when we delivered a warehouse solution to capture results from the previous day. The currently installed systems must combine operational data (customer cellphone number) with real-time usage patterns by location and volume to identify and protect against fraudulent use.
The Snowflake database satisfies this requirement with Snowpipe and Streaming, which provides real-time streaming and automatically scales the compute resources up or down to match the demands.
Need to handle Semi-Structured Data: The rapid rise of Hadoop and NoSQL solutions (for example, MongoDB and Couchbase) was primarily driven by the need to process semi-structured data, particularly JSON format. Unlike traditional structured data, which comes in a predefined structured form (like a spreadsheet with rows and columns), JSON data includes repeating groups of values, and the structure may change over time. Initially used by websites to provide a standard data transfer method, JSON is now the de facto method for a huge volume of web-based traffic.
The Snowflake data warehouse natively handles structured and semi-structured data, and JSON or parquet data can be directly loaded into the Snowflake database. Snowflake has also extended the SQL language to query semi-structured data easily.
Business Intelligence Performance: Related to data loading throughput, this requirement refers to the business intelligence community, which often needs to run large and complex reports to deliver business insight. Often working to demanding deadlines, they need the maximum available compute performance, especially for end-of-month or end-of-year reports.
Independently Sized: One size fits all is no longer a viable approach. Any business has multiple independent groups of users, each with different processing requirements. It should be possible to run various independent analytics workloads on independently deployed computers, each sized to the needs and budget illustrated in the diagram below.
Traditionally, the only way to guarantee performance for a departmental group of users was to invest in and maintain their own hardware but call it a Data Mart to hide the fact it was another Data Silo. This leads to inconsistencies, as data is summarized and reported from multiple sources, and no two reports agree on the same result.
An ideal solution would be capable of running extensive compute-intensive ELT/ETL processes on huge, powerful machines without impacting smaller departmental teams of business intelligence analysts running queries against the same data. Each user group would have their own independently sized machines, and each could be sized appropriately.
Of course, every data mart (data silo) can, in theory, be sized independently, but ideally, we want all the users to have access to all the data. Why extract, transform, and copy the data to another location? Every department should be able to transparently access all the data across the entire enterprise (subject to authority), but each independently sized to the task at hand.
“Once duplicated, data content will diverge”
— The Law of Data Duplication
Snowflake Training by an Expert.
Click in the image below for more detail
Low Entry Point: Every multi-terabyte data warehouse starts with a single requirement, a single fact table, and perhaps a single report. If successful, the platform grows, but it's important (especially to small—to medium-sized start-ups) to have a low entry point. Data Warehouses used to be the province of large multinationals, but business insight is a critical requirement for any business, small or large.
Quickly Scalable: The system must be incrementally scalable — perhaps from gigabytes to terabytes or even multiples of petabytes. It, therefore, must be possible to add additional compute and storage as needed, ideally without downtime. In short, adding additional compute and storage resources must be possible to increase concurrency (more users) or processing (faster delivery of huge workloads) without downtime or interruption of service. The Snowflake database supports this; virtual warehouses can be resized quickly with zero downtime.
Inexpensive: The solution should involve no upfront capital expenditure or commitment and be inexpensive to operate with costs in line with usage. Until recently, the only options available to build an analytics platform involved a huge capital outlay on expensive hardware and database licenses. Once analytic query demands and data volumes exceed the multi-terabyte level, we typically need a lengthy and expensive migration project to move to a larger system. This is no longer a viable strategy, and the ideal solution should use a flexible pay-as-you-use model with costs in line with usage.
Elasticity: Closely related to the above requirement, it should be possible to rapidly increase the available resources to match an unexpected workload. Equally, it should be easy to scale back the processing resources, with costs in line with usage. Ideally, the entire process should be transparent to users. Finally, when not needed, it should be possible to suspend compute resources to control costs, with the option to resume processing within seconds when needed automatically.
-
Consolidated: Many data warehouse initiatives lead to a disparate number of independently loaded data marts. Aside from the risk of inconsistencies, there's also the issue around timing delivery, as results fail to be delivered consistently to all marts simultaneously. In one project, users frequently complained of inconsistent sales figures from two systems, which were (ironically) sourced from the same raw data. This requirement means all the data should be consolidated into a single data store and accessible to all users. Likewise, the solution should support queries against structured and semi-structured data to avoid the spread of technologies, including relational, NoSOL, and Hadoop data stores.
Low Administration Burden: Oracle supports 14 types of indexes, including B-Tree, Bitmap, Partitioned, Clustered, and Index Organized Tables, and there are numerous situations in which indexes are disabled, potentially leading to an exception or failure. Equally, Oracle 11 has over 450 independently adjustable parameters to tune the database, and most systems need a highly skilled DBA. The Snowflake database is provided as a service with no software to install or knobs to tune. In short, it just works.
Accessible for Data Sharing: As Bill Schmarzo (CTO at EMC Global Services) has indicated, "Data monetization is the holy grail of Big Data". In his book "Understanding How Data Powers Big Business", he describes the five stages of maturity, including Data Monetization, illustrated below.
This requirement refers to the ability to securely share access to data with business partners, suppliers or subsidiaries. The traditional method involves building an expensive ETL pipeline to extract and deliver the data to partners or providing a pre-defined analytics dashboard. The Snowflake database can be accessed by any authorized client globally, supporting data replication and sharing.
Technically Accessible: As No-SQL database vendors have realized, solutions accessible only to Java programmers are of limited use, and systems must be accessible to business analysts and users using industry-standard SQL. This requirement means the solution must support industry standard SQL rather than an application programmer interface (API).
Hold all the Data: As the voracious appetite for data grows, scaling up to a larger hardware platform every few years is no longer an option. The ideal solution would provide an infinite data storage capacity with potentially infinite computing resources. It should have already proven itself with some of the world's largest data consumers and support systems with petabytes of data in a single table.
Simplicity: Having worked in IT since 1984 as a developer, designer and architect, the biggest lesson I've learned is to Keep it Simple. As I have indicated before – I find designers sometimes deliver over-complex generic solutions that could (in theory) do anything but, in reality, are remarkably difficult to operate and often misunderstood. The solution should be elegant and, most importantly, simple.
“Everything should be made as simple as possible, and not simpler”
— Albert Einstein
Summary
I’d expect many people reading this article to be highly skeptical that it's even remotely possible to deliver a solution to satisfy all the above requirements, and just a few years ago, I would have agreed. However. The Snowflake database has achieved just that.
Snowflake genuinely is The Cloud Data Platform. From a humble but massively ambitious starting point, it has developed from an incredible Data Warehouse platform to the potential data cloud platform for all your analytic and transactional needs.
Snowflake Training from an Snowflake Superhero.
Click on the image below for more information.