Snowflake is now capable of near real-time data ingestion, data integration, and data queries at an incredible scale. This article explains how Snowflake uses Kafka to capture real-time data, with results available on Tableau dashboards within minutes.
It summarises the challenges, the components needed, and why the traditional approach (the Lambda Architecture) is no longer a sensible strategy for delivering real-time data queries at scale.
Real-Time Data Requirements
The requirements include capturing, transforming, and analyzing data at a potentially massive velocity in near real-time. This involves capturing data from millions of electronic sensors and transforming and storing the results for real-time analysis on dashboards. The solution must minimize latency – the delay between a real-world event and its impact on dashboards- by a few minutes.
Typical applications include: -
Monitoring Machine Sensors: Using embedded sensors in industrial machines or vehicles. For example, Progressive Insurance uses real-time speed data to help analyze customer behavior and deliver appropriate discounts. Logistics giant FedEx uses similar technology, SenseAware, to provide near real-time parcel tracking.
Fraud Detection: To assess the risk of credit card fraud before authorizing or declining the transaction. This can be based upon a simple report of a lost or stolen card or, more likely, an analysis of aggregate spending behavior aligned with machine learning techniques.
Customer Sentiment Analysis: Used by many retail operations, this involves capturing and analyzing social media feeds, including Twitter and Facebook. In addition to the velocity challenge, the data is provided in JSON format, and the structure is likely to change over time.
What's the Problem?
The primary challenge for systems architects is the potentially massive throughput required, which could exceed a million transactions per second. NoSQL databases can handle the data velocity but have the disadvantages associated with a lack of SQL access, no transaction support, and eventual consistency. Finally, they don't support flexible join operations, and analytic query options are limited or non-existent. This means you can quickly retrieve a key-value pair for an event, but analyzing the data is a severe challenge.
However, it doesn’t stop there.
Real-time Components
The diagram above illustrates the main architectural components needed to solve this problem. This includes:
Data Capture
High-Velocity Data Capture: The ability to capture high-velocity message streams in hundreds of megabytes per second from multiple data sources.
Message Queuing: We can expect short-term spikes in data volume, which implies a message-handling solution to avoid scaling up the entire solution in the worst-case scenario.
Guaranteed message delivery implies a scale-out, fault-tolerant, highly available solution that gracefully handles individual node failure and guarantees message delivery.
Architectural Separation: Decouple the source systems from the messaging, transformation, and data storage components. Ideally, the solution should allow independent scaling of each component in the stack.
Transformation
Data integration: The transformation process must almost certainly combine real-time transaction streams with existing reference data from databases and other data sources. The solution must, therefore, provide excellent data source connectivity and seamless integration with other sources.
Guaranteed once-only processing: The transformation process needs to be resilient to failure and re-start operations, effectively guaranteeing that every message will be processed once only.
Massively Scalable: While the data capture component will help smooth out massive spikes in velocity, the solution must transparently scale to deal with regular and unexpected workloads.
Storage and Analytics
Unlimited Data Storage: The data storage solution must be capable of accepting, processing, and storing millions of transactions, ideally in a single repository. This implies an almost unlimited data storage capacity, combining Data Lake and analytics capability on a single platform.
Dashboard Connectivity: The solution must support open connectivity standards, including JDBC and ODBC, to support Business Intelligence and dashboards. Capturing data is of little value if it cannot be analyzed.
Want to learn more from a Snowflake Expert? Click on the image below for training from Analytics Today.
The Traditional Solution
The diagram above illustrates a typical architecture referred to as the Lambda Architecture, which includes a speed layer to process data in real time and a batch layer to produce an accurate historical record. In essence, this splits the problem into two distinct components, and the results are combined at query time in the Serving Layer to deliver results.
“Keeping code written in two different systems perfectly in sync was really, really hard”
— - Jay Kreps on Lambda (LinkedIn)
While the Lambda Architecture has many advantages, including decoupling and separation of responsibility, it also has the following disadvantages:
Logic Duplication: Much of the logic to transform the data is duplicated in both the Speed and Batch layers. This adds to the system complexity and creates challenges for maintenance as code needs to be maintained in two places – often using two completely different technologies. As Jay Kreps who invented the Lambda Architecture while at LinkedIn testifies, keeping code written in two different systems was really hard.
Batch Processing Effort: The batch processing layer assumes all input data is re-processed every time. This has the advantage of guaranteeing accuracy as code changes are applied to the data every time but place a huge batch processing burden on the system.
Serving Layer Complexity: As the Batch and Speed layers independently process data, the Serving Layer must execute queries against two data sources and combine real-time and historical results into a single query. This adds complexity to the solution and may rule out direct access from some dashboard tools or need additional development effort.
NoSQL Data Storage: While batch processing typically uses Hadoop/HDFS for data storage, the Speed Layer needs fast random access to data and typically uses a NoSQL database, such as HBase. This comes with huge disadvantages, including no industry-standard SQL interface, a lack of join operations, and no support for ad-hoc analytic queries.
When the only transformation tool available was Map-Reduce with NoSQL for data storage, the Lambda Architecture was a sensible solution. It has been successfully deployed at scale at Twitter and LinkedIn. However, there are more advanced (and simple) alternatives available.
The Snowflake Solution
The diagram above illustrates a simple alternative solution with a single real-time data flow from source to dashboard. The critical component that makes this possible is the Snowflake data warehouse, which now includes a native Kafka connector and Streams and Tasks to seamlessly capture, transform, and analyze data in near real-time.
The components in the above solution are:
Apache Kafka: This is for fault-tolerant message queuing and broadcast systems.
Snowflake Streams & Tasks: To receive the data, perform change data capture, transform, and store data ready for analysis and presentation.
Snowflake Multi-cluster Architecture: To seamlessly handle thousands of concurrent online users and analyze results.
Tableau: For analytic presentation and dashboards.
The advantages of this architecture include:
Absolute Simplicity: As a pipeline to capture and implement change, data capture and storage can be completed with just a handful of SQL statements.
SQL Transformations: With all data transformation logic in the Snowflake transformation component (using industry-standard SQL), there's no code duplication or multiple technologies to cause maintenance issues.
Real-Time Accuracy: As the database solution provides full relational support and ACID compliance, no issue exists around eventual consistency from NoSQL solutions.
The diagram below illustrates one of Snowflake's unique benefits: it delivers unlimited cloud-based computing resources, both serverless and virtual warehouses, over a shared storage system.
This means providing a separate speed and batch processing layer is no longer necessary, as queries can be continually streamed into the warehouse using Snowpipe, transformed in one virtual warehouse, and the results analyzed in yet another.
Conclusion
Snowflake is uniquely positioned to provide a single platform for all your data warehouse storage, processing, and analysis needs. This includes:
Near Real-Time Streaming: Using Snowpipe and the native Kafka connector to capture and ingest streaming data without contention.
Semi-structured Processing: This uses a simple SQL interface to provide a real-time schema-on-read view over JSON, AVRO, Parquet, and XML. Unlike other databases that store semi-structured data as simple text, Snowflake parses it and seamlessly stores it in a columnar data structure for fast retrieval.
A Data Lake: This can combine both semi-structured JSON and structured CSV formats. This helps abolish the separate data silos of Data Lake and Warehouse.
A Data Warehouse: With full ACID compliance and an industry-standard SQL interface over an unlimited data store.
Integration to Transformation and Analytical tools: This includes native connectors for Spark, Talend, Informatica, Looker, PowerBI, and Tableau.
Snowflake Training from an Snowflake Superhero.
Click on the image below for more information.