The SQL Data Lakehouse

 Intro

Data lake. Data mesh. Data fabric. So many terms have popped up in recent years to describe the way organizations should work with modern data. Why do we need another one — namely, the data lakehouse?

A New Stack is Needed

Not so long ago, the enterprise data warehouse was seen as a foundational component in any stack: a centralized repository, consolidating all the smaller data marts across the organization, and capable of delivering fast response times for complex SQL queries. But today you’d be hard pressed to find a CTO or CIO who believes that a single data warehouse will be the solution to all of their data challenges. What happened in between?

  • New types of data: Connected devices and online behavior tracking generate a new type of data that’s faster, more abundant, and less structured. Streaming sources such as app user tracking, sensor data, telemetry or machine logs output a continuous stream of event-based, semi-structured data with ever-changing schema — a far cry from the tabular data of years past.
  • New use cases: Data was previously used for reporting on pre-defined KPIs and metrics — running the same query on the same table to aggregate the latest records into a summary table or visualization. However, newer use cases such as log analytics or ad-hoc exploratory analysis often require access to granular records rather than summaries.
  • Lost data: Due to the coupling of storage and compute, businesses looking to cut these costs need to store less data. This could mean pruning, setting strict retention policies, or storing aggregations rather than raw records.
  • Lock-in: Data warehouses rely on proprietary file systems to provide fast query performance, creating vendor lock-in and closed architectures. This becomes problematic when the business wants to migrate its data, or starts gravitating towards non-SQL use cases such as data science using TensorFlow. Data needs to be pulled out of the warehouse and duplicated, further driving up compute costs .
Big data analytics
The data platform

Challenges

Engineering bottlenecks abound: The data platform is managed by a single, centralized engineering team that needs to cater to requests from multiple technical and business data consumers. Every new data source that needs to be integrated, or any change to the data model — such as to add a new filter capability to a BI dashboard — goes into the backlog of a team that’s overwhelmed with similar requests. This leads to inflexibility and slow turnaround times and frustrates everyone involved.

Open, Flexible, Self-Service Data Stack

If the data warehouse is no longer fit for purpose, and the data lake has failed to deliver on its promise, is the modern data stack destined to remain an unsolved problem? We believe that the answer is no, and the solution lies in a new approach, driven by recent developments in cloud, open source, and serverless technologies. This is the open SQL data lakehouse.

data lake house
Open Data Lakehouse

Openness manifests across four key pillars:

Open formats: To prevent vendor lock-in, data is stored in non-proprietary formats that are queryable by multiple engines — Apache Parquet or Apache ORC are popular options.

The SQL Data Lakehouse stack

What is a data lake house
Building a lakehouse stack

Let’s briefly look at each of the components of the open lakehouse stack:

Storage

In an open lakehouse, data is stored as files on a distributed object store such as Amazon S3, Google Cloud Storage, or Azure Data Lake Storage. Replicating the same architecture in-house, this layer would be a Hadoop cluster.

The data catalog or metadata store

Data lakes can easily become data swamps when businesses thoughtlessly pour data into them with no consideration for how to later retrieve datasets and make them usable. The preventive for this is the data catalog.

Transaction management

Due to the append-only nature of object storage, it can be challenging to update or delete records in a data lake. This is a problem when records need to be removed for security or compliance reasons (such as a GDPR request or a fraudulent transaction), as well as when replicating OLTP databases into the lake.

Security and Governance

Using technologies such as AWS Lake Formation or Apache Ranger, businesses can solve the inherent security challenges that come with unstructured data lake storage. IT teams can define access control policies for specific records or databases, based on the metadata catalog, to prevent unauthorized access to data.

Query Engine

In an open SQL data lakehouse, storage is decoupled from compute in order to provide scale and flexibility; as a bonus, it also helps prevent vendor lock-in. SQL query engines, which can return the results of an SQL query directly from the lake, are key. Popular query engines include Presto and SparkSQL, to name a few. Visualization and dashboard tools, as well as in-house data products, can consume data from these engines via ODBC or JDBC.

How the SQL Data Lakehouse improves upon the hybrid data lake

The data Lakehouse solves many of the problems associated with the previous generation of enterprise data platforms — rigidness, lack of governance, vendor lock-in, and lack of flexibility. The components we’ve covered in the previous section provide a new set of capabilities including:

  • Single source of truth: By providing SQL-based access directly on the data lake, the lakehouse removes the need to create multiple copies of the same data for different use cases. Consistency improves and analytics sprawl is reduced. Data scientists, business teams, and developers can work with the same data using the same SQL queries, rather than creating bespoke extracts.
  • Governance and transaction layers: Data catalog and transaction management support remove much of the chaos associated with unstructured storage and prevent the data lake from ‘swampifying’.
  • Flexibility at every layer: The use of open standards throughout the data lakehouse enables businesses to pick and choose the technology that works best for each workload or use case, without being tied down to a particular vendor.

Open-Source Presto — Unlocking Lakehouse Analytics

Presto is a parallel distributed SQL query engine for the data lake, originally developed at Facebook. It enables interactive, ad-hoc analytics on large amounts of data stored on object stores.

  • Flexibility to support a wide range of SQL use cases
  • A highly pluggable architecture makes Presto easy to extend with custom integrations, such as for security or event listeners.
  • Federation of data sources via Presto connectors.
  • Seamless connectivity to existing systems using standard ANSI-SQL.

Faster and Simpler Lakehouse Analytics

While Presto can be transformative, implementing it is not always simple:

  • Presto needs a lot of configurations: Coming from the Hadoop world, Presto has hundreds of configuration parameters
  • Presto needs a catalog and storage: Presto doesn’t include a metadata catalog — as the user, you’ll need to manage one for your data lakes
  • Presto clusters need a lot of management: Whether it is to add a data source or update a config parameter, Presto clusters need to be managed
Data Lakehouse architecture
Data Lakehouse architecture

0 Comments