Intro

Data lake. Data mesh. Data fabric. So many terms have popped up in recent years to describe the way organizations should work with modern data. Why do we need another one — namely, the data lakehouse?

Technical people know that definitions and specifications matter, and there are nuances between the terms and the approaches they represent. But all are attempts to encapsulate a new vision for how organizations work with data; and the crux of the issue is that such a vision is in dire need because current solutions haven’t evolved to meet current needs.

How to build the new data stack remains a question that almost every business grapples with, but few have managed to answer convincingly. However, it’s the exceptions — businesses that have managed to reap the full benefits of the data they generate and collect — which have grown to dominate their respective fields.

If we think of the handful of companies that have emerged as clear winners in recent years — Google, Amazon, Facebook, Uber, and Tesla spring to mind — it’s easy to find the common denominator: data. All these companies built their innovation on collecting unprecedented amounts of data, as well as developing new techniques to store, access, and process it in order to inform decisions (both human and algorithmic).

At its heart, the new data stack is the attempt to emulate these success stories: to create an organization that can handle intense data loads, and that becomes stronger and more sophisticated as it collects more data, rather than getting bogged down with the engineering work required to make that data usable.

We believe that the open lakehouse — predicated on open foundations, ubiquitous access to data, and the ability to store unlimited amounts of data in the cloud — presents an opportunity to solve the problem of the new data stack, and to break down the barriers to wider adoption of advanced analytics and machine learning.

In the next chapters, we will present the challenge that modern data poses for organizations with ambitious data strategies, and the drawbacks of current data lake and data warehouse solutions. We’ll then present our vision of the open lakehouse, powered by open-source technologies such as Presto and Apache Hudi, and how it can combine the reliability and performance of the data warehouse together with the flexibility and better price performance of the data lake.

A New Stack is Needed

Not so long ago, the enterprise data warehouse was seen as a foundational component in any stack: a centralized repository, consolidating all the smaller data marts across the organization, and capable of delivering fast response times for complex SQL queries. But today you’d be hard pressed to find a CTO or CIO who believes that a single data warehouse will be the solution to all of their data challenges. What happened in between?

The need to reimagine the data stack was born out of the basic, undeniable fact that today’s business data looks nothing like the data that organizations were working with a few years ago:

More data: Digital transformation processes lead to a proliferation of new digital systems as companies adopt technologies to replace manual processes. Even a very small company is likely to be using separate systems for CRM, online advertising, web analytics, product analytics, marketing automation, finance, and HR — each measuring customer and employee interactions across dozens of dimensions and generating thousands of records.
New types of data: Connected devices and online behavior tracking generate a new type of data that’s faster, more abundant, and less structured. Streaming sources such as app user tracking, sensor data, telemetry or machine logs output a continuous stream of event-based, semi-structured data with ever-changing schema — a far cry from the tabular data of years past.
New use cases: Data was previously used for reporting on pre-defined KPIs and metrics — running the same query on the same table to aggregate the latest records into a summary table or visualization. However, newer use cases such as log analytics or ad-hoc exploratory analysis often require access to granular records rather than summaries.

These changes have made the continued reliance on data warehouses problematic:

Cost: As data volumes increase, the cost of storing all data in the warehouse becomes prohibitive. The rise of semi-structured and unstructured data further drives up these costs due to the need to perform complex ETL or ELT in order to convert data into a tabular format and ensure a consistent data model.
Lost data: Due to the coupling of storage and compute, businesses looking to cut these costs need to store less data. This could mean pruning, setting strict retention policies, or storing aggregations rather than raw records.
Lock-in: Data warehouses rely on proprietary file systems to provide fast query performance, creating vendor lock-in and closed architectures. This becomes problematic when the business wants to migrate its data, or starts gravitating towards non-SQL use cases such as data science using TensorFlow. Data needs to be pulled out of the warehouse and duplicated, further driving up compute costs .

And we’ve not even discussed the elephant in the room — artificial intelligence and machine learning. While not many organizations have wide-scale deployments of AI in production today, most companies expect it to play a much more central role in the future. In preparation, organizations are looking to future-proof their architecture and adopt models that will reliably scale to a more data-intensive reality.

Data Lakes, Hybrid Approaches, and Why They Fail

Today, the question of ‘data warehouse vs data lake’ is largely moot. Few organizations will try to store all their log, text, image and audio data in a data warehouse; whereas a data lake that is merely unstructured object storage provides no ability to analyze data, necessitating a separate compute layer for querying.

What we are seeing instead is that organizations that have high volumes of data adopt a hybrid approach, often referred to as the enterprise data platform. To simplify an implementation that is often incredibly complex, it entails raw data being ingested and stored in an object store such as Amazon S3, and subsets of the data being moved into warehouses to power analytics and data applications.

The enterprise data platform is meant to overcome the inherent limitations of a coupled architecture by separating storage from compute. The organization can retain much higher volumes of data, including semi-structured and unstructured data, without blowing up its cloud bill.

To power specific use cases, platform engineers write code to pull data into the warehouse, and from there analysts and developers can query the data using SQL for ad-hoc analysis; apps and dashboards can do the same programmatically via ODBC or JDBC.

However, this type of enterprise data platform has rarely lived up to its promise. Instead, it’s often become an unwieldy, unmanageable project that swallows endless IT resources to deliver limited value. The reasons for this are numerous:

Challenges

Engineering bottlenecks abound: The data platform is managed by a single, centralized engineering team that needs to cater to requests from multiple technical and business data consumers. Every new data source that needs to be integrated, or any change to the data model — such as to add a new filter capability to a BI dashboard — goes into the backlog of a team that’s overwhelmed with similar requests. This leads to inflexibility and slow turnaround times and frustrates everyone involved.

Limited scope of useful data: Since the data warehouse remains the exclusive pathway to access query-able datasets, most of the challenges of the past remain unsolved. Businesses still need to choose between paying exorbitant amounts to store high volumes of data in the warehouse or working with smaller subsets of the data — aggregations, smaller time windows, or summary tables — with all the limitations that entails. The raw data might be available, but it’s still locked behind a cost-benefit analysis and limited engineering resources.

Reluctance to use the data lake leads to ‘shadow analytics’: As business units become frustrated by waiting for their data to become available through the centralized data platform, they start bypassing the data lake altogether and running their own analytics — either using the capabilities provided by the business apps they work with, or in specialized data marts. This creates challenges for data governance and reliability.

Multiple data copies and data models undermine single source of truth: Each use case requires data to be pulled from the lake and modeled into a table. Many such tables are created, and data is further manipulated inside the target dashboards or applications. The result is inconsistency and difficulty agreeing on the baseline numbers that represent the reality of the business.

Vendor lock-in: Data warehouses are relational databases that store files in proprietary formats to enable large scale analytical querying. Many of the ELT tools used within the data warehouse ecosystem also rely on proprietary technology. When all usable data is locked behind these closed gardens, any new use case or tool is dependent on the ability to integrate with proprietary systems. Migration to other systems grows increasingly challenging over time, even as vendor licensing costs become untenable.

Compliance and security: Data access management is another line item that requires constant involvement from IT and engineering teams. Businesses struggle to strike a balance between broad access to data on the one hand, and governance, regulatory compliance, and security requirements on the other.

Open, Flexible, Self-Service Data Stack

If the data warehouse is no longer fit for purpose, and the data lake has failed to deliver on its promise, is the modern data stack destined to remain an unsolved problem? We believe that the answer is no, and the solution lies in a new approach, driven by recent developments in cloud, open source, and serverless technologies. This is the open SQL data lakehouse.

Principles of the SQL data lakehouse

The open SQL data lakehouse builds on the previous hybrid approach but improves on it by introducing the concept of bringing warehouse workloads to the data lake. Businesses can run analytics on technology and tools that do not require any vendor lock-in including licensing, data formats, interfaces, and infrastructure.

Openness manifests across four key pillars:

Open formats: To prevent vendor lock-in, data is stored in non-proprietary formats that are queryable by multiple engines — Apache Parquet or Apache ORC are popular options.

Open interfaces: The interfaces to access data are reliant on industry-standard, ANSI-SQL, while JDBC or ODBC drivers can be used to connect to any reporting, dashboarding, or notebook tool. And because the stack is based on open-source technology, industry standard language clauses continue to be added in and expanded on.

Open cloud: The stack is cloud agnostic; aside from storage, it natively aligns with containers and can be run on any cloud.

Open source: While proprietary tools can be part of the open lake data stack, the bias is towards open-source technologies that allow businesses to benefit from the best innovations, not just from one vendor but from the entire community.

The SQL Data Lakehouse stack

What is a data lake house — Building a lakehouse stack

Rather than relying exclusively on the data warehouse for analytics, the lakehouse distributes workloads between various engines and platforms based on a ‘best of breed’ approach. A central layer of governance including shared catalogs ensures the security and reliability of data. Businesses can leverage the full power of the data lake as a virtually unlimited storage layer, without the complex toolchain and engineering-centric processes that plagued previous hybrid approaches.

Let’s briefly look at each of the components of the open lakehouse stack:

Storage

In an open lakehouse, data is stored as files on a distributed object store such as Amazon S3, Google Cloud Storage, or Azure Data Lake Storage. Replicating the same architecture in-house, this layer would be a Hadoop cluster.

The storage layer often includes a landing zone where raw data is ingested as-is, including event data which lands as a continuous stream of small JSON or CSV files. This is the raw material that will be used to build future analytic use cases, as well as to recreate a historical snapshot of the data. Analytics-ready datasets are stored separately in columnar formats such as Apache Parquet.

The data catalog or metadata store

Data lakes can easily become data swamps when businesses thoughtlessly pour data into them with no consideration for how to later retrieve datasets and make them usable. The preventive for this is the data catalog.

The catalog describes all the data that’s stored in your system to make it usable, so that query engines will know where to find the records required for a specific query. Examples of lakehouse data catalogs are AWS Glue, Hive Metastore, and Amundsen.

Transaction management

Due to the append-only nature of object storage, it can be challenging to update or delete records in a data lake. This is a problem when records need to be removed for security or compliance reasons (such as a GDPR request or a fraudulent transaction), as well as when replicating OLTP databases into the lake.

Transaction management can be handled with technologies like Apache Hudi, Apache Iceberg, or Delta Lake to enable incremental data ingestion, managing data capture for inserts and deletions, and ACID transactions.

Security and Governance

Using technologies such as AWS Lake Formation or Apache Ranger, businesses can solve the inherent security challenges that come with unstructured data lake storage. IT teams can define access control policies for specific records or databases, based on the metadata catalog, to prevent unauthorized access to data.

Query Engine

In an open SQL data lakehouse, storage is decoupled from compute in order to provide scale and flexibility; as a bonus, it also helps prevent vendor lock-in. SQL query engines, which can return the results of an SQL query directly from the lake, are key. Popular query engines include Presto and SparkSQL, to name a few. Visualization and dashboard tools, as well as in-house data products, can consume data from these engines via ODBC or JDBC.

In the next section, we will expand on why we believe Presto is key to enabling self-service analytics in an open lake architecture, and how it can be extended to replace most data warehouse workloads at a fraction of the cost.

How the SQL Data Lakehouse improves upon the hybrid data lake

The data Lakehouse solves many of the problems associated with the previous generation of enterprise data platforms — rigidness, lack of governance, vendor lock-in, and lack of flexibility. The components we’ve covered in the previous section provide a new set of capabilities including:

Stronger decoupling of compute and storage: In hybrid approaches (using both the warehouse and lake), the data warehouse and its related ELT ecosystem still played a major role; the open lakehouse, which is built on open-source storage and query engines reading directly from the lake, further separates compute resources from storage. Businesses can store more data and increase retention periods without worrying about unreasonable costs.
Single source of truth: By providing SQL-based access directly on the data lake, the lakehouse removes the need to create multiple copies of the same data for different use cases. Consistency improves and analytics sprawl is reduced. Data scientists, business teams, and developers can work with the same data using the same SQL queries, rather than creating bespoke extracts.
Governance and transaction layers: Data catalog and transaction management support remove much of the chaos associated with unstructured storage and prevent the data lake from ‘swampifying’.
Flexibility at every layer: The use of open standards throughout the data lakehouse enables businesses to pick and choose the technology that works best for each workload or use case, without being tied down to a particular vendor.

Open-Source Presto — Unlocking Lakehouse Analytics

Presto is a parallel distributed SQL query engine for the data lake, originally developed at Facebook. It enables interactive, ad-hoc analytics on large amounts of data stored on object stores.

While Presto is not the only SQL query engine, it has become the de-facto standard for cloud data lakes — either in its open-source distribution or via commercial SaaS products such as Ahana or Amazon Athena.

Using Presto, businesses can run SQL queries directly on data lake storage (such as Amazon S3) to retrieve tables whose metadata is stored in the metadata catalog. Other key characteristics include:

High scalability from 1 to 1000s of workers
Flexibility to support a wide range of SQL use cases
A highly pluggable architecture makes Presto easy to extend with custom integrations, such as for security or event listeners.
Federation of data sources via Presto connectors.
Seamless connectivity to existing systems using standard ANSI-SQL.

While Presto was originally developed as an ad-hoc querying tool, its highly extensible nature allows it to be used as the query layer for various analytics use cases.

Interactive querying use cases include BI dashboards, exploratory data science using SQL, federated queries across data sources, and customer-facing analytic applications. In data lakehouses, when combined with tools such as Hudi or Delta Lake, Presto can also power analytics on transactional sources.

Presto’s highly flexible, extensible, and connected design make it a perfect candidate to replace many of the workloads that organizations would usually run in-warehouse. Presto provides similarly reliable and consistent performance and requires only SQL knowledge from the people writing the queries.

By moving compute workloads from closed-source data warehouses to open-source Presto, businesses can better isolate workloads, reduce costly cloud data warehouse storage, and keep their architecture open and migrate-able. This lake-first approach also eliminates much of the engineering overload required for data lake operations, including the need to copy and remodel data per use case.

Faster and Simpler Lakehouse Analytics

While Presto can be transformative, implementing it is not always simple:

Presto needs a lot of configurations: Coming from the Hadoop world, Presto has hundreds of configuration parameters
Presto needs a catalog and storage: Presto doesn’t include a metadata catalog — as the user, you’ll need to manage one for your data lakes
Presto clusters need a lot of management: Whether it is to add a data source or update a config parameter, Presto clusters need to be managed

Ahana Cloud is a managed service for Presto that abstracts these complexities into a self-service SaaS product. With Ahana, you can deploy an Open SQL Lakehouse Stack that includes built-in AWS S3, Hive metadata store, and Apache Superset for BI use in under an hour.

Deploy this SQL lakehouse stack with Ahana Cloud for Presto

Unlike many cloud tools, Ahana doesn’t have a steep learning curve. In fact, Ahana has a free forever Community Edition to help get started with learning and validating the value of Presto for the SQL Data Lakehouse in AWS.

Organizations such as Carbon and Blinkit, India’s largest delivery app, use Ahana Cloud to power SQL-based data lake analytics, generate insights, and get faster, more iterative, interactive data discovery across all their data — without the complexity.

The SQL Data Lakehouse

A New Stack is Needed

Challenges

Open, Flexible, Self-Service Data Stack

Openness manifests across four key pillars:

The SQL Data Lakehouse stack

Let’s briefly look at each of the components of the open lakehouse stack:

Storage

The data catalog or metadata store

Transaction management

Security and Governance

Query Engine

How the SQL Data Lakehouse improves upon the hybrid data lake

Open-Source Presto — Unlocking Lakehouse Analytics

Faster and Simpler Lakehouse Analytics

While Presto can be transformative, implementing it is not always simple:

0 Comments

Popular Posts

Python IDEs and Code Editors | infoStud blogs

How To Start Your Coding Journey | infoStud blogs

Is the online/offline internship really helpful for you? | infoStud blogs

Technology

Subscribe Us

Categories

Tags

Categories

Advertisement

Subscribe Us

Facebook

Popular Posts

Big Data

Programming Languages

SAP

Artificial Intelligence

Big Data

Contact form

The SQL Data Lakehouse

A New Stack is Needed

Challenges

Open, Flexible, Self-Service Data Stack

Openness manifests across four key pillars:

The SQL Data Lakehouse stack

Let’s briefly look at each of the components of the open lakehouse stack:

Storage

The data catalog or metadata store

Transaction management

Security and Governance

Query Engine

How the SQL Data Lakehouse improves upon the hybrid data lake

Open-Source Presto — Unlocking Lakehouse Analytics

Faster and Simpler Lakehouse Analytics

While Presto can be transformative, implementing it is not always simple:

You may like these posts

0 Comments

Popular Posts

Python IDEs and Code Editors | infoStud blogs

How To Start Your Coding Journey | infoStud blogs

Is the online/offline internship really helpful for you? | infoStud blogs

Technology

Subscribe Us

Categories

Tags

Categories

Advertisement

Subscribe Us

Facebook

Popular Posts

Big Data

Programming Languages

SAP

Artificial Intelligence

Big Data

Contact form