Cloud Architecture

Data Lake Architecture: A Comprehensive Guide

12 min read • Jul 13, 2020

SM
By Mensur HajdarbegovicData Architect

Mensur has extensive experience in all aspects of corporate IT, from networking and VoIP, to infrastructure planning and hardware deployment. He's also an avid tech enthusiast and reviewer.

Share article

What is a Data Lake?

The amount of data being gathered and generated by enterprises and organizations is constantly increasing. Data Lakes evolved as a means to handle and store this ever-growing deluge of data. In enterprises and large organizations, data can easily become fragmented between various departments and teams. A Data Lake is a storage repository that can centralize and store vast amounts of raw data in its native format. The data can be structured, semi-structured, or unstructured. The data structure and requirements are not defined until the data is needed, at read-time.

This essentially means that Data Lakes create a future-proof environment for raw data, unconstrained and unfiltered by traditional, strict database rules and relations at write-time. The ingested raw data is always there, and can be re-interpreted and analyzed as needed.

Some experts think of Data Lakes as a replacement for Data Warehouses, while others see them as a staging area for filtering and feeding data into existing Data Warehouse solutions; or as a place to store data backups from Data Warehouses and databases.

It’s important to note that Data Lake architecture varies widely from application to application and architectural considerations are always subject to technical and business requirements. The Data Lake Architecture presented in this article is meant to demonstrate a common-case prototype but is far from comprehensive enough to cover the multitude of applications of modern Data Lakes.

Data Lake Architecture Layers

Data processing in Data Lakes can be loosely organized in the following conceptual model:

Ingestion Layer

The Ingestion Layer is tasked with ingesting raw data into the Data Lake. Modification of raw data is prohibited. Raw data can be ingested in batches or in real-time, and is organized in a logical folder structure. The Ingestion layer can accommodate data from different external sources, such as:

  • Social networks
  • IoT devices
  • Wearable devices
  • Data streaming devices

One of the advantages is that it can quickly ingest almost any type of data covering any system, including (but not limited to):

  • Real-time data from connected health monitoring devices
  • Video streams from security cameras
  • Videos, photographs or geolocation data from mobile phones
  • All types of telemetry data
Distillation Layer

The Distillation Layer converts the data stored by the Ingestion Layer to structured data for further analysis. In this layer, raw data is interpreted and transformed into structured data sets and subsequently stored as files or tables. The data is cleansed, denormalized, and derived at this stage, and then becomes uniform in terms of encoding, format, and data type.

Processing Layer

The Processing Layer runs user queries and advanced analytical tools on structured data. Processes can be run in real-time, as a batch, or interactively. Business logic is applied in this layer and data is consumed by analytical applications. This layer is also known as trusted, gold, or production-ready.

Insights Layer

The Insights Layer is the output interface, or the query interface, of the Data lake. It uses SQL or non-SQL queries to request and output data in reports or dashboards.


Unified Operations Layer

The Unified Operations Layer performs system monitoring and manages the system using workflow management, auditing, and proficiency management.

In some Data Lake implementations, a Sandbox Layer is included as well. As the name suggests, this layer is a place for data exploration by data scientists and advanced analysts. The sandbox layer is also referred to as the Exploration Layer or Data Science Layer.

The Pros and Cons of Data Lakes

Data Lake Pros:
  • "Schema on read” rather than “schema on write” enables greater flexibility.
  • The integration of differently structured data is much easier.
  • Raw data is always kept intact.
  • Data analysis can be done later, whenever it is required and repeated if necessary. The same source data can be interpreted in different ways for different needs.
  • Data Lakes are far more scalable than traditional data warehouses.
Data Lake Cons:
  • Unstructured data and lack of metadata can lead to a Data Lake becoming a Data "Swamp", where it is hard to find useful data.
  • Data scientists may require additional training to successfully mine data from a Data Lake.
  • Inexperienced users can start dumping data into a Data Lake without a viable strategy or plan to extract valuable insight.

Data Lake File Storage

Data Lakes rely on big data storage and take advantage of its high reliability, scalability, and uptime. The main requirement for Data Lake storage is the ability to store vast amounts of data at a low cost. 

Using cloud storage has the advantage of scalability while being comparatively lower in cost. On-premise Data Lake implementations can also be used, especially if the required big data hardware infrastructure is already in place.

Modern Data Lake architecture separates the physical storage layer from the computing layer, making them independently scalable to meet individual needs. Data Lakes traditionally relied on the Hadoop Distributed File System (HDFS) with Apache ORC or Parquet columnar file formats. Generally, we are seeing a migration towards cloud-native storage like the Amazon S3 and Azure Data Lake Storage.

IBM offers the helpful “5 V’s of Big Data” to demonstrate the most important dimensions of stored data:

Data Lake Architecture Business Considerations

Data Lake Security

Security should be implemented in all layers of the Data Lake, with the traditional intent of restricting access to data. Only authorized users and services are permitted. Data Lake security is accomplished by employing the following methods:

  • Network Level Security controls access to data using network security policies such as firewalls and IP address ranges.
  • Access Control permits access to authorized users. Different user roles and permissions can also be set.
  • Encryption is used extensively in Data Lakes. All data stored is encrypted, and only decrypted at read-time. In data transit, end-to-end encryption is used as well.

One of the challenges in Data Lake security is handling sensitive or confidential personal data and adhering to legal requirements regarding the way this data can be collected, stored, and used. In global enterprises, this is even more challenging due to the necessity to comply with regulatory frameworks in different countries like HIPAA in the US, GDPR in the EU, or the PCI global security standard. 

Data Lake Architecture vs. Traditional Databases and Warehouses

The data analysis paradigm in Data Lakes is described as a top-down approach in comparison to traditional database systems:

  • Data Lake
    • Ingest Data
    • Analyze
    • Define Data Structure
  • Application Database
    • Relational Data Structuring
    • Ingest Data
    • Analyze
  • Data Warehouse
    • Report Data Structuring
    • Ingest Data
    • Analyze

This approach to data analysis in Data Lakes saves a lot of upfront work that usually goes into creating the data structure, thus allowing fast ingestion and storage of data. Moving structuring data to the last step is helpful in situations when the structure itself is hard to define and subject to changes or different interpretations.

Data Management and Governance

Data Lake management deals with the challenges of monitoring and logging the transformations of data as it moves through different layers of the Data Lake. All actions performed on the data are logged, as well as all user actions that led up to them.

Metadata

Metadata is data describing data. Ingestion of raw data without applying detailed metadata should not be allowed. A Data Lake can quickly turn into a Data Swamp when you are unable to locate data. On the other hand, being too strict with metadata can result in no ingested data at all, so you end up with a data desert.

Building a Data Lake Team - Roles and Responsibilities

A Data Lake team is essentially a Data Science (DS) team. Depending on the size of the company and the volume of big data, DS teams are custom-built for specific business tasks. In general, DS team roles and responsibilities in a Data Lake architecture would be similar to the following:

  • Chief Analytics Officer or Chief Data Officer - The team lead, tasked with communicating and translating the business needs to data science. Preferable skills include data science and analytics, programming, and business expertise.
  • Data Analyst - Deals with data collection quality, data interpretation, and analytics. Preferable skills include R, Python, JavaScript, C/C++, SQL.
  • Business Analyst - Converts business expectations into data analysis tasks. Preferable skills include data visualization, business intelligence, SQL.
  • Data Scientist - Solves business tasks using data mining and machine learning techniques. Preferable skills include R, SAS, Python, SQL, NoSQL, Hive, Pig, Hadoop, Spark.
  • Data Architect or Data Engineer - Design and implement Data Lake architecture, manage storage and performance, and ensure the integrity of data from different sources. Preferable skills include SQL, NoSQL, Spark.

Bear in mind that many of these required skills intersect, so an individual could combine multiple roles in a functional team.

Stages of Data Lake Implementation

Most businesses go through the following stages of development when building and integrating Data Lakes within their existing business architecture:

  • Gathering raw data - A Data Lake is built separately from the core business IT systems and it becomes a landing zone for gathering raw data from all sources. At this stage, a Data Lake serves only as a data capture environment.
  • Environment for Data Science - Data Scientists analyze gathered data and test different analytics prototypes. The Data Lake becomes a platform for analytics and machine learning experimentation.
  • Integration with core business IT systems - Large, structured data sets from the core business IT systems are loaded to the Data Lake and linked with data gathered from other sources. High-intensity everyday business tasks remain in the core business IT systems.
  • The Data Lake becomes a core part of the data infrastructure - Computing intensive tasks like machine learning utilize the Data Lake and reduce the load on core business IT systems. Data-intensive applications can be built on top of the Data Lake. Streams of raw data from various sources are quickly ingested and stored by the Data Lake.

Data Lake Best Practices

For the sake of brevity, we will limit our list of best practices to the bare essentials:

  • Data Lake implementation should be customized to support the specific needs of the enterprise or the industry that will use it.
  • Existing data management policies of the enterprise should be supported by the Data Lake implementation.
  • Try to automate adding metadata during data ingestion as much as possible.
  • A single Data Lake should be able to perform different architectural roles. For some users, a Data Lake will only serve as a digital archive, while other users perform advanced analytics of that data and find new ways to utilize it.

Software Stacks Used in Data Lakes

Software and cloud vendors have developed several software stacks for Data Lake implementation. We will list a few of the more popular ones for reference.

Microsoft Azure
Amazon AWS
Google Cloud Platform

The Future of Data Lakes

A Data Lake is a secure, robust, and centralized storage platform that lets you ingest, store, and process structured and unstructured data. Raw data assets are kept intact, while data exploration, analytics, machine learning, reporting, and visualization are performed on the data and tweaked as needed. This means raw data can be reused and repurposed at a later date, without much hassle.

Although many proponents or vendors may make bold promises, Data Lake architecture will never remove the need for traditional databases, nor replace them. It is simply not envisioned or designed to do that. Most daily business operations will continue to rely on traditional database systems. Repetitive and strictly defined tasks—such as sales, invoicing, inventory, banking transactions—are perfectly implemented in traditional databases. Data Lakes work in conjunction with traditional databases to generate more value from data already available to an organization, gaining new insights and discovering new information from existing data.

Early implementations of Data Lakes were plagued by the fact that the architecture is designed by data scientists for data scientists. Setting up all different components and tools required highly qualified data engineers. Mining and analyzing data from Data Lakes also faced the same challenge, as it was mostly code-based and required specialized talent. Of course, this was not an issue for huge tech companies that dominate the big data space, thanks to their large pool of skilled software engineers and data scientists. However, new solutions like integrated, turnkey Data Lake platforms and GUI-based user interfaces instead of code-based control could make it much easier for companies to implement and use Data Lakes.

In the future, Data Lake architecture and logic could be used and integrated with large document management systems, various digital archives, public records, health care records, scientific research datasets, and so on.

Top Data Lake Learning Resources

If you are interested in learning more about Data Lake architecture, you should find some of the following resources helpful:

Books
PDFs and Whitepapers
Videos

FAQ

What is Data Lake architecture?

Data Lake architecture describes the way data is organized and handled in Data Lakes. The main component is data storage, as all other data transformations are organized into five distinct layers: Ingestion Layer, Distillation Layer, Processing Layer, Insights Layer, and Unified Operations Layer.

What is the difference between Data Warehouse and Data Lake?

Data Warehouses and Data Lakes are both used for storing big data. A Data Lake is a storage pool of raw data structured at read-time, while a Data Warehouse is a traditional data repository holding strictly structured and filtered data defined at write-time.

What is the need for Data Lake architecture?

The volume of data being gathered by enterprises and organizations is constantly increasing. Data Lakes evolved as a means to handle and store this flood of data. The concept of storing unstructured data in a Data Lake and running different analyses and interpretations of the data later, whenever it is needed, is a very interesting tool for modern, information-driven businesses.

How is data stored in a Data Lake?

Data in a Data Lake is stored in its raw format, usually object blobs or files. The raw data itself is always kept unmodified, while separate copies are created for holding processed data for reporting, analytics, and visualization.

How do you build a Data Lake?

Using cloud storage from AWS, Azure, or Google, you can get started on the storage aspect of your Data Lake with relative ease. The next part is building the data processing, analysis, and reporting logic, and this is where things usually get complicated, as you need to use a patchwork of different tools. Using a preset Data Lake Platform for this part could be easier if it fits your needs.

Is Snowflake a Data Lake?

Yes, Snowflake is a Data Lake solution. It is sometimes called a Data Ocean as it enables the use of multiple cloud solutions and providers globally with failover and sync functionality.

Is BigQuery a Data Lake?

No, Google BigQuery is a cloud-based data warehouse with rapid SQL query support.

Is Databricks a Data Lake?

No, Databricks is a company involved in the development of cloud data processing, analysis, security, administration, and reporting tools used in Data lakes.

Is a Data Lake a Database?

A Data Lake is not a database in the traditional sense. A Data Lake is a repository for raw data with or without any structure, while a database represents a strictly structured and defined set of data.

Is Hadoop HDFS a Data Lake?

Hadoop Distributed File System (HDFS) is a distributed file system that handles large data sets running on commodity hardware. It is not a Data Lake, it is a file system that can handle the storage of a Data Lake.

Is S3 a Data Lake?

The Amazon Simple Storage Service (S3) is Amazon’s cloud storage platform. It is not a complete Data Lake solution, but it is often used as cloud storage in Data Lake implementations.

Why choose Data Lakes?

Traditional database systems are not really open to experimenting with data. Data Lakes enable agile data analysis and experimentation, and they are a prime choice for big data analytics.

Mensur Hajdarbegovic

By Mensur Hajdarbegovic

Data Architect