12 min read • Jul 13, 2020
Mensur has extensive experience in all aspects of corporate IT, from networking and VoIP, to infrastructure planning and hardware deployment. He's also an avid tech enthusiast and reviewer.
The amount of data being gathered and generated by enterprises and organizations is constantly increasing. Data Lakes evolved as a means to handle and store this ever-growing deluge of data. In enterprises and large organizations, data can easily become fragmented between various departments and teams. A Data Lake is a storage repository that can centralize and store vast amounts of raw data in its native format. The data can be structured, semi-structured, or unstructured. The data structure and requirements are not defined until the data is needed, at read-time.
This essentially means that Data Lakes create a future-proof environment for raw data, unconstrained and unfiltered by traditional, strict database rules and relations at write-time. The ingested raw data is always there, and can be re-interpreted and analyzed as needed.
Some experts think of Data Lakes as a replacement for Data Warehouses, while others see them as a staging area for filtering and feeding data into existing Data Warehouse solutions; or as a place to store data backups from Data Warehouses and databases.
It’s important to note that Data Lake architecture varies widely from application to application and architectural considerations are always subject to technical and business requirements. The Data Lake Architecture presented in this article is meant to demonstrate a common-case prototype but is far from comprehensive enough to cover the multitude of applications of modern Data Lakes.
Data processing in Data Lakes can be loosely organized in the following conceptual model:
The Ingestion Layer is tasked with ingesting raw data into the Data Lake. Modification of raw data is prohibited. Raw data can be ingested in batches or in real-time, and is organized in a logical folder structure. The Ingestion layer can accommodate data from different external sources, such as:
One of the advantages is that it can quickly ingest almost any type of data covering any system, including (but not limited to):
The Distillation Layer converts the data stored by the Ingestion Layer to structured data for further analysis. In this layer, raw data is interpreted and transformed into structured data sets and subsequently stored as files or tables. The data is cleansed, denormalized, and derived at this stage, and then becomes uniform in terms of encoding, format, and data type.
The Processing Layer runs user queries and advanced analytical tools on structured data. Processes can be run in real-time, as a batch, or interactively. Business logic is applied in this layer and data is consumed by analytical applications. This layer is also known as trusted, gold, or production-ready.
The Insights Layer is the output interface, or the query interface, of the Data lake. It uses SQL or non-SQL queries to request and output data in reports or dashboards.
The Unified Operations Layer performs system monitoring and manages the system using workflow management, auditing, and proficiency management.
In some Data Lake implementations, a Sandbox Layer is included as well. As the name suggests, this layer is a place for data exploration by data scientists and advanced analysts. The sandbox layer is also referred to as the Exploration Layer or Data Science Layer.
Data Lakes rely on big data storage and take advantage of its high reliability, scalability, and uptime. The main requirement for Data Lake storage is the ability to store vast amounts of data at a low cost.
Using cloud storage has the advantage of scalability while being comparatively lower in cost. On-premise Data Lake implementations can also be used, especially if the required big data hardware infrastructure is already in place.
Modern Data Lake architecture separates the physical storage layer from the computing layer, making them independently scalable to meet individual needs. Data Lakes traditionally relied on the Hadoop Distributed File System (HDFS) with Apache ORC or Parquet columnar file formats. Generally, we are seeing a migration towards cloud-native storage like the Amazon S3 and Azure Data Lake Storage.
IBM offers the helpful “5 V’s of Big Data” to demonstrate the most important dimensions of stored data:
Security should be implemented in all layers of the Data Lake, with the traditional intent of restricting access to data. Only authorized users and services are permitted. Data Lake security is accomplished by employing the following methods:
One of the challenges in Data Lake security is handling sensitive or confidential personal data and adhering to legal requirements regarding the way this data can be collected, stored, and used. In global enterprises, this is even more challenging due to the necessity to comply with regulatory frameworks in different countries like HIPAA in the US, GDPR in the EU, or the PCI global security standard.
The data analysis paradigm in Data Lakes is described as a top-down approach in comparison to traditional database systems:
This approach to data analysis in Data Lakes saves a lot of upfront work that usually goes into creating the data structure, thus allowing fast ingestion and storage of data. Moving structuring data to the last step is helpful in situations when the structure itself is hard to define and subject to changes or different interpretations.
Data Lake management deals with the challenges of monitoring and logging the transformations of data as it moves through different layers of the Data Lake. All actions performed on the data are logged, as well as all user actions that led up to them.
Metadata is data describing data. Ingestion of raw data without applying detailed metadata should not be allowed. A Data Lake can quickly turn into a Data Swamp when you are unable to locate data. On the other hand, being too strict with metadata can result in no ingested data at all, so you end up with a data desert.
A Data Lake team is essentially a Data Science (DS) team. Depending on the size of the company and the volume of big data, DS teams are custom-built for specific business tasks. In general, DS team roles and responsibilities in a Data Lake architecture would be similar to the following:
Bear in mind that many of these required skills intersect, so an individual could combine multiple roles in a functional team.
Most businesses go through the following stages of development when building and integrating Data Lakes within their existing business architecture:
For the sake of brevity, we will limit our list of best practices to the bare essentials:
Software and cloud vendors have developed several software stacks for Data Lake implementation. We will list a few of the more popular ones for reference.
A Data Lake is a secure, robust, and centralized storage platform that lets you ingest, store, and process structured and unstructured data. Raw data assets are kept intact, while data exploration, analytics, machine learning, reporting, and visualization are performed on the data and tweaked as needed. This means raw data can be reused and repurposed at a later date, without much hassle.
Although many proponents or vendors may make bold promises, Data Lake architecture will never remove the need for traditional databases, nor replace them. It is simply not envisioned or designed to do that. Most daily business operations will continue to rely on traditional database systems. Repetitive and strictly defined tasks—such as sales, invoicing, inventory, banking transactions—are perfectly implemented in traditional databases. Data Lakes work in conjunction with traditional databases to generate more value from data already available to an organization, gaining new insights and discovering new information from existing data.
Early implementations of Data Lakes were plagued by the fact that the architecture is designed by data scientists for data scientists. Setting up all different components and tools required highly qualified data engineers. Mining and analyzing data from Data Lakes also faced the same challenge, as it was mostly code-based and required specialized talent. Of course, this was not an issue for huge tech companies that dominate the big data space, thanks to their large pool of skilled software engineers and data scientists. However, new solutions like integrated, turnkey Data Lake platforms and GUI-based user interfaces instead of code-based control could make it much easier for companies to implement and use Data Lakes.
In the future, Data Lake architecture and logic could be used and integrated with large document management systems, various digital archives, public records, health care records, scientific research datasets, and so on.
If you are interested in learning more about Data Lake architecture, you should find some of the following resources helpful:
What is Data Lake architecture?
Data Lake architecture describes the way data is organized and handled in Data Lakes. The main component is data storage, as all other data transformations are organized into five distinct layers: Ingestion Layer, Distillation Layer, Processing Layer, Insights Layer, and Unified Operations Layer.
What is the difference between Data Warehouse and Data Lake?
Data Warehouses and Data Lakes are both used for storing big data. A Data Lake is a storage pool of raw data structured at read-time, while a Data Warehouse is a traditional data repository holding strictly structured and filtered data defined at write-time.
What is the need for Data Lake architecture?
The volume of data being gathered by enterprises and organizations is constantly increasing. Data Lakes evolved as a means to handle and store this flood of data. The concept of storing unstructured data in a Data Lake and running different analyses and interpretations of the data later, whenever it is needed, is a very interesting tool for modern, information-driven businesses.
How is data stored in a Data Lake?
Data in a Data Lake is stored in its raw format, usually object blobs or files. The raw data itself is always kept unmodified, while separate copies are created for holding processed data for reporting, analytics, and visualization.
How do you build a Data Lake?
Using cloud storage from AWS, Azure, or Google, you can get started on the storage aspect of your Data Lake with relative ease. The next part is building the data processing, analysis, and reporting logic, and this is where things usually get complicated, as you need to use a patchwork of different tools. Using a preset Data Lake Platform for this part could be easier if it fits your needs.
Is Snowflake a Data Lake?
Yes, Snowflake is a Data Lake solution. It is sometimes called a Data Ocean as it enables the use of multiple cloud solutions and providers globally with failover and sync functionality.
Is BigQuery a Data Lake?
No, Google BigQuery is a cloud-based data warehouse with rapid SQL query support.
Is Databricks a Data Lake?
No, Databricks is a company involved in the development of cloud data processing, analysis, security, administration, and reporting tools used in Data lakes.
Is a Data Lake a Database?
A Data Lake is not a database in the traditional sense. A Data Lake is a repository for raw data with or without any structure, while a database represents a strictly structured and defined set of data.
Is Hadoop HDFS a Data Lake?
Hadoop Distributed File System (HDFS) is a distributed file system that handles large data sets running on commodity hardware. It is not a Data Lake, it is a file system that can handle the storage of a Data Lake.
Is S3 a Data Lake?
The Amazon Simple Storage Service (S3) is Amazon’s cloud storage platform. It is not a complete Data Lake solution, but it is often used as cloud storage in Data Lake implementations.
Why choose Data Lakes?
Traditional database systems are not really open to experimenting with data. Data Lakes enable agile data analysis and experimentation, and they are a prime choice for big data analytics.