Data Lakes evolved as a means to handle and store this ever-growing deluge of data. In enterprises and large organizations, data can easily become fragmented between various departments and teams.
The amount of data being gathered and generated by enterprises and organizations is constantly increasing. Data Lakes evolved as a means to handle and store this ever-growing deluge of data. In enterprises and large organizations, data can easily become fragmented between various departments and teams.
What is a Data Lake?
A Data Lake is a storage repository that can centralize and store vast amounts of raw data in its native format. The data can be structured, semi-structured, or unstructured. The data structure and requirements are not defined until the data is needed, at read-time.
This essentially means that Data Lakes create a future-proof environment for raw data, unconstrained and unfiltered by traditional, strict database rules and relations at write-time. The ingested raw data is always there, and can be re-interpreted and analyzed as needed.
Some experts think of Data Lakes as a replacement for Data Warehouses, while others see them as a staging area for filtering and feeding data into existing Data Warehouse solutions; or as a place to store data backups from Data Warehouses and databases.
It’s important to note that Data Lake architecture varies widely from application to application and architectural considerations are always subject to technical and business requirements. The Data Lake Architecture presented in this article is meant to demonstrate a common-case prototype but is far from comprehensive enough to cover the multitude of applications of modern Data Lakes.
Data Lake Architecture Layers
Data processing in Data Lakes can be loosely organized in the following conceptual model:
Ingestion Layer
The Ingestion Layer is tasked with ingesting raw data into the Data Lake. Modification of raw data is prohibited. Raw data can be ingested in batches or in real-time, and is organized in a logical folder structure. The Ingestion layer can accommodate data from different external sources, such as:
- Social networks
- IoT devices
- Wearable devices
- Data streaming devices
One of the advantages is that it can quickly ingest almost any type of data covering any system, including (but not limited to):
- Real-time data from connected health monitoring devices
- Video streams from security cameras
- Videos, photographs or geolocation data from mobile phones
- All types of telemetry data
Distillation Layer
The Distillation Layer converts the data stored by the Ingestion Layer to structured data for further analysis. In this layer, raw data is interpreted and transformed into structured data sets and subsequently stored as files or tables. The data is cleansed, denormalized, and derived at this stage, and then becomes uniform in terms of encoding, format, and data type.
Processing Layer
The Processing Layer runs user queries and advanced analytical tools on structured data. Processes can be run in real-time, as a batch, or interactively. Business logic is applied in this layer and data is consumed by analytical applications. This layer is also known as trusted, gold, or production-ready.
Insights Layer
The Insights Layer is the output interface, or the query interface, of the Data lake. It uses SQL or non-SQL queries to request and output data in reports or dashboards.
Unified Operations Layer
The Unified Operations Layer performs system monitoring and manages the system using workflow management, auditing, and proficiency management.
In some Data Lake implementations, a Sandbox Layer is included as well. As the name suggests, this layer is a place for data exploration by data scientists and advanced analysts. The sandbox layer is also referred to as the Exploration Layer or Data Science Layer.
The Pros and Cons of Data Lakes
Data Lake Pros:
- "Schema on read” rather than “schema on write” enables greater flexibility.
- The integration of differently structured data is much easier.
- Raw data is always kept intact.
- Data analysis can be done later, whenever it is required and repeated if necessary. The same source data can be interpreted in different ways for different needs.
- Data Lakes are far more scalable than traditional data warehouses.
Data Lake Cons:
- Unstructured data and lack of metadata can lead to a Data Lake becoming a Data "Swamp", where it is hard to find useful data.
- Data scientists may require additional training to successfully mine data from a Data Lake.
- Inexperienced users can start dumping data into a Data Lake without a viable strategy or plan to extract valuable insight.
Data Lake File Storage
Data Lakes rely on big data storage and take advantage of its high reliability, scalability, and uptime. The main requirement for Data Lake storage is the ability to store vast amounts of data at a low cost.
Using cloud storage has the advantage of scalability while being comparatively lower in cost. On-premise Data Lake implementations can also be used, especially if the required big data hardware infrastructure is already in place.
Modern Data Lake architecture separates the physical storage layer from the computing layer, making them independently scalable to meet individual needs. Data Lakes traditionally relied on the Hadoop Distributed File System (HDFS) with Apache ORC or Parquet columnar file formats. Generally, we are seeing a migration towards cloud-native storage like the Amazon S3 and Azure Data Lake Storage.
IBM offers the helpful “5 V’s of Big Data” to demonstrate the most important dimensions of stored data:
Data Lake Architecture Business Considerations
Data Lake Security
Security should be implemented in all layers of the Data Lake, with the traditional intent of restricting access to data. Only authorized users and services are permitted. Data Lake security is accomplished by employing the following methods:
- Network Level Security controls access to data using network security policies such as firewalls and IP address ranges.
- Access Control permits access to authorized users. Different user roles and permissions can also be set.
- Encryption is used extensively in Data Lakes. All data stored is encrypted, and only decrypted at read-time. In data transit, end-to-end encryption is used as well.
One of the challenges in Data Lake security is handling sensitive or confidential personal data and adhering to legal requirements regarding the way this data can be collected, stored, and used. In global enterprises, this is even more challenging due to the necessity to comply with regulatory frameworks in different countries like HIPAA in the US, GDPR in the EU, or the PCI global security standard.
Data Lake Architecture vs. Traditional Databases and Warehouses
The data analysis paradigm in Data Lakes is described as a top-down approach in comparison to traditional database systems:
- Data Lake
- Ingest Data
- Analyze
- Define Data Structure
- Application Database
- Relational Data Structuring
- Ingest Data
- Analyze
- Data Warehouse
- Report Data Structuring
- Ingest Data
- Analyze
This approach to data analysis in Data Lakes saves a lot of upfront work that usually goes into creating the data structure, thus allowing fast ingestion and storage of data. Moving structuring data to the last step is helpful in situations when the structure itself is hard to define and subject to changes or different interpretations.
Data Management and Governance
Data Lake management deals with the challenges of monitoring and logging the transformations of data as it moves through different layers of the Data Lake. All actions performed on the data are logged, as well as all user actions that led up to them.
Metadata
Metadata is data describing data. Ingestion of raw data without applying detailed metadata should not be allowed. A Data Lake can quickly turn into a Data Swamp when you are unable to locate data. On the other hand, being too strict with metadata can result in no ingested data at all, so you end up with a data desert.
Building a Data Lake Team - Roles and Responsibilities
A Data Lake team is essentially a Data Science (DS) team. Depending on the size of the company and the volume of big data, DS teams are custom-built for specific business tasks. In general, DS team roles and responsibilities in a Data Lake architecture would be similar to the following:
- Chief Analytics Officer or Chief Data Officer - The team lead, tasked with communicating and translating the business needs to data science. Preferable skills include data science and analytics, programming, and business expertise.
- Data Analyst - Deals with data collection quality, data interpretation, and analytics. Preferable skills include R, Python, JavaScript, C/C++, SQL.
- Business Analyst - Converts business expectations into data analysis tasks. Preferable skills include data visualization, business intelligence, SQL.
- Data Scientist - Solves business tasks using data mining and machine learning techniques. Preferable skills include R, SAS, Python, SQL, NoSQL, Hive, Pig, Hadoop, Spark.
- Data Architect or Data Engineer - Design and implement Data Lake architecture, manage storage and performance, and ensure the integrity of data from different sources. Preferable skills include SQL, NoSQL, Spark.
Bear in mind that many of these required skills intersect, so an individual could combine multiple roles in a functional team.
Stages of Data Lake Implementation
Most businesses go through the following stages of development when building and integrating Data Lakes within their existing business architecture:
- Gathering raw data - A Data Lake is built separately from the core business IT systems and it becomes a landing zone for gathering raw data from all sources. At this stage, a Data Lake serves only as a data capture environment.
- Environment for Data Science - Data Scientists analyze gathered data and test different analytics prototypes. The Data Lake becomes a platform for analytics and machine learning experimentation.
- Integration with core business IT systems - Large, structured data sets from the core business IT systems are loaded to the Data Lake and linked with data gathered from other sources. High-intensity everyday business tasks remain in the core business IT systems.
- The Data Lake becomes a core part of the data infrastructure - Computing intensive tasks like machine learning utilize the Data Lake and reduce the load on core business IT systems. Data-intensive applications can be built on top of the Data Lake. Streams of raw data from various sources are quickly ingested and stored by the Data Lake.
Data Lake Best Practices
For the sake of brevity, this list is limited to the bare essentials:
- Data Lake implementation should be customized to support the specific needs of the enterprise or the industry that will use it.
- Existing data management policies of the enterprise should be supported by the Data Lake implementation.
- Try to automate adding metadata during data ingestion as much as possible.
- A single Data Lake should be able to perform different architectural roles. For some users, a Data Lake will only serve as a digital archive, while other users perform advanced analytics of that data and find new ways to utilize it.
Software Stacks Used in Data Lakes
Software and cloud vendors have developed several software stacks for Data Lake implementation. We will list a few of the more popular ones for reference.
Microsoft Azure
- Azure Data Lake - Integrated Data Lake Solution
- Azure Data Lake Storage - Data Lake Cloud Storage
- Data Lake Analytics - On-demand Analytics Job Service
- HDInsight - Managed Hadoop for Open Source Analytics
Amazon AWS
- Amazon S3 - Data Lake Cloud Storage
- Amazon S3 Glacier - Archive Cloud Storage
- AWS Glue - Prepare Load, and Catalog Data
- AWS Data Exchange - Find and Use Third-Party Data
- Amazon Athena - Interactive Data Lake Analytics
- Amazon EMR - Managed Hadoop Service for Open Source Analytics
- Amazon Kinesis - Real-time Analytics
- Amazon Elasticsearch Service - Operational Analytics and Indexing
- Amazon Quicksight - Dashboards and Visualization
Google Cloud Platform
- Google Cloud Storage - Cloud Storage
- Dataproc - Managed Hadoop Service for Open Source Analytics
- BigQuery - Interactive Data Lake Analytics
The Future of Data Lakes
A Data Lake is a secure, robust, and centralized storage platform that lets you ingest, store, and process structured and unstructured data. Raw data assets are kept intact, while data exploration, analytics, machine learning, reporting, and visualization are performed on the data and tweaked as needed. This means raw data can be reused and repurposed at a later date, without much hassle.
Although many proponents or vendors may make bold promises, Data Lake architecture will never remove the need for traditional databases, nor replace them. It is simply not envisioned or designed to do that. Most daily business operations will continue to rely on traditional database systems. Repetitive and strictly defined tasks—such as sales, invoicing, inventory, banking transactions—are perfectly implemented in traditional databases. Data Lakes work in conjunction with traditional databases to generate more value from data already available to an organization, gaining new insights and discovering new information from existing data.
Early implementations of Data Lakes were plagued by the fact that the architecture is designed by data scientists for data scientists. Setting up all different components and tools required highly qualified data engineers. Mining and analyzing data from Data Lakes also faced the same challenge, as it was mostly code-based and required specialized talent. Of course, this was not an issue for huge tech companies that dominate the big data space, thanks to their large pool of skilled software engineers and data scientists. However, new solutions like integrated, turnkey Data Lake platforms and GUI-based user interfaces instead of code-based control could make it much easier for companies to implement and use Data Lakes.
In the future, Data Lake architecture and logic could be used and integrated with large document management systems, various digital archives, public records, health care records, scientific research datasets, and so on.