Cloud
|
11 min read
|
July 31, 2020

Data Lakes vs. Data Warehouses: Key Differences

by

Gareth Dwyer

Data warehouses store processed, structured data in a consistent format while data lakes store and process structured and unstructured data in multiple formats.

The difference between Data Lakes and Data Warehouses in one sentence

Data warehouses store processed, structured data in a consistent format while data lakes store and process structured and unstructured data in multiple formats.

Why do we need data warehouses or data lakes?

Many businesses start by storing nearly all of their data in a single database. Users interact with a web application and the data about users, sales, products, and anything else is stored in the database.

A data analyst employed by the business can easily find any required data in the database and generate useful reports and analytics.

Diagram of Data Analyst using a single database
A data analyst employed by the business can easily find any required data in the database and generate useful reports and analytics.

When there’s only one database, it’s easy to do analysis.

As the business grows, more services are added. Each of these stores its own data, and not all of it is relevant to the web application. Suddenly there is useful data in separate places, and the data analyst has to try to keep track of where this data is, how it is stored, and how it can be accessed. It becomes difficult to match users across different sources and some analysis is now just too difficult to do.

Diagram of Data Analyst unhappy with how a complex database works
When there’s only one database, it’s easy to do analysis.

With many sources, analysis becomes difficult.

The solution to this problem is to add a centralized store for all business data. Each service, including the original database, can write data to this centralized location, and analysis can easily be carried out in one place.

Data Analyst happy with complex data gathered in one place
With many sources, analysis becomes difficult.

Carrying out analysis is easy again if data from all services is written to a single location.

This “single source of truth” isn’t just another database though. To deal with all the data from different sources in different formats, it needs to take some more complexity into account, and it can do this in different ways. The two main choices are a data warehouse or a data lake, and each of these has different strengths.

Understanding Data Warehouses

A data warehouse is a centralized data store for structured business data used to generate reports and conduct analysis.

When your data is small and simple, you can analyze it directly at its source, using SQL or similar data tools. Once it grows more complex though you’ll almost always need to add several more steps between sourcing the data and analyzing it. It’s common to think of these steps as a data pipeline. Data is:

  • Extracted: pulled from different sources
  • Transformed: cleaned, linked, and processed into a consistent format
  • Loaded: saved to a new centralized source

This process is referred to as Extract-Transform-Load, often abbreviated to ETL.Going back to our earlier idea of many sources feeding into a single centralized location, we can add more detail to this now. Data is extracted from many sources, transformed into neat and organized chunks, and loaded into a data warehouse, as shown below.

The structure of the data warehouse has to be defined up-front, and is usually designed for a specific business or reporting domain (for example, you will find pre-built data warehouses for sales, accounting, or real estate).

The data analyst can easily generate reports by looking at the required data in the warehouse.

Data Analyst happy with another form of complex data organization
Understanding Data Warehouses

Using a data warehouse architecture, data is extracted from various sources, cleaned (transformed), and then stored neatly in a data warehouse where a data analyst can use it.

This generally works well, but it places a high barrier to entry to get data into the warehouse. If a new source is added or if a source changes significantly, it might take weeks of engineering time to create the necessary transform steps to start loading the data into the warehouse.

Understanding Data Lakes

A data lake is a centralized data store for both structured and unstructured business data used for exploratory data analysis and business reporting.

Data lakes are a newer concept than data warehouses. As with a data warehouse, a data lake is a centralized store of both connected and disconnected data. The steps to move data into a data lake and then use it look similar to the ones we saw for a data warehouse, but with the second and third steps swapped. Instead of ETL, we use ELT.

This means that data is stored in the data lake in its raw form, almost identically to how it appeared in the various sources. The “transform” step is done later, often at the same time as the final analysis.

Data Analyst using transformed complex data
Understanding Data Lakes

Using a data lake architecture, data is extracted from various sources, and immediately stored in a centralized location. It is then cleaned (transformed) as necessary for specific kinds of analysis.

The barrier to entry for the data is now far lower. It is easy to add new data sources to the lake and ensure that all data is stored in a centralized location. The difficult part (the cleaning and transforming step) can be safely delayed and carried out when it becomes necessary.

Comparing Data Warehouses and Data Lakes

The data lake architecture was created to address problems with the data warehouse. If the barrier to entry for data to enter the warehouse is too high, important data might be lost or remain unused until the necessary pipeline is built.

This does not mean that data lakes are always the better choice though. As with most things in life, the answer to the question “which should I use?” is “it depends”. Data warehouses and data lakes are also not direct alternatives, although their use cases strongly overlap. Many companies use both in conjunction.

Data lakes can usually handle data on a much larger scale. This doesn’t just mean in terms of the number of petabytes stored, but also in terms of the other Vs of big data: for example, how fast the data flows into systems and how much variety it has.

Summary of advantages and disadvantages of Data Warehouses vs Data Lakes

Data Warehouse

  • Rigid structure can make data more predictable and reliable
  • Rigid structure makes it difficult to add new sources
  • Predefined structure means you may not have to "reinvent the wheel"
  • May be missing data from some sources

Data Lake

  • Flexible structure increases risk of it turning into a "data swamp"
  • Flexible structure makes it easy to add new data sources
  • No predefined structure means you may have to custom-build common functionality
  • Usually stores a complete view of data from all sources and gets new data quickly

Data warehouses are structured, making it easier to rely on and use the resulting data. Data lakes are unstructured, making it easier to add data from different sources.

When deciding between the two architectures, you’ll need to consider how important flexibility is and also the quantity of data that you need to store. Data lakes are typically used to store far more data than data warehouses.

Example: Pain points of using Data Warehouses and Data Lakes

No solution is perfect and each option brings its own set of tradeoffs. Let’s look at an example experience of using a data warehouse and a data lake.

Example: SaaS Provider with a Data Lake

You are a senior manager at a company that provides Software as a Service. Data from your product, sales, marketing, and customer support teams all feed into a data lake. You need insights from this data to generate an annual report and make key decisions for the upcoming year, and you are working with a data analyst.

You are delighted by

The fact that you have access to all the data you need in one place and how you can draw connections between different pieces of data to gain new insights. You want to compare your NPS score from your customer support team with the upgrade conversion rate to your premium package from your sales team with the month on month user growth rate from your marketing team, and see how this relates to the A/B test data that your product team recently ran on a new feature. The data analyst has all of this data at her fingertips and can quickly answer your questions.

You are frustrated by

Some of the data doesn’t line up and some of it is directly contradictory. There are more user sign-ups recorded in the data from the marketing team than in the user records from the product team. The data analyst warns you that the analysis might not be reliable because there were different versions of the same data and she had to guess which one to use, and she is still waiting for information from the marketing team about which version is correct.

Example: SaaS provider with a Data Warehouse

You are a senior manager at a company that provides Software as a Service. Data from your product, sales, marketing, and customer support teams all feed into a data warehouse. You need insights from this data to generate an annual report and make key decisions for the upcoming year, and you are working with a data analyst.

You are delighted by

The fact that you can do a lot of the analysis yourself without relying on help from the data analyst. Everything is well structured and easy to understand and many of the insights that you need can be generated at the click of a button. The formats remain consistent across time and you also can compare the insights that you generate now with insights going years back.

You are frustrated by

Some data is missing completely or does not contain all of the attributes that you expect. Your data team tells you that it’s on their backlog to write the transforms to get the new data feeding into the warehouse, but they are hedgy about providing a time estimate for when this will be done. You have to find credentials for different services and manually export the missing data to piece together your analysis. You may not have full access to historical data if, for example, your data warehouse solution cannot store more than 12 months’ worth of data.

Combining Data Warehouses and Data Lakes

You might be wondering if you can get the advantages of both data warehouses and data lakes. Good news: you can, and many people do. Consider the following set up.

Data Analyst pulling from data lake and warehouse

Data lakes and data warehouses don’t have to be alternatives. They can also be used in conjunction.

Here data is loaded in its raw format into one centralized location (the data lake) and only subsequently processed and loaded into the data warehouse. Data scientists can access the data lake directly, analyzing data in its raw form, while data analysts and executives can benefit from the additional structure added by the warehouse.

Of course, this setup is not without its own disadvantages. Mainly, costs are significantly higher. The data lake and data warehouse each have their own hardware costs, maintenance costs, and set up costs. Adding more pieces always adds more complexity too: now you have to make sure that the data lake and data warehouse stay in sync. But at scale, you’ll almost always need the benefits of both so starting with this architecture is often a good way to future proof your setup.

Maturity and trends of Data Warehouses and Data Lakes

So far, we have considered the two solutions at face value without looking at historical context. Another important point to consider though is the maturity and future of data warehouses and data lakes.

The Google Trends graph below (calculated by the number of Google searches made for each term) shows that interest in data warehouses is still overall higher than in data lakes, but that interest in data warehouses is declining while that in data lakes is increasing.

It is likely that data lakes will become more popular than data warehouses within a few years. As hardware costs become cheaper, it makes sense to store more data even if the business doesn’t currently have a well-defined use for it.

That said, data warehouses are a far older concept than data lakes, and the technology and best practices for building data warehouses are far more mature. If you’re looking for something tried and tested, data warehouses are likely going to be more attractive. If you’re looking for something newer, more exciting, and with a brighter future, data lakes are probably for you.

Chart showing trends in search results for data lakes and data warehouses
Data warehouses are older and declining in popularity while the interest in data lakes is increasing.

Conclusion

Data warehouses and data lakes are deceptively simple concepts, but there is a lot to consider before choosing one or the other (or deciding to use both). The main considerations are:

  • How central data is to your organization: if data is one of your key focus areas, storing as much of it as possible in a lake is likely to your advantage.
  • How mature and well-understood your use case is: there’s no need to reinvent the wheel if you are working on a well-understood problem and your data fits neatly into predefined schemas.

Both solutions are useful in different scenarios and neither is going to disappear any time soon, but data warehouses remain tried and tested while data lakes are new and upcoming.

X

Get Your PDF Copy

Get a full, printable copy of the AWS Show conversation on the various approaches a FinOps program could take to address the #1 FinOps challenge from a practitioner's viewpoint.









Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.