11 min read • Jul 31, 2020
Gareth is a South African software engineer and writer who served as CTO of an EdTech startup before founding Ritza. He now produces technical articles and tutorials, supporting companies in their education-based marketing efforts.
Data warehouses store processed, structured data in a consistent format while data lakes store and process structured and unstructured data in multiple formats.
Many businesses start by storing nearly all of their data in a single database. Users interact with a web application and the data about users, sales, products, and anything else is stored in the database.
A data analyst employed by the business can easily find any required data in the database and generate useful reports and analytics.
When there’s only one database, it’s easy to do analysis.
As the business grows, more services are added. Each of these stores its own data, and not all of it is relevant to the web application. Suddenly there is useful data in separate places, and the data analyst has to try to keep track of where this data is, how it is stored, and how it can be accessed. It becomes difficult to match users across different sources and some analysis is now just too difficult to do.
With many sources, analysis becomes difficult.
The solution to this problem is to add a centralized store for all business data. Each service, including the original database, can write data to this centralized location, and analysis can easily be carried out in one place.
Carrying out analysis is easy again if data from all services is written to a single location.
This “single source of truth” isn’t just another database though. To deal with all the data from different sources in different formats, it needs to take some more complexity into account, and it can do this in different ways. The two main choices are a data warehouse or a data lake, and each of these has different strengths.
A data warehouse is a centralized data store for structured business data used to generate reports and conduct analysis.
When your data is small and simple, you can analyze it directly at its source, using SQL or similar data tools. Once it grows more complex though you’ll almost always need to add several more steps between sourcing the data and analyzing it. It’s common to think of these steps as a data pipeline. Data is:
This process is referred to as Extract-Transform-Load, often abbreviated to ETL.Going back to our earlier idea of many sources feeding into a single centralized location, we can add more detail to this now. Data is extracted from many sources, transformed into neat and organized chunks, and loaded into a data warehouse, as shown below.
The structure of the data warehouse has to be defined up-front, and is usually designed for a specific business or reporting domain (for example, you will find pre-built data warehouses for sales, accounting, or real estate).
The data analyst can easily generate reports by looking at the required data in the warehouse.
Using a data warehouse architecture, data is extracted from various sources, cleaned (transformed), and then stored neatly in a data warehouse where a data analyst can use it.
This generally works well, but it places a high barrier to entry to get data into the warehouse. If a new source is added or if a source changes significantly, it might take weeks of engineering time to create the necessary transform steps to start loading the data into the warehouse.
A data lake is a centralized data store for both structured and unstructured business data used for exploratory data analysis and business reporting.
Data lakes are a newer concept than data warehouses. As with a data warehouse, a data lake is a centralized store of both connected and disconnected data. The steps to move data into a data lake and then use it look similar to the ones we saw for a data warehouse, but with the second and third steps swapped. Instead of ETL, we use ELT.
This means that data is stored in the data lake in its raw form, almost identically to how it appeared in the various sources. The “transform” step is done later, often at the same time as the final analysis.
Using a data lake architecture, data is extracted from various sources, and immediately stored in a centralized location. It is then cleaned (transformed) as necessary for specific kinds of analysis.
The barrier to entry for the data is now far lower. It is easy to add new data sources to the lake and ensure that all data is stored in a centralized location. The difficult part (the cleaning and transforming step) can be safely delayed and carried out when it becomes necessary.
The data lake architecture was created to address problems with the data warehouse. If the barrier to entry for data to enter the warehouse is too high, important data might be lost or remain unused until the necessary pipeline is built.
This does not mean that data lakes are always the better choice though. As with most things in life, the answer to the question “which should I use?” is “it depends”. Data warehouses and data lakes are also not direct alternatives, although their use cases strongly overlap. Many companies use both in conjunction.
Data lakes can usually handle data on a much larger scale. This doesn’t just mean in terms of the number of petabytes stored, but also in terms of the other Vs of big data: for example, how fast the data flows into systems and how much variety it has.
Data warehouses are structured, making it easier to rely on and use the resulting data. Data lakes are unstructured, making it easier to add data from different sources.
When deciding between the two architectures, you’ll need to consider how important flexibility is and also the quantity of data that you need to store. Data lakes are typically used to store far more data than data warehouses.
No solution is perfect and each option brings its own set of tradeoffs. Let’s look at an example experience of using a data warehouse and a data lake.
You are a senior manager at a company that provides Software as a Service. Data from your product, sales, marketing, and customer support teams all feed into a data lake. You need insights from this data to generate an annual report and make key decisions for the upcoming year, and you are working with a data analyst.
The fact that you have access to all the data you need in one place and how you can draw connections between different pieces of data to gain new insights. You want to compare your NPS score from your customer support team with the upgrade conversion rate to your premium package from your sales team with the month on month user growth rate from your marketing team, and see how this relates to the A/B test data that your product team recently ran on a new feature. The data analyst has all of this data at her fingertips and can quickly answer your questions.
Some of the data doesn’t line up and some of it is directly contradictory. There are more user sign-ups recorded in the data from the marketing team than in the user records from the product team. The data analyst warns you that the analysis might not be reliable because there were different versions of the same data and she had to guess which one to use, and she is still waiting for information from the marketing team about which version is correct.
You are a senior manager at a company that provides Software as a Service. Data from your product, sales, marketing, and customer support teams all feed into a data warehouse. You need insights from this data to generate an annual report and make key decisions for the upcoming year, and you are working with a data analyst.
The fact that you can do a lot of the analysis yourself without relying on help from the data analyst. Everything is well structured and easy to understand and many of the insights that you need can be generated at the click of a button. The formats remain consistent across time and you also can compare the insights that you generate now with insights going years back.
Some data is missing completely or does not contain all of the attributes that you expect. Your data team tells you that it’s on their backlog to write the transforms to get the new data feeding into the warehouse, but they are hedgy about providing a time estimate for when this will be done. You have to find credentials for different services and manually export the missing data to piece together your analysis. You may not have full access to historical data if, for example, your data warehouse solution cannot store more than 12 months’ worth of data.
You might be wondering if you can get the advantages of both data warehouses and data lakes. Good news: you can, and many people do. Consider the following set up.
Data lakes and data warehouses don’t have to be alternatives. They can also be used in conjunction.
Here data is loaded in its raw format into one centralized location (the data lake) and only subsequently processed and loaded into the data warehouse. Data scientists can access the data lake directly, analyzing data in its raw form, while data analysts and executives can benefit from the additional structure added by the warehouse.
Of course, this setup is not without its own disadvantages. Mainly, costs are significantly higher. The data lake and data warehouse each have their own hardware costs, maintenance costs, and set up costs. Adding more pieces always adds more complexity too: now you have to make sure that the data lake and data warehouse stay in sync. But at scale, you’ll almost always need the benefits of both so starting with this architecture is often a good way to future proof your setup.
So far, we have considered the two solutions at face value without looking at historical context. Another important point to consider though is the maturity and future of data warehouses and data lakes.
The Google Trends graph below (calculated by the number of Google searches made for each term) shows that interest in data warehouses is still overall higher than in data lakes, but that interest in data warehouses is declining while that in data lakes is increasing.
It is likely that data lakes will become more popular than data warehouses within a few years. As hardware costs become cheaper, it makes sense to store more data even if the business doesn’t currently have a well-defined use for it.
That said, data warehouses are a far older concept than data lakes, and the technology and best practices for building data warehouses are far more mature. If you’re looking for something tried and tested, data warehouses are likely going to be more attractive. If you’re looking for something newer, more exciting, and with a brighter future, data lakes are probably for you.
Data warehouses are older and declining in popularity while the interest in data lakes is increasing.
Data warehouses and data lakes are deceptively simple concepts, but there is a lot to consider before choosing one or the other (or deciding to use both). The main considerations are:
Both solutions are useful in different scenarios and neither is going to disappear any time soon, but data warehouses remain tried and tested while data lakes are new and upcoming.
Data Lakes vs. Data Warehouse
Can a Data Lake replace a Data Warehouse?
Yes. If you are struggling with the limits of your data warehouse, it might be time to replace it with a data lake. Using both in conjunction is also a common pattern though.
Are Data Warehouses obsolete?
No. While data lakes are becoming more popular, many businesses still rely heavily on data warehouses.
When are Data Lakes required?
If your data does not fit well into a predefined schema, or you need to store large amounts of data in different formats with flexible schemas, you might need a data lake instead of a data warehouse.
What is Data Lake Architecture?
A data lake stores structured and unstructured data in different formats in a centralized location. These can be architected in different ways: for example, you might choose to ‘layer’ your data lake into different kinds of data.
By Gareth Dwyer