Virtasant at re:Invent: Your source for re:Invent and cloud innovation this year

Learn More

Cloud Architecture

A Pragmatic Approach to Data Lakes

6 min read • Jun 12, 2020

SM
By James CrossDelivery Leader

James is a Big Data Engineer and certified AWS Solutions Architect with a passion for data driven applications. He's spent the last 7-years helping his clients to design and implement huge scale streaming Big Data platforms, Cloud-based analytics stacks, and severless architectures.

Share article

Many companies start their Data Lake journey on the first page of a Forbes or Times article. It begins with an exec-level desire to have the latest technical toy in the toolkit, the newest buzzword in buzzword-bingo, Big Data, Artificial Intelligence, and to make it all happen: a Data Lake.

This desire feeds down to senior IT management, who, when tasked with delivering a solution that centralizes all of the company’s data in one place, start shaping a 3 to 5-year change program.

We recently spoke to a European company that was exactly on this track. They laid the extensive and detailed groundwork to deliver a Serverless, Event Sourced, Event-Driven, Machine Learning Powered Data Lake, using Cloud Native technologies in three and a half years.

Our first question was, what use cases are you designing this system for? This resulted in some soul searching, and some chatter, followed by the very vague response of “to improve our operations” resounded.

This is the big Data Lake trap: delivering an enormous change program without grounding your thinking. Let’s break down how to avoid getting stuck like this.

Start by asking the right question

To be successful, start with the question of why, rather than what. That means identifying the business value, which should be the starting point for every engineering project. Create a list of the problems you would like to solve by creating the Data Lake.

The key to a good quality use case list is ensuring that they’re definitive, close-ended, achievable, and self-describing. Doing this means you can take a use case and start to work on it almost immediately without spinning wheels with scope definition.

Conversely, an ineffective use case would be something open-ended, weakly defined, and requiring scope effort. For example, I want to reduce the time it takes to gather insights from my data. This is an admirable goal and a worthwhile strategic objective, but it’s not a use case because it is framed as an outcome and is not actionable.

A good use case is the opposite of this and is tied to a specific business goal, defined business data, and has an actionable objective. For example, I want to correlate the number of times a customer purchases a particular product category with the meta-customer information I have (like their age, demographics, nationality, income, etc.).

There’s a remarkable difference between these two. The first is open-ended and weakly defined. The second is something that we can start work on right away because we know the data sources we need to access to begin work.
Once you’ve compiled a list of these 5–10 use cases, the next step is to prioritize them. When doing so, it’s essential to consider a few variables:

  • The difficulty of implementation (e.g., time and effort)
  • Probability of success (e.g., business intelligence dashboards have a far higher chance of success than attempting to predict a time-to-event)
  • The business value that a successful use case will deliver

The key difference with this approach is that we have started by considering why we are adopting Data Lakes and used this to drive a selection of use cases that drive business value. This ensures that our plan will deliver something useful to the business, rather than something that serves the purpose of simply designing and building a Data Lake.

What comes next?

Hopefully, the use cases give you some focus and direction, but it should also start to provide you with clarity around the tech stack. Inevitably we will need some fundamental things:

  • A flexible place to store data at rest, the “Data Lake”
  • A method of ingesting data into the lake
  • A common approach for processing and transforming data, both ad-hoc and on a schedule
  • A sandbox for your data scientists and engineers to play around in

The last point is perhaps the most crucial. Inevitably some of your use cases will have an element of uncertainty. Is the data there to support the use case? If it is, is it clean enough? Can you accurately join different data sources together? Perhaps there isn’t a correlation that you thought there was.

Build a minimum viable tech stack and start iterating on your use cases quickly. Time-box your data experiments to 2 to 4 weeks and aggressively discard things that aren’t working. We call this process the Exploratory Data Analysis (EDA) phase.

You will quickly figure out what works, what doesn’t, and what can be thrown away. Inevitably, you will uncover some things you didn’t expect.his is the beauty of this time-boxed EDA phase, you don’t spend 6-months developing a use case that was a dead horse from the start.

By this point, we’ve answered the important why question specified above and can understand the tangibility of our use cases. From here, we are better able to estimate the cost and value proposition of each one.

This gives us the information we need to answer the how and what questions with far more accuracy than we could when senior executives handed us a Data Lake Mandate.

Finish by solving a business problem

Instead of falling into a Data Lake buzzword bingo trap, which starts with the latest, greatest, best and brightest technologies, focus on solving the problem you set out to address in the first place.

As with so many technology change programs, the key to success lies at the very start of the program, in asking the right questions. As IT professionals, we’re tempted to dive into the what question and immediately start solutionizing. However, we should be asking ourselves why. This is certainly true with the adoption of Data Lakes.

If we start by asking why we are adopting Data Lakes, and why we need one, we invariably look at the business problems we’re trying to solve by utilizing this new technology. This forces us to focus on use cases and better understand the business requirements behind this technology’s adoption. 

Ultimately, this enables us to design a better technology solution that fits the business goals.

James Cross

By James Cross

Delivery Leader