Pragmatic Data Lakes

6 min read

Pragmatic Data Lakes

SM
By James CrossDelivery Leader

James is a Big Data Engineer and certified AWS Solutions Architect with a passion for data driven applications. He's spent the last 7-years helping his clients to design and implement huge scale streaming Big Data platforms, Cloud-based analytics stacks, and severless architectures.

Share article

Many companies start their Data Lake journey on the first page of a Forbes or Times article. It begins with an exec-level desire to have the latest technical toy in the toolkit, the latest buzzword in buzzword-bingo, Big Data, Artificial Intelligence, and to make it all happen: a Data Lake.

This desire feeds down to senior IT management, who, when tasked with delivering a solution that centralizes all of the company’s data in one place, start shaping a 3 to 5-year change program.

We recently spoke to a European company that was exactly on this track. They laid the extensive and detailed groundwork to deliver a Serverless, Event Sourced, Event-Driven, Machine Learning Powered Data Lake, using Cloud Native technologies in 3.5 years.

Our first question to them was, what use cases are you designing this system for? This resulted in some soul searching, chatter with people glancing at each other, before a very vague “to improve our operations” came out.

This is the big Data Lake trap: delivering an enormous change program without grounding your thinking invaluable business use cases for this technology.

Start by asking the right question

To be successful, start with the question why, rather than what. That means business value, which should be the starting point for every engineering project. Create a list of the problems you would like to solve by creating a Data Lake.

The key to a good quality use case list is ensuring that they’re definitive, close-ended, achievable, and self-describing. Doing this means you can take a use case and start to work on it almost immediately without spinning wheels with scope definition.

Conversely, a bad use case would be something open-ended, weakly defined, and requiring scope effort. For example, I want to reduce the time it takes to gather insights from my data. This is an admirable goal and a worthwhile strategic objective, but it’s not a use case because it is framed as an outcome and is not actionable.

A good use case is the opposite of this and is tied to a specific business goal, specific business data, and has an actionable objective. For example, I want to correlate the number of times a customer purchases a particular category of product with the meta-customer information I have (like their age, demographics, nationality, income, etc.).

There’s a remarkable difference between these two. The first is open-ended and weakly defined. The second is something that we can start work on right away because we know the data sources we need to access to begin work.

Once you’ve compiled a list of these 5–10 use cases, the next step is to prioritize them. When doing so, it’s essential to consider a few variables:

  • The difficulty of implementation (e.g., time and effort)
  • Probability of success (e.g., business intelligence dashboards have a far higher chance of success than attempting to predict a time-to-event)
  • Business value a successful use case will deliver

The key difference with this approach is that we’ve started by considering why we’re adopting Data Lakes and used this to drive a selection of use cases that drive business value. This ensures that our plan will deliver something useful to the business, rather than something that serves the purpose of simply designing and building a Data Lake.

What comes next?

Hopefully, the use cases give you some focus and direction, but it should also start to provide you with clarity around the tech stack. Inevitably we will need some common things:

  • A flexible place to store data at rest, the “Data Lake”
  • A method of ingesting data into the lake
  • A common approach for processing and transforming data, both ad-hoc and on a schedule
  • A sandbox for your data scientists and engineers to play around in

The last point is perhaps the most crucial. Inevitably some of your use cases will have an element of uncertainty. Is the data there to support the use case? If it is, is it clean enough? Can you accurately join different data sources together? Perhaps there isn’t a correlation that you thought there was?

Build a minimum viable tech stack and start iterating on your use cases quickly. Time-box your data experiments to 2–4 weeks and aggressively discard things that aren’t working. We call this process the Exploratory Data Analysis (EDA) phase.

You will quickly figure out what works, what doesn’t, and what can be thrown away. Inevitably, you will uncover some things you didn’t expect, and this is the beauty of this time-boxed EDA phase, you don’t spend 6-months developing a use case that was a dead horse from the start.

By this point, we’ve answered the important why question specified above and can understand the tangibility of our use cases. From here, we are better able to estimate the cost and value proposition of each one.

This gives us the information we need to answer the how and what questions with far more accuracy than we could when senor executives handed us a Data Lake Mandate.

Finish by solving a business problem

Instead of falling into a Data Lake buzzword bingo trap, which is starting with the latest, greatest, best and brightest technologies, focus on solving the problem you set out to solve in the first place.

As with so many technology change programs, the key to success lies at the very start of the program, in asking the right questions. As IT professionals, we’re tempted to dive into the what question and start solutionizing from the start. However, we should be asking ourselves why. This is certainly true with the adoption of Data Lakes.

If we start by asking why are we adopting Data Lakes, and why do we need one, we invariably look at the business problems we’re trying to solve by utilizing this new technology. This forces us to focus on use cases and helps us to better understand the business requirements behind the adoption of this technology. 

Ultimately, this enables us to design a better technology solution that fits the business goals.

SM
By James CrossDelivery Leader

James is a Big Data Engineer and certified AWS Solutions Architect with a passion for data driven applications. He's spent the last 7-years helping his clients to design and implement huge scale streaming Big Data platforms, Cloud-based analytics stacks, and severless architectures.