How to Operate a Data Platform: Challenges and Solutions

Why is operating a data platform hard? Because of the significant variability in the day-to-day.

A good data user experience is like running a Starbucks store - you walk in expecting a consistent tasting drink from any location in the world. An analyst using a reporting dashboard or a data scientist using a machine learning model is no different, they should expect consistent, reliable data.

Now what happens when an ingredient is out of stock? Your drink can’t be made. There could be many reasons why it’s out of stock - maybe the forecast for ingredients was too low, maybe some stores were sent more than others, maybe there was a storm that created supply issues, etc. There are more ways for things to go wrong than we could ever list.

A data platform is the same - it collects data from a variety of sources, carefully mixes in other quantities of data, and then gets shipped to a user via a web application, reporting platform, marketing platform, etc. There are more ways for things to go wrong than we could ever list.

This article will cover topics on how to operate a Data Platform, such as: process bottlenecks, data quality, and data monitoring systems. 

From Manufacturing to DevOps

A Starbucks drink and data differ in that the feedback loop is much faster for data. Moving physical goods takes days or hours; data can change within minutes or seconds. Humans process and ship physical goods, whereas machines complete these tasks for data.

Back in college, we read a book about plant operations called, ‘The Goal. I never thought I’d have a career in plant manufacturing, yet here I am reminiscing about it. The only thing I remember was the importance of managing bottlenecks. Bottlenecks should be used at maximum capacity to ensure high throughput. 

Pushed to the extreme, this concept meant that someone’s sole responsibility was monitoring bottlenecks to minimize downtime.

The spiritual child of ‘The Goal’ is a book focused on IT called, ‘The Phoenix Project’. It applied the same manufacturing principles to software development, which inspired the dev ops movement. The principles were maximizing throughput through bottlenecks, quality assurance ownership to minimize rework, and holistically examining software systems across departments. 

Managing Bottlenecks

Data comes in, data gets processed, data gets shipped. 

Running a data platform requires constantly monitoring bottlenecks, which comes in two forms: people and technical. 

People bottlenecks happen because they’re juggling multiple projects, don’t communicate regularly across departments, or see the downstream impact of their work. They context switch across projects and tasks - meetings, emails, or tickets. Teams optimize to improve their productivity (which is a good thing), but delaying one task might be the bottleneck for the entire Data Platform.

Regular communication to align on prioritization with other teams is the solution and a large part of managing a Data Platform. Hard problems take focus, and major projects have endless hard problems that require prioritization or they won’t get solved. 

Technical bottlenecks are more straightforward because, hypothetically, you have more influence over the solution. It’s ensuring that the machines processing the data have minimal downtime. Straightforward, however, does not mean easy. 

A Data Platform can process billions of rows of data a day. Consider a scenario where a bug fix needs to be applied on all the historical data. This change means reprocessing all the data with the required change. Machines take a fixed amount of time to complete this work, which is a bottleneck that’s straightforward to calculate and manage. 

What’s not easy are the new bottlenecks that appear as more data and users come onboard. Engineers have a variety of methods to increase platform capacity for higher volume or faster processing. Sometimes, solutions may be clever engineering; other times, it’s unglamorous. It’s just a data engineer rolling up their sleeves and getting things done manually. 

Data Rework: Differences Between Software vs. Data Engineering 

If software development is about managing libraries (open-source or 3rd-party) and logic, then data is about managing data sources and processing logic. There are some major differences between software and data engineering. 

Software can be more predictable. In software, you select the library versions. Major changes, hopefully, are in a different version. In data, you’re relying on the data source being consistent. Unfortunately, that’s not always the case. 

Software compiling is faster. In software, debugging can happen faster because a program compiles in seconds. In data, you may start by compiling a small dataset for debugging, but then query progressively larger and more complex data. Both are equally complex, just that data debugging can have longer feedback loops because of compilation and query time.  

Software bug fixes are usually a few snippets of code. In software, the complexity in debugging is finding the issue across libraries, data, and servers, but the actual fix is usually fast. In data, billions of rows get backed up when things go down. It’s resource-intensive, manual work that requires pushing technical bottlenecks to their maximum capacity to clear things out. This compounding rework in data is why automated monitoring is so important. 

Monitoring Data Quality to Minimize Rework

Data monitoring systems are important because new bottlenecks can appear at any time and clearing out traffic jams is expensive in time and effort. 

The operating challenge is the assumption that the source data’s organization has high data quality standards. That’s not always the case, and investing in a monitoring system is the best way to be proactive. A system can be built in-house that lets you nail down your use and edge cases, before deciding to outsource to a vendor. 

A sample checklist of items to monitor include:

  • Timing - data arriving on time consistently
  • Schema - data schema & data structure consistency
  • Volume - data size consistency
  • DateTime - date & timezone consistency 

Setting up monitoring is an investment in being able to provide reliable, consistent data to users, and peace of mind to the Data Platform teams. Over time, more edge-cases can be added to the monitoring as they come up. 

Summary

Data is everywhere in today’s world. Any digital system that you touch is powered by Data Platforms. Wrangling a consistent, reliable data experience is an exercise in smoothing out complexities. Running a Data Platform is an operational balancing act of being proactive with monitoring at a higher level, while managing people and technical bottlenecks at a daily level.


*Title image credit: https://ralphammer.com/make-me-think. I chose this image because it reflects how users should have simple, intuitive, data experiences. Those of us building Data Platforms should manage the complexity in the background.   

My Product Manager Survival Handbook


Developing software products is a messy process. It’s messy because it’s so flexible and fairly new, only about 60 years old. It’s also abstract and not limited by the physical world, only by our creativity. 

If you’re building with more concrete mediums, whether it’s furniture or a cake, there are well-documented rules about how parts work together and how it should look. In this sense, software engineers more closely resemble artists such as writers or musicians where rules are flexible and the result isn’t how it looks, but how it makes us feel. Great technology delights - think of the raw excitement the first time Steve Jobs presented the iPhone in 2007. Or the sheer joy of finishing tedious paperwork in minutes instead of hours because a computer automated it.

But, building delightful products is a process littered with worries. Work experience can help because it closes the gap between the execution required and final product vision. Unfortunately, experience is a nice way of saying learning from mistakes and there are always new ways to make mistakes. Experience isn't always enough. 

So, what are my worries? 

The One Skill That Data Scientists Are Not Being Trained On

After attending the Toronto Machine Learning Micro-Summit this past week, one theme came up repeatedly during the presentations -  communicate with the business team early, and often, or you'll need to go back and re-do your work. 

There was the story of an insurance company that created a model that recommended whether to replace or fix a car after a damage claim. It sounded great - the Data Scientists got a prototype up and running and had business team buy-in. But, the problem was that their models weren't very accurate. Usually when that happens it means that your data is noisy or the algorithm isn't powerful enough. They went back to their business team and it turns out that they missed 2 key features: the age of the vehicle and if it's a luxury model. 

Another example was a telecom that built a model to optimize call center efficiency. The data science team spent a month building the model and everyone was excited to get it in production. Then, they were told that the call center runs on an outdated application. It turns out that integrating with the application would cost more than the ROI of the project.