Why is operating a data platform hard? Because of the significant variability in the day-to-day.
A good data user experience is like running a Starbucks store - you walk in expecting a consistent tasting drink from any location in the world. An analyst using a reporting dashboard or a data scientist using a machine learning model is no different, they should expect consistent, reliable data.
Now what happens when an ingredient is out of stock? Your drink can’t be made. There could be many reasons why it’s out of stock - maybe the forecast for ingredients was too low, maybe some stores were sent more than others, maybe there was a storm that created supply issues, etc. There are more ways for things to go wrong than we could ever list.
A data platform is the same - it collects data from a variety of sources, carefully mixes in other quantities of data, and then gets shipped to a user via a web application, reporting platform, marketing platform, etc. There are more ways for things to go wrong than we could ever list.
This article will cover topics on how to operate a Data Platform, such as: process bottlenecks, data quality, and data monitoring systems.
From Manufacturing to DevOps
A Starbucks drink and data differ in that the feedback loop is much faster for data. Moving physical goods takes days or hours; data can change within minutes or seconds. Humans process and ship physical goods, whereas machines complete these tasks for data.
Back in college, we read a book about plant operations called, ‘The Goal’. I never thought I’d have a career in plant manufacturing, yet here I am reminiscing about it. The only thing I remember was the importance of managing bottlenecks. Bottlenecks should be used at maximum capacity to ensure high throughput.
Pushed to the extreme, this concept meant that someone’s sole responsibility was monitoring bottlenecks to minimize downtime.
The spiritual child of ‘The Goal’ is a book focused on IT called, ‘The Phoenix Project’. It applied the same manufacturing principles to software development, which inspired the dev ops movement. The principles were maximizing throughput through bottlenecks, quality assurance ownership to minimize rework, and holistically examining software systems across departments.
Managing Bottlenecks
Data comes in, data gets processed, data gets shipped.
Running a data platform requires constantly monitoring bottlenecks, which comes in two forms: people and technical.
People bottlenecks happen because they’re juggling multiple projects, don’t communicate regularly across departments, or see the downstream impact of their work. They context switch across projects and tasks - meetings, emails, or tickets. Teams optimize to improve their productivity (which is a good thing), but delaying one task might be the bottleneck for the entire Data Platform.
Regular communication to align on prioritization with other teams is the solution and a large part of managing a Data Platform. Hard problems take focus, and major projects have endless hard problems that require prioritization or they won’t get solved.
Technical bottlenecks are more straightforward because, hypothetically, you have more influence over the solution. It’s ensuring that the machines processing the data have minimal downtime. Straightforward, however, does not mean easy.
A Data Platform can process billions of rows of data a day. Consider a scenario where a bug fix needs to be applied on all the historical data. This change means reprocessing all the data with the required change. Machines take a fixed amount of time to complete this work, which is a bottleneck that’s straightforward to calculate and manage.
What’s not easy are the new bottlenecks that appear as more data and users come onboard. Engineers have a variety of methods to increase platform capacity for higher volume or faster processing. Sometimes, solutions may be clever engineering; other times, it’s unglamorous. It’s just a data engineer rolling up their sleeves and getting things done manually.
Data Rework: Differences Between Software vs. Data Engineering
If software development is about managing libraries (open-source or 3rd-party) and logic, then data is about managing data sources and processing logic. There are some major differences between software and data engineering.
Software can be more predictable. In software, you select the library versions. Major changes, hopefully, are in a different version. In data, you’re relying on the data source being consistent. Unfortunately, that’s not always the case.
Software compiling is faster. In software, debugging can happen faster because a program compiles in seconds. In data, you may start by compiling a small dataset for debugging, but then query progressively larger and more complex data. Both are equally complex, just that data debugging can have longer feedback loops because of compilation and query time.
Software bug fixes are usually a few snippets of code. In software, the complexity in debugging is finding the issue across libraries, data, and servers, but the actual fix is usually fast. In data, billions of rows get backed up when things go down. It’s resource-intensive, manual work that requires pushing technical bottlenecks to their maximum capacity to clear things out. This compounding rework in data is why automated monitoring is so important.
Monitoring Data Quality to Minimize Rework
Data monitoring systems are important because new bottlenecks can appear at any time and clearing out traffic jams is expensive in time and effort.
The operating challenge is the assumption that the source data’s organization has high data quality standards. That’s not always the case, and investing in a monitoring system is the best way to be proactive. A system can be built in-house that lets you nail down your use and edge cases, before deciding to outsource to a vendor.
A sample checklist of items to monitor include:
- Timing - data arriving on time consistently
- Schema - data schema & data structure consistency
- Volume - data size consistency
- DateTime - date & timezone consistency
Setting up monitoring is an investment in being able to provide reliable, consistent data to users, and peace of mind to the Data Platform teams. Over time, more edge-cases can be added to the monitoring as they come up.
Summary
Data is everywhere in today’s world. Any digital system that you touch is powered by Data Platforms. Wrangling a consistent, reliable data experience is an exercise in smoothing out complexities. Running a Data Platform is an operational balancing act of being proactive with monitoring at a higher level, while managing people and technical bottlenecks at a daily level.