Essays on tech, data, and product strategy

How to Operate a Data Platform: Challenges and Solutions

Why is operating a data platform hard? Because of the significant variability in the day-to-day.

A good data user experience is like running a Starbucks store - you walk in expecting a consistent tasting drink from any location in the world. An analyst using a reporting dashboard or a data scientist using a machine learning model is no different, they should expect consistent, reliable data.

Now what happens when an ingredient is out of stock? Your drink can’t be made. There could be many reasons why it’s out of stock - maybe the forecast for ingredients was too low, maybe some stores were sent more than others, maybe there was a storm that created supply issues, etc. There are more ways for things to go wrong than we could ever list.

A data platform is the same - it collects data from a variety of sources, carefully mixes in other quantities of data, and then gets shipped to a user via a web application, reporting platform, marketing platform, etc. There are more ways for things to go wrong than we could ever list.

This article will cover topics on how to operate a Data Platform, such as: process bottlenecks, data quality, and data monitoring systems.

From Manufacturing to DevOps

A Starbucks drink and data differ in that the feedback loop is much faster for data. Moving physical goods takes days or hours; data can change within minutes or seconds. Humans process and ship physical goods, whereas machines complete these tasks for data.

Back in college, we read a book about plant operations called, ‘The Goal ’. I never thought I’d have a career in plant manufacturing, yet here I am reminiscing about it. The only thing I remember was the importance of managing bottlenecks. Bottlenecks should be used at maximum capacity to ensure high throughput.

Pushed to the extreme, this concept meant that someone’s sole responsibility was monitoring bottlenecks to minimize downtime.

The spiritual child of ‘The Goal’ is a book focused on IT called, ‘The Phoenix Project’. It applied the same manufacturing principles to software development, which inspired the dev ops movement. The principles were maximizing throughput through bottlenecks, quality assurance ownership to minimize rework, and holistically examining software systems across departments.

Managing Bottlenecks

Data comes in, data gets processed, data gets shipped.

Running a data platform requires constantly monitoring bottlenecks, which comes in two forms: people and technical.

People bottlenecks happen because they’re juggling multiple projects, don’t communicate regularly across departments, or see the downstream impact of their work. They context switch across projects and tasks - meetings, emails, or tickets. Teams optimize to improve their productivity (which is a good thing), but delaying one task might be the bottleneck for the entire Data Platform.

Regular communication to align on prioritization with other teams is the solution and a large part of managing a Data Platform. Hard problems take focus, and major projects have endless hard problems that require prioritization or they won’t get solved.

Technical bottlenecks are more straightforward because, hypothetically, you have more influence over the solution. It’s ensuring that the machines processing the data have minimal downtime. Straightforward, however, does not mean easy.

A Data Platform can process billions of rows of data a day. Consider a scenario where a bug fix needs to be applied on all the historical data. This change means reprocessing all the data with the required change. Machines take a fixed amount of time to complete this work, which is a bottleneck that’s straightforward to calculate and manage.

What’s not easy are the new bottlenecks that appear as more data and users come onboard. Engineers have a variety of methods to increase platform capacity for higher volume or faster processing. Sometimes, solutions may be clever engineering; other times, it’s unglamorous. It’s just a data engineer rolling up their sleeves and getting things done manually.

Data Rework: Differences Between Software vs. Data Engineering

If software development is about managing libraries (open-source or 3rd-party) and logic, then data is about managing data sources and processing logic. There are some major differences between software and data engineering.

Software can be more predictable. In software, you select the library versions. Major changes, hopefully, are in a different version. In data, you’re relying on the data source being consistent. Unfortunately, that’s not always the case.

Software compiling is faster. In software, debugging can happen faster because a program compiles in seconds. In data, you may start by compiling a small dataset for debugging, but then query progressively larger and more complex data. Both are equally complex, just that data debugging can have longer feedback loops because of compilation and query time.

Software bug fixes are usually a few snippets of code. In software, the complexity in debugging is finding the issue across libraries, data, and servers, but the actual fix is usually fast. In data, billions of rows get backed up when things go down. It’s resource-intensive, manual work that requires pushing technical bottlenecks to their maximum capacity to clear things out. This compounding rework in data is why automated monitoring is so important.

Monitoring Data Quality to Minimize Rework

Data monitoring systems are important because new bottlenecks can appear at any time and clearing out traffic jams is expensive in time and effort.

The operating challenge is the assumption that the source data’s organization has high data quality standards. That’s not always the case, and investing in a monitoring system is the best way to be proactive. A system can be built in-house that lets you nail down your use and edge cases, before deciding to outsource to a vendor.

A sample checklist of items to monitor include:

Timing - data arriving on time consistently
Schema - data schema & data structure consistency
Volume - data size consistency
DateTime - date & timezone consistency

Setting up monitoring is an investment in being able to provide reliable, consistent data to users, and peace of mind to the Data Platform teams. Over time, more edge-cases can be added to the monitoring as they come up.

Summary

Data is everywhere in today’s world. Any digital system that you touch is powered by Data Platforms. Wrangling a consistent, reliable data experience is an exercise in smoothing out complexities. Running a Data Platform is an operational balancing act of being proactive with monitoring at a higher level, while managing people and technical bottlenecks at a daily level.

*Title image credit: https://ralphammer.com/make-me-think. I chose this image because it reflects how users should have simple, intuitive, data experiences. Those of us building Data Platforms should manage the complexity in the background.

Thoughts from Toronto Machine Learning Summit 2019

I just attended the 2019 Toronto Machine Learning Summit (TMLS) last week - it was a great experience.The community was welcoming,the content was relevant, and it was well organized.

One thing that I appreciated about the event is how down-to-earth it felt. Many conversations started with the person that I sat beside during a session, which doesn't happen often at conferences. People had a range of experience with Machine Learning, but everyone came from a genuine place to learn more about the topic. It's quite different from other conferences that I've attended where the marketing glamour is cranked up - leading to a mass-produced approach, hit-or-miss content, and participants feeling more distant.

Are Data Scientists Actually Surveillance Scientists? - Part 1

There was of course no way of knowing whether you were being watched at any given moment. How often, or on what system, the Thought Police plugged in on any individual wire was guesswork. It was even conceivable that they watched everybody all the time. But at any rate they could plug in your wire whenever they wanted to. You had to live—did live, from habit that became instinct—in the assumption that every sound you made was overheard, and, except in darkness, every movement scrutinized. -1984, George Orwell

Last summer I had a conversation with an acquaintance who had recently visited China. There was discussion about China's Social Credit System (SCS) and its impact on people's daily lives. The Social Credit System is a system that assigns scores to citizens based on their reputation, and that score can impact someone's ability to be outside in the evening, their eligibility to book a travel ticket or their suitability for a loan. It's similar to a Credit Score that North Americans are more familiar with but more encompassing as the SCS takes non-financial data into account. My acquaintance said that the initial feedback was positive - her friends and family felt safer walking the streets at night knowing that people deemed dangerous wouldn't be allowed outside.

How studying data science lets me design better customer solutions

If data is the new oil, then data science is the new refinery.

I was recently asked whether studying Data Science has helped me in my day-to-day job. My response was yes, but not in an obvious way - it's resulted in better designed customer solutions by improving my empathy.

Let me take a step back. For the past few years, I've been leading Software-as-a-Service (SaaS) platform integrations for enterprise clients. I often describe the work as similar to being a clothing tailor. If a software consultancy is a bespoke tailor that customizes every detail at a premium price; than, a SaaS platform is a made-to-measure tailor who cuts from an existing pattern at an economical price. Over time, I've learned how to measure and cut software for customers of all shapes, sizes, and sophistication.

Why soft skills will win in the age of machine learning

Back in college, I had a summer job completing research for a clinical health professor. She was a leading expert in diagnosing and treating open human wounds. My job was to survey other experts, get them to examine photos of open wounds, and then recommend a treatment.

A few months ago, I discovered a smartphone app which replaces this work.* You take a photo of an open wound and upload it to the cloud. I suspect that the photo is run through an image recognition model, called a Convolutional Neural Network (CNN), that identifies specific features of the wound for treatment. Current machine learning is very good at completing narrowly defined tasks, such as analyzing a specific type of medical image, because they have millions of previous examples to train from. It is not good at handling non-standard cases.

How Data Scientists Are Controlling Your Life

My daily experience with recommendation systems are seamless. They recommend what to read on Apple News, listen on Spotify, eat on Uber Eats, purchase on Amazon, watch on Netflix. These software programs take millions of data points, clean and segment the data, weigh different variables, and output recommendations that ensure we stay engaged with the platform for the next selection. As much as we want to believe that machines make all these decisions, data scientists are the ones that are deciding the inputs for these models. Ultimately, these choices introduce bias.

What if I'm missing out on an incredible book or song because the inputs don't capture interests of mine that I didn't even know existed?