How to Operate a Data Platform: Challenges and Solutions

Why is operating a data platform hard? Because of the significant variability in the day-to-day.

A good data user experience is like running a Starbucks store - you walk in expecting a consistent tasting drink from any location in the world. An analyst using a reporting dashboard or a data scientist using a machine learning model is no different, they should expect consistent, reliable data.

Now what happens when an ingredient is out of stock? Your drink can’t be made. There could be many reasons why it’s out of stock - maybe the forecast for ingredients was too low, maybe some stores were sent more than others, maybe there was a storm that created supply issues, etc. There are more ways for things to go wrong than we could ever list.

A data platform is the same - it collects data from a variety of sources, carefully mixes in other quantities of data, and then gets shipped to a user via a web application, reporting platform, marketing platform, etc. There are more ways for things to go wrong than we could ever list.

This article will cover topics on how to operate a Data Platform, such as: process bottlenecks, data quality, and data monitoring systems. 

From Manufacturing to DevOps

A Starbucks drink and data differ in that the feedback loop is much faster for data. Moving physical goods takes days or hours; data can change within minutes or seconds. Humans process and ship physical goods, whereas machines complete these tasks for data.

Back in college, we read a book about plant operations called, ‘The Goal. I never thought I’d have a career in plant manufacturing, yet here I am reminiscing about it. The only thing I remember was the importance of managing bottlenecks. Bottlenecks should be used at maximum capacity to ensure high throughput. 

Pushed to the extreme, this concept meant that someone’s sole responsibility was monitoring bottlenecks to minimize downtime.

The spiritual child of ‘The Goal’ is a book focused on IT called, ‘The Phoenix Project’. It applied the same manufacturing principles to software development, which inspired the dev ops movement. The principles were maximizing throughput through bottlenecks, quality assurance ownership to minimize rework, and holistically examining software systems across departments. 

Managing Bottlenecks

Data comes in, data gets processed, data gets shipped. 

Running a data platform requires constantly monitoring bottlenecks, which comes in two forms: people and technical. 

People bottlenecks happen because they’re juggling multiple projects, don’t communicate regularly across departments, or see the downstream impact of their work. They context switch across projects and tasks - meetings, emails, or tickets. Teams optimize to improve their productivity (which is a good thing), but delaying one task might be the bottleneck for the entire Data Platform.

Regular communication to align on prioritization with other teams is the solution and a large part of managing a Data Platform. Hard problems take focus, and major projects have endless hard problems that require prioritization or they won’t get solved. 

Technical bottlenecks are more straightforward because, hypothetically, you have more influence over the solution. It’s ensuring that the machines processing the data have minimal downtime. Straightforward, however, does not mean easy. 

A Data Platform can process billions of rows of data a day. Consider a scenario where a bug fix needs to be applied on all the historical data. This change means reprocessing all the data with the required change. Machines take a fixed amount of time to complete this work, which is a bottleneck that’s straightforward to calculate and manage. 

What’s not easy are the new bottlenecks that appear as more data and users come onboard. Engineers have a variety of methods to increase platform capacity for higher volume or faster processing. Sometimes, solutions may be clever engineering; other times, it’s unglamorous. It’s just a data engineer rolling up their sleeves and getting things done manually. 

Data Rework: Differences Between Software vs. Data Engineering 

If software development is about managing libraries (open-source or 3rd-party) and logic, then data is about managing data sources and processing logic. There are some major differences between software and data engineering. 

Software can be more predictable. In software, you select the library versions. Major changes, hopefully, are in a different version. In data, you’re relying on the data source being consistent. Unfortunately, that’s not always the case. 

Software compiling is faster. In software, debugging can happen faster because a program compiles in seconds. In data, you may start by compiling a small dataset for debugging, but then query progressively larger and more complex data. Both are equally complex, just that data debugging can have longer feedback loops because of compilation and query time.  

Software bug fixes are usually a few snippets of code. In software, the complexity in debugging is finding the issue across libraries, data, and servers, but the actual fix is usually fast. In data, billions of rows get backed up when things go down. It’s resource-intensive, manual work that requires pushing technical bottlenecks to their maximum capacity to clear things out. This compounding rework in data is why automated monitoring is so important. 

Monitoring Data Quality to Minimize Rework

Data monitoring systems are important because new bottlenecks can appear at any time and clearing out traffic jams is expensive in time and effort. 

The operating challenge is the assumption that the source data’s organization has high data quality standards. That’s not always the case, and investing in a monitoring system is the best way to be proactive. A system can be built in-house that lets you nail down your use and edge cases, before deciding to outsource to a vendor. 

A sample checklist of items to monitor include:

  • Timing - data arriving on time consistently
  • Schema - data schema & data structure consistency
  • Volume - data size consistency
  • DateTime - date & timezone consistency 

Setting up monitoring is an investment in being able to provide reliable, consistent data to users, and peace of mind to the Data Platform teams. Over time, more edge-cases can be added to the monitoring as they come up. 

Summary

Data is everywhere in today’s world. Any digital system that you touch is powered by Data Platforms. Wrangling a consistent, reliable data experience is an exercise in smoothing out complexities. Running a Data Platform is an operational balancing act of being proactive with monitoring at a higher level, while managing people and technical bottlenecks at a daily level.


*Title image credit: https://ralphammer.com/make-me-think. I chose this image because it reflects how users should have simple, intuitive, data experiences. Those of us building Data Platforms should manage the complexity in the background.   

Thoughts from Toronto Machine Learning Summit 2019

I just attended the 2019 Toronto Machine Learning Summit (TMLS) last week - it was a great experience.The community was welcoming,the content was relevant, and it was well organized. 

One thing that I appreciated about the event is how down-to-earth it felt. Many conversations started with the person that I sat beside during a session, which doesn't happen often at conferences. People had a range of experience with Machine Learning, but everyone came from a genuine place to learn more about the topic. It's quite different from other conferences that I've attended where the marketing glamour is cranked up - leading to a mass-produced approach, hit-or-miss content, and participants feeling more distant. 

How Amazon Uses Machine Learning to Drive the Customer Experience

I read all 22 Amazon Shareholder letters. 

I wanted to understand how Amazon used machine learning to drive the customer experience. Here's what I learned.

Amazon is a Platform-as-a-Service (PaaS) company that just happens to be a retailer

Jeff Bezos has been consistent about Amazon's goal since the beginning - deliver an amazing customer experience. That means providing vast selection, fast convenience, and price reductions. Amazon has invested in building massive platforms, whether it's fulfillment or cloud computing centers, to support the pillars of selection and convenience. The last pillar, price reductions, is a result of the efficiencies from scaling. 

Here's their playbook: (a) create a platform for their own business needs (b) formalize this platform into an ecosystem by opening it up to 3rd-parties (c) refine this platform into a self-service option that's easy-to-use with an interface. Lather, rinse, repeat. 

Mythbusters: Were Overzealous Algorithms Responsible for Slow Sales at Loblaw Companies Ltd?

Well, this is new.

Retailers have blamed bad weather for poor sales before, but I don't think I've ever seen a retailer blame bad algorithms.

Loblaw Companies Ltd, Canada's largest grocery and pharmacy chain, which owns the mainstream brand Loblaws and discount brand No Frills, had a soft Q2 performance with same-store revenue growing 0.6%. Their President, Sarah Davis, blamed the performance on algorithms that prioritized increasing profit margins instead of promotional pricing to attract foot traffic. That is, Loblaw chose to increase the revenue from each customer instead of focusing on increasing the number of customers. She says:

We know exactly what we did and what we did was we focused on going for margin improvements...And in the excitement of seeing margin improvements in certain categories as we started to implement some of the algorithms, people were overzealous...You end up with fewer items on promotion in your flyer. 

The One Skill That Data Scientists Are Not Being Trained On

After attending the Toronto Machine Learning Micro-Summit this past week, one theme came up repeatedly during the presentations -  communicate with the business team early, and often, or you'll need to go back and re-do your work. 

There was the story of an insurance company that created a model that recommended whether to replace or fix a car after a damage claim. It sounded great - the Data Scientists got a prototype up and running and had business team buy-in. But, the problem was that their models weren't very accurate. Usually when that happens it means that your data is noisy or the algorithm isn't powerful enough. They went back to their business team and it turns out that they missed 2 key features: the age of the vehicle and if it's a luxury model. 

Another example was a telecom that built a model to optimize call center efficiency. The data science team spent a month building the model and everyone was excited to get it in production. Then, they were told that the call center runs on an outdated application. It turns out that integrating with the application would cost more than the ROI of the project.

Are Data Scientists Actually Surveillance Scientists? - Part 1

There was of course no way of knowing whether you were being watched at any given moment. How often, or on what system, the Thought Police plugged in on any individual wire was guesswork. It was even conceivable that they watched everybody all the time. But at any rate they could plug in your wire whenever they wanted to. You had to live—did live, from habit that became instinct—in the assumption that every sound you made was overheard, and, except in darkness, every movement scrutinized. -1984, George Orwell

Last summer I had a conversation with an acquaintance who had recently visited China. There was discussion about China's Social Credit System (SCS) and its impact on people's daily lives. The Social Credit System is a system that assigns scores to citizens based on their reputation, and that score can impact someone's ability to be outside in the evening, their eligibility to book a travel ticket or their suitability for a loan. It's similar to a Credit Score that North Americans are more familiar with but more encompassing as the SCS takes non-financial data into account. My acquaintance said that the initial feedback was positive - her friends and family felt safer walking the streets at night knowing that people deemed dangerous wouldn't be allowed outside. 

Lessons from University of Toronto's Data Science Statistics Course (3251)

I recently completed a Statistics for Data Science course at the University of Toronto for Continuing Studies and I wanted to share my reflections about the experience. 

Overall, the course was mostly interesting, some parts boring, and always challenging. 

I should begin with admitting that I managed to avoid any heavy maths or statistics classes in university even though I completed science and business degrees. I certainly felt behind when math notations and equations started popping up in class. But, where there's a will, there's a way (mostly). 

Most lectures were challenging to follow because of the pace of learning. I even tried to read ahead for some lectures to be more prepared, but that only marginally helped. As a result, I often sat in class with material that was way over my head and questioning if it was the best use of my time to spend that weekday night on campus, rather than at home self-studying. A younger me might of panicked and wondered what I had gotten myself into or worried that I would be outed as a fake. And I don't think I was the only one as I saw classmates drop out in the initial few weeks.  

How studying data science lets me design better customer solutions

If data is the new oil, then data science is the new refinery.

I was recently asked whether studying Data Science has helped me in my day-to-day job. My response was yes, but not in an obvious way - it's resulted in better designed customer solutions by improving my empathy.

Let me take a step back. For the past few years, I've been leading Software-as-a-Service (SaaS) platform integrations for enterprise clients. I often describe the work as similar to being a clothing tailor. If a software consultancy is a bespoke tailor that customizes every detail at a premium price; than, a SaaS platform is a made-to-measure tailor who cuts from an existing pattern at an economical price. Over time, I've learned how to measure and cut software for customers of all shapes, sizes, and sophistication.

Why soft skills will win in the age of machine learning

Back in college, I had a summer job completing research for a clinical health professor. She was a leading expert in diagnosing and treating open human wounds. My job was to survey other experts, get them to examine photos of open wounds, and then recommend a treatment.

A few months ago, I discovered a smartphone app which replaces this work.* You take a photo of an open wound and upload it to the cloud. I suspect that the photo is run through an image recognition model, called a Convolutional Neural Network (CNN), that identifies specific features of the wound for treatment. Current machine learning is very good at completing narrowly defined tasks, such as analyzing a specific type of medical image, because they have millions of previous examples to train from. It is not good at handling non-standard cases. 

How Data Scientists Are Controlling Your Life

My daily experience with recommendation systems are seamless. They recommend what to read on Apple News, listen on Spotify, eat on Uber Eats, purchase on Amazon, watch on Netflix. These software programs take millions of data points, clean and segment the data, weigh different variables, and output recommendations that ensure we stay engaged with the platform for the next selection. As much as we want to believe that machines make all these decisions, data scientists are the ones that are deciding the inputs for these models. Ultimately, these choices introduce bias. 

What if I'm missing out on an incredible book or song because the inputs don't capture interests of mine that I didn't even know existed?