Are Data Scientists Actually Surveillance Scientists? - Part 1

There was of course no way of knowing whether you were being watched at any given moment. How often, or on what system, the Thought Police plugged in on any individual wire was guesswork. It was even conceivable that they watched everybody all the time. But at any rate they could plug in your wire whenever they wanted to. You had to live—did live, from habit that became instinct—in the assumption that every sound you made was overheard, and, except in darkness, every movement scrutinized. -1984, George Orwell

Last summer I had a conversation with an acquaintance who had recently visited China. There was discussion about China's Social Credit System (SCS) and its impact on people's daily lives. The Social Credit System is a system that assigns scores to citizens based on their reputation, and that score can impact someone's ability to be outside in the evening, their eligibility to book a travel ticket or their suitability for a loan. It's similar to a Credit Score that North Americans are more familiar with but more encompassing as the SCS takes non-financial data into account. My acquaintance said that the initial feedback was positive - her friends and family felt safer walking the streets at night knowing that people deemed dangerous wouldn't be allowed outside. 

At roughly the same time I had this conversation, I was reading a great technical book on how to build big data systems: Designing Data-Intensive Applications. The last chapter, surprisingly, was not technical - rather, it was a commentary on data and society:  

As a thought experiment, try replacing the word data with surveillance, and observe if common phrases still sound so good. How about this: “In our surveillance-driven organization we collect real-time surveillance streams and store them in our surveillance warehouse. Our surveillance scientists use advanced analytics and surveillance processing in order to derive new insights.” [1]

Now, I work in technology and advertising, so it's not lost on me that the industry is built around collecting data on users to provide a more customized experience. Big data and machine learning systems are tools, and their application is really in the hands in the beholder. I think we need to ask ourselves at what point does data collection turn into surveillance and what are its implications?

Taking a step back might help in answering these questions. One of the big breakthroughs that's fueling the current artificial intelligence gold rush is something called deep learning. In 2012, Canadian researchers discovered a way to significantly reduce the error rates in computers being able to classify images. We're seeing the applications of deep learning popping up - from self-driving cars, to identifying diseases from medical images, to legal documents being read faster than humans ever could.

Another use, however, is authenticating people's faces through video cameras. 

Depending on your perspective, it's a way to conveniently unlock your phone or it's a way to conveniently monitor humans. I think many people don't believe they have anything to hide and so it doesn't register as a concern. However, machine learning systems are still built by humans, which means errors still happen.

Let's walk through an example: you buy a subway ticket to a certain neighbourhood but you have to submit a face scan. The scan believes your Social Credit Score is too low to be trustworthy and you're denied. You have no transparency into why you have a low score and you try and think of any possible reasons. What about that bill you paid late one time a few years ago? Could that be impacting me? Even worse, what if there's a mistake that's attributed to you? In machine learning systems, there's always a tradeoff being made of prioritizing false positives or false negatives. 

Another challenge is that these systems, "merely extrapolate from the past; if the past is discriminatory, they codify that discrimination." [2] Are the data scientists building these systems removing the bias in the datasets? I worry about these downstream effects and their impact to our daily lives. 

One area that's promising is an increased focus on privacy by organizations and how they collect and use your data. We're seeing countries legislating privacy in favor of the user in the EU and US. Where there's a gap, though, is the governance of the data systems themselves. In the future, privacy policies could evolve to incorporate an organization's values being applied to data systems too. I'll discuss what a framework for privacy-focused data systems could look like in a followup Part 2.


[1] Designing Data-Intensive Applications, Martin Kleppmann

[2] Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, Cathy O'Neil

How studying data science lets me design better customer solutions

If data is the new oil, then data science is the new refinery.

I was recently asked whether studying Data Science has helped me in my day-to-day job. My response was yes, but not in an obvious way - it's resulted in better designed customer solutions by improving my empathy.

Let me take a step back. For the past few years, I've been leading Software-as-a-Service (SaaS) platform integrations for enterprise clients. I often describe the work as similar to being a clothing tailor. If a software consultancy is a bespoke tailor that customizes every detail at a premium price; than, a SaaS platform is a made-to-measure tailor who cuts from an existing pattern at an economical price. Over time, I've learned how to measure and cut software for customers of all shapes, sizes, and sophistication.

However, where the analogy ends is that cloth is something we can touch and see, so we naturally understand its limitations; software architecture, on the other hand, exists in our minds and most of us aren't able to judge the quality. With fabric, we don't question why it can't be made from a liquid, but I often find myself explaining to customers why our platform can't do what they want because how data is stored. 

So how does studying data science fit into all of this?

A large part of pragmatic data science involves the process of Extracting, Transforming, and Loading (ETL) data. Extract data from a source (i.e. database, API, CSV), Transform the data by cleaning it up such as removing outliers and incomplete records, and Load the new data into your machine learning training model. Rinse, lather, and repeat for every project. 

Let me give some examples. If I have a task that requires bulk automation, a developer will likely prefer a well-formatted CSV where they can easily extract the information. If I have a task that requires a computer to read thousands of records, a developer will likely prefer an input with standardized punctuation and identifiers (i.e. JSON). If I have a task that requires loading data into a new table or database, a developer will likely prefer working with someone who weighs any risks to the existing data. 

All the hands-on ETL practice I've done over the past year has honed my compass on how to work with data - whether it's a better grasp of what's a reasonable request of a developer or being more articulate with a customer in explaining what's possible with data. It leads to improved communication, credibility, faster-decision making, and ultimately, a timely well-designed solution. 

Why soft skills will win in the age of machine learning

Back in college, I had a summer job completing research for a clinical health professor. She was a leading expert in diagnosing and treating open human wounds. My job was to survey other experts, get them to examine photos of open wounds, and then recommend a treatment.

A few months ago, I discovered a smartphone app which replaces this work.* You take a photo of an open wound and upload it to the cloud. I suspect that the photo is run through an image recognition model, called a Convolutional Neural Network (CNN), that identifies specific features of the wound for treatment. Current machine learning is very good at completing narrowly defined tasks, such as analyzing a specific type of medical image, because they have millions of previous examples to train from. It is not good at handling non-standard cases. 

Jobs that require experts will increasingly be impacted as cloud storage and machine learning services come down in cost and become more accessible.

For example, we have seen generalists such as nurse practitioners, paralegals, and dental hygienists, offer services that used to be only available through doctors, lawyers, and dentists. Machine learning allows these generalists to offer even more of these services. As a result, specialists will be left handling non-standard cases. 

Making specialist services more affordable means that those who were under-served before now have access. The total market size grows and consumers benefit from this outcome. 

Reflecting on my daily work, I can see the hard skills such as reporting and analysis being automated. However, the softer skills of project management, architecture design, and relationship building are tasks that I just can't see being automated anytime soon. Context for how to operate in a certain industry won't be easily captured either. These soft skills are the ones that will win in the age of machine learning.

*Company is called:

How Data Scientists Are Controlling Your Life

My daily experience with recommendation systems are seamless. They recommend what to read on Apple News, listen on Spotify, eat on Uber Eats, purchase on Amazon, watch on Netflix. These software programs take millions of data points, clean and segment the data, weigh different variables, and output recommendations that ensure we stay engaged with the platform for the next selection. As much as we want to believe that machines make all these decisions, data scientists are the ones that are deciding the inputs for these models. Ultimately, these choices introduce bias. 

What if I'm missing out on an incredible book or song because the inputs don't capture interests of mine that I didn't even know existed?

Six years ago, I switched my book purchases to Google's Play Book marketplace. I loved the convenience of having my book highlights stored online and available for quick future reference. Google's recommendation system has an endless list of books for me to discover. And, for many years, I happily obliged and purchased their recommendations. What I've noticed recently, however, is that their recommendations are just not interesting anymore. I had a narrowly defined set of interests for book topics that I was looking for and I read them all. 

Lately, I've been visiting independent bookstores and discovered many new, interesting books that I had never heard of. Books in different topics that I found interesting, but didn't know I wanted. One bookstore had an interesting quote that's stuck with me:

On the internet you can find what you're looking for; in our store you can find what you are not looking for (benmcnallybooks)

When you walk into a store, you're not browsing a store's products. You're browsing the store owner's taste. A store owner has carefully curated their selection based on their database of expertise, and can filter top selections from many categories. I may discover an interesting selection not because I'm interested in a topic, but because an expert knows a quality selection regardless of topic.

We all start consuming similar lifestyles as our lives become increasingly focused around digital platforms and their recommendation systems, We may not be discovering parts of ourselves because we didn't know they existed. So, get out into the real world and don't let data scientists control your life.