Are Data Scientists Actually Surveillance Scientists? - Part 1 - Essays on tech, data, and product strategy

There was of course no way of knowing whether you were being watched at any given moment. How often, or on what system, the Thought Police plugged in on any individual wire was guesswork. It was even conceivable that they watched everybody all the time. But at any rate they could plug in your wire whenever they wanted to. You had to live—did live, from habit that became instinct—in the assumption that every sound you made was overheard, and, except in darkness, every movement scrutinized. -1984, George Orwell

Last summer I had a conversation with an acquaintance who had recently visited China. There was discussion about China's Social Credit System (SCS) and its impact on people's daily lives. The Social Credit System is a system that assigns scores to citizens based on their reputation, and that score can impact someone's ability to be outside in the evening, their eligibility to book a travel ticket or their suitability for a loan. It's similar to a Credit Score that North Americans are more familiar with but more encompassing as the SCS takes non-financial data into account. My acquaintance said that the initial feedback was positive - her friends and family felt safer walking the streets at night knowing that people deemed dangerous wouldn't be allowed outside.

At roughly the same time I had this conversation, I was reading a great technical book on how to build big data systems: Designing Data-Intensive Applications. The last chapter, surprisingly, was not technical - rather, it was a commentary on data and society:

As a thought experiment, try replacing the word data with surveillance, and observe if common phrases still sound so good. How about this: “In our surveillance-driven organization we collect real-time surveillance streams and store them in our surveillance warehouse. Our surveillance scientists use advanced analytics and surveillance processing in order to derive new insights.” [1]

Now, I work in technology and advertising, so it's not lost on me that the industry is built around collecting data on users to provide a more customized experience. Big data and machine learning systems are tools, and their application is really in the hands in the beholder. I think we need to ask ourselves at what point does data collection turn into surveillance and what are its implications?

Taking a step back might help in answering these questions. One of the big breakthroughs that's fueling the current artificial intelligence gold rush is something called deep learning. In 2012, Canadian researchers discovered a way to significantly reduce the error rates in computers being able to classify images. We're seeing the applications of deep learning popping up - from self-driving cars, to identifying diseases from medical images, to legal documents being read faster than humans ever could.

Another use, however, is authenticating people's faces through video cameras.

Depending on your perspective, it's a way to conveniently unlock your phone or it's a way to conveniently monitor humans. I think many people don't believe they have anything to hide and so it doesn't register as a concern. However, machine learning systems are still built by humans, which means errors still happen.

Let's walk through an example: you buy a subway ticket to a certain neighbourhood but you have to submit a face scan. The scan believes your Social Credit Score is too low to be trustworthy and you're denied. You have no transparency into why you have a low score and you try and think of any possible reasons. What about that bill you paid late one time a few years ago? Could that be impacting me? Even worse, what if there's a mistake that's attributed to you? In machine learning systems, there's always a tradeoff being made of prioritizing false positives or false negatives.

Another challenge is that these systems, "merely extrapolate from the past; if the past is discriminatory, they codify that discrimination." [2] Are the data scientists building these systems removing the bias in the datasets? I worry about these downstream effects and their impact to our daily lives.

One area that's promising is an increased focus on privacy by organizations and how they collect and use your data. We're seeing countries legislating privacy in favor of the user in the EU and US. Where there's a gap, though, is the governance of the data systems themselves. In the future, privacy policies could evolve to incorporate an organization's values being applied to data systems too. I'll discuss what a framework for privacy-focused data systems could look like in a followup Part 2.

[1] Designing Data-Intensive Applications, Martin Kleppmann

[2] Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, Cathy O'Neil