But what even is AI? This should be a simple question, but honest answers are surprisingly hard to find. Mystification and misinformation abound, amplified by a media that’s typically far too deferential to industry hype. We’re told that AI is about to revolutionize everything—among other things, by throwing millions of people out of work by automating away their jobs.
This didn’t sound quite right to us, so we sat down with a veteran data scientist to learn more. The data scientist helped us sort the fact from the fiction, and obtain a clearer view of Silicon Valley’s next next big thing. When you strip away all the nonsense, what’s actually going on?All right, let’s get started with the basics. What is a data scientist? Do you self-identify as one?
I would say the people who are the most confident about self-identifying as data scientists are almost unilaterally frauds. They are not people you would voluntarily spend a lot of time with.
There are a lot of people in this category who have only been exposed to a little bit of real stuff—they’re sort of peripheral. You actually see a lot of this with these strong AI companies: companies that claim to be able to build human intelligence using some inventive “Neural Pathway Connector Machine System,” or something.1 You can look at the profiles of every single one of these companies. There are always people who have strong technical credentials, and they are in a field that is just slightly adjacent to AI, like physics or electrical engineering.
And that’s close, but the issue is that no person with a Ph.D. in AI starts one of these companies, because if you get a Ph.D. in AI, you’ve spent years building a bunch of really shitty models, or you see robots fall over again and again and again. You become so acutely aware of the limitations of what you’re doing that the interest just gets beaten out of you. You would never go and say, “Oh, yeah, I know the secret to building human-level AI.”
In a way it’s sort of like my dad, who has a Ph.D. in biology and is a researcher back east, and I told him a little bit about the Theranos story.2 I told him their shtick: “Okay, you remove this small amount of blood, and run these tests…” He asked me what the credentials were of the person starting it, and I was like, “She dropped out of Stanford undergrad.” And he was like, “Yeah, I was wondering, since the science is just not there.” Only somebody who never actually killed hundreds of mice and looked at their blood—like my dad did—would ever be crazy enough to think that was a viable idea.
So I think a lot of the strong AI stuff is like that. A lot of data science is like that, too. Another way of looking at data science is that it’s a bunch of people who got Ph.D.s in the wrong thing, and realized they wanted to have a job. Another way of looking at it—I think the most positive way, which is maybe a bit contrarian—is that it’s really, really good marketing.
As someone who tries not to sell fraudulent solutions to people, it actually has made my life significantly better because you can say “big data machine learning,” and people will be like, “Oh, I’ve heard of that, I want that.” It makes it way easier to sell them something than having to explain this complex series of mathematical operations. The hype around it—and that there’s so much hype—has made the actual sales process so much easier. The fact that there is a thing with a label is really good for me professionally.
But that doesn’t mean there’s not a lot of ridiculous hype around the discipline.I’m curious about the origins of the term “data science”—do you think that it came internally from people marketing themselves, or that it was a random job title used to describe someone, or what?
As far as I know, the term “data science” was invented by Jeff Hammerbacher at Facebook.The Cloudera guy?3
Yeah, the Cloudera guy. As I understand it, “data science” originally came from the gathering of data on his team at Facebook.
If there was no hype and no money to make, essentially, what I would say data science is, is the fact that the data sets have gotten large enough where you can start to consider variable interactions in a way that’s becoming increasingly predictive. And there are a number of problems where the actual individual variables themselves don’t have a lot of meaning, or they are kind of ambiguous, or they are only very weak signals. There’s information in the correlation structure of the variables that can be revealed, but only through really huge amounts of data.
So essentially, there are n variables, right? So there’s n-squared potential correlations, and n-cubed potential cubic interactions or whatever. Right? There’s a ton of interactions. The only way you can solve that is by having massive amounts of data.So the data scientist role emphasizes the data part first. It’s like, we have so much data, and so this new role arises using previous disciplines or skills applied to a new context?
You can start to see new things emerge that would not emerge from more standard ways of looking at problems. That’s probably the most charitable way of putting it without any hype. But I should also say that the hype is just ferocious.
And even up until recently, there’s just massive bugs in the machine-learning libraries that come bundled with Spark.4 It’s so bizarre, because you