At Amazon, the push for more user data is never-ending: When you read books on your Kindle, the data about which phrases you highlight, which pages you turn, and whether you read straight through or skip around are all fed back into Amazon’s servers and can be used to indicate what books you might like next. When you log in after a day reading Kindle e-books at the beach, Amazon is able to subtly customize its site to appeal to what you’ve read: If you’ve spent a lot of time with the latest James Patterson, but only glanced at that new diet guide, you might see more commercial thrillers and fewer health books.
Amazon users have gotten so used to personalization that the site now uses a reverse trick to make some additional cash. Publishers pay for placement in physical bookstores, but they can’t buy the opinions of the clerks. But as Lanier predicted, buying off algorithms is easy: Pay enough to Amazon, and your book can be promoted as if by an “objective” recommendation by Amazon’s software. For most customers, it’s impossible to tell which is which.
Amazon proved that relevance could lead to industry dominance. But it would take two Stanford graduate students to apply the principles of machine learning to the whole world of online information.
Click Signals
As Jeff Bezos’s new company was getting off the ground, Larry Page and Sergey Brin, the founders of Google, were busy doing their doctoral research at Stanford. They were aware of Amazon’s success—in 1997, the dot-com bubble was in full swing, and Amazon, on paper at least, was worth billions. Page and Brin were math whizzes; Page, especially, was obsessed with AI. But they were interested in a different problem. Instead of using algorithms to figure out how to sell products more effectively, what if you could use them to sort through sites on the Web?
Page had come up with a novel approach, and with a geeky predilection for puns, he called it PageRank. Most Web search companies at the time sorted pages using keywords and were very poor at figuring out which page for a given word was the most relevant. In a 1997 paper, Brin and Page dryly pointed out that three of the four major search engines couldn’t find themselves. “We want our notion of ‘relevant’ to only include the very best documents,” they wrote, “since there may be tens of thousands of slightly relevant documents.”
Page had realized that packed into the linked structure of the Web was a lot more data than most search engines made use of. The fact that a Web page linked to another page could be considered a “vote” for that page. At Stanford, Page had seen professors count how many times their papers had been cited as a rough index of how important they were. Like academic papers, he realized, the pages that a lot of other pages cite—say, the front page of Yahoo—could be assumed to be more “important,” and the pages that those pages voted for would matter more. The process, Page argued, “utilized the uniquely democratic structure of the web.”
In those early days, Google lived at google.stanford.edu, and Brin and Page were convinced it should be nonprofit and advertising free. “We expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers,” they wrote. “The better the search engine is, the fewer advertisements will be needed for the consumer to find what they want…. We believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm.”
But when they released the beta site into the wild, the traffic chart went vertical. Google worked—out of the box, it was the best search site on the Internet. Soon, the temptation to spin it off as a business was too great for the twenty-something cofounders to bear.
In the Google mythology, it is PageRank that drove the company to worldwide dominance. I suspect the company likes it that way—it’s a simple, clear story that hangs the search giant’s success on a single ingenious breakthrough by one of its founders. But from the beginning, PageRank was just a small part of the Google project. What Brin and Page had really figured out was this: The key to relevance, the solution to sorting through the mass of data on the Web was… more data.
It wasn’t just which pages linked to which that Brin and Page were interested in. The position of a link on the page, the size of the link, the age of the page—all of these factors mattered. Over the years, Google has come to call these clues embedded in the data
From the beginning, Page and Brin realized that some of the most important signals would come from the search engine’s users. If someone searches for “Larry Page,” say, and clicks on the second link, that’s another kind of vote: It suggests that the second link is more relevant to that searcher than the first one. They called this a
Where data was concerned, Google was voracious. Brin and Page were determined to keep everything: every Web page the search engine had ever landed on, every click every user ever made. Soon its servers contained a nearly real-time copy of most of the Web. By sifting through this data, they were certain they’d find more clues, more signals, that could be used to tweak results. The search-quality division at the company acquired a black-ops kind of feel: few visitors and absolute secrecy were the rule.
“The ultimate search engine,” Page was fond of saying, “would understand exactly what you mean and give back exactly what you want.” Google didn’t want to return thousands of pages of links—it wanted to return one, the one you wanted. But the perfect answer for one person isn’t perfect for another. When I search for “panthers,” what I probably mean are the large wild cats, whereas a football fan searching for the phrase probably means the Carolina team. To provide perfect relevance, you’d need to know what each of us was interested in. You’d need to know that I’m pretty clueless about football; you’d need to know who I was.
The challenge was getting enough data to figure out what’s personally relevant to each user. Understanding what someone means is tricky business—and to do it well, you have to get to know a person’s behavior over a sustained period of time.
But how? In 2004, Google came up with an innovative strategy. It started providing other services, services that required users to log in. Gmail, its hugely popular e-mail service, was one of the first to roll out. The press focused on the ads that ran along Gmail’s sidebar, but it’s unlikely that those ads were the sole motive for launching the service. By getting people to log in, Google got its hands on an enormous pile of data—the hundreds of millions of e-mails Gmail users send and receive each day. And it could cross-reference each user’s e-mail and behavior on the site with the links he or she clicked in the Google search engine. Google Apps—a suite of online word-processing and spreadsheet-creation tools—served double duty: It undercut Microsoft, Google’s sworn enemy, and it provided yet another hook for people to stay logged in and continue sending click signals. All this data allowed Google to accelerate the process of building a theory of identity for each user—what topics each user was interested in, what links each person clicked.
By November 2008, Google had several patents for personalization algorithms—code that could figure out the groups to which an individual belongs and tailor his or her result to suit that group’s preference. The categories Google had in mind were pretty narrow: to illustrate its example in the patent, Google used the example of “all persons interested in collecting ancient shark teeth” and “all persons not interested in collecting ancient shark teeth.” People in the former category who searched for, say, “Great White incisors” would get different results from the latter.
Today, Google monitors every signal about us it can get its hands on. The power of this data can’t be underestimated: If Google sees that I log on first from New York, then from San Francisco, then from New York again, it knows I’m a bicoastal traveler and can adjust its results accordingly. By looking at what browser I use, it can make some guesses about my age and even perhaps my politics.
How much time you take between the moment you enter your query and the moment you click on a result sheds light on your personality. And of course, the terms you search for reveal a tremendous amount about your interests.
Even if you’re not logged in, Google is personalizing your search. The neighborhood—even the block—that you’re logging in from is available to Google, and it says a lot about who you are and what you’re interested in. A query for “Sox” coming from Wall Street is probably shorthand for the financial legislation “Sarbanes Oxley,” while