They wanted an e-book service that could compete with the Amazon Kindle. And part of their plan for doing that was to say, “Hey, we can offer higher-quality versions of all these old public domain books than what Amazon has.”
At the time, Amazon was claiming to have four hundred thousand e-books. Three hundred thousand of those were public domain books that had been scanned by somebody and gone through one pass of optical character recognition. There would be typos or misspellings or wrong words everywhere. Not really joyful to read.
The idea was that Google could have a similar catalog of these older public domain works but that they would actually be readable. Honestly, I think it was largely a marketing thing. To be able to say, “We have five hundred thousand e-books and nobody else has five hundred thousand e-books.”So reCAPTCHA would be the weapon that Google could use to beat Amazon at e-books, by offering a way to clean up text produced by not-so-great optical recognition software.
Yeah. Google had a team, and still does, that worked on optical recognition software. The project was open-source and called Tesseract. Tesseract was closely tied to the Google Books team. We met with them during our first few weeks, and sat next to them.
Tesseract was okay, but it wasn’t as good as the commercial software we had been using for reCAPTCHA, which was called ABBYY. So Google wanted reCAPTCHA to improve the text quality.What was it like to be acquired by Google?
It was exciting. We had a lot of code that was specific to The New York Times. They had a particular format of articles and sections and so on. Now we were doing books, which is a very different sort of thing.
Also, the scale was completely new. At Google, we were working with millions of books. And we had way more access to computing power, obviously.I’d imagine that Google had more computing resources available than your startup.
Yes, and reCAPTCHA was an old-school startup, before there was any of this cloud stuff. The front ends that served the CAPTCHAs were hosted on four servers: two on the East Coast and two on the West Coast. Eventually we added three servers in Europe for latency reasons—because of a big client that wanted low latency for their European users.
So we went from having a handful of servers that we had to manage ourselves to having as many resources as we wanted. In the first year or two at Google, we easily scaled up our traffic by six to eight times what it had been before.
When we got acquired, we were serving maybe four or five thousand CAPTCHAS per second. Which is not bad. Facebook used reCAPTCHA. So did Ticketmaster, Twitter, and a bunch of sites that were a big deal ten years ago and that nobody remembers anymore.
But within a year at Google, we were easily double or triple that. Not due to us doing any special marketing or anything. It was just organic growth from the sites that were already using us, plus others saying, “Okay, they’re part of Google now, so they’re not going to just disappear.” Which I guess is different than what people say when startups get acquired by Google these days!These days, when they buy a startup, they usually just burn it down so it doesn’t become a competitor.
Yeah.You mentioned that Google Books had already scanned millions of books. Where did those books come from originally?
By the time we arrived, Google Books had been going on for years. It was first announced back in 2004. All of the book scanning was done in collaboration with libraries. Harvard and the University of Michigan were the two largest ones in the U.S.
The way it worked was that books that weren’t checked out would get trucked off from the library to a Google scan center. There, they had people turning the pages and taking photos with cameras from above to scan them, to ensure it was a nondestructive scanning process.
I did get to see a scan center at one point. It’s one of the first situations that I became aware of where Google was using TVCs.3 Google wasn’t directly employing the people scanning the books—there was some third party that was responsible for the scan center operations at any given place.
Libraries liked the project, because their whole point was the preservation of written material. So preserving that material digitally for future generations seemed good. And the libraries got the scans of the books to do whatever they wanted to with them.What was the original impetus for the project?
My best guess is that Larry Page just thought it would be cool. He probably decided it was worth doing because compared to the scale of Google, the amount of resources required was not huge. And it made sense given Google’s culture and mission back in 2004. Google was a search engine. The reasoning that I always heard was, “So you can search the web, but there’s a whole bunch of human knowledge that’s stuck in dead-tree form. Why can’t you search all of that as well?”
The Culture Is Changing
I’m curious about your evolution in how you saw the company. When you first joined, you were excited. Suddenly you had a huge amount of computing power at your disposal, and you were part of this big, ambitious project to digitize the world’s books. How did that feeling evolve over time?
Many different things changed at Google, both culturally and engineering-wise, over the nine years I was there.
The Google of nine years ago felt much closer to Larry and Sergey’s original vision. It was honest techno-utopianism. Google Books was a great example of that. “We’re just going to try and scan all the world’s books because we did some numbers on the back of an envelope and it seemed like we could.” And they did actually scan 20