August 20, 2007

Unveiling The Mystery of Google

Dan McCue  /  Charleston Regional Business Journal

Four months after announcing its intention to build a data center here, and four months before the facility is officially scheduled to open, Google Inc. continues to fire the imagination of many Lowcountry residents.

Economic developers and members of the Charleston Digital Corridor describe the Internet search-engine giant's local plans as a validation of the region's efforts to create an innovation-based cluster here. Scores of hopefuls have applied for the first few positions that have been advertised by the company. The company anticipates hiring approximately 200. And Google, along with its subsidiary YouTube, even managed to bring the eight Democrats running for president to Charleston for a debate sanctioned by the Democratic National Committee and broadcast by CNN.

But if you ask most of those with Google on their minds what they expect to transpire behind closed doors at the Mt. Holly Commerce Park once the data center is up and running at the end of the year, the responses turn vague and speculative. And Google, for one, isn't inclined to lift the veil shielding the mystery from view. "Because of the competitive nature of our business, we do not disclose the specifics of what is stored in any of our data centers," said Andrew Johnson, Google's East Coast regional manager for hardware operations. "What I can tell you is that the Mt. Holly facility will be an integral part of Google's network and help us continue to provide great service to our customers." he added.

Inside scoop across Internet
Ironically, for all the company's much-discussed penchant for secrecy, some of the best and most in-depth descriptions of Google's data centers and how they function are readily available on the Internet if one is lucky enough to stumble upon a clue here and there. One of the best descriptions of all was offered in 2003 by Google fellow Jeff Dean at a University of Washington symposium that was recorded for broadcast by the university's television station and saved on www.researchchannel.org.

Dean joined Google in mid-1999, just a year after the company was founded by Larry Page and Sergey Brin. As a member of the company's Systems Infrastructure Group, his day-to-day work revolved around large-scale distribution systems, performance monitoring, microprocessor architecture and information retrieval. In both his presentation and in a paper he wrote around the same period with fellow Googlers Luiz Andre Barroso and Urs Holzle, titled "Web Search for a Planet: The Google Cluster Architecture," Dean made it plain that Google's business model is based almost entirely on scale.

"I"m not much for mission statements, but I think Google has a good one," Dean told an assembly of roughly 200 technology students. "It's to organize the world's information and make it universally accessible and useful."

Dean opined that it's a particularly good mission statement for workers and management too because "almost anything fits under it and it ensures you'll never run out of things to do." But when one considers that Google currently has a searchable index of more than 4 billion Web pages and more than 880 million images, it becomes clear the company's mission is entirely dependent on the efficient and reliable management of scale, and that's where the data centers become critical.

As Googlers Monika Henzinger and Steve Lawrence wrote in a research study, "Extracting knowledge from the World Wide Web," administering that scale requires lots of computers, extensive networking capability between all those computers, and software that's not dependent on a particular computer being online.

"The challenge at Google is how to build reliable systems out of unreliable individual parts," Dean said.

The hocus-pocus of Google is the algorithms it uses to process data in a structured, efficient way, but also in a way that presents the information to the person sitting at his or her keyboard in new and interesting ways. To do the latter, Dean said, he and others must continue to improve the quality of search results by analyzing large amounts of data.

Google is just like you

The rest of what makes Google a $160 billion company from the perspective of market valuation isn't all that different from any medium to large company in the Lowcountry: It tries to get as much value as it can from all the hardware it has to use.

What that means is when the data center in Mt. Holly is complete, it won't be filled with rows of the newest, top-of-the-line servers currently on the market: instead, it will feature scores of moderately priced computers, likely bought from a discount vendor. "The company essentially is looking for the most computational power per dollar it can get," Dean said.

So instead of purchasing a single massive Intel server for three-quarters of a million dollars, it will purchase extensive racks of computers that might only cost a third of that price. Back when Google was evolving from a Stanford University research project into a Mountain View, California-based corporation, data centers were operated by hosting providers who operated under a rather dumb, low-tech premise: They charged the early search providers by the square foot, when most of the cost associated with these companies was related to power and cooling. "Google would cram as many machines as they could into every square foot," Dean said.

Once it began building data centers across the country and around the world, another aspect of managing scale came into play: The need to move large racks of computers into place quickly and with minimal risk of damage. As soon as the Google building is completed in Mt. Holly, the company will begin bringing in large racks of computers on wheels. The computers are wired together in the racks and all share a common switch. Once the computers are in place, the racks will be locked down and Google personnel will wire the control switches for the individual racks together to create a cluster.

Redundancy usurps failure

Just as Google's pursuit of value in creating its data centers mirrors the experience of any company seeking to thrive in a competitive environment, so too does its experience with technology. Many machines within Google's data centers fail every day.

The company deals with these failures through replication and redundancy, which happens to be necessary to deal with the sheer amount of information being sought and being processed. Redundancy also makes reliance on less expensive hardware more practical.

Basically, at the data center level, Google's massive index of Web sites-similar to the index at the back of a book, only much bigger-is broken down into more manageable "shards" which are then replicated across several machines. When someone types in a Google search, it is routed to the nearest data center, where it too is sent to several computer clusters at once. That way, if one computer breaks down, the cluster's overall capacity will be diminished by an infinitesimal degree and the search will proceed unimpeded.

Searches and shards of the massive Google index are also replicated across data centers to reduce the company's vulnerability to a facility failure or regional power failure. "Relative to how often the index is updated, thanks to parallelization, data centers can handle many, many queries per second," Dean said.

Multiple replications of search tasks also reduce average completion time for the search. The real question relating to the Mt. Holly site is, given Google's penchant for expanding the services it offers-including many new services geared specifically to the business community-which of them will run off computers here?

Again, Google's Andrew Johnson declined to be specific."We do not disclose what services run in any of our operating locations due to competitive advantage," he said. "That said, we are very excited about the growth in our business services sector and the Mt. Holly data center will help us serve many of our products to our expanding customer base."