Your bookshelf is not complete without these books! Check out the Absolutely Must Have Books List
After years of closely guarding the formula for its search algorithms, Google is opening up a little. The search engine company has kept its search formula a closely guarded secret for two reasons: competition and to prevent abuse, said Udi Manber, Google’s vice president of engineering, search quality, in post on the corporate blog. Manber said the blog post is the first part of a renewed effort at the company t
o open up a bit more than we have in the past.” Manber said the most famous part of Google’s ranking algorithm is PageRank, an algorithm developed by Google cofounders Larry Page and Sergey Brin. While PageRank is still in use, it is a “part of a much larger system,” he said. “Other parts include language models (the ability to handle phrases, synonyms, diacritics, spelling mistakes, and so on), query models (it’s not just the language, it’s how people use it today”), time models (some queries are best answered with a 30-minutes old page, and some are better answered with a page that stood the test of time), and personalized models (not all people want the same thing),” he said.
Excerpt from the google blog
“The heart of the group is the team that works on core ranking. Ranking is hard, much harder than most people realize. One reason for this is that languages are inherently ambiguous, and documents do not follow any set of rules. There are really no standards for how to convey information, so we need to be able to understand all web pages, written by anyone, for any reason. And that’s just half of the problem. We also need to understand the queries people pose, which are on average fewer than three words, and map them to our understanding of all documents. Not to mention that different people have different needs. And we have to do all of that in a few milliseconds.
The most famous part of our ranking algorithm is PageRank, an algorithm developed by Larry Page and Sergey Brin, who founded Google. PageRank is still in use today, but it is now a part of a much larger system. Other parts include language models (the ability to handle phrases, synonyms, diacritics, spelling mistakes, and so on), query models (it’s not just the language, it’s how people use it today), time models (some queries are best answered with a 30-minutes old page, and some are better answered with a page that stood the test of time), and personalized models (not all people want the same thing).
Another team in our group is responsible for evaluating how well we’re doing. This is done in many different ways, but the goal is always the same: improve the user experience. This is not the main goal, it is the only goal. There are automated evaluations every minute (to make sure nothing goes wrong), periodic evaluations of our overall quality, and, most importantly, evaluations of specific algorithmic improvements. When an engineer gets a new idea and develops a new algorithm, we test their ideas thoroughly. We have a team of statisticians who look at all the data and determine the value of the new idea. We meet weekly (sometimes twice a week) to go over those new ideas and approve new launches. In 2007, we launched more than 450 new improvements, about 9 per week on the average. Some of these improvements are simple and obvious — for example, we fixed the way Hebrew acronym queries are handled (in Hebrew an acronym is denoted by a (”) next to the last character, so IBM will be IB”M), and some are very complicated — for example, we made significant changes to the PageRank algorithm in January. Most of the time we look for improvements in relevancy, but we also work on projects where the sole purpose is to simplify the algorithms. Simple is good.”