Supporting Articles For Busby SEO Test
The most dramatic change in search engine design in the past several years has been developing search engines that account for the Web’s hyperlink structure. LSI, with its SVD of a term-by-document matrix, is an approach that works well for smaller document collections but has problems with scalability. The computation and storage of an SVD-based LSI model for the entire Web is not tractable.
In general, the idea of this relatively new approach is that there are certain pages on the Web that are recognized as the “go to” places for certain information, and there is another set of pages that legitimize those esteemed positions by pointing to them with links. For example, let us say there is a website called The History of Meat and Potatoes. If enough other websites link to The History of Meat and Potatoes website, then The History of Meat and Potatoes shows up high on the list when queries are made. These “go to” places are known as authorities, and those webpages that point to authorities are known as hubs. It is a mutually reinforcing approach, with good hubs pointing to good authorities and good authorities pointing to good hubs .
In 1998, Jon Kleinberg of Cornell University formalized this approach with the hyperlink induced topic search (HITS) algorithm, which takes into account the Web’s social network. Basically, not only does the query pull in a set of pages that matches the term, but it also evaluates webpages that link to the term. Ultimately, the user is presented with both sets of pages - the authoritative and the hub list. One example of how those results might appear on the screen is found at the website for the search engine Teoma (www. teoma. cam). Teoma provides the user not only with a standard results lists but also a list of additional “authorities.”
The advantage of the HITS algorithm for web queries is that the user receives two lists of results: a set of authoritative pages he or she seeks and a set of link pages that can give a more comprehensive look at the information that is available. One major disadvantage is that these computations are made after the query is made, so the time to create the lists would be unacceptable to most users. Also, on the downside, a webmaster can skew the results by adding links to one’s own page to increase the authority score and hub score .
Another more well-known, similar, linkage data approach is the PageR-ank algorithm developed by the founders of Google, Larry Page and Sergey Brin . Page and Brin were graduate students at Stanford in 1998 when they published a paper describing the fundamental concepts of the PageRank algorithm, which later was used as the underlying algorithm that currently drives Google . Unlike the HITS algorithm, where the results are created after the query is made, Google has the Web crawled and indexed ahead of time, and the links within these pages are analyzed before the query is ever entered by the user. Basically, Google looks not only at the number of links to The History of Meat and Potatoes website - referring to the earlier example - but also the importance of those referring links. Google deter-mines how many other Meat and Potatoes websites are also being linked to the referring site and what is important about those sites. Again, all of this is computed before the query is ever typed by the user. Once a query comes, the results are returned based on these intricate PageRank values.
Although the Google search engine is in a dominant position in the web searching marketplace, it is not without its shortcomings.
A strategy to improve one’s position on a results page known as “Googlebombing” involves creating additional websites or manipulating weblogs (knowns as “blogs”) that can beef up the number of referring links artificially [85, 37]. Like most of the major commercial search engines, Google is limited in its advanced concept matching . Referring to the earlier Meat and Potatoes example, if a user queries with synonyms such as “beef,” “spuds,” and “history,” the chances of a results list with The History of Meat and Potatoes as one the top entries would be less likely. Also, if the user wanted to know about The History of Meat and Potatoes in China, it would be difficult to process the polysemic query term “China.” Would the search engine process the term “China” as meaning China as an Asian country or china meaning “fine dishes”?.
It is important to note that the HITS and the PageRank algorithms are not the only approaches that take into account the hyperlink structure of the Web. Other methods such as the stochastic approach for link structure analysis (SALSA) algorithm of Lempel and Moran combine elements of the HITS and PageRank algorithms [49]. However, in the remaining sections of this chapter, the focus shifts to the theoretical aspects of the two original link algorithms, HITS and PageRank.


