Thursday, September 20, 2007

Anchor Text Analysis Part 4 of 4

This post is part of a series about anchor text analysis. Please click on the links below to explore further:

Part 1
Part 2
Part 3

Anchor text can be used to facilitate a focused crawler, which fetches relevant pages of topic-oriented search engines (Chakrabarti, et al., 1999). A focused crawler would begin at the homepage of a limited URL domain, such as a corporation’s website.

Since they usually provide good summaries of the destination pages, anchor text can be used to develop a strategy for crawling unvisited URLs (Diligenti, et al., 2000; Li et al., 2005). In 2005, Li et al. compared web crawlers across the data sets of four Japanese universities. A standard breadth-first crawler, a hierarchical focused crawler which moved from parent to children pages, and a focused crawler guided by anchor texts using a decision tree were evaluated. Li et al. (2005) discovered that the anchor text crawler significantly outperformed the other crawlers and only had to crawl 3% of the pages to crawl 50% of the relevant pages. Additionally, the anchor text crawler was able to find deep, relevant pages more effectively than the others were.

Anchor text is also helpful for locating homepages, which can be viewed as destination pages with many anchor texts pointing to them. Westerveld et al. (2002) experimented with combinations of link, URL, and anchor text based methods to create algorithms to find entry pages. They discovered that using anchor text only methods outperformed full text only methods, and anchor text provided high precision results.

Chakrabarti et al. (1998a) suggests that anchor texts can be used to create an automatic resource compiler (ARC). An ARC is a program that combines link and text analysis to create a list of authoritative web resources like those on human-compiled resource lists like Yahoo! Building on the Kleinberg’s (1999) theory of authority pages, hubs, and extended anchor text, their study created automatically generated resource lists that fared equal or better than the human-compiled lists in user studies. The automated lists were generated faster and were easier to update than the human ones. The ARC received the worst scores for subjects such as “affirmative action,” which has many web pages and human-compiled resources. Topics such as “cheese,” “alcoholism,” and “Zen Buddhism” fared the best with the ARC because they are less political and have a smaller web presence.

Kumar et al. (2006), using the CLEVER (CLient-side EigenVector Enhanced Retrival) search system developed at the IBM Almaden Research Center, discovered that link-based ranking with anchor text performed better than content-based techniques. They enhanced Kleinberg’s (1999) HITS algorithm with weighted relevancy values using extended anchor text. Kumar et al. also found that linked-based ranking with anchor text was even more successful when combined with content.

Anchor text was also efficient in searching the intranet. Fagan et al. (2003) point out the differences between querying the Internet and the intranet. Unlike the democracy of the Internet, the intranet reflects an autocratic or bureaucratic position of the organization it serves. Intranet documents are designed to be informative, and there is little incentive to design a page that will attract traffic or have many links. In fact, the concept of rank is inconsequential, as all pages, no matter their relevance or importance, are treated equally. Internet searches are deemed successfully by users when they satisfice; intranet queries are more precise because they are looking for a specific answer. When the researchers crawled IBM’s intranet, they found that anchor text was surprisingly efficient. Its ranking performance increased as the recall parameter was relaxed. For recall at the 20th position, anchor text led to a 15% improvement in recalled and influenced more than 87%, doubling its recall over all. It also out-performed title indexing.

Hawking et al., 1999 point out that experiments, using TREC-8 (Text Retrieval Conference) Small and Large Web Tracks, did not find searching with anchor text effective. However, Craswell et al. (2001, 2003) explain that the TREC trials were based on subject search, rather than site finding tasks, the focus of their work. They define the difference as being that site search is when a user is looking for a particular item; a subject search results in as many relevant documents as possible. Eiron & McCurley (2003) explain the dichotomy as “related to the fact that the predominant use of web search engines is for the entry page search task, but that ad hoc queries appear in a very heaving tail of distribution of queries. Thus commercial search engines achieve their success by serving most to the people most of the time” (459). Craswell et al. (2001, 2003) discovered that ranking based on anchor text was twice as effective for site finding as content ranking. Overall, anchor text is ideal for locating sites, but may underperform in some subject search experiments.

Traditional information retrieval systems use measures of precision and recall on large bodies of closed information systems (Salton & Buckley, 1990). The Internet, due to its democratic nature, is large, growing, and dynamic, making total recall impossible. Few identifying structures increase precision and recall. In this environment, anchor text analysis has proven to be a surprisingly valuable and versatile method to retrieve authoritative and relevant information.

Works Cited

Anick, P. G. (1994). “Adapting a full-text information retrieval system to computer the troubleshooting domain. Proceedings of ACM SIGIR, 349-358.

Bharat, K., & Henzinger, M. (1998). Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (102-111). New York: ACM Press.

Brin, S. & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30 (1-7), 107-117.

Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Raghavan, P., & Rajagopalan, S. (1998a) Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference, 1-7, 65-74.

Chakrabarti, S., Dom, B., Gibson, D., Kumar, S. R., Raghavan, P., Rajagopalan, S. & Tomkins, A. (1998b). Experiments in Topic Distillation. ACM SIGIR Workshop on Hypertext Information Retrieval on the Web. 13-21.

Chakrabarti, S., van den Berg, M., & Dom, B. (1999). Focused crawling: a new approach to topic-specific web resource discovery. In The Eighth International World Wide Web Conference, Toronto, Canada, May 1999.

Craswell, N., Hawking, D., & Robertson, S.E. (2001). Effective site finding using link anchor information. In Proceedings of the 24th CAN SIGIR. 250-257.

Craswell, N., Hawking, D., Wilkinson, R, & Wu, M. (2003). Overview of the trec-2003 web track. In Proceedings of TREC 2003.

Croft, W. B., Cook, R., and Wilder, D. (1995). Providing government information on the internet: Experience with ‘THOMAS.’ University of Massachusetts Technical Report 95-96.

Dash, R. K. (2005). Increasing blog traffic: 10 factors affecting your search engine Rankings. BlogSpinner-X. Retrieved on August 13, 2007 from:
http://blogspinner.blogspot.com/2005/11/increasing-blog-traffic-10-factors.html
Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C. L., & Gori, M. (2000). Focused crawling using context graphs. In Proceedings of the 26th VLDB Conference, 527-534.

Eiron, N., & McCurley, K. S. (2003). Analysis of anchor text for web search. In
Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (459-460). New York: ACM Press.

Fagin, R., Kumar, R., McCurley, K.S., Novak, J., Sivakumar, D., Tomlin, J. A., & Williamson, D. P. (2003). Searching the workplace web. In Proceedings of WWW2003.

F├╝rnkranz, J. (1999). Exploiting structural information for text classification on the WWW. Proceedings of IDA-99, 3rd Symposium on Intelligent Data Analysis, 487-498.

Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D., & Flack, G. W. (2002). Using web structure for classifying and describing web pages. In Proceedings of WWW2002.

Hawking, D., Voorhees, E., Bailey, P., & Craswell, N. (1999). Overview of TREC-8 Web Track. In Proceedings of TREC-8.

Hiler, J. (March 3, 2002). Google time bomb. Microcontent News, Retrieved
December 2, 2007 from:
http://www.microcontentnews.com/articles/googlebombs.htm

Jin, R., Hauptmann, A. G., & Zhai, C. (2002) Title language model for information retrieval. In Proceedings of the 25th ACM SIGIR, 42-48.

Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604-632.

Kraft, R., & Zien, J. (2004). Mining anchor text for query refinement. WWW2004, 666-674.

Kumar, R., Raghavan, P., Rajagopalan, S. & Tomkins, A. (2001). On semi-automated web taxonomy construction. In Proceedings of the 4th ACM WebDB. New York: ACM Press, 91-96.

Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (2006). Core algorithms in the CLEVER System. ACM Transactions on Internet Technology, 6(2), 131-152.

Langville, A. N. & Meyer, C. D. (2006). Google’s PageRank and beyond: The science of search engine rankings. Princeton, NJ: Princeton University Press.

Li, J., Furuse, K., & Yamaguchi, K. (2005). Focused crawling by exploiting anchor text using decision tree. In International World Wide Web Conference, Chiba, Japan, 1190-1191.

Lu, W. H., Chien, L. F., & Lee, H. J. (2002). Translation of web queries using anchor text mining. ACM Transactions on Asian Language Information Processing, 1(2), 159-172.

Lu, W. H., Chien, L. F., & Lee, H. J. (2004). Anchor text mining for translation of web queries: A transitive translation approach. ACM Transactions on Information Systems, 22 (2), 242-269.

Maarek, Y. & Smadja, F. (1989). Full text indexing based on lexical relations. In Proceedings of the 12th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 198-206.

McBryan, O. (1994). GENVL and WWWW: Tools for taming the web. In First
International World Wide Web Conference
, Geneva, Switzerland.

Salton, G. & Buckley, C. (1990). Improving retrieval performance for relevance
feedback. Journal of the American Society for Information Science, 41(4), 288-297.

Westerveld, T., Kraaij, W., & Hiemstra, D. (2002). Retrieving web pages using content, links, URLs and anchors. In Tenth Text Retrieval Conference, 663-672.

Monday, September 10, 2007

Anchor Text Analysis Part 3 of 4

This post is part of a series about anchor text analysis. Please click on the links below to explore further:

Part 1
Part 2
Part 4

Eiron & McCurley (2003) observe that queries are often very short and contain few terms on average. Some researchers have noted that users create queries of one to three words without thought of query construction (Anick, 1994; Croft et al., 1995). Anchor text is remarkable similar to queries because it is short and summarizes what the user hopes to find. It also matches queries in term distribution: they are both either a noun phrase or a noun with adjectives, and they rarely contain verbs (Eiron & McCurley, 2003). The vocabulary and grammatical form of queries and anchor text are more similar than either compared to full-text documents (Eiron & McCurley, 2003). When the query terms and anchor text match, the found documents are relevant to the query (Eiron & McCurley, 2003).

The similarity between anchor text and query formation can be exploited by creating an algorithm that suggests refinements. Kraft & Zien (2004) suggest that better quality refinement originates from the anchor text than from mining the document’s content. They found that anchor text used with algorithms provided high quality refinements for queries of one or two words, the typical length of everyday search engine queries. They propose that further research should explore how anchor text can be used to broaden or change the direction of the search or how extended anchor text can refine queries (Kraft & Zien, 2004).

Theoretically, an algorithm for query refinement would find anchor texts that are similar to the text used in the query. However, there are problems using anchor text alone (Kraft & Zien, 2004). There may be too many anchor texts to match a single term query, and some anchor texts are automatically generated by Web authoring tools (Kraft & Zien, 2004). Ranking methods, in addition to anchor text, are employed instead. With this method, queries are easier to process because the algorithm only has to process anchor texts, not full documents. Additionally, destination documents with a large amount of anchor texts pointing to it are, on the whole, more relevant and popular than pages with less anchor texts.

Anchor text can also act synonymously. Eiron & McCurley (2003), in their experiments with querying the IBM intranet with anchor text, point out that the common term “layoff” is often called “restructuring,” “downsizing,” “rightsizing,” “outsourcing” or a dozen other euphemisms in the corporate world. Anchor text could assist in locating documents about layoffs that do not contain the term in their title, text, or keywords.

Similarly, anchor text, with link structures, can be used to constructed multilingual lexicons and find translation equivalents of query terms. Since bi- or multilingual human editors create anchor text, they can differentiate between the subtleties of connotation better than any machine translation. Exploiting the anchor text is less time-consuming and expensive than multilingual lexicons or thesauri. Lu et al. (2002) created an experimental system to perform English-Chinese web searches, which proved efficient for translating queries containing new terminology or proper names. Later experiments by Lu et al. (2004) combined anchor text with a bilingual dictionary to suggest helpful query terms to refine the search.

Anchor text works the same way as folksonomy. Instead of using traditional subject indexing or a controlled vocabulary, users employ freely chosen keywords to describe the content of the destination page. Often the anchor text is more descriptive and concise than the full text of the destination page. For example, the website for General Motors, www.gm.com, is about the automobile manufacturer, yet “automobile manufacturer” does not appear on the page. The anchor text, “automobile manufacturer” on a web page pointing to the General Motors site provides this information.

Anchor text is also important when the destination pages do not have text that can be crawled by a search engine. For instance, a destination page of magnolia illustrations can only be qualified by anchor text because the illustrations are jpegs, which do not provide textual information about what they represent.

Anchor text, combined with document titles, provides subject information about the document (Jin et al, 2002). They both express a concept of what the document is about, although they are constructed differently linguistically. However, anchor text is more useful than titles, because there can be multiple anchor text terms, but only one document title.

Thursday, September 6, 2007

Anchor Text Analysis Part 2 of 4

This post is part of a series about anchor text analysis. Please click on the links below to explore further:

Part 1
Part 3
Part 4

Oliver McBryan first discussed the usefulness of anchor text in Internet searching in 1994. He discussed the operations of GENVL (GENerate Virtual Library), an interactive hierarchical virtual library cataloged by subject, and WWWW, the WWW Worm, a search engine. He proposed that a search engine could be constructed on anchor texts alone.

Anchor text supports the ranking of websites, built on the premise that if an author creates a web page and uses anchor text to describe another page, than the anchor text is a good indicator of the destination page’s content. Google, as well as many other search engines, uses link analysis to create PageRank values (Brin & Page, 1998).

When other sites point to the destination page with the same anchor text, then the page will be more likely to be relevant to users querying with terms within the anchor text. Anchor text, especially with text-based methods, improves ranking quality (Bharat and Henzinger, 1998; Chakrabarti et al, 1998b; Kleinberg, 1999; Kumar et al., 2001).

Kleinberg (1999) suggests that anchor text can be used to determine authoritative content on the web, when he wrote, “Hyperlinks encode a considerable amount of latent human judgment, and we claim that this type of judgment is precisely what is needed to formulate a notion of authority” (204). Using the HITS (Hypertext Induced Topic Search) algorithm, Kleinberg divides the web into two parts. An authority page contains high-quality information about a subject. A hub contains many links to high-quality web pages. Authority pages and hubs exist in a mutually reinforcing relationship in which anchor texts are embedded and point to reliable information.

However, some anchor text may not be useful at all. “Click here” is a popular anchor text that conveys nothing about the source or destination pages. Anchor texts can also be exploited through black hat practices, which spammers create multiple pages with anchor text that point to their website and artificially inflate its rank. PageRank, and other well-engineered systems, minimize the practice by requiring pages to be cited by important pages.

Destination pages may also suffer from inaccurate or spurious anchor texts (Hiler, 2002). A well-known example of anchor text mischief occurred in 2003 when the highest-ranking result of the query terms “miserable failure” was George W. Bush’s official biography. Of the more than 800 links to Bush’s page, only 32 of them used the phrase (Langville & Meyer, 2006). In the years since, search engines have improved their algorithms to defeat the Googlebombing phenomenon.

Saturday, September 1, 2007

Anchor Text Analysis Part 1 of 4

This post is part of a series about anchor text analysis. Please click on the links below to explore further:

Part 2
Part 3
Part 4

Due to its enormity, diversity, and lack of formal structure, the Internet presents a unique dilemma for information retrieval. To fulfill informational needs, users translate their questions into short queries through a search engine. Using an algorithm, the search engine scours the web to locate and rank relevant documents.

Several factors make it difficult to evaluate the effectiveness of the search. Relevancy is subjective, gauged by the user. A document that may be judged relevant by evaluative techniques may not satisfy the user’s needs. Additionally, users often do not know exactly what they are looking for and experiment with different queries to refine their search. The growth and popularity of the Internet has also made ascertaining the authoritativeness of a document complicated. Unlike the closed corpus used in traditional methods of information retrieval, the web presents a massive collection with varying degrees of reliability.

Anchor text, however, is one of the few structures on the Internet that can improve the effectiveness of a search. It enhances a query’s quality because it assigns concise, human-generated terms to the web page it describes. Kumar et al. (2006) explains it best when they wrote:

The networking revolution made it possible for hundreds of millions of individuals to create, share, and consume content on a truly global scale, demanding new techniques for information management and searching. One particularly fruitful direction has emerged here: exploiting the link structure of the Web to deliver better search, classification and mining. The idea is to tap into the annotation implicit in the actions of content creators: for instance when a page-creator incorporates hyperlinks to the home pages of several football teams, it suggests that the page-creator is a football fan (132).

This paper presents an overview of information retrieval using anchor text analysis from the past decade, which has proven to be a versatile and efficient method of finding information.

On the web, a hyperlink consists of two parts: the destination page and the anchor text on the source page. The author of the source page determines the anchor text, which describes the destination page. Anchor text is defined as the highlighted clickable text that is displayed for a hyperlink on a web page.

An illustrative example is the anchor text for this link:

information retrieval

“Information retrieval” is associated with the document “index.html.” From this configuration, it is assumed that a human editor of the web page believed that “index.html” was about information retrieval. When a web crawler follows links in a document to crawl additional documents, it indexes the anchor text “information retrieval” with not only the source document it is embedded in, but also the destination document, index.html.

Extended anchor text is the words surrounding the anchor text as well as the anchor text itself. F├╝rnkranz (1999) defined extended anchor text as the headings immediately before the anchor text and the paragraph containing it. Glover et al. (2002) narrowed the definition to the 25 words before and after the anchor text. Extended anchor text displays lexical affinity, which states that if there is relevant text close to a link, the link is more likely to be relevant (Maarek & Smadja, 1989).