Saturday, September 1, 2007

Anchor Text Analysis Part 1 of 4

This post is part of a series about anchor text analysis. Please click on the links below to explore further:

Part 2
Part 3
Part 4

Due to its enormity, diversity, and lack of formal structure, the Internet presents a unique dilemma for information retrieval. To fulfill informational needs, users translate their questions into short queries through a search engine. Using an algorithm, the search engine scours the web to locate and rank relevant documents.

Several factors make it difficult to evaluate the effectiveness of the search. Relevancy is subjective, gauged by the user. A document that may be judged relevant by evaluative techniques may not satisfy the user’s needs. Additionally, users often do not know exactly what they are looking for and experiment with different queries to refine their search. The growth and popularity of the Internet has also made ascertaining the authoritativeness of a document complicated. Unlike the closed corpus used in traditional methods of information retrieval, the web presents a massive collection with varying degrees of reliability.

Anchor text, however, is one of the few structures on the Internet that can improve the effectiveness of a search. It enhances a query’s quality because it assigns concise, human-generated terms to the web page it describes. Kumar et al. (2006) explains it best when they wrote:

The networking revolution made it possible for hundreds of millions of individuals to create, share, and consume content on a truly global scale, demanding new techniques for information management and searching. One particularly fruitful direction has emerged here: exploiting the link structure of the Web to deliver better search, classification and mining. The idea is to tap into the annotation implicit in the actions of content creators: for instance when a page-creator incorporates hyperlinks to the home pages of several football teams, it suggests that the page-creator is a football fan (132).

This paper presents an overview of information retrieval using anchor text analysis from the past decade, which has proven to be a versatile and efficient method of finding information.

On the web, a hyperlink consists of two parts: the destination page and the anchor text on the source page. The author of the source page determines the anchor text, which describes the destination page. Anchor text is defined as the highlighted clickable text that is displayed for a hyperlink on a web page.

An illustrative example is the anchor text for this link:

information retrieval

“Information retrieval” is associated with the document “index.html.” From this configuration, it is assumed that a human editor of the web page believed that “index.html” was about information retrieval. When a web crawler follows links in a document to crawl additional documents, it indexes the anchor text “information retrieval” with not only the source document it is embedded in, but also the destination document, index.html.

Extended anchor text is the words surrounding the anchor text as well as the anchor text itself. F├╝rnkranz (1999) defined extended anchor text as the headings immediately before the anchor text and the paragraph containing it. Glover et al. (2002) narrowed the definition to the 25 words before and after the anchor text. Extended anchor text displays lexical affinity, which states that if there is relevant text close to a link, the link is more likely to be relevant (Maarek & Smadja, 1989).

No comments: