Monday, September 10, 2007

Anchor Text Analysis Part 3 of 4

This post is part of a series about anchor text analysis. Please click on the links below to explore further:

Part 1
Part 2
Part 4

Eiron & McCurley (2003) observe that queries are often very short and contain few terms on average. Some researchers have noted that users create queries of one to three words without thought of query construction (Anick, 1994; Croft et al., 1995). Anchor text is remarkable similar to queries because it is short and summarizes what the user hopes to find. It also matches queries in term distribution: they are both either a noun phrase or a noun with adjectives, and they rarely contain verbs (Eiron & McCurley, 2003). The vocabulary and grammatical form of queries and anchor text are more similar than either compared to full-text documents (Eiron & McCurley, 2003). When the query terms and anchor text match, the found documents are relevant to the query (Eiron & McCurley, 2003).

The similarity between anchor text and query formation can be exploited by creating an algorithm that suggests refinements. Kraft & Zien (2004) suggest that better quality refinement originates from the anchor text than from mining the document’s content. They found that anchor text used with algorithms provided high quality refinements for queries of one or two words, the typical length of everyday search engine queries. They propose that further research should explore how anchor text can be used to broaden or change the direction of the search or how extended anchor text can refine queries (Kraft & Zien, 2004).

Theoretically, an algorithm for query refinement would find anchor texts that are similar to the text used in the query. However, there are problems using anchor text alone (Kraft & Zien, 2004). There may be too many anchor texts to match a single term query, and some anchor texts are automatically generated by Web authoring tools (Kraft & Zien, 2004). Ranking methods, in addition to anchor text, are employed instead. With this method, queries are easier to process because the algorithm only has to process anchor texts, not full documents. Additionally, destination documents with a large amount of anchor texts pointing to it are, on the whole, more relevant and popular than pages with less anchor texts.

Anchor text can also act synonymously. Eiron & McCurley (2003), in their experiments with querying the IBM intranet with anchor text, point out that the common term “layoff” is often called “restructuring,” “downsizing,” “rightsizing,” “outsourcing” or a dozen other euphemisms in the corporate world. Anchor text could assist in locating documents about layoffs that do not contain the term in their title, text, or keywords.

Similarly, anchor text, with link structures, can be used to constructed multilingual lexicons and find translation equivalents of query terms. Since bi- or multilingual human editors create anchor text, they can differentiate between the subtleties of connotation better than any machine translation. Exploiting the anchor text is less time-consuming and expensive than multilingual lexicons or thesauri. Lu et al. (2002) created an experimental system to perform English-Chinese web searches, which proved efficient for translating queries containing new terminology or proper names. Later experiments by Lu et al. (2004) combined anchor text with a bilingual dictionary to suggest helpful query terms to refine the search.

Anchor text works the same way as folksonomy. Instead of using traditional subject indexing or a controlled vocabulary, users employ freely chosen keywords to describe the content of the destination page. Often the anchor text is more descriptive and concise than the full text of the destination page. For example, the website for General Motors,, is about the automobile manufacturer, yet “automobile manufacturer” does not appear on the page. The anchor text, “automobile manufacturer” on a web page pointing to the General Motors site provides this information.

Anchor text is also important when the destination pages do not have text that can be crawled by a search engine. For instance, a destination page of magnolia illustrations can only be qualified by anchor text because the illustrations are jpegs, which do not provide textual information about what they represent.

Anchor text, combined with document titles, provides subject information about the document (Jin et al, 2002). They both express a concept of what the document is about, although they are constructed differently linguistically. However, anchor text is more useful than titles, because there can be multiple anchor text terms, but only one document title.

No comments: