Thursday, September 20, 2007

Anchor Text Analysis Part 4 of 4

This post is part of a series about anchor text analysis. Please click on the links below to explore further:

Part 1
Part 2
Part 3

Anchor text can be used to facilitate a focused crawler, which fetches relevant pages of topic-oriented search engines (Chakrabarti, et al., 1999). A focused crawler would begin at the homepage of a limited URL domain, such as a corporation’s website.

Since they usually provide good summaries of the destination pages, anchor text can be used to develop a strategy for crawling unvisited URLs (Diligenti, et al., 2000; Li et al., 2005). In 2005, Li et al. compared web crawlers across the data sets of four Japanese universities. A standard breadth-first crawler, a hierarchical focused crawler which moved from parent to children pages, and a focused crawler guided by anchor texts using a decision tree were evaluated. Li et al. (2005) discovered that the anchor text crawler significantly outperformed the other crawlers and only had to crawl 3% of the pages to crawl 50% of the relevant pages. Additionally, the anchor text crawler was able to find deep, relevant pages more effectively than the others were.

Anchor text is also helpful for locating homepages, which can be viewed as destination pages with many anchor texts pointing to them. Westerveld et al. (2002) experimented with combinations of link, URL, and anchor text based methods to create algorithms to find entry pages. They discovered that using anchor text only methods outperformed full text only methods, and anchor text provided high precision results.

Chakrabarti et al. (1998a) suggests that anchor texts can be used to create an automatic resource compiler (ARC). An ARC is a program that combines link and text analysis to create a list of authoritative web resources like those on human-compiled resource lists like Yahoo! Building on the Kleinberg’s (1999) theory of authority pages, hubs, and extended anchor text, their study created automatically generated resource lists that fared equal or better than the human-compiled lists in user studies. The automated lists were generated faster and were easier to update than the human ones. The ARC received the worst scores for subjects such as “affirmative action,” which has many web pages and human-compiled resources. Topics such as “cheese,” “alcoholism,” and “Zen Buddhism” fared the best with the ARC because they are less political and have a smaller web presence.

Kumar et al. (2006), using the CLEVER (CLient-side EigenVector Enhanced Retrival) search system developed at the IBM Almaden Research Center, discovered that link-based ranking with anchor text performed better than content-based techniques. They enhanced Kleinberg’s (1999) HITS algorithm with weighted relevancy values using extended anchor text. Kumar et al. also found that linked-based ranking with anchor text was even more successful when combined with content.

Anchor text was also efficient in searching the intranet. Fagan et al. (2003) point out the differences between querying the Internet and the intranet. Unlike the democracy of the Internet, the intranet reflects an autocratic or bureaucratic position of the organization it serves. Intranet documents are designed to be informative, and there is little incentive to design a page that will attract traffic or have many links. In fact, the concept of rank is inconsequential, as all pages, no matter their relevance or importance, are treated equally. Internet searches are deemed successfully by users when they satisfice; intranet queries are more precise because they are looking for a specific answer. When the researchers crawled IBM’s intranet, they found that anchor text was surprisingly efficient. Its ranking performance increased as the recall parameter was relaxed. For recall at the 20th position, anchor text led to a 15% improvement in recalled and influenced more than 87%, doubling its recall over all. It also out-performed title indexing.

Hawking et al., 1999 point out that experiments, using TREC-8 (Text Retrieval Conference) Small and Large Web Tracks, did not find searching with anchor text effective. However, Craswell et al. (2001, 2003) explain that the TREC trials were based on subject search, rather than site finding tasks, the focus of their work. They define the difference as being that site search is when a user is looking for a particular item; a subject search results in as many relevant documents as possible. Eiron & McCurley (2003) explain the dichotomy as “related to the fact that the predominant use of web search engines is for the entry page search task, but that ad hoc queries appear in a very heaving tail of distribution of queries. Thus commercial search engines achieve their success by serving most to the people most of the time” (459). Craswell et al. (2001, 2003) discovered that ranking based on anchor text was twice as effective for site finding as content ranking. Overall, anchor text is ideal for locating sites, but may underperform in some subject search experiments.

Traditional information retrieval systems use measures of precision and recall on large bodies of closed information systems (Salton & Buckley, 1990). The Internet, due to its democratic nature, is large, growing, and dynamic, making total recall impossible. Few identifying structures increase precision and recall. In this environment, anchor text analysis has proven to be a surprisingly valuable and versatile method to retrieve authoritative and relevant information.

Works Cited

Anick, P. G. (1994). “Adapting a full-text information retrieval system to computer the troubleshooting domain. Proceedings of ACM SIGIR, 349-358.

Bharat, K., & Henzinger, M. (1998). Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (102-111). New York: ACM Press.

Brin, S. & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30 (1-7), 107-117.

Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Raghavan, P., & Rajagopalan, S. (1998a) Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference, 1-7, 65-74.

Chakrabarti, S., Dom, B., Gibson, D., Kumar, S. R., Raghavan, P., Rajagopalan, S. & Tomkins, A. (1998b). Experiments in Topic Distillation. ACM SIGIR Workshop on Hypertext Information Retrieval on the Web. 13-21.

Chakrabarti, S., van den Berg, M., & Dom, B. (1999). Focused crawling: a new approach to topic-specific web resource discovery. In The Eighth International World Wide Web Conference, Toronto, Canada, May 1999.

Craswell, N., Hawking, D., & Robertson, S.E. (2001). Effective site finding using link anchor information. In Proceedings of the 24th CAN SIGIR. 250-257.

Craswell, N., Hawking, D., Wilkinson, R, & Wu, M. (2003). Overview of the trec-2003 web track. In Proceedings of TREC 2003.

Croft, W. B., Cook, R., and Wilder, D. (1995). Providing government information on the internet: Experience with ‘THOMAS.’ University of Massachusetts Technical Report 95-96.

Dash, R. K. (2005). Increasing blog traffic: 10 factors affecting your search engine Rankings. BlogSpinner-X. Retrieved on August 13, 2007 from:
http://blogspinner.blogspot.com/2005/11/increasing-blog-traffic-10-factors.html
Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C. L., & Gori, M. (2000). Focused crawling using context graphs. In Proceedings of the 26th VLDB Conference, 527-534.

Eiron, N., & McCurley, K. S. (2003). Analysis of anchor text for web search. In
Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (459-460). New York: ACM Press.

Fagin, R., Kumar, R., McCurley, K.S., Novak, J., Sivakumar, D., Tomlin, J. A., & Williamson, D. P. (2003). Searching the workplace web. In Proceedings of WWW2003.

F├╝rnkranz, J. (1999). Exploiting structural information for text classification on the WWW. Proceedings of IDA-99, 3rd Symposium on Intelligent Data Analysis, 487-498.

Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D., & Flack, G. W. (2002). Using web structure for classifying and describing web pages. In Proceedings of WWW2002.

Hawking, D., Voorhees, E., Bailey, P., & Craswell, N. (1999). Overview of TREC-8 Web Track. In Proceedings of TREC-8.

Hiler, J. (March 3, 2002). Google time bomb. Microcontent News, Retrieved
December 2, 2007 from:
http://www.microcontentnews.com/articles/googlebombs.htm

Jin, R., Hauptmann, A. G., & Zhai, C. (2002) Title language model for information retrieval. In Proceedings of the 25th ACM SIGIR, 42-48.

Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604-632.

Kraft, R., & Zien, J. (2004). Mining anchor text for query refinement. WWW2004, 666-674.

Kumar, R., Raghavan, P., Rajagopalan, S. & Tomkins, A. (2001). On semi-automated web taxonomy construction. In Proceedings of the 4th ACM WebDB. New York: ACM Press, 91-96.

Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (2006). Core algorithms in the CLEVER System. ACM Transactions on Internet Technology, 6(2), 131-152.

Langville, A. N. & Meyer, C. D. (2006). Google’s PageRank and beyond: The science of search engine rankings. Princeton, NJ: Princeton University Press.

Li, J., Furuse, K., & Yamaguchi, K. (2005). Focused crawling by exploiting anchor text using decision tree. In International World Wide Web Conference, Chiba, Japan, 1190-1191.

Lu, W. H., Chien, L. F., & Lee, H. J. (2002). Translation of web queries using anchor text mining. ACM Transactions on Asian Language Information Processing, 1(2), 159-172.

Lu, W. H., Chien, L. F., & Lee, H. J. (2004). Anchor text mining for translation of web queries: A transitive translation approach. ACM Transactions on Information Systems, 22 (2), 242-269.

Maarek, Y. & Smadja, F. (1989). Full text indexing based on lexical relations. In Proceedings of the 12th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 198-206.

McBryan, O. (1994). GENVL and WWWW: Tools for taming the web. In First
International World Wide Web Conference
, Geneva, Switzerland.

Salton, G. & Buckley, C. (1990). Improving retrieval performance for relevance
feedback. Journal of the American Society for Information Science, 41(4), 288-297.

Westerveld, T., Kraaij, W., & Hiemstra, D. (2002). Retrieving web pages using content, links, URLs and anchors. In Tenth Text Retrieval Conference, 663-672.

No comments: