Search Engine Optimization

HITS search algorithm


HITS search algorithm

 

HITS search algorithm - HITS (acronym for Hyperlink-Induced Topic Search) is a search algorithm developed by Jon Kleinberg to find the most authoritative pages for a broad search query. It is based on the scoring pages either as "hubs" in the sense of pages that link to numerous authoritative Web pages for a given topic, or authoritative pages, which are Web pages that are pointed to by the hub pages.

Kleinberg reasoned that there were two types of search queries, specific and broad. For more specific queries, the problem is to find any pages that match the criteria. For broad queries, there is an abundance problem: there are too many pages that match the criteria and therefore the problem is to find the most relevant ones for that query. He further argued that the content of a page was not always the best indicator of its relevance to the topic. The Harvard University home page does not contain the word "Harvard" in especially high concentrations and would not ordinarily rank very high for keyword Harvard. However, the pages linking to that page would have a high concentration of that word. Presumably, if they linked to many other similar pages, they would be hub "authorities" for the keyword. In Kleinberg's system, relevant candidate pages were first identified using a search in a simpler search engine, and then identifies hubs and authorities and computes separate scores for each, depending on the number of links they give or receive for the specific keyword. A page that is pointed to by many hubs gets a high authority value, while a page that points at many high authority pages gets a high hub value. 

The algorithm becomes increasingly useful as the Web expands, because increasingly specific queries have a sufficient number of corresponding pages that might be considered authorities or hubs, even if the query may seem to be relatively abstruse.

HITS bears some similarity to the Hilltop Search Algorithm. It differs from Google PageRank because it is a way of ranking authority for a specific keyword, and because in implementation it is done "on the fly" when the query is processed, rather than when the page is processed. Google has obviously adopted some criteria that are based on the content of Anchor Text in links to a page, since a page can be brought to #1 position in Google for a query solely on the basis of anchor text.

A disadvantage of the algorithm is that because it is executed at query time, it may have very poor performance, depending on the number of iterations required to rank the Hubs and authorities.

A second disadvantage of the algorithm is that it may be making assumptions about the structure of the Web that no longer hold true. In the early days, there were many Web pages that were legitimate collections of links. This was due to the poor quality of search results. With the advent of better search engine technology these collections have assumed a minor role, and have generally resolved themselves into

1- Legitimate human edited directories like www.dmoz.org 

2- Link farms and directories for exploitation purposes.

3- "Link bait" pages that have useful links in order  to draw links from others.

4- Pages that are themselves authoritative, but also include a number of external links, such as Wikipedia.

5- Link pages based on mutual link exchanges. Like "link bait" collections, these can be legitimate sources of authority and relevance in some cases. Political action groups are going to exchange links with like-minded groups.  

The algorithm doesn't seem to take account of case number 4 - "hubs" that are also authorities, and which are perhaps the best indicators of good external pages. On the other hand, results might be unduly influenced by link farms. It might be able to do away with the tricky identification of hubs and iterative ranking simply by relying on a number of good directories like the dmoz open directory project.

References

Kleinberg, J. Authoritative sources in a hyperlinked environment" Journal of the ACM 46:5:604-632, , 1999.

US Patent 6112202  Aug 29, 2000.

Note - Definitions of Search Engine Optimization terms are based on inferences from common usage and definitions given by other sources. Conclusions about search engine behavior are based on understanding of the behavior of the most popular search engines. Both are subject to error or may change. Search engine company management may define or use a term or set or change any policy in any way they see fit, and may make these definitions and specifications public or not. These decisions and definitions are beyond our control.  

Notice: Copyright

All materials are copyright 2008 by Ami Isseroff. All rights reserved. These pages may not be reproduced in any form in electronic or printed media without express written permission from the author.

SEO Glossary

SEO

SEO Basics

The SEO Book

SEO Articles

SEO Blog

Web Pro World Forum

More Links

Love Poems

MidEastWeb: Middle East

Zionism

SEO - Web Site Search Engine Optimization Contact: Webmaster(at)Yu-hu.com
site map
HITS search algorithm