|
Quality of Google
Search Engine Results
|
The Quality of Google Search Results
Ami Isseroff
Sept 28, 2008
Yesterday I wrote about
Google Quality Rater Secrets,
discussing how Google uses humans to rate its search engine results. Improving the quality of search engine results was
the rationale behind the original search engine and ranking algorithm described by the Google founders in The Anatomy of a Large-Scale Hypertextual Web Search Engine
and
The PageRank Citation Ranking: Bringing Order to the Web.
Ten years ago, search engines were poor. Results were often random and the never fit what users were
really looking for, but rather, what Web site owners and search engine companies wanted them to see. Google made a giant
improvement in that situation. At the same time, it was helped by the growth of the Web, which, along with a lot of
junk, has also generated a large amount of high quality information as well as commercial sites where you can buy just
about anything.
I decided to test the Google search engine against its own criteria, which its quality raters use to
judge results. While it is possible to test an infinite number and variety of searches, I decided to concentrate on some
informational searches, including both commercial and non-commercial information. What I found is that Google results
rarely measure up to its own standards, especially when the information may be missing or hard to come by. But even when
the information is there, Google falls short. Sometimes pages that are better, more relevant answers to the query are
pushed off the first page of Google results by blogs and junk. An increasing proportion of pages retrieved by Google for
informational queries are restricted access articles. If you don't belong to JSTOR or project MUSE through a library or
don't want to pay for the article, you won't get the information. These pages should not be the first ones retrieved by
a search because most people cannot access them. They are a sort of Webspam because they frequently promise information
that is accessible only for members.
For more complex queries, Google evidently simply did not have the answers at all, but would not
"admit it," and poured out oodles of listings that were "off topic" - not relevant to the query at all.
I tried to match the query results to expectations based on
Google's quality rating. None of the results
could be "vital" because i was not looking for any firms. That left the categories: Useful (what you expected to get),
Relevant (has relevant information, but may be too broad or too narrow or a sub page of the correct site or a brief
article) Not relevant - Too broad or too narrow to fit the query, or has a link to relevant information but is not
relevant itself. Off topic - ignored part of the query. For a query about [universities in india] universities in Europe
are Off Topic. As I noted, in the article about quality rating, this is a frequent fault of Google queries. Here are the
results. You can try the queries yourself and you should get similar results.
Google Search Quality Results
["infant mortality" Ecuador] query
My first query was: ["infant mortality"
Ecuador]. For a query about Ecuador, I would expect to get a site sponsored by the government or an article that is all
about Ecuador. A perfect match for this query about infant mortality in Ecuador would be an article that was wholly
about infant mortality in Ecuador, discussed the reasons for high mortality and progress in lowering it, and gave
statistics for infant mortality over a long period. At minimum, one would expect a page retrieved among the first ten
results to give at least the figures for infant mortality in Ecuador for a single year.
Query Results: Google claimed to retrieve 155,000 pages for this query. Of the first ten pages listed (first
page of search engine results), all had something about the topic, but none met expectations. The top page was "not
relevant" as it was too broad - it was a UN report about general health conditions in Ecuador that listed a single
sentence about infant mortality.
One page was a graph of infant mortality and another statistic. Most of the articles were about health or social
progress in general in Ecuador but at least had figures for infant mortality. not about infant mortality but most
are too broad or too narrow and did not have more than a sentence about infant mortality. In Google rating terms they
were between "relevant" and "not relevant." One page was restricted access (jstor). At least one page, listed as
number 10, must be rated "not relevant" because it is about abortion issues rather than infant mortality - Safe Abortion
Hotline Launched in Anti-Choice Ecuador.
www.rhrealitycheck.org/blog/2008/07/17/safe-abortion-hotline-lauched-antichoice-ecuador
It should really be rated "off topic." A better page than listing number 10 certainly is the wikipedia page about
Ecuadorian demography that is retrieved as number 16:
A better page than listing number 10 certainly is the wikipedia page about Ecuadorian demography that is retrieved as
number 16:
Demographics of Ecuador - Wikipedia, the free encyclopedia
en.wikipedia.org/wiki/Demographics_of_Ecuador
That article however, was not about infant mortality in Ecuador, though it
at least had some relevant information.
While there are articles about Infant mortality in Ecuador, none of them
were listed in full on the Web. Sometimes just
titles were listed or abstracts were given, with or without the possibility of paid access to the entire article.
["infant mortality" Ecuador 1920] Query
My next query was: ["infant mortality" Ecuador 1920]. Google
claimed to have retrieved over 9,000 results. I could not find a single query that had information about infant
mortality in Ecuador in 1920. The results were less relevant than the broader query. The top result was
about living longer in general:
Life Expectancy
www.healthpromoting.com/Articles/articles/expect.htm
It mentioned Ecuador somewhere, and infant mortality somewhere else.
The second result at least had the words "Ecuador" and "infant mortality" in it, but it
is about coding a study of infant mortality statistics, and did not give any actual results:
Codebook for "A New Dataset on Infant Mortality Rates, 1816-2002 ...
anessakimball.com/docs/research/InfantMortalityRate_data/IMR_codebk.pdf
The others were similarly off topic for various reasons. None seem to have had any information about
infant morality in Ecuador in 1920. A book listing however, did have information about life expectancy in one province
of Ecuador in the period in question.
Poland "Gross National Product" 1930] Query
For the query Poland "Gross National Product" 1930] Google claimed to have found 1,550 pages. Not one of the first 10
listings included the Gross national product of Poland for the year 1930. Some were not about Poland at all, most were
not about 1930. Listing #23 was the first listing that had an estimate for the year 1929.
http://books.google.com/books?id=82ncGA4GuN4C&pg=PA22&lpg=PA22&dq=Poland+%22Gross+National+Product%22+1930&source=web&ots=wmM8kGJAL3&sig=5IVOWB0MH9F-aAPCuEEFjxTKLwI&hl=en&sa=X&oi=book_result&resnum=3&ct=result
[antispam software download] query
For the query [antispam software download] Google claimed to have found 1,450,000 pages. Google asked if I
really meant "Anti-Spam" software. That query however, retrieved far less results - about 500,000.
Useful (or full) results for this query should have been pages that offered a choice of products to download. Given
the large number of listings, one would expect good results. Of the results retrieved, three of those lists on the main
page were off topic - not related to SPAM in any interpretation. One is the free AVG anti-Virus and anti-Spyware
product. A second is Windows Defender that stops popups. Neither of these products protect against mail spam or Webspam,
though they are good products. A third product is anti-Spyware, not related to anti-Spam. Most of the pages were
"relevant" - in that they provided the opportunity to download a single product. One page was "useful" (top rating) - it
gave the opportunity to download a choice of products.
[Free graphics software download] query
Google claimed to have returned 5,720,000 pages for this query. Of the first 10 listings, six were evidently "useful"
(highest quality rating) because they were, as expected, listings of a choice of software. Three were "relevant," in
that they provided a single product (too narrow for query). One was off topic. The product, www.smartdraw.com/ as
frequently happens with such searches, was advertised as a "free download" but in fact the "free" part is only a demo.
In effect, this is Webspam, but Google has no good way to defend against it at present and does not try.
Google Search Quality: Summary and Conclusions
Five different queries in Google yielded fair to poor results. In no case were the first ten results
listed "Useful" results according to Google criteria. For narrow searches where the information may not exist, Google
presented irrelevant results. The search engine acted like a student who does not know the answer to a question and
produces an "answer" that is vaguely related to the subject. Google doesn't know how to say "I don't know" and doesn't
know when it does not know. Google is not "aware" of important types of information provided in the query. For
example, it can't recognize that 1920 is a date rather than just a set of characters and doesn't process information
accordingly. It is not case sensitive for any query, so it cannot tell the difference between Apple (computer) and
apple (fruit) , or Windows (operating system) and windows (glass covered openings in houses) or Word (operating system)
and word (language unit).
The above defects are somewhat hard to remedy. It is much easier to remove restricted results from the
first links presented, because these are in effect SPAM. Likewise, it shouldn't be a problem to slap a big penalty on
those who advertise free software and really are really only allowing free demonstration copies. Google should be
applying WEBSPAM criteria more fairly. Misleading Web sites should all be penalized equally. It is also hard to
understand why Google lists products that are not anti-SPAM products when the query asks for anti-SPAM. Part of the
problem seems to be that Google depends too much on its Google Pagerank
algorithm. The pages at the big Web sites, or those that have "Authority,"
even if they don't match the query, seem to push out other pages that have the right answer to the query.
"Authority" seems to be pretty arbitrary. An irrelevant blog article was pushed ahead of a somewhat
relevant Wikipedia article in the query about infant mortality in Ecuador, because the article was about a much more
popular topic - abortions, and therefore got a lot of links.
Another frequent defect I have encountered when searching in Google, is presentation of text that is
not on the linked page at all. This can be because of a deliberate SPAM attempt. Usually though, it is because Google
spidered the first page of a Web log or other updated main page, and then the Web log was updated, so the information
remained only in the permanent article page. The back page was not listed among the top results, presumably because its
page rank was too low.
Search engine results will never be better than the information on the Web. If there really are no Web
pages all abut infant mortality in Ecuador, no search engines will find such pages. But Google and other search engines
should be able to learn to screen out the "non-results" and to at least warn the user when no results match the query.
They can also learn to screen out SPAM and irrelevant pages, no matter how "authoritative" they might seem.
Ami Isseroff
Notice: Copyright
All materials are copyright 2008 by Ami Isseroff. All rights reserved. These pages may not be reproduced in any
form in electronic or printed media without express written permission from the author. | SEO
SEO Basics
The SEO Book SEO Articles
SEO Blog SEO Glossary
Web Pro World Forum
SEO Links
More Links Love Poems
MidEastWeb: Middle East
Zionism
|