Sunday, January 23, 2011

Google's spam and content farm problem is not "better than it has ever been"

(Update below) I use Google Alerts to keep abreast of certain topics, such as my MIT program and mentions of my name. The automated search results that are emailed to me are interesting, particularly the ones based on my name. They almost always look like this:

The results are garbage, filled with contextually unrelated and bizarre terms such as "Lamont Arizona" or "lamont rupture disk." The links in the screenshot above take users to a page filled with links about Georgia real estate and references to many random terms (including my name), while the Danish site contains scores of random terms that include my last name (such as product or business names -- "Lamont auto", etc.), but no links. They contain no useful information about me or any other topic, yet pages like them are generated every week and added to Google's index. I looked back at my email archive, and found that pages created more than a year ago are still active.

What is their purpose? Pages like this are machine-generated spam, designed to get eyeballs on pages filled with advertisements, or to boost the search-engine ranking of linked sites. Google's wonderful search engine depends on language in headlines and page text as well as inbound links to determine which sites deserve to be at the top of search engine results when certain terms are typed in. That's great for quality sites which have lots of inbound links and deserve to be at the top because they are likely to be the most relevant and useful for users.

The problem is the system has been effectively reverse-engineered by spammers and content farms who are adding little, if any value with spam pages and poorly written trash or copied content that are boxed in by ads and affiliate links. In some cases, the pages don't contain any ads, but lots of links to other pages that someone wants to raise in the search engine rankings. Links and search-engine ranking translates to money if the keyword is popular or relates to something that people research online with the intention of buying. The ultimate prize for the spammers and content farms is getting their garbage on the first page of Google search results. The fact that quality pages (or the original content) that people are more likely to be interested in are pushed down or off the page are of little concern to them.

So I read with some interest a post by Google's Matt Cutts on the company's recent efforts to fight the problems described above. He said:
January brought a spate of stories about Google’s search quality. Reading through some of these recent articles, you might ask whether our search quality has gotten worse. The short answer is that according to the evaluation metrics that we’ve refined over more than a decade, Google’s search quality is better than it has ever been in terms of relevance, freshness and comprehensiveness.
Say what? I don't know what sorts of metrics Google is using, but I see poor quality results showing up in almost all of the searches that I perform. Not always on the first page -- on a Google search for my own name, there are enough high quality results (my blogging and social networking activity, plus references to other people with the same name) that push the garbage off the first page of search engine results. Still, starting at the top of page 3 of the results, I see results for bogus/scammy paid ringtone and "fast download" schemes attached to copyrighted technical mp3s that I produced for Computerworld when I was an editor there. Besides being illegal, such pages are not the original source of the mp3s (the pages on are) yet the scraped content repackaged into paid services ranks higher than the originating site, which offers the mp3s for free.

For popular terms, however, the garbage routinely outranks the real thing. For instance, when you search for "online education," affiliate garbage dominates the first page of results. I am sure practically everyone reading this post has had a similar experience, attempting to conduct some serious research using Google and on the first page of results being presented with trick sites or utter drivel. It wastes users' time, and in some cases gives people false or misleading information. Organized hacking and crime rings have also joined the party -- for one of the bogus mp3 sites I obsrved, the credit card payment server is located in Russia. How many innocent people have attempted to pay for something through this system, and have ended up having their credit card information stolen or malware downloaded to their computers?

I also found it strange that Google is bragging about quality while the main problems that people were criticizing the company for in December and January -- low-quality content farms and scrapers -- still clog up search results (see one humorous example described on this Hacker News thread). Certain keywords are basically useless to find quality content on the first page of results (Wikipedia sometimes ranks high, but the quality is often questionable). SEO-driven content farms have simply taken over.

To Cutts' credit, he tried to respond to some of the questions and criticisms on Hacker News, but it is premature to crow about "quality" when spam, low-quality information, and other garbage fills search engine results. The problem clearly has not been fixed.

Update: Matt Cutts and debated the definition for "quality" and what an increase in spam and content farms does to Google's quality metrics. Part of the Twitter thread can be accessed here.

Related posts:

Disclosure: I am not a spammer or content farm, but I do use Google Adsense and Amazon Associates, and rewrite my headlines to improve their search engine ranking. 

Sources and research: Google, Paid Content, Techmeme, my own experience.

No comments:

Post a Comment

All comments will be reviewed before being published. Spam, off-topic or hateful comments will be removed.