iSnare.com - Free Content Articles Directory
Authors Contents [Advanced Search][Add OpenSearch][Job Search]
Distribute your articles to thousands of article sites for only $2 and below! Read more...

Index  Internet
 

Pushing Bad Data- Google’s Latest Black Eye

 
[ Contact the Author] [ Send to a Friend] [ Article Publisher] [Make PDF] [ Print] [ Bookmark & Share]
 
Read our Terms of Service before reprinting this article. The submitter specified above has claimed the rights to this article.
Eric Lester

Google stopped counting, or at least publicly displaying, the number of pages it indexed in September of 05, after a school-yard "measuring contest" with rival Yahoo. That count topped out around 8 billion pages before it was removed from the homepage. News broke recently through various SEO forums that Google had suddenly, over the past few weeks, added another few billion pages to the index. This might sound like a reason for celebration, but this "accomplishment" would not reflect well on the search engine that achieved it.

What had people buzzing was the nature of the fresh, new few billion pages. They were blatant spam- containing Pay-Per-Click (PPC) ads, scraped content, and they were, in many cases, showing up well in the search results. They pushed out far older, more established sites in doing so. A Google representative responded via forums to the issue by calling it a "bad data push," something that met with various groans throughout the SEO community.

How did someone manage to dupe Google into indexing so many pages of spam in such a short period of time? I'll provide a high level overview of the process, but don't get too excited. Like a diagram of a nuclear explosive isn't going to teach you how to make the real thing, you're not going to be able to run off and do it yourself after reading this article. Yet it makes for an interesting tale, one that illustrates the ugly problems cropping up with ever increasing frequency in the world's most popular search engine.

A Dark and Stormy Night
Our story begins deep in the heart of Moldva, sandwiched scenically between Romania and the Ukraine. In between fending off local vampire attacks, an enterprising local had a brilliant idea and ran with it, presumably away from the vampires... His idea was to exploit how Google handled subdomains, and not just a little bit, but in a big way.

The heart of the issue is that currently, Google treats subdomains much the same way as it treats full domains- as unique entities. This means it will add the homepage of a subdomain to the index and return at some point later to do a "deep crawl." Deep crawls are simply the spider following links from the domain's homepage deeper into the site until it finds everything or gives up and comes back later for more.

Briefly, a subdomain is a "third-level domain." You've probably seen them before, they look something like this: subdomain.domain.com. Wikipedia, for instance, uses them for languages; the English version is "en.wikipedia.org", the Dutch version is "nl.wikipedia.org." Subdomains are one way to organize large sites, as opposed to multiple directories or even separate domain names altogether.

So, we have a kind of page Google will index virtually "no questions asked." It's a wonder no one exploited this situation sooner. Some commentators believe the reason for that may be this "quirk" was introduced after the recent "Big Daddy" update. Our Eastern European friend got together some servers, content scrapers, spambots, PPC accounts, and some all-important, very inspired scripts, and mixed them all together thusly...

Five Billion Served- And Counting...
First, our hero here crafted scripts for his servers that would, when GoogleBot dropped by, start generating an essentially endless number of subdomains, all with a single page containing keyword-rich scraped content, keyworded links, and PPC ads for those keywords. Spambots are sent out to put GoogleBot on the scent via referral and comment spam to tens of thousands of blogs around the world. The spambots provide the broad setup, and it doesn't take much to get the dominos to fall.

GoogleBot finds the spammed links and, as is its purpose in life, follows them into the network. Once GoogleBot is sent into the web, the scripts running the servers simply keep generating pages- page after page, all with a unique subdomain, all with keywords, scraped content, and PPC ads. These pages get indexed and suddenly you've got yourself a Google index 3-5 billion pages heavier in under 3 weeks.

Reports indicate, at first, the PPC ads on these pages were from Adsense, Google's own PPC service. The ultimate irony then is Google benefits financially from all the impressions being charged to Adsense users as they appear across these billions of spam pages. The Adsense revenues from this endeavor were the point, after all. Cram in so many pages that, by sheer force of numbers, people would find and click on the ads in those pages, making the spammer a nice profit in a very short amount of time.

Billions or Millions? What is Broken?
Word of this achievement spread like wildfire from the DigitalPoint forums. It spread like wildfire in the SEO community, to be specific. The "general public" is, as of yet, out of the loop, and will probably remain so. A response by a Google engineer appeared on a Threadwatch thread about the topic, calling it a "bad data push". Basically, the company line was they have not, in fact, added 5 billions pages. Later claims include assurances the issue will be fixed algorithmically. Those following the situation (by tracking the known domains the spammer was using) see only that Google is removing them from the index manually.

The tracking is accomplished using the "site:" command. A command that, theoretically, displays the total number of indexed pages from the site you specify after the colon. Google has already admitted there are problems with this command, and "5 billion pages", they seem to be claiming, is merely another symptom of it. These problems extend beyond merely the site: command, but the display of the number of results for many queries, which some feel are highly inaccurate and in some cases fluctuate wildly. Google admits they have indexed some of these spammy subdomains, but so far haven't provided any alternate numbers to dispute the 3-5 billion showed initially via the site: command.

Over the past week the number of the spammy domains & subdomains indexed has steadily dwindled as Google personnel remove the listings manually. There's been no official statement that the "loophole" is closed. This poses the obvious problem that, since the way has been shown, there will be a number of copycats rushing to cash in before the algorithm is changed to deal with it.

Conclusions
There are, at minimum, two things broken here. The site: command and the obscure, tiny bit of the algorithm that allowed billions (or at least millions) of spam subdomains into the index. Google's current priority should probably be to close the loophole before they're buried in copycat spammers. The issues surrounding the use or misuse of Adsense are just as troubling for those who might be seeing little return on their adverting budget this month.

Do we "keep the faith" in Google in the face of these events? Most likely, yes. It is not so much whether they deserve that faith, but that most people will never know this happened. Days after the story broke there's still very little mention in the "mainstream" press. Some tech sites have mentioned it, but this isn't the kind of story that will end up on the evening news, mostly because the background knowledge required to understand it goes beyond what the average citizen is able to muster. The story will probably end up as an interesting footnote in that most esoteric and neoteric of worlds, "SEO History."

Important NoticeDISCLAIMER: All information, content, and data in this article are sole opinions and/or findings of the individual user or organization that registered and submitted this article at Isnare.com without any fee. The article is strictly for educational or entertainment purposes only and should not be used in any way, implemented or applied without consultation from a professional. We at Isnare.com do not, in anyway, contribute or include our own findings, facts and opinions in any articles presented in this site. Publishing this article does not constitute Isnare.com's support or sponsorship for this article. Isnare.com is an article publishing service. Please read our Terms of Service for more information.

Mr. Lester served for 5 years as webmaster for ApolloHosting.com and previously worked in the IT industry an additional 5 years. Apollo Hosting provides website hosting, ecommerce hosting, vps hosting, and web design services to a wide range of customers.

Article Tags: billion [See Dictionary], google [See Dictionary], ppc [See Dictionary]
Got a question about this article? Ask the community!
Article published on June 26, 2006 at Isnare.com
 
Rate this article:

Is Live The Death Of Google?
Submitted by: Eric Lester

Dig into any self-labeled "SEO forum" and you'll probably find some neatly organized categories along the lines of "Google," "Yahoo," and "MSN"...

Streamline Your Website Pages
Submitted by: Eric Lester

Squeezing the most efficient performance from your web pages is important The benefits are universal, whether the site is personal or large and professional...

Ecommerce Hosting Considerations
Submitted by: Eric Lester

Website hosting can be a complex undertaking Determining how much space you need, how much transfer, finding a reliable host, and getting everything online is no simple task...

Does Your Website Host Fight Spam?
Submitted by: Eric Lester

Virtually anyone with an email address knows what Spam is, and has, perhaps, considered giving up the speed, convenience, and simplicity of email because of it...

Basics Of Search Engine Optimisation (SEO)
Submitted by: Lijo George

What is SEO Search Engine Optimization is a step by step process in which a web site is optimized to the expectations of Search Engines...

Ebook - E For Environmental
Submitted by: Roberto Sedycias

The emphasis on going green is highlighted as writing books are one of the contributors for depleting natural resources...

Article Writer - Do You Need One?
Submitted by: Enzo F. Cesario

Content is king Your web presence needs content that your audience will be interested in, period...

How to Find Quality Web Directories
Submitted by: Jason Kay

When you promote your website one of the first things you will want to do is to submit your website to a number of different web directories...

Review of Mozy Online Storage
Submitted by: Jason Kay

Every year people just like you lose countless documents and important files because of the unforeseen, but with online storage companies such as Mozy, this scenario can be avoided forever...

You Can Still Make Good Money on Ebay
Submitted by: Mark Thomas Walters

The banning of the sale of digital products on eBay has led to many online marketers abandoning the use of eBay as one of their revenue streams...

Secrets to Using Social Bookmarking For Link Building
Submitted by: Blake Evans

Social bookmarking became popular as a tool to share your favorite websites with others on the internet...

What is Pagerank?
Submitted by: Blake Evans

The Internet is a vast expanse of space which contains matter on anything you can think of Due to the instantaneous nature of the Internet, anyone who has access to a computer and a phone line indulges in some browsing on various subjects...

Teleseminars - How to Use Them Effectively
Submitted by: Mark Thomas Walters

Teleseminars can be extremely beneficial to your online business because they give you a new way to reach new prospects as well as a new way to interact with your existing clients...

Internet Safety For Kids That Parents Should Know
Submitted by: Paul Phoenix

Internet is a believed to be a remarkable source through which school going kids can look for their desired information...

How to Network Using Twitter
Submitted by: Mark Thomas Walters

Twitter is the fastest growing social networking site on the Internet, and is a very popular subject within Internet marketing circles...

How to Get Your Website Indexed by Google in 15 Minutes
Submitted by: Jeffrey J Smith

Getting your website or blog indexed by Google is a good way to get your new online venture established and receive free traffic in the process...

Online Jobs - Cash Making Power Sites
Submitted by: Jack L. Smith

Legitimate Online Jobs - Do They Exist “Cash Making Power Sites” is just what the title suggest – it is a membership site that offers you not one but FIVE (yes, you heard it right) ready websites for you to start generating income...

Web Development - Get Deep Insights About Your Customers to Precisely Target Products and Services
Submitted by: Daljeet Sidhu

No business can be successful without a deep understanding of its customers This knowledge can help you to adapt your business style and advertising techniques to attract more customers and expand your client base...

Google's SafeSearch Helps Keep Kids Safer Online
Submitted by: Gregg Housh

The internet is a evolving community, home to everything from exotic recipes to ridiculous videos to books that are no longer in print...

Isnare.com Footer Divider

© 2004-2009. Isnare Free Articles - An Isnare Online Technologies Free Articles Project. All Rights Reserved.   Privacy Policy