Sometime around January or February, a number of webmasters began to notice that Google had somehow “lost” huge portions of their websites. Reference to their sites, generally to the index pages and a seemingly random selection of internal pages existed in Google listings but pages that once drove sizable amounts of traffic appeared to vanish into the ether. As February rolled into March, more reports were posted to blogs and forums by frustrated webmasters who started to notice the number of pages from their sites had declined, significantly, in Google’s index.

Many SEO firms, including StepForth, received information requests and research projects from clients who wanted to know what had happened to their sites. In all cases, we did the best we could but, given the obvious complexity of the update and the lack of fresh information from Google, recommendations given during this period have more resembled shotgun style SEO advice than the finer laser focus most of us would normally prefer to offer our clients. As is the case with most major updates, investigation as often as not leads to more questions.

Matt Cutts, Google’s Search Quality Officer and #1 communicator, answered many of those questions yesterday in an open and wide ranging post titled, “Indexing Timeline“.

The post outlines how Google staff have examined and responded to webmasters’ queries and complaints stemming from the Bigdaddy update. It also addresses a number of issues webmasters who have seen sections of their pages disappear from the SERPs including the quality of both in-bound and out-bound links, irrelevant reciprocal linking schemes, and duplicate text found on vertical reference and affiliate sites.

According to his timeline, on March 13, Googleguy asked webmasters to offer example sites for Google’s analysis in a post at WebmasterWorld. Commenting on the sites offered up for examination Cutts wrote,

“After looking at the example sites, I could tell the issue in a few minutes. The sites that fit “no pages in Bigdaddy” criteria were sites where our algorithms had very low trust in the inlinks or the outlinks of that site. Examples that might cause that include excessive reciprocal links, linking to spammy neighborhoods on the web, or link buying/selling. The Bigdaddy update is independent of our supplemental results, so when Bigdaddy didn’t select pages from a site, that would expose more supplemental results for a site.”

That quote covers a lot of ground but it explains a great deal of Google’s post-Bigdaddy behaviour.

Google bases its ranking algorithm on trust. That might sound naïve to the uninformed, but we are discussing one of the most informed electronic entities that has ever existed. Google also keeps historic records on every item contained in its index. Though it bases its opinions on a baseline of trust, those opinions are extremely well informed.

In order to remain continually informed, it spiders everything it can and sorts the data later. Google maintains a massive number of indexes including one known as the supplemental index. The supplemental index is a much larger representation of documents found on the web than those included in the main Google index.

“We’re able to place fewer restraints on sites that we crawl for this supplemental index than we do on sites that are crawled for our main index. For example, the number of parameters in a URL might exclude a site from being crawled for inclusion in our main index; however, it could still be crawled and added to our supplemental index.” (source: Google Help Center)

Many of the results that appeared to have disappeared are assumed to have been drawn from the supplemental results before the update. “A supplemental result is just like a regular web result, except that it’s pulled from our supplemental index”.

As Cutts is quoted saying above, Bigdaddy results are separate from supplemental results. When a reference to a site is found in the main (Bigdaddy) results, Google does not necessarily dip into supplemental results as often as it might have previously.

Quality On, Quality In and Quality Out

Google has gotten better at judging the quality of content found on a document and within a site. Content includes text, images, titles, tags and both inbound and outbound links. Consistently said that well-built sites offering quality information and a positive user experience should perform well throughout its search indexes, Google provides a wealth of information via the Google Help Center and through its webmaster focused spokespersons, Cutts and Googleguy.

As Google has gotten better at determining the origin and history of content found in its various indexes, it tries to snip away at duplicate forms of on-site content, with the goal of listing the most trust worthy sites under any given user query in the main index.

Having been inundated over the years with multiple replications of what was already considered duplicate content. Google (and other search engines) has gotten very good at knowing if it has already indexed similar or duplicate content. Google is capable of examining text (including individual paragraphs), images and link networks (in and outbound links), looking for telltale signs of duplicate content.

If, for example, it perceives a site displaying product information pulled from the same product database that 25,000 other sites pull duplicate product information from, Google is not likely to rank that site well. Similarly, if it finds duplicate networks of reciprocal links shared among several pages in its index, it is not likely to assign a high trust value to that document.

Reciprocal linking strategies

“As these indexing changes have rolled out, we’ve improving how we handle reciprocal link exchanges and link buying/selling.”

Though Cutts points at reciprocal linking as an indicator to Google that there might be issues with a website’s credibility, that doesn’t automatically mean that all reciprocal links are going to cause problems for webmasters. Common sense and the value of delivering a quality user experience should dictate decisions around link strategies.

For example, if a professional landscaper provided links to plant nurseries in his or her region, and those nurseries in turn provided links to that landscaper, Google would likely consider those to be quality links. There is a direct relevance between the two sources of information. A network of links between local landscaping businesses, nurseries, horticultural institutes, permaculture initiatives, non-profit volunteer groups and a number of gardening centers, shared amongst a relevant set of websites would also likely be judged beneficial to Google users and not subject to supplemental penalization.

On the other hand, a network of obviously purchased links between anyone who will exchange links with each other, regardless of relevancy or direct user benefit is likely to trip any number of filters present in the Bigdaddy/Jagger upgrades.

Cutts provided an example of a simple error made by a real estate site. Along with a number of internal reference links to exotic properties displayed as a footer-style site-map, Cutts found several out-bound links with anchor text reading;, Credit Cards, Quit Smoking Forum, Hair Care, and When he reset, he saw a similar set of links, only this time, the out-bound links were directed towards, mortgages sites, credit card sites, and exercise equipment. Cutts commented, “…if you were getting crawled more before and you’re trading a bunch of reciprocal links, don’t be surprised if the new crawler has different crawl priorities and doesn’t crawl as much. ”

Affiliate Text and Content

Cutts devoted a long paragraph covering affiliate text, mentioning a T-shirt site that once had about 100 pages indexed, a number recently reduced to only 5.

“The person said that every page has original content, but every link that I clicked was an affiliate link that went to the site that actually sold the T-shirts. And the snippet of text that I happened to grab was also taken from the site that actually sold the T-shirts. The site has a blog, which I’d normally recommend as a good way to get links, but every link on the blog is just an affiliate link. The first several posts didn’t even have any text, and when I found an entry that did, it was copied from somewhere else. So I don’t think that the drop in indexed pages for this domain necessarily points to an issue on Google’s side. The question I’d be asking is why anyone would choose your “favourites” site instead of going directly to the site that sells T-shirts?”

The Ghosts of minutes past

We live in the present. Our websites live in the past as well as the present. Google keeps tabs on all documents in its index and even if it has, “… spidered content that was posted only moments before,” it has an elephant’s memory for previous details and a computer’s ability to pull lots of information together to get a bigger picture of how all those details fit together.

Google works by following links. Google ranks by examining the quality of content found on a site and also on the sites that link into, or are linked to from, sites in its indexes. If you have seen a great deal of page content fall away from Google’s index, or if you are just generally interested in how Google is working, read Cutts’ Bigdaddy “Indexing Timeline“.