I am in disbelief. Less than 40 minutes ago the Official Google Webmaster Central Blog announced that Google can now fill out web forms and spider the resulting content. Previously this was not only not done by search engines but it was well known that such content would be useless since it wouldn’t ‘necessarily’ be formatted for the eyes of searchers. Apparently Google is now throwing this concept to the wind.

How to Block Google’s Spider From Your Form
From the announcement it appears that Google is not yet spidering forms on a wide spread basis. Here is a quote that sums up their policy on forms quite nicely:

“Only a small number of particularly useful sites receive this treatment, and our crawl agent, the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won’t crawl any of the URLs that a form would generate. Similarly, we only retrieve GET forms and avoid forms that require any kind of user information. For example, we omit any forms that have a password input or that use terms commonly associated with personal information such as logins, userids, contacts, etc. We are also mindful of the impact we can have on web sites and limit ourselves to a very small number of fetches for a given site.”

So in short, if you want to block Googlebot from your form the easiest way is to use a Captcha or block the page entirely from spiders using your robots.txt file.

My Thoughts
Interesting indeed. I don’t really see how this kind of data would be useful to Google but as the post states they do consider whether the content is of any use before adding it to their index. I suppose this is yet another corner of the Internet that Google wants to be sure it isn’t missing in its ever long quest for indexing the world’s information.

Special thanks to Google RSS Reader for bringing this breaking news to my doorstep 🙂 I love technology!