
It used to be part of SEO 101 that search engines don't use forms. For quite some time now, Google has been testing filling in and submitting forms, and the results are already in the index. When Google started, dynamic sites were the exception to the rule – today, even simple personal websites are more likely to be a dynamic wordpress blog than a static GeoCities page. With dynamic pages comes the ever-present temptation to add forms to everything – and that's just the kind of thing that attracts the innovative engineers at Google.
The way in which Google crawls forms is by taking keywords from your page, inserting them to the form, and then submitting them to the landing page. By doing this, Google hopes to uncover content which is languishing in your database or pages of results pointing to information you didn't link to directly from a currently crawled page.
In no circumstance should a webmaster rely upon Google's Form Crawling behaviour; it is not controlled and not a good way to get your content indexed. If you do not wish to link to all your pages directly then use a sitemap in XML format; Google will then use that to find all your hidden content.
Look on Google Form Crawling as a helping hand for sites which aren't as spider-friendly as they could be and nothing more. It is not a way for you to be exclusively 'allowed' to use forms for navigation – such a practice is still considered harmful.
If you've thought out your SEO well, Google will already know about all the content which you want to share; and any other URLs are most probably duplicate content or error messages you wouldn't want crawled in any case.
For the well-designed site, Form Crawling by Google is unhelpful at the least. Here are a few different methods which can help stop not only Google but also other automated spiders attempting to crawl forms dead in their tracks:
A typical HTML form tag specifies an 'ACTION' parameter, the location to which the form should be submitted. A simple way to confuse automated form crawling is by setting your action to blank and then using client-side javascript to change the action of the form.
Here's an example:
<form method=”POST”
action=””>
<input type=”text” name=”search”>
<input
type=”submit” value=”search”
onClick=”this.parentNode.action='form_processor.php';”>
</form>
The simple code above will change your form action to 'form_processor.php' at the last moment, just when the form is submitted. At present, Google's form processing code does not process this client-side javascript and will try to submit to the current page – a most unfruitful experience.
Google and other polite search engines respect robots.txt; this is a file which resides in the root of your server and specifies which pages can be accessed by crawlers. Just set the action of your form to a script within a directory you have Disallow-ed to spiders and they will not submit the form.
Knowing that the form is of no use to a search engine, you can take a look at the USER_AGENT data provided by each visitor through a server-side language such as PHP, and then just not output forms to known search engine spiders. This requires you to make a list of search engine user agents (the string which identifies the browser or source of the visit) and check each visit against them.
You can decrease the likelihood of your forms being crawled by not using conventional field names. In place of 'search', use an unrecognisable string of characters or a number. This is the weaker of the suggested techniques.