In an effort to understand how the simple robots.txt file works for crawling search engines and all of its uses I've run across a couple very interesting things.
Let's start with how a web search engine crawls. The crawler goes to the website and starts are the main page. Off this main page are normally many html links. The crawler follows all these links, which go to other links, and other links; the crawler develops a picture of the entire site. The trick here is this may not contain all the pages on the site.
Enter the Deep Web. This is a place where normal search engine crawlers don't see. A page that isn't linked off the main site can't be found by a crawler. There's no way for it to know what that page name is, aside from guessing, and that could be a very large number of guesses. The crawler program has other things to do, like mapping sites elsewhere.
There are attempts to find these hidden pages. Protocols developed by Google and others attempts to put the onus on website administrators. The owner of the site has to decide what content to advertise. They place this in a file, similar to the robots.txt file used for blocking content from indexing on search engines. This is the opposite of this. These are pages a crawler may not find, but want to make sure they find.
Other areas of untapped data are things like flight schedules, medical journals/writings, and basically search-able databases that don't have direct html links to them. Programs are being developed to place specialized search terms into these online databases to index them like the regular web.
Back to the beginning, as I mentioned I was researching robots.txt and how it works. This is an 'exclusion protocol'. It tells a crawler NOT to index and advertise the particular link or URL. One concrete example I found that would work well for this is redundant links on the site. If there are links for 'print only' formats which are the same as the article, this could junk up the search for the site. One misnomer is using this robots.txt file to keep pages hidden from hackers. This is patently wrong. The robots.txt is accessible by anyone. This lets crawlers figure out what one doesn't want indexed. However since it is accessible by anyone it is a stopping point for hackers to look and see what they can poke at. Robots.txt has its uses, but securing a particular part of the site from unwanted lookers should be setup with authentication that the web server offers.
References
http://www.sitemaps.org/protocol.php
http://www.deeppeep.org/
http://www.nytimes.com/2009/02/23/technology/internet/23search.html?_r=1&th&emc=th
http://en.wikipedia.org/wiki/Deep_Web
Saturday, January 8, 2011
Subscribe to:
Posts (Atom)