Crawling a website (cont'd...)
The simple scheme outlined above for crawling demands several modules that fit together as shown in Figure below
1. The URL frontier, containing URLs to be fetched in the current crawl (in the case of continuous crawling, a URL may have been fetched previously but is back in the frontier for re-fetching).
2. DNS resolution module determines the web server from which to obtain the page specified by a URL.
3. A fetch module that uses the http protocol to fetch the web page at a URL.
4. A parsing module that filters the text and set of links from a fetched web page source code.
5. Duplicate elimination module, determines whether an extracted link is already exists in the URL frontier or has recently been fetched.
Fig 1. Crawler architecture
Crawling is done by anywhere from one to hundreds of threads, each of thread loops through the logical cycle shown in Figure 3.1. These threads may be run in a single process, or may be run in partitioned among multiple processes running at different nodes of a distributed system. A crawler thread takes a URL from the frontier and fetches the web page at that URL, usually using the http protocol. After that the fetched page is written into a temporary store and number of operations are performed on it. Next, the page is being parsed and the text and the links in it are extracted. The text with any tag information is passed on to the indexer. In addition, each extracted link goes through a series of tests and filters to determine whether the link should be added to the URL frontier, the thread tests whether a web page with the same content has already been seen at another URL. Finally, the URL is checked for duplicate elimination. If the URL is already crawled, we do not add it to the frontier again.
< Prev.Page   1   2
3   4   5   6   Next page>