tutorialpoint.org

Crawling a website

Introduction

We can imagine the World Wide Web as a huge networks. Web pages are connected to each other directly or indirectly. The pages are like nodes and they are connected to different paths. These path are called links. When we search for any information on web, search engine need to crawl'' over the web pages to collect the desired information.

Necessity of crawling

There are billions of pages available on the web. Most of them are inter connected to each other. When we search for any query on the search engine, it will have to find information about the searched query. Using crawling it finds relevant results from web, which we get as output of search engine. Obviously the return of web page is not online. It means that the pages are crawled offline and information about the pages are stored. The output of search is returned from the stored data.

Crawler

Crawlers or spiders are nothing but automated robots(program) which read/scan the content of the web pages. During crawling, crawler visits all the important links, meta tag, text, and collect the tags, links, keywords, phrases form all pages, and stores some selected pieces in hard drive. later they can be recalled using queries.

Crawlers refer to the programs that search engines use to scan and analyze websites in order to determine their importance, and thus their ranking in the results of Internet searches for certain keywords. Crawlers are very active, and often account for a great deal of the visitors to websites all over the Internet. The objective of crawling is to quickly and efficiently gather information about the web pages through the link structure that interconnects them.