tutorialpoint.org

Blogs

Algorithm for Friendship, Affair, and Love

IIT Patna: A Journey Began

Google page ranking algorithm

Crawling a website

Performance measure of a website

Crawling a website (cont'd...)

DNS resolution

Each web server has a unique IP address: a sequence of four bytes generally represented IP ADDRESS as four integers separated by dots; for instance 207.165.121.321 is the numerical IP address associated with the host www.wikipedia.org. Given a URL such as www.wikipedia.org in text form, translating it to an IP address (in this case, 207.165.121.321) is a process known as DNS resolution or DNS lookup; here DNS stands for Domain Name Service. During DNS resolution, the program that wishes to perform this translation contacts a server that returns the translated IP address.

The URL frontier

The URL frontier at a node is given a URL by its crawl process (or by the host splitter of another crawl process). Actually it maintains the URLs in the frontier and bring up them in some particular order whenever a crawler thread seeks a URL. Two important considerations follows the order in which URLs are returned by the frontier. First one, the high-quality pages that change frequently should be prioritized for frequently crawling. Thus, the priority of a page should be a function of, its change rate and its quality both. The combination is necessary because a big number of spam pages can change completely on each and every fetch.

Writing a simple crawler code

Now, we are familiar about the working principle of a web crawler. Here I am providing a simple crawler code, using python programing language. For writing a web crawler, knowledge of source code of website is must. Source code of website is nothing but HTML code. So knowledge about HTML code structure and syntax is required to write a web crawler. The program in python language which I provide at the end of this writing can fetch the url names mentioned in a particular page. Actually URL links in HTML are given in ``a'' tags. So will have to hit ``a'' tag and will have to find text after ``href''.

For crawling the links, first of all we will have to fetch source code of website. By using python library ``Urllib'' we will fetch source code. By using another python library ``BeautifulSoup'' we will do parsing of this code. The code which is provided, at the end of the text can also collect keywords, titles and description of the web pages.

< Prev.Page   1   2   3   4   5   6   Next page>