Write A Web Crawler

Description
Most of us are familiar with web spiders and crawlers like GoogleBot - they visit a web page, index content there, and then visit outgoing links from that page. Crawlers are an interesting technology with continuing development.

Web crawlers marry queuing and HTML parsing and form the basis of search engines etc. Writing a simple crawler is a good exercise in putting a few things together. Writing a well behaved crawler is another step up.

For this challenge you may use any single shot web client you wish, e.g. Python's httplib or any of a number of libcurl bindings; you may NOT use a crawling library like Mechanize or whatnot. You may use an HTML parsing library like BeautifulSoup; you may NOT use a headless browser like PhantomJS. The purpose of this challenge is to tie together fetching a page, reassembling links, discovering links and assembling them, adding them to a queue, managing the depth of the queue, and visiting them in some reasonable order - while avoiding duplicate visits.

Your crawler MUST support the following features:

HTTP/1.1 client behaviors
GET requests are the only method you must support
Parse all links presented in HTML - anchors, images, scripts, etc
Take at least two options - a starting (seed) URL and a maximum depth to recurse to (e.g. "1" would be fetch the HTML page and all resources like images and script associated with it but don't visit any outgoing anchor links; a depth of "2" would visit the anchor links found on that first page only, etc ...)
Do not visit the same link more than once per session
Optional features include HTTPS support, support for robots.txt, support for domains to which you restrict the crawler, and storing results (for example how wget does so).

Be careful with what you crawl! Don't get yourself banned from the Internet. I highly suggest you crawl a local server you control as you may trigger rate limits and other mechanisms to identify unwanted visitors.

Solution
in Python

import asyncio
import aiohttp
from bs4 import BeautifulSoup

visited = set()

async def crawl(node, max_depth, session):
if node['depth'] > max_depth:
print('reached max depth')
return
node['next'] = {
'depth': node['depth'] + 1,
'urls': [],
}
for link in node['urls']:
links = await get_links(link, session)
node['next']['urls'].extend(links)
await crawl(node['next'], max_depth, session)

async def get_links(url, session):
print(f'getting links on {url}')
if url in visited or url.startswith('/'):
print(f'already visited {url}')
return []
visited.add(url)
links = []
async with session.get(url) as resp:
text = await resp.text()
soup = BeautifulSoup(text, 'html.parser')
for a in soup.find_all('a', href=True):
if '//' in a['href']:
print(a['href'])
links.append(a['href'])
return links

async def main():
# max_depth = 2
# root = r'http://quotes.toscrape.com/'
root = input('Website to crawl: ')
max_depth = int(input('Depth to crawl (int): '))
print(f'crawling {root} at depth {max_depth}')
links = {
'urls': [root],
'depth': 1,
}
with aiohttp.ClientSession() as session:
await crawl(links, max_depth, session)
print(links)
print(f'Number of pages crawled: {len(visited)}')

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Programming

Search This Blog

Featured Post

Your First Programming Language

Write A Web Crawler

Comments

Post a Comment

Popular posts from this blog

Decipher A Seven Segment Display

Continued Fraction

Kolakoski Sequence

Advanced pacman

Puzzle Me This

Everyone's A Winner

Star Battle solver