What programming language you start with really all depends on where you want to go with programming/coding. The great thing about this field is that there are an absolute abundance of smaller fields that you can go into, all using programming in their own unique ways. For web applications, a good start would be with HTML and later moving your way through CSS, JavaScript, JQuery, PHP, SQL, and any of the JavaScript libraries. Ruby is also a popular choice, so I would recommend checking that out too. For more scientific fields or areas with more machine learning and A.I., Python is generally a great place to start as it is widely used in that field of study. C++ is also a very useful language to know for that, but it can be a little more challenging for beginners. For game and application design, languages such as C#, C, Swift, Kotlin, and Java are most often used for that.
Description
Most of us are familiar with web spiders and crawlers like GoogleBot - they visit a web page, index content there, and then visit outgoing links from that page. Crawlers are an interesting technology with continuing development.
Web crawlers marry queuing and HTML parsing and form the basis of search engines etc. Writing a simple crawler is a good exercise in putting a few things together. Writing a well behaved crawler is another step up.
For this challenge you may use any single shot web client you wish, e.g. Python's httplib or any of a number of libcurl bindings; you may NOT use a crawling library like Mechanize or whatnot. You may use an HTML parsing library like BeautifulSoup; you may NOT use a headless browser like PhantomJS. The purpose of this challenge is to tie together fetching a page, reassembling links, discovering links and assembling them, adding them to a queue, managing the depth of the queue, and visiting them in some reasonable order - while avoiding duplicate visits.
Your crawler MUST support the following features:
HTTP/1.1 client behaviors
GET requests are the only method you must support
Parse all links presented in HTML - anchors, images, scripts, etc
Take at least two options - a starting (seed) URL and a maximum depth to recurse to (e.g. "1" would be fetch the HTML page and all resources like images and script associated with it but don't visit any outgoing anchor links; a depth of "2" would visit the anchor links found on that first page only, etc ...)
Do not visit the same link more than once per session
Optional features include HTTPS support, support for robots.txt, support for domains to which you restrict the crawler, and storing results (for example how wget does so).
Be careful with what you crawl! Don't get yourself banned from the Internet. I highly suggest you crawl a local server you control as you may trigger rate limits and other mechanisms to identify unwanted visitors.
Solution
in Python
import asyncio
import aiohttp
from bs4 import BeautifulSoup
visited = set()
async def crawl(node, max_depth, session):
if node['depth'] > max_depth:
print('reached max depth')
return
node['next'] = {
'depth': node['depth'] + 1,
'urls': [],
}
for link in node['urls']:
links = await get_links(link, session)
node['next']['urls'].extend(links)
await crawl(node['next'], max_depth, session)
async def get_links(url, session):
print(f'getting links on {url}')
if url in visited or url.startswith('/'):
print(f'already visited {url}')
return []
visited.add(url)
links = []
async with session.get(url) as resp:
text = await resp.text()
soup = BeautifulSoup(text, 'html.parser')
for a in soup.find_all('a', href=True):
if '//' in a['href']:
print(a['href'])
links.append(a['href'])
return links
async def main():
# max_depth = 2
# root = r'http://quotes.toscrape.com/'
root = input('Website to crawl: ')
max_depth = int(input('Depth to crawl (int): '))
print(f'crawling {root} at depth {max_depth}')
links = {
'urls': [root],
'depth': 1,
}
with aiohttp.ClientSession() as session:
await crawl(links, max_depth, session)
print(links)
print(f'Number of pages crawled: {len(visited)}')
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Most of us are familiar with web spiders and crawlers like GoogleBot - they visit a web page, index content there, and then visit outgoing links from that page. Crawlers are an interesting technology with continuing development.
Web crawlers marry queuing and HTML parsing and form the basis of search engines etc. Writing a simple crawler is a good exercise in putting a few things together. Writing a well behaved crawler is another step up.
For this challenge you may use any single shot web client you wish, e.g. Python's httplib or any of a number of libcurl bindings; you may NOT use a crawling library like Mechanize or whatnot. You may use an HTML parsing library like BeautifulSoup; you may NOT use a headless browser like PhantomJS. The purpose of this challenge is to tie together fetching a page, reassembling links, discovering links and assembling them, adding them to a queue, managing the depth of the queue, and visiting them in some reasonable order - while avoiding duplicate visits.
Your crawler MUST support the following features:
HTTP/1.1 client behaviors
GET requests are the only method you must support
Parse all links presented in HTML - anchors, images, scripts, etc
Take at least two options - a starting (seed) URL and a maximum depth to recurse to (e.g. "1" would be fetch the HTML page and all resources like images and script associated with it but don't visit any outgoing anchor links; a depth of "2" would visit the anchor links found on that first page only, etc ...)
Do not visit the same link more than once per session
Optional features include HTTPS support, support for robots.txt, support for domains to which you restrict the crawler, and storing results (for example how wget does so).
Be careful with what you crawl! Don't get yourself banned from the Internet. I highly suggest you crawl a local server you control as you may trigger rate limits and other mechanisms to identify unwanted visitors.
Solution
in Python
import asyncio
import aiohttp
from bs4 import BeautifulSoup
visited = set()
async def crawl(node, max_depth, session):
if node['depth'] > max_depth:
print('reached max depth')
return
node['next'] = {
'depth': node['depth'] + 1,
'urls': [],
}
for link in node['urls']:
links = await get_links(link, session)
node['next']['urls'].extend(links)
await crawl(node['next'], max_depth, session)
async def get_links(url, session):
print(f'getting links on {url}')
if url in visited or url.startswith('/'):
print(f'already visited {url}')
return []
visited.add(url)
links = []
async with session.get(url) as resp:
text = await resp.text()
soup = BeautifulSoup(text, 'html.parser')
for a in soup.find_all('a', href=True):
if '//' in a['href']:
print(a['href'])
links.append(a['href'])
return links
async def main():
# max_depth = 2
# root = r'http://quotes.toscrape.com/'
root = input('Website to crawl: ')
max_depth = int(input('Depth to crawl (int): '))
print(f'crawling {root} at depth {max_depth}')
links = {
'urls': [root],
'depth': 1,
}
with aiohttp.ClientSession() as session:
await crawl(links, max_depth, session)
print(links)
print(f'Number of pages crawled: {len(visited)}')
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Comments
Post a Comment