Python Web Crawler in Less Than 50 Lines

14 02 2009

I got kind of bored today, and wrote a pretty simple web crawler with python and it turned out to be less than 50 lines. It doesn’t store output, I’ll leave that up to anyone who wants to use the code, because, well, theres just too many ways to choose from. Right now you pass it a starting link as a parameter and it will crawl forever untill it runs out of links. But that is not a likely condition. So here ya go. Have fun. Feel free to ask questions

 

import sys
import re
import urllib2
import urlparse
tocrawl = set([sys.argv[1]])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
	try:
		crawling = tocrawl.pop()
		print crawling
	except KeyError:
		raise StopIteration
	url = urlparse.urlparse(crawling)
	try:
		response = urllib2.urlopen(crawling)
	except:
		continue
	msg = response.read()
	startPos = msg.find('<title>')
	if startPos != -1:
		endPos = msg.find('</title>', startPos+7)
		if endPos != -1:
			title = msg[startPos+7:endPos]
			print title
	keywordlist = keywordregex.findall(msg)
	if len(keywordlist) > 0:
		keywordlist = keywordlist[0]
		keywordlist = keywordlist.split(", ")
		print keywordlist
	links = linkregex.findall(msg)
	crawled.add(crawling)
	for link in (links.pop(0) for _ in xrange(len(links))):
		if link.startswith('/'):
			link = 'http://' + url[1] + link
		elif link.startswith('#'):
			link = 'http://' + url[1] + url[2] + link
		elif not link.startswith('http'):
			link = 'http://' + url[1] + '/' + link
		if link not in crawled:
			tocrawl.add(link)

Actions

Information

11 responses

16 02 2009
Teifion

I get an errors that start and end pos are not defined. I figured start pos should be set to 0 but what is end pos set to?

16 02 2009
The.Anti.9

my apologies, there was html in the quotes in startPos and endPos definitions that i forgot to change to the html escape characters. Fixed now.

3 03 2009
Suzy

Just dropping by.Btw, you website have great content!

______________________________
Unlimited Public Records Searches!

3 03 2009
shawn

hi the.anti.9

I’m trying to put a web business together. I met a guy froma company who indicated he could right me a spider and a web crawler that I could use to atract businee to my site but after 4 weeks of trying to track him down I’m starting to realize he is full of it. I would very much enjoy a converation with you regarding a small project for a web crawler to aquire leads to my web site. I can be reach at nigelshawn@yahoo.com

22 07 2009
ncr_bhikari

nice and concise article…i’ve not tested it though…

1 08 2009
Sam

Hello,

When running script I get the error:

Traceback (most recent call last):
File “/Users/samiles3/Desktop/crawler.py”, line 5, in
tocrawl = set([sys.argv[1]])
IndexError: list index out of range

Any ideas?

Thank you for your help.

4 08 2009
desiNerd

@ Sam, it seems you dint provide the url as input …you have to run the above program like this:
desiNerd-laptop$ python crawler.py http://www.google.com

as you dint provide the last argument it errored out saying that the argv list is out of index, as in your case size(argv[]) is zero whereas the program tries to fetch the URL using argv[0], the first argument provided…hope this helps. thanks.

6 08 2009
Sam

Thanks, desiNerd, that works great.

I am running script (thanks :D ) but sometimes when the script comes across a url with either a # anchor in or a php ? parameter in, the script gets stuck on this URL, and just prints the URL over and over.

Any ideas how I might solve this?

Thank you for your help once again.

Sam

6 08 2009
Sam

Please forgive me, but I have solved this problem now.

29 08 2009
Varun

Wondering whether the admin of the website being crawled, will get to know about our crawler in any way?

They may also block a particular IP from which the crawler is being hosted.

Is there any way to impersonate as if the request (actually sent out by our crawler) to the website is coming from some web browser (like IE, firefox, opera)?

29 10 2009
jabberbuzz

great job !!

Leave a comment