I got kind of bored today, and wrote a pretty simple web crawler with python and it turned out to be less than 50 lines. It doesn’t store output, I’ll leave that up to anyone who wants to use the code, because, well, theres just too many ways to choose from. Right now you pass it a starting link as a parameter and it will crawl forever untill it runs out of links. But that is not a likely condition. So here ya go. Have fun. Feel free to ask questions
import sys
import re
import urllib2
import urlparse
tocrawl = set([sys.argv[1]])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
while 1:
try:
crawling = tocrawl.pop()
print crawling
except KeyError:
raise StopIteration
url = urlparse.urlparse(crawling)
try:
response = urllib2.urlopen(crawling)
except:
continue
msg = response.read()
startPos = msg.find('<title>')
if startPos != -1:
endPos = msg.find('</title>', startPos+7)
if endPos != -1:
title = msg[startPos+7:endPos]
print title
keywordlist = keywordregex.findall(msg)
if len(keywordlist) > 0:
keywordlist = keywordlist[0]
keywordlist = keywordlist.split(", ")
print keywordlist
links = linkregex.findall(msg)
crawled.add(crawling)
for link in (links.pop(0) for _ in xrange(len(links))):
if link.startswith('/'):
link = 'http://' + url[1] + link
elif link.startswith('#'):
link = 'http://' + url[1] + url[2] + link
elif not link.startswith('http'):
link = 'http://' + url[1] + '/' + link
if link not in crawled:
tocrawl.add(link)
I get an errors that start and end pos are not defined. I figured start pos should be set to 0 but what is end pos set to?
my apologies, there was html in the quotes in startPos and endPos definitions that i forgot to change to the html escape characters. Fixed now.
Just dropping by.Btw, you website have great content!
______________________________
Unlimited Public Records Searches!
hi the.anti.9
I’m trying to put a web business together. I met a guy froma company who indicated he could right me a spider and a web crawler that I could use to atract businee to my site but after 4 weeks of trying to track him down I’m starting to realize he is full of it. I would very much enjoy a converation with you regarding a small project for a web crawler to aquire leads to my web site. I can be reach at nigelshawn@yahoo.com
nice and concise article…i’ve not tested it though…
Hello,
When running script I get the error:
Traceback (most recent call last):
File “/Users/samiles3/Desktop/crawler.py”, line 5, in
tocrawl = set([sys.argv[1]])
IndexError: list index out of range
Any ideas?
Thank you for your help.
@ Sam, it seems you dint provide the url as input …you have to run the above program like this:
desiNerd-laptop$ python crawler.py http://www.google.com
as you dint provide the last argument it errored out saying that the argv list is out of index, as in your case size(argv[]) is zero whereas the program tries to fetch the URL using argv[0], the first argument provided…hope this helps. thanks.
Thanks, desiNerd, that works great.
I am running script (thanks
) but sometimes when the script comes across a url with either a # anchor in or a php ? parameter in, the script gets stuck on this URL, and just prints the URL over and over.
Any ideas how I might solve this?
Thank you for your help once again.
Sam
Please forgive me, but I have solved this problem now.
Wondering whether the admin of the website being crawled, will get to know about our crawler in any way?
They may also block a particular IP from which the crawler is being hosted.
Is there any way to impersonate as if the request (actually sent out by our crawler) to the website is coming from some web browser (like IE, firefox, opera)?
great job !!