Python Web Crawler in Less Than 50 Lines
I got kind of bored today, and wrote a pretty simple web crawler with python and it turned out to be less than 50 lines. It doesn’t store output, I’ll leave that up to anyone who wants to use the code, because, well, theres just too many ways to choose from. Right now you pass it a starting link as a parameter and it will crawl forever untill it runs out of links. But that is not a likely condition. So here ya go. Have fun. Feel free to ask questions
import sys
import re
import urllib2
import urlparse
tocrawl = set([sys.argv[1]])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
while 1:
try:
crawling = tocrawl.pop()
print crawling
except KeyError:
raise StopIteration
url = urlparse.urlparse(crawling)
try:
response = urllib2.urlopen(crawling)
except:
continue
msg = response.read()
startPos = msg.find('<title>')
if startPos != -1:
endPos = msg.find('</title>', startPos+7)
if endPos != -1:
title = msg[startPos+7:endPos]
print title
keywordlist = keywordregex.findall(msg)
if len(keywordlist) > 0:
keywordlist = keywordlist[0]
keywordlist = keywordlist.split(", ")
print keywordlist
links = linkregex.findall(msg)
crawled.add(crawling)
for link in (links.pop(0) for _ in xrange(len(links))):
if link.startswith('/'):
link = 'http://' + url[1] + link
elif link.startswith('#'):
link = 'http://' + url[1] + url[2] + link
elif not link.startswith('http'):
link = 'http://' + url[1] + '/' + link
if link not in crawled:
tocrawl.add(link)
** EDIT **
This was a very early draft of this program. As it turns out, I revisited this project a few months later and it evolved much more.
If you would like to check out the more evolved form, feel free to have a look here at my github!
Categories: code, Programming, Python, web, Web Development
code, crawler, Python, regex, regular expression, web-crawler
I get an errors that start and end pos are not defined. I figured start pos should be set to 0 but what is end pos set to?
my apologies, there was html in the quotes in startPos and endPos definitions that i forgot to change to the html escape characters. Fixed now.
Just dropping by.Btw, you website have great content!
______________________________
Unlimited Public Records Searches!
hi the.anti.9
I’m trying to put a web business together. I met a guy froma company who indicated he could right me a spider and a web crawler that I could use to atract businee to my site but after 4 weeks of trying to track him down I’m starting to realize he is full of it. I would very much enjoy a converation with you regarding a small project for a web crawler to aquire leads to my web site. I can be reach at nigelshawn@yahoo.com
nice and concise article…i’ve not tested it though…
Hello,
When running script I get the error:
Traceback (most recent call last):
File “/Users/samiles3/Desktop/crawler.py”, line 5, in
tocrawl = set([sys.argv[1]])
IndexError: list index out of range
Any ideas?
Thank you for your help.
Sam: You need to pass it a start url.
@ Sam, it seems you dint provide the url as input …you have to run the above program like this:
desiNerd-laptop$ python crawler.py http://www.google.com
as you dint provide the last argument it errored out saying that the argv list is out of index, as in your case size(argv[]) is zero whereas the program tries to fetch the URL using argv[0], the first argument provided…hope this helps. thanks.
Thanks, desiNerd, that works great.
I am running script (thanks :D ) but sometimes when the script comes across a url with either a # anchor in or a php ? parameter in, the script gets stuck on this URL, and just prints the URL over and over.
Any ideas how I might solve this?
Thank you for your help once again.
Sam
Please forgive me, but I have solved this problem now.
Wondering whether the admin of the website being crawled, will get to know about our crawler in any way?
They may also block a particular IP from which the crawler is being hosted.
Is there any way to impersonate as if the request (actually sent out by our crawler) to the website is coming from some web browser (like IE, firefox, opera)?
great job !!
You may want to check out a web crawler that I have written as well. It is made up of two classes, a parser and a crawler. It’s simple to use and you can extend the crawler class to do just about anything. The crawler on this page is a good example of how to make a purely procedural crawler, but in most cases, an object oriented one will be easier to maintain and modify :) http://www.esux.net/content/simple-python-web-crawler-thats-easy-build
Pretty neat :) I came across the site because we’re writing a Python API for our web-crawling service, 80legs.
I am new to python.What is the meaning of “You need to pass it a start url”
Thanks!
@vinayak: You need to take the above code and put it into a file and save the file with a name such as “crawler.py”. Next, navigate to that file in your shell and type:
python crawler.py http://www.somewebsite.com
Replace somewebsite.com with an actual website and it should start giving results.
The meaning of “You need to pass it a start variable” is… take a look at line 4:
tocrawl = set([sys.argv[1]])
That line sets the variable tocrawl with the value of sys.argv[1], which refers to the first value passed to the script when it is executed. If that is not set, there should be an error.
Interesting stuff. Crawling web with Python can be easier if you use another tools, like BeautifulSoup. I’m writting a serie of tutorials in my blog discussing it. It is in portuguese, but Google Translation works well in this case =P
http://herberthamaral.com/2010/02/criando-web-crawlers-em-python-parte-i/
Thanks! This is nice code, very well thought out. Only thing you might want to add is robots.txt compliance to be a good web citizen (denizen? its a web-crawler…)
@ Varun: But that would be dishonest! Not sure, maybe you can tamper with what Python sends as its user agent string. I would start looking at urllib.URLopener -> the version variable, I think Python submits this as its UA?
dearest, i am a computer science student from egypt, my graduation project is to fill an open source digital library with images and there meta data from a web crawler i want a very simple web crawler that get the image meta data and thats it
plzz reply if u can help me thnx :)
Serves as a great start, and saves me some work. Thanks a lot!!!!!
Nice work !
Hello,
Can you pls modify the CODE for me.
I don’t know python.
Here is the requirements -
1. The CODE should run in Google App Engine.
2. If I give domain name, say – http://www.mycompany.com
It will fetch me internal URLs that will have */gallery/* in it.
Say, URLs like this -
http://www.mycompany.com/abcd/gallery/xyz/test.html
http://www.mycompany.com/gallery/xyz/test.html
http://mycompany.com/abcd/gallery/test.html
etc.
It will be a great help if you can mail me the modified code.
Thank you in advance.
Ayan Debnath
iosoft@gmail.com
See a related article with extensive explanations:
http://ms4py.org/2010/04/27/python-search-engine-crawler-part-1/
Please i am doing a research and need to write a web crawler that gather the number of web pages and hyperlinks on a website(otago.ac.nz)
can i use the above code for that as well …………… i am writing with python
The above code will let you do that if you make a modification. Namely, you have to limit it to only crawl otago.ac.nz, so for each hyperlink that it finds, you have to have it check if it is part of that website, and then just keep a count of the results.
Hi Chaim, the thing is i am so new to python , i have actually been struggling to do what you have said for like 2months now,but i cant still fix it … how do you limit it to only that web site ?
any help will be appreciated alot
It’s been a while that I’ve looked at this, but essentially you need to parse each domain that is found and determine if it contains the string otaga.ac.nz. This can be done by a simple regular expression search or a sub-string comparison or by using some module to break apart the domain name. It really doesn’t matter how you do it – you just want to determine if that string is part of the URL found. If it is, then don’t add it to the list to keep indexing.
i will be so grateful if i am helped ………cheers
Hey guys – got notified of this post through backtype. If you’re not interested in having to write your own web crawler, you can user our crawling service at http://www.80legs.com.
Cheers,
Shion
hi ‘ yeah
i tried running this ;code but had an error with line 12… where it says “print crawling”
what do i do
But Shion, you guys are robots.txt compliant. Otherwise I’d use you ;)
Sorry, dumn question. How can I navigate to the file in a shell?
please where in the code do u out the web site you want to crawl?
Thanks
where do I out? or where do I put? I’m going to assume put. You give the start url as a parameter when you call the program.
> python crawler.py http://yoursite.com
Hi! Sometimes I think that my life has no sense, but when I visit blogs like yours I find myself happy. Keep writing on.
Thanks for your reply! This was a very early draft of this program. If you would like to see the more evolved version you should go here to my github!
I have added this post to my favorites. Now and on I will read it more often.
Hi! You got an error in
elif not link.startswith(‘http’):
link = ‘http://’ + url[1] + ‘/’ + link
Last line should look like
link = urlparse.urljoin(url,link)
That’s why:
imagine you crawling from page http://site.com/1/test.html
and there is a link to “2/test2.html”
Your code redirects to:
site.com/2/test2.html
and my to: site.com/1/2/test2.html (as it should be!)
Thanks for your ideas. :)
Please it s me again …….. i tried using this code on a large website(www.otago.ac.nz) and it worked but please how do i specify the external links and the internal links ? ?
And also i tried running this code on smaller website e.g http://www.dejumomagbagbeola.org but it raised and iteration error…
i hope some one will help me out with making the code specify internal and external links
Please ! Please Please !
Thanks for the code, I will modify it to demonstrate a simple sql injection detection mechanism :)