Home > code, Programming, Python, web, Web Development > Python Web Crawler in Less Than 50 Lines

Python Web Crawler in Less Than 50 Lines

I got kind of bored today, and wrote a pretty simple web crawler with python and it turned out to be less than 50 lines. It doesn’t store output, I’ll leave that up to anyone who wants to use the code, because, well, theres just too many ways to choose from. Right now you pass it a starting link as a parameter and it will crawl forever untill it runs out of links. But that is not a likely condition. So here ya go. Have fun. Feel free to ask questions

import sys
import re
import urllib2
import urlparse
tocrawl = set([sys.argv[1]])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
	try:
		crawling = tocrawl.pop()
		print crawling
	except KeyError:
		raise StopIteration
	url = urlparse.urlparse(crawling)
	try:
		response = urllib2.urlopen(crawling)
	except:
		continue
	msg = response.read()
	startPos = msg.find('<title>')
	if startPos != -1:
		endPos = msg.find('</title>', startPos+7)
		if endPos != -1:
			title = msg[startPos+7:endPos]
			print title
	keywordlist = keywordregex.findall(msg)
	if len(keywordlist) > 0:
		keywordlist = keywordlist[0]
		keywordlist = keywordlist.split(", ")
		print keywordlist
	links = linkregex.findall(msg)
	crawled.add(crawling)
	for link in (links.pop(0) for _ in xrange(len(links))):
		if link.startswith('/'):
			link = 'http://' + url[1] + link
		elif link.startswith('#'):
			link = 'http://' + url[1] + url[2] + link
		elif not link.startswith('http'):
			link = 'http://' + url[1] + '/' + link
		if link not in crawled:
			tocrawl.add(link)

** EDIT **

This was a very early draft of this program. As it turns out, I revisited this project a few months later and it evolved much more.
If you would like to check out the more evolved form, feel free to have a look here at my github!

  1. February 16, 2009 at 12:02 pm | #1

    I get an errors that start and end pos are not defined. I figured start pos should be set to 0 but what is end pos set to?

  2. The.Anti.9
    February 16, 2009 at 1:49 pm | #2

    my apologies, there was html in the quotes in startPos and endPos definitions that i forgot to change to the html escape characters. Fixed now.

  3. Suzy
    March 3, 2009 at 1:30 am | #3

    Just dropping by.Btw, you website have great content!

    ______________________________
    Unlimited Public Records Searches!

  4. March 3, 2009 at 4:40 am | #4

    hi the.anti.9

    I’m trying to put a web business together. I met a guy froma company who indicated he could right me a spider and a web crawler that I could use to atract businee to my site but after 4 weeks of trying to track him down I’m starting to realize he is full of it. I would very much enjoy a converation with you regarding a small project for a web crawler to aquire leads to my web site. I can be reach at nigelshawn@yahoo.com

  5. ncr_bhikari
    July 22, 2009 at 3:52 am | #5

    nice and concise article…i’ve not tested it though…

  6. Sam
    August 1, 2009 at 12:15 pm | #6

    Hello,

    When running script I get the error:

    Traceback (most recent call last):
    File “/Users/samiles3/Desktop/crawler.py”, line 5, in
    tocrawl = set([sys.argv[1]])
    IndexError: list index out of range

    Any ideas?

    Thank you for your help.

  7. The.Anti.9
    August 1, 2009 at 12:20 pm | #7

    Sam: You need to pass it a start url.

  8. desiNerd
    August 4, 2009 at 1:42 pm | #8

    @ Sam, it seems you dint provide the url as input …you have to run the above program like this:
    desiNerd-laptop$ python crawler.py http://www.google.com

    as you dint provide the last argument it errored out saying that the argv list is out of index, as in your case size(argv[]) is zero whereas the program tries to fetch the URL using argv[0], the first argument provided…hope this helps. thanks.

  9. Sam
    August 6, 2009 at 4:26 pm | #9

    Thanks, desiNerd, that works great.

    I am running script (thanks :D ) but sometimes when the script comes across a url with either a # anchor in or a php ? parameter in, the script gets stuck on this URL, and just prints the URL over and over.

    Any ideas how I might solve this?

    Thank you for your help once again.

    Sam

  10. Sam
    August 6, 2009 at 4:29 pm | #10

    Please forgive me, but I have solved this problem now.

  11. Varun
    August 29, 2009 at 12:25 am | #11

    Wondering whether the admin of the website being crawled, will get to know about our crawler in any way?

    They may also block a particular IP from which the crawler is being hosted.

    Is there any way to impersonate as if the request (actually sent out by our crawler) to the website is coming from some web browser (like IE, firefox, opera)?

  12. October 29, 2009 at 12:32 am | #12

    great job !!

    • January 13, 2010 at 1:50 pm | #13

      You may want to check out a web crawler that I have written as well. It is made up of two classes, a parser and a crawler. It’s simple to use and you can extend the crawler class to do just about anything. The crawler on this page is a good example of how to make a purely procedural crawler, but in most cases, an object oriented one will be easier to maintain and modify :) http://www.esux.net/content/simple-python-web-crawler-thats-easy-build

  13. January 13, 2010 at 2:09 pm | #14

    Pretty neat :) I came across the site because we’re writing a Python API for our web-crawling service, 80legs.

  14. vinayak
    February 25, 2010 at 5:01 pm | #15

    I am new to python.What is the meaning of “You need to pass it a start url”

    Thanks!

  15. March 2, 2010 at 12:59 am | #16

    @vinayak: You need to take the above code and put it into a file and save the file with a name such as “crawler.py”. Next, navigate to that file in your shell and type:
    python crawler.py http://www.somewebsite.com

    Replace somewebsite.com with an actual website and it should start giving results.

    The meaning of “You need to pass it a start variable” is… take a look at line 4:
    tocrawl = set([sys.argv[1]])

    That line sets the variable tocrawl with the value of sys.argv[1], which refers to the first value passed to the script when it is executed. If that is not set, there should be an error.

  16. March 7, 2010 at 7:13 pm | #17

    Interesting stuff. Crawling web with Python can be easier if you use another tools, like BeautifulSoup. I’m writting a serie of tutorials in my blog discussing it. It is in portuguese, but Google Translation works well in this case =P

    http://herberthamaral.com/2010/02/criando-web-crawlers-em-python-parte-i/

  17. Alfred
    March 9, 2010 at 2:13 am | #18

    Thanks! This is nice code, very well thought out. Only thing you might want to add is robots.txt compliance to be a good web citizen (denizen? its a web-crawler…)

    @ Varun: But that would be dishonest! Not sure, maybe you can tamper with what Python sends as its user agent string. I would start looking at urllib.URLopener -> the version variable, I think Python submits this as its UA?

  18. fayrouz
    March 21, 2010 at 9:25 am | #19

    dearest, i am a computer science student from egypt, my graduation project is to fill an open source digital library with images and there meta data from a web crawler i want a very simple web crawler that get the image meta data and thats it
    plzz reply if u can help me thnx :)

  19. dude
    March 24, 2010 at 7:00 pm | #20

    Serves as a great start, and saves me some work. Thanks a lot!!!!!

  20. security guy
    March 30, 2010 at 8:20 pm | #21

    Nice work !

  21. Ayan Debnath
    April 20, 2010 at 3:46 am | #22

    Hello,

    Can you pls modify the CODE for me.
    I don’t know python.

    Here is the requirements -

    1. The CODE should run in Google App Engine.

    2. If I give domain name, say – http://www.mycompany.com
    It will fetch me internal URLs that will have */gallery/* in it.

    Say, URLs like this -

    http://www.mycompany.com/abcd/gallery/xyz/test.html
    http://www.mycompany.com/gallery/xyz/test.html
    http://mycompany.com/abcd/gallery/test.html
    etc.

    It will be a great help if you can mail me the modified code.

    Thank you in advance.

    Ayan Debnath
    iosoft@gmail.com

  22. April 28, 2010 at 6:17 am | #23

    See a related article with extensive explanations:
    http://ms4py.org/2010/04/27/python-search-engine-crawler-part-1/

  23. Adejumo Magbagbeola
    June 1, 2010 at 4:28 pm | #24

    Please i am doing a research and need to write a web crawler that gather the number of web pages and hyperlinks on a website(otago.ac.nz)

    can i use the above code for that as well …………… i am writing with python

    • Chaim
      June 1, 2010 at 5:29 pm | #25

      The above code will let you do that if you make a modification. Namely, you have to limit it to only crawl otago.ac.nz, so for each hyperlink that it finds, you have to have it check if it is part of that website, and then just keep a count of the results.

      • August 23, 2010 at 8:57 am | #26

        Hi Chaim, the thing is i am so new to python , i have actually been struggling to do what you have said for like 2months now,but i cant still fix it … how do you limit it to only that web site ?
        any help will be appreciated alot

    • August 24, 2010 at 8:47 pm | #27

      It’s been a while that I’ve looked at this, but essentially you need to parse each domain that is found and determine if it contains the string otaga.ac.nz. This can be done by a simple regular expression search or a sub-string comparison or by using some module to break apart the domain name. It really doesn’t matter how you do it – you just want to determine if that string is part of the URL found. If it is, then don’t add it to the list to keep indexing.

  24. Adejumo Magbagbeola
    June 1, 2010 at 4:29 pm | #28

    i will be so grateful if i am helped ………cheers

  25. June 1, 2010 at 5:38 pm | #29

    Hey guys – got notified of this post through backtype. If you’re not interested in having to write your own web crawler, you can user our crawling service at http://www.80legs.com.

    Cheers,
    Shion

  26. Adejumo Magbagbeola
    June 6, 2010 at 5:32 pm | #30

    hi ‘ yeah
    i tried running this ;code but had an error with line 12… where it says “print crawling”

    what do i do

  27. Amit
    June 15, 2010 at 3:07 pm | #31

    But Shion, you guys are robots.txt compliant. Otherwise I’d use you ;)

  28. wan
    June 21, 2010 at 8:15 pm | #32

    Sorry, dumn question. How can I navigate to the file in a shell?

  29. Adejumo
    July 26, 2010 at 9:10 pm | #33

    please where in the code do u out the web site you want to crawl?

    Thanks

    • The.Anti.9
      July 26, 2010 at 9:16 pm | #34

      where do I out? or where do I put? I’m going to assume put. You give the start url as a parameter when you call the program.

      > python crawler.py http://yoursite.com

  30. August 2, 2010 at 3:49 am | #35

    Hi! Sometimes I think that my life has no sense, but when I visit blogs like yours I find myself happy. Keep writing on.

    • The.Anti.9
      August 2, 2010 at 8:29 am | #36

      Thanks for your reply! This was a very early draft of this program. If you would like to see the more evolved version you should go here to my github!

  31. August 2, 2010 at 6:51 am | #37

    I have added this post to my favorites. Now and on I will read it more often.

  32. py-fan
    August 2, 2010 at 8:11 am | #38

    Hi! You got an error in
    elif not link.startswith(‘http’):
    link = ‘http://’ + url[1] + ‘/’ + link

    Last line should look like
    link = urlparse.urljoin(url,link)

    That’s why:
    imagine you crawling from page http://site.com/1/test.html
    and there is a link to “2/test2.html”
    Your code redirects to:
    site.com/2/test2.html
    and my to: site.com/1/2/test2.html (as it should be!)

    Thanks for your ideas. :)

  33. August 23, 2010 at 8:33 am | #39

    Please it s me again …….. i tried using this code on a large website(www.otago.ac.nz) and it worked but please how do i specify the external links and the internal links ? ?

    And also i tried running this code on smaller website e.g http://www.dejumomagbagbeola.org but it raised and iteration error…

    i hope some one will help me out with making the code specify internal and external links

    Please ! Please Please !

  34. March 27, 2012 at 6:52 am | #40

    Thanks for the code, I will modify it to demonstrate a simple sql injection detection mechanism :)

  1. No trackbacks yet.
You must be logged in to post a comment.
Follow

Get every new post delivered to your Inbox.