22 Feb

How to get favicon.ico files from Alexa Top 1000 sites in 2 minutes with Python

Make folders:

mkdir -p favicons/icons ; cd favicons

Get a list of Alexa Top 1000 sites:

curl -s -O http://s3.amazonaws.com/alexa-static/top-1m.csv.zip ; unzip -q -o top-1m.csv.zip top-1m.csv ; head -1000 top-1m.csv | cut -d, -f2 | cut -d/ -f1 > topsites.txt

Former time-saving oneliner was found here.

yum install python-gevent

Gevent is a high-performance network framework for Python built on top of libevent and greenlets.

Few modifications of an example shipped with gevent:

#!/usr/bin/python
# Copyright (c) 2009 Denis Bilenko. See LICENSE for details.
"""Spawn multiple workers and wait for them to complete"""

ursl = []
urls = lines = ['http://www.' + line.strip() for line in open('topsites.txt')]

import gevent
from gevent import monkey

# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()

import urllib2
from socket import setdefaulttimeout
setdefaulttimeout(30)

def print_head(url):
     print ('Starting %s' % url)
     url = url + '/favicon.ico'
     try:
         data = urllib2.urlopen(url).read()
         except Exception, e:
         print 'error', url, e
         return

    fn = 'icons/' + url[+11:].replace("/", "-")
    myFile = file(fn, 'w')
    myFile.write(data)
    myFile.close()

jobs = [gevent.spawn(print_head, url) for url in urls]

gevent.joinall(jobs)
[dande@host favicons]$ time python ./get.py
...

real 0m50.644s
user 0m1.914s
sys 0m0.888s
[dande@host favicons]$
[dande@host favicons]$ ls icons/ | wc -l
889
[dande@host favicons]$

Well, there’s no much sense except fooling around with Python.



Leave a Reply

Your email address will not be published. Required fields are marked *