How to get favicon.ico files from Alexa Top 1000 sites in 2 minutes with Python

Make folders:

mkdir -p favicons/icons ; cd favicons

Get a list of Alexa Top 1000 sites:

curl -s -O ; unzip -q -o top-1m.csv ; head -1000 top-1m.csv | cut -d, -f2 | cut -d/ -f1 > topsites.txt

Former time-saving oneliner was found here.

yum install python-gevent

Gevent is a high-performance network framework for Python built on top of libevent and greenlets.

Few modifications of an example shipped with gevent:

# Copyright (c) 2009 Denis Bilenko. See LICENSE for details.
"""Spawn multiple workers and wait for them to complete"""

ursl = []
urls = lines = ['http://www.' + line.strip() for line in open('topsites.txt')]

import gevent
from gevent import monkey

# patches stdlib (including socket and ssl modules) to cooperate with other greenlets

import urllib2
from socket import setdefaulttimeout

def print_head(url):
     print ('Starting %s' % url)
     url = url + '/favicon.ico'
         data = urllib2.urlopen(url).read()
         except Exception, e:
         print 'error', url, e

    fn = 'icons/' + url[+11:].replace("/", "-")
    myFile = file(fn, 'w')

jobs = [gevent.spawn(print_head, url) for url in urls]

[dande@host favicons]$ time python ./

real 0m50.644s
user 0m1.914s
sys 0m0.888s
[dande@host favicons]$
[dande@host favicons]$ ls icons/ | wc -l
[dande@host favicons]$

Well, there’s no much sense except fooling around with Python.

Leave a Reply

Your email address will not be published. Required fields are marked *