How to get favicon.ico files from Alexa Top 1000 sites in 2 minutes with Python
Make folders:
mkdir -p favicons/icons ; cd favicons
Get a list of Alexa Top 1000 sites:
curl -s -O http://s3.amazonaws.com/alexa-static/top-1m.csv.zip ; unzip -q -o top-1m.csv.zip top-1m.csv ; head -1000 top-1m.csv | cut -d, -f2 | cut -d/ -f1 > topsites.txt
Former time-saving oneliner was found here.
yum install python-gevent
Gevent is a high-performance network framework for Python built on top of libevent and greenlets.
Few modifications of an example shipped with gevent:
#!/usr/bin/python # Copyright (c) 2009 Denis Bilenko. See LICENSE for details. """Spawn multiple workers and wait for them to complete""" ursl = [] urls = lines = ['http://www.' + line.strip() for line in open('topsites.txt')] import gevent from gevent import monkey # patches stdlib (including socket and ssl modules) to cooperate with other greenlets monkey.patch_all() import urllib2 from socket import setdefaulttimeout setdefaulttimeout(30) def print_head(url): print ('Starting %s' % url) url = url + '/favicon.ico' try: data = urllib2.urlopen(url).read() except Exception, e: print 'error', url, e return fn = 'icons/' + url[+11:].replace("/", "-") myFile = file(fn, 'w') myFile.write(data) myFile.close() jobs = [gevent.spawn(print_head, url) for url in urls] gevent.joinall(jobs)
[dande@host favicons]$ time python ./get.py ... real 0m50.644s user 0m1.914s sys 0m0.888s [dande@host favicons]$
[dande@host favicons]$ ls icons/ | wc -l 889 [dande@host favicons]$
Well, there’s no much sense except fooling around with Python.