How to get favicon.ico files from Alexa Top 1000 sites in 2 minutes with Python
Make folders:
mkdir -p favicons/icons ; cd favicons
Get a list of Alexa Top 1000 sites:
curl -s -O http://s3.amazonaws.com/alexa-static/top-1m.csv.zip ; unzip -q -o top-1m.csv.zip top-1m.csv ; head -1000 top-1m.csv | cut -d, -f2 | cut -d/ -f1 > topsites.txt
Former time-saving oneliner was found here.
yum install python-gevent
Gevent is a high-performance network framework for Python built on top of libevent and greenlets.
Few modifications of an example shipped with gevent:
#!/usr/bin/python
# Copyright (c) 2009 Denis Bilenko. See LICENSE for details.
"""Spawn multiple workers and wait for them to complete"""
ursl = []
urls = lines = ['http://www.' + line.strip() for line in open('topsites.txt')]
import gevent
from gevent import monkey
# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()
import urllib2
from socket import setdefaulttimeout
setdefaulttimeout(30)
def print_head(url):
print ('Starting %s' % url)
url = url + '/favicon.ico'
try:
data = urllib2.urlopen(url).read()
except Exception, e:
print 'error', url, e
return
fn = 'icons/' + url[+11:].replace("/", "-")
myFile = file(fn, 'w')
myFile.write(data)
myFile.close()
jobs = [gevent.spawn(print_head, url) for url in urls]
gevent.joinall(jobs)
[dande@host favicons]$ time python ./get.py ... real 0m50.644s user 0m1.914s sys 0m0.888s [dande@host favicons]$
[dande@host favicons]$ ls icons/ | wc -l 889 [dande@host favicons]$
Well, there’s no much sense except fooling around with Python.