22 Feb

How to get favicon.ico files from Alexa Top 1000 sites in 2 minutes with Python

Make folders:

mkdir -p favicons/icons ; cd favicons

Get a list of Alexa Top 1000 sites:

curl -s -O http://s3.amazonaws.com/alexa-static/top-1m.csv.zip ; unzip -q -o top-1m.csv.zip top-1m.csv ; head -1000 top-1m.csv | cut -d, -f2 | cut -d/ -f1 > topsites.txt

Former time-saving oneliner was found here.

yum install python-gevent

Gevent is a high-performance network framework for Python built on top of libevent and greenlets.

Few modifications of an example shipped with gevent:

#!/usr/bin/python
# Copyright (c) 2009 Denis Bilenko. See LICENSE for details.
"""Spawn multiple workers and wait for them to complete"""

ursl = []
urls = lines = ['http://www.' + line.strip() for line in open('topsites.txt')]

import gevent
from gevent import monkey

# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()

import urllib2
from socket import setdefaulttimeout
setdefaulttimeout(30)

def print_head(url):
     print ('Starting %s' % url)
     url = url + '/favicon.ico'
     try:
         data = urllib2.urlopen(url).read()
         except Exception, e:
         print 'error', url, e
         return

    fn = 'icons/' + url[+11:].replace("/", "-")
    myFile = file(fn, 'w')
    myFile.write(data)
    myFile.close()

jobs = [gevent.spawn(print_head, url) for url in urls]

gevent.joinall(jobs)
[dande@host favicons]$ time python ./get.py
...

real 0m50.644s
user 0m1.914s
sys 0m0.888s
[dande@host favicons]$
[dande@host favicons]$ ls icons/ | wc -l
889
[dande@host favicons]$

Well, there’s no much sense except fooling around with Python.

21 Feb

How to download Coursera materials with use of Python

Install coursera-dl by Dirk Gorissen:

python-pip install coursera-dl

Make a folder to store files:

mkdir -p ./courses/comnetworks-2012-001

Run:

coursera-dl -u y [email protected] -p your_password ./courses/comnetworks-2012-001 comnetworks-2012-001

Enjoy.

If you want to check if there are new materials you should run the same command. coursera-dl is smart enough to skip files you already have:

- Downloading resources for 2-6 Link Layer Overview (0414)
- "2-readings.pdf" already exists, skipping
- "2-6-link-overview-ink.pdf" already exists, skipping
- "2 - 6 - 2-6 Link Layer Overview (0414).txt" already exists, skipping
- "2 - 6 - 2-6 Link Layer Overview (0414).srt" already exists, skipping
- "2 - 6 - 2-6 Link Layer Overview (0414).mp4" already exists, skipping
20 Feb

Watching specified files/folders for changes in Python

For specific purposes there could be a need to monitor file and folders changes on Linux box. To achieve this you can go with incrond. There’s also Pythonic way. Several Python wrappers on inotify feature  are accessible. Here we’ll cover simple Python daemon Watcher (github repo). First of all, we need to install python-inotify package:

yum install python-inotify.noarch

python-inotify uses Linux kernel feature called inotify (accessible starting from version 2.6.13). It allows to get notifications on file system event from user-space.

Now you can download last version of the config and the daemon:

mkdir watcher
cd watcher
wget https://raw.github.com/splitbrain/Watcher/master/watcher.ini
wget https://raw.github.com/splitbrain/Watcher/master/watcher.py

Modify your watcher.ini to meet your requirements:

[DEFAULT]
logfile=/tmp/watcher.log
pidfile=/tmp/watcher.pid
[job1]
watch=/tmp
events=create,delete
recursive=false
autoadd=true
command=ls -l $filename

Now you are ready to start Watcher daemon:

chmod u+x watcher.py
./watcher.py -c watcher.ini debug
19 Feb

10 Minutes Celery Introduction

Celery is an asynchronous task queue/job queue based on distributed message passing. This post is not detailed introduction but rather a short how-to start using Celery.

Using Celery supposes having of several components. It’s a:

  • broker. Think it as a transport. You can choose among RabbitMQ, Redis or SQL servers;
  • worker application which executes task;
  • client application which should add tasks to the queue.

Let’s get started. At the very beginning there’s a need to install Celery. I run Fedora server. If you use Debian use apt-get.

yum install python-celery.noarch

For the sake of simplicity we’ll use Redis as a broker. It’s fast, simple to setup and doesn’t consume a lot of resources.

yum install redis

Now we can tune some options. Here’s redis.conf example:

daemonize no
pidfile /var/run/redis/redis.pid
port 6379
bind 127.0.0.1
timeout 0
loglevel notice
logfile /var/log/redis/redis.log
databases 16
save 900 1
save 300 10
save 60 10000
rdbcompression yes
dbfilename dump.rdb
dir /var/lib/redis/
slave-serve-stale-data yes
appendonly no
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
slowlog-log-slower-than 10000
slowlog-max-len 128
vm-enabled no
vm-swap-file /tmp/redis.swap
vm-max-memory 0
vm-page-size 32
vm-pages 134217728
vm-max-threads 4
hash-max-zipmap-entries 512
hash-max-zipmap-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
activerehashing yes

We will also need celery-with-redis package which Celery requires to work with Redis:

python-pip install -U celery-with-redis

Keep in mind that this command would also update your current Celery installation with its dependencies. It’s not big deal, but you might need to know.

Now let’s create our worker application called tasks.py:

from celery import Celery
celery = Celery('tasks', broker='redis://localhost:6379/0', backend='redis://localhost:6379/1')

@celery.task
def add(x, y):
    return x + y

Now we can launch it:

celery -A tasks worker --loglevel=info

You should get similar output:

 
-------------- celery@turtle v3.0.15 (Chiastic Slide)
---- **** -----
--- * *** * -- [Configuration]
-- * - **** --- . broker: redis://localhost:6379/0
- ** ---------- . app: tasks:0x1d1b690
- ** ---------- . concurrency: 1 (processes)
- ** ---------- . events: OFF (enable -E to monitor this worker)
- ** ----------
- *** --- * --- [Queues]
-- ******* ---- . celery: exchange:celery(direct) binding:celery
--- ***** -----

[Tasks]
. tasks.add

[2013-02-19 23:52:42,339: WARNING/MainProcess] celery@turtle ready.
[2013-02-19 23:52:42,361: INFO/MainProcess] consumer: Connected to redis://localhost:6379/0.

Here’s our client application:

from tasks import add
result = add.delay(4, 4)
print result.get(timeout=1)

Note that here we use Celeray in synchronous mode. It means that we wait till the result is ready. I believe in most cases one would use Celery in asynchronous mode. Here we use it just to get a result to make sure everything works.

Output:

[dande@turtle ~]# python client.py
8
[dande@turtle ~]#

Now as everything is ready we can start to think about what we can do with described solution.

By the way, if you are interested in how Celery uses Redis run:

redis-cli monitor
17 Feb

How to put Apache web site into the maintenance mode

This is how you can add put you webisite working under Apache web server into the maintenance mode. Note that you need to have mod_rewrite enabled.

RewriteEngine On
RewriteCond %{DOCUMENT_ROOT}/maintenance.html -f
RewriteCond %{REQUEST_FILENAME} !/maintenance.html
RewriteRule ^.*$ /maintenance.html [L]

If the page maintenance.html exists all requests will be rewritten to it.

16 Feb

Devops Weekly Digest 02/16