A cache module for python CGI scripts
After an earlier failed attempt at writing a cache module for python CGI scripts, a not-so-nice email from one of my web hosts made me try again after they mentioned they have no plans for enabling Apache’s mod_cache module. I suspect that the pickle module somehow messed up my previous attempt, leaving me with no other choice than to disable it and burden the server more than needed.
The building blocks for a flat-file cache module is a unique mapping from url to filename and a place to store the files. An md5 hash creates a sufficiently unique mapping and a directory is a nice place to store files. We also need a time limit (in seconds) so the web pages are not stored forever.
The complete module
"""A module that writes a webpage to a file so it can be restored at a later time
Interface:
filecache.write(...)
filecache.read(...)
"""
import time
import os
import md5
def key(url):
k = md5.new()
k.update(url)
return k.hexdigest()
def filename(basedir, url):
return "%s/%s.txt"%(basedir, key(url))
def write(url, basedir, content):
""" Write content to cache file in basedir for url"""
fh = file(filename(basedir, url), mode="w")
fh.write(content)
fh.close()
def read(url, basedir, timeout):
"""Read cached content for url in basedir if it is fresher than timeout (in seconds)"""
fname = filename(basedir, url)
content = ""
if os.path.exists(fname) and (os.stat(fname).st_mtime > time.time() - timeout):
fh = open(fname, "r")
content = fh.read()
fh.close()
return content
A minimal example, including time measurement
Instead of explaining what the functions are doing, I hope they are fairly understandable and that a usage example is sufficient for understanding how it works. As a bonus, the example includes timing so you can see how long it takes to build your pages from scratch as opposed to reading from cache.
import time
startTime = time.clock()
import sys
import os
import filecache
cache_timeout = 10
cache_basedir = "cache"
cache = filecache.read(os.environ.get("REQUEST_URI", ""), cache_basedir, cache_timeout)
if cache:
print cache
print ""%(time.clock() - startTime)
sys.exit()
# generete output
output ="stuff"
#Write output to cache
filecache.write(os.environ.get("REQUEST_URI", ""), cache_basedir, output)
print output
print ""%(time.clock() - startTime)
Store the example as example.py, the cache module as filecache.py and create a directory named cache. Run as
python example.py
Note that the timeout is set very low, at 10 seconds. This is fine for testing but not much more.
While this very minimal example is slower when the output is fetched from cache, I can assure you that this is not the case with more realistic web pages. In my case, I have experienced speedups from around 0.7 seconds to hardly measurable time (0.00 to 0.01 seconds). This does not include the time needed to start the Python interpreter and importing the time module so a very popular site might still get you in trouble with your web host. I think the mod_cache module for Apache would take care of that too, but that wasn’t available in my case.
There is no way to remove the cache other than a rm *
or similar in the cache directory. It works for me but probably not for a very dynamic site.
This cache module is used in production at Good Web Hosting Info with a timeout of one hour. The time measurement is shown at the bottom of the source code. It’s not a very busy site so it’s quite likely to get a page built from scratch if you look beyond the front page. The time precision is 1/100 second so cached pages normally have 0.000 of 0.010 seconds.
The python files are also available from here.