Archive for May, 2011

mod_rewrite revisited

May 04, 2011 by olsner

I suddenly decided to continue my earlier mod_rewrite experiments, BF+Thue in mod_rewrite and my first excursion in mod_rewrite. That code all worked in theory (and for very small examples), but not in practice: Apache runs out of memory.

The Problem

The basic problem is that for each request, Apache will never return any memory to the system - Apache expects trivial things such as interpreting turing-complete languages to complete quickly and without using more than a smidgeon of memory.

To get around this, I decided to change the "run-time system" to make a redirect for each step, i.e. sending an error back to the user-agent and tell it to make a completely new request to a different URL instead of making mod_rewrite loop on the Apache side until it's done. Apache allocates the memory for each of these requests separately, and its memory use stays bounded (to the size of the program's state anyway).

Implementing this mostly this meant simply replacing the [N] flag (which means jumping to the start of the list of rewrite rules but continue rewriting) with the [R] flag that send a redirect to the client. But of course a few other minor shenanigans were lurking - primarily around the "bootstrapping" step.

The original RTS relied on Apache always having a slash at the start of the URI to bootstrap the program, and relied on Apache not complaining if you removed the slash internally while doing the rewriting. With the new redirecting approach, we will need to continue processing from a URI that has a slash (since it's a completely new request from the client) without bootstrapping that URL again. As described in the comments below, we now use a 'q' to tell a bootstrapped program from an unstarted program instead of checking whether the slash was removed or not.

While I was doing these changes I ended up I made the compiler output the RTS with the program instead of requiring separate program and RTS config files.

Thue mod_rewrite Compiler v2.0

Without further ado, here is the updated Thue to mod_rewrite compiler in its entirety, now including all the RTS it needs!

# Bootstrapping part: output the RTS prologue
1 {
# This file should be included in an Apache2 config file for a VirtualHost. \
# Enable rewriting\
RewriteEngine on\

# Hack to make actual files available if you know their names\
RewriteCond /var/www/rewrite%{REQUEST_FILENAME} -f\
RewriteRule ^.*$ - [L]

# This should only run on the first inputed string to add the interpreter.
# After each redirect, the client makes a new request and we return to the
# first rule, so this must be safe and idempotent.
# An initial q indicates whether we have boostrapped the program yet, the
# second q is a separator between the output and the current program state.
# We also add a ^ just for fun.
RewriteRule ^/([^q].*)$ qq^$1
# Remove any leading slash to simplify the rest of our processing
RewriteRule ^/(.*)$ $1

# Add an 'r' to distinguish unchanged strings. This would be the termination
# condition for our rewrite system - if the 'r' is still left after running the
# chain of rewrites, we're done and stop looping.
# This rule is probably completely unneccessary since we'll always redirect
# when rewriting, which should already prevent us from reaching the
# end-of-program rule before we're done.
RewriteRule ^(q.*)q(.*)$ $1qr$2

# Strip comments
# Rewrite question marks since they interfere with mod_rewrite by making the
# productions look like requests with query strings, which apache then splits.
# .*+(){}^$<>[\\ |]\|\]\)/\\\.*+(){}^$<
s/\([.*+(){}^$<>[\\ |]\|\]\)/\\\1/g
# Skip empty lines
/^$/ d
# This is where we're actually doing something worthwhile:
s/^\(.\+\)::=~\(.*\)$/RewriteRule ^q(.*)qr?(.*)\1(.*)$ q$1\2q$2$3 [R,L]/
s/^\(.\+\)::=\(.*\)$/RewriteRule ^q(.*)qr?(.*)\1(.*)$ q$1q$2\2$3 [R,L]/
# End-of-program marker, we want to ignore everything after this line
/^::=$/ { s/::=//; b end }

# Restart processing and input next program line

# Output the RTS epilogue after the last line of program text
# If the string has changed, we are not yet done. Loop to the beginning and
# remove the changed-marker. This doesn't change the string at all ('-' is a
# special substitution string that passes along the original string), just has
# the [N] flag to trigger apache to restart rewriting.
# This rule is probably completely unneccessary since we'll always redirect
# when rewriting.
RewriteRule ^(q.*q[^r].*)$ $1 [N]

# If the string is still unchanged, apply output formatting and send to the
# interface CGI script.

# First, remove the unchanged-marker 'r'
RewriteRule ^q(.*)qr(.*)$ q$1q$2
# Finally, pass the resulting string to a simple CGI or PHP script to print it
# to the web browser. [R] means to redirect to it, ending rewriting.
RewriteRule ^q(.*)q(.*)$ /print.php?$1 [R]
# Done. Quit. Get out of here.

Testing it out

For obvious reasons I wouldn't want you to test this on my server, but if you want to set this up on your own server, setup should be very easy: set up a new virtual host and let the virtual host config stanza include the compiled output of your Thue program.

One "small" caveat is that pretty much every HTTP client available will only follow a very small number of recursive redirects. To actually see a largeish program (such as hello world) run to completion you will have to reconfigure your HTTP client to follow a very large (a few hundred thousand) number of redirects. Using the command-line curl program, curl -L -g --max-redirs 300000 seems to do the trick.