BF+Thue in mod_rewrite

Based on the mod_rewrite experiment from the other week, I hacked up a small Thue to mod_rewrite compiler in sed. For testing, I have used the brainfuck interpreter in Thue (as far as I can see, this is like the only example of a "real-world" Thue program). I tested this on the small 3+5 examples included in the BF Thue distribution and tried to test it on BF Hello World, but apache runs out of memory before completing the rewriting ;-)

Anyway, in this BF interpreter, the interpretation is bootstrapped by a circumflex, '^', so the interpreter placeholder I introduced last week had a natural use allowing us to simply enter the brainfuck program as a URL, followed by a colon and input as binary underscore-separated numbers (terminated by 0_), and have patience.

The Thue interpreter needs to keep track of a few bits of state - all of this has to be encoded in some reliable way in the URL we're continually rewriting:

  • Have we added the interpreter? For BF+Thue, this is '^' which is really something that has leaked from this one specific Thue program into the interpreter - which is really ugly, but hey, at least it was easy. Anyway, since we're in the root of the virtual host, we can use the leading slash to encode this information - simply remove it after adding the interpreter, and we can match on ^/(.*)$ to determine whether we should add the bootstrap. (I have since realized that there is not much that prevents a malicious Thue program to output a slash as its first character, thereby breaking my entire system. Well, well, in version 2.0 I'll replace this with a more robust encoding system.)
  • The current output string. Stored in the first part of the URL, terminated by a 'q' (I'm having real trouble here with selecting characters unused in the program and all its possible states - otherwise matching would be a bitch)
  • Did we rewrite anything in the last pass over the state? If nothing was rewritten, we must terminate the program. After the 'q' that terminates the output string, we store an 'r' before starting going through the rules and then every time we apply a rule we remove the 'r'. If the 'r' is still there after having tried every rule in the program, we're done and pass the output string to the printer PHP script.
  • Finally, the current program state is stored after the q and r markers and goes on to the end of the URL.

In the end, this became something quite beautiful in its ugliness, although it could certainly need some robustness work. For example, it blindly relies on the Thue program to never ever use q or r anywhere (and probably a few other things I haven't thought of). This should be changed to a totally robust encoding of anything that the rewrite rules should operate on as data, that can't interfere with the things that are used as control structures and data formatting by the rewrite rules.

Here is the translator sed script in all its regexp-wtf glory, highlighted as Perl since GeSHi unfortunately doesn't have sed highlighting.

(EDIT: Not actually highlighted as Perl anymore since the replacement of wordpress! I have created my own sed highlighter.)

# Remove comments
# Rewrite question marks since they interfere with mod_rewrite by making the
# productions look like requests with query strings, which apache then splits.
# Escape characters with special meanings in regexps.
s/\([.*+(){}^$<>[\\ |]\|\]\)/\\\1/g
# Remove the ::= terminating line since it doesn't define a rule.
# Output rule: ~asf means to print the string "asf" and remove the matched string.
s/^\(.*\)::=~\(.*\)$/RewriteRule ^(.*)qr?(.*)\1(.*)$ $1\2q$2$3 [N]/
# Normal rule: match substring and replace it
s/^\(.*\)::=\(.*\)$/RewriteRule ^(.*)qr?(.*)\1(.*)$ $1q$2\2$3 [N]/

Note that the resulting rewrite rules are meant to be inserted into the mod_rewrite embedded language bootstrap I built in the previous post.

No responses so far