Thursday, June 18, 2009

Clojure's Agents: Scientists, Monkeys, and MD5

Just recently, I decided to make my first attempt at doing something useful with Clojure's agents. Now please don't confuse useful with helpful. I use the term useful in the most pejorative sense imaginable and reserve it for things like solar powered flashlights and inflatable dartboards. That being said, I did actually write a program that does something marginally interesting, but before we dive into that, let's take a step back.

The official description of agents written by Rich Hickey himself says the following, "Agents provide independent, asynchronous change of individual locations. Agents are bound to a single storage location for their lifetime, and only allow mutation of that location (to a new state) to occur as a result of an action. Actions are functions (with, optionally, additional arguments) that are asynchronously applied to an Agent's state and whose return value becomes the Agent's new state."

This is an accurate description of what agents do, but I had to read that paragraph many times to really get it. If your eyes are still returning to fixed position, cast your fears aside, and let me take you by the hand. We're about to take baby steps through what agents offer followed by an incredibly contrived (and somewhat long-winded) example.

At the most basic level, agents provide an easy way to work with some data in the background (another thread). To create an agent, you bind a variable to it and give it a starting value (zero in this case).

user=> (def counter (agent 0))
#'user/counter

An agents state is modified by sending it a function. In the case of a counter, we'd typically want a function that takes a value and increments it by one. Additional arguments can be passed to this function, but we're keeping things simple for now.

user=> (defn add-one [x] (inc x))
#'user/add-one

We could have used an anonymous function, but I'm being overly explicit for the sake of example. Note that the agent will pass its current state as the first argument to our function. Having composed the appropriate function, we can fire it off to the agent.

user=> (send counter add-one)
#<Agent@8edf3c: 1>

The agent passes its current state (zero) to add-one as the first argument. The add-one function returns the new result (one) back to the agent, and it is stored as the agent's new state. The send function's return value is the agent in its new state. If we want to examine the agent's current state, we simply dereference it.

user=> @counter
1

It's worth mentioning that you can retrieve an agent's state from any thread at any time without blocking. Once send is called, the agent begins its processing in the background and program execution continues. If you anticipate that your agent will engage in blocking operations, you should use send-off rather than send. The usage of send-off verses send has implications on how threads are managed in the background and should be explored in the privacy of your own home.

With the introductions out of the way, I will make good on my promise to present an incredibly contrived example. I was just going to lay out some code and explain it, but I had spicy mustard on my pancakes this morning, so I've decided to evolve this into a full blown scenario.

Imagine for a moment that you've been kidnapped by a mad scientist. Let's call him Dr. Moreau. Being completely insane, Dr. Moreau has decided to transplant your brain with that of a chimpanzee and then conduct experiments involving extended programming sessions with ASP. He leaves you alone with the chimp in his laboratory while he goes to fetch his bone saw.

Luckily, the chimp knows sign language and explains that Dr. Moreau stores all his passwords as MD5 sums of lowercase four letter strings in the character range of a-z. The doctor has OCD, so he has sixty passwords. If you can decrypt them all and feed them into his mainframe in the next five minutes, the lab will self-destruct and you and the monkey will go free.

The monkey gestures towards a terminal and reminds you that time is of the essence, so the job must be done in parallel. He also lets you know that the machine has four processors and one programming language installed. That language is Clojure.

Dispelling disbelief that a situation so incredibly contrived is actually happening, you decide a proof of concept would be the best place to start. You roll up your sleeves and begin programming.

First things first, it's obvious that this program will need to be capable of generating MD5 sums given an input string. Thankfully, this is a very common task. The monkey directs you to g-o-o-g-l-e-dot-com, and you quickly find the following nugget.

(ns agentmd5
(:refer-clojure)
(:import
   (java.security
     NoSuchAlgorithmException
     MessageDigest)
   (java.math BigInteger)))

; computes an MD5 sum of a string
; (http://www.holygoat.co.uk/blog/entry/2009-03-26-1)
(defn md5-sum
"Compute the hex MD5 sum of a string."
[#^String str]
(let [alg (doto (MessageDigest/getInstance "MD5")
            (.reset)
            (.update (.getBytes str)))]
  (try
    (.toString (new BigInteger 1 (.digest alg)) 16)
    (catch NoSuchAlgorithmException e
      (throw (new RuntimeException e))))))

Knowing that this program will run on multiple cores and that identical MD5s may be requested several times, you decide it might be wise to memoize this function so that redundant calls will come from cache.

; memoized version of MD5
(def md5-memo (memoize md5-sum))
That problem out of the way, the next step is to brute force all possible combinations of four character strings. Having done some Perl and Ruby programming in the past, you initially think, "I'll just use the range operator." Unfortunately, range operators aren't on the menu today, so we'll be doing things the old fashioned way. For your first attempt, you adapt some Java code you find online.
; starts at "aaaa" and rotates string based on `n'
; "aaaa", "aaab", "aaac", etc...
(defn base26 [n]
    (let [seed-string "aaaa"
                      s (new StringBuilder seed-string)
                      a_val (int \a)]
      (loop [pos 3 x (int n)]
            (when (pos? x)
              (let [digit (char (+ a_val (unchecked-remainder x 26)))]
                (.setCharAt s pos digit)
                (when (pos? pos)
                  (recur (int (dec pos)) (unchecked-divide x 26))))))
      (.toString s)))
This code works just fine initially, and thanks to some quick help from some friendly Clojurians it runs nearly as fast as the Java code from which it was adapted. You're pleased with this until you realize that an obscure bug prevents this code from working on the Java 5 SE VM. What if Dr. Moreau's mainframe is running Java 5 SE?! Never satisfied with marginal quality even when your life is on the line, you make a second attempt.
(defn string-to-base36 [string]
          (Integer/parseInt string 36))

(doall (for [x (range (string-to-base36 "aaaa")
                       (string-to-base36 "zzzz"))]
                       (Integer/toString x 36)))
This works, but it's much slower, and it includes a slew of character combinations outside of what's needed. The monkey reassuringly squeezes your shoulder and encourages one last attempt.
; returns the next iteration of a given character sequence
(defn next-string [#^"[C" chars start-char end-char]
  (loop [i (int (- (alength chars) 1))]
    (if (>= i 0)
      (let [last-char (= (aget #^"[C" chars i) (char end-char))
            recur-val (int (if last-char (- i 1) -1))]
        (if last-char
          (aset chars i (char start-char))
          (aset chars i (char (inc (int (aget chars i))))))
        (recur recur-val))))
  (new String chars))

; returns a sequence of strings given a length, range, and result size
; length=4 start-char=a end-char=z result-size=5 would yield
; ["aaaa" "aaab" "aaac" "aaad" "aaae"]
(defn alpha-gen [length start-char end-char result-size]
  (let [#^"[C" chars (make-array (Character/TYPE) length)]
    ; initialize the array
    (loop [i 0]
      (if (< i length)
        (do (aset chars (int i) (char start-char))
          (recur (inc i)))))
    (for [i (range result-size)]
      (next-string chars start-char end-char))))

The code is somewhat dense and sprinkled with type hints, but it does exactly what you need. You benchmark it and see that it's as fast as your initial attempt, and it works beautifully on Java 5 and Java 6 (as it should...).

Thinking that these generated values will be needed multiple times, you decide to pre-calculate all the combinations and then make them available to your agents via cache. This will use more memory, but hey, your life is on the line! We could just use Clojure's memoize function once more, but we're having fun here, so let's make a closure instead and make use of Clojure's refs at the same time.

; creates an initial cached copy of all character combinations
; subsequent calls return the cached copy rather than calculating
(def string-perms
  (let [xs (ref [])]
    (fn []
      (if (empty? @xs)
        (dosync (ref-set xs (doall (alpha-gen 4 \a \z (Math/pow 26 4)))))
        @xs))))

Next you need a function to find the string that corresponds to a given MD5. The hard work has already been done, so we just glue some of our functions together.

; cycles through character combinations for a given md5 string
; until the matching result is found
(defn decode-md5 [md5]
   (first (filter #(= md5 (md5-memo %1)) (string-perms))))

It's worth mentioning that filter is lazy. This means that it will only provide as much data as it's asked for. Once the first match is found, the calculation will cease to prevent wasting additional resources.

Feeling very confident in your programming prowess, you quickly slop together a function to apply your decode-md5 function to all possible character combinations. The monkey looks at you with a faint air of suspicion but shrugs his shoulders and goes back to eating a banana.

; decodes a bucket
(defn decode [bucket]
   (map decode-md5 bucket))

You decide that the best way to divide up the work load across agents will be to break the input data into buckets. Not worrying about how this will actually work, you throw all your concrete instincts to the wind.

; returns a collection of agents for a given set of work buckets
(defn spawn-agents [agents buckets]
  (if (empty? buckets)
    agents
    (recur (conj agents (agent (first buckets)))
         (rest buckets))))

The monkey can hardly contain himself. You've done all the work so far in just under three minutes (did I mention you're in a time warp?). With two minutes to spare, you take care of the initializing the data your program will use. For your initial dry run, you decide to populate your dataset with an MD5'd test string of "cloj".

; initialize string permutations
(string-perms)

; number of tasks to doll out to the agents
(def num-work-units 60)

; generate requested number of work units
(def work-units (map md5-sum (for [x (range num-work-units)] "cloj")))

; number of agents to spawn
(def num-agents 4)

; divide the units into buckets for each agent
(def work-buckets (partition (int (/ (count work-units)
                                   num-agents)) work-units))

You realize that the strategy you're using to partition work units into work buckets allows for some data loss if things don't divide perfectly but then chuckle at the ultra-controlled and incredibly contrived situation you're in.

With your task nearly complete and almost a minute left before Dr. Moreau returns, you spawn the agents and send them their jobs.

; create an agent for each bucket
(def agents (spawn-agents [] work-buckets))

; send each agent a job to do
(doseq [agent agents]
  (send agent decode))

; wait on the agents to finish
(apply await agents)

; view the results
(doseq [agent agents]
  (doseq [result @agent] (println result)))

; clean up
(shutdown-agents)

Dr. Moreau runs Linux, so you fire up top. Eager to see his incredibly evil CPU pegged at 400%, you invoke your program. Text begins streaming across the terminal, but to your complete horror only one processor is being used. Luckily, you're the world's greatest debugger, so the thirty seconds you have left is more than ample.

In this case, the problem lies in the same laziness that we took advantage of with filter earlier. In many instances, Clojure will only calculate what it believes you will need to speed up processing in the present. In some cases, this can work against you.

The decode function above passes the decode-md5 function to map with a bucket as the input data. The map function sees that you're not asking for any specific number of items, so it says, "I'll take care of it later. Baywatch is on."

When you begin requesting results from your agents, none of the work has been done. To make matters worse, you're iterating over the agents in series, so the lazy map function is being called in a completely serial fashion.

With only fifteen seconds left, the monkey begins banging on the keyboard. You're about to crawl under a table when you realize the error of your ways. If you could force the values from the map to be materialized when it's initially called, all your problems would go away (well, there's that IRS thing, but I digress...).

Invoking the Vim editor, you make a small change.

; decodes a bucket
(defn decode [bucket]
  (doall (map decode-md5 bucket)))

The doall function tells Clojure we want the work done now. With just seconds to spare, you test the revised copy. All four processors are utterly consumed as bytes begin spinning out of the ether. There's an electric charge in the monkey's eye, and the test completes rapidly.

Some additional testing shows that processing a 6,000 item dataset takes one minute twenty-six seconds with four agents, and three minutes fifteen seconds with one agent. It's also helpful to consider that Dr. Moreau's co-worker has MySQL running buckwild on the same machine, and it frequently pegs two or three cores. This makes accurate benchmarking difficult.

Triumphs aside, there's bad news. You've wasted all your time screwing around with generating characters and testing. Dr. Moreau has returned with the bone saw. As he approaches, you think back to ancient times and realize that these kinds of situations used to be reconciled without Clojure. You grab a nearby stapler and begin flogging the doctor. The monkey joins in and inflicts a painful bite. The doctor rapidly succumbs.

Exhausted after a hard day's work, you take the monkey to Ben & Jerry's for some well deserved ice cream. You enjoy a nice evening together chatting about Dr. Moreau, Clojure, and possible content for upcoming blog articles.

The full source code for the example program is available here. No animals were harmed during the development of this program.

Tuesday, June 2, 2009

Markov Chaining

A friend of mine wrote a very nice Markov Chainer in Python a while back. I had a lot of fun playing with it and decided to try my hand at rewriting it in Clojure. This proved to be trickier than I expected as it was my first Clojure program beyond simple factorial and fibonacci generators, but I learned a lot in the process. The main issue I ran into was my own misunderstanding of how Clojure adds elements to it's various sequence types.

It's fast to add elements to the head of a list and the tail of a vector. Clojure takes advantage of this by adding elements to the optimal position of a sequence based on it's type. Being totally oblivious to this fact, I spent quite a while trying to pin down the difference in behavior between my Python reference implementation and my own version. Once I discovered the source of the program, I had a classic hacker aha moment and a great deal of satisfaction.

Note that this program stays in the realm of immutable data 100% of the time. I believe this may be the reason that it's much slower than the Python version, but I could be wrong. For whatever reason, there's a pretty large speed difference between the two programs. If anyone has any suggestions, I'll be happy to try them.

(ns markov 
  (use clojure.contrib.str-utils)) 

(set! *warn-on-reflection* true)

(defn flatten 
  "Takes any nested combination of sequential things (lists, vectors, 
  etc.) and returns their contents as a single, flat sequence. 
  (flatten nil) returns nil." 
  [x] 
  (filter (complement sequential?) 
          (rest (tree-seq sequential? seq x))))

(defn rand-elt 
  "Return a random element of this seq" 
  [s] 
  (nth s (rand-int (count s)))) 

(defn clean [txt] 
  "clean given txt for symbols disruptive to markov chains" 
  (let [new-txt (re-gsub #"[:;,^\"()]" "" txt) 
        new-txt (re-gsub #"'(?!(d|t|ve|m|ll|s|de|re))" "" new-txt)] new-txt)) 

(defn chain-lengths [markov-chain] 
  "return a set of lengths for each element in the collection" 
  (let [markov-keys (map keys markov-chain)] 
    (set (for [x markov-keys] (count x))))) 

(defn max-chain-length [markov-chain] 
  "return the length lf the longest chain" 
  (apply max (chain-lengths markov-chain))) 

(defn chain 
  "Take a list of words and build a markov chain out of them. 
  The length is the size of the key in number of words." 
  ([words] 
   (chain words 3)) 
  ([words length] 
   (loop [markov-chain {} 
          keychain (for [x (range length)] nil) 
          words (map clean words)] 
     (let [first-word (first words)] 
       (if (seq words) 
         (recur (assoc markov-chain keychain 
                       (cons first-word (get markov-chain keychain))) 
                (concat (rest keychain) [first-word]) 
                (rest words)) 
         (assoc markov-chain keychain [])))))) 

(defn split-sentence [text] 
  "Convert a string to a collection on common boundaries" 
  (filter seq (re-split #"[,.!?()\d]+\s*" text))) 

(defn file-chain 
  "Create a markov chain from the contents of a given file" 
  ([file] 
   (file-chain file 3)) 
  ([file length] 
   (let [sentences (split-sentence (slurp file)) 
         flatten-list (fn [& x] (flatten (list x)))] 
     (loop [markov-chain {} words sentences] 
       (if (seq words) 
         (recur (merge-with flatten-list 
                            markov-chain 
                            (chain (re-split #"\s+" (first words)))) 
                (rest words)) 
         markov-chain))))) 

(defn construct-sentence 
   "Build a sentence from a markov chain structure.  Given a 
   Markov chain (any size key),  Seed (to start the sentence) and 
   Proc (a function for choosing the next word), returns a sentence 
   composed until is reaches the end of a chain (an end of sentence)." 
  ([markov-chain] 
   (construct-sentence markov-chain nil rand-elt)) 
  ([markov-chain seed] 
   (construct-sentence markov-chain seed rand-elt)) 
  ([markov-chain seed proc] 
   (loop [words (if seed seed (rand-elt (keys markov-chain))) 
          sentence (str-join " " (filter identity words))] 
     (if (seq (markov-chain words)) 
       (let [word-new (proc (markov-chain words))] 
         (recur (concat (rest words) [word-new]) 
                (str-join " " (into [sentence] [word-new])))) 
       sentence))))
Here's some example usage:
(ns main (use markov))
(def markov (file-chain "corpus.txt"))
(doseq [x (range 100)]
  (doseq [x (range 3)] (println (construct-sentence markov)))
  (println))

SyntaxHighlighter For Clojure

I'm really surprised that Blogger doesn't include a solution for syntax highlighting as it hosts a ton of blogs related to programming. Fortunately, there are plenty of third party solutions out there. I've settled on SyntaxHighlighter, and while it's not perfect, it does a good enough job. With Clojure being as new as it is, SyntaxHighlighter doesn't include a brush file for the language. This should change in the future since I have submitted a brush to the author and have been informed that it will be included in the next release. In the meantime, here's the file which provides highlighting support for a good number of common functions.
SyntaxHighlighter.brushes.Clojure = function()
{
  // Contributed by Travis Whitton

  var funcs = ':arglists :doc :file :line :macro :name :ns :private :tag :test new alias alter ' +
              'and apply assert class cond conj count def defmacro defn defstruct deref do '     +
              'doall dorun doseq dosync eval filter finally find first fn gen-class gensym if '  +
              'import inc keys let list loop map ns or print println quote rand recur reduce '   +
              'ref repeat require rest send seq set sort str struct sync take test throw '       +
              'trampoline try type use var vec when while';

  this.regexList = [
          { regex: new RegExp(';.*$', 'gm'),                               css: 'comments' },
          { regex: SyntaxHighlighter.regexLib.multiLineDoubleQuotedString, css: 'string' },
          { regex: /\[|\]/g,                                               css: 'keyword' },
          { regex: /'[a-z][A-Za-z0-9_\-]*/g,                               css: 'color1' }, // symbols
          { regex: /:[a-z][A-Za-z0-9_\-]*/g,                               css: 'color2' }, // keywords
          { regex: new RegExp(this.getKeywords(funcs), 'gmi'),             css: 'functions' }
      ];

  this.forHtmlScript(SyntaxHighlighter.regexLib.aspScriptTags);
}

SyntaxHighlighter.brushes.Clojure.prototype     = new SyntaxHighlighter.Highlighter(); 
SyntaxHighlighter.brushes.Clojure.aliases       = ['clojure', 'Clojure', 'clj'];

Monday, June 1, 2009

Method Introspection

As a Ruby hacker, I enjoy the some of the introspection features it provides. When I first started with Clojure, I missed the ability to introspect on methods easily when reaching into the Java world. Luckily, it only took a few lines of code to get what I wanted.

(defn class-methods [x]
 (let [c (if (class? x) x (class x))]
  (distinct (sort (seq 
                   (map #(.getName %1)
                    (.getMethods c)))))))

(println (let [x (new Integer 1)] (class-methods x)))