A clojure drama in 3 acts and a prologue



(this space intentionally left almost blank)

Act I : Scraping web pages with enlive

The goal is to retrieve all theater plays by a famous author (Here the french Molière from the 17th century) and the characters from those plays. (source : toutmoliere.net)

(ns drama.act1
  (:require  [net.cgrand.enlive-html :as h]
             [clojure.string :as s]))

Enlive selector

Enlive is a templating system working as in the following lines :

  1. plain HTML without any special tags
  2. HTML page is converted into a tree of nodes, like {:tag :a :attrs {:href "/"} :content () }
  3. Enlive provides functions to select and transform the above mentioned tree structure.

Web scraping with enlive are done in 2 steps :

  1. Use enlive selectors to find the part of HTML page containing the requested information
  2. Plain functions extract the infos from the nodes

Converts a source (url , file or string) into nodes

(defn resource
  (let [r (cond
            (.startsWith s "http:") (java.net.URL. s)
            (.exists (java.io.File. s)) (java.io.File. s)
            :else s)]
    (if (not (string? r)) (h/html-resource r) (h/html-snippet r))))
(def moliere "http://toutmoliere.net/")

Extract all plays

Typical scraping structure is done in 2 steps : select and extract

Enlive selectors are a flexible way to express your HTML selection

The syntax can be at first sight a bit confusing, but in fact following simple rules :

  1. any selector is always inside a [] . In this case [] means inclusion
  2. inner [] means and for example in [:li [:a (h/attr= :href "/")]]
  3. Follows CSS syntax

More details

Extracts the list of plays by the author from http://toutmoliere.net/oeuvres.html in local resources/data/oeuvres.html

(defn extract-plays
  (let [nodes (h/select (resource url) [:div#liste1 :ul.listerub :li :a])
        extract (fn [n]
                    {:url (str moliere (-> n :attrs :href))
                     :title (-> n :content first s/trim)
                     :date (-> n (h/select [:i]) first  h/text s/trim)
    (map extract nodes)))

Extract the characters

Involves a more complex logic : from the play's main page, go to play's act 1 page and then extract the list of characters from there.

2 samples pages are available in local : resources/data/{ecoledesfemmes.html,ecoledesfemmes_acte1.html}

Extract url of Acte 1

(defn characters-url
  (->> (h/select nodes [:ul#lapiece [:a (h/attr= :title "Acte 1")]])
       (str moliere)))

Returns a list of characters [name , description]. Here it's bit more trickier : various cases to handle

(defn extract-characters
  (let [selector1 [:div#centre_texte :div :div h/text-node]
        selector2 [:div#centre_texte
                   [:div h/first-of-type]
                   [:table h/first-of-type] :tr]
        items (h/select
        items (if (< 1 (count items))
                (s/split-lines (apply str items));;one line = one character
                (map h/text (h/select nodes selector2)))
        trim (fn [s] (-> s
                         (s/replace-first #"^[,. ]+" "") ;trim left
                         (s/replace-first #"[,. ]+$" "") ;trim right))
        extractor (fn [s]
                    (map (partial s/join " ")
                         (split-with #(= (.toUpperCase %) %)
                                     (s/split (trim s) #"[,. ]+"))))
        validate (fn [c] (when (not (empty? (first c))) c))]
    (keep (comp validate extractor) items)))

Associate the characters to a play

Put it all together

(defn append-characters
  [{u :url :as play}]
  (let [curl (characters-url (resource u))
        chars (extract-characters (resource curl))]
    (assoc play
      :characters-url curl
      :characters chars)))

Returns all informations wanted as a lazy-sequence ie only fetch the data when requested.

Please use it with caution as it scrapes more than 60 web pages.

(defn all-in-one
  (map append-characters
   (extract-plays "http://toutmoliere.net/oeuvres.html")))

Some IO functions

(defn coll->file
  [f coll & {:keys [separator] :or {separator "|"}}]
  (spit f (apply str (map #(str (s/join separator %) "\n") coll))))

Returns a list of vectors. If the header is supplied, it returns a list of maps

(defn file->coll
  [f & {:keys [separator header size] :or {separator "|"}}]
  (let [lines (.split (slurp f) "\n")
        separator ({"|" "\\|"} separator separator)
        cut (fn [l] ((if (sequential? header)
                       (partial zipmap header)
                     (map #(.trim %)
                          (if size (.split l separator size)
                              (.split l separator)))))]
    (map cut lines)))

Loads into resources/data/moliere_plays.txt all plays.

(defn plays->file
  (coll->file "resources/data/moliere_plays.txt"
             (map (juxt :title :date :url) plays)))

Loads into resources/data/moliere_characters.txt all characters. Skip invalid characters

(defn characters->file
  (let [valid? (fn [c] (and (< 1 (count c))
                            (= (first c) (.toUpperCase (first c)))))]
    (coll->file "resources/data/moliere_characters.txt"
                      (mapcat (fn [{cs :characters t :title}]
                                (keep (fn [c] (when (valid? c) (cons t c)))

Further information on enlive


Act 2 : Let's play with data

Cascalog is used to query our data. It's build on top Hadoop and cascading but you don't need to have any knowlegde of Hadoop ecosystem or map/reduce in order to use it.

Most of the time, Cascalog let's you concentrate on "what" you want not on "how" : it's declarative like SQL.

(ns drama.act2
  (:require [drama.act1 :as a1]
            [cascalog.api :as ca]
            [cascalog.ops :as co]))

list of [title date url]


Data in cascalog are list of tuples

(def plays 
  (a1/file->coll "resources/data/moliere_plays.txt" ))

List all records [title of the play, character's name , characters's desc ]

(def  characters
  (a1/file->coll "resources/data/moliere_characters.txt" :size 3))

Get All characters of a play

Some cascalog queries

Any cascalog query has always these 3 parts :

  1. How to define and execute queries <- ?<- ??<-. Here details
  2. Columns of the query
  3. Predicates : generator , operation , aggregator . Here details
(defn find-characters
  (ca/??<- [?name ?desc]
           (characters title ?name ?desc)))

Get all the plays where a character is present : query using an implicit join

(defn find-plays
  (ca/??<- [?title ?date]
           (plays ?title ?date ?url)
           (characters ?title name ?desc)))

List all characters with their number of occurences in plays

(defn list-characters
  (ca/??<- [?name ?ct]
           (characters ?title ?name ?desc)
           (co/count ?ct)))

List all plays and counting their characters

(defn list-plays
  (ca/??<- [?title ?date ?url ?ct]
           (plays ?title ?date ?url)
           (characters ?title ?name ?desc)
           (co/count ?ct)))

Get the n most used characters

(defn top-n-characters
  (let [count-q (ca/<- [?name ?ct]
                       (characters ?title ?name ?desc)
                       (co/count ?ct))
        q (co/first-n count-q n :sort ["?ct"] :reverse true)]
    (ca/??- q)))

Act 3 Back to the web

Architecture in place :

  1. Ring interface : 2 maps Request/Response and 2 functions handler/middleware
  2. Routing with moustache in-depth intro
  3. HTML templating with enlive
(ns drama.act3
  (:use [net.cgrand.moustache :only [app]]
        [ring.middleware.file :only [wrap-file]]
        [ring.util.codec :only [url-decode]]
        [ring.util.response :only [response content-type file-response]]
        [ring.adapter.jetty :only [run-jetty]])
  (:require [net.cgrand.enlive-html :as h]
            [drama.act2 :as a2]))

Enlive templating System

It's based on 2 macros defsnippet and deftemplate both define a fct returning a sequence of strings


(h/defsnippet list-item  [:div#main :ul :li]
  [{:keys [title text url nolink total]} ]
  [[:a h/first-of-type]]
  (h/do-> (if nolink identity (h/set-attr :href (str "/" title) ))
          (h/content title))
  [[:a (h/nth-of-type 2)]] (when url (h/set-attr :href  url))
  [[:span h/first-of-type]] (h/content text)
  [[:span (h/nth-of-type 2)]]  (h/content (str total)))
(defn prepend-attrs [att prefix]
  (fn[node] (update-in node [:attrs att] (fn[v] (str prefix v)))))


(h/deftemplate main  [title items]
  [[:link (h/attr= :rel "stylesheet")]] (prepend-attrs :href "/")
  [:div#main :h3] (h/content title)
  [:div#main :ul] (if (and (sequential? items) (seq items))
                    (h/content (map list-item items))
                    (h/substitute "")))
(defn vec->item [[t d u c]]
  {:title t :text d :url u :total c})

render view in utf-8

(defn render
   (response body)
   "text/html ; charset=utf-8"))

Routing requests

Ring is a perfect example of the motto "data and functions", it consists of

  1. The request and response are the data
  2. Handler returns a response given a request
  3. Middleware is High-Order function : it takes a handler as first parameter and returns a new handler function

More details

routes describes the behaviour of the web app : how to handle each incoming request. app is the main function of moustache, it consists of 2 parts :

  1. middlewares
  2. routes

Test your routes from the REPL : (routes {:uri \"/\" :request-method :get})

More details

(def routes
   (wrap-file "resources") ;; to get CSS files
   [] (fn [req] (render (main "Molière Works" (map vec->item a2/plays))))
   [play &] (fn [req] (render (main play (map vec->item (a2/find-characters play)))))))

Generates HTML pages for each play

(defn generate-pages
  (doseq [[title _] a2/plays]
    (spit (str "resources/generated/" title ".html")
          (apply str (main title
                           (map #(assoc (vec->item %) :nolink 1)
                                (a2/find-characters title)))))))
(defn generate-summary
  (spit (str "resources/generated/plays.html")
        (apply str (main "Molière Works" (map vec->item (a2/list-plays))))))
(defn baked-handler [name]
  (fn [req]
     (str name ".html")
     {:root "resources/generated" :index-files? true
      :allow-symlinks? false})))

Here instead of running a cascalog query to get the list of characters, it gets the generated page

(def baked-routes
   (wrap-file "resources")
   [""] (baked-handler "plays")
   [play &] (baked-handler play)))

Starts Jetty server with your routes. Note (var routes) allows to do interactive web development

(defn start
  [ & [port & options]]
  (run-jetty (var baked-routes) {:port (or port 8080) :join? false}))
(defn -main []
  (let [port (try (Integer/parseInt (System/getenv "PORT"))
                  (catch  Throwable t 8080))]
    (start port)))
(ns drama.epilogue)

Game of life : Beauty of clojure in action

  1. code from https://gist.github.com/2491305 on https://github.com/laurentpetit/mixit2012
  2. first version with indexes
  3. final version some time later because simple ain't easy

It's a classical example also present in the "Clojure Programming" book by Cemerik, Carper, Grand.


The game is represented as a set of the living cells #{[1 0] [1 1] [1 2]}

(defn neighbours [[x y]]
  (for [dx [-1 0 1] dy (if (zero? dx) [-1 1] [-1 0 1])]
    [(+ dx x) (+ dy y)]))

let's go step by step ... in the REPL :

(def board #{[1 0] [1 1] [1 2]})

(take 5 (iterate step board))

(defn step
  (set (for [[loc n] (frequencies (mapcat neighbours cells))
             :when (or (= n 3) (and (= n 2) (cells loc)))]

A board is a plain string with h lines and each line contains w characters


Just to show it live , the code himself is a good example of :

  1. java interop
  2. atom simplest form of concurrency in clojure

Instructions to run it :

  1. (swing-board board 5 5) open the swing window with an empty board
  2. (play #{[1 0] [1 1] [1 2]} 100) make it alive
  3. (def continue false) stop GUI refresh and then close swing window
(defn str-board
  [cells w h]
  (apply str (for [y (range h)
                   x (range (inc w))]
                (= x w) \newline
                (cells [x y]) \O
                :else \.))))

Atom stores the current state of the game

Waiting time before computing the next state

Refresh time interval

(def board 
  (atom #{}))
(def sleep 
(def refresh-interval  40)

Set it at false to stop the game

(def continue?  true)

Displays in a Swing TextArea, a board of size [w h] with living cells present in atom r. Refreshes the board at each refresh interval .

(defn swing-board
  [r w h]
  (let [t (doto (javax.swing.JTextArea. "" h w)
            (.setFont (java.awt.Font/decode "Monospaced 48")))
        j (doto (javax.swing.JFrame. "Game of Life")
            (.add t)
    (future (while continue?
              (Thread/sleep refresh-interval)
              (.setText t (str-board @r w h))))))

Given the initial state of the board, compute the next n states (fct step) and updates the board after each sleep period

(defn play
  [init n]
    (reset! board init)
    (dotimes [_ n]
      (Thread/sleep sleep)
      (swap! board step))))