drama0.1.0-SNAPSHOTA clojure drama in 3 acts and a prologue dependencies
| (this space intentionally left almost blank) | |||||||||||||||||||||
Act I : Scraping web pages with enliveThe goal is to retrieve all theater plays by a famous author (Here the french Molière from the 17th century) and the characters from those plays. (source : toutmoliere.net) | (ns drama.act1 (:require [net.cgrand.enlive-html :as h] [clojure.string :as s])) | |||||||||||||||||||||
Enlive selectorEnlive is a templating system working as in the following lines :
Web scraping with enlive are done in 2 steps :
| ||||||||||||||||||||||
Converts a source (url , file or string) into nodes | (defn resource [s] (let [r (cond (.startsWith s "http:") (java.net.URL. s) (.exists (java.io.File. s)) (java.io.File. s) :else s)] (if (not (string? r)) (h/html-resource r) (h/html-snippet r)))) | |||||||||||||||||||||
(def moliere "http://toutmoliere.net/") | ||||||||||||||||||||||
Extract all playsTypical scraping structure is done in 2 steps : select and extract Enlive selectors are a flexible way to express your HTML selection The syntax can be at first sight a bit confusing, but in fact following simple rules :
| ||||||||||||||||||||||
Extracts the list of plays by the author from http://toutmoliere.net/oeuvres.html in local resources/data/oeuvres.html | (defn extract-plays [url] (let [nodes (h/select (resource url) [:div#liste1 :ul.listerub :li :a]) extract (fn [n] {:url (str moliere (-> n :attrs :href)) :title (-> n :content first s/trim) :date (-> n (h/select [:i]) first h/text s/trim) })] (map extract nodes))) | |||||||||||||||||||||
Extract the charactersInvolves a more complex logic : from the play's main page, go to play's act 1 page and then extract the list of characters from there. 2 samples pages are available in local : resources/data/{ecoledesfemmes.html,ecoledesfemmes_acte1.html} | ||||||||||||||||||||||
Extract url of Acte 1 | (defn characters-url [nodes] (->> (h/select nodes [:ul#lapiece [:a (h/attr= :title "Acte 1")]]) first :attrs :href (str moliere))) | |||||||||||||||||||||
Returns a list of characters [name , description]. Here it's bit more trickier : various cases to handle | (defn extract-characters [nodes] (let [selector1 [:div#centre_texte :div :div h/text-node] selector2 [:div#centre_texte [:div h/first-of-type] [:table h/first-of-type] :tr] items (h/select nodes selector1) items (if (< 1 (count items)) (s/split-lines (apply str items));;one line = one character (map h/text (h/select nodes selector2))) trim (fn [s] (-> s (s/replace-first #"^[,. ]+" "") ;trim left (s/replace-first #"[,. ]+$" "") ;trim right)) extractor (fn [s] (map (partial s/join " ") (split-with #(= (.toUpperCase %) %) (s/split (trim s) #"[,. ]+")))) validate (fn [c] (when (not (empty? (first c))) c))] (keep (comp validate extractor) items))) | |||||||||||||||||||||
Associate the characters to a play Put it all together | (defn append-characters [{u :url :as play}] (let [curl (characters-url (resource u)) chars (extract-characters (resource curl))] (assoc play :characters-url curl :characters chars))) | |||||||||||||||||||||
Returns all informations wanted as a lazy-sequence ie only fetch the data when requested. Please use it with caution as it scrapes more than 60 web pages. | (defn all-in-one [] (map append-characters (extract-plays "http://toutmoliere.net/oeuvres.html"))) | |||||||||||||||||||||
Some IO functions | ||||||||||||||||||||||
(defn coll->file [f coll & {:keys [separator] :or {separator "|"}}] (spit f (apply str (map #(str (s/join separator %) "\n") coll)))) | ||||||||||||||||||||||
Returns a list of vectors. If the header is supplied, it returns a list of maps | (defn file->coll [f & {:keys [separator header size] :or {separator "|"}}] (let [lines (.split (slurp f) "\n") separator ({"|" "\\|"} separator separator) cut (fn [l] ((if (sequential? header) (partial zipmap header) identity) (map #(.trim %) (if size (.split l separator size) (.split l separator)))))] (map cut lines))) | |||||||||||||||||||||
Loads into resources/data/moliere_plays.txt all plays. | (defn plays->file [plays] (coll->file "resources/data/moliere_plays.txt" (map (juxt :title :date :url) plays))) | |||||||||||||||||||||
Loads into resources/data/moliere_characters.txt all characters. Skip invalid characters | (defn characters->file [plays] (let [valid? (fn [c] (and (< 1 (count c)) (= (first c) (.toUpperCase (first c)))))] (coll->file "resources/data/moliere_characters.txt" (mapcat (fn [{cs :characters t :title}] (keep (fn [c] (when (valid? c) (cons t c))) cs)) plays)))) | |||||||||||||||||||||
Further information on enlive | ||||||||||||||||||||||
Act 2 : Let's play with dataCascalog is used to query our data. It's build on top Hadoop and cascading but you don't need to have any knowlegde of Hadoop ecosystem or map/reduce in order to use it. Most of the time, Cascalog let's you concentrate on "what" you want not on "how" : it's declarative like SQL. | (ns drama.act2 (:require [drama.act1 :as a1] [cascalog.api :as ca] [cascalog.ops :as co])) | |||||||||||||||||||||
list of [title date url] ModelData in cascalog are list of tuples | (def plays (a1/file->coll "resources/data/moliere_plays.txt" )) | |||||||||||||||||||||
List all records [title of the play, character's name , characters's desc ] | (def characters (a1/file->coll "resources/data/moliere_characters.txt" :size 3)) | |||||||||||||||||||||
Get All characters of a play Some cascalog queriesAny cascalog query has always these 3 parts :
| (defn find-characters [title] (ca/??<- [?name ?desc] (characters title ?name ?desc))) | |||||||||||||||||||||
Get all the plays where a character is present : query using an implicit join | (defn find-plays [name] (ca/??<- [?title ?date] (plays ?title ?date ?url) (characters ?title name ?desc))) | |||||||||||||||||||||
List all characters with their number of occurences in plays | (defn list-characters [] (ca/??<- [?name ?ct] (characters ?title ?name ?desc) (co/count ?ct))) | |||||||||||||||||||||
List all plays and counting their characters | (defn list-plays [] (ca/??<- [?title ?date ?url ?ct] (plays ?title ?date ?url) (characters ?title ?name ?desc) (co/count ?ct))) | |||||||||||||||||||||
Get the n most used characters | (defn top-n-characters [n] (let [count-q (ca/<- [?name ?ct] (characters ?title ?name ?desc) (co/count ?ct)) q (co/first-n count-q n :sort ["?ct"] :reverse true)] (ca/??- q))) | |||||||||||||||||||||
Act 3 Back to the webArchitecture in place :
| (ns drama.act3 (:use [net.cgrand.moustache :only [app]] [ring.middleware.file :only [wrap-file]] [ring.util.codec :only [url-decode]] [ring.util.response :only [response content-type file-response]] [ring.adapter.jetty :only [run-jetty]]) (:require [net.cgrand.enlive-html :as h] [drama.act2 :as a2])) | |||||||||||||||||||||
Enlive templating SystemIt's based on 2 macros | ||||||||||||||||||||||
list.html | (h/defsnippet list-item [:div#main :ul :li] [{:keys [title text url nolink total]} ] [[:a h/first-of-type]] (h/do-> (if nolink identity (h/set-attr :href (str "/" title) )) (h/content title)) [[:a (h/nth-of-type 2)]] (when url (h/set-attr :href url)) [[:span h/first-of-type]] (h/content text) [[:span (h/nth-of-type 2)]] (h/content (str total))) | |||||||||||||||||||||
(defn prepend-attrs [att prefix] (fn[node] (update-in node [:attrs att] (fn[v] (str prefix v))))) | ||||||||||||||||||||||
list.html | (h/deftemplate main [title items] [[:link (h/attr= :rel "stylesheet")]] (prepend-attrs :href "/") [:div#main :h3] (h/content title) [:div#main :ul] (if (and (sequential? items) (seq items)) (h/content (map list-item items)) (h/substitute ""))) | |||||||||||||||||||||
(defn vec->item [[t d u c]] {:title t :text d :url u :total c}) | ||||||||||||||||||||||
render view in utf-8 | (defn render [body] (content-type (response body) "text/html ; charset=utf-8")) | |||||||||||||||||||||
Routing requestsRing is a perfect example of the motto "data and functions", it consists of
Test your routes from the REPL : | (def routes (app (wrap-file "resources") ;; to get CSS files [] (fn [req] (render (main "Molière Works" (map vec->item a2/plays)))) [play &] (fn [req] (render (main play (map vec->item (a2/find-characters play))))))) | |||||||||||||||||||||
Generates HTML pages for each play | (defn generate-pages [] (doseq [[title _] a2/plays] (spit (str "resources/generated/" title ".html") (apply str (main title (map #(assoc (vec->item %) :nolink 1) (a2/find-characters title))))))) | |||||||||||||||||||||
(defn generate-summary [] (spit (str "resources/generated/plays.html") (apply str (main "Molière Works" (map vec->item (a2/list-plays)))))) | ||||||||||||||||||||||
(defn baked-handler [name] (fn [req] (file-response (str name ".html") {:root "resources/generated" :index-files? true :allow-symlinks? false}))) | ||||||||||||||||||||||
Here instead of running a cascalog query to get the list of characters, it gets the generated page | (def baked-routes (app (wrap-file "resources") [""] (baked-handler "plays") [play &] (baked-handler play))) | |||||||||||||||||||||
Starts Jetty server with your routes.
Note | (defn start [ & [port & options]] (run-jetty (var baked-routes) {:port (or port 8080) :join? false})) | |||||||||||||||||||||
(defn -main [] (let [port (try (Integer/parseInt (System/getenv "PORT")) (catch Throwable t 8080))] (start port))) | ||||||||||||||||||||||
(ns drama.epilogue) | ||||||||||||||||||||||
Game of life : Beauty of clojure in action
It's a classical example also present in the "Clojure Programming" book by Cemerik, Carper, Grand. | ||||||||||||||||||||||
LogicThe game is represented as a set of the living cells | ||||||||||||||||||||||
(defn neighbours [[x y]] (for [dx [-1 0 1] dy (if (zero? dx) [-1 1] [-1 0 1])] [(+ dx x) (+ dy y)])) | ||||||||||||||||||||||
let's go step by step ... in the REPL : | (defn step [cells] (set (for [[loc n] (frequencies (mapcat neighbours cells)) :when (or (= n 3) (and (= n 2) (cells loc)))] loc))) | |||||||||||||||||||||
A board is a plain string with h lines and each line contains w characters GUIJust to show it live , the code himself is a good example of :
Instructions to run it :
| (defn str-board [cells w h] (apply str (for [y (range h) x (range (inc w))] (cond (= x w) \newline (cells [x y]) \O :else \.)))) | |||||||||||||||||||||
Atom stores the current state of the game Waiting time before computing the next state Refresh time interval | (def board (atom #{})) (def sleep 200) (def refresh-interval 40) | |||||||||||||||||||||
Set it at false to stop the game | (def continue? true) | |||||||||||||||||||||
Displays in a Swing TextArea, a board of size [w h] with living cells present in atom r. Refreshes the board at each refresh interval . | (defn swing-board [r w h] (let [t (doto (javax.swing.JTextArea. "" h w) (.setFont (java.awt.Font/decode "Monospaced 48"))) j (doto (javax.swing.JFrame. "Game of Life") (.add t) .pack .show)] (future (while continue? (Thread/sleep refresh-interval) (.setText t (str-board @r w h)))))) | |||||||||||||||||||||
Given the initial state of the board, compute the next n states (fct step) and updates the board after each sleep period | (defn play [init n] (future (reset! board init) (dotimes [_ n] (Thread/sleep sleep) (swap! board step)))) | |||||||||||||||||||||