Introduction to testing Cascalog with Midje.
Here at Screen6 we have been using Cascalog to write our Hadoop jobs in Clojure. Even though we could have gone with using plain Clojure, Cascalog gives the ability to write Map-Reduce jobs in a fast and concise way. Another big bonus that comes from using Cascalog is the ease of testing. This blog posts covers a simple use case for Cascalog and shows a few simple test cases that we can write against Cascalog queries.
Testing: Hadoop way
Before we dive into testing Cascalog jobs, a short detour into testing approaches in Hadoop is deserved.
The accepted way of testing MapReduce jobs in Hadoop is to combine unit tests and occasional runs in local cluster. There are multiple issues with this approach. First of all, youâre going to double the work youâre required to do in order to mark the build as âlocally tested". Second of all, itâs not easy to debug failures in Hadoop cluster. Sure, you can go through stacktraces, but getting your debugger to work in such environment is quite cumbersome.
There is an alternative to this â Apache MRUnit. Itâs a superior approach to running a local cluster, however it still suffers from being extremely low level when compared to Midje-Cascalog or even Cascadingâs built-in testing utilities. Usability of MRUnit is also undermined by the fact that most Hadoop jobs are rarely split into composable parts, often having lots of code repetition across similar tasks.
Overall it feels that Hadoop community often treats testing as a second class citizen and this sentiment truly hurts adoption of best testing practices.
Cascalog
A common use case for writing MapReduce jobs is to parse billions of log entries to produce sensible view into patterns of data access, load distribution, etc. Itâs also a task we have to solve at Screen6. In fact, itâs such a common problem, that masses of MapReduce frameworks were created in the past years. The most prominent of them is Hadoop, first developed at Yahoo in 2005 and since gradually adopted as the go-to framework for data analysis.
However, it is worth noting that despite being the de-facto data science tool, Hadoop was never considered âeasy" when it came to developing. It had and still has a high learning curve, which led to creation of frameworks that try to provide a higher level of abstraction when working with huge arrays of data. One of the most popular of such frameworks is Cascading, which provides a more declarative way of describing your data processing pipeline. Cascalog stands even higher on the abstraction ladder, implementing Datalog, a truly declarative language on top of existing Cascading library. Unlike many other libraries and frameworks that increase the abstraction level while dropping the ability to go one level deeper, Cascalog gives you the ability to actually resort to writing Cascading when required. In our experience thatâs rarely needed and in fact our code doesnât have any âabstraction breaking".
MapReduce with Cascalog
Problem statement
Before we can start with testing, weâll layout a problem that we can solve with Cascalog. A common task for us at Screen6 is extracting useful information out of log files. Itâs neither a unique technical challenge, nor is it inherently exciting, but itâs such a common thing to do that itâs always nice to have a way to minimize the amount of code you need to write to accomplish it.
Since our data pipeline is a quite contrived and just explaining it would take a space required for a few blog posts, weâll devise a simpler example of such usage. Letâs say youâve created a map service that has a URI scheme similar to OpenStreetMap.
There are a few valuable pieces of statistics that you can get from analyzing the log files accumulated by such service. For example, you could easily determine which areas of the map get most requests, thus enabling you to tune your caching. If you are using subdomains for faster loading of your tiles from JavaScript you might check if the request distribution is equally spread among such subdomains.
Letâs say your tile server has Nginx serving the tiles and it is running with a slightly customize access log format. Here is a made-up example of a line in this log:
2013-08-01T:07:51:23+0100 123.65.150.10 GET a.tiles.example.com /16/1213/721.png 200 1013 4512 "http://.example.com/mapview" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.25 Safari/534.3"
First test
So, we have the following information in our log files: timestamp of request, clientâs IP address, HTTP method used, host name, URI, status code, response time in microseconds, size of response in bytes, referer and user agent. Letâs write our test for this exact line. For now we just want to use plain Clojure. We will be also using Midje library, which integrates with Cascalog easily. We can just jump into Clojure REPL and play around with writing our test:
(def line "2013-08-01T07:51:23+0100 123.65.150.10 GET a.tiles.example.com /16/1213/721.png 200 1013 4512 \"http://map.example.com\" \"Mozilla/5.0\"") (fact (parse-line line) => [1375339883 "123.65.150.10" "GET" "a.tiles.example.com" "/16/1213/721.png" 200 1013 4512 "http://map.example.com" "Mozilla/5.0"])
This is plain Midje and plain Clojure, we donât actually do anything with Cascalog. However, this example will later evolve into Midje-Cascalog based test.
Hereâs a short excerpt of the code that would be responsible for parsing the line in a way that satisfies the test:
(ns ... (:require ... [clojure.data.csv :as csv] [clj-time.format :as time.format] [clj-time.core :as time.core] [clj-time.coerce :as time.coerce] ...) ...) (defn set-timestamp [entry] (let [time-format (time.format/formatter "yyyy-MM-dd'T'HH:mm:ssZ")] (->> (entry :timestamp) (time.format/parse time-format) to-unixtime (assoc entry :timestamp)))) (defn parse-line [line] (let [fields [:timestamp :ip :http-method :host :uri :response-code :response-time :response-size :referer :user-agent] entry (first (csv/read-csv line :separator \space))] (-> (zipmap fields entry) set-timestamp ... (mapv fields))))
Midje-Cascalog
This seems simple enough, weâre simply treating each line as a map, run a few transformations on this map and then return a vector of fields in the correct order. Now we can actually write the test against Cascalog code:
(ns ...-test (:require ... [midje.cascalog :refer [produces]])) (fact (parse-logs [[line]]) => (produces [[1375339883 "123.65.150.10" "GET" "a.tiles.example.com" 16 1213 721 200 1013 4512 "http://map.example.com" "Mozilla/5.0"]]))
As we can see, not much has changed. In fact, we had to change very little to accommodate for Cascalog: we pass vector of vectors to the query (parse-logs), the right part of the fact is wrapped in produces call and the response is also a vector of vectors. We can also see that the actual expected output changed a bit, instead of getting URI /16/123/721.png we are actually getting three integers [16 123 721]. Letâs take a look at the corresponding Cascalog query:
(defn parse-logs [source] (<- [?timestamp ?ip ?http-method ?host ?zoom ?x ?y ?response-code ?response-time ?response-size ?referer ?user-agent] (source ?line) (parse-line ?line :> ?timestamp ?ip ?http-method ?host ?uri ?response-code ?response-time ?response-size ?referer ?user-agent) (parse-uri ?uri :> ?zoom ?x ?y)))
Simple query composition
Of course, in the real world youâd have some checks interspersed, but for the purpose of this blog post weâre omitting those. This query can be our building block for future queries. For example, if we want to know during which hours our servers gets most requests, we could write the following query:
(defn most-popular-hours [source] (<- [?hour-of-day ?hits] ((select-fields source ["?timestamp"]) ?timestamp) (get-hour-of-day ?timestamp :> ?hour-of-day) (cascalog.ops/count :> ?hits)))
And now our test âfact" becomes quite simply this:
(fact (most-popular-hours (parse-logs [[line]])) => (produces [[6 1]]))
As can be seen from these examples, writing tests with Midje-Cascalog is quite an enjoyable experience.
Conclusion
The test cases we shown so far have been quite short. In fact, it makes little to no sense to run your tests using only a handful of log entries. More often youâll extract a few thousand lines of relevant data, make sure there are a few broken entries and test against that. However, writing Cascalog tests doesnât become a harder task even then, the only part that you need to take care of is loading of resources, your tests however will retain their structure.
Midje-Cascalog is a fantastic tool and is certainly a great selling point when considering Cascalog as an option for handling your MapReduce tasks. And since itâs still Clojure, itâs extremely easy to compose your queries and your facts. In our experience, using Cascalog allows to spend more time on tackling the problem at hand and trying different approaches when compared to writing pure MapReduce jobs using Hadoop. And since Cascalog is still using Hadoop to actually execute the jobs, weâre getting the best of both worlds â interactivity and ease of development with Clojure and stability and scalability of Hadoop platform.
Further reading
Here is a wonderful introduction into the world of Cascalog testing. If youâd like to familiarize yourself with Cascalog a bit, I suggest you go through "Cascalog for the Impatient" tutorial.











