Bruno Bonacci brings up some very good points why using a single and coherent solution to manipulate data results in higher productivity by comparing what Pig and Hive require:
In languages like Pig and Hive, in order to make complex manipulation of your data you have to write User Defined Functions (UDF). UDFs are a great way to extend the basic functionality, however for Hive and Pig you have to use a different language to write your UDFs as the basic SQL or Pig Latin languages have only a handful of functions and they lack of basic control structures. Both they offer the possibility to write UDFs in a number of different languages (which is great), however this requires a programming paradigm switch by the developer. Pig allows to write UDFs in Java, Jython, JavaScript, Groovy, Ruby and Python, for Hive you need to write then in Java (good article here). I won’t make the example of UDFs in Java as the comparison won’t be fair, life is too short to write them in Java, but let’s assume that you want to write a UDF for Pig and you want to use Python. If you go for the JVM platform version (Jython) you won’t be able to use existing modules coming from Python ecosystem (unless they are in pure Python). Same for Ruby and Javascript. If you decide to use Python you will have the setup burden of installing Python and all the modules that you intend to use in every Hadoop task node. So, you start with a language such as Pig Latin or SQL, you have to write, compile and bundle UDFs in a different language, you are constrained to use only the plain language without importing modules or face the extra burden of additional setup and, as if is not enough, you have to smooth the type difference between the two languages during their communication back and forth with the UDF. For me that’s enough to say that we can do better than that. Cascalog is a Clojure DSL, so your main language is Clojure, your custom functions are Clojure, the data are represented in Clojure data types, and the runtime is the JVM, no-switch required, no additional compilation required, no installation burden, and you can use all available libraries in the JVM ecosystem.
I’m not a big fan of SQL, except the cases where it really belongs to; SQL-on-Hadoop is my least favorite topic, probably except the whole complexity of the ecosystem. In the space of multi-format/unstructured data I’ve always liked the pragmatism and legibility of Pig. But the OP is definitely right about the added complexity.
This also reminded me about the Python vs R “war”.
Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
âś“ Live Streamingâś“ Interactive Chatâś“ Private Showsâś“ HD Quality
Anya is LIVE right now
FREE
Free to watch • No registration required • HD streaming
Welcome to another issue of Clojure weekly, my small routine blog contribution to the Clojure sphere! These are just a few links, normally 4/5 urls, pointing at articles, documentation, screencasts, podcasts or anything else that attracts my attention. I add a small comment so you can decide if you want to look at the whole thing or not. That’s it, enjoy!
Screen6 — Introduction to testing Cascalog with Midje. Here's a quick introduction to all you need to test first with Cascalog. Cascalog comes with integrated testing capabilities which makes it very attractive to incrementally design cascalog queries. This post shows you how you can use Midje instead of core test. Sure you can always REPL them but it sounds like having a bit of confidence when you evolve the code and the data is a nice thing to have.
Why I'm Productive in Clojure There are a couple of good points in this post about what differentiate a good language from a not so good one. The final goal while developing is expressing the problem domain into a language that it can then be sent down to hardware. We all now this can be done in assembler or punch cards as well. A good language will let you express the problem domain with minimal effort, without forcing you into unrelated constructs. We normally call those unrelated constructs "boilerplate" because it tends to be repeated everywhere. The other good point is about the richness of tools in a language. A language can offer many ways to solve a problem but the important thing is if there is a common denominator or a guiding principle. If guiding principles permeate the language, then the overhead as a developer to keep in mind all the tools the language offers to solve the problem is minimal.
clojure/tools.trace tools.trace contains a set of utility for better debugging and tracing of functions. One I specially like is the possibility to replace a "defn" with "deftrace" and the invocation of that function will print input parameters and output results. For recursive invocation will print each invocations with a different identation level. Another interesting macro is "traceform", which in case of an error with backreferencing let bindings will show the problem as it seen by replacing all bindings. Too difficult to explain, better if you just try it out.
Luminus - A Clojure Web Framework I recently dicovered Luminus. I had a look at the documentation and tried out quickly building a project. If only I had this when I was working on the https://github.com/reborg/iad server! Luminus is the closest I found so far to Rails. It is heavily opinionated and hence very fast for the kind of applications it is directed to: the classic web interface on top of a database persistence. The author of Luminus is also the author of "Web Development with Clojure" below.
The Pragmatic Bookshelf | Web Development with Clojure A new beta book became available at the PragProg.com bookshelf, "Web Development with Clojure" by Dmitri Sotnikov is the first Clojure book dedicated specifically to web development that I know of. Dmitri is author of the Luminus framework and many other Clojure contrib at https://github.com/yogthos/luminus
Here at Screen6 we have been using Cascalog to write our Hadoop jobs in Clojure. Even though we could have gone with using plain Clojure, Cascalog gives the ability to write Map-Reduce jobs in a fast and concise way. Another big bonus that comes from using Cascalog is the ease of testing. This blog posts covers a simple use case for Cascalog and shows a few simple test cases that we can write against Cascalog queries.
Testing: Hadoop way
Before we dive into testing Cascalog jobs, a short detour into testing approaches in Hadoop is deserved.
The accepted way of testing MapReduce jobs in Hadoop is to combine unit tests and occasional runs in local cluster. There are multiple issues with this approach. First of all, you’re going to double the work you’re required to do in order to mark the build as “locally tested". Second of all, it’s not easy to debug failures in Hadoop cluster. Sure, you can go through stacktraces, but getting your debugger to work in such environment is quite cumbersome.
There is an alternative to this — Apache MRUnit. It’s a superior approach to running a local cluster, however it still suffers from being extremely low level when compared to Midje-Cascalog or even Cascading’s built-in testing utilities. Usability of MRUnit is also undermined by the fact that most Hadoop jobs are rarely split into composable parts, often having lots of code repetition across similar tasks.
Overall it feels that Hadoop community often treats testing as a second class citizen and this sentiment truly hurts adoption of best testing practices.
Cascalog
A common use case for writing MapReduce jobs is to parse billions of log entries to produce sensible view into patterns of data access, load distribution, etc. It’s also a task we have to solve at Screen6. In fact, it’s such a common problem, that masses of MapReduce frameworks were created in the past years. The most prominent of them is Hadoop, first developed at Yahoo in 2005 and since gradually adopted as the go-to framework for data analysis.
However, it is worth noting that despite being the de-facto data science tool, Hadoop was never considered “easy" when it came to developing. It had and still has a high learning curve, which led to creation of frameworks that try to provide a higher level of abstraction when working with huge arrays of data. One of the most popular of such frameworks is Cascading, which provides a more declarative way of describing your data processing pipeline. Cascalog stands even higher on the abstraction ladder, implementing Datalog, a truly declarative language on top of existing Cascading library. Unlike many other libraries and frameworks that increase the abstraction level while dropping the ability to go one level deeper, Cascalog gives you the ability to actually resort to writing Cascading when required. In our experience that’s rarely needed and in fact our code doesn’t have any “abstraction breaking".
MapReduce with Cascalog
Problem statement
Before we can start with testing, we’ll layout a problem that we can solve with Cascalog. A common task for us at Screen6 is extracting useful information out of log files. It’s neither a unique technical challenge, nor is it inherently exciting, but it’s such a common thing to do that it’s always nice to have a way to minimize the amount of code you need to write to accomplish it.
Since our data pipeline is a quite contrived and just explaining it would take a space required for a few blog posts, we’ll devise a simpler example of such usage. Let’s say you’ve created a map service that has a URI scheme similar to OpenStreetMap.
There are a few valuable pieces of statistics that you can get from analyzing the log files accumulated by such service. For example, you could easily determine which areas of the map get most requests, thus enabling you to tune your caching. If you are using subdomains for faster loading of your tiles from JavaScript you might check if the request distribution is equally spread among such subdomains.
Let’s say your tile server has Nginx serving the tiles and it is running with a slightly customize access log format. Here is a made-up example of a line in this log:
2013-08-01T:07:51:23+0100 123.65.150.10 GET a.tiles.example.com /16/1213/721.png 200 1013 4512 "http://.example.com/mapview" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.25 Safari/534.3"
First test
So, we have the following information in our log files: timestamp of request, client’s IP address, HTTP method used, host name, URI, status code, response time in microseconds, size of response in bytes, referer and user agent. Let’s write our test for this exact line. For now we just want to use plain Clojure. We will be also using Midje library, which integrates with Cascalog easily. We can just jump into Clojure REPL and play around with writing our test:
This is plain Midje and plain Clojure, we don’t actually do anything with Cascalog. However, this example will later evolve into Midje-Cascalog based test.
Here’s a short excerpt of the code that would be responsible for parsing the line in a way that satisfies the test:
This seems simple enough, we’re simply treating each line as a map, run a few transformations on this map and then return a vector of fields in the correct order. Now we can actually write the test against Cascalog code:
As we can see, not much has changed. In fact, we had to change very little to accommodate for Cascalog: we pass vector of vectors to the query (parse-logs), the right part of the fact is wrapped in produces call and the response is also a vector of vectors. We can also see that the actual expected output changed a bit, instead of getting URI /16/123/721.png we are actually getting three integers [16 123 721]. Let’s take a look at the corresponding Cascalog query:
Of course, in the real world you’d have some checks interspersed, but for the purpose of this blog post we’re omitting those. This query can be our building block for future queries. For example, if we want to know during which hours our servers gets most requests, we could write the following query:
As can be seen from these examples, writing tests with Midje-Cascalog is quite an enjoyable experience.
Conclusion
The test cases we shown so far have been quite short. In fact, it makes little to no sense to run your tests using only a handful of log entries. More often you’ll extract a few thousand lines of relevant data, make sure there are a few broken entries and test against that. However, writing Cascalog tests doesn’t become a harder task even then, the only part that you need to take care of is loading of resources, your tests however will retain their structure.
Midje-Cascalog is a fantastic tool and is certainly a great selling point when considering Cascalog as an option for handling your MapReduce tasks. And since it’s still Clojure, it’s extremely easy to compose your queries and your facts. In our experience, using Cascalog allows to spend more time on tackling the problem at hand and trying different approaches when compared to writing pure MapReduce jobs using Hadoop. And since Cascalog is still using Hadoop to actually execute the jobs, we’re getting the best of both worlds — interactivity and ease of development with Clojure and stability and scalability of Hadoop platform.
Further reading
Here is a wonderful introduction into the world of Cascalog testing. If you’d like to familiarize yourself with Cascalog a bit, I suggest you go through "Cascalog for the Impatient" tutorial.
Hello weekly user! After a couple of weeks of vacation, Clojure weekly is back! Here you can find a few Clojure related links pointing at articles, documentation, screencasts, podcasts or anything else that attracts my attention. I add a small comment so you can decide if you want to look at the whole thing or not. That’s it, enjoy!
cascalog/midje-cascalog at develop · nathanmarz/cascalog Midje Cascalog used to be a standalone project on GitHub that now has been deprecated and incorporated into the main Cascalog project. This is already by definition a success! Cascalog is fun to use and probably for this reason it's easy to create complicated queries that become pretty soon difficult to maintain. Consider also that Cascalog errors tend to be difficult to troubleshoot. One strategy is to build complexity incrementally and extract functions with a specific small scope. Midje cascalog helps exactly with that, but with the added bonus of covering each specific piece of behaviour with some tests, incrementally confirming assumptions about the result of subqueries.
Riemann - Dashboards Riemann is a monitoring and event processing framework. The main idea is to handle any sort of application event (logging, performances, exception and so on). Riemann is similar to New Relic although is definitely not that feature complete, multi-platform and cross-language like New Relic is. After all Riemann is a Clojure product with a Clojure DSL so it has a very specific narrow focus. The link is pointing at the out-of-the-box dashboard which comes with Reimann that is probably good for a lot of possible installations.
Skills Matter : In The Brain of Uncle Bob (Robert C. Martin) Fascinating retrrospective of Uncle Bob taalking about the pillars of objectt orientation: encapsulation, inheritance and polymorphism. Uncle Bob claims at the end is that OOP is mostly about enabling easy polymorphism than anything else. he recognizes that concepts like encapsulation or inheritance are possbile in the C language and something like polymorphism aleady existed in assembly language as a technique to avoid expensive release of boards containing a sequential program in a series of 32K eproms. But easiness of polymorphism has a tremendous impact: it enables dependency inversion, one of the most popular techniques to dirrect dependencies in a single direction, which, ultimately, enables maintenability of software. he didn't talk about functional programming in this context so the following arre my thouthgs. encapsulation in FP has a simpler meaning, it means manly cohesion of related function that is achieved with namespaces (clojure) and functors (ML and close derivatives like Scala). Ijnheritance in FP is also possible in terms of relations between namespaces (not involving types). Polymorphism is also heavily present ijn Clojure: multimetods and protoclols are the enabling techniques. Polymorphism in FP has the same result, it can be used to invert pendencies and increase maintenability. That is maybe why FP is an equially powerful paradigm for modern programming.
clojure-cookbook/CONTRIBUTING.md at master · clojure-cookbook/clojure-cookbook · GitHub This is a neat idea. A cook-book format is always welcome in my book collection, whatever the language. Having one for Clojure is definitely good news. In this case the idea is even better and recipes are contributed by the community for the community. The current list of proposed chapters doesn't contain "executing code in parallel" but that might change in the future (I've got a few ideas about it). Looking forward to the final book!
An intro to Scalding1, Twitter’s Scala API for Cascading, by Dean Wampler2:
“There’s not better way to write general-purpose Hadoop MapReduce programs when specialized tools like Hive and Pig aren’t quite what you need.”
Watch the video and slides after below.
An Overview of Scalding
Scalding for Hadoop
âśš At Twitter, the creators of Scalding, different teams use different libraries for dealing with different scenarios.
âśš Dean Wampler is the co-author of the Programming Scala book so his preference for Scala is understandable.
âśš Do you know any other teams or companies using Scalding instead of Cascading or Cascalog?
Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
âś“ Live Streamingâś“ Interactive Chatâś“ Private Showsâś“ HD Quality
Anya is LIVE right now
FREE
Free to watch • No registration required • HD streaming
This is the only interesting paragraph from InfoWorld’s article “Twitter’s programmers speed Hadoop development“:
Three Twitter teams are using Cascading in combination with programming languages: The revenue team uses Scala, the publisher analytics team uses Clojure, and the analytics team uses Jython.
Each of these combinations led to new projects:
Scala + Cascading => Scalding
Clojure + Cascading => Cascalog
Jython + Cascading => PyCascading
An interesting question I couldn’t answer is why each team prefers a different language. My hypothesis:
Scala with its strong typing for handling clear models generating numbers that must always be correct.
A good explanation of why Cascading, Cascalog, and other frameworks hiding away the details of MapReduce are making things easier for non-programmers:
Data scientists at The Climate Corporation chose to create their algorithms in Cascalog, which is a high-level Clojure-based machine learning language built on Cascading. Cascading is an advanced Java application framework that abstracts the MapReduce APIs in Apache Hadoop and provides developers with a simplified way to create powerful data processing workflows. Programming in Cascalog, data scientists create compact expressions that represent complex batch-oriented AI and machine learning workflows. This results in improved productivity for the data scientists, many of whom are mathematicians rather than computer scientists. It also gives them the ability to quickly analyze complex data sets without having to create large complicated programs in MapReduce. Furthermore, programmers at The Climate Corporation also use Cascading directly for creating jobs inside Hadoop streaming to process additional batch-oriented data workflows.
A brief but very clear explanation of the benefits of using Cascalog-checkpoints by Paul Lam:
Building Cascading/Cascalog queries can be visualised as assembling pipes to connect a flow of data. Imagine that you have Flow A and B. Flow B uses the result from A along with other bits. Thus, Flow B is dependent on A. Typically, if a MapReduce job fail for whatever reason, you simply fix what’s wrong and start the job all over again. But what if Flow A takes hours to run (which is common for a MR job) and the error happened in Flow B? Why re-do all that processing for Flow A if we know that it finished successfully?