Top Posts Tagged with #cascalog

Popular Recent

Bruno Bonacci brings up some very good points why using a single and coherent solution to manipulate data results in higher productivity by comparing what Pig and Hive require:

In languages like Pig and Hive, in order to make complex manipulation of your data you have to write User Defined Functions (UDF). UDFs are a great way to extend the basic functionality, however for Hive and Pig you have to use a different language to write your UDFs as the basic SQL or Pig Latin languages have only a handful of functions and they lack of basic control structures. Both they offer the possibility to write UDFs in a number of different languages (which is great), however this requires a programming paradigm switch by the developer. Pig allows to write UDFs in Java, Jython, JavaScript, Groovy, Ruby and Python, for Hive you need to write then in Java (good article here). I won’t make the example of UDFs in Java as the comparison won’t be fair, life is too short to write them in Java, but let’s assume that you want to write a UDF for Pig and you want to use Python. If you go for the JVM platform version (Jython) you won’t be able to use existing modules coming from Python ecosystem (unless they are in pure Python). Same for Ruby and Javascript. If you decide to use Python you will have the setup burden of installing Python and all the modules that you intend to use in every Hadoop task node. So, you start with a language such as Pig Latin or SQL, you have to write, compile and bundle UDFs in a different language, you are constrained to use only the plain language without importing modules or face the extra burden of additional setup and, as if is not enough, you have to smooth the type difference between the two languages during their communication back and forth with the UDF. For me that’s enough to say that we can do better than that. Cascalog is a Clojure DSL, so your main language is Clojure, your custom functions are Clojure, the data are represented in Clojure data types, and the runtime is the JVM, no-switch required, no additional compilation required, no installation burden, and you can use all available libraries in the JVM ecosystem.

I’m not a big fan of SQL, except the cases where it really belongs to; SQL-on-Hadoop is my least favorite topic, probably except the whole complexity of the ecosystem. In the space of multi-format/unstructured data I’ve always liked the pragmatism and legibility of Pig. But the OP is definitely right about the added complexity.

This also reminded me about the Python vs R “war”.

Original title and link: Complex data manipulation in Cascalog, Pig, and Hive (NoSQL database©myNoSQL)

#Cascalog #Pig #Hive #Hadoop #BigData

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality

Anya is LIVE right now

FREE

Free to watch • No registration required • HD streaming

Clojure Weekly, Sep 5th, 2013

Welcome to another issue of Clojure weekly, my small routine blog contribution to the Clojure sphere! These are just a few links, normally 4/5 urls, pointing at articles, documentation, screencasts, podcasts or anything else that attracts my attention. I add a small comment so you can decide if you want to look at the whole thing or not. That’s it, enjoy!

Screen6 — Introduction to testing Cascalog with Midje. Here's a quick introduction to all you need to test first with Cascalog. Cascalog comes with integrated testing capabilities which makes it very attractive to incrementally design cascalog queries. This post shows you how you can use Midje instead of core test. Sure you can always REPL them but it sounds like having a bit of confidence when you evolve the code and the data is a nice thing to have.

Why I'm Productive in Clojure There are a couple of good points in this post about what differentiate a good language from a not so good one. The final goal while developing is expressing the problem domain into a language that it can then be sent down to hardware. We all now this can be done in assembler or punch cards as well. A good language will let you express the problem domain with minimal effort, without forcing you into unrelated constructs. We normally call those unrelated constructs "boilerplate" because it tends to be repeated everywhere. The other good point is about the richness of tools in a language. A language can offer many ways to solve a problem but the important thing is if there is a common denominator or a guiding principle. If guiding principles permeate the language, then the overhead as a developer to keep in mind all the tools the language offers to solve the problem is minimal.

clojure/tools.trace tools.trace contains a set of utility for better debugging and tracing of functions. One I specially like is the possibility to replace a "defn" with "deftrace" and the invocation of that function will print input parameters and output results. For recursive invocation will print each invocations with a different identation level. Another interesting macro is "traceform", which in case of an error with backreferencing let bindings will show the problem as it seen by replacing all bindings. Too difficult to explain, better if you just try it out.

Luminus - A Clojure Web Framework I recently dicovered Luminus. I had a look at the documentation and tried out quickly building a project. If only I had this when I was working on the https://github.com/reborg/iad server! Luminus is the closest I found so far to Rails. It is heavily opinionated and hence very fast for the kind of applications it is directed to: the classic web interface on top of a database persistence. The author of Luminus is also the author of "Web Development with Clojure" below.

The Pragmatic Bookshelf | Web Development with Clojure A new beta book became available at the PragProg.com bookshelf, "Web Development with Clojure" by Dmitri Sotnikov is the first Clojure book dedicated specifically to web development that I know of. Dmitri is author of the Luminus framework and many other Clojure contrib at https://github.com/yogthos/luminus

#clojure #luminus #cascalog #book #productivity

Introduction to testing Cascalog with Midje.

Here at Screen6 we have been using Cascalog to write our Hadoop jobs in Clojure. Even though we could have gone with using plain Clojure, Cascalog gives the ability to write Map-Reduce jobs in a fast and concise way. Another big bonus that comes from using Cascalog is the ease of testing. This blog posts covers a simple use case for Cascalog and shows a few simple test cases that we can write against Cascalog queries.

Testing: Hadoop way

Before we dive into testing Cascalog jobs, a short detour into testing approaches in Hadoop is deserved.

The accepted way of testing MapReduce jobs in Hadoop is to combine unit tests and occasional runs in local cluster. There are multiple issues with this approach. First of all, you’re going to double the work you’re required to do in order to mark the build as “locally tested". Second of all, it’s not easy to debug failures in Hadoop cluster. Sure, you can go through stacktraces, but getting your debugger to work in such environment is quite cumbersome.

#testing #cascalog #clojure

Clojure Weekly, July 2nd, 2013

Hello weekly user! After a couple of weeks of vacation, Clojure weekly is back! Here you can find a few Clojure related links pointing at articles, documentation, screencasts, podcasts or anything else that attracts my attention. I add a small comment so you can decide if you want to look at the whole thing or not. That’s it, enjoy!

cascalog/midje-cascalog at develop · nathanmarz/cascalog Midje Cascalog used to be a standalone project on GitHub that now has been deprecated and incorporated into the main Cascalog project. This is already by definition a success! Cascalog is fun to use and probably for this reason it's easy to create complicated queries that become pretty soon difficult to maintain. Consider also that Cascalog errors tend to be difficult to troubleshoot. One strategy is to build complexity incrementally and extract functions with a specific small scope. Midje cascalog helps exactly with that, but with the added bonus of covering each specific piece of behaviour with some tests, incrementally confirming assumptions about the result of subqueries.

Riemann - Dashboards Riemann is a monitoring and event processing framework. The main idea is to handle any sort of application event (logging, performances, exception and so on). Riemann is similar to New Relic although is definitely not that feature complete, multi-platform and cross-language like New Relic is. After all Riemann is a Clojure product with a Clojure DSL so it has a very specific narrow focus. The link is pointing at the out-of-the-box dashboard which comes with Reimann that is probably good for a lot of possible installations.

Skills Matter : In The Brain of Uncle Bob (Robert C. Martin) Fascinating retrrospective of Uncle Bob taalking about the pillars of objectt orientation: encapsulation, inheritance and polymorphism. Uncle Bob claims at the end is that OOP is mostly about enabling easy polymorphism than anything else. he recognizes that concepts like encapsulation or inheritance are possbile in the C language and something like polymorphism aleady existed in assembly language as a technique to avoid expensive release of boards containing a sequential program in a series of 32K eproms. But easiness of polymorphism has a tremendous impact: it enables dependency inversion, one of the most popular techniques to dirrect dependencies in a single direction, which, ultimately, enables maintenability of software. he didn't talk about functional programming in this context so the following arre my thouthgs. encapsulation in FP has a simpler meaning, it means manly cohesion of related function that is achieved with namespaces (clojure) and functors (ML and close derivatives like Scala). Ijnheritance in FP is also possible in terms of relations between namespaces (not involving types). Polymorphism is also heavily present ijn Clojure: multimetods and protoclols are the enabling techniques. Polymorphism in FP has the same result, it can be used to invert pendencies and increase maintenability. That is maybe why FP is an equially powerful paradigm for modern programming.

clojure-cookbook/CONTRIBUTING.md at master · clojure-cookbook/clojure-cookbook · GitHub This is a neat idea. A cook-book format is always welcome in my book collection, whatever the language. Having one for Clojure is definitely good news. In this case the idea is even better and recipes are contributed by the community for the community. The current list of proposed chapters doesn't contain "executing code in parallel" but that might change in the future (I've got a few ideas about it). Looking forward to the final book!

#clojure #midje #cascalog #riemann #monitoring #oop #unclebob #cookbook

An Overview of Scalding

An intro to Scalding1, Twitter’s Scala API for Cascading, by Dean Wampler2:

“There’s not better way to write general-purpose Hadoop MapReduce programs when specialized tools like Hive and Pig aren’t quite what you need.”

Watch the video and slides after below.

An Overview of Scalding

Scalding for Hadoop

✚ At Twitter, the creators of Scalding, different teams use different libraries for dealing with different scenarios.

✚ Dean Wampler is the co-author of the Programming Scala book so his preference for Scala is understandable.

✚ Do you know any other teams or companies using Scalding instead of Cascading or Cascalog?

Scalding ↩

Dean Wampler: Principal Consultant at Think Big Analytics ↩

Original title and link: An Overview of Scalding (NoSQL database©myNoSQL)

#Scalding #cascading #cascalog #twitter

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality

Anya is LIVE right now

FREE

Free to watch • No registration required • HD streaming

This is the only interesting paragraph from InfoWorld’s article “Twitter’s programmers speed Hadoop development“:

Three Twitter teams are using Cascading in combination with programming languages: The revenue team uses Scala, the publisher analytics team uses Clojure, and the analytics team uses Jython.

Each of these combinations led to new projects:

Scala + Cascading => Scalding

Clojure + Cascading => Cascalog

Jython + Cascading => PyCascading

An interesting question I couldn’t answer is why each team prefers a different language. My hypothesis:

Scala with its strong typing for handling clear models generating numbers that must always be correct.

Clojure for designing new analysis models

Jython enables quick experimentation with data.

Your thoughts?

Original title and link: Twitter and Their Cascading Libraries for Dealing With Different Scenarios (NoSQL database©myNoSQL)

#Cascading #Scalding #Cascalog #PyCascading #Twitter

A good explanation of why Cascading, Cascalog, and other frameworks hiding away the details of MapReduce are making things easier for non-programmers:

Data scientists at The Climate Corporation chose to create their algorithms in Cascalog, which is a high-level Clojure-based machine learning language built on Cascading. Cascading is an advanced Java application framework that abstracts the MapReduce APIs in Apache Hadoop and provides developers with a simplified way to create powerful data processing workflows. Programming in Cascalog, data scientists create compact expressions that represent complex batch-oriented AI and machine learning workflows. This results in improved productivity for the data scientists, many of whom are mathematicians rather than computer scientists. It also gives them the ability to quickly analyze complex data sets without having to create large complicated programs in MapReduce. Furthermore, programmers at The Climate Corporation also use Cascading directly for creating jobs inside Hadoop streaming to process additional batch-oriented data workflows.

Original title and link: Cascalog and Cascading: Productivity Solutions for Data Scientists (NoSQL database©myNoSQL)

#Cascading #Cascalog #Hadoop #MapReduce #Clojure #Java #BigData

A brief but very clear explanation of the benefits of using Cascalog-checkpoints by Paul Lam:

Building Cascading/Cascalog queries can be visualised as assembling pipes to connect a flow of data. Imagine that you have Flow A and B. Flow B uses the result from A along with other bits. Thus, Flow B is dependent on A. Typically, if a MapReduce job fail for whatever reason, you simply fix what’s wrong and start the job all over again. But what if Flow A takes hours to run (which is common for a MR job) and the error happened in Flow B? Why re-do all that processing for Flow A if we know that it finished successfully?

Original title and link: Cascalog-Checkpoint: Fault-Tolerant MapReduce Topologies (NoSQL database©myNoSQL)

#Cascalog #Cascading #MapReduce #Twitter

Bruno Bonacci brings up some very good points why using a single and coherent solution to manipulate data results in higher productivity by comparing what Pig and Hive require:

This also reminded me about the Python vs R “war”.

Original title and link: Complex data manipulation in Cascalog, Pig, and Hive (NoSQL database©myNoSQL)

#Cascalog #Pig #Hive #Hadoop #BigData

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality

Anya is LIVE right now

FREE

Free to watch • No registration required • HD streaming

Clojure Weekly, Sep 5th, 2013

#clojure #luminus #cascalog #book #productivity

Introduction to testing Cascalog with Midje.

Testing: Hadoop way

Before we dive into testing Cascalog jobs, a short detour into testing approaches in Hadoop is deserved.

#testing #cascalog #clojure

Clojure Weekly, July 2nd, 2013

#clojure #midje #cascalog #riemann #monitoring #oop #unclebob #cookbook

An Overview of Scalding

An intro to Scalding1, Twitter’s Scala API for Cascading, by Dean Wampler2:

“There’s not better way to write general-purpose Hadoop MapReduce programs when specialized tools like Hive and Pig aren’t quite what you need.”

Watch the video and slides after below.

An Overview of Scalding

Scalding for Hadoop

✚ At Twitter, the creators of Scalding, different teams use different libraries for dealing with different scenarios.

✚ Dean Wampler is the co-author of the Programming Scala book so his preference for Scala is understandable.

✚ Do you know any other teams or companies using Scalding instead of Cascading or Cascalog?

Scalding ↩

Dean Wampler: Principal Consultant at Think Big Analytics ↩

#Scalding #cascading #cascalog #twitter

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality

Anya is LIVE right now

FREE

Free to watch • No registration required • HD streaming

This is the only interesting paragraph from InfoWorld’s article “Twitter’s programmers speed Hadoop development“:

Three Twitter teams are using Cascading in combination with programming languages: The revenue team uses Scala, the publisher analytics team uses Clojure, and the analytics team uses Jython.

Each of these combinations led to new projects:

Scala + Cascading => Scalding

Clojure + Cascading => Cascalog

Jython + Cascading => PyCascading

An interesting question I couldn’t answer is why each team prefers a different language. My hypothesis:

Scala with its strong typing for handling clear models generating numbers that must always be correct.

Clojure for designing new analysis models

Jython enables quick experimentation with data.

Your thoughts?

Original title and link: Twitter and Their Cascading Libraries for Dealing With Different Scenarios (NoSQL database©myNoSQL)

#Cascading #Scalding #Cascalog #PyCascading #Twitter

A good explanation of why Cascading, Cascalog, and other frameworks hiding away the details of MapReduce are making things easier for non-programmers:

#Cascading #Cascalog #Hadoop #MapReduce #Clojure #Java #BigData

A brief but very clear explanation of the benefits of using Cascalog-checkpoints by Paul Lam:

#Cascalog #Cascading #MapReduce #Twitter

Top Posts Tagged with #cascalog | Tumlook

Trending Tags

Last Seen Tags

#cascalog

Trending Tags

Last Seen Tags

#cascalog