Top Posts Tagged with #distributed cache

SharePoint - Weekly News - Allied Consultants

Original Article: http://alliedc.com/sharepoint-weekly-news-3/

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Joins in pig

Joins plays most significant role in pig’s data transformations. We use joins in most of our pig scripts. Here we make this baffling topic simplified.

A common database join strategy is to first select records from both the inputs and sort both inputs on the join key and then walks through both the inputs together joining those records which match the join condition. We can do join on multiple keys which are of same compatibles.

We can classify joins on many parameters:

1) Equi join , Non equi join

2) Inner join, Outer join

3) Replicated , Skewed and Merge join

4) Map side join, Reduce side join

We shall discuss each of them below:

Equi and Non equi join: (classification based on operator we use)

If join happens with ‘=’ condition it is called equi join, pig supports equi join using join key word. If join happens with operators other than ‘=’ is called non equi join. Non equi joins are supported by doing CROSS product fallowed by FILTER condition.

Inner and outer join: (classification based on how we are fetching records from relations)

If we need only intersection of relations we go for inner join (use keyword join), if we need intersection along with complete records of left relation we go with left outer join (use keyword left outer), if we need intersection along with all records of right relation we go for right outer join (use keyword right outer), if we need intersection and also all records of both the relations we go for full join (use keyword full outer).

Replicated, Skewed and Merge: (Based on internal join implementation mechanism)

We can alter how the join should be implemented internally by using keyword USING ‘<type of join >’. Pig provides many techniques for running a join job on Hadoop cluster. Each of these techniques we use depends on the nature of the relation’s data we join with.

1) Replicated:

If one relation is small and other is large, we use replicated join. This uses distributed cache tool provide by pig. This is also called as fragment replicated join. It can be applied to inner, left outer joins only. It can be used for more than two tables, in that case all the left most tables are loaded into memory means right most table should be large.

2) Skewed:

For the data where records are skewed we use skewed join. The records which are very large are treated separately by allocating them separate reducers upon some calculations (like size of skewed key values, reducer’s processing capacity). Initially MR Job performs sampling and checks the skewed keys, if found performs normal join for normal keys and skewed records are allocated to separate set of reducers. Skewed join can have utmost two tables. If there are multiple tables we need to break down to series of joins. We can use this technique for inner and outer joins.

3) Merge: For the relations whose records are already sorted on the same join key we can use this technique. This join can be entirely done in map phase only so there is no reducer phase here. It can be applied to inner joins only.

Map side and Reduce side join: (based on where join operation is run and completed)

Map side is more efficient but only under certain conditions such as, data must be sorted on the key same as join key, each input should be divided in same no of partitions and all records of a particular key should reside in the same partition.

Reduce side join is the default join which includes all phases like map, shuffle, combiner and reduce. In this phase we add tags to the mapper output key values to get identified in shuffle phase so that all keys with same tag will go for same reducer.

#join #pig #skewed #merge join #distributed cache #mapside #reduce side #joinsinpig

Cache < Data Grid < Database

I would like to clarify definitions for the following technologies:

In-Memory Distributed Cache

In-…

View Post

#distributed cache #gridgain #in-memory data grid #in-memory database