Top Posts Tagged with #diggbigg

Fetch data from HBASE database from R using rhbase package

Sometimes you may have to perform some analysis on the dataset which is stored in HBASE tables on the Hadoop cluster. Recently, I came across this situation and Revolution Analytics's package rhbase came to the rescue. Although, the tutorial given on the rhbase wiki is very well documented but there are some issue which I have faced and I thought I should create a step by step guide for suture references.

The following things are required to be installed on the server and client.

Server: Linux-Ubuntu, Hadoop, HBASE, R, Rstudio-server, thrift

Client: Browser

For this guide I assume Ubuntu, Hadoop, and HBASE are already installed and configured on the server side and there are some data tables in the HBASE database.

1. R

sudo apt-get install r-base

2. Rstudio Server

Install Rstudio server form the instructions given on the download site

3. Apache Thrift

Download apache thrift version 0.9.0 rather than 0.9.1

Install all Thrift pre-requisites

Ubuntu

$ sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev

CentOS5/Rhel5

$ sudo yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel

- Build Thrfit according to instructions - Update PKG_CONFIG_PATH in bashrc by typing the following command in terminal:

sudo nano ~/.bashrc

And paste the following line at the end

export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/

Then source the bashrc via

source ~/.bashrc

Verifiy pkg-config path is correct: Type this in terminal

pkg-config --cflags thrift

And it should return -I/usr/local/include/thrift and not something like -I/usr/local/include

Copy Thrift library

sudo cp /usr/local/lib/libthrift-0.9.0.so /usr/lib/

4. Start HBASE and Thrift server

HBASE

start-hbase.sh

Now HBASE must be running. To verify whether hadoop and related all applications are running or not and query HBASE perform the following

jps

You should see something like,

6162 DataNode

6739 TaskTracker

502 JobTracker

7029 HMaster

14867 Jps

12245 Main

5924 NameNode

7320 HRegionServer

13740 ThriftServer

6412 SecondaryNameNode

hbase shell hbase(main):003:0> list #To see list of tables hbase(main):003:0> describe('TABLE_NAME') #Small description of concerned table hbase(main):003:0> scan('TABLE_NAME') #Get content of that table

Thrift

hbase thrift start

If it throws an error then try this

hbase thrift start -threadpool

5. rhbase

Installing rhbase package generally requires 2 steps

wget https://raw.github.com/RevolutionAnalytics/rhbase/master/build/rhbase_1.2.0.tar.gz R CMD INSTALL rhbase_1.2.0.tar.gz

6. Login into Rstudio server

Type server's IP in the browser with port 8787. For example, 192.168.20.10:8787

7. Query HBASE from R

require(rhbase) hostLoc = '192.168.20.10' #Give your server IP port = 9090 #Default port for thrift service hb.init(hostLoc, port) hb.init(serialize="character") #If data in table is characters other no need for this step hb.list.tables() hb.describe.table("TABLE_NAME") data <- c() iter <- hb.scan(tablename='TABLE_NAME', startrow="1", colspec="FamilyName:") while(length(row <- iter$get(1))>0){ data <- c(data, row) }

Enjoy, now you can browse, read, write, and modify tables stored in HBASE through R.

#hbase #R #DiggBigg #rhbase #rstudio #big-data

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

How do I start learning Hadoop?

#hadoop #DiggBigg #big-data #map-reduce

Due to the large list of Colleges with Data Science Degrees, I receive a number of email inquires with questions about choosing a program. I have not attended any of the programs, and I am not sure how qualified I am to provide guidance. Anyhow, I will do my best to share what information I do have.

Originally, the list started out with 5 schools. Now the list is well over 100 schools, so I have not been able to keep up with all the intricate details of every program. There are not very many undergraduate options, and the list only contains a few PhD programs, so the information here will be focused on pursuing a masters degree.

Start by asking 2 questions:

What are my current data science skills?

What are my future data science goals?

Those 2 questions can provide a lot of guidance. Understand that data science consists of a number of different topic areas:

Mathematical Foundation (Calculus/Matrix Operations)

Computing (DB, programming, machine learning, NoSQL)

Communication (visualization, presentation, writing)

Statistics (regression, trees, classification, diagnostics)

Business (domain specific knowledge)

After seeing the above lists, this is where things get cloudy. Everyone brings a different set of existing skills, and everyone has different future goals. Here are a few scenarios that might clear things up.

Data Scientist

The most common approach is to attempt to build knowledge in all 5 topic areas. If this is your goal, find the topic areas where you are weakest and target a graduate program to help you bolster those weak skills. In the end, you will come out with a broad range of very desired skills.

Specialist

A different approach is to select one topic area and get really, really good. For example, maybe you want to be an expert on machine learning. If that is your goal, then maybe a traditional computer science graduate program is what is best. In the end, you will be well-suited to be an effective member of a data science team or pursue a PhD.

Data Manager

A third and also common approach is from people that want to help fill the expected void of1.5 million data-savvy managers. These people do not necessarily want to know the deep details of the algorithms, but they would like an understanding of what the algorithms can do and when to use which algorithm. In this case, a graduate program from a business school (MBA) might be a good choice. Just make sure the program also involves coverage from the non-business topics of data science.

Example

I think NYU is the best example of a school that can help a person achieve just about any data science goal. The NYU program is a university-wide initiative, so the program is integrated with many departments (math, CS, Stats, Business, and others). Therefore, a student could possibly tailor a program to reach a variety of future goals. Plus, New York has a lot of companies solving interesting data science problems.

Conclusion

There you have it. It does not narrow the choices down, but it should help to provide some guidance. Other factors to consider are length of a program and/or location.

Good Luck with your decision, and feel free to leave a comment if you have and good/bad experiences with any of the particular graduate programs.

#datascience #DiggBigg

#DiggBigg

#DiggBigg #datascience

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

What Does a Data Scientist Do?

Big Data [sorry] & Data Science: What Does a Data Scientist Do? from Data Science London

#datascience #DiggBigg #presentation

Data Science Use Cases

Background

For each type of analysis think about:

What problem does it solve and for who?

How is it being solved today?

What are the data inputs and where do they come from?

What are the outputs and how are they consumed- (online algo, static reportis a revenue leakage ("saves us money") or a revenue growth ("makes us money") problem?

Use Cases By Function

Sales

Lead prioritization

What is a given lead's likelihood of closing

revenue impact: supports growth

usage: online algorithm and static report

Demand forecasting

Logistics

Demand forecasting

How many of what thing do you need and where will we need them? (Enables lean inventory and prevents out of stock situations.)

revenue impact: supports growth and militates against revenue leakage

usage: online algorithm and static report

#ideas #datascience #DiggBigg #use-cases

Fetch data from HBASE database from R using rhbase package

The following things are required to be installed on the server and client.

Server: Linux-Ubuntu, Hadoop, HBASE, R, Rstudio-server, thrift

Client: Browser

For this guide I assume Ubuntu, Hadoop, and HBASE are already installed and configured on the server side and there are some data tables in the HBASE database.

1. R

sudo apt-get install r-base

2. Rstudio Server

Install Rstudio server form the instructions given on the download site

3. Apache Thrift

Download apache thrift version 0.9.0 rather than 0.9.1

Install all Thrift pre-requisites

Ubuntu

$ sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev

CentOS5/Rhel5

$ sudo yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel

- Build Thrfit according to instructions - Update PKG_CONFIG_PATH in bashrc by typing the following command in terminal:

sudo nano ~/.bashrc

And paste the following line at the end

export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/

Then source the bashrc via

source ~/.bashrc

Verifiy pkg-config path is correct: Type this in terminal

pkg-config --cflags thrift

And it should return -I/usr/local/include/thrift and not something like -I/usr/local/include

Copy Thrift library

sudo cp /usr/local/lib/libthrift-0.9.0.so /usr/lib/

4. Start HBASE and Thrift server

HBASE

start-hbase.sh

Now HBASE must be running. To verify whether hadoop and related all applications are running or not and query HBASE perform the following

jps

You should see something like,

6162 DataNode

6739 TaskTracker

502 JobTracker

7029 HMaster

14867 Jps

12245 Main

5924 NameNode

7320 HRegionServer

13740 ThriftServer

6412 SecondaryNameNode

Thrift

hbase thrift start

If it throws an error then try this

hbase thrift start -threadpool

5. rhbase

Installing rhbase package generally requires 2 steps

wget https://raw.github.com/RevolutionAnalytics/rhbase/master/build/rhbase_1.2.0.tar.gz R CMD INSTALL rhbase_1.2.0.tar.gz

6. Login into Rstudio server

Type server's IP in the browser with port 8787. For example, 192.168.20.10:8787

7. Query HBASE from R

Enjoy, now you can browse, read, write, and modify tables stored in HBASE through R.

#hbase #R #DiggBigg #rhbase #rstudio #big-data

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

How do I start learning Hadoop?

#hadoop #DiggBigg #big-data #map-reduce

Start by asking 2 questions:

What are my current data science skills?

What are my future data science goals?

Those 2 questions can provide a lot of guidance. Understand that data science consists of a number of different topic areas:

Mathematical Foundation (Calculus/Matrix Operations)

Computing (DB, programming, machine learning, NoSQL)

Communication (visualization, presentation, writing)

Statistics (regression, trees, classification, diagnostics)

Business (domain specific knowledge)

Data Scientist

Specialist

Data Manager

Example

Conclusion

There you have it. It does not narrow the choices down, but it should help to provide some guidance. Other factors to consider are length of a program and/or location.

Good Luck with your decision, and feel free to leave a comment if you have and good/bad experiences with any of the particular graduate programs.

#datascience #DiggBigg

#DiggBigg

#DiggBigg #datascience

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

What Does a Data Scientist Do?

Big Data [sorry] & Data Science: What Does a Data Scientist Do? from Data Science London

#datascience #DiggBigg #presentation

Data Science Use Cases

Background

For each type of analysis think about:

What problem does it solve and for who?

How is it being solved today?

What are the data inputs and where do they come from?

What are the outputs and how are they consumed- (online algo, static reportis a revenue leakage ("saves us money") or a revenue growth ("makes us money") problem?

Use Cases By Function

Sales

Lead prioritization

What is a given lead's likelihood of closing

revenue impact: supports growth

usage: online algorithm and static report

Demand forecasting

Logistics

Demand forecasting

How many of what thing do you need and where will we need them? (Enables lean inventory and prevents out of stock situations.)

revenue impact: supports growth and militates against revenue leakage

usage: online algorithm and static report

#ideas #datascience #DiggBigg #use-cases

Top Posts Tagged with #diggbigg | Tumlook

Trending Tags

Last Seen Tags

#diggbigg

Trending Tags

Last Seen Tags

#diggbigg