Fetch data from HBASE database from R using rhbase package
Sometimes you may have to perform some analysis on the dataset which is stored in HBASE tables on the Hadoop cluster. Recently, I came across this situation and Revolution Analytics's package rhbase came to the rescue. Although, the tutorial given on the rhbase wiki is very well documented but there are some issue which I have faced and I thought I should create a step by step guide for suture references.
The following things are required to be installed on the server and client.
Server: Linux-Ubuntu, Hadoop, HBASE, R, Rstudio-server, thrift
Client: Browser
For this guide I assume Ubuntu, Hadoop, and HBASE are already installed and configured on the server side and there are some data tables in the HBASE database.
1. R
sudo apt-get install r-base
2. Rstudio Server
Install Rstudio server form the instructions given on the download site
3. Apache Thrift
Download apache thrift version 0.9.0 rather than 0.9.1
Install all Thrift pre-requisites
Ubuntu
$ sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev
CentOS5/Rhel5
$ sudo yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel
- Build Thrfit according to instructions - Update PKG_CONFIG_PATH in bashrc by typing the following command in terminal:
sudo nano ~/.bashrc
And paste the following line at the end
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/
Then source the bashrc via
source ~/.bashrc
Verifiy pkg-config path is correct: Type this in terminal
pkg-config --cflags thrift
And it should return -I/usr/local/include/thrift and not something like -I/usr/local/include
Copy Thrift library
sudo cp /usr/local/lib/libthrift-0.9.0.so /usr/lib/
4. Start HBASE and Thrift server
HBASE
start-hbase.sh
Now HBASE must be running. To verify whether hadoop and related all applications are running or not and query HBASE perform the following
jps
You should see something like,
6162 DataNode
6739 TaskTracker
502 JobTracker
7029 HMaster
14867 Jps
12245 Main
5924 NameNode
7320 HRegionServer
13740 ThriftServer
6412 SecondaryNameNode
hbase shell hbase(main):003:0> list #To see list of tables hbase(main):003:0> describe('TABLE_NAME') #Small description of concerned table hbase(main):003:0> scan('TABLE_NAME') #Get content of that table
Thrift
hbase thrift start
If it throws an error then try this
hbase thrift start -threadpool
5. rhbase
Installing rhbase package generally requires 2 steps
wget https://raw.github.com/RevolutionAnalytics/rhbase/master/build/rhbase_1.2.0.tar.gz R CMD INSTALL rhbase_1.2.0.tar.gz
6. Login into Rstudio server
Type server's IP in the browser with port 8787. For example, 192.168.20.10:8787
7. Query HBASE from R
require(rhbase) hostLoc = '192.168.20.10' #Give your server IP port = 9090 #Default port for thrift service hb.init(hostLoc, port) hb.init(serialize="character") #If data in table is characters other no need for this step hb.list.tables() hb.describe.table("TABLE_NAME") data <- c() iter <- hb.scan(tablename='TABLE_NAME', startrow="1", colspec="FamilyName:") while(length(row <- iter$get(1))>0){ data <- c(data, row) }
Enjoy, now you can browse, read, write, and modify tables stored in HBASE through R.













