Fetch data from HBASE database from R using rhbase package
Sometimes you may have to perform some analysis on the dataset which is stored in HBASE tables on the Hadoop cluster. Recently, I came across this situation and Revolution Analytics's package rhbase came to the rescue. Although, the tutorial given on the rhbase wiki is very well documented but there are some issue which I have faced and I thought I should create a step by step guide for suture references.
The following things are required to be installed on the server and client.
For this guide I assume Ubuntu, Hadoop, and HBASE are already installed and configured on the server side and there are some data tables in the HBASE database.
1. R
sudo apt-get install r-base
2. Rstudio Server
Install Rstudio server form the instructions given on the download site
3. Apache Thrift
Download apache thrift version 0.9.0 rather than 0.9.1
Now HBASE must be running. To verify whether hadoop and related all applications are running or not and query HBASE perform the following
jps
You should see something like,
6162 DataNode
6739 TaskTracker
502 JobTracker
7029 HMaster
14867 Jps
12245 Main
5924 NameNode
7320 HRegionServer
13740 ThriftServer
6412 SecondaryNameNode
hbase shell hbase(main):003:0> list #To see list of tables hbase(main):003:0> describe('TABLE_NAME') #Small description of concerned table hbase(main):003:0> scan('TABLE_NAME') #Get content of that table
Thrift
hbase thrift start
If it throws an error then try this
hbase thrift start -threadpool
5. rhbase
Installing rhbase package generally requires 2 steps
wget https://raw.github.com/RevolutionAnalytics/rhbase/master/build/rhbase_1.2.0.tar.gz R CMD INSTALL rhbase_1.2.0.tar.gz
6. Login into Rstudio server
Type server's IP in the browser with port 8787. For example, 192.168.20.10:8787
7. Query HBASE from R
require(rhbase) hostLoc = '192.168.20.10' #Give your server IP port = 9090 #Default port for thrift service hb.init(hostLoc, port) hb.init(serialize="character") #If data in table is characters other no need for this step hb.list.tables() hb.describe.table("TABLE_NAME") data <- c() iter <- hb.scan(tablename='TABLE_NAME', startrow="1", colspec="FamilyName:") while(length(row <- iter$get(1))>0){ data <- c(data, row) }
Enjoy, now you can browse, read, write, and modify tables stored in HBASE through R.
Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality
Anya is LIVE right now
FREE
Free to watch • No registration required • HD streaming
Due to the large list of Colleges with Data Science Degrees, I receive a number of email inquires with questions about choosing a program. I have not attended any of the programs, and I am not sure how qualified I am to provide guidance. Anyhow, I will do my best to share what information I do have.
Originally, the list started out with 5 schools. Now the list is well over 100 schools, so I have not been able to keep up with all the intricate details of every program. There are not very many undergraduate options, and the list only contains a few PhD programs, so the information here will be focused on pursuing a masters degree.
Start by asking 2 questions:
What are my current data science skills?
What are my future data science goals?
Those 2 questions can provide a lot of guidance. Understand that data science consists of a number of different topic areas:
Mathematical Foundation (Calculus/Matrix Operations)
After seeing the above lists, this is where things get cloudy. Everyone brings a different set of existing skills, and everyone has different future goals. Here are a few scenarios that might clear things up.
Data Scientist
The most common approach is to attempt to build knowledge in all 5 topic areas. If this is your goal, find the topic areas where you are weakest and target a graduate program to help you bolster those weak skills. In the end, you will come out with a broad range of very desired skills.
Specialist
A different approach is to select one topic area and get really, really good. For example, maybe you want to be an expert on machine learning. If that is your goal, then maybe a traditional computer science graduate program is what is best. In the end, you will be well-suited to be an effective member of a data science team or pursue a PhD.
Data Manager
A third and also common approach is from people that want to help fill the expected void of1.5 million data-savvy managers. These people do not necessarily want to know the deep details of the algorithms, but they would like an understanding of what the algorithms can do and when to use which algorithm. In this case, a graduate program from a business school (MBA) might be a good choice. Just make sure the program also involves coverage from the non-business topics of data science.
Example
I think NYU is the best example of a school that can help a person achieve just about any data science goal. The NYU program is a university-wide initiative, so the program is integrated with many departments (math, CS, Stats, Business, and others). Therefore, a student could possibly tailor a program to reach a variety of future goals. Plus, New York has a lot of companies solving interesting data science problems.
Conclusion
There you have it. It does not narrow the choices down, but it should help to provide some guidance. Other factors to consider are length of a program and/or location.
Good Luck with your decision, and feel free to leave a comment if you have and good/bad experiences with any of the particular graduate programs.
Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality
Anya is LIVE right now
FREE
Free to watch • No registration required • HD streaming
What are the data inputs and where do they come from?
What are the outputs and how are they consumed- (online algo, static reportis a revenue leakage ("saves us money") or a revenue growth ("makes us money") problem?
Use Cases By Function
Sales
Lead prioritization
What is a given lead's likelihood of closing
revenue impact: supports growth
usage: online algorithm and static report
Demand forecasting
Logistics
Demand forecasting
How many of what thing do you need and where will we need them? (Enables lean inventory and prevents out of stock situations.)
revenue impact: supports growth and militates against revenue leakage
usage: online algorithm and static report
Predicting Lifetime Value (LTV)
what for: if you can predict the characteristics of high LTV customers, this supports customer segmentation, identifies upsell opportunties and supports other marketing initiatives
usage: can be both an online algorithm and a static report showing the characteristics of high LTV customers
Wallet share estimation
working out the proportion of a customer's spend in a category accrues to a company allows that company to identify upsell and cross-sell opportunities
Usage: can be both an online algorithm and a static report showing the characteristics of low wallet share customers
Churn
working out the characteristics of churners allows a company to product adjustments and an online algorithm allows them to reach out to churners
usage: can be both an online algorithm and a statistic report showing the characteristics of likely churners
Customer segmentation
If you can understand qualitatively different customer groups, then we can give them different treatments (perhaps even by different groups in the company). Answers questions like: what makes people buy, stop buying etc
usage: static report
Product mix
What mix of products offers the lowest churn? eg. Giving a combined policy discount for home + auto = low churn
usage: online algorithm and static report
Cross selling/Recommendation algorithms/
Given a customer's past browsing history, purchase history and other characteristics, what are they likely to want to purchase in the future?
usage: online algorithm
Up selling
Given a customer's characteristics, what is the likelihood that they'll upgrade in the future?
usage: online algorithm and static report
Channel optimization
what is the optimal way to reach a customer with cetain characteristics?
usage: online algorithm and static report
Discount targeting - What is the probability of inducing the desired behavior with a discount - usage: online algorithm and static report
Reactivation likelihood
What is the reactivation likelihood for a given customer
usage: online algorithm and static report
Adwords optimization and ad buying
calculating the right price for different keywords/ad slots
Risk
Credit risk
Treasury or currency risk
How much capital do we need on hand to meet these requirements?
Fraud detection
predicting whether or not a transaction should be blocked because it involves some kind of fraud (eg credit card fraud)
Accounts Payable Recovery
Predicting the probably a liability can be recovered given the characteristics of the borrower and the loan
Anti-money laundering
Using machine learning and fuzzy matching to detect transactions that contradict AML legislation (such as the OFAC list)
Customer support
Call centers
Call routing (ie determining wait times) based on caller id history, time of day, call volumes, products owned, churn risk, LTV, etc.
Call center message optimization
Putting the right data on the operator's screen
Call center volume forecasting
predicting call volume for the purposes of staff rostering
Human Resources
Resume screening
scores resumes based on the outcomes of past job interviews and hires
Employee churn
predicts which employees are most likely to leave
Training recommendation
recommends specific training based of performance review data
Talent management
looking at objective measures of employee success
Use Cases By Vertical
Healthcare
Claims review prioritization
payers picking which claims should be reviewed by manual auditors
Medicare/medicaid fraud
Tackled at the claims processors, EDS is the biggest & uses proprietary tech
Medical resources allocation
Hospital operations management
Optimize/predict operating theatre & bed occupancy based on initial patient visits
Alerting and diagnostics from real-time patient data
Embedded devices (productized algos)
Exogenous data from devices to create diagnostic reports for doctors
Prescription compliance
Predicting who won't comply with their prescriptions
Physician attrition
Hospitals want to retain Drs who have admitting privileges in multiple hospitals
Survival analysis
Analyse survival statistics for different patient attributes (age, blood type, gender, etc) and treatments
Medication (dosage) effectiveness
Analyse effects of admitting different types and dosage of medication for a disease
Readmission risk
Predict risk of re-admittance based on patient attributes, medical history, diagnose & treatment
Consumer Financial
Credit card fraud
Banks need to prevent, and vendors need to prevent
Retail (FMCG - Fast-moving consumer goods)
Pricing
Optimize per time period, per item, per store
Was dominated by Retek, but got purchased by Oracle in 2005. Now Oracle Retail.
JDA is also a player (supply chain software)
Location of new stores
Pioneerd by Tesco
Dominated by Buxton
Product layout in stores
This is called "plan-o-gramming"
Merchandizing
when to start stocking & discontinuing product lines
Identifying contractors who are regularly involved in poor performing products
Design issue prediction
Predicting that a construction project is likely to have issues as early as possible
Life Sciences
Identifying biomarkers for boxed warnings on marketed products
Drug/chemical discovery & analysis
Crunching study results
Identifying negative responses (monitor social networks for early problems with drugs)
Diagnostic test development
Hardware devices
Software
Diagnostic targeting (CRM)
Predicting drug demand in different geographies for different products
Predicting prescription adherence with different approaches to reminding patients
Putative safety signals
Social media marketing on competitors, patient perceptions, KOL feedback
Image analysis or GCMS analysis in a high throughput manner
Analysis of clinical outcomes to adapt clinical trial design
COGS optimization
Leveraging molecule database with metabolic stability data to elucidate new stable structures
Hospitality/Service
Inventory management/dynamic pricing
Promos/upgrades/offers
Table management & reservations
Workforce management (also applies to lots of verticals)
Electrical grid distribution
Keep AC frequency as constant as possible
Seems like a very "online" algorithm
Manufacturing
Sensor data to look at failures
Quality management
Identifying out-of-bounds manufacturing
Visual inspection/computer vision
Optimal run speeds
Demand forecasting/inventory management
Warranty/pricing
Travel
Aircraft scheduling
Seat mgmt, gate mgmt
Air crew scheduling
Dynamic pricing
Customer complain resolution (give points in exchange)
Call center stuff
Maintenance optimization
Tourism forecasting
Agriculture
Yield management (taking sensor data on soil quality - common in newer John Deere et al truck models and determining what seed varieties, seed spacing to use etc
Mall Operators
Predicting tenants capacity to pay based on their sales figures, their industry
Predicting the best tenant for an open vacancy to maximise over all sales at a mall
Education
Automated essay scoring
Utilities
Optimise Distribution Network Cost Effectiveness (balance Capital 7 Operating Expenditure)
Predict Commodity Requirements
Other
Sentiment analysis
Loyalty programs
Sensor data
Alerting
What's going to fail?
De duplication
Procurement
Use Cases That Need Fleshing Out
Procurement
Negotiation & vendor selection
Are we buying from the best producer
Marketing
Direct Marketing
Response rates
Segmentations for mailings
Reactivation likelihood
RFM
Discount targeting
FinServ
Phone marketing
Generally as a follow-up to a DM or a churn predictor
Email Marketing
Offline
Call to action w/ unique promotion
Why are people responding- How do I adjust my buy (where, when, how)?
"I'm sure we are wasting half our money here, but the problem is we don't know which ad"
Media Mix Optimization
Kantar Group and Nielson are dominant
Hard part of this is getting to the data (good samples & response vars)
Healthcare
CRM & utilization optimization
Claims coding
Forumlary determination and pricing
How do I get you to use my card for auto-pay? Paypal? etc. Unsolved.