Hadoopi has been updated and now has wired networking (for improved performance and reliability) plus the addition of metrics collection with Prometheus and visualisation of those metrics in Grafana dashboards.
Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality
Anya is LIVE right now
FREE
Free to watch • No registration required • HD streaming
Visualising IOT Data on a Pi Cluster using Mesos, Spark & Kafka
The sensorpi repo on GitHub holds various ramblings, scripts and code put together for and experiment to visualise realtime sensor data processed on a cluster of Raspberry Pis. Not dissimilar to https://www.circuits.dk/datalogger-example-using-sense-hat-influxdb-grafana/ but using the features of the cluster for near realtime processing.
But first a few caveats, unlike my Hadoopi project as this is an experiment there isn’t chef code to setup and configure the cluster of Raspberry Pis (you’ll need to do this by hand). Originally this project was intended to implement a SMACK (Spark, Mesos, Akka, Cassandra and Kafka) stack. You’ll see the end result is more of a SMKS stack (Spark, Mesos, Kafka and Scala) acting as a transfer mechanism between two Pis for capturing sensor mnetrics and visualsation of the collected data. On the sensor side a Pi zero is using an EnviroPhat pushing data to Kafka via Python. On the visualisation side there is an influxdb instance and grafana server to store and serve a realtime dashboard of the data.
Despite all of those caveats, I learned a tonne about running a Mesos on a cluster (on tin), writing Scala code, building it, Spark streaming, IOT sensors, Influxdb TICK stack and Grafana dashboards. So if you want to play along I expect you’ll learn all those things too, so it’s a case of manual setup, please make sure your command line Fu is cranked up to 11!
What You Will Need
The project uses a cluster of 5 Raspberry Pis, a wireless hub and 60W usb power adapter, full details of the cluster hardware can be found on the Hadoopi project. You’ll also need a development PC, I’m running Debian linux as it supports all the technologies and tools to build and deploy. Finally for the sensor I’m using an EnviroPhat from PiMoroni on a Raspberry Pi Zero W.
These instructions below will help you set up the cluster, this is how we are going to spread the workload across it:
Raspberry Pi Zero W and EnviroPhat - Python application posting sensor data to a Kafka Topic
Master01 - Running influxDB and Grafana
Master02 - Mesos Master, Launch Scripts and NFS share for spark app checkpointing
Worker01/02/03 - mesos slaves for running tasks from mesos
Each of the workers will run the spark application and kafka brokers as part of the mesos cluster.
You’ll be using the development PC for building the scala application and submitting it to Spark, if you have Kafka installed you can also use that to view the messages on the Kafka topic to help with troubleshooting.
As stated above this project really is an experiment because you could post metrics from the sensor Pi directly to influxdb - therefore the mesos, spark, scala and kafka components really are their just to learn.
Compilation and Packaging
Because the Raspberry Pi uses an arm architecture we are going to have to compile Mesos and Kafka, you’ll also find minimal documentation on achieving this, I’m using instructions that Netflix posted after one of their hackdays and blog post so as a result we have quite an old version of Mesos 0.28.3. I’ll be using the tried and trusted jessie lite 23-09-16 distribution of raspbian as I’d proven whilst building the Hadoopi project.
Boot one of you Pis from a fresh install and run the following to prepare the environment for compiling:
sudo -i apt-get update apt-get -y install oracle-java8-jdk tar wget git autoconf libtool build-essential python-dev python-boto libcurl4-nss-dev libsasl2-dev libsasl2-modules maven libapr1-dev libsvn-dev update-alternatives --config java select /usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/bin/java java -version <--- check it's the 1.8 SDK vi /etc/dphys-swapfile set CONF_SWAPSIZE to 1024 /etc/init.d/dphys-swapfile restart free -h <--- check you have 1GB of swap
And now build the binary (warning this may take 3-4 hours):
./bootstrap mkdir build cd build ../configure --disable-python make mkdir /opt/mesos make DESTDIR=/opt/mesos/ install tar -C /opt/mesos/ -zcvf /home/pi/mesos-install.tar.gz .
Now transfer the binary mesos-install.tar.gz down to the dev pc
Next we need to make the kafka binary for mesos, following from the above build on the Pi:
cd apt-get -y install oracle-java8-jdk git java -version <--- check it's the 1.8 SDK git clone https://github.com/mesos/kafka cd kafka vi build.gradle version <--- change this to theabsolute path for Config.scala, so prepend /root/kafka/ ./gradlew jar downloadKafka cd /root tar -C /root/kafka/ -zcvf /home/pi/kafka-install.tgz .
Now transfer the binary kafka-install.tgz down to the dev pc
You can now power off the Pi and wipe the sdcard for reuse.
Setup OS and Networking
On each of the 5 nodes in the cluster - master01/02/worker01/02/03 (so excluding the sensor Pi). The Network is is setup to use the 10.0.0.255 subnet with master01/02 at 10.0.0.11 and 12, the workers being 21, 22, 23.
sudo -i apt-get update vi /etc/wpa_supplicant/wpa_supplicant.conf country=GB ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev update_config=1 network={ ssid="XXXXXXXX" <---- replace twith the SSID and KEY for the wireless network psk="YYYYYYYY" } vi /etc/dhcpcd.conf ...add to bottom... interface wlan0 static ip_address=10.0.0.XX/24 <-- 10.0.0.11,12,21,22,23 as appropriate static routers=10.0.0.1 static domain_name_servers=10.0.0.1 hostnamectl set-hostname ZZZZZZ <- change to master01/02/worker01/02/03 as appropriate vi /etc/hosts ...add to bottom... 10.0.0.11 master01 10.0.0.12 master02 10.0.0.21 worker01 10.0.0.22 worker02 10.0.0.23 worker03 vi /etc/ssh/ssh_config ...add to bottom... StrictHostKeyChecking no <--- this will break sshd on a stretch based distro of raspbian, it's ok on jessie service networking restart ifup wlan0 ifconfig <---- check got correct ip apt-get install nscd
You may for convenience wish to setup passwordless login from your dev pc
At this point you should have all 5 cluster nodes setup with network connectivity, the correct hostnames and access to the other nodes in the cluster.
Setting up the Worker Nodes
Now we are going to setup each of the worker nodes installing mesos, zookeeper, spark and startup script.
Firstly copy the mesos binary from you dev pc to each pi
scp mesos-install.tar.gz pi@workerXX:~
Then on each worker pi:
sudo -i cd / tar -zxvf /home/pi/mesos-install.tar.gz apt-get install libcurl4-nss-dev libsvn-dev mkdir /opt/mesos-slave-work_dir sudo -i apt-get install oracle-java8-jdk apt-get install zookeeperd vi /etc/zookeeper/conf/zoo.cfg ...add server list... server.1=worker01:2888:3888 server.2=worker02:2888:3888 server.3=worker03:2888:3888 echo "X" > /etc/zookeeper/conf/myid <---- X is 1, 2 or 3 correcponding to worker01/02/03 service zookeeper start
Next Create the startup script in /root, change the “ip” and “hostname” as applicable for the node
cd vi startup.sh #!/bin/bash mount -t nfs4 10.0.0.12:/ /mnt mesos-slave --log_dir=/var/log/mesos/ --work_dir=/opt/mesos-slave-work_dir --ip=10.0.0.XX --hostname=workerXX --master=zk://worker01:2181,worker02:2181,worker03:2181/mesos --launcher=posix --resources="cpus(*):2; mem(*):768;disk(*):40000;ports:[31000-32000,7000-7001,7199-7199,9042-9042,9160-9160]" <---- replace XX with ip address and hostname chmod a+x startup.sh
Now setup spark
cd /opt/ wget https://archive.apache.org/dist/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.6.tgz tar -zxvf spark-1.6.2-bin-hadoop2.6.tgz ln -s /opt/spark-1.6.2-bin-hadoop2.6 /opt/spark cp spark-1.6.2-bin-hadoop2.6/conf/spark-defaults.conf.template spark-1.6.2-bin-hadoop2.6/conf/spark-defaults.conf vi spark-1.6.2-bin-hadoop2.6/conf/spark-defaults.conf spark.network.timeout 500s spark.testing.reservedMemory 64000000 spark.testing.memory 128000000 cp spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh.template spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh vi spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh ...add to bottom... SPARK_MASTER_IP=10.0.0.12
That’s the worker nodes ready to go.
Setting up master02
The master02 node is the most fidly to setup, it runs the majority of services, so there are a few steps to install, mesos, spark, nfs share and kafka.
Copy the required files to the Pi - that’s startup scripts, the mesos and kafka bianaries onto the node:
sudo -i cd / tar -zxvf /home/pi/mesos-install.tar.gz apt-get install libcurl4-nss-dev libsvn-dev
Next we need to setup an nfs share for the spark applications to use for checkpointing, details taken from http://www.instructables.com/id/Turn-Raspberry-Pi-into-a-Network-File-System-versi/
Next repeat the install of spark on master02 as was done on the worker nodes.
Now install kafka:
cd /opt mkdir kafka cd kafka tar -zxvf /home/pi/kafka-install.tgz vi /opt/kafka/kafka-mesos.properties ...remove all lines and add... storage=zk:/kafka-mesos master=zk://worker01:2181,worker02:2181,worker03:2181/mesos zk=worker01:2181,worker02:2181,worker03:2181/KafkaCluster api=http://master02:7000
Finally move that startup scripts to /root
mv /home/pi/*.sh /root chmod a+x /root/*.sh
That’s master02 ready to go.
Setting up master01
We’ll install 2 TICK stack components - influxdb 1.4 (which unfortunately doesn’t have the free clustering option) and the Chronograf. For dashboards we’ll also install Grafana.
Let’s install influxdb and create an empty database for our data:
sudo -i apt-get install apt-transport-https curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add - source /etc/os-release echo "deb https://repos.influxdata.com/debian jessie stable" | sudo tee /etc/apt/sources.list.d/influxdb.list apt-get update apt-get install influxdb vi /etc/influxdb/influxdb.conf [http] # Determines whether HTTP endpoint is enabled. enabled = true # The bind address used by the HTTP service. bind-address = ":8086" # Determines whether user authentication is enabled over HTTP/HTTPS. auth-enabled = false service influxd restart influx -precision rfc3339 ---issue db creation command --- create database sensordata exit
apt-get install apt-transport-https curl curl https://bintray.com/user/downloadSubjectPublicKey?username=bintray | sudo apt-key add - echo "deb https://dl.bintray.com/fg2it/deb jessie main" | sudo tee -a /etc/apt/sources.list.d/grafana.list apt-get update apt-get install grafana vi /etc/grafana/grafana.ini [server] # Protocol (http, https, socket) protocol = http # The ip address to bind to, empty will bind to all interfaces ;http_addr = # The http port to use http_port = 3000 service grafana-server restart
Check http://master01:3000 to check grafana is working
Setting up the Sensor
Connect up the EnviroPhat and boot, next configure the networking as for the main cluster, I chose 10.0.0.99 as the ip.
Upload the sensor.py code
scp sensor/sensor.py pi@XXXXX:~ <--- replace XXX with the ip of your sensor pi
Install the supporting libraries and configure:
sudo -i apt-get install pip python-envirophat pip2 install envirophat sudo raspi-config <------ enable i2c bus from the interfaces option pip2 install kafka-python
Everything is ready to go
Starting up the Cluster
This is where the fun starts, Make sure all the Pis are up and running and we can start playing.
Open 3 terminal windows to worker01/02/03
sudo -i ./startup.sh
The workers should have mounted the nfs share from master02 and kick off the mesos slave daemons.
Next open 3 terminal windows to master02
sudo -i ./start-mesos-master.sh
Open http://master02:5050/#/slaves check mesos is running
Take a look at the slaves and check our worker nodes are present and correct.
Open http://master02:5050/#/ check the 3 tasks are running for the Kafka brokers
Get the Sensor Sending Metrics to Kafka
Next ssh to the sensor pi and run:
python sensor.py
On the dev pc, locate your local kafka install and read from the kafka topic with the console consumer to check messages are making it to kafka
./bin/kafka-console-consumer.sh --bootstrap-server 10.0.0.21:7000,10.0.0.22:7000,10.0.0.23:7000 --topic test --from-beginning
You’ll see the sensor data from kafka being displayed on the consumer console.
Building the Scala Application and getting ready to deploy
One of the challenges I set myself with this project was to write code in a new language, I chose to use Scala and Spark streaming (an ideal fit for mesos), much of the work is based around the learnings from a couple of books and the excellent Taming Big Data with Spark Streaming and Scala - Hands On! by Frank Kane.
I’ve put together a small spark streaming app that loads small batches of data from Kafka and pushes them to influxdb. One of the main frustrations during development was library dependency issues, I worked originally in ScalaIDE http://scala-ide.org/ submitting the app to my local spark instance using the data sources on the Pi Cluster. This was somewhat “clunky” but worked, packaging up the application to deploy to the cluster using the Scala Build Tool was also problematic, but after much wrangling I got it working. Here’s the files included for the KafkaSensor app:
To build the fat/uber jar (app, config and dependencies), install sbt from https://www.scala-sbt.org/ Set the version in project/assembly.sbt as directed in https://github.com/sbt/sbt-assembly. The dependencies are defined in the build.sbt:
There include libraries for kafka, json parsing, config management and influxdb. The job of sbt is to go and grab the nested dependencies of jars needed to build the jar and package it up for runtime.
You may want to review the runtime settings included in the project in src/main/resources/application.conf.
To build the jar run:
sbt assembly
If all goes welll you’ll have a jar file target/scala-2.10/KafkaSensor-assembly-1.0.jar
Running the Scala Application on the Cluster
Now we have compiled our application, we need to use the spark libraries on the development pc to submit the application to the cluster.
Open 2 terminal windows on your dev pc. In one window run a web server to allow the cluster to load your jar:
cd target/scala-2.10 python -m SimpleHTTPServer 8000
In the second, submit the spark application (you’ll need the spark binaries installed), replace the 10.0.0.9 with you dev pc ip
You’ll note the teeny memory settings needed to get the app to run on the cluster (see the hadoopi project for more details on cluster memory allocation). Check the application is running on the cluster http://master02:5050/#/ watch the driver and then worker tasks be created
Select one of the tasks and look at the “stdout” output, you should see the heartbeat message
View Metrics in Chronograf + Grafana
Now the application is running, lets look at the data being collected in Chronograf, point the browser at http://master01:8888/
Use the data explorer to quickly find the metrics and knock up a quick graph.
Next we need to set up Grafana, make sure the server is running on master01 and restart it if not
sudo -i service grafana-server restart
Log into Grafana by hitting http://master01:3000/, login as user “admin” password “admin” (change these at your leisure)
Next set up the influxdb data source to http://master01:8086
Then import the “grafana/EnviroPhat.json” dashboard.
And at long last view the dashboard and metrics
Stopping the App and Cluster
To kill the spark app, much like how it was launched run:
Then terminate the running processes for the kafka framework and mesos master by hitting ^C on each terminal window on master02, also do this on the mesos slave processes on each worker.
Summary
As you’ll have seen this is very much an experiment, the original plan to build a SMACK stack hasn’t come to fruition. The cluster is really a massive overkill to simply recreate https://www.circuits.dk/datalogger-example-using-sense-hat-influxdb-grafana/
But in doing this as I said at the start I’ve learned a whole heap about Mesos, Spark and Scala and that was always the intention (and it turns out my soldering skills aren’t too bad either).
This project contains the configuration files and chef code to configure a cluster of five Raspberry Pi 3s as a working Hadoop running Hue.
This video shows how to set up and configure the cluster using this code.
The versions of installed Hadoop components are:
hadoop 2.6.4
hue 3.11.0
hbase 1.2.4
pig 0.12.1
hive 1.2.1
spark 1.6.2
livy 0.2.0
oozie 4.3.0
sqoop 1.99.4
solr 4.10.4
impala - not supported
Inspiration
Running Hadoop on Rasberry Pis is not a new idea, a lot of work has been done by individuals and I wanted to make sure their efforts were recognised. Their work and formed the basis of my attempts and inspired me to start the project.
My day job uses the Cloudera distribution of Hadoop, which provides a great management interface, but that means I’m somewhat shielded from the inner workings of Hadoop configuration and tooling. I decided to put this distribution together to see if was feasible/practical to run Hadoop on a cluster of Raspberry Pis, but also to get more exposure to it’s tooling and configuration. At the same time I wanted a feature rich and easy to use “suite” of tools so I based the project around Hue. I also wanted it to be easily recreateable so used Chef.
Caveats
There are some caveates to bear in mind when using this tool.
No Impala - don’t think the Pi has enough power whilst running other Hadoop components - happy to receive pull requests :-)
No Sentry - access to the data on the cluster will be based around HDFS security
Not a production reference - it’s a learning exercise, given the performance of the cluster you really wouldn’t want to run anything “real” on there.
Teeny amount of memory so only basic stuff - The 1GB of ram on the Pi is a real limitation, you’re really only going to be able to one task at a time, so go easy on it
Its slowwwwwwwww - the combination of teeny amount of RAM, 4 cores and wifi networking means this is built for speed, be realistic with your expectations!
Setup requires basic linux command line fu and understanding of network configuration - you are going to be compiling applications, running chef code and possibly doing a little fault finding, you’ll need some basic linux sysadmin skills for this.
It’s compiles and configures correctly as of NOW - there are loads of dependencies on 3rd party libraries and packages, both when compiling binaries or configuring the cluster, but things get move, deleted or just change, you’ll need be able to diagnose problems and change settings.
Ask if you want help - github issues are ideal but please be clear in what attempts you have made to diagnose and fix.
The hardware
To build the cluster you are going to need:
5 x Raspberry Pi 3s
Nylon Spacers to stack the Pis
Acrylic Raspberry Pi case for base and lid to keep the dust off
Wireless Router - I’m using a TP-Link tl-wr802n travel router
Anker 60w 6 usb port hub
6 usb cables
5 x Samsung Evo+64GB micro sd cards
Computer for administering via ssh, running a webserver and a web browser to access Hue
Just a quick note on the micro sd cards, perfromance of cards can vary wildly, I was recommended the Evo+ cards as their performance at reading and writing small files was very good, you can check the performance of yours using https://github.com/geerlingguy/raspberry-pi-dramble/blob/master/setup/benchmarks/microsd-benchmarks.sh
Making Binaries
Although most of hadoop eco system is written in java and available in binary format, there are some components that need to be built for the target architecture for performance. As the Raspberry Pi is ARM based we simply can’t take the precompiled binaries and run them, we’re going to have to compile those. These are Hadoop (with the correct version of protobuf libraries that we will also need to build), Oozie and Hue.
Install the Rasbian Jessie Lite version dated 23/09/16 onto you sd card, the development of the project was based around this version.
This version has a few “features” that are really useful, firstly it will auto expand the file system on first boot, secondly the SSH server is enabled by default (which is probably an error) but it means we can configure the Pi headless without the need of doing basic service config without the fuss of connecting a keyboard and monitor.
Compile protobuf
Locate you Pi’s ip address (via some form of network scan, or go through the fuss of connect a monitor and keyboard) then ssh to it as the pi user. then download and unpack the protobuf v2.5 source, tweak the build script so that it refers to the new location of the google test suite and build and install the libraries:
sudo -i apt-get update apt-get install dh-autoreconf wget https://github.com/google/protobuf/archive/v2.5.0.zip unzip v2.5.0.zip cd protobuf-2.5.0 vi autogen.sh
Change the references to the google test suite:
# Check that gtest is present. Usually it is already there since the # directory is set up as an SVN external. if test ! -e gtest; then echo "Google Test not present. Fetching gtest-1.5.0 from the web..." wget https://github.com/google/googletest/archive/release-1.5.0.zip unzip release-1.5.0.zip mv googletest-release-1.5.0 gtest fi
Then generate the build configuration files
./autogen.sh ./configure --prefix=/usr
Build and install protobuf
make make check make install
Compile Hadoop
We now need to compile the Hadoop binaries, download and unpack the Hadoop 2.6.4 source, tweak pom.xml so it bypasses the documentation generation as this fails on the Pi, apply the HADOOP-9320 patch and build the binary.
cd apt-get install oracle-java8-jdk wget http://apache.mirror.anlx.net/hadoop/common/hadoop-2.6.4/hadoop-2.6.4-src.tar.gz tar -zxvf hadoop-2.6.4-src.tar.gz vi hadoop-2.6.4-src/pom.xml
Disable the problem step in by adding the following to <properties>…</properties>
<additionalparam>-Xdoclint:none</additionalparam>
Next apply the HADOOP-9320 patch
cd hadoop-2.6.4-src/hadoop-common-project/hadoop-common/src wget https://issues.apache.org/jira/secure/attachment/12570212/HADOOP-9320.patch patch < HADOOP-9320.patch cd ~/hadoop-2.6.4-src/
Next install a whole bunch of build tools and libraries:
Once build package the archive ready for deployment
cd hadoop-dist/target/ cp -R hadoop-2.6.4 /opt/hadoop-2.6.4 cd /opt tar -zcvf /root/hadoop-2.6.4.armf.tar.gz hadoop-2.6.4
Compile Hue
Download the Hue 3.11.0 source, unpack it, apply the necessary patches and tweak the spark_shell.py so it defaults to 256MB for spark driver and executor memory rather than 1GB.
cd /opt wget https://dl.dropboxusercontent.com/u/730827/hue/releases/3.11.0/hue-3.11.0.tgz tar -zxvf hue-3.11.0.tgz cd hue-3.11.0
Download the patch to fix the example loading issues with 3.11.0, patch the desktop/core/src/desktop/api2.py and desktop/core/src/desktop/tests_doc2.py files:
Change the driverMemory and executorMemory defaults from 1GB to 256MB in desktop/libs/notebook/src/notebook/connectors/spark_shell.py
Then build the apps:
make apps
Then package them up:
cd /opt tar -zcvf /root/hue-3.11.0.armf.tar.gz hue-3.11.0
Compiling Oozie
Finally we need to compile oozie. Download the oozie 4.3.10 source and unpack it, we need to change the target java version to 1.8 in the pom.xml file, then build the binaries, download the ext-2.2. library (to enable the oozie web interface) and copy to the applicable folder and package the files.
Download the files and unpack:
cd apt-get install oracle-java8-jdk maven wget http://archive.apache.org/dist/oozie/4.3.0/oozie-4.3.0.tar.gz tar -zxvf oozie-4.3.0.tar.gz cd oozie-4.3.0/
Change the target java version to 1.8 in pom.xml, then set a few environment variables to stop the build process running out of memory, and build the binary:
Now download the ext-2.2 library and package the files:
cd /root/oozie-4.3.0/distro/target/oozie-4.3.0-distro/oozie-4.3.0 mkdir libext cd libext wget http://archive.cloudera.com/gplextras/misc/ext-2.2.zip cd /root/oozie-4.3.0/distro/target/oozie-4.3.0-distro/ tar -zcvf /root/oozie-4.3.0.armf.tar.gz oozie-4.3.0
Making the Compiled Files Available
Transfer the compiled binary files to you computer and start the python webserver to make them available to the Pis as you configure the cluster:
python -m SimpleHTTPServer
Installing and Configuring the Cluster
The cluster will comprise of 5 Raspberry Pis, 3 of them will be configured as worker nodes (worker01, worker02 and worker03) and have the HDFS Datnode and Yarn Nodemanager Hadoop components installed. The remaining two Pis will be setup as master01 and master02.
The master nodes will have the the Hadoop components:
master01
HDFS Namenode
Yarn Resource Manager
Hue
Hive
Pig
Oozie
Sqoop
master02
Hbase
Zookeeper
Spark & Livy
Solr
Networking
The cluster is setup to run on the 10.0.0.x network range:
The TP-Link router is setup in “WISP Client” mode so it bridges the two wifi networks from 10.0.0.x to 192.168.2.x that way I can set the cluster up so it has outbound internet connectivity from the 10.0.0.x network on via my 192.168.2.x network. The great advantage with this setup is the cluster and router can taken away from my 192.168.2.x network and use the cluster without having to reconfigure the network.
You can modify the network configuration by modifying the chef attributes or refactoring the network recipe. You may also want to add the nodes to your administering computer /etc/hosts file.
Code
The code for the projects is available over at https://github.com/andyburgin/hadoopi
Installing
Firstly write the Jessie Lite version dated 23/09/16 onto your sd cards. We will setup the Pis in turn, worker01-03 and then master02, for the final step when we setup master01 which will need all the Pi’s powered on as it modifies files on the HDFS file system. This means you’ll only need on pi connected to ethernet.
So insert the first sd card into one of your Pis, connect the ethernet cable and power it up, wait a minute whilst the Pi boots and SSH into the pi, you can determine the ip address of your pi by either using a network scanner or by connecting a monitor to the hdmi port.
Once we have ssh'ed into the Pi we need to update the system:
sudo -i apt-get update
Installed git and chef:
DEBIAN_FRONTEND=noninteractive apt-get -y install chef git
Clone the code from github to the Pi:
git clone https://github.com/andyburgin/hadoopi.git cd hadoopi
Set your wifi SSID and password as environment variables:
export WIRESSID=myssid export WIREPASS=mypassword
Then run chef against the worker01.json file
chef-solo -c solo.rb -j worker01.json
Wait for it to finish and:
poweroff
You now need to repeat these steps using a fresh sd card for worker02, worker03 and master02.
For the setup of master01 place the final sd card in the pi with the ethernet connection, setup the remaining 4 pis with the previously configured sd cards and power up all of the pis. As before ssh into the Pi you are configuring, update the system, install git & chef, clone the code, set you network SSID & password, then run:
chef-solo -c solo.rb -j master01.json
Then we need an extra chef run to install additional Hadoop components and configure files on the HDFS filesystem that runs across the cluster:
chef-solo -c solo.rb -j master01-services.json
Then ssh into all of the nodes (via their 10.0.0.x ip addresses):
poweroff
Starting and Stopping the Cluster and Installing Examples
As a one off exercise before we start the cluster lets install some test data into mysql for later testing with sqoop, we do this before we start the Hadoop services as this will require too many system resources to tun both. SSH to master01 where our mysql instance runs (this is required to hold Hue configuration data, but we can also use it to hold test data too) and run:
sudo -i git clone https://github.com/datacharmer/test_db.git cd test_db mysql < employees.sql mysql -u root -e "CREATE USER 'hduser'@'%' IDENTIFIED BY 'hduser';" mysql -u root -e "GRANT ALL PRIVILEGES on employees.* to 'hduser'@'%' WITH GRANT OPTION;" mysql -u root -e "FLUSH PRIVILEGES;"
Starting the Cluster
On master01 and master02 you will find some scripts in the /root/hadoopi/scripts folder. After powering on the cluster (you no longer need the ethernet connection as all nodes will connect to the wifi connection) start the cluster on master01 by issuing:
sudo -i cd ~/hadoopi/scripts ./master01-startup.sh
Then on master02 run:
sudo -i cd ~/hadoopi/scripts ./master02-startup.sh
Once those scripts have sucessfully run you are now ready to configure Hue.
Configuring Hue
We access hue via a web browser, point you we bbrowser your web browser at:
http://master01:8888/
The first time you hit the url you’ll be asked to create a Hue administrator user, the cluster is configured to work withe the “hduser” user, so make sure you use that with a password of your choice.
From the “Quick Start Wizard” select the “examples” tab and one by one install each of the examples. Frome the “Data Browser” menu take a look at the “Metastore Tables” to check Hive data is available and the “HBase” option to view the data there. Finally select one of the Dashboards under the “Search” menu to heck the Solr indexes are available.
Stopping the Cluster
Powering down the cluster is as easy as starting it, on master01 run:
sudo -i cd ~/hadoopi/scripts ./master01-stop.sh
Then on master02 run:
sudo -i cd ~/hadoopi/scripts ./master02-stop.sh poweroff
Finally on each of the worker nodes run:
sudo poweroff
From now on you can simply power on the cluster and run the startup scripts on master01 and master02 as root. Once you have finished using the cluster shut it down by running the stop script on master01 and master02, then running the “poweroff” command on each of the 5 nodes as root.
Hue Examples
I won’t go through every example using installed in hue, watching the vide will convey far more, but I’ll cover some of the key ones and allow you to explore. Please remember the Raspberry Pi has a teeny amount of memory and runs things very slowly, be patient and don’t submit more than one job at once, doing so will more than likely cause your jobs to fail.
Hive Examples
From the “Query Editor” menu select “Hive”, in the resulting query window select one of the example scripts e.g. “Sample Salary Growth” and hit the “play” button to submit the query.
Choose the “Job Browser” icon and open in a new tab, watch the job transition from Accepted->Running->Succeeded.
Return to the original query tab and you’ll see the results displayed in the browser.
Pig Example
From the “Query Editor” menu select “Pig”, then select the “samples” tab and then the “Upper Text (example)”, in the resulting query editor amend the 2nd line to:
upper_case = FOREACH data GENERATE UPPER(text);
Click the play button to submit the query, you’ll be prompted to specify an output path, so choose “/tmp/pigout” and click “Yes”.
Go to the job browser and wait for the two generated jobs to finish, then click the “HDFS Browser” icon and navigate to /tmp/pigout and view on e of the “part-m-????” file and see the generated text.
Spark Notebooks
Spark notebooks are one of the best features of Hue, allowing you to edit code directly in the browser and run it via spark on the cluster. We’ll try out the three supported languages Python, Scala and R (we don’t have Impala support).
Select “Notebooks” menu and open the “Sample Notebook”, we’ll need to start a session for each language, we’ll try Python first so click the “context” icon to open the context menu and then choose “Recreate” next to the PySpark option, this creates a Spark job viewable via the job browser.
When the session is ready and the job is running, click the “play” icon next to each of the 4 examples and the cluster will calculate the result and display it in the browser, feel free to experiment with the results and edit the code.
When you have finished with the PySpark examples you need to kill the session/job, so on the context menu click “close” next to the PySpark menu.
To try the Scala examples click “Recreate” next to the Scala option on the context menu to create the session and associated job. When running you’ll be able to edit Scala code in the browser and interact with the results.
After trying both examples kill the session/job by closing the session on the context menu.
Skip over the Impala examples as they aren’t supported.
Start the R session by clicking “create” next to the option and when the job/session have started edit the path in the R sample to /tmp/web_logs_1.csv The R example doesn’t used HDFS so you’ll need to SSH onto each worker and run:
Back on the Notebook click the play icon next to the code and examine the output including plot.
Finally close the session on the context menu so the spark job ends.
Sqoop example
We’re going to use Sqoop to transfer the employee sample data we installed earlier into HDFS. To do this we’ll need to configure two Sqoop links and a Sqoop Job.
Create the first link by selecting the “Data Browser” and then “Sqoop”. Choose “Manage Links” then “Click here to add one”
Then click “Save”, choose “Manage Links” again to add the second link. Click “New Link” and enter the follwoing values:
Name: hdfs out
Connector: hdfs-connector
JDBC Driver Class:
JDBC Connection String:
Username: hduser
Password: ******
We need to then edit that link to complete setting it up, click “Manage Links” the the “hdfs out” link to bring up the edit page. Make the follwoing settings:
Name: hdfs out
HDFS URI: hdfs://master01:54310/
Finally click Save.
Next we need to configure the job, select “Click here to add one” and fill out the following:
Step 1: Information
Name: import employees from mysql to hdfs
From Link: mysql employees in
To Link: hdfs out
Step 2: From
Schema name: employees
Table name: employees
Table SQL statement:
Table column names:
Partition column name:
Null value allowed for the partition column:
Boundary query:
Step 3: To
Output format: TEXT_FILE
Compression format: NONE
Custom compression format:
Output directory: /tmp/sqoopout
Start the job by clicking “Save and run”, navigate to the job browser and wait for the job to be submitted and run sucessfully. Then Navigate to hdfs browser and check the data in “/tmp/sqoopout”
Ozzie Examples
There are many Ozzie examples installed, so I’ll only talk about a few of them here. Lets firstly run the Shell example. Select “Query Editors” from the menu and the “Job Designer”, next click the “Shell” example, you’ll be presented with the job editor. I for scroll down you’ll see a parameter to the “hello.py” command is “World!”.
Click the “Submit” button (then confirm) and you’ll be presented with the workflow view of the job, wait for the job to run and and select the log icon next “hello.py”. In the resulting log you’ll see the phrase “Hello World!”
Next let’s look at amore complicated workflow, from the menu select “Workflows” -> “Editors” -> “Workflows”, on the resulting page select “Spark” and you’ll be presented with the job configuration page, for the spark workflow to run we will need to configure the settings for spark, click the “pen” icon to edit the job, next click the “cogs” icon on the spark step to access the steps properties. In the “Options List” filed enter:
Click the “disk” icon to save the changs to the workflow and click the “play” icon to submit the job, enter “/tmp/sparkout” into the output field and hit “Submit”. Again you’ll be presented with the workflow view, wait for the job to finish and then use the hdfs browser to navigate to the “/tmp/sparkout” folder and view one of the data files to check the spark job copied the intended files.
Solr Dashboards
Select “Search” from the menu and then each of the dashboards in turn (Twitter, Yelp Reviews and Web Logs).
You’ll see each present a variety of charts, graphs, maps and text along with interactive filtering features to explore the data held in the solr search indexes.
Reminder, Power Off the Cluster
Once you have finished experimenting with the cluster please don’t just turn off the power, you rick corrupting your data and possibly the sd card. So follow the procedure of running the shutdown scripts on master01 and master02, then running the poweroff command on each of the pis before turning the power off.
Wrapup
I hope you have fun setting up and playing with the code, I learned a tonne of stuff setting it up and I hope you do too. If you have any improvements then please send pull requests to github, any problems and I'll respond to issues there too.
One of the challenges of playing with log data is you rarely have a fresh supply of a high "production" volume logs to hand, especially in a test/development environment.
I recently stumbled upon a nodejs tool called nodejs-logreplay by Adam Lundrigan (you can find his blog over at http://adam.lundrigan.ca/) which takes an existing log file and relays it in realtime against a webserver. You can try this out by cloning https://github.com/adamlundrigan/nodejs-logreplay All the usage notes are already contained in the Readme.md but to summerise:
edit config/test.json setting the source parameter to point to an existing log file and the target webserver.
node replay.js test
This will read the existing log file and replay the requests in sequence in time (obviously you'll be missing the original posted data).
This is great BUT I was looking for something to simulate the creation of the log file, here the target webserver sees all the generated traffic from a single source, consequently this is reflected in the resulting access log file, which isn't quite what I wanted.
I thought, wouldn't it be great is Adams' tool rather than sending the requests to a webserver wrote it to a file in realtime with the existing request data from the log (e.g. source ip) ? So I forked Adams project, and replaced his http request code with the time shifted requests being written to file instead.
The code is available at https://github.com/andyburgin/nodejs-logreplay it uses an additional dateformatter library, but I've included a package.json file so you can run "npm install". I've also changed the config file format so you can pick an output file location:
vi config/test.json
{
"output": "/home/andy/rewrite.log",
"source": "/home/andy/input.log",
"speedupFactor": "10"
}
You need to run "node replay.js test" and to see the resulting file open another terminal and run "tail -f /home/andy/rewrite.log" (change path accordingly)
As an example rather than tailing our new logfile to the console, lets pipe it through maptails and watch the "War-Games-Esqe" animation.
Experiment 2 - Vagrant Chef setup for Beaver, Logstash and Kibana
I was never completely happy with how I'd left Experiment 2, there were too many "ignore this error" and "as a work around you'll need to...", I'm please to say these are now gone.
Recap
For those of you that are new to experiment 2 (or indeed my blog) it's an attempt to allow anyone to simply grab vagrant and quickly have a working Logstash server with Kibana web interface up and running with an accompanying machine to ship logs to to it (via beaver). All the configuration work has been done and apart from setting some ip addresses to suit your setup you should be up and running comparatively quickly (depending upon you machine spec and internet bandwidth.
Fixing, Fixing, Fixing
So after much refactoring and fixing:
Using hashicorp Ubuntu 12.04 LTS box from https://vagrantcloud.com/
Chef 11 hack is gone - now using the chef omnibus installer plugin
apt repository update workaround gone - now using the chef omnibus installer plugin
Apart from setting ip addresses in the Vagentfiles the networking quirks are gone
Beaver install twaeks no longer needed
The false Apache test errors have now gone
No need to manually autostart kibana - that's working now
UFW config much simpler
GeoIP and Maps working in Kibana - woot!
So head on over to https://github.com/andyburgin/burginsdatathing/tree/master/experiment02 grab the files and read the README.md - that lists all the prerequsites and versions of tools etc used.
That's a wrap!
As it's feature packed and I've removed a the "quirks" and configuration "idiosyncrasies" I think I'll park Experiment 2 for now (at least until the chef cookbook supports 1.4.x version of logstash).
Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality
Anya is LIVE right now
FREE
Free to watch • No registration required • HD streaming
es2gexf - extract logstash and elasticsearch data to a gexf file
In my last post http://data.andyburgin.co.uk/post/65706647269/visualising-logstash-apache-data-in-gephi I generated some pretty visualisations of apache webserver logs by extracting data from my logstash elasticsearch server using a python script.
You can now grab the script from https://github.com/andyburgin/es2gefx
I still need to add command line arguments, but the parameters in the code are:
host & port - the ip and port number of elasticsearch url
starttime & endtime - in format %Y%m%d%H%M%S includes all entries in elasticsearch within these timestamps
evttype - the value of the @type field that indicates an elasticsearch entry generated logstash that match the %{COMBINEDAPACHELOG} grok pattern
relatedfield & relatetimeout - field and time period to be used for edge detection •verbose - debug messages
It was built using Python 2.7.3 on Debian Wheezy vm, you will need to download the latest https://github.com/paulgirard/pygexf don't use the easy_install method as this will install 0.2.2 which is missing some of the newer features needed. You just need a to put the files in a gexf folder next to the es2gefx.py file.
The tool has limitations, there is a limit to what the xml libraries underlying pygexf can do. I've been generation 25000 nodes and 18000 edges without problems.
BTW It's my first "real" stab at Python development, so I encourage everyone to take a look and feedback (nicely).
I've been wanting to create a data visualisation in Gephi for a while, so using the stack I've built in Experiments one and two I made this...
Look Shinys!
How did you do that ?
Details, Details, Please ?
Gephi is a network graph manipulation and visualisation tool, it lets you create a series of nodes (e.g. people) and edges to define their relationships (e.g. person A likes Person B), thats graph theory at it's simplest level, so please check out the cousera course and youtube videos for a much more in depth overview of what Gephi and graph theory are:
http://gephi.org/
https://www.coursera.org/course/sna
Sebastien Heymann Exploratory Network Analysis with Gephi
http://www.youtube.com/watch?v=Y7Ah6VylIak
http://www.youtube.com/watch?v=L6hHv6y5GsQ
http://www.youtube.com/watch?v=ejzrR6RupNA
http://www.youtube.com/watch?v=Mr574c4bORU
There are two issues I have with Gephi, the first is it easy to import your facebook/twitter data and make a nice network graph and post it on a blog, thus allowing you to claim you are a data scientist and data visualisation expert - trust me you're not. This is to the data analyst what comic sans is to a graphic designer. By doing this you're missing all the network analysis maths and calculations that can tell you something about the graph and draw facts about what you are looking at. The second issue is Gephi is so damn nice once you've studdied network analysis it's a bit like the old saying "when all you have is a hammer, everything looks like a nail" - with Gephi everything starts to look like a network graph (and sometimes it just isnt).
Nail Hunting
Now I have my logstash server collecting log data, tokenising it and chucking it into elasticsearch I thought it would be good to see if I could get the data into a network graph (yes I know I nail hunting). I seem to have a tonne of apache log data and I thought looking for relationships between the "hits" might show me more than I can see already in the raw logfiles or Kibana.
In/Out shake it all about
Challenge number one was how do I get data out of elasticsearch, build the relationships and then get that into a format that Gehpi can read ? I first looked at Java, found a library that could talk to elasticsearch and installed netbeans. After several evenings of teeth gnashing and saying rude words I decided I was loosing focus on what I was trying to do and instead fighting with configuration and environmental issues with my setup. I decided to try another route as to br quite honest it wasn't fun.
Next I tried python using the elasticsearch rest API - boom, after a couple of hours I had json data that was stored by logstash showing in the console, I implemented the "scroll" search and all was rosie despite the single but obvious problem - how do I get the data from python to a gephi format? A quick search found https://github.com/paulgirard/pygexf which can not only write a gefx file but supports a lot of the modern Gephi features e.g the timeline (by specifying a lifespan for a node). A word of note here, the version in the python package manager is very out of date and has several features I needed missing, so I had to manually download and use the latest version from github.
Time and Relationships
I started to generate gefx files with time based nodes - but with no edges. At this point I realised that I needed to stop feeling so smug about my accomplishments. Unless I could find meaningful relationships in the data then this exercise was a bit pointless. I scratched my head and thought about what the data could show us, I had a series of sequential hits from a webserver, the obvious thing I could do is relate them by client! So using the client ip field (which is a bit iffy as there could be multiple clients behind a firewall looking to the webserver like a single user) I could relate the nodes - I got me some edges!
Too Much Too Young
I got a little carried away and started to generate gefx files with a 100000 nodes and many more edges between each node, this was fine until the underlying xml library behind pygexf would report "Killed" when I tried to write the file - I decided to scale back my ideas a little and I limited the results to about 25000 nodes, I also settled for a single edge per node (the next hit in the timeline with a matching ip address, within 5 seconds). This should give me a chain of hits for each "page view".
Driving Gephi
Once the data was in Gephi I pretty much use it for visualisation, partition (colour) based on ip address, rank (size) based on response size, enable timeline and select a timeframe, finally apply a layout (force atlas 2) - I haven't done much with filtering and I've yet too look properly at the statistics.
So hold on, I know what you might be thinking - you've just made pretty stuff ? isn't that a bit contradictory ? what did I say about posting your twitter relationships on a blog ? comics sans ? There is an ounce of truth in that, but what I haven't shown on the video is what you can find when you explore the graph (in Data Analyst words "Exploratory Visual Analysis"). For example by scrolling along the timeline I can see bots, lots of them. I can see when they visit, what they are looking at, it's really useful and I've learnt a lot about the traffic to the servers, now yes I could have done some of this in Kibana or reading the logfiles, but here I can see it standing out like a sore thumb, but there's much more to find using the statistical tools and features of Gephi.
Summary
So it's a good starting point, there are some obvious limitations to the work so far. A bit of hammer and nails still rings true, after all I am trying to force linear time based data into a network, I think there's still a relationship I haven't spotted to give me the light bulb moment. I'm only using the apache log data, not the error logs nor syslogs, I suspect these will add real value. On a practical level I may need to change how I generate the gexf file to get more data into it, perhaps an interim store such as neo4j. When I've got this sussed I'm sure the gephi streaming API will be amazing, I expect to be making a logstash output for this generating the graphs in real time - soon my friends, soon...
Just to recap, I chose Beaver to ship my logs as it's written in python which comes preinstalled on all the cloud servers I use, it has a small runtime footprint and using the logstash shipper would require java to be installed and consume a not insignificant chunk of memory. Another advantage here is Beaver can create an ssh tunnel to the logstash server and make the redis server look like it's local. So you have the data encrypted as it transports it and Beaver will reconnect automatically if the tunnel breaks.
In the below instructions I'm going to make modifications to our wordpress demo, this is just for illustration, if you get it all up and running then when you relaunch the vm with either "vagrant up" or "vagrant provision" it will overwrite the changes. Unfortunately at present there's nothing in the logstash cookbook to configure the beaver ssh related options, so see this as more of a proof of concept than how to rollout for production.
On the Logstash Server
Firstly add a user and generate a key and store it
Firstly the syslog messages are just stored as a single text message, this needs to be tokenised, so lets add a new filter to the logstash config in the Vagrantfile:
Note that by default the ufw cookbook I am using is opening port 22 to the world, you may want to change this by forking the cookbook and removing that code.
I did try and get the geoip filter working, but despite best efforts no luck, if the logstash cookbook gets upgraded to support version 1.2.x I suspect that will be quite easy.
In latest rollout it looks like the apache2 cookbook is throwing some errors relating to the security conf having the wrong server tokens, I believe this is a fault with the cookbook.
Play along at home
I’ve updated the vagrantfiles and berkshelf files on my github account
Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality
Anya is LIVE right now
FREE
Free to watch • No registration required • HD streaming
In experiment one I used chef to setup a logstash server which threw incoming log data from a wordpress stack into elasticsearch. This was then interrogated and analysed using the wonderful Kibana interface.
I'd be stretching the term "visualising the data" with Kibana, as pretty and functional as it is, it's job isn't to create informative and immersive graphs of the data. To see data trends I could use something like a centralised Munin server, but that's not going to give me information about the logfiles, just performance metrics and doesn't make use of all the wonderful log data sat in elasticsearch.
However logstash does come with an output interface to graphite, I've not used graphite before and it's high time I had, so here's what I did... ...but hang on a moment, the observant readers of this article will say "excuse me, graphite receives the data straight from logstash, it doesn't talk to elasticsearch" and you'd be right, but I have some future plans to do that.
It's Just a Graph
So let's have my take on what graphite is, please feel free to correct me in the comments.
Graphite is a there to render graphs from the data/events it's given, it keeps an aggregate store of this data so that it can provide an historical view. But just to be clear it does not store all of the events, just a running total. There's some basic dash boarding built in, but it's main job is to render graphs, that's all it really does. If you think you can just point a logstash output at graphite and it will render wonderful graphs you're very much mistaken, there's some setting up to do first.
When you send data to graphite you send an event, which comprises of 3 bits, and event name, a timestamp and a value (numeric not string). You can then get graphite to render a graph of selected event names over a defined time period. You can also choose what type of graph, e.g. line vs stacked.
That's a really basic overview and there are loads of other features for customising the graphs, please see the graphite docs for more info.
Let's Cook
I'm going to use the existing experiment one vagrant files, there's no need to change the wordpress definitions as that's going to just keep sending log data, it doesn't need to know what happens to it once it hits logstash.
So first off I thought I'd clear out the contents of my Berkshelf so I've got the latest cookbooks, I stopped any running VMs and delete the contents of the berkhelf folder (for me it's C:\Users\username\.berkshelf\*). Then to add the graphite cookbook to the berkshelf file for the logstash server and use the cookbook, next add the recipe to the chef solo config along with a graphite output for logstash - easy ? no.
The Mangled Cutlery Drawer
In Experiment one I wrote at length about the "quirks" of trying to get the chef recipes playing nicely together, this isn't helped by my use case of wanting all machines on my LAN subnet (so needing eth1 and the associated binging of servers to that interface). So I'm still dealing with that complication and the forced chef 11 upgrade for the aging precise vagrant base box I'm using.
But here's some of the other "Lessons Learnt"
One of the cookbooks uses a "hostsfile" recipe, which has been updated to version 2 and now doesn't work for me. I had to hardwire to an older version in my Berkshelf file.
If you decide to try out kibana v3 rather than the one supplied in the logstash cookbook, you have to place the kibana attributes definition before the logstash attributes - looks like if an attribute is already set in the json it can't then change it - so ordering in the json matters.
I spent way to long trying to get the "head" plugin for elasticsearch to work, might revisit that in the future, I just expected it to be easy.
Home Improvements
In addition to getting graphite up and running I added a few improvements. I decided to add the preview release of Kibana v3 - completely new interface (even includes a map). It's really sweet looking and it lets you customise the layout by defining a series of panels and what data is shown on each, I can see some lovely dashboards being made with this (wonder if you can add a graphite graph ?) hope it's ready for release soon.
Earlier I described a graphite event as having 3 bits - name, timestamp and value. Here's where I got a bit stuck with how I had logstash configured, basically I no numeric data, everything from the syslog and apache logs was stored as strings. As the apache logs are of most interest I needed to get them converted from the apache combined format into the constituent bits. Luckily for me logstash has this filter built in so I tweaked the config to include it so I could graph the response size for each log entry in graphite.
The readme.md gives full instructions on how to get the servers up and running.
Final Thoughts
I'm starting to see the benifits of filtering the messages, not only because I can start viewing them in graphite, but in Kibana I can be more selective in what I'm looking for, which will help tracking down problems in the future.
I was hoping to make a series of graphs and not just the my measly single one I've produced, but you can only graph the data you have. The thing to remember about graphite is it just makes graphs, you need to tell it what's on those graphs. It's only going to start being useful or "look cool" when it's configured. The real usefulness will come when it's integrated into a dashboard or NOC display, and with all graphs you need to understand how to interpret them, you'll also need to remember to look at them too.
At some point in the future I'm going to play with the geoip filtering in logstash so I can use the map in Kibana. To do that I'm going to have to find some public apache log files and look at "back filling" in logstash, but I think for the next experiment I'll do some logfile data visualisation.
Experiment 1 - Vagrant, Chef, Elasticsearch, Logstash and Kibana
Time to roll my sleeves up and get stuck in with making something. Part of my day job involves being a sysadmin, looking after servers, deploying releases of code and on the odd occasion troubleshooting. It dawned on me that the servers are constantly keeping track of events on the system and applications, by generating logfiles, basically that's a lot of data when you have a few servers. What I'm doing in this experiment is setting up a centralised logging server and using a web frontend to view and interrogate the collated logs. I'm going to use the devops notion of configuration as code by using chef to provision servers.
The initial idea here was to spin up a virtual machine using Vagrant and provision it using Chef solo to install Logstash. Next I'd create another VM (again using Vagrant and Chef solo) to run apache and send it's logs to the first VM. It seemed fairly simple in principle, the Chef cookbook for logstash already existed and the "stack" is established and described in the accompanying docs http://logstash.net/docs/1.1.13/tutorials/getting-started-centralized
When I started the experiment I'd only really used Vagrant and Chef to spin up a couple of basic VMs with only a "few moving parts", simple stuff, not the sack a described above. However I had done tonnes of reading and taken note of the advice and wisdom discussed in many episodes of http://foodfightshow.org/ (if you use Chef this is by far the best practical hands on knowledge you can get).
So despite my plan being somewhat ambitious I wanted to try to keep things simple, I was potentially trying to learn a lot of new technologies at the same time. The problem with learning technologies (especially for those that are in their infancy when compared to other established tech) is they all have their own little "quirks" and "gotchas" as part of their learning curve, these could cause one to stumble and trip along the way. I'm rather glad I wasn't more ambitious as it turned out not only did I stumble, I sometimes ended up flat on my face.
Vagrant Up!
To get started I decided to use "Ubuntu 12.04 LTS server" which simplifies things because I could use the stock precise64 vagrant box, so getting the basic box up and running was as simple as...
install vagrant to c:\vagrant
create a folder for the vm in c:\vagrant\logstashdemo
run "vagrant init" in the new folder
tweak the config vagrantfile
run "vagrant up" in c:\vagrant\logstashdemo
Wait for the vm to spin up
"vagrant ssh" and look around
logout and run "vagrant halt"
..fun eh ?
OK Logstash, here I come ready or not
To install logstash I decided to use Chef Solo, that way there's no need to set up a chef server and might be a good way to set things up at the day job. So following the instructions I download the cookbook, set the recipe run list in the vagrantfile and boom! cookbook dependency hell. Each cookbook requires a different cookbook, so I download that only to find that another 3 cookbooks are needed etc. After some reading it turns out Berkshelf is my new best friend, especially the vagrant plugin. I download and install that, tweak the config to use berkshelf and yay the dependencies resolve themselves - thank you berkshelf.
Next I hit a few distribution and housekeeping issues, I like to use bridged or NAT networking in virtual box, so machines appear on the same LAN subnet as my machine, The vagrant box images already have an eth0 adapter for themselves with their own subnet, specifying "config.vm.network :public_network" creates new eth1 nic, which is fairly inconvenient. To setting the IP address on eth1 was a problem as all the recipes that tweak network settings only seem to work with eth0. After much gnashing of teeth I decided that this was blocking me from actually making progress with Logstash, so I took the decision to "bodge" the network config by running a shell command once the box is up - yep it's a dirty hack but it works...
...I now discover some of my cookbooks need chef, which is vexing as the vagrant base box comes with chef 10 installed. It also appears that some of the apt repositories need refreshing before the recipes run, so I force the a quick update:
I hope to revisit this one day to remove the hacks when I have more experience with chef, I suspect I need to extend one of the recipes that manage network settings to manipulate eth1.
Cook me some Logstash
I base my Logstash install on the vagrantfile supplied with the cookbook at https://github.com/lusis/chef-logstash/blob/master/Vagrantfile
I modified the vagrantfile to add redis to the chef runlist, be warned the first run of provisioning takes an age, there's a lot of software to download and install.
It's not surprising that with so many different components it didn't work out of the box. To get it all cooking (see what I did there) I worked on configuring each component one by one, getting them running and then talking to the each other, so...
Redis -> Logstash -> Elasticseach -> Kibana
By far the longest part of this project was getting attribute values right for each component and getting their syntax correct in the chef.json. This meant a lot of provisioning, checking the underlying component config files and checking the right values had gone in. This would also have been easier if I'd been more familiar with setting up the software components.
However all the hard work has paid off the chef.json attributes are really neat, it's easy to read and understand how the installed components are setup. In fact it looks so simple it doesn't show really show how much effort was needed to get it so tidy.
Having said that there's still one glitch, kibana isn't starting at boot up so it needs a lttle "sudo service kibana start" (I really should report it as a bug or look for a chkconfig recipe to fix it).
Can I have a little wordpress
Next I need some logs to send to the logstashdemo server, I usually work with webservers so I thought I'd knock together a simple wordress site and send the logs using a logstash shipper. I decided not to use the shipper included in logstash as I'd need to install the Java JVM and I'd like a small footprint (both disk space and memory). I choose a python based shipper called "Beaver" (all my boxes have python 2.x installed by default) and it will quite happily talk to redis.
As with the logstashdemo server I hit problems with networking and chef 10, so I'm using the same hacks. Again the real hardwork was getting the syntax right for the shipper's config in the chef.json - trial and error (mostly error) was the key, mostly fighting with the creation of the input array for beaver. So the syntax is slightly more bulky, but it looks simple and is readable.
More fun ensued when trying to get both VMs running at the same time (which is the point). It turns out that berkshelf creates a folder for the VMs to read their cookbooks from, if you run two VMs at once there is a conflict and one of the VMs will moan and fail during provisioning because some of it's cookbooks are missing. After much head scratching I discovered the this was a known issue with the vagrant berkshelf plugin and a fix had been submitted. I'm please to say it works and has saved me from rethinking my approach to the whole project.
Conclusion
So I have a simple solution to crank up 2 vms, let themselves provision and configure. Then point a browser at kibana, a browser at wordpress. As if by magic I can view and interogate the log data using Kibana - that's pretty cool, especially as the vagrantfiles for both give you a clear understanding of what's installed and what each component does.
There are a few things I'm still a little confused about, I expected to have to use "knife" and some of the other Chef configuration files (roles and environments). Not even seeing these makes me think I might have missed something ? I expect when I write my first recipe things will become more involved.
The moral of this story is try not to learn loads of new stuff all at the same time, especially when using some technologies which are in the grand scheme of things still new. But in the end I learnt loads and really pleased with the end results,
Play along at home
I've uploaded the vagrantfiles and berkshelf files to my github account
https://github.com/andyburgin/burginsdatathing/tree/master/experiment01
The readme.md gives full instructions on how to play along at home. So have a go, improvements and fixes (especially to the dirty hacks and the non starting Kibana) welcome.
One final note, if you do use any of the config in a production environment, make sure you have iptables or ufw up and running along with any other hardening/security precautions you want to take e.g. you should run the beaver -> redis connection over an ssh tunnel.
Credits and References
I really should give some credit to the resources I used to put this experiment together:
The Logstash cookbook - https://github.com/lusis/chef-logstash
Much wisdom on the Food Fight Show -http://foodfightshow.org/
Good starting point for the "stack" - http://logstash.net/docs/1.1.13/tutorials/getting-started-centralized
Good overview of getting Logstash and Elasticsearch up and running - http://devopsanywhere.blogspot.co.uk/2012/07/stash-those-logs-set-up-logstash.html
Everything you wanted to know about getting Beaver, Redis and Logstash to play nicely - http://josediazgonzalez.com/2013/01/01/setting-up-beaver-for-use-with-logstash/
Nice bit about Logstash alternatives and why it wins - http://dfwarden.blogspot.co.uk/2012/10/the-long-road-to-logstash.html
As the new year started I decided I'd learn something new, I've always been interested in "data" and decided it was time I learnt something about "Data Analysis".
I'd already discovered a really good course over on coursera for "Social Network Analysis", I'd been looking for a more "structured" way to learn about Gephi (which is awesome check out http://gephi.org/) and this fitted the bill - https://www.coursera.org/course/sna . But I started the course as it was about to finish, so although I worked my way through the 6 weeks of lectures using Gephi and NetLogo I never got the qualification as I didn't have time.
This course is an applied statistics course focusing on data analysis. The course will begin with an overview of how to organize, perform, and write-up data analyses. Then we will cover some of the most popular and widely used statistical methods like linear regression, principal components analysis, cross-validation, and p-values. Instead of focusing on mathematical details, the lectures will be designed to help you apply these techniques to real data using the R statistical programming language, interpret the results, and diagnose potential problems in your analysis. You will also have the opportunity to critique and assist your fellow classmates with their data analyses.
So the course started pretty well, there was tonnes of information and a shed loads of stats (not something I'd done much of). I really liked the way I felt like I was learning a craft, for example here's "The Method"
Define the question
Define the ideal data set
Determine what data you can access
Obtain the data
Clean the data
Exploratory data analysis
Statistical prediction/modelling
Interpret results
Challenge results
Synthesize/write up results
Create reproducible code
However, I soon found I was learning more stats than I anticipated. Which really wasn't what I wanted to do, but I was nearly half way through and decided to stick with it despite the 5 hours a week it was taking (doesn't sound much but on top of a family and day job I just had no spare time). In addition to the lectures and coursework I had two written assignments to do - involving cleaning, exploratory analysis, modelling/prediction, partitioning data samples to prove/disprove then refine. Not done anything like it since collage (OK maybe my Prince 2 exam).
Despite the hard work I'm still glad I did it, but was hoping to produce more visual stuff in R than the odd line and plot graphs. In the end I got 88.5% (basically a mark off a distinction) I actually quite proud.
In away it's been the thing that's kick started this blog I've got some relevant Data Analysis skills and the Sysadmin background. Should be able to combine the two.
Let's start at the beginning, a very good place to start
I'm putting this blog together for a number of reasons...
Learn lots of new stuff
Provide a record of who,what,why,when,where
Let whoever is interested play along at home
The Day Job
It's impossible for me to separate many of the things I'll talk about on this blog from the "day job", I love my day job, but I really don't have the RandD time I'd like to have to try out many of the tools and techniques.
I see what I do here as mutually beneficial, I get to scratch my itch (play with the latest tools and techniques), the result can be applied at work, albeit in the context of the work environment (basically considering the performance, robustness and security which as a professional I should).
How does the proverb go? - "Find a Job You Love and You’ll Never Work a Day in Your Life" well that's the point really.
Scratching eh ? So what's the itch ?
Well as I said above the day job doesn't really let me have the RandD time I'd like, and there is some stuff I'd like to do - that's part one of the itch. Let's apply an encapsulating term to this, maybe even a buzzword, let's say "Devops"
Part two of the itch, well I see lots of data on a day to day basis, I'm intrigued by the current buzz word of "Big Data" - as I see it getting data, cleaning it, analysing it and then Visualising the data
So that's the point, combining devops techniques to setup the platform to analyse data and make something pretty that explains or highlights patterns in it - that's a good enough starting point I think, things will probably evolve in different directions but we have to start somewhere.