Using the ELK stack to examine data

Summary

ELK is an acronym that stands for Elasticsearch , Log Stash , and Kibana . They are very frequently used together to collect, parse, and view large amounts of data.

Elasticsearch is a NOSQL database, with a very nice RESTfull api infront of it, which makes it very easy to put data into, and get data out of. It is also very easy to setup a cluster of servers, which then makes the searching even faster as the data and searching is spread out among all the nodes in your cluster.

Log Stash, is a data parsing pipeline. The pipeline has modules that run in three phases; the first phase is the input phase, and there are many modules that allow you to load in data from various sources. The second phase is the filter phase; this allows you to transform and filter your data. The final phase is the output phase, and like the input phase, there are many modules that allow you to output the data in many formats, one of which is to output directly to elasticsearch.

In this post, I'll walk through setting up a smaller docker environment using Vagrant and then using a docker image to start up a complete ELK stack. Once the environment is up, you'll be able to login to the container, and load CSV data into Elasticsearch, via logstash, and then view the data with Kibana.

Source code to follow along

If you want to follow along with this example, you can clone the git repo that has all the files you need:

1
2
git clone https://github.com/analogpixel/longmontDataScience
cd elk


Vagrant

To get vagrant working, you just need to install Vagrant , and virtualbox . Once that is installed, you can just cd into the elk directory, and run vagrant up . If everything works correctly, you should see Vagrant downloading an image, starting it, and then building the machine (after the machine is up, you can run vagrant ssh and it'll log you into the virtual machine). There is a lot of voodoo going on that i'll explain now:

After vagrant has started your virtual machine, it is going to run a script that is embedded in the Vagrantfile:

1
2
3
4
5
apt-get update
apt-get install -y puppet
cd /vagrant
puppet module install puppetlabs-apt
puppet apply server.pp


This script tells the machine to update the apt repo cache, install puppet, the puppet apt module, and then run puppet to finish the configuration of the machine.

Puppet

The server.pp file is a puppet manifest that configures the apt repos for docker, installs docker, and then puts all the support files into the correct place. At the end of the file you can see that it is actually running the process that will transform the csv data we were given, into something that works better with logstash.

The docker ELK container

By this point, Vagrant has run and Puppet has run (hopefully) and your virtual machine is ready to pull down the docker image and run it. To do this you can just run the start.sh script in the elk directory, or you can just run:

1
2
3
sudo bash
sysctl -w vm.max_map_count=262144
docker run --rm -d -m 4g -p 5601:5601 -p 9200:9200 -p 5044:5044 --name elk -v /data:/data sebp/elk


this will download (if you don't have it already) the elk container, and start it up with all the ports mapped and the volumes mapped to the /data directory on the vm. Also, when vagrant started there was a line in the Vagrantfile that mapped that Kibana port to your host operating system

1
server1.vm.network "forwarded_port", guest: 5601, host: 5601


so, once you have given the container a second to start up, you should be able to go to http://localhost:5601 and see the Kibana login.

Logging into the container

We are getting to an inception moment here; You are running Vagrant which started a virtual machine on your host machine, and now you are logging into the virtual machine, and from there you are going to log into a container. From the virtual machine run:

1
docker exec  -it elk bash


this will connect to the container that was named elk (--name elk) and run bash (allowing you to login). Once in you can change to the /data directory because that's where all the fun is going to take place.

Logstash

As stated in the summary logstash is a pipeline that takes data in, does something with it, and then exports it. This configuration is stored in the cvsPipe.conf file and looks a lot like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
input {
  file {
    path => "/data/output.csv"
    start_position => "beginning"
   sincedb_path => "/dev/null"
  }
}

filter {
  csv {
      separator => ","
     columns => ["DateTime","AIRFLOW","INBOARDTEMP","INBOARDVIBRATION","OUTBOARDTEMP","OUTBOARDVIBRATION"]

     convert => {  "AIRFLOW" => "float" 
                   "INBOARDTEMP" => "float" 
                   "INBOARDVIBRATION" => "float" 
                   "OUTBOARDTEMP" => "float" 
                   "OUTBOARDVIBRATION" => "float"
                }
  }
  
  # https://www.elastic.co/guide/en/logstash/current/plugins-filters-date.html
  date {
    match => ["DateTime", "m/d/yy HH:mm:SS"]
    target => "Date"
  }

  
}

output {
   elasticsearch {
     hosts => "http://localhost:9200"
     index => "blower-data"
  }
stdout {}
}


The interesting parts here are the convert and date section. Without these sections, the csv module would just load in all the data as strings, and then when you get to Kibana, you wouldn't be able to do anything with it. The convert line, tells logstash to take each column that is defined, and convert the type to a float (other options are available) . The date line looks at the DateTime field that is pulled in from the CSV file, and then matches the pre-defined date format, and writes out a new field called Date that has the correct date and time.

Before we can do the date magic as stated above, we have to do a little data wrangling as the format we were given the the data and time in doesn't seem to work well with logstash. If you look at the transform.py file, you'll see the python code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import csv

reader = csv.reader(open("/data/data.csv"), delimiter=',')
writer = csv.writer(open("/data/output.csv", "w"), delimiter=',', lineterminator='\n')

# fix up the header since we are merging date and time into one field
headers = reader.next()[1:]
headers[0] = "dateTime"

writer.writerow( headers)

for row in reader:
	newRow = [row[0] + " " + row[1]] + row[2:]
	writer.writerow(newRow)


that opens up the data we were given (data.csv) and converts the date and time to one field, and outputs the new data to output.csv. If you were using the puppet manifest to setup everything, this step was already run, if not, you'll need to run python transform.py first to get your data into a format logstash will appreciate better.

At this point, you are really close, all you need to do is run logstash from the /data directory:

1
2
cd /data
/opt/logstash/bin/logstash -f csvPipe.conf


After a few seconds, you should have a bunch of data sitting in elasticsearch ready to be viewed by kibana.

Kibana

Now we are ready to view some data into Kibana, If everything has worked so far, you should be able to connect to http://localhost:5601 and your host machine and see Kibana.

Since this is a fresh install of Kibana, we'll need to tell it where our data is, by default it tries to look for logstash- data to blower-data, because in the csvPipe.conf file, when we wrote to elasticsearch the index we picked was called blower-data. So make that change, wait a few seconds, and the UI should change a bit, and you should see a box asking for the Time-field name, select the Date field in the dropdown as that is the place we put the converted date time in the csvPipe.conf file (target => "Date"). Click create, and then click over to the discover tab on the left.

And see that there is NO data. HAHAHA all is for nothing. Just kidding. If you look up in the upper right corner, you'll see that the time range it is look at is the last 15min. Just click on the 15min area, and you'll get a drop down that will let you pick other time ranges, go for the last year, and you should see your data appear.


publish
kibana
logstash
elasticsearch
longmontdatascience