This is Part 4 of my tutorial series on ELK on CentOS 7

  • Part 1 - Operating System, Java and Tweaks
  • Part 2 - Elasticsearch
  • Part 3 - Kibana
  • Part 4 - Logstash with Nginx (This page)

The next component of the ELK stack is Logstash. This component receives data from different sources, aggregates and filters it and prepares it to be ingested by Elasticsearch.

You don't necessarily need Logstash for a lot of the things I show in this tutorial. For example, Filebeat can send logs directly to Elasticsearch without having to go through Logstash. However there's two reasons, still using Logstash. First, this will help you understand the ELK-stack better and secondly, if you're planning to collect logs from multiple servers, Logstash is the way to go.

Install Logstash

Again, as with Elasticsearch and Kibana, we need to make sure we have Elastic's GPG key installed:

$ sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch

Then we create the repo file:

$ sudo nano /etc/yum.repos.d/logstash.repo

Dump this into the empty file:

[logstash-6.x]
name=Elastic repository for 6.x packages
baseurl=https://artifacts.elastic.co/packages/6.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

Now we can install Logstash:

$ sudo yum install logstash -y

Logstash on startup:

$ sudo systemctl enable logstash.service

And then we start Logstash:

$ sudo systemctl start logstash.service

Configure Logstash

In this part, we will configure logstash to receive files from a remote Nginx web server. But before we do that, we obviously need to make sure, Logstash can receive encrypted connections, otherwise anybody could send us their logs and spam us.

Configure SSL

This section is based on a tutorial by Benjamin Knofe and has been updated for Elastic 6.x.

In your home folder run this command first to generate the CA cert:

$ cd ~
$ openssl genrsa -out ca.key 2048
$ openssl req -x509 -new -nodes -key ca.key -sha256 -days 3650 -out ca.crt

Answer the questions, the only one that really matters is the hostname which should be the FQDN of your Elastic Server. When you run the last command, unfortunately when you mistype, you have to press ctrl+c to abort and run the whole thing again.

Logstash Certificate

Next we need to generate a certificate for Logstash. Create a file in your home folder again called logstash.conf:

$ nano logstash.conf

Dump this into the file and read the instructions to it below:

[req]
distinguished_name = req_distinguished_name
req_extensions = v3_req
prompt = no

[req_distinguished_name]
countryName                     = XX
stateOrProvinceName             = XXXXXX
localityName                    = XXXXXX
postalCode                      = XXXXXX
organizationName                = XXXXXX
organizationalUnitName          = XXXXXX
commonName                      = XXXXXX
emailAddress                    = XXXXXX

[v3_req]
keyUsage = keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names

[alt_names]
DNS.1 = DOMAIN_1
DNS.2 = DOMAIN_2
DNS.3 = DOMAIN_3
DNS.4 = DOMAIN_4

In the section [req_distinguished_name], change the XX to what you entered earlier in the questions when generating the ca.crt file (it's not a must have, though).

In the [alt_names] section, change the entries to match all FQDN names of the machines you will this certificate on. We will create a separate certificate for the web servers each, so don't put those in here. Only if you're using this certificate on several machines in an ELK cluster, add those machines, too.

In our case, since we're only running it on one machine, just keep the first entry, change it to match your ELK-host's FQDN and delete the rest.

Logstash Key

Next, generate the Logstash key:

$ openssl genrsa -out logstash.key 2048
$ openssl req -sha512 -new -key logstash.key -out logstash.csr -config logstash.conf

Next, we need to get the serial number of the CA.

$ openssl x509 -in ca.crt -text -noout -serial

The last line of the output is the serial number. Copy only the number and put it into a file (replace [SERIALNUMBER] but keep the "):

$ echo "[SERIALNUMBER]" > serial

Signing of the Logstash Certificate

Next, we create and sign the Logstash certificate:

$ openssl x509 -days 3650 -req -sha512 -in logstash.csr -CAserial serial -CA ca.crt -CAkey ca.key -out logstash.crt -extensions v3_req -extfile logstash.conf
$mv logstash.key logstash.key.pem && openssl pkcs8 -in logstash.key.pem -topk8 -nocrypt -out logstash.key

Store and Secure the Logstash Certificates and keys

Let's now create a folder to store all this (including the configuration files) and change file permissions. Make sure you're in the directory where all the files we just created are - should be your home):

$ sudo mkdir /etc/elk-certs
$ sudo mv -t /etc/elk-certs/ ca.* logstash.* serial
$ cd /etc/elk-certs
$ sudo chown logstash:root *
$ sudo chmod o-rwx *.key*

The last line removes access to everyone but the logstash and the root user from all private keys.

Create the Filebeat Certificate

We will create the Filebeat Certificate on the same machine, since we will need to use the CA we just created to sign it. So make sure, you're still in the proper folder:

$ cd /etc/elk-certs

Create a new file called beat.conf:

$ sudo nano beat.conf

And dump this into it:

[req]
distinguished_name = req_distinguished_name
req_extensions = v3_req
prompt = no

[req_distinguished_name]
countryName                     = XX
stateOrProvinceName             = XXXXXX
localityName                    = XXXXXX
postalCode                      = XXXXXX
organizationName                = XXXXXX
organizationalUnitName          = XXXXXX
commonName                      = XXXXXX
emailAddress                    = XXXXXX

[ usr_cert ]
# Extensions for server certificates (`man x509v3_config`).
basicConstraints = CA:FALSE
nsCertType = client, server
nsComment = "OpenSSL FileBeat Server / Client Certificate"
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer:always
keyUsage = critical, digitalSignature, keyEncipherment, keyAgreement, nonRepudiation
extendedKeyUsage = serverAuth, clientAuth

[v3_req]
keyUsage = keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth, clientAuth

Make sure the hostname matches the hostname of the web server you will send logs from (not a hard requirement, but you get it).

Then we generate all the necessary stuff (note, that this time I'm using sudo since we're now in a folder which should only be writeable by root):

$ sudo openssl genrsa -out beat.key 2048
$ sudo openssl req -sha512 -new -key beat.key -out beat.csr -config beat.conf
$ sudo openssl x509 -days 3650 -req -sha512 -in beat.csr -CAserial serial -CA ca.crt -CAkey ca.key -out beat.crt -extensions v3_req -extensions usr_cert  -extfile beat.conf

Secure the key file:

$ sudo chmod o-rwx beat.key

Finally you need to copy beat.crt, beat.key and files to your web server which runs Filebeat.

Configuring a Pipeline

A pipeline in Logstash is the process of receiving data, filtering and processing the data and then sending it on to somewhere else. We will of course send the data to Elasticsearch but you can also send it to other destinations (Hadoop etc).

By default, Logstash defines the main pipeline. If you're running your ELK stack for one or two purposes only, that's absolutely fine. But if you're running ELK for all sorts of data crunching, I would highly recommend to define pipelines for certain purposes. For example, if you're planning to collect log files from multiple web servers (even a couple of hundred) you should define a pipeline for that. Actually, if you can, split that up into multiple pipelines if necessary, along the lines of clusters that don't have anything to do with each other, for example.

That being said, make sure you keep an eye on your performance parameters. By default, the elastic setup is optimized for one pipeline. If you're running multiple pipelines hot, you need to adjust the settings.

In our case, we will define a new pipeline called web servers while leaving the main pipeline alone for now. This means, we don't have to look into performance:

$ sudo nano /etc/logstash/pipelines.yml

Your pipelines.yml should look like this now:

# This file is where you define your pipelines. You can define multiple.
# For more information on multiple pipelines, see the documentation:
#   https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html

- pipeline.id: main
  path.config: "/etc/logstash/conf.d/*.conf"
- pipeline.id: webservers
  path.config: "/etc/logstash/webserver.conf.d/*.conf"

You should have added the last two lines, the rest should already be there.

Configuring the Pipelining Process

In the section above, we have defined a new pipeline. Now we need to say what the pipeline is supposed to do. We have to make sure the folder for our new pipeline exists:

$ sudo mkdir /etc/logstash/webserver.conf.d

Before we go ahead and configure the pipeline, a few words on how the configuration files are actually build. Each pipeline-configuration looks like this (don't use this of course):

input {
  ...
}

filter {
  ...
}

output {
  ...
}

As you can see, it consists of 3 main sections input, filter and output. Most configuration examples keep the whole configuration in one file, and it often makes sense. In this case here though, we will split up the configuration in a file which contains the input and output section and a few other files which contain the filters for each type of log that come in, specifically Apache2 and NGINX logs. This is for readability only.

Define Input and Output

Now, let's get stuff done. First we're going to generate a file called webserver_io.conf:

$ sudo touch /etc/logstash/webserver.conf.d/webserver_io.conf

And this is how it will look like:

input {
	beats {
		port => 5044
		host => "0.0.0.0"
		ssl => true
		ssl_certificate_authorities => ["/etc/elk-certs/ca.crt"]
		ssl_certificate => "/etc/elk-certs/logstash.crt"
		ssl_key => "/etc/elk-certs/logstash.key"
		ssl_verify_mode => "force_peer"
		}
	}
	
output {
	elasticsearch {
		hosts => ["localhost:9200"]
		index => "webserverlogs-%{+YYYY.MM.dd}"
		# template => "/etc/logstash/index_templates/webserver_template"
		# template_name => "webserverlogs"
		}
	}

Please note the two lines which are commented out for now in the output section. We will uncomment those later, and the explanation will follow as well.

For the details of the SSL configuration, you can check out the respective document on Elastic's website.

The host => "0.0.0.0" directive is to make sure, Logstash doesn't only accept connections from the localhost interface but all interfaces of the machine it's running on.

Some of you might notice the index => "webserverlogs-%{+YYYY.MM.dd}" line in the output section.

What we're actually doing here is to create a daily index and dump all logs from all web servers in there. We can do this because Kibana will allow us to aggregate all those indices but then also distinguish the sources of the logs. So in Kibana you will be able to either see all results of all the multiple web servers combined - even if coming from NGINX and Apache - but also view just the logs from specific servers.

Define the Filter for Nginx (link for Apache2 below)

(If you are using Apache2 instead of Nginx, follow this link)

Filters are the core of Logstash. This is where you can preprocess data in a way that Elastic can really use later. You can parse all sorts of log files but also other sources such as CSV files or even Twitter feeds and make it digestible for Elasticsearch.

As mentioned above, separating the filter section from the input and output section is a bit unusual but makes sense here, especially because the filter definitions for NGINX are a bit complex, and if you have to deal with multiple web servers - e.g. if you're running a farm of Apaches, NGINXs, and maybe even IISs - this makes the configuration much more readable and also easier to maintain.

Let's create a new configuration file which will contain our Nginx filters:

$ sudo nano /etc/logstash/webserver.conf.d/nginx_filter.conf

Dump the following content in the file (config source):

filter {
  if [fileset][module] == "nginx" {
    if [fileset][name] == "access" {
      grok {
        match => { "message" => ["%{IPORHOST:[nginx][access][remote_ip]} - %{DATA:[nginx][access][user_name]} \[%{HTTPDATE:[nginx][access][time]}\] \"%{WORD:[nginx][access][method]} %{DATA:[nginx][access][url]} HTTP/%{NUMBER:[nginx][access][http_version]}\" %{NUMBER:[nginx][access][response_code]} %{NUMBER:[nginx][access][body_sent][bytes]} \"%{DATA:[nginx][access][referrer]}\" \"%{DATA:[nginx][access][agent]}\""] }
      }
      mutate {
        add_field => { "read_timestamp" => "%{@timestamp}" }
      }
      date {
        match => [ "[nginx][access][time]", "dd/MMM/YYYY:H:m:s Z" ]
        remove_field => "[nginx][access][time]"
      }
      useragent {
        source => "[nginx][access][agent]"
        target => "[nginx][access][user_agent]"
        remove_field => "[nginx][access][agent]"
      }
      geoip {
        source => "[nginx][access][remote_ip]"
      }
    }
    else if [fileset][name] == "error" {
      grok {
        match => { "message" => ["%{DATA:[nginx][error][time]} \[%{DATA:[nginx][error][level]}\] %{NUMBER:[nginx][error][pid]}#%{NUMBER:[nginx][error][tid]}: (\*%{NUMBER:[nginx][error][connection_id]} )?%{GREEDYDATA:[nginx][error][message]}"] }
      }
      mutate {
        rename => { "@timestamp" => "read_timestamp" }
      }
      date {
        match => [ "[nginx][error][time]", "YYYY/MM/dd H:m:s" ]
        remove_field => "[nginx][error][time]"
      }
    }
  }
}

Now we should have a fully configured Logstash instance which can process Nginx log files.

Let's make sure, Logstash will be started at system boot and then fire up the beast:

$ sudo systemctl enable logstash.service
$ sudo systemctl start logstash.service

To make sure, everything is running smooth, let's check the daemons:

$ sudo lsof -Pni | grep logstash

This should give you something like this:

java     15411      logstash   61u  IPv6 2028031      0t0  TCP 127.0.0.1:34850->127.0.0.1:9200 (ESTABLISHED)
java     15411      logstash  100u  IPv6 2028036      0t0  TCP *:5044 (LISTEN)
java     15411      logstash  103u  IPv6 2028046      0t0  TCP 127.0.0.1:9600 (LISTEN)

If you have any troubles, log in as root and check out the log files in /var/log/logstash.

GeoIP to location conversion

One thing is missing. We want to be able to display locations on a map. For this, we need to add a couple of things. First we need to install an Elastic Ingest-Plugin for GeoIPs:

$ sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install ingest-geoip

This path varies with your installation. Shown above is the default path for RPM and DEB based installations.

When you run it, you will get a permission warning which you need to confirm.

In some tutorials you will see configurations for an alternate Geoip database which mostly seems to offer higher accuracy. The installation and configuration is fairly simple and can be implemented easily but I won't cover it here.

Index Templates

When I was starting with ELK, I soon got confused with the terms "index", "index template" and "index patterns". So here is a quick reference for the 3 terms and what they are used for:

An Index is what the term suggests. It is the actual dataset stored in key:value pairs in the Elasticsearch core. That's what's at the core of your numbers crunching, the database.

An Index Pattern is a simple name-based aggregation of indices in Kibana. In our example here, the web servers will create multiple individual indices called webserverlogs-YYYY-MM-DD where the latter part is representing the date. Once you start visualize stuff in Kibana, you need to use an index pattern to tell Kibana which indices you want include in your visualization. You could for example only look at 2018 data or the month July by choosing the appropriate index pattern (so webserverlogs-2018-* and webserverlogs-2018-07-* respectively in the example).

An Index Template is a set of settings and mappings you apply at the creation of an index. This happens directly in the Elasticsearch core but can be triggered through Logstash (which is what I show in a minute) or Kibana through the Developer Interface. I will not go through the parameters, since that is a whole other thing you need to consider once you implement productive clusters. A starting point is Elastic's documentation on the settings.

By default, Logstash is trying to load an Index Template into Elasticsearch at startup which is simply called logstash. To make our tutorial work, we will need to create a new Index Template and load it into Kibana when we start. Amongst other things, this will set the Geopoint mappings correct. If we don't do that, we will not get proper values for geolocation of your visitors.

Create a new folder in /etc/logstash and create a new file for the Index Template:

$ sudo mkdir /etc/logstash/index_template
$ sudo nano /etc/logstash/index_template/webserver_template

The file should contain this (note that if you change anything, you should increase the version number by 1):

{
  "webserverlogs" : {
    "order" : 0,
    "version" : 1,
    "index_patterns" : [
      "webserverlogs-*"
    ],
    "settings" : {
      "index" : {
        "refresh_interval" : "5s"
      }
    },
    "mappings" : {
      "_default_" : {
        "dynamic_templates" : [
          {
            "message_field" : {
              "path_match" : "message",
              "match_mapping_type" : "string",
              "mapping" : {
                "type" : "text",
                "norms" : false
              }
            }
          },
          {
            "string_fields" : {
              "match" : "*",
              "match_mapping_type" : "string",
              "mapping" : {
                "type" : "text",
                "norms" : false,
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          }
        ],
        "properties" : {
          "@timestamp" : {
            "type" : "date"
          },
          "@version" : {
            "type" : "keyword"
          },
          "geoip" : {
            "dynamic" : true,
            "properties" : {
              "ip" : {
                "type" : "ip"
              },
              "location" : {
                "type" : "geo_point"
              },
              "latitude" : {
                "type" : "half_float"
              },
              "longitude" : {
                "type" : "half_float"
              }
            }
          }
        }
      }
    },
    "aliases" : { }
  }
}

Save and close, then head over to the configuration /etc/logstash/webserver.conf.d/webserver_io.conf:

$ sudo nano /etc/logstash/webserver.conf.d/webserver_io.conf

Remember the two commented lines in the output section earlier? Uncomment the lines, so the output section now looks like this (of course leave the input section untouched):

output {
	elasticsearch {
		hosts => ["localhost:9200"]
		index => "webserverlogs-%{+YYYY.MM.dd}"
		template => "/etc/logstash/index_templates/webserver_template"
		template_name => "webserverlogs"
		}
	}

Conclusion

This was a long one, but now we have a running environment which is able to collect Nginx log files. Next up will be the same for Apache2 and we'll add some cookies to the mix as well. Stay tuned.