Monitoring Toolchain for Application Metrics

Monitoring your system landscape is critical for detecting outages as soon as possible. It enables you to minimize downtime and maximize uptime, so that you can get the most benefit out of your systems. Classical monitoring tools like Nagios and Icinga focus on host and operating system metrics, such as disk errors, CPU load and RAM usage.
Regarding application metrics, they focus on simple up-down checks. These kind of checks tell you that something is wrong, but in general do not give you any hint that something will soon be wrong or even what exactly is wrong.

To extract that kind of information, you need a deeper insight into your individual application by collecting and monitoring application metrics. As applications are very individual, so are the concrete application metrics you should collect and monitor. For one application, it might be very important to monitor the amount of a specific error occurring in invocations of a remote system, for another, it is critical to monitor the performance of a certain database query enabling the main use case of that application. Therefore, you need a very flexible toolchain that enables you to collect, store, visualize and alert to whatever application metrics you need.

These four logical components are summarized in Picture 1. The focus of this article lies on the introduction of concrete implementations for those logical components. We will just skim their surface and won’t go into detail, as more information can be obtained by following the links throughout this article.

Collection

The collection component is responsible for gathering and transporting the required metrics to the storage engine. To obtain your required application metrics, you usually have to write custom code. For example, you could create an interceptor which inspects the result of an invocation of a remote system and calculates the percentage of erroneous responses during the last 5 minutes. You could then either immediately push that updated metric to your storage component after each invocation or temporarily store the metrics in JMX and expose them via Jolokia for later polling.

One concrete tool supporting the latter is the Jolokia2 input plugin of Telegraf. Its setup and configuration is straightforward and easy to tinker around with. Though Telegraf belongs to the InfluxData stack, most of its input plugins can also create output formats for third party storage engines. At the time of this writing, the Jolokia2 input plugin only supports InfluxDB’s line protocol format, but in the future it is very likely to support the aforementioned other formats, too, as there is an ongoing effort to make all input plugins support different output formats out of the box.

Storage

Application metrics are best stored in specific databases, called time series databases. Compared to relational databases, which are optimized for random access reads, updates and deletes, time series databases sacrifice on random operations completely in order to support a very high throughput regarding append-only inserts, always indexed primarily by the creation timestamp.

Therefore, time series databases can handle much higher insert loads than relational databases which is very important as you might collect thousands or ten thousands of metrics per minute throughout your application landscape. Because data is primarily indexed by creation time, the main use case of querying the latest application state can be fulfilled very efficiently, too. As application metrics are not updated after their insertion, it is not really a disadvantage to sacrifice on update performance.

Though, you don’t need to store your application metrics forever and sooner or later want to delete them, you do not need good random deletion performance. It just suffices to be able to drop a certain bucket of data no longer required efficiently. Time series databases support this use case, too. As storing a lot of data also requires a lot of disk space, time series databases support efficient compression and aggregation functionality for saving disk space. Compression algorithms such as gzip might be familiar, but the aggregation functionality deserves further explanation.

When analyzing application metrics, it is very helpful to have a high resolution of the latest metrics, whereas for older metrics a high resolution is less important. For analysis of the live system state, being able to see how a certain metric develops in 10s intervals is very important to quickly react to problems and precisely correlate with dependent metrics. Older metrics are no longer needed in such a high resolution and thus can be aggregated to minutes, hours or even days, saving disk space. Time series databases offer special support for that. Common time series databases are Graphite, Prometheus and InfluxDB and can be used as the storage component in your application metrics monitoring toolchain. In this article, we will run through a simple write and read example with InfluxDB.

Interaction with InfluxDB works by using its HTTP API. For more interactive use cases, there is also a CLI client which invokes the HTTP API internally. Sending metrics to InfluxDB is quite easy. For testing purposes, you could use curl with the HTTP API directly. Just type:

curl -s -i -XPOST ’http://10.73.5.161:8086/write?db=medpex’ --data-binary ’active_users,application=myapp value=750’

to send the current value for the metric „active_users“ with tag „application=myapp“ and value 750 to the InfluxDB instance running at 10.73.5.161:8086. Tags are indexed automatically to support efficient queries. InfluxDB confirms a successful write with status code 204 (Picture 2).

We didn’t provide a timestamp in this example, to keep it simple, so InfluxDB will just use its current time when receiving the data. Also notice that you don’t have to create a table upfront as you have to in relational databases. InfluxDB takes care of that internally. You can just add arbitrary metrics without further ado.

Querying data with curl is also easy.

curl -s -G ’http://10.73.5.161:8086/query?pretty=true&db=mydb’ --data-urlencode "q=SELECT value FROM active_users WHERE application=’myapp’"

returns the result in JSON format (Picture 3). Also note that the query syntax resembles SQL, which makes it very intuitive to use.

Visualization

The curl output might suffice for testing purposes, but not for production use cases. Humans are very good at pattern recognition, so let’s plot the data into a graph with Grafana. Picture 4 shows an example setup for our „active_users“ metric. Besides Grafana, there exists quite a variety of other visualization applications you could use. However, we picked Grafana because of its nice interface and rich storage engine support. If you are interested in alternatives, feel free to check out Kibana or Datadog. At a first glance though, they seem to be more coupled to a certain monitoring stack.

Grafana supports several styles for representing your graphs. For further information, check their playground for some live examples. It also supports lots of different data sources, not only InfluxDB. If you have an existing MySQL database, you could visualize its data with Grafana, too. To get an overview of all the plugins, check https://grafana.com/plugins. For manual monitoring, e.g. a person watching a monitor with all the graphs, we could already stop our journey here, but please read on to discover how Grafana can also be used for automatic alerting.

Alerting

Let’s suppose you have a pool of database connections. You would like to be alerted if the average number of used connections in the last 5 minutes exceeds a certain limit, say 20. Grafana allows you to define custom conditions with the query language supported by your metrics storage engine, in our case InfluxDB. Picture 5 shows you how to define such a condition for our use case with Grafana.

The „A“ in „query(A, 5m, now)“ references the automatically created query name for the query in tab „Metrics“, used to draw the graph for used database connections, as shown earlier for the „active_users“ metric. The condition is evaluated every 60 seconds. You can also configure the behavior for situations in which no data is available or an exception or timeout occurs. In order to configure notifications for that alert, you first have to setup the required channel. Grafana supports a lot of them, as shown in Picture 6.

For example, after having entered the configuration for your SMTP server, you could configure a notification for your alert as shown in Picture 7.

Of course, in case of an alert, your dashboard will also give you that information, see Picture 8.

Conclusion

The introduced toolchain offers a lot of flexibility and is definitely worth further evaluation. It enables you to monitor whatever metrics you can think of and supports gaining deeper insight into your applications, which allows you to continuously remove bottlenecks, optimize performance and keep your customers satisfied.