MTR Monitor

Build Status

In December 2017, Hetzner, our hosting provider for the Build Platform, had a major network incident that lasted for almost a whole week. Our users were rightly frustrated.

You can find more information about the incident in our public Post Mortem.

To prevent and monitor these situation in the future, we have set up a transatlantic monitoring system based on MTR reports and Curl-ing important vendors for our platform such are GitHub and DockerHub. This system should report any issues in the network between Germany(Hetzner) and US(GitHub, DockerHub).

This project is part of the effort to have a readily available MTR reports before, during and after incidents, that we can send to Hetzner.

The project consists of two parts. A MTR monitor that continiously tests the quality of the network by running mtr from both sides of the Atlantic, and CURL monitor that continiously tries to eastablish a HTTPS connection to the other side of the Atlantic.

MTR reports are generated every 5 minutes and uploaded to an S3 bucket. Results of CURL tests are displayed on the Platform — Network Grafana dashboard and are connected to PagerDuty based alerts.

Currently, we have the following routes covered:

  • Germany(Hetzner) -> AWS US East 1 (part of Job Runner)
  • Germany(Hetzner) -> AWS US West 1 (part of Job Runner)
  • Germany(Hetzner) -> AWS US West 2 (part of Job Runner)
  • AWS US East 1 -> Builder sb1 in Hetzner (standalone AWS instance with Docker container)
  • AWS US West 1 -> Builder sb1 in Hetzner (standalone AWS instance with Docker container)
  • AWS US West 2 -> Builder sb1 in Hetzner (standalone AWS instance with Docker container)

The tests from Germany are executed from every Builder machine, where this project is injected as a gem.

The DNS records of the US based MTR monitors are the following:

  • mtr-monitor.us-east-1.semaphoreci.com
  • mtr-monitor.us-west-1.semaphoreci.com
  • mtr-monitor.us-west-2.semaphoreci.com

These records point to the Load Balancer. If you want to SSH into the machines, use the following commands:

To create a new MTR monitor follow this guide.

Location of the generated MTR reports

The MTR monitor generate and stores MTR reports both on the local machine, and uploads them to S3.

Local reports on the machine are located in the /var/log/mtr directory, and the following structure:

/var/log/mtr/<name>__<YYYY-DD-MM-HH-MM>__<host-ip-address>_to_<target-ip-address>.txt

For example, if you call your report hetzner-to-us-east-1 and run it at 2017-12-18 12:33:06, the log will be generated in:

/var/log/mtr/hetzner-to-us-east-1__2017-12-18-12-33__142-21-43-11_to_138-21-32-191.txt

On S3, the path will follow the same convention, but will use a nested directory structure:

s3://<bucket-name>/<name>/<YYYY-DD-MM-HH-MM>/<host-ip-address>_to_<target-ip-address>.txt
s3://<bucket-name>/hetzner-to-us-east-1/2017-12-18-12-33/142-21-43-11_to_138-21-32-191.txt

Report Name

The name of the report is used to group reports with the same purpose on S3 and on the local file system.

We use the following naming convention:

<from>-to-<destination>

Examples:

hetzner-to-github
us-east-1-to-hetzner-sb1
hetzner-to-us-west-2

Using MTR Monitor as a gem

The MTR monitor can be used as a gem and injected into existing Ruby applications. Currently, we inject the MTR monitor into Job Runner.

First, add the mtr_monitor gem to your Gemfile:

gem 'mtr_monitor'

Secondly, use the report class to generate a report:

name   = "google"
domain = "google.com"

s3_bucket             = "my-private-bucket-name" # change this
aws_access_key_id     = "<KEY>"
aws_secret_access_key = "<KEY>"

options = {
  :name => name,
  :mtr_target => mtr_target,
  :s3_bucket => s3_bucket,
  :mtr_options => mtr_options,
  :aws_access_key_id => aws_access_key_id,
  :aws_secret_access_key => aws_secret_access_key,
  :dig_ip_address => dig_ip_address,
  :logdna_ingestion_key => logdna_ingestion_key,
  :logger => logger
}

MtrMonitor::Report.new(options).generate

This above snippet will :

  • generate an MTR report on your local system under the /var/log/mtr directory
  • upload the report to the provided S3 bucket
  • submit metrics via Watchman and generate a metric "pulse"

If you want to generate reports continuously, create a CRON task that will call the above code. To monitor if the CRON task is running as expected, you should set up an alert on Grafana based on the "pulse" metric.

The pulse metric has the format network.mtr.pulse and is tagged with the hostname of the server where the MTR monitor is running and with the name of the metric.

MTR hops are also submitted to Grafana. Based on these metrics you can observe the packet loss, avg, best, and worst latency on the network. For more information read the code in lib/mtr_monitor/metrics.rb.

Bump gem version

  1. Change version in lib/mtr_monitor/version.rb
  2. Run bundle
  3. Push gem to RubyGems manually or let Semaphore do it for you automatically

Update MTR monitor in Job Runner

MTR monitor is run in the mtr_report cron task within Job Runner.

To update the version used in Job Runner:

  1. Run bundle update mtr_monitor.
  2. Make sure the code using gem corresponds to the new version.
  3. Try to run the task locally.
  4. Deploy to staging sec1 and check if it works properly.
  5. Finally, deploy to production build servers.

Using MTR Monitor as a standalone Docker container

The MTR monitor can be used as a standalone Docker container. This is our current approach for monitors that are hitting Germany from the United States.

By default, the containers running on us-east-1, us-west-1, and us-west-2 are automatically deployed on every merge into master in for this repository.

The container on the ec2 machines will trigger a MTR report generation every 5 minutes. Every time a Report is generated the following is executed:

  • a new MTR report is generate on your local system under the /var/log/mtr directory
  • the report is uploaded to the provided S3 bucket
  • metrics are submitted via Watchman and a pulse is generated
  • the MTR cleaner is uninitiated that cleans all reports from the local system that are older then 2 weeks

To monitor if the CRON task is running as expected, you should set up an alert on Grafana based on the "pulse" metric.

The pulse metric has the format network.mtr.pulse and is tagged with the hostname of the server where the MTR monitor is running and with the name of the metric.

MTR hops are also submitted to Grafana. Based on these metrics you can observe the packet loss, avg, best, and worst latency on the network. For more information read the code in lib/mtr_monitor/metrics.rb.

This is deployed as a docker-compose group of docker images. One docker images generates the MTR reports, while the other onw exposes an nginx server that responsd yes to incomming requests.