Monitoring Creation of Log Files in s3

I manage several apps that write various pieces of data to the local file system and rely on Fluentd to ship them to s3. There is solid monitoring around the fluentd aggregator process, but I wanted better visibility and alerting when things aren’t written to s3 as expected.

The solution I came up with was a custom Datadog check. The files I am monitoring are written into a bucket named something like example-logs/data/event-files/year/month/day. A new path is set up in the s3 bucket for the current day’s date, e.g. logs/data/example-log/2018/08/15 each day. The Datadog check sends the count of objects in the current date’s directory as a gauge. You can then monitor that objects are created each day as expected and at a normal rate.

Here is an example config

init_config:

instances:
# this will monitor s3://example-logs/data/production/event-log/<year>/<month>/<day>
  - bucket: example-logs
    prefix: data/production/event-log
    tags:
      - 'env:production'
      - 'log:event-log'
  - bucket: example-logs
    prefix: data/staging/event-log
    tags:
      - 'env:staging'
      - 'log:event-log'

The check will add the current date path to the prefix automatically.

How to Set it Up for Yourself

  • Install boto3 via the Datadog embedded pip
/opt/datadog-agent/embedded/bin/pip install boto3
  • add s3_object_count.py to /etc/datadog-agent/checks.d
  • add your config file to /etc/datadog-agent/conf.d/s3_object_count.d

The code for the check is pretty simple.

See https://github.com/dschaaff/datadog-checks for the full source.

Bash Function to SSH into ec2 Instances

I’ve often found myself with an instance id that I want to login to look at something. It sucks looking up the IP when you don’t know the DNS name. I’m sure there are other ways to do this but here is what I came up with.


getec2ip() {
 aws ec2 describe-instances --instance-ids $1 | jq [.Reservations[0].Instances[0].PrivateIpAddress] | jq --raw-output .[]
}

assh() {
 host=$(getec2ip ${1})
 ssh user@${host}
}

This relies on the aws cli and jq to parse out the ip and has made it much easier for me to quickly hop on an instance.

Jenkins Dynamic EC2 Slaves

There is a nice plugin for Jenkins that lets you dynamically add capacity by spinning up EC2 instances on demand and then terminating them when the job queue expires. This is a great way to save money on an AWS based build infrastructure.

Unfortunately, the plugin documentation is really light and there are a few gotchas to look out for.

Security Groups

This field only accepts comma separated security group IDs, not names. This is frustrating because other fields in the plugin take a space separated list (e.g. labels)

Running in VPC

If you’re a sane person you’re going to want to run these instances in a private VPC. This is entirely possible but is hidden in the advanced settings. If you expand the advanced settings you’ll see a field to enter your desired subnet ID. Set this to the ID of the private subnet in your VPC you want the instances to run in.

Don’t Rely On the User Data/Init Scrip to Install Dependencies

This adds a lot of time to the instance coming on line and being usable by Jenkins. A better approach is to make an AMI with all the build dependencies you need. The only delay is then the instance boot time.

This is far from an exhaustive walkthrough but highlights the issues I ran into setting it up.

Terraform AMI Maps

Up until today we had been using a map variable in terraform to choose our ubuntu 14 ami based on region.

variable "ubuntu_amis" {
    description = "Mapping of Ubuntu 14.04 AMIs."
    default = {
        ap-northeast-1 = "ami-a25cffa2"
        ap-southeast-1 = "ami-967879c4"
        ap-southeast-2 = "ami-21ce8b1b"
        cn-north-1     = "ami-d44fd2ed"
        eu-central-1   = "ami-9cf9c281"
        eu-west-1      = "ami-664b0a11"
        sa-east-1      = "ami-c99518d4"
        us-east-1      = "ami-c135f3aa"
        us-gov-west-1  = "ami-91cfafb2"
        us-west-1      = "ami-bf3dccfb"
        us-west-2      = "ami-f15b5dc1"
    }
}

We would then set the ami id like so when creating an ec2 instance.

ami = "${lookup(var.ubuntu_amis, var.region)}"

The problem we ran into is that we now use Ubuntu 16 by default and wanted to expand the ami map to contain its ID’s as well. I quickly discovered that nested maps like the one below work.

 variable "ubuntu_amis" {
    description = "Mapping of Ubuntu 14.04 AMIs."
    default = {
        "ubuntu14" = {
          ap-northeast-1 = "ami-a25cffa2"
          ap-southeast-1 = "ami-967879c4"
          ap-southeast-2 = "ami-21ce8b1b"
          cn-north-1     = "ami-d44fd2ed"
          eu-central-1   = "ami-9cf9c281"
          eu-west-1      = "ami-664b0a11"
          sa-east-1      = "ami-c99518d4"
          us-east-1      = "ami-c135f3aa"
          us-gov-west-1  = "ami-91cfafb2"
          us-west-1      = "ami-bf3dccfb"
          us-west-2      = "ami-f15b5dc1"
      }
      "ubuntu16" = {
          ap-northeast-1 = "ami-a25cffa2"...

I also tried the solution from this old github issue but it is no longer valid since the concat function only accepts lists now. In the end I landed using a variable for os version and setting it like this.

variable "os-version" {
    description = "Whether to use ubuntu 14 or ubuntu 16"
    default     = "ubuntu16"
}
ariable "ubuntu_amis" {
    description = "Mapping of Ubuntu 14.04 AMIs."
    default = {
          ubuntu14.ap-northeast-1 = "ami-a25cffa2"
          ubuntu14.ap-southeast-1 = "ami-967879c4"
          ubuntu14.ap-southeast-2 = "ami-21ce8b1b"
          ubuntu14.cn-north-1     = "ami-d44fd2ed"
          ubuntu14.eu-central-1   = "ami-9cf9c281"
          ubuntu14.eu-west-1      = "ami-664b0a11"
          ubuntu14.sa-east-1      = "ami-c99518d4"
          ubuntu14.us-east-1      = "ami-c135f3aa"
          ubuntu14.us-gov-west-1  = "ami-91cfafb2"
          ubuntu14.us-west-1      = "ami-bf3dccfb"
          ubuntu14.us-west-2      = "ami-f15b5dc1"
          ubuntu16.ap-northeast-1 = "ami-a68e3ec7"
          ubuntu16.ap-southeast-1 = "ami-5b7ed338"
          ubuntu16.ap-southeast-2 = "ami-e2112881"
          ubuntu16.cn-north-1     = "ami-593feb34"
          ubuntu16.eu-central-1   = "ami-df02c5b0"
          ubuntu16.eu-west-1      = "ami-be376ecd"
          ubuntu16.sa-east-1      = "ami-8f34aae3"
          ubuntu16.us-east-1      = "ami-2808313f"
          ubuntu16.us-gov-west-1  = "ami-19d56d78"
          ubuntu16.us-west-1      = "ami-900255f0"
          ubuntu16.us-west-2      = "ami-7df25b1d"
    }
}

Then using a lookup such as this when creating an instance

ami = "${lookup(var.ubuntu_amis, "${var.os-version}.${var.region}")}"

Hopefully this helps someone out and if you know of a better way to accomplish this please share!

OpenVPN and ec2 Jumbo Frames

While troubleshooting site to site links running OpenVPN recently I ran into an issue with MTU sizing on the ec2 end. When we originally setup the links we followed the performance tuning advice found here. The relevant portion is that we set tun-mtu 6000 Why did we do this? Here’s OpenVPN’s explanation

By increasing the MTU size of the tun adapter and by disabling OpenVPN’s internal fragmentation routines the throughput can be increased quite dramatically. The reason behind this is that by feeding larger packets to the OpenSSL encryption and decryption routines the performance will go up. The second advantage of not internally fragmenting packets is that this is left to the operating system and to the kernel network device drivers. For a LAN-based setup this can work, but when handling various types of remote users (road warriors, cable modem users, etc) this is not always a possibility.

During later testing we discovered that we could easily push 40mb/s over the OpenVPN tunnel into the ec2 instance, but only 1mb/s or less going the opposite direction. Obviously not ideal.

So how does ec2 MTU sizing play into this? We had on premise VMs on one side with a standard mtu of 1500 on their interfaces while the ec2 instance had jumbo frames enabled with an mtu of 9001. This meant that our traffic towards ec2 was being fragmented into an MTU of 1500 by linux  on the on premise OpenVPN instance. Traffic out of ec2 towards the other end was not fragmented by the vpn instance at all and was sent along to the internet gateway where it would be chopped into a standard MTU size to cross the internet. This manifested itself in significantly lower throughput going from ec2 to the on premise instance.

After discovering this we set the MTU of eth0 on the ec2 instance to 1500 and lowered the OpenVPN tun-mtu to 1500. Since implementing those changes we are achieving equal throughput in both directions!