Thursday 27th October 2016

Mesos & Marathon

Recently I implemented an Apache Mesos cluster running the Marathon framework. Apache Mesos makes a team of servers behave like a single super- server, pooling all of the machine resources into a great big heap of CPUs & memory. The cluster consists of three master nodes and an arbitrary number of slave nodes. The masters keep track of which tasks are running on which slaves and how much resource is available, in use or on offer.

Mesos Masters & Slaves diagram

Marathon

The master nodes are also home to the Marathon framework. Marathon is a bit like init for your cluster; Applications are deployed within Docker containers that are in turn managed by Marathon. Marathon makes use of Mesos to provision resources for long-running tasks that you want to run. You can use Marathon to run services in a docker container or simply from disk.

You tell Marathon how and what to provide for a service with a simple JSON file; This JSON file notes how much CPU time the service is guaranteed access to and also how much memory it is allowed to make use of. You can also specify storage volumes to mount inside the container so that you can have persistence for your services. Here's an example of a Marathon JSON configuration file for an Apache cluster that I put together. It has two instances and runs with the web sites content and vhost configuration loaded by way of volumes. If I want to add a site, I simply add the vhost name to the labels section and place the usual vhost.conf file in the sites enabled folder on any of the slaves. I then do a rolling restart of the service in Marathon and the site is live without interrupting any of the existing services.

{
  "id": "apache2",
  "cmd":"/usr/sbin/apache2ctl -D FOREGROUND",
  "cpus": 0.1,
  "mem": 256,
  "instances": 2,
  "constraints": [["hostname", "UNIQUE", ""]],
  "container": {
    "type": "DOCKER",
    "docker": {
      "image": "docker.marathon.mesos:5000/apache2",
      "network": "BRIDGE",
      "portMappings": [
        { "containerPort": 80, "hostPort": 0, "servicePort": 0, "protocol": "tcp" }
      ]
    },
    "volumes": [
      {
        "containerPath": "/srv",
        "hostPath": "/mnt/gv0/apache2/srv/",
        "mode": "RO"
      },
      {
        "containerPath": "/etc/apache2/sites-enabled",
        "hostPath": "/mnt/gv0/apache2/etc/sites-enabled/",
        "mode": "RO"
      }
    ]
  },
  "labels":{
    "HAPROXY_GROUP":"external",
    "HAPROXY_0_BIND_ADDR":"0.0.0.0",
    "HAPROXY_0_VHOST":"vorax.org,www.vorax.org"
  },
 "healthChecks":[
   {
        "protocol": "HTTP",
        "path": "/",
        "gracePeriodSeconds": 3,
        "intervalSeconds": 10,
        "portIndex": 0,
        "timeoutSeconds": 10,
        "maxConsecutiveFailures": 3
   }
 ]
}

The really cool thing about all this is that once you have told Marathon what services you want to run and how many instances of them you need, it will keep them going as long as you have enough hardware to make it happen; If a machine fails, Marathon will happily restart your failed services on any other available space in the cluster.

Traffic

Having containers move around from slave to slave is great for resilience but a royal pain in the butt from a DNS perspective. After all, your service is not highly available if no one knows where it went in the event of a disaster; Enter Marathon-lb. Marathon-lb is a small bit of connective tissue that receives events from the Marathon framework and reconfigures HAProxy on-the-fly to ensure that no matter where your container is running in the cluster, traffic will make it to it. I run a copy of Marathon-lb on each slave in the cluster.

In addition to Marathon-lb I also use Mesos-DNS to help with service discovery from within the cluster. Mesos-DNS handily provides names for your running services using the following format {service-name}.{framework}.mesos

e.g. to reach my rabbitmq container I can use the name rabbitmq.marathon.mesos the name rabbitmq is the ID you assign to the service in the JSON file you pass to Marathon.

Mesos-DNS is also updated on-the-fly as containers are added, removed and moved about. If you have multiple instances of a service (like my apache example) then requesting the service by name will also spread load around the available instances as Mesos-DNS returns any IP from the pool of available instances.

Shared Storage

A problem that occurs with having containers moved to new machines in the event of a failure is that any persistent data that was on the old (now missing) machine dies with it's host. I found a way round this for configuration and low-speed data using GlusterFS.

Each of the slaves is running GlusterFS client and I use this to mount a shared file system area that all the slaves are kept in sync with. This means that no matter which slave a docker container finds itself running on, it's always able to get access to it's persistent storage volume.

In addition to this I'm also researching a tool called REX-ray that will allow me to provision storage volumes directly from my cloud provider and attach these to docker containers no matter which Mesos slave they find themselves on.

Block storage

GlusterFS is great for non-performance sensitive date but is no where near good enough as a back-end for databases and the like. For this we use RexRay that allows us to bind a block storage volume to a container on the fly.

RexRay takes care of provisioning the volume from our OpenStack provider and keeps it connected to the right container as that container moves from one node to another in the cluster.