NOTE: Numenta has
announced a strategic partnership with Avik Partners,
please read more about the future of
Grok for IT Analytics.
I was at the AWS summit in NYC, and when I got back to California, I saw a
notification from Grok that our web server auto scaling group (ASG) was acting
anomalously. At first I thought it was only because we were seeing heavier than
usual traffic since we were mentioned in Information Week and VentureBeat, but
when I checked it out in Grok’s mobile app, I noticed a couple of odd and
First, I saw an odd pattern in the NetworkIn graph – instead of a flattish line
with minor variation and different sized rough edged spikes I’d expect from
regular web requests, or a line trending upward as the web traffic went up from
the media mentions, what I was seeing was a stable line with very high, very
sharp spikes at regular intervals. On closer examination, the peak spikes were
even more suspicious – they were all very nearly the same size.
Secondly, Grok had also flagged the group’s CPU metric, and when I looked at it,
the graph was very atypical. This ASG normally has a CPU load that hovers around
20%, and what I was seeing was hovering around 40-45% with surges to 100%. The
surges were not as sharply edged as the network spikes, but there were still
surges in load.
Finally, even though the CPU was staying above the level we’d configured our ASG
to add instances at, when I took a look at the ASG instance count it was still
hovering around the lower bound of the ASG’s size limit.
After staring at the individual graphs on the phone for a bit, I pulled up the
Grok web chart page so I could compare them to each other – sometimes I prefer
using the web chart because it allows me to view more than one of an instance or
ASG’s underlying metric graphs’ raw data charts at once, and also because by
viewing them on a 15 inch MacBook Pro screen instead of a phone, I can see a
bigger picture of what is going on. Sure enough, once I expanded the CPU and
NetworkIn underlying charts on my laptop I could see that the surges in CPU were
lagging behind the spikes in network, and I immediately got an idea of what was
going on, and a quick look at the ASG’s scaling history and the logs on a
freshly started web server instance confirmed my suspicions.
Here’s a little background about our web server back end auto scaling group. The
Numenta web site itself is stored in a git repository and uses a combination of
docpad, nodejs and some custom scripts to render the static pages of the site
from various source assets. Each web server updates the local copy of the git
repository every fifteen minutes, and if the master branch has changed, it
renders the site and swaps the new rendered document directory tree into place.
That’s the normal process, and because we’re using git we only have to download
the changes in the source files since the last site update.
What was happening in this case though, was that the site renders were failing
on fresh instances. So the ASG would spin up a new instance, the new instance
would have to download the entire git repository, it would attempt to run the
build process and the build would fail. Then the health check script would run,
see that the web server didn’t have a valid document root, report the instance
as unhealthy to the ASG, and the ASG would terminate the instance. Once the cool
down expired, we’d lather, rinse and repeat. If all the old instances had gotten
terminated in the ASG, we’d have had downtime on the website since new instances
were failing to become healthy and get added to the load balancer.
The reason the problem was only showing up on new instances was because the
support script was just checking that a particular component was installed, not
that it was a specific version. When the upstream source for that component
updated to a new version, it wasn’t compatible with our build process, so new
instances would fail to build. The identically sized NetworkIn spikes were when
a new instance started up and did the initial git clone of the website
repository, and the wider CPU spikes were when the new instance would attempt to
render the website’s static assets.
We resolved the issue in two ways. In the short term, we updated the build
script so that it specified exact versions for all of its dependencies. This
stopped the new instance churn right away. In the long term, we stopped
rendering the site on the web servers – we had initially decided to build the
static site on the web server to eliminate machine-generated files from the
repository, but now we pre-render everything and check the static site into the
repository. It can be served immediately after a git pull and the web servers
don’t have to do anything else when we update the web site content.
Here’s an example of the new, normal curves after all the fixes were put in
I could have figured this out without Grok, but I would have had to look at
every metric for the ASG, then scrubbed through them to find where the anomalous
behavior began, then rummaged through the log files, but, and this is an
important but, I wouldn’t have started looking until the last good web server
backend failed and I got an alert that the website was down. Thresholds on the
CPU load or NetworkIn wouldn’t have helped me – the spikes I saw were what I
would have expected to see when a new instance was added to the ASG, what made
them unusual was how rapidly the instances were getting added and dropped.
Instead, Grok notified me when the anomalous behavior started, highlighted
exactly which metrics were showing the strange behavior, and let me compare them
to each other immediately so I could get good picture of the situation before I
started looking in my logs, and before the website actually went offline.