Last fall, we came across a blog published by AWS, Amazon’s cloud computing subsidiary, announcing Amazon Kinesis Analytics for anomaly detection, which uses Random Cut Forest (RCF) algorithm to identify anomalies in streaming data.
Seeing a new algorithm for anomaly detection piqued our interest because we have developed our own benchmark for anomaly detection at Numenta, called the Numenta Anomaly Benchmark (NAB). For those who are unfamiliar with NAB, it is the first benchmark designed to evaluate time-series data, and it has a unique scoring scheme that gives credit to algorithms that are able to find anomalies earlier while penalizing false results.
NAB’s open source repository contains more than 50 labeled data streams taken from a wide range of real-world sources. While reading AWS’ paper about using RCF algorithm for anomaly detection, we were pleasantly surprised to see that the authors demonstrated their algorithm using one of NAB’s sample datasets consisting of six months of taxi ridership volume in New York City.
Out of curiosity, we asked Luiz Scheinkman, one of our engineers, to run RCF on NAB to see how the new algorithm would score compared to the other anomaly detection algorithms we’ve evaluated using NAB, including our own HTM algorithm.
… And the results are in
Amazon’s RCF came in #5 out of 10 anomaly detection algorithms tested on NAB. Most of these are anomaly detection algorithms we found available in open source, while a few were submission entries to the 2016 NAB Competition.
|Detector||Standard Profile||Reward Low FP||Reward Low FN|
|Random Cut Forest||51.7||38.4||59.7|
|Twitter ADVec v1.0.0||47.1||33.6||53.5|
In order to get a better understanding on how Luiz evaluated RCF on NAB, I sat down with him and asked him a couple of questions regarding the process.
Can you give more details on how you integrated RCF into NAB?
There are 3 different ways to test custom algorithms on NAB:
1. Create a custom detector using NAB API
2. Give NAB anomaly scores before the threshold optimization phase
3. Give NAB the anomaly detections
Because the algorithm was already implemented in AWS Kinesis Data Analytics, I chose option 2 where I would just stream NAB data directly to AWS Kinesis and calculate the anomaly scores using the built-in RANDOM_CUT_FOREST function.
How much tweaking did you have to do to the algorithm to get these results?
I used the AWS Kinesis Data Analytics default template for anomaly detection as is; the only tweak was to normalize the anomaly score.
Have you tried changing the algorithm parameters to see if you would get different results?
I’ve tried using different values for the “shingleSize” and “numberOfTrees” parameters and published the results yielding the best scores I found, which were actually the default values for the RANDOM_CUT_FOREST function. The results are easy to replicate and the parameters are easy to change, so it would be great if anyone in the Open Source community could try different parameters and let me know if that improves the scores.
How can somebody else replicate these results?
I’ve outlined how you can do so in seven steps. You can also find these instructions in our NAB Random Cut Forest repository.
1. Clone the NAB repository
This command will clone the repository:
git clone https://github.com/numenta/NAB.git
2. Configure your AWS credentials
Use the AWS Command Line Interface (CLI) tool, and enter this command:
3. Create NAB results folder structure
This command will create the necessary directories and entries in the config/thresholds.json file:
python scripts/create_new_detector.py --detector randomCutForest
4. Create AWS Kinesis Analytics Application
This command will create and configure a new AWS Kinesis Analytics Application ready to receive NAB data from the input stream and output anomaly scores suitable for NAB to the output stream:
python nab/detectors/random_cut_forest/random_cut_forest.py --create
5. Stream all files
To stream all NAB data files use the following command:
python nab/detectors/random_cut_forest/random_cut_forest.py --stream
6. Clean up
At the end of the evaluation, it is recommended you delete all resources used to compute the anomaly scores. Use the following command to delete all AWS resources created by this script:
python nab/detectors/random_cut_forest/random_cut_forest.py --delete
7. Compute NAB scores
Once you have calculated anomaly scores for all NAB data, you can use NAB’s standard commands to compute NAB scores. For example, use the following command from NAB’s root directory to optimize the anomaly score threshold for your algorithm’s detections, run the scoring algorithm, and normalize the raw scores to yield your final NAB scores.
python run.py -d randomCutForest --optimize --score --normalize
If you have any questions or comments or would like to share your results, don’t hesitate to leave a comment below or start a thread on the NAB section of the HTM Forum.