Please note we have recently updated our Privacy Policy, effective May 24, 2018. You may view the updated Privacy Policy here.
By using this website, you consent to the use of information that you provide us in accordance with the Privacy Policy.


The Care and Feeding of Machine Learning

August 29, 2016 / Chris Lord Alex Baker

Carbon Black attacks malware from every angle: behavioral, static, human, machine, and combines this with the collective intelligence provided by our partners and customers to deliver complete coverage across our security offerings.

In this post, we want to discuss the design and technology supporting one of those angles, a state-of-the-art binary -analysis pipeline. This system takes advantage of the millions of real-world binaries we see every day to feed a pipeline that incorporates dozens of auto-scaling AWS instances, distributed Docker containers, clusters of H2O-based machine-learning engines, and a suite of logging and visualization capabilities.  This gives us the ability to run a broad suite of algorithms and analytics across hundreds of millions of files from a collection of billions.

Data Processing

The backbone of our analysis pipeline is a cluster of auto-scaling AWS instances. Each worker instance contains a series of containers that handle three main tasks:

1. Ingest incoming binaries: extract and compute features, statistics, and abstractions from incoming binaries. Binaries come from customers, partners, and trawls of the web for the diverse goodware and malware samples. This produces a flow of millions of new abstractions per day that must be analyzed.

2. Analyze extracted features: continuously apply machine-learning models to abstractions, algorithmically identifying instances of malicious patterns. The output of this task is a series of predictions about binaries’ potential maliciousness and relationships to known malware families. These predictions are validated against outside intelligence.

3. Enrich with intelligence: update information on binaries from internal and external intelligence and compare past predictions. Intelligence comes from our partners, our customers, and Carbon Black malware analysts. The output of this task is information on the efficacy of different models and data that is used to update or retrain models.


Monitoring and Scaling

The performance of our cluster is monitored using Amazon Auto Scaling and, when resource usage rises beyond specific thresholds, additional containers are spun up using Docker, and additional instances added as needed. Similarly, when usage falls (for example, because the backlog of binaries to process have been cleared), the system aggressively spins down containers and instances.  The load-based scaling is augmented by AWS Lambda scripts that fine-tune processing allocation between ingest, analyze, and enrich tasks to keep all tasks responsive.

We also use Amazon’s spot instances to respond to changes in pricing. The engine automatically spins up dozens of instances to process a high volume of binaries when it is cost-effective to do so, and will defer some processing when prices are high.


This dynamic approach to using instances works because our processing is completely stateless; we use a series of SQS queues to pass messages, S3 buckets to store binaries, and a variety of high-performance databases for storing analyses, predictions, statistics, and gathered intelligence.

Elastic Stack provides visibility into the process via searchable logs, and a combination of Telegraf, InfluxDb, and Grafana are used to generate dashboards that provide insight into performance at a glance. The dashboards are broadcast internally on wall displays to keep other teams at Carbon Black informed of our progress and mission (and they’re cool looking, too).

Automatic Model Building

Incoming binaries are tagged by a collection of models trained on sample sets drawn from an existing, constantly-updated library of billions of malicious and benign binaries. For example, one of our modelling approaches utilizes large H2O clusters (running outside the main data processing pipeline) to automatically generate Gradient Boosting Machines made of thousands of connected decision trees, each trained on millions of sampled binaries. Statistics are constantly gathered and used to further fine-tune our methods, creating an ever-improving feedback cycle. But we’re getting ahead of ourselves. We’ll discuss much more about our approaches to model building in future blog posts, and in the meantime head here for some more details about our approach to dynamic analysis.

The data-analysis pipeline keeps our data diverse and up-to-date and would not have been possible without some amazing work by other members of R&D: Doran Smestad, Jason Cherry and John Langton, as well as the support of the Carbon Black Threat and Product teams to tap existing data and intelligence sources.

Having this pipeline is crucial, because it’s a rapidly changing world out there, and attackers don’t wait for your models to update.

TAGS: Carbon Black / endpoint security / machine learning