There are countless ways to apply machine learning to the task of identifying malware. Static, analysis-based approaches have historically been popular, but they have not proven sufficient for reliably detecting new attacks. Rather, the most resilient machine-learning approaches involve dynamic analysis: evaluating programs based on the actions they take. In this blog post, we’ll discuss how we’re using the data and resources at Carbon Black to do
dynamic analysis right.
Dynamic Analysis: Difficult but Necessary
The term “dynamic analysis” (sometimes called behavioral analysis) exists primarily to differentiate from static analysis. Static-based approaches look at a file as it sits idly on the disk, attempting to determine its nature without allowing it to perform any of its intended actions. It is widely understood that dynamic analysis is a more resilient, effective approach to detecting malware1. It’s easy for malware too look like it’s benign, but it’s hard for it to act benign.
Then why do so many products out there only do static analysis?
One of the main reasons is that static-based analysis is comparatively easy to execute at scale. You gather some samples of good and bad data, extract some features, and feed them into your algorithms. Here at Carbon Black, we’ve got good and bad files by the hundreds of millions, and performing static analysis was considered low-hanging fruit. Our architecture for performing this analysis is detailed in this post.
But we know we need to do more. The thing is, if you want to do dynamic analysis, you need examples of malware actually running, and you need samples of normal, everyday computer usage to compare them against. Getting this data in a machine-learning-ready format, at scale, is a huge challenge.
Challenges and How We Address Them
The ultimate goal of a dynamic-analysis algorithm is to identify malware in a real-world setting. It’d be sensible to assume that we would train on lots of samples of real-world malware and real-world goodware but, unfortunately it’s not quite as simple as that. What follows is a summary of the issues you face when performing dynamic analysis and how we’ve tried to address them.
A Solid Real-World Baseline
Dynamic analysis generally involves identifying sequences of events that are indicative of malware. Of course, it isn’t enough to identify sequences that are common in malware execution. You need to find sequences that are common in malware but rare during normal, benign execution. The more discriminating your sequences are between bad and good behavior, the fewer false positives you’ll have to wade through.
To find the most discriminating sequences, you need a large body of everyday, attack-free data to train on. This can be challenging, but it’s one place where Carbon Black is extremely well-positioned. We have a huge body of real-world execution data, as gathered by the Carbon Black Response product. By working with anonymized data, contributed by numerous partners and customers, we have access to a world-class dynamic analysis goodware baseline.
Bolstering Malware Samples
Of course, there isn’t very much real-world malware data to be found on our customers’ computers (if there was we wouldn’t be doing our jobs very well!). There are other solutions: we could engage in full-scale, full-context mock attacks of isolated machines and record the results, for example. But we need data at large scale. A detonation pipeline is the answer: it enables us to automatically execute numerous malware samples in a virtual machine and Cb Response sensors gather the execution data that we need. This is a common approach, but it’s not without its own complications.
Validating Detonation Against Real Data
The risk here is that detonated malware will not reflect the real-world behavior that we’re trying to detect and protect against. For example, some malware attempts to circumvent observation by detonation services, and differences in context may lead to differences in the actions taken. On some level, this is an inescapable challenge when using this approach2 and we use the best tools we can to stay ahead in the cat-and-mouse game between the detonators and evasive detonates. In addition, we are statistically comparing our detonations to the real-world examples and using those results to mitigate these differences.
Our approach involves detonating very specific malware samples so that we can compare an unknown program against classes and families of known bad behavior. This means that the results of our algorithms are not simply “good” or “bad,” but expressed in terms of similarity to one or more known attacks, providing the defender with more context. The details of this approach will be explored in an upcoming post.
Because we have the resources at the ready, we are also detonating select goodware samples. This enables us to make several useful comparisons. We can 1) compare detonated malware and goodware, allowing for a parallel same-kind analysis; and 2) compare real-world goodware executions to detonation-driven executions of the same files. These serve as sanity checks, and help to inform the results from our primary machine learning efforts.
Full-Scale Data for Full-Scale Problems
Any reasonable machine-learning approach to endpoint security is going to face the problem of obtaining training data at scale. If you’re looking at files, you’ll need a lot of files. If you’re looking at behavior, you’re going to need a lot of behavior. Unfortunately, obtaining lots and lots of examples of real attacks, as they happen, isn’t necessarily feasible.
Our solution is to use:
- A massive body of baseline data
- A torrent of detonation data
- Statistics and comparisons between behaviors for validation
Collectively, these approaches give us a powerful set of tools to generate patterns of malicious behavior. Stay tuned for more details about the machine learning that this is all supporting!
- The argument that dynamic methods are more effective (if more difficult) than static methods is common in articles about automated dynamic analysis, for example here, here, here, here, here, and this survey. Generally speaking, the argument for static methods are that they are faster, easier, or safer to implement, but not that they are inherently more effective than dynamic approaches. One rare exception is the argument that static disassembly might catch unexecuted code blocks that dynamic approaches could miss.
- Consider for example this paper: 9 of the 10 related research approaches it describes use some form of virtualization or emulation.