This year I had the pleasure of presenting at DEF CON. The goal of the research I presented was to look at identifying various packers, compilers and cryptors by using patterns in assembly mnemonics, in addition to a couple other PE header values. Spoiler alert, the technique works pretty well, as the results were encouraging. This post will summarize my presentation “I am packer and so can you.” In addition, I’ll point to the code so you can begin creating your own signatures and begin playing with (and improving) this specific technique.
The motivation for the research was an attempt to create an easy-to-understand, simple-to-manage, cross-platform method of packer (from here on out “packer” will be a generic term for compiler, packer, cryptor) detection.
Another requirement was to provide some type of fuzzy-matching while capturing some aspect of executable behavior. Several of the tools traditionally used for packer detection haven’t been updated in years or are simple translations of old techniques to new tools. The area seemed ripe for an update. One of my areas of interest is data science and its various applications, and DEF CON was a great excuse to apply some of my knowledge.
The solution involves disassembling a binary to get the various assembly mnemonics at the beginning of the executable (those at AddressOfEntryPoint). These mnemonics represent program flow and help describe some of the context/behavior of each executable. Once this representation is available for an executable there has to be a way to compare it to other executables to gauge similarity. This step also creates the data for judging similarity between executables.
While you could use a machine-learning (ML) model, there are some disadvantages for this type of tool. To deal with the issues of a model-based approach, I created an easy-to-understand signature language. In addition to the assembly mnemonics, the MajorLinkerVersion, MinorLikerVersion, and NumberOfSections values from the PE header are used. These values are not required but allow for signature scoping because these values are often influenced by packers.
[Microsoft : Visual Studio 2002/2003]
mnemonics = push,mov,push,push,push,mov,push,mov,sub,push,push,push,mov,call,xor,mov,mov,mov,and,mov,shl,add,mov,shr,mov,push,call,pop,test,jne
Doing a straight comparison between the mnemonics from a given executable and a signature is one way to judge a match (or not), however it doesn’t give a good metric on the amount of similarity or a fuzzy-match. To accomplish this, I used a take on Levenshtein Distance that introduces a penalty to edits left vs. the right. Since instructions are executed in order, capturing that order was important while still providing a way to not require following all program execution.
In other words, we can be slightly lazy about instructions as they continue since we likely don’t know the length of the packer stub or initial function. In order to accomplish that, I used a take on Levenshtein distance known as “Tapered Levenshtein.” To compute this value, the following equation is used: sum(1 – (mnemonic position / len(mnemonic set))). Position 0 (the leftmost mnemonic) would have a weight of one, and the values would decrease proportionally from there. Using the signature above as an example:
Edits required to convert
(8/10) + (6/10) + (4/10) = 1.8 edits
and a similarity of
1 – 1.8/10 = 80% similar
Using this computation, and the additional header values, I evaluated several datasets. While more in-depth analysis and visualizations can be seen in the slides, the randomized test dataset of approximately 411,000 samples showed really nice clustering using tapered Levenshtein. The data also produced expected results when I spot-checked clusters with various other packer-identifying technologies. In addition to the slides, code and some initial signatures can be found here: