Digital image processing has constantly evolved since its inception more than half a century ago. From Lawrence Roberts’ seminal thesis on machine perception to modern-day robotic vision, numerous techniques have been developed, each tailored to the challenges of a particular application.
At Bit9 + Carbon Black, we’ve integrated image recognition into our existing software reputation and malware analysis, giving you a new tool to combat social engineering. This feature is included in Carbon Black v5.1’s “Icon Matching Feed.”
We must use advanced algorithms to recognize images and not just directly compare icons. Similar to binary polymorphism, malware authors often subtly change their icons to prevent traditional signature-based detection.
Our process has five main steps, broadly similar to any image recognition process:
- Developing trusted icon reference sets
- Icon extraction and normalization
- Feature extraction
- Similarity calculation & match to a trusted icon
- Trust validation
Image recognition is more precisely “image categorization.” Given a set of known icons divided into categories and an unknown icon, which category best fits the unknown icon?
Since phishing attacks take advantage of brand recognition, categorization requires a list of recognized brands and, for each brand, a set of all icons associated with the brand. These can be remarkably diverse. Here’s a sample from the Adobe PDF icon reference set:
These reference sets form the foundation of the image recognition process. These are the brands your users recognize, trust, and are more likely to click when they shouldn’t.
Icon Extraction and Normalization
The first step in recognizing an image is to get at the thing – more complex than you might expect, because strictly-compliant PEs are the unicorn of the security world. This was the first place where our experience with malware came in handy. RT_ICON resources house the images – either as whole PNG files or in the case of bitmaps, just the pixel data (with the bitmap metadata kept in the RT_GROUP_ICON resource alongside the directory). The extracted icons all have different file formats, resolutions and color depths, so we reformat all extracted images into a consistent intermediate with uniform parameters.
There isn’t one right algorithm for all contexts, so we studied what makes icons different from other, common image-recognition subjects. The most obvious distinction is that icons are very small files. Where traditional recognition is an exercise in reduction, feature extraction from icons is a bit like squeezing blood from a stone. Compared to images of natural objects, color-based features are relatively insignificant in icons. Shapes are more consistent, constrained by market recognition and operating parameters. For example, the black and white PDF icon is still clearly a PDF:
Sensitivity to local extrema can be problematic with natural subjects (such as recognizing a person’s mouth without concern for whether the lips are chapped), but icons don’t have room for gradual contours, vague geometry, or other aberrations. Object positioning within the “scene” likewise isn’t especially characterizing of icons. If, for example, a user encounters an icon with its text placed below the logo rather than above it, the meaning conveyed by those components is entirely unchanged:
For these reasons, and more, our feature extraction implementation focuses primarily (but not exclusively) on edge detection. For example, here’s a normalized input icon and its edge representation:
Similarity calculation and match to trusted icon
After the edges have been extracted and quantified, the edges of any two icons can be compared to quantify their similarity:
In this image, the large icons in the center are from a trusted reference set. The icons at the top and bottom are “unknown” icons we are analyzing. As you move from left to right, you see the stages of the normalization process, edge detection and similarity calculation. These two unknown icons match the reference icon with confidence scores of 47 and 57.
Here are three more examples of similarity matching. In all three cases, the trusted reference icon is on the left and a malware icon is on the right:
A brief aside – “Icon Obfuscation”
If every icon were precisely the same, this kind of analysis would be straightforward. Unfortunately, the real world is complicated and there are several different categories of obfuscation commonly discovered. It’s a delicate balancing act between deviating as much as possible to avoid detection and still being recognized by the user/victim as the trusted icon.
By far, the most common method is noise injection (the “pepper shaker”). It’s simple, not specific to any particular subject, and produces virtually limitless distinct icons despite requiring changes only barely perceptible to the human eye.
Unlike noise injection, other methods selectively alter components of the icon. Consider these two IE-related logos: having the “orbiter” change direction and filling the globe with “ocean: means that while the relationship isn’t lost, they are obviously of different design.
This type of obfuscation requires creativity and adds risk of upsetting the balancing act, hence being employed less frequently.
Once a binary’s icon has been matched to a reference set, we know the application the icon is expected to come from. We can compare attributes of the binary with the expected attributes from that vendor:
Does the icon even make sense in this context? If the icon should be from a Word document, why is it coming from an executable?
Is the binary signed by the right organization? If it’s an Acrobat icon, is the binary signed by Adobe? If it’s an Excel icon, is the binary signed by Microsoft? Did the sample also come from that organization?
Is the binary metadata consistent with the other binaries from that organization? If it’s an Acrobat binary from a Windows system, does the “Company Name” field in the FILEINFO structure say “Adobe Corporate” like all other binaries from Abode?
This trust validation is where Bit9 + Carbon Black’s Threat Intelligence Cloud shines. Our Software Reputation Service is the industry’s largest catalog of trusted software. It gives us a deep database of trusted software to make this association with confidence.
We hope this brief peek into just a few of its components helps to clarify the process behind the new Icon Matching feed. With this feed enabled, when one of your users clicks on what appears to be a PDF attachment, you’ll have this analysis process standing by to alert you that the binary backing the “PDF” had no relation to Adobe. Let’s put a stake in the heart of phishing attacks!