Many security solutions employ signature-based detection. To bypass this, attackers often rely on existing malicious samples to create new samples that preserve the original malicious behavior but have distinct signatures. This is usually done with the help of malware toolkits which can perform various transformations such as: obfuscation, packing, name shuffling, patching, etc. The resulting samples are generated with a unique signature that is previously unseen, but may still be similar to known malicious samples.

Therefore, detecting similarity between samples is an important capability for modern security solutions for a few reasons. First, similarity detection can improve malware detection: If a given sample is similar enough to previously seen samples that were already flagged as malicious, then it is very likely to be malicious as well. Second, similarity detection can help to automate the task of identifying malicious campaigns, which are often based on similar variants of the same malicious sample.

In this blog, we focus on similarity in the context of Microsoft Office macros, which are widely exploited by attackers as a platform for delivering malware. We will discuss several patterns of similarity based on real-world samples that we detected in the wild, and we will briefly describe our solution.

Office Macros: Background

A macro is a special-purpose program embedded in a Microsoft Office document (Word, Excel, PowerPoint, etc.) which is used to automate various tasks such as keyboard strokes and mouse movements. A macro consists of modules written in VBA which is a powerful imperative programming language. A macro can access various machine resources: It can read and write to the file system, use the network, and execute commands. The execution of a macro is usually triggered by a callback function which is invoked when a specific event occurs. For example, the following macro is triggered when the (Word) document is opened, and executes a command that runs calc.exe:

The versatility and expressiveness of VBA is what attracts attackers to use macros as a component in the infection flow.

Similarity Patterns: Examples

In this section, we will discuss several similarity patterns that we observed in samples from our clientsֵ’ traffic.

Identifier Shuffling

One way to obtain different variants of the same malicious macro is by renaming identifiers: functions, variables, etc. This produces distinct macros which are semantically equivalent, that is, exhibit the same behavior. To illustrate this, let’s have a look at two malicious macros ([1], [2]) that we detected in our clients’ traffic. In both cases, the macro was contained in an Microsoft Office document that was delivered as an email attachment.

As you can see, these two macros are identical up to the names of some functions and variables. The matching identifiers are marked here in the same color. For example, the variable KYjo in the first macro (on the left) corresponds to the variable kBHE in the second macro (on the right).

Here, the identifier renaming has clearly no effect on the behavior. Both macros use the built-in callback function Worksheet_Change to initiate the malicious flow which is obfuscated using various properties of the embedding document (page setup, spreadsheet cells, etc.). 

After deobfuscation, the malicious flow can be described as follows: 

First, an ActiveX object is created by invoking the CreateObject API with the winmgmts:Win32_Process class (line X). Then, the arguments (CommandLine and ProcessStartupInformation) for the Create method of the WMI class are constructed (line X), where the CommandLine argument contains the command to be executed. Finally, the function fkldf is called and the method Create is invoked with the constructed arguments (line X), which leads to the execution of a PowerShell command that downloads and executes a malicious JavaScript payload.

Modifying Constants

Another way to obtain different variants of the same malicious macro is by modifying constants (i.e. strings, integers, etc). Malicious macros often construct strings in an obfuscated manner, and then use them to execute commands or access the network in a later stage. This is done in the following sample ([3]), which was detected in one of our clients’ traffic:

In this macro, the function TCONETC first defines an obfuscated constant string (line X), and then uses the Replace API to perform several transformations (lines X) which eventually result in a PowerShell command that will be executed later in the function Auto_open using the Shell API.

Such attacks often re-appear in slightly different configurations. The main logic for constructing the strings is reused, and only the initial input is patched in order to modify the parameters of the attack: network locations, executed commands, etc.

We observed that in other attacks that we prevented across different clients. In one of the detected samples ([4]), the macro differs from the previously mentioned macro ([3]) only in the first command of the function TCONETC

As you can see, the difference is in the section of the initial string that corresponds to the URL, since here the attacker is trying to launch the attack using a different network location.

Reusing Primitives

As we mentioned before, many malicious macros rely on obfuscation, which is typically implemented using a sequence of encoding or decoding operations. The macros that we discussed in the previous section use slightly different constants when constructing the obfuscated strings, but they use the same encoding and decoding procedures for obfuscating the PowerShell command. We observed that some malicious macros not only modify strings, but also use different encoding or decoding procedures for implementing the obfuscation mechanism.

This can be observed in the following two macros ([5] and [6]) which were detected in our clients’ traffic. The first one is a Word macro (on the left) and the second one is an Excel macro (on the right). Each of the macros first stores some obfuscated data in the variables Based and Named, then writes the value of Based to the file system using the function writeBytes, and finally executes a command using the value of Named. In this case, however, each of the macros uses a different encoding procedure, that is, a different sequence of operations, to construct the obfuscated data (the values of Named and Based).

Despite the differences in the encoding procedures, other parts of these macros reveal some interesting similarities. First, both macros use the same decoding procedure, the function decodeBase64, which performs a standard decoding of a base64-encoded string:

Second, both macros contain the function writeBytes which implements a primitive for dropping files. The two implementations are similar, and differ only because of some identifier shuffling (the variable cbinaryStream corresponds to the variable binaryStream).

Our Solution

In general, similarity detection is a hard problem that has been researched in both industry and academia in many contexts such as: malware detection, code clone detection, etc. The approaches that were proposed to solve this problem are based on various techniques: textual analysis, structural analysis, semantic analysis, machine learning, etc. Our challenge is to detect similarity efficiently and precisely, while keeping a low false positive and false negative rate. Perception Point’s advanced macro analysis engine approaches the problem by relying on code fragments. In a nutshell, the platform decomposes the macro into code fragments and tries to match them against code fragments that were seen in other samples. By doing this, Perception Point is able to keep customers safe and secure, no matter the macro.

References:

  1. 0005144ebb03d2f5a5b17e21362c628ddc1705e910cfd56032b7b55c932b68da
  2. 20e2093192e7b7b96c067cd8f16cee4ccb51e8c10676050646877bc83dc34a27
  3. 31e93f3226377174335eabda90bc771425043cf412dd91b257f1814be085c715
  4. 6586c7399b24c4b29c2173ec47a733cab38abe3d175b47bbdd7188e3ab1dd0c3
  5. 536eaf59d72519d5e1cc52e98e212fdf52855f1828d3326fcd22be5071b231a0
  6. b5f6912f1291dc26442e02bb2e79c7c13613a87d23ddf0c294c9d02b231aab70