Empowering Gemini for Malware Analysis with Code Interpreter and Google Threat Intelligence
One of Google Cloud's major missions is to arm security professionals with modern tools to help them defend against the latest threats. Part of that mission involves moving closer to a more autonomous, adaptive approach in threat intelligence automation.
In our latest advancements in malware analysis, we’re equipping Gemini with new capabilities to address obfuscation techniques and obtain real-time insights on indicators of compromise (IOCs). By integrating the Code Interpreter extension, Gemini can now dynamically create and execute code to help deobfuscate specific strings or code sections, while Google Threat Intelligence (GTI) function calling enables it to query GTI for additional context on URLs, IPs, and domains found within malware samples. These tools are a step toward transforming Gemini into a more adaptive agent for malware analysis, enhancing its ability to interpret obfuscated elements and gather contextual information based on the unique characteristics of each sample.
Building on this foundation, we previously explored critical preparatory steps with Gemini 1.5 Pro, leveraging its expansive 2-million-token input window to process substantial sections of decompiled code in a single pass. To further enhance scalability, we introduced Gemini 1.5 Flash, incorporating automated binary unpacking through Mandiant Backscatter before the decompilation phase to tackle certain obfuscation techniques. Yet, as any seasoned malware analyst knows, the true challenge often begins once the code is exposed. Malware developers frequently employ obfuscation tactics to conceal critical IOCs and underlying logic. Malware may also download additional malicious code, making it challenging to fully understand the behavior of a given sample.
For large language models (LLMs), obfuscation techniques and additional payloads create unique challenges. When dealing with obfuscated strings such as URLs, IPs, domains, or file names, LLMs often “hallucinate” without explicit decoding methods. Additionally, LLMs cannot access, for example, URLs that host additional payloads, often resulting in speculative interpretations about the sample’s behavior.
To help with these challenges, Code Interpreter and GTI function calling tools provide targeted solutions. Code Interpreter enables Gemini to autonomously create and execute custom scripts, as needed, using its own judgment to decode obfuscated elements within a sample, such as strings encoded with XOR-based algorithms. This capability minimizes interpretation errors and enhances Gemini's ability to reveal hidden logic without requiring manual intervention.
Meanwhile, GTI function calling expands Gemini’s reach by retrieving contextualized information from Google Threat Intelligence on suspicious external resources such as URLs, IPs, or domains, providing verified insights without speculative guesses. Together, these tools equip Gemini to better handle obfuscated or externally hosted data, bringing it closer to the goal of functioning as an autonomous agent for malware analysis.
To illustrate how these enhancements boost Gemini's capabilities, let's look at a practical example. In this case, we’re analyzing a PowerShell script that contains an obfuscated URL that hosts a second-stage payload. This particular sample was previously analyzed with some of the most advanced publicly available LLM models, which incorporate code generation and execution as part of their reasoning process. Despite these capabilities, each model “hallucinated,” generating completely fabricated URLs instead of accurately revealing the correct one.
Obfuscated PowerShell code sample to be analyzed by Gemini
Utilizing Code Interpreter and GTI function calling as part of its reasoning process, Gemini autonomously generated the following report without any human intervention. When deemed necessary, it applies these tools to process and extract additional information from the sample.
Final Report: PowerShell Malware with URL DeobfuscationThis report analyzes a PowerShell script designed to download and execute a file from an obfuscated URL. The script employs a custom obfuscation routine resembling the RC4 stream cipher to conceal the malicious URL.
Functionality:
-
Obfuscation Routine: The script defines a function F that performs byte-level XOR encryption using a key derived from the string "tox2". This function is used to deobfuscate the malicious URL.
-
URL Decryption: The obfuscated URL, stored as a byte array, is decrypted using the F function with "tox2" as the key.
-
File Download: The script downloads a file from the decrypted URL (hxxps://filedn[.]eu/lODWTgN8sswHA6Pn8HXWe1J/tox2/Scan_docs%2398097960[.]msi) and saves it to the system's temporary directory under a randomly generated filename.
-
Execution: Finally, the script executes the downloaded file.
Indicators of Compromise (IOCs):
-
URL: hxxps://filedn[.]eu/lODWTgN8sswHA6Pn8HXWe1J/tox2/Scan_docs%2398097960[.]msi
Deobfuscation Code:
def deobfuscate(x, y): s = list(range(256)) j = 0 k = [ord(c) for c in y] for i in range(256): j = (j + s[i] + k[i % len(k)]) % 256 s[i], s[j] = s[j], s[i] i = 0 j = 0 output = [] for m in range(len(x)): i = (i + 1) % 256 j = (j + s[i]) % 256 s[i], s[j] = s[j], s[i] output.append(x[m] ^ s[(s[i] + s[j]) % 256]) return output a = [0x25, 0x7B, 0x5A, 0xD8, 0x9C, 0xFF, 0xC3, 0xA4, 0x95, 0xB7, 0x9D, 0x6E, 0x5F, 0x4D, 0x49, 0xFB, 0x78, 0x32, 0xB8, 0x17, 0x72, 0x01, 0x31, 0xFB, 0xC0, 0x58, 0x45, 0xCB, 0x5C, 0x3C, 0xDD, 0x00, 0x60, 0x87, 0x25, 0xB2, 0x44, 0xA6, 0xC5, 0xEA, 0x71, 0xAA, 0x9D, 0xA1, 0x05, 0x5F, 0xBD, 0x41, 0x28, 0x70, 0x30, 0x6B, 0x96, 0x24, 0x92, 0xA0, 0x9D, 0x92, 0x23, 0x79, 0x69, 0x2C, 0x1B, 0x1E, 0xEB, 0x69, 0x8F, 0x04, 0x10, 0x74, 0xBC] b = "tox2" c = deobfuscate(a, b) decoded_url = "".join(chr(i) for i in c) print(decoded_url)Google Threat Intelligence (GTI) Lookup:
The GTI lookup provides valuable context regarding the identified IOC:
-
Threat Actor: UNC5687, known for using MESHAGENT in phishing campaigns impersonating the Security Service of Ukraine.
-
Campaign: This specific URL is linked to a campaign spreading MESHAGENT via phishing emails with malicious links. The campaign involved a captcha page leading to the download of either a malicious MSI or EXE file.
-
Malware: The campaign delivers MESHAGENT, an open-source remote access framework, configured to communicate with C2 domains linked to a service called "AnonVNC."
Conclusion:
The analyzed PowerShell script is confirmed to be malicious, associated with a phishing campaign by UNC5687 distributing the MESHAGENT remote access tool.
Gemini identified that the script employs an XOR-based obfuscation algorithm that resembles RC4 to conceal the download URL. Recognizing this pattern, Gemini autonomously generates and executes a Python deobfuscation script within the Code Interpreter sandbox, successfully revealing the external resource.
With the URL in hand, Gemini then utilizes GTI function calling to query Google Threat Intelligence for further context. This analysis links the URL to UNC5687, a threat cluster known for using a remote access tool in phishing campaigns impersonating the Security Service of Ukraine.
As we’ve seen, the integration of these tools has strengthened Gemini’s ability to function as a malware analyst capable of adapting its approach to address obfuscation and gathering vital context on IOCs. By incorporating the Code Interpreter and GTI function calling, Gemini is better equipped to navigate complex samples by autonomously interpreting hidden elements and contextualizing external references.
While these are significant advancements, many challenges remain, especially given the vast diversity of malware and scenarios that exist in the threat landscape. We’re committed to making steady progress, and future updates will continue to enhance Gemini's capabilities, moving us closer to a more autonomous, adaptive approach in threat intelligence automation.