DISEÑO DE LA INVESTIGACIÓN
C. Otros espacios
2. APLICACIÓN DE INSTRUMENTOS PARA LA RECOGIDA DE DATOS Tal como hemos indicado en el capítulo anterior, dentro de los métodos de investigación,
3.4 Descripción de la muestra: el aula de los “árboles”
In what follows, an in-depth forensic analysis of document based code injection attacks is performed. The capabilities of theShellOSframework are used to exactly pinpoint no-op sleds and payloads for analysis, and then examine the structure of Windows API call sequences, as well as the overall behavior of the code injected into the document.
The analysis is based on 10,000 distinct PDF documents collected from the wild and provided through several sources4. Many of these were submitted directly to a submission server (running the ShellOSframework) available on the University of North Carolina campus. All the documents used in this analysis had previously been labeled asmaliciousby antivirus engines, so this subsequent
analysis focuses on what one can learn about the malicious code, rather than whether the document is malicious or not.
To get a sense of how varied these documents are (e.g., whether they come from different campaigns, use different exploits, use different obfuscation techniques, etc.), a preliminary analysis is performed usingjsunpack(Hartstein, 2010) andVirusTotal5. Figure 6.1 shows that the set of PDFs spans from 2008, shortly after the first emergence of malicious PDF documents in 2007, up to July of 2011. Only 16 of these documents were unknown toVirusTotalwhen queries were submitted in January of 2012. 100 10 20 30 40 50 60 70 80 90 . 0 CVE-2009-4324 CVE-2009-1493 CVE-2009-1492 CVE-2009-0927 CVE-2008-2992 CVE-2007-5659
% of PDFs with Matching Signature (as reported by jsunpack) 72% 18.6% 18.4% 54.6% 0.26% 5.4% (a) Vulnerabilities 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5
Percentage of our Data Set
# o f Ex p lo its i n a S in g le P DF 19.59% 63.43% 12.73% 3.92% 0.31% (b) No. of Exploits
Figure 6.2: Results fromjsunpackshowing (a) known vulnerabilities and (b) exploits per PDF
Figure 6.2(a) shows the Common Vulnerabilities and Exposure (CVE) identifiers, as reported by jsunpack. The CVEs reported are, of course, only for those documents thatjsunpackcould successfully unpack and match signatures to the unpacked Javascript. The percentages here do not sum to 100% because most documents contain more than one exploit. Of the successfully labeled documents, 72% of them contain the original exploit that fueled the rise of malicious PDFs in 2007 — namely, thecollab.collectEmailexploit (CVE-2007-5659). As can be seen in Figure 6.2(b), most of the documents contain more than one exploit, with the second most popular exploit,getAnnots
(CVE-2009-1492), appearing in 54.6% of the documents.
6.3.1 On Payload Polymorphism
Polymorphism has long been used to uniquely obfuscate each instance of a payload to evade detection by anti-virus signatures (Song et al., 2010; Talbi et al., 2008). A polymorphic payload contains a fewdecoderinstructions, followed by the encoded portion of the payload.
The approach one can use to analyze polymorphism is to trace the execution of the first n instructions in each payload (n= 50is used in this evaluation). In theseninstructions, one can observe either a decode loop, or the immediate execution of non-polymorphic code. ShellOS detects code injection payloads by executing from each position in a buffer, then triggering on a heuristic, such as thePEBheuristic (Polychronakis et al., 2010). However, since payloads are sometimes prepended by aNOPsled, tracing the firstninstructions would only include execution of those sled instructions. Therefore, to isolate theNOPsled from the injected code, one executes each detected payload several times. The first execution detects the presence of injected code and indicates the buffer offset of both the execution start position (e.g., most likely the start of theNOPsled) and the offset of the instruction where the heuristic is triggered (e.g., at some location inside the payload itself). One then executes the buffer multiple times, starting at the offset the heuristic is originally triggered at and moves backwards until the heuristic successfully triggers again (of course, resetting the state after each execution). This new offset indicates the first instruction required by the injected code to properly function, and the following analysis begins theninstruction trace from here.
Figure 6.3 shows the number of code injection payloads found in each of the 220 unique starting sequences traced. Uniqueness, in this case is rather strict, and is determined by exactly matching instruction sequences (including opcodes). Notice the heavy-tailed distribution. Upon examining the actual instruction sequences in the tail, it is apparent that the vast majority of these are indeed the same instruction sequence, but with varying opcode values, which is indicative of polymorphism. After re-binning the unique sequences by ignoring the opcode values, the distribution remains similar to that shown in Figure 6.3, but with only 108 unique starting sequences.
Surprisingly, however, 90% of payloads analyzed are completely non-polymorphic. This is in stark contrast to prior empirical studies of code injection payloads (Polychronakis et al., 2009; Zhang et al., 2007; Payer et al., 2005a). One plausible explanation of this difference may be that prior studies examined payloads on-the-wire (e.g. network service-level exploits). Network-level
220 1 10 100 4740 0.1 1 10 100 1000
Unique Sequence of First 50 x86 Instructions/Operands of Shellcode
Count
Figure 6.3: Unique sequences observed. 93% of payloads use 1 of the top 10 unique starting sequences observed.
exploits operate in plain-view of intrusion detection systems and therefore require obfuscation of the payloads themselves. Document-based exploits, such as those in this data set, have the benefit of using the document format itself (e.g. object compression) to obfuscate the injected code, or the ability to pack it at the Javascript-level rather than machine code-level.
The 10% of payloads that are polymorphic represent most of the heavy tail in Figure 6.3. Of the payloads in this set, 11% use thefstenvGetPC instruction. The remaining 89% usedcallas their GetPC instruction. Of the non-polymorphic sequences, 99.6% begin by looking up the address of theTEBwith no attempt to obfuscate these actions. Only7payloads try to be evasive in their TEBlookup; they first push theTEBoffset to the stack, then pop it into a register via: push byte 0x30; pop ecx; mov eax,fs:[ecx].
6.3.2 On API Call Patterns
To test the effectiveness of the automatic API call hooking and simulation described in this chapter, each payload in the data set is allowed to continue executing inShellOS. The average analysis time, per payload API sequence traced, is∼2 milliseconds. The following is one example of an API trace provided to the analyst by the diagnostics:
This particular example downloads a binary to the affected machine, then executes it. Of particular interest to an analysis is the domain (redacted in this example), which can subsequently be
begin snippet
LoadLibraryA("urlmon") LoadLibraryA("shell32")
GetTempPathA(Len = 64, Buffer = "C:\TEMP\") URLDownloadToFile( URL = "http://(omitted).php?spl=pdf_sing&s=0907...(omitted)...FC2_1&fh=", File = "C:\TEMP\a.exe") ShellExecuteA(File = "C:\TEMP\a.exe") ExitProcess(ExitCode = -2), end snippet
added to a blacklist. Also of interest is the obvious text-based information pertinent to the exploit used,e.g. spl=pdf sing, which identifies the exploit used in this attack as CVE-2010-2883. Other payloads contain similar identifying strings as well,e.g.exp=PDF (Collab),exp=PDF (GetIcon), orex=Util.Printf– presumably for bookkeeping in an overall diverse attack campaign.
Overall, automatic hooking handles a number of API calls without corresponding handler implementations, for example:LoadLibraryAÕGetProcAddressÕURLDownloadToFileÕ[FreeL- ibrary+0]ÕWinExecÕExitProcess. In this example,FreeLibraryis an API call that has no handler implementation. The automatic API hooking discovered the function name and that the function is directly called by payload, hence the+0offset. Next, the automatic simulation disassembles the API code to find aret, adjusts the stack appropriately, and sets a valid return value. The new API hooking techniques also identify a number of payloads that attempt to bypass function hooks by jumping a few bytes into the API entry point. The payloads that make use of this technique only apply it to a small subset of their API calls. This hook bypassing is observed for the following functions:
VirtualProtect,CreateFileA,LoadLibraryA, andWinExec. In the following API call sequence, the method described in this chapter automatically identifies and handles hook bypassing in 2 API calls:
GetFileSize6ÕGetTickCount ÕReadFile ÕGetTickCountÕGlobalAllocÕGetTempPathAÕ SetCurrentDirectoryAÕ[CreateFileA+5]ÕGlobalAllocÕReadFileÕWriteFileÕCloseHandle Õ[WinExec+5]ÕExitProcess. In this case, the stacks are automatically adjusted to account for the
+5jump into theCreateFileAandWinExecAPI calls. After the stack adjustment, the API calls are handled as usual.
6
A custom handler is required forGetFileSizeandReadFile. The handler reads the original document file to provide the correct file size and contents to the payload.
Table 6.1: Code Injection Payload API Trace Patterns
Id Cnt LoadLibraryA GetProcAddress GetT
empP athA GetSystemDirectory URLDo wnload T oFile URLDo wnload T
oCacheFile CreateProcessA ShellEx
ecuteA
W
inEx
ec
FreeLibrary DeleteFileA ExitThread TerminateProcess TerminateThread SetUnhandled ExceptionFilter ExitProcess
1 5566 ¬ ® ¯ ° 2 1008 ¬ ® ¯ ° 3 484 ¬ ¯ ° ® ± 4 397 ² ®³ ¬± ¯´ °µ 5 317 ® ¬ ¯ ° ± 6 179 ¬ ® ¯ 7 90 ¬ ® ¯ 8 75 ¬ ®° ¯± ² 9 46 ® ¬ ¯ ° 10 36 ° ¬-¯± ² ³ ´ µ 11 27 ¬ ®°² ¯±³ ´ 12 25 ¬ ® ¯ ° ± 13 20 ¬ ® ¯±²´µ+ °³+ 14 12 ¬ ®°²´ ¯±³µ 11 15 10 ¬ ® ¯ °
Typically, an exploit will crash or silently terminate an exploited application. However, several interesting payloads observed make an effort to mask the fact that an exploit has occurred on the end-user’s machine. Several API call sequences first load a secondary payload from the original document: GetFileSizeÕVirtualAllocÕGetTickCountÕReadFile. Then, assembly-level code decodes the payload (typicallyxor-based), and transfers control to the second payload, which goes through another round of decoding itself. The secondary payload then drops two files extracted from the original document to disk – an executable and a PDF:GetTempPathAÕGetTempFileNameAÕ CreateFileAÕ[LocalAlloc+0]ÕWriteFileÕCloseHandleÕWinExecÕCreateFileAÕWriteFile ÕCloseHandle ÕCloseHandle Õ [GetModuleFileNameA+0]ÕWinExec Õ ExitProcess. An executable is launched in the background, while a PDF is launched in the foreground via ’cmd.exe /c start’.
Overall, over 50 other unique sequences of API calls are found in the data set. Table 1 only shows the full API call sequences for the15most frequent payloads. As with observations of the firstnassembly-level instructions, the call sequences have a heavy-tailed distribution.