Code Disassembly with Pydasm
In this post, I would like to show you how to use a famous Python module named pydasm, a python wrapper for libdasm. It attempts to capture all the functionality of libdasm and bring its versatility to Python.
libdasm is a C-library that tries to provide simple and convenient way to disassemble Intel x86 raw opcode bytes (machine code). It can parse and print out opcodes in AT&T and Intel syntax.
Nowadays, it exists more powerfull disassembler like Capstone Engine, but as I often use pydbg, a pure-python win32 debugger interface, I still use pydasm for some of my projects.
Installation
We will do our tests on a Windows based machine (I will use Windows 10). To prepare the environment you need to install the following dependencies:
Note: Be sure to add python.exe to your environment PATH. It can be automatically done during the installation of Python.
$ cd pydasm-master
$ python setup.py install
running install
running bdist_egg
running egg_info
creating pydasm.egg-info
...
creating 'dist\pydasm-1.5-py2.7-win32.egg' and adding 'build\bdist.win32\egg' to it
removing 'build\bdist.win32\egg' (and everything under it)
Processing pydasm-1.5-py2.7-win32.egg
Copying pydasm-1.5-py2.7-win32.egg to c:\python27\lib\site-packages
Adding pydasm 1.5 to easy-install.pth file
Installed c:\python27\lib\site-packages\pydasm-1.5-py2.7-win32.egg
Processing dependencies for pydasm==1.5
Finished processing dependencies for pydasm==1.5
Now, you just have to copy pydasm.pyd
from C:\Users\User\Desktop\pydasm-master\build\lib.win32-2.7\
into C:\Python27\DLLs
.
Exported Methods
Pydasm come with some exported methods helping us to disassemble binary files. If you look at the source code, as a wrapper, pydasm come with some python counterpart of libdasm’s, including :
get_instruction()
get_instruction_string()
get_mnemonic_string()
get_operand_string()
get_register_type()
Let’s take a look at each of them…
About get_instruction()
This method is the Python counterpart of libdasm’s get_instruction()
. Here is the prototype of the method:
def pydasm_get_instruction(data, mode)
It decodes an instruction from the given buffer. It takes 2 arguments:
data
, a string containing the data to disassemble.mode
, either MODE_16 or MODE_32.
It returns an INSTRUCTION
object or None
if the instruction can’t be disassembled. I won’t explain the content of the INSTRUCTION
object, but for your information, here is the structure definition:
typedef struct _INSTRUCTION {
int length; // Instruction length
enum Instruction type; // Instruction type
enum Mode mode; // Addressing mode
BYTE opcode; // Actual opcode
BYTE modrm; // MODRM byte
BYTE sib; // SIB byte
int modrm_offset; // MODRM byte offset
int extindex; // Extension table index
int fpuindex; // FPU table index
int dispbytes; // Displacement bytes (0 = no displacement)
int immbytes; // Immediate bytes (0 = no immediate)
int sectionbytes; // Section prefix bytes (0 = no section prefix)
OPERAND op1; // First operand (if any)
OPERAND op2; // Second operand (if any)
OPERAND op3; // Additional operand (if any)
PINST ptr; // Pointer to instruction table
int flags; // Instruction flags
short eflags_affected; // Process eflags affected
short eflags_used; // Processor eflags used by this instruction
int iop_written; // mask of affected implied registers (written)
int iop_read; // mask of affected implied registers (read)
} INSTRUCTION, *PINSTRUCTION;
Note: For more details about the INSTRUCTION
, I suggest you read the source code of pydasm to get a full understanding of how it works and how you can use it in your code.
About get_instruction_string()
This method is the Python counterpart of libdasm’s get_instruction_string()
. Here is the prototype of the method:
def pydasm_get_instruction(INSTRUCTION, format, offset)
It decodes an instruction from the given buffer. It takes 3 arguments:
INSTRUCTION
, theINSTRUCTION
object.format
, the format, FORMAT_INTEL or FORMAT_ATT, depending on how you want to display the code.offest
, the base address of the executable.
It returns a string representation of the disassembled instruction or zero if there is nothing to desassemble.
Note: Although Intel is the standard assembly syntax on the x86 platform and is generally thought to be nicer than AT&T, there is still good reason to learn AT&T as the gcc
compiler emits code in this syntax. If you don’t know what to choose, I suggest you to select FORMAT_INTEL.
Note that the offest
parameter can be set at zero by default. Here is the difference if you specify the base address of the executable. Here is an example With zero as parameter:
push byte 0x60
push dword 0x46e070
call 0x210e ; RVA
mov edi,0x94
mov eax,edi
call 0xffffddee ; RVA
With the base address as parameter:
push byte 0x60
push dword 0x46e070
call 0x44e5f4 ; VA
mov edi,0x94
mov eax,edi
call 0x44a2e0 ; VA
As you can see, if we specify a virtual address for the instruction we want to disassemble, calls and conditions will use the base address instead of the relative address you set (zero in this example).
About get_mnemonic_string()
This method is the Python counterpart of libdasm’s get_mnemonic_string()
. Here is the prototype of the method:
def get_mnemonic_string(INSTRUCTION, format)
It transforms an instruction object’s mnemonic into its string representation. It takes 2 arguments:
INSTRUCTION
, theINSTRUCTION
object.format
, the format, FORMAT_INTEL or FORMAT_ATT, depending on how you want to display the code.
It returns a string representation of the mnemonic.
About get_operand_string()
This method is the Python counterpart of libdasm’s get_operand_string()
. Here is the prototype of the method:
def get_operand_string(INSTRUCTION, operand, format, offset)
It transform an instruction object’s operand into its string representation. It takes 4 arguments:
INSTRUCTION
, theINSTRUCTION
object.operand
, the operand index (0,1,2).format
, the format, FORMAT_INTEL or FORMAT_ATT, depending on how you want to display the code.offest
, the base address of the executable.
It returns a string representation of the disassembled operand.
About get_register_type()
This method is the Python counterpart of libdasm’s get_register_type()
. Here is the prototype of the method:
def get_register_type(OPERAND)
It transforms an instruction object’s mnemonic into its string representation. It takes 1 argument:
OPERAND
, theOPERAND
object.
It returns a Long representing the type of the register. I won’t explain the content of the OPERAND
object, but for your information, here is the structure definition:
typedef struct _OPERAND {
enum Operand type; // Operand type (register, memory, etc)
int reg; // Register (if any)
int basereg; // Base register (if any)
int indexreg; // Index register (if any)
int scale; // Scale (if any)
int dispbytes; // Displacement bytes (0 = no displacement)
int dispoffset; // Displacement value offset
int immbytes; // Immediate bytes (0 = no immediate)
int immoffset; // Immediate value offset
int sectionbytes; // Section prefix bytes (0 = no section prefix)
WORD section; // Section prefix value
DWORD displacement; // Displacement value
DWORD immediate; // Immediate value
int flags; // Operand flags
} OPERAND, *POPERAND;
A Bit of Practice
Now we will disassemble a Windows executable (PE file format). In a PE file, the first bytes contain the structure of the file and don’t represent any executable code. To disassemble the code, we need to find the entry point of the .text
section. To do that I will use a python module called pefile.
Note: If you want more information about pefile and the Portable Executable format, I suggest you read by previous blog post.
Note: For the tests I used putty.exe. PuTTY is a free implementation of SSH and Telnet for Windows, but you can use any executable or binary file.
import pydasm
import pefile
exe_path = "putty.exe"
# Store the file in a variable
fd = open(exe_path, 'rb')
data = fd.read()
fd.close()
# Get the EP, raw size and virtual address of the code
pe = pefile.PE(exe_path)
ep = pe.OPTIONAL_HEADER.AddressOfEntryPoint
raw_size = pe.sections[0].SizeOfRawData
ep_va = ep + pe.OPTIONAL_HEADER.ImageBase
print "[*] Entry Point: " + hex(ep)
print "[*] Raw Size: " + hex(raw_size)
print "[*] EP VA: " + hex(ep_va)
# Start disassembly at the EP
offset = ep
# Loop until the end of the .text section
while offset < (offset + raw_size):
# Get the first instruction
i = pydasm.get_instruction(data[offset:], pydasm.MODE_32)
# Print a string representation if the instruction
print pydasm.get_instruction_string(i, pydasm.FORMAT_INTEL, ep_va + offset)
# Go to the next instruction
offset += i.length
In this code sample, we fill a variable with the content of the binary and get the address of the code we want to disassemble. The we call get_instruction()
by passing the offset of the code and ask to use the 32 bit mode (its a 32 bit binary). The we call get_instruction_string()
by passing the INSTRUCTION
structure, format and the virtal address. You could use zero as the offset, but a real disassembler would show you the virtal address. Then we loop until the end of the section by incrementing the offset.
Here is a sample from the output:
[*] Entry Point: 0x550f0
[*] Raw Size: 0x5c000
[*] EP VA: 0x4550f0
push byte 0x60
push dword 0x478108
call 0x4ac2f4
mov edi,0x94
mov eax,edi
call 0x4a9cb0
mov [ebp-0x18],esp
mov esi,esp
...
Reading a Shellcode
You can also read pure binary data, like a shellcode. It could be useful when you download a random shellcode from Internet if you don’t know what it really does (think about rm -rf /
shellcode…).
import pydasm
# windows/messagebox - 281 bytes
# http://www.metasploit.com
# VERBOSE=false, EXITFUNC=process, TITLE=BreakInSecurity,
# TEXT=Hello, I'm in ur code, ICON=INFORMATION
shellcode = ("\xd9\xeb\x9b\xd9\x74\x24\xf4\x31\xd2\xb2\x77\x31\xc9\x64\x8b"
"\x71\x30\x8b\x76\x0c\x8b\x76\x1c\x8b\x46\x08\x8b\x7e\x20\x8b"
"\x36\x38\x4f\x18\x75\xf3\x59\x01\xd1\xff\xe1\x60\x8b\x6c\x24"
"\x24\x8b\x45\x3c\x8b\x54\x28\x78\x01\xea\x8b\x4a\x18\x8b\x5a"
"\x20\x01\xeb\xe3\x34\x49\x8b\x34\x8b\x01\xee\x31\xff\x31\xc0"
"\xfc\xac\x84\xc0\x74\x07\xc1\xcf\x0d\x01\xc7\xeb\xf4\x3b\x7c"
"\x24\x28\x75\xe1\x8b\x5a\x24\x01\xeb\x66\x8b\x0c\x4b\x8b\x5a"
"\x1c\x01\xeb\x8b\x04\x8b\x01\xe8\x89\x44\x24\x1c\x61\xc3\xb2"
"\x08\x29\xd4\x89\xe5\x89\xc2\x68\x8e\x4e\x0e\xec\x52\xe8\x9f"
"\xff\xff\xff\x89\x45\x04\xbb\x7e\xd8\xe2\x73\x87\x1c\x24\x52"
"\xe8\x8e\xff\xff\xff\x89\x45\x08\x68\x6c\x6c\x20\x41\x68\x33"
"\x32\x2e\x64\x68\x75\x73\x65\x72\x88\x5c\x24\x0a\x89\xe6\x56"
"\xff\x55\x04\x89\xc2\x50\xbb\xa8\xa2\x4d\xbc\x87\x1c\x24\x52"
"\xe8\x61\xff\xff\xff\x68\x69\x74\x79\x58\x68\x65\x63\x75\x72"
"\x68\x6b\x49\x6e\x53\x68\x42\x72\x65\x61\x31\xdb\x88\x5c\x24"
"\x0f\x89\xe3\x68\x65\x58\x20\x20\x68\x20\x63\x6f\x64\x68\x6e"
"\x20\x75\x72\x68\x27\x6d\x20\x69\x68\x6f\x2c\x20\x49\x68\x48"
"\x65\x6c\x6c\x31\xc9\x88\x4c\x24\x15\x89\xe1\x31\xd2\x6a\x40"
"\x53\x51\x52\xff\xd0\x31\xc0\x50\xff\x55\x08")
format = pydasm.FORMAT_INTEL
mode = pydasm.MODE_32
offset = 0
while offset < len(shellcode):
instruction = pydasm.get_instruction(shellcode[offset:], mode)
print pydasm.get_instruction_string(instruction, format, 0)
if not instruction:
offset += 1
continue
offset += instruction.length
Output
fldpi
wait
fstenv [esp-0xc]
xor edx,edx
mov dl,0x77
xor ecx,ecx
mov esi,fs:[ecx+0x30]
mov esi,[esi+0xc]
mov esi,[esi+0x1c]
...
Conclusion
We are done with this quick introduction to pydasm, I hope it will be useful for you next projects. Have fun !