A friend of mine recently put together a simple virtual machine to demonstrate how code virtualization obfuscation works in practice (you can check out their write-up here: mth.lc). It looked like a fun target, so I decided to spend some time reversing it to understand the underlying interpreter mechanics and find a clean way to solve it.
When you run the binary, you get a classic crackme prompt:
% ./obf_crackme
--- MATA-VIRT CRACKME ---
Enter the secret number: 1234
FAIL: Try again deeper in the VM...
Setup and reconnaissance
I dropped the binary into IDA Pro to locate the entry point. The main function is straightforward: it grabs an integer from the user via scanf and passes it to check_logic(). If the return value matches 15575 (0x3CD7), we win.
int __fastcall main(int argc, const char **argv, const char **envp)
{
int v4;
int v5;
v5 = 0;
printf("--- MATA-VIRT CRACKME ---\n");
printf("Enter the secret number: ");
if ( scanf("%d", &v4) != 1 )
return 1;
if ( (unsigned int)check_logic() == 15575 )
printf("OK: Access Granted! You cracked it.\n");
else
printf("FAIL: Try again deeper in the VM...\n");
return 0;
}
Looking at the disassembly for main, we can see that the input is stored on the stack and then loaded into the edi register (the first argument register in the System V AMD64 ABI) right before calling _check_logic:
__text:0000000100002590|push rbp__text:0000000100002591|mov rbp, rsp__text:0000000100002594|sub rsp, 0x10__text:0000000100002598|mov [rbp-0x4], 0__text:000000010000259f|lea rdi, [rip+0xc8] ; "--- MATA-VIRT CRACKME ---\n"__text:00000001000025a8|call _printf__text:00000001000025ad|lea rdi, [rip+0xd5] ; "Enter the secret number: "__text:00000001000025b6|call _printf__text:00000001000025bb|lea rdi, [rip+0xe1] ; "%d"__text:00000001000025c2|lea rsi, [rbp-0x8]__text:00000001000025c8|call _scanf__text:00000001000025cd|cmp eax, 1__text:00000001000025d0|je loc_1000025DB__text:00000001000025d2|mov [rbp-0x4], 1__text:00000001000025d9|jmp loc_10000260F__text:00000001000025db|mov edi, [rbp-0x8]__text:00000001000025de|call _check_logic__text:00000001000025e3|cmp eax, 0x3CD7__text:00000001000025e8|jne loc_1000025FA__text:00000001000025ea|lea rdi, [rip+0xb5] ; "OK: Access Granted! You cracked it.\n"__text:00000001000025f3|call _printf__text:00000001000025f8|jmp loc_100002608__text:00000001000025fa|lea rdi, [rip+0xca] ; "FAIL: Try again deeper in the VM...\n"__text:0000000100002603|call _printf__text:0000000100002608|mov [rbp-0x4], 0__text:000000010000260f|mov eax, [rbp-0x4]__text:0000000100002612|add rsp, 0x10__text:0000000100002616|pop rbp__text:0000000100002617|retThe check_logic function is just a trampoline. It loads the address of the obfuscated bytecode into r10 and jumps directly into the VM interpreter entry point:
__text:0000000100002618|lea r10, [rip-0x161f] ; bytecode at 0x100001000__text:000000010000261f|jmp vm_entry ; tail call to VM at 0x100000800Reversing the VM interpreter
The VM entry point starts at 0x100000800. The interpreter sets up its context by allocating stack space and saving the native registers. This saved register area effectively serves as the virtual machine’s register file (vregs):
vm_entry:
push rbp
mov rbp, rsp
sub rsp, 0xA0 ; stack frame
mov [rbp-0x80], rax ; vreg[0] = saved rax
mov [rbp-0x78], rbx ; vreg[1] = saved rbx
mov [rbp-0x70], rcx ; vreg[2] = saved rcx
mov [rbp-0x68], rdx ; vreg[3] = saved rdx
mov [rbp-0x60], rsi ; vreg[4] = saved rsi
mov [rbp-0x58], rdi ; vreg[5] = saved rdi (user input)
mov [rbp-0x50], r8 ; vreg[6] = saved r8
mov [rbp-0x48], r9 ; vreg[7] = saved r9
mov r12, r10 ; r12 = VM instruction pointer
lea r13, [rbp-0x80] ; r13 = virtual register file base
dispatch: ; dispatch loop
movzx rax, byte ptr [r12] ; fetch opcode byte
inc r12 ; advance IP
lea r8, jump_table ; r8 = base of jump table
movsxd rax, dword ptr [r8+rax*4] ; index by opcode
add rax, r8 ; compute handler address
push rax
ret ; jump to handlerSince the user input is stored in rdi on entry, it maps directly to vreg[5] in the VM state. The interpreter fetches each opcode, scales it to index a relative jump table, and branches to the respective instruction handler.
To solve this statically, we would need to map out every handler in the jump table, reconstruct the custom bytecode format, and write a disassembler or decompiler for the VM. While doable, there are faster ways to bypass or solve this program.
Bypassing the check
If the goal is simply to force the application to accept any input, we can bypass the check entirely. Looking back at main, the decision to grant access depends on a conditional jump:
__text:00000001000025e8|jne loc_1000025FAWe can attach a debugger like lldb and modify the instruction at 0x1000025E8 from a jnz (75 10) to a jz (74 10). This reverses the logic so that any incorrect answer passes the check:
% lldb obf_crackme
[+] Loaded lldbinit version 3.2.436 @ lldb-22.1 (clang version)
(lldbinit) target create "obf_crackme"
Current executable set to './obf_crackme' (x86_64).
(lldbinit) r
Process 47941 stopped
* thread #1, stop reason = signal SIGSTOP
frame #0: 0x0000000200011d90 dyld`_dyld_start
(lldbinit) memory write 0x1000025E8 74 10
(lldbinit) c
Process 47941 resuming
--- MATA-VIRT CRACKME ---
Enter the secret number: 1
OK: Access Granted! You cracked it.
Process 47941 exited with status = 0 (0x00000000)
We can also write a simple bash script to brute-force the input. Since the input is an integer, if the space of possible keys is small, a simple loop will find it:
#!/bin/bash
for i in $(seq 1 100000); do
result=$(echo "$i" | ./obf_crackme 2>/dev/null | tail -1)
if echo "$result" | grep -q "OK"; then
echo "Found: $i"
exit 0
fi
done
echo "Not found in range 1-100000"
Running the script:
% ./brute_force.sh
Found: 10000
./bruteforce.sh 67.43s user 64.34s system 105% cpu 2:04.65 total
The correct input is 10000. This took about two minutes to complete. If the search space had been wider or required a 64-bit value, brute-forcing would have been impractical.
Symbolic execution with angr
What I really wanted to try to tackle this VM was angr. It is a symbolic execution engine that can analyze binaries and reason about their behavior, and they can be highly effective against VM obfuscation schemes. According to the angr documentation:
angr’s power comes not from it being an emulator, but from being able to execute with what we call symbolic variables. Instead of saying that a variable has a concrete numerical value, we can say that it holds a symbol, effectively just a name. Then, performing arithmetic operations with that variable will yield a tree of operations (termed an abstract syntax tree or AST, from compiler theory). ASTs can be translated into constraints for an SMT solver, like z3, in order to ask questions like “given the output of this sequence of operations, what must the input have been?” Source: https://docs.angr.io/en/latest/core-concepts/solver.html
Let’s implement a script to resolve this challenge.
The first step is to load the binary into angr, following the loading documentation:
import angr
import claripy
proj = angr.Project('./obf_crackme', auto_load_libs=False)
# auto_load_libs=False prevents angr from pulling in system libraries
# (libc, dyld, etc.) which we don't need and would slow things down
obj = proj.loader.main_object
print(obj, obj.entry)
This output shows the mapped Mach-O segment addresses:
<MachO Object obf_crackme, maps [0x100000000:0x10000cfff]> 4294976912
The next step is to create a blank symbolic state at the instruction 0x1000025db right after scanf has stored the user’s input and just before calling check_logic:
__text:00000001000025db|mov edi, [rbp-0x8]__text:00000001000025de|call _check_logicAccording to the angr documentation on states:
.blank_state()constructs a “blank slate” blank state, with most of its data left uninitialized. When accessing uninitialized data, an unconstrained symbolic value will be returned.
state = proj.factory.blank_state(
addr=BASE + 0x25db,
add_options={
angr.options.SYMBOL_FILL_UNCONSTRAINED_MEMORY,
angr.options.SYMBOL_FILL_UNCONSTRAINED_REGISTERS,
}
)
stack_base = 0x7fffffffffe000 # arbitrary stack base address (don't collide with other data regions in the symbolic state)
state.regs.rbp = stack_base
state.regs.rsp = stack_base - 0x100
I added the two options to fill uninitialized memory and registers with symbolic values according to the instructions given when running the script:
angr.storage.memory_mixins.default_filler_mixin | The program is accessing register with an unspecified value. This could indicate unwanted behavior.
angr.storage.memory_mixins.default_filler_mixin | angr will cope with this by generating an unconstrained symbolic variable and continuing. You can resolve this by:
angr.storage.memory_mixins.default_filler_mixin | 1) setting a value to the initial state
angr.storage.memory_mixins.default_filler_mixin | 2) adding the state option ZERO_FILL_UNCONSTRAINED_{MEMORY,REGISTERS}, to make unknown regions hold null
angr.storage.memory_mixins.default_filler_mixin | 3) adding the state option SYMBOL_FILL_UNCONSTRAINED_{MEMORY,REGISTERS}, to suppress these messages.
Next, I created a symbolic variable (a 32-bit symbolic bitvector) to represent the user input, and stored it in the virtual register vreg[5] (the rdi register), the same stack slot scanf would write to, once again following the instructions given by angr’s documentation on solver/bitvectors:
user_input = claripy.BVS('user_input', 32)
state.mem[state.regs.rbp - 8].int = user_input
Everything is now set up to symbolically execute the VM. The next step is to create a simulation manager and explore the state space, setting our target to 0x1000025e3 (cmp eax, 0x3CD7). angr will trace the paths through check_logic (the VM bytecode interpreter) while treating user_input as symbolic.
According to the documentation:
An extremely common operation in symbolic execution is to find a state that reaches a certain address, while discarding all states that go through another address. Simulation manager has a shortcut for this pattern, the
.explore()method.
simgr = proj.factory.simulation_manager(state)
simgr.explore(find=BASE + 0x25e3)
When any state reaches 0x1000025e3, it’s added to simgr.found[]:
found = simgr.found[0]
At the comparison, eax holds the return value from check_logic(user_input). For the binary to print "Access Granted", we need eax == 0x3CD7. We add this as a constraint to the found state’s solver and ask it to evaluate the symbolic variable.
Z3 (the underlying SMT solver) backtracks through all the constraints that accumulated during symbolic execution, including every VM bytecode operation that depended on user_input, and finds the unique concrete value that makes everything consistent.
found.add_constraints(found.regs.eax == 0x3CD7)
solution = found.solver.eval(user_input)
print(f"Solution: {solution}")
Here is the complete angr script:
import angr
import claripy
BASE = 0x100000000
proj = angr.Project('./obf_crackme', auto_load_libs=False)
state = proj.factory.blank_state(
addr=BASE + 0x25db,
add_options={
angr.options.SYMBOL_FILL_UNCONSTRAINED_MEMORY,
angr.options.SYMBOL_FILL_UNCONSTRAINED_REGISTERS,
}
)
stack_base = 0x7fffffffffe000
state.regs.rbp = stack_base
state.regs.rsp = stack_base - 0x100
user_input = claripy.BVS('user_input', 32)
state.mem[state.regs.rbp - 8].int = user_input
simgr = proj.factory.simulation_manager(state)
simgr.explore(find=BASE + 0x25e3)
if simgr.found:
found = simgr.found[0]
found.add_constraints(found.regs.eax == 0x3CD7)
solution = found.solver.eval(user_input)
print(f"Solution: {solution}")
else:
print("Path not found.")
Running the script prints the correct input needed to satisfy the check:
% python3 solve.py
Solution: 10000
Because the VM does not use complex cryptographic operations (like one-way hash functions), the SMT solver is able to back-propagate through the virtualized instructions and solve the condition directly!