Cracking a virtualized binary with Angr

A friend of mine recently put together a simple virtual machine to demonstrate how code virtualization obfuscation works in practice (you can check out their write-up here: mth.lc). It looked like a fun target, so I decided to spend some time reversing it to understand the underlying interpreter mechanics and find a clean way to solve it.

When you run the binary, you get a classic crackme prompt:

% ./obf_crackme
--- MATA-VIRT CRACKME ---
Enter the secret number: 1234
FAIL: Try again deeper in the VM...

Setup and reconnaissance

I dropped the binary into IDA Pro to locate the entry point. The main function is straightforward: it grabs an integer from the user via scanf and passes it to check_logic(). If the return value matches 15575 (0x3CD7), we win.

int __fastcall main(int argc, const char **argv, const char **envp)
{
  int v4;
  int v5;

  v5 = 0;
  printf("--- MATA-VIRT CRACKME ---\n");
  printf("Enter the secret number: ");
  if ( scanf("%d", &v4) != 1 )
    return 1;
  if ( (unsigned int)check_logic() == 15575 )
    printf("OK: Access Granted! You cracked it.\n");
  else
    printf("FAIL: Try again deeper in the VM...\n");
  return 0;
}

Looking at the disassembly for main, we can see that the input is stored on the stack and then loaded into the edi register (the first argument register in the System V AMD64 ABI) right before calling _check_logic:

__text:0000000100002590|push    rbp__text:0000000100002591|mov     rbp, rsp__text:0000000100002594|sub     rsp, 0x10__text:0000000100002598|mov     [rbp-0x4], 0__text:000000010000259f|lea     rdi, [rip+0xc8]   ; "--- MATA-VIRT CRACKME ---\n"__text:00000001000025a8|call    _printf__text:00000001000025ad|lea     rdi, [rip+0xd5]   ; "Enter the secret number: "__text:00000001000025b6|call    _printf__text:00000001000025bb|lea     rdi, [rip+0xe1]   ; "%d"__text:00000001000025c2|lea     rsi, [rbp-0x8]__text:00000001000025c8|call    _scanf__text:00000001000025cd|cmp     eax, 1__text:00000001000025d0|je      loc_1000025DB__text:00000001000025d2|mov     [rbp-0x4], 1__text:00000001000025d9|jmp     loc_10000260F__text:00000001000025db|mov     edi, [rbp-0x8]__text:00000001000025de|call    _check_logic__text:00000001000025e3|cmp     eax, 0x3CD7__text:00000001000025e8|jne     loc_1000025FA__text:00000001000025ea|lea     rdi, [rip+0xb5]   ; "OK: Access Granted! You cracked it.\n"__text:00000001000025f3|call    _printf__text:00000001000025f8|jmp     loc_100002608__text:00000001000025fa|lea     rdi, [rip+0xca]   ; "FAIL: Try again deeper in the VM...\n"__text:0000000100002603|call    _printf__text:0000000100002608|mov     [rbp-0x4], 0__text:000000010000260f|mov     eax, [rbp-0x4]__text:0000000100002612|add     rsp, 0x10__text:0000000100002616|pop     rbp__text:0000000100002617|ret

The check_logic function is just a trampoline. It loads the address of the obfuscated bytecode into r10 and jumps directly into the VM interpreter entry point:

__text:0000000100002618|lea     r10, [rip-0x161f]  ; bytecode at 0x100001000__text:000000010000261f|jmp     vm_entry           ; tail call to VM at 0x100000800

Reversing the VM interpreter

The VM entry point starts at 0x100000800. The interpreter sets up its context by allocating stack space and saving the native registers. This saved register area effectively serves as the virtual machine’s register file (vregs):

vm_entry:
  push rbp
  mov  rbp, rsp
  sub  rsp, 0xA0                    ; stack frame

  mov  [rbp-0x80], rax              ; vreg[0] = saved rax
  mov  [rbp-0x78], rbx              ; vreg[1] = saved rbx
  mov  [rbp-0x70], rcx              ; vreg[2] = saved rcx
  mov  [rbp-0x68], rdx              ; vreg[3] = saved rdx
  mov  [rbp-0x60], rsi              ; vreg[4] = saved rsi
  mov  [rbp-0x58], rdi              ; vreg[5] = saved rdi  (user input)
  mov  [rbp-0x50], r8               ; vreg[6] = saved r8
  mov  [rbp-0x48], r9               ; vreg[7] = saved r9

  mov  r12, r10                     ; r12 = VM instruction pointer
  lea  r13, [rbp-0x80]              ; r13 = virtual register file base

dispatch:                           ; dispatch loop
  movzx rax, byte ptr [r12]         ; fetch opcode byte
  inc   r12                         ; advance IP
  lea   r8, jump_table              ; r8 = base of jump table
  movsxd rax, dword ptr [r8+rax*4]  ; index by opcode
  add   rax, r8                     ; compute handler address
  push  rax
  ret                               ; jump to handler

Since the user input is stored in rdi on entry, it maps directly to vreg[5] in the VM state. The interpreter fetches each opcode, scales it to index a relative jump table, and branches to the respective instruction handler.

To solve this statically, we would need to map out every handler in the jump table, reconstruct the custom bytecode format, and write a disassembler or decompiler for the VM. While doable, there are faster ways to bypass or solve this program.

Bypassing the check

If the goal is simply to force the application to accept any input, we can bypass the check entirely. Looking back at main, the decision to grant access depends on a conditional jump:

__text:00000001000025e8|jne     loc_1000025FA

We can attach a debugger like lldb and modify the instruction at 0x1000025E8 from a jnz (75 10) to a jz (74 10). This reverses the logic so that any incorrect answer passes the check:

% lldb obf_crackme
[+] Loaded lldbinit version 3.2.436 @ lldb-22.1 (clang version)
(lldbinit) target create "obf_crackme"
Current executable set to './obf_crackme' (x86_64).
(lldbinit) r
Process 47941 stopped
* thread #1, stop reason = signal SIGSTOP
    frame #0: 0x0000000200011d90 dyld`_dyld_start
(lldbinit) memory write 0x1000025E8 74 10
(lldbinit) c
Process 47941 resuming
--- MATA-VIRT CRACKME ---
Enter the secret number: 1
OK: Access Granted! You cracked it.
Process 47941 exited with status = 0 (0x00000000)

We can also write a simple bash script to brute-force the input. Since the input is an integer, if the space of possible keys is small, a simple loop will find it:

#!/bin/bash
for i in $(seq 1 100000); do
  result=$(echo "$i" | ./obf_crackme 2>/dev/null | tail -1)
  if echo "$result" | grep -q "OK"; then
    echo "Found: $i"
    exit 0
  fi
done
echo "Not found in range 1-100000"

Running the script:

% ./brute_force.sh
Found: 10000
./bruteforce.sh  67.43s user 64.34s system 105% cpu 2:04.65 total

The correct input is 10000. This took about two minutes to complete. If the search space had been wider or required a 64-bit value, brute-forcing would have been impractical.

Symbolic execution with angr

What I really wanted to try to tackle this VM was angr. It is a symbolic execution engine that can analyze binaries and reason about their behavior, and they can be highly effective against VM obfuscation schemes. According to the angr documentation:

angr’s power comes not from it being an emulator, but from being able to execute with what we call symbolic variables. Instead of saying that a variable has a concrete numerical value, we can say that it holds a symbol, effectively just a name. Then, performing arithmetic operations with that variable will yield a tree of operations (termed an abstract syntax tree or AST, from compiler theory). ASTs can be translated into constraints for an SMT solver, like z3, in order to ask questions like “given the output of this sequence of operations, what must the input have been?” Source: https://docs.angr.io/en/latest/core-concepts/solver.html

Let’s implement a script to resolve this challenge.

The first step is to load the binary into angr, following the loading documentation:

import angr
import claripy

proj = angr.Project('./obf_crackme', auto_load_libs=False)
# auto_load_libs=False prevents angr from pulling in system libraries
# (libc, dyld, etc.) which we don't need and would slow things down
obj = proj.loader.main_object
print(obj, obj.entry)

This output shows the mapped Mach-O segment addresses:

<MachO Object obf_crackme, maps [0x100000000:0x10000cfff]> 4294976912

The next step is to create a blank symbolic state at the instruction 0x1000025db right after scanf has stored the user’s input and just before calling check_logic:

__text:00000001000025db|mov     edi, [rbp-0x8]__text:00000001000025de|call    _check_logic

According to the angr documentation on states:

.blank_state() constructs a “blank slate” blank state, with most of its data left uninitialized. When accessing uninitialized data, an unconstrained symbolic value will be returned.

state = proj.factory.blank_state(
    addr=BASE + 0x25db,
    add_options={
        angr.options.SYMBOL_FILL_UNCONSTRAINED_MEMORY,
        angr.options.SYMBOL_FILL_UNCONSTRAINED_REGISTERS,
    }
)

stack_base = 0x7fffffffffe000 # arbitrary stack base address (don't collide with other data regions in the symbolic state) 
state.regs.rbp = stack_base
state.regs.rsp = stack_base - 0x100

I added the two options to fill uninitialized memory and registers with symbolic values according to the instructions given when running the script:

angr.storage.memory_mixins.default_filler_mixin | The program is accessing register with an unspecified value. This could indicate unwanted behavior.
angr.storage.memory_mixins.default_filler_mixin | angr will cope with this by generating an unconstrained symbolic variable and continuing. You can resolve this by:
angr.storage.memory_mixins.default_filler_mixin | 1) setting a value to the initial state
angr.storage.memory_mixins.default_filler_mixin | 2) adding the state option ZERO_FILL_UNCONSTRAINED_{MEMORY,REGISTERS}, to make unknown regions hold null
angr.storage.memory_mixins.default_filler_mixin | 3) adding the state option SYMBOL_FILL_UNCONSTRAINED_{MEMORY,REGISTERS}, to suppress these messages.

Next, I created a symbolic variable (a 32-bit symbolic bitvector) to represent the user input, and stored it in the virtual register vreg[5] (the rdi register), the same stack slot scanf would write to, once again following the instructions given by angr’s documentation on solver/bitvectors:

user_input = claripy.BVS('user_input', 32)
state.mem[state.regs.rbp - 8].int = user_input

Everything is now set up to symbolically execute the VM. The next step is to create a simulation manager and explore the state space, setting our target to 0x1000025e3 (cmp eax, 0x3CD7). angr will trace the paths through check_logic (the VM bytecode interpreter) while treating user_input as symbolic.

According to the documentation:

An extremely common operation in symbolic execution is to find a state that reaches a certain address, while discarding all states that go through another address. Simulation manager has a shortcut for this pattern, the .explore() method.

simgr = proj.factory.simulation_manager(state)
simgr.explore(find=BASE + 0x25e3)

When any state reaches 0x1000025e3, it’s added to simgr.found[]:

found = simgr.found[0]

At the comparison, eax holds the return value from check_logic(user_input). For the binary to print "Access Granted", we need eax == 0x3CD7. We add this as a constraint to the found state’s solver and ask it to evaluate the symbolic variable.

Z3 (the underlying SMT solver) backtracks through all the constraints that accumulated during symbolic execution, including every VM bytecode operation that depended on user_input, and finds the unique concrete value that makes everything consistent.

found.add_constraints(found.regs.eax == 0x3CD7)
solution = found.solver.eval(user_input)
print(f"Solution: {solution}")

Here is the complete angr script:

import angr
import claripy

BASE = 0x100000000
proj = angr.Project('./obf_crackme', auto_load_libs=False)
state = proj.factory.blank_state(
    addr=BASE + 0x25db,
    add_options={
        angr.options.SYMBOL_FILL_UNCONSTRAINED_MEMORY,
        angr.options.SYMBOL_FILL_UNCONSTRAINED_REGISTERS,
    }
)

stack_base = 0x7fffffffffe000
state.regs.rbp = stack_base
state.regs.rsp = stack_base - 0x100

user_input = claripy.BVS('user_input', 32)
state.mem[state.regs.rbp - 8].int = user_input

simgr = proj.factory.simulation_manager(state)
simgr.explore(find=BASE + 0x25e3)
if simgr.found:
    found = simgr.found[0]
    found.add_constraints(found.regs.eax == 0x3CD7)
    solution = found.solver.eval(user_input)
    print(f"Solution: {solution}")
else:
    print("Path not found.")

Running the script prints the correct input needed to satisfy the check:

% python3 solve.py
Solution: 10000

Because the VM does not use complex cryptographic operations (like one-way hash functions), the SMT solver is able to back-propagate through the virtualized instructions and solve the condition directly!

Setup and reconnaissance#

Reversing the VM interpreter#

Bypassing the check#

Symbolic execution with angr#

Comments

Setup and reconnaissance

Reversing the VM interpreter

Bypassing the check

Symbolic execution with angr