Reverse Engineering and Decompilers [2024-02-10]

Posted on Sat 10 February 2024 in Thought

Diving Back into Code

I have been delving back into some lower-level code, specifically assembly. The goal is essentially a refresher for me on reverse engineering and exploit development techniques. It has been refreshing to relearn some of the techniques and tooling available today.

Currently, I have been experimenting with Ghidra, Hopper, and radare2. While I find it rather easier to read assembly, sometimes you need to translate those instructions into readable higher-level code, such as C/C++. Unfortunately, decompilers are relatively rare and expensive, with examples like Hex Rays' decompiler.

It dawned on me: "Why don't I have GPT become my decompiler?" It's funny because LLMs have been used as copilots for developers (writing, suggesting, and debugging code). At the same time, it's the very nature of debugging that is at the core of reverse engineering. So naturally, if GPT, Code LLAMA, and other LLMs are used to write code, they should also be able to reverse engineer code.

Below is a simple example code borrowed from Blue Fox: Arm Assembly Internals and Reverse Engineering, which I highly recommend. It was compiled using just gcc on macOS (Sonoma), with no special optimizers.

#include <unistd.h>

int main(void) {
    write(1, "Hello!\n", 7);
}

When the code is compiled, it becomes:

$ file hello.o
hello.o: Mach-O 64-bit object arm64

So, by running objdump -d, I get the disassembly of the compiled binary:

Disassembly of section __TEXT,__text:

0000000000000000 <ltmp0>:
       0: a9bf7bfd      stp x29, x30, [sp, #-16]!
       4: 910003fd      mov x29, sp
       8: 52800020      mov w0, #1
       c: 90000001      adrp    x1, 0x0 <ltmp0+0xc>
      10: 91000021      add x1, x1, #0
      14: d28000e2      mov x2, #7
      18: 94000000      bl  0x18 <ltmp0+0x18>
      1c: 52800000      mov w0, #0
      20: a8c17bfd      ldp x29, x30, [sp], #16
      24: d65f03c0      ret

Now, suppose I didn't have the source code, like I did earlier. This is usually the state when you are presented with a set of binaries and need to understand how it works. You would need to disassemble the code using tools typically already installed on Unix/Linux/Mac, or you would use tools like Ghidra, IDA, Hopper, radare2, etc. Then, you would need to study the instruction set, noting functions, labels, variables, etc. This can be time-consuming, especially if you are on a time crunch to complete the analysis of the target program.

Here comes LLM. I threw the disassembled code to GPT-4, and it was surprisingly close to recreating the small program:

#include <unistd.h>

int main() {
    // Assuming the address setup by `adrp` and `add` instructions points to some data
    const char *data = "Example"; // Placeholder for the actual data pointed to by x1 after `adrp` + `add`
    ssize_t bytes_written;

    // The `mov w0, #0x1` instruction suggests we're writing to stdout (file descriptor 1)
    // The `mov x2, #0x7` instruction suggests the length of the data is 7 bytes
    // Therefore, the call to _write corresponds to writing 7 bytes from `data` to stdout
    bytes_written = write(1, data, 7);

    // Return 0 indicating success
    return 0;
}

I was impressed with the results, and I will definitely keep that in my back pocket for when I truly need to dive a bit deeper.

Here's the full output of the GPT disassembly.