diff --git a/compile_pipeline.md b/compile_pipeline.md new file mode 100644 index 0000000..a166309 --- /dev/null +++ b/compile_pipeline.md @@ -0,0 +1,155 @@ +# What Happens When You Run `g++ app.cpp -o app` + +At first glance, it looks like a single command — but under the hood, `g++` is orchestrating a multi-stage pipeline. Here's what actually happens, step by step. + +--- + +## The Full Pipeline + +``` +app.cpp + │ + ▼ [1] Preprocessor (cpp) +app.ii ← expanded source (macros, #includes resolved) + │ + ▼ [2] Compiler (cc1plus) +app.s ← assembly code + │ + ▼ [3] Assembler (as) +app.o ← relocatable object file (ELF / Mach-O / COFF) + │ + ▼ [4] Linker (ld / lld / link.exe) +app ← final executable +``` + +`g++` is a driver — it calls each of these tools in sequence and passes the right flags between them. None of these intermediate files are written to disk unless you explicitly ask (e.g., `g++ -S app.cpp` to stop at assembly). + +--- + +## Stage 1 — Preprocessing + +**Tool:** `cpp` (the C preprocessor, invoked internally) + +The preprocessor handles all directives that start with `#`: + +- `#include ` — literally pastes the content of `iostream` (and everything it includes) into your source +- `#define FOO 42` — performs textual substitution across the file +- `#ifdef` / `#ifndef` / `#endif` — conditionally includes or excludes blocks of code +- `#pragma once` — prevents a header from being included more than once + +The output is a single, flat `.ii` file — pure C++ source with no `#` directives, potentially tens of thousands of lines long even for a small program. + +```bash +# You can inspect this stage yourself: +g++ -E app.cpp -o app.ii +``` + +--- + +## Stage 2 — Compilation + +**Tool:** `cc1plus` (GCC's C++ compiler frontend) + +The compiler takes the preprocessed source and: + +1. **Parses** it into an Abstract Syntax Tree (AST) +2. **Type-checks** — validates that types match, overloads resolve, templates instantiate correctly +3. **Optimizes** — applies transformations based on the `-O` level +4. **Emits assembly** — produces human-readable `.s` text in the target ISA (x86-64, ARM, etc.) + +```bash +# Stop after compilation, get assembly: +g++ -S app.cpp -o app.s +``` + +```asm +; Example snippet of what app.s might look like for a simple function +_Z3addii: + push rbp + mov rbp, rsp + mov DWORD PTR [rbp-4], edi + mov DWORD PTR [rbp-8], esi + mov edx, DWORD PTR [rbp-4] + mov eax, DWORD PTR [rbp-8] + add eax, edx + pop rbp + ret +``` + +Note the mangled name `_Z3addii` — that's `add(int, int)` after C++ name mangling encodes the parameter types into the symbol name. + +--- + +## Stage 3 — Assembly + +**Tool:** `as` (GNU assembler, or `llvm-mc` under Clang) + +The assembler converts the `.s` text file into a binary **object file** (`.o`). This is a relocatable binary — it contains: + +- **Machine code** for all functions defined in this translation unit +- **A symbol table** listing every symbol defined here and every symbol referenced but not yet defined +- **Relocation entries** — placeholders saying "at this byte offset, fill in the final address of symbol `X`" + +The object file is **not yet executable** because: +- References to functions/globals in other `.cpp` files are unresolved +- Absolute memory addresses haven't been assigned yet + +```bash +# Stop at object file: +g++ -c app.cpp -o app.o + +# Inspect the symbol table: +nm app.o +# U _ZSt4cout ← U = undefined, still unresolved +# T _Z3addii ← T = defined in text (code) section +``` + +--- + +## Stage 4 — Linking + +**Tool:** `ld` (on Linux), `lld` (LLVM), or `link.exe` (MSVC) + +This is where everything comes together. The linker: + +1. **Collects** all `.o` files (yours + any from `-l` libraries) +2. **Resolves symbols** — for every `U` (undefined) symbol in any object, finds the `T` (defined) symbol in another object or library +3. **Applies relocations** — patches all those placeholder bytes with real addresses +4. **Lays out sections** — merges `.text`, `.data`, `.bss`, `.rodata` sections from all objects into one +5. **Writes the executable** — outputs an ELF (Linux), Mach-O (macOS), or PE (Windows) binary with a proper entry point + +``` +app.o ← your code + + +libstdc++.so ← C++ standard library (iostream, string, etc.) + + +libc.so ← C runtime (malloc, printf, etc.) + + +crt1.o ← C runtime startup (calls main(), handles argc/argv) + │ + ▼ +app ← fully linked executable +``` + +Even though you only wrote `app.cpp`, the final binary has code from the C++ standard library, the C runtime, and the platform startup objects — all stitched together by the linker. + +```bash +# See what the linker actually pulls in: +g++ app.cpp -o app -Wl,--verbose 2>&1 | less + +# Or check what shared libraries the final binary depends on: +ldd app +``` + +--- + +## Quick Summary + +| Stage | Input | Output | Key job | +|-------------|-------------|---------|--------------------------------------| +| Preprocess | `app.cpp` | `app.ii`| Expand macros and `#include`s | +| Compile | `app.ii` | `app.s` | Parse, type-check, optimize, emit asm| +| Assemble | `app.s` | `app.o` | Encode asm as binary machine code | +| Link | `app.o` + libs | `app` | Resolve symbols, assign addresses | + +When you run `g++ app.cpp -o app`, all four stages happen invisibly in sequence. The `-o app` flag only names the final output — not any of the intermediates.