Files
strangecpp/compile_pipeline.md
2026-02-24 11:49:10 +01:00

156 lines
5.3 KiB
Markdown

# What Happens When You Run `g++ app.cpp -o app`
At first glance, it looks like a single command — but under the hood, `g++` is orchestrating a multi-stage pipeline. Here's what actually happens, step by step.
---
## The Full Pipeline
```
app.cpp
▼ [1] Preprocessor (cpp)
app.ii ← expanded source (macros, #includes resolved)
▼ [2] Compiler (cc1plus)
app.s ← assembly code
▼ [3] Assembler (as)
app.o ← relocatable object file (ELF / Mach-O / COFF)
▼ [4] Linker (ld / lld / link.exe)
app ← final executable
```
`g++` is a driver — it calls each of these tools in sequence and passes the right flags between them. None of these intermediate files are written to disk unless you explicitly ask (e.g., `g++ -S app.cpp` to stop at assembly).
---
## Stage 1 — Preprocessing
**Tool:** `cpp` (the C preprocessor, invoked internally)
The preprocessor handles all directives that start with `#`:
- `#include <iostream>` — literally pastes the content of `iostream` (and everything it includes) into your source
- `#define FOO 42` — performs textual substitution across the file
- `#ifdef` / `#ifndef` / `#endif` — conditionally includes or excludes blocks of code
- `#pragma once` — prevents a header from being included more than once
The output is a single, flat `.ii` file — pure C++ source with no `#` directives, potentially tens of thousands of lines long even for a small program.
```bash
# You can inspect this stage yourself:
g++ -E app.cpp -o app.ii
```
---
## Stage 2 — Compilation
**Tool:** `cc1plus` (GCC's C++ compiler frontend)
The compiler takes the preprocessed source and:
1. **Parses** it into an Abstract Syntax Tree (AST)
2. **Type-checks** — validates that types match, overloads resolve, templates instantiate correctly
3. **Optimizes** — applies transformations based on the `-O` level
4. **Emits assembly** — produces human-readable `.s` text in the target ISA (x86-64, ARM, etc.)
```bash
# Stop after compilation, get assembly:
g++ -S app.cpp -o app.s
```
```asm
; Example snippet of what app.s might look like for a simple function
_Z3addii:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
ret
```
Note the mangled name `_Z3addii` — that's `add(int, int)` after C++ name mangling encodes the parameter types into the symbol name.
---
## Stage 3 — Assembly
**Tool:** `as` (GNU assembler, or `llvm-mc` under Clang)
The assembler converts the `.s` text file into a binary **object file** (`.o`). This is a relocatable binary — it contains:
- **Machine code** for all functions defined in this translation unit
- **A symbol table** listing every symbol defined here and every symbol referenced but not yet defined
- **Relocation entries** — placeholders saying "at this byte offset, fill in the final address of symbol `X`"
The object file is **not yet executable** because:
- References to functions/globals in other `.cpp` files are unresolved
- Absolute memory addresses haven't been assigned yet
```bash
# Stop at object file:
g++ -c app.cpp -o app.o
# Inspect the symbol table:
nm app.o
# U _ZSt4cout ← U = undefined, still unresolved
# T _Z3addii ← T = defined in text (code) section
```
---
## Stage 4 — Linking
**Tool:** `ld` (on Linux), `lld` (LLVM), or `link.exe` (MSVC)
This is where everything comes together. The linker:
1. **Collects** all `.o` files (yours + any from `-l` libraries)
2. **Resolves symbols** — for every `U` (undefined) symbol in any object, finds the `T` (defined) symbol in another object or library
3. **Applies relocations** — patches all those placeholder bytes with real addresses
4. **Lays out sections** — merges `.text`, `.data`, `.bss`, `.rodata` sections from all objects into one
5. **Writes the executable** — outputs an ELF (Linux), Mach-O (macOS), or PE (Windows) binary with a proper entry point
```
app.o ← your code
+
libstdc++.so ← C++ standard library (iostream, string, etc.)
+
libc.so ← C runtime (malloc, printf, etc.)
+
crt1.o ← C runtime startup (calls main(), handles argc/argv)
app ← fully linked executable
```
Even though you only wrote `app.cpp`, the final binary has code from the C++ standard library, the C runtime, and the platform startup objects — all stitched together by the linker.
```bash
# See what the linker actually pulls in:
g++ app.cpp -o app -Wl,--verbose 2>&1 | less
# Or check what shared libraries the final binary depends on:
ldd app
```
---
## Quick Summary
| Stage | Input | Output | Key job |
|-------------|-------------|---------|--------------------------------------|
| Preprocess | `app.cpp` | `app.ii`| Expand macros and `#include`s |
| Compile | `app.ii` | `app.s` | Parse, type-check, optimize, emit asm|
| Assemble | `app.s` | `app.o` | Encode asm as binary machine code |
| Link | `app.o` + libs | `app` | Resolve symbols, assign addresses |
When you run `g++ app.cpp -o app`, all four stages happen invisibly in sequence. The `-o app` flag only names the final output — not any of the intermediates.