C to Assembly
How Does C Code Become Assembly
The compiler does a lot of stuff to translate C code to assembly
- Choose assembly instructions to implement C oeprations
- Implement C conditionals and loops using jumps and branches
- Choose registers and memory locations to store data
- Move data among the registers and memory to satifisy dependecies
- Coordinate function calls
- Try to make the assembly fast.
However, the direct mapping from C to assembly is not so obvious
Clang/LLVM Compilation Pipeline
To understand this translation process, let us see how the compiler reasons about it.
You can see what clang
compiler does by looking at the LLVM IR
outline
- LLVM IR Primer
- C To LLVM IR
- Straigh-line C code to LLVM IR
- C functions to LLVM IR
- C conditions to LLVM IR
- LLVM IR aatributes
- LLVM IR to Assembly
- Linux x86-64 calling convention
Components of LLVM IR
LLVM IR is similar to assembly
- LLVM IR uses a simple instruction format, i.e.,
<dst operand> = <op code><src operands>
- LLVM IR code adopts a similar structure to assembly code
- Control flow is implemented using conditional and unconditional branches
LLVM IR is simpler than assembly
- Smaller instruction set
- Infinite LLVM IR registers, similar to variables in C
- No implicit FLAGS register or condition codes
- No explicit stack pointer or frame pointer
- C-like type system
- C-like functions
LLVM IR Registers
LLVM IR stores values variables, called registers
- Syntax:
%<name>
- LLVM register are like C variables: LLVM supports an infinite number of registers, each distinguished by name
- Register names are local to each LLVM IR function
One catch: We shall see that LLVM hijacks its syntax for registers to refer to “basic blocks”.
LLVM IR Instructions
LLVM-IR code is organized into instructions
- Syntax for instructions that produce a value:
%<name> = <opcode> <operand list>
- Syntax for other insturctions:
<opcode> <operand list>
- Operands are registers, constants, or “basica blocks”
Common LLVM IR Instructions
LLVM IR Data Types
- Integers:
i<number>
- A 64-bit integer:
i64
- A 1-bit integer:
i1
- A 64-bit integer:
- Floating-point values:
double
,float
- Arrays:
[<number> x <type>]
- An array of 5 integer:
[5 x i32]
- An array of 5 integer:
- Structs:
{<type>, ...}
- Vectors:
< <number> x <type> >
- Pointers:
<type>*
- A pointer to an 8-bit integer:
i8*
- A pointer to an 8-bit integer:
- Labels(i.e., basic blocks):
label
Straight-line C code to LLVM IR
Straight-line C code (i.e., containing no conditionals or loops) becomes a sequence of LLVM IR instructions
- Arguments are evaluated before the C operation
- Intermediate results are stored in registers
Aggregate Types
A variable with an aggregate type (i.e., an array or a struct) is typically stored in memory. Accessing the aggregate type involves computing an address and then reading or writing memory.
The getelementptr
instruction computes a memory address from a pointer and a list of indices. Example: Compute the address %2 + 0 + %4
.
%5 = getelementptr inbounds [7 x i32], [7 x i32]* %2, i64 0, i64 %4
LLVM IR Functions
Functions in LLVM IR resemble functions in C.
LLVM IR function parameters map directly to their C counterparts.
Baisc Blocks
The body of a function definition is partitioned into basic blocks: sequence of instructions (i.e., straight-line code) where control only enters through the first instruction and only exists from the last.
Control-Flow Graphs
Control-flow instructions (e.g., br
instructions) induce control-flow edges between the basic blocks of a function, creating a control-flow graph(CFG).
C Conditionals
A conditional in C is translated into a conditional branch instruction, br
, in LLVM IR
The conditional branch in LLVM IR takes as arugments a 1-bit
integer and two basic-blcok labels
.
A conditional branch terminates its basic block and creates 2 outgoing control-flow edeges in CFG.
Unconditional Branches
If a br
instructions has just one operand, it is an unconditional branch
An unconditional branch terminates its basic block and produces 1
outgoing control-flow edge.
In general, a C conditional typically creates a diamond pattern in the CFG
Loops
The loop control for a C loop consists of a loop induction variable, an initialization, a condition, and an increment. The induction varaible changes registers at the code for the loop increment.
LLVM IR maintains the static single assignment(SSA) invariant: a register is defined by at most one instruction in a function. But what happens to the induction variable which changes as iteration goes or as the loop unfolds ? The answer is phi
instruction.
The phi
instruction speficies, for each predecessor p
of a basic block B
, the value of the destination register if control enters B
via P
.
In this particular code, the phi
instruction says, if the the code comes from block6
which is the entry point the loop, then the reigster %9
is going to be 0
. Otherwise, it is going to adpot the value %14
which holds the incremental operation (i+=1
).
- A block with multiple incoming edges may have
phi
instructions - The
phi
instruction is not a real instruction. It won’t appear in the assembly, it’s just a trick for LLVM to represent loops.
LLVM IR Attributes
LLVM IR constructs (e.g., instructions, operands, functions, and function parameter) might be decorated with attributes. Some attributes are derived from the source code.
Other attributes are determined by compiler analysis.
%15 = load double, double* %24, align 8
Summary of LLVM IR
LLVM IR is similar to assembly, but simpler.
- All computed values are stored in registers.
- SSA: Each reigster name is written on at most one line of IR
- A function is modeled as a control-flow graph, whose nodes are basic blocks(straight line code), and whose edges denote control flow between basic blocks.
- Compared to C, all operators are explict.
- All integer sizes are apparent.
- There are no implicit operations, e.g., type casts.
LLVM IR to Assembly
THe compiler must perform three tasks to translate LLVM IR into x86-64 assembly.
- Select assembly instructions to implement LLVM IR instructions.
- Allocate x86-64 general-purpose registers to hold values.
- Coordinate function calls.
When a program executes, virutal memory is organized into segments.
Assembly code contains directives that refer to and operate on sections of assembly.
- Segment directives organize the contents of an assembly file into segments.
.text
: Identifies the text segment.bss
: Identifies the bss segment-
.data
: Identifies the data segment - Storage directives store content into the current segment
x: .space 20
: Allcoate 20 bytes at locationx
y: .long 172
Stores the constant172L
at the locationy
z: .asciz "6.172" : Store the string "6.172\0" at location
z`.align 8
: Align the next content to 8-byte boundary.
- Scope and linkage directive control linking.
- Example
.globl fib
: Makes “fib” visiable to other object files.
- Example
Stack Segment
The stack segment stores data in memory to manage function calls and returns. More specifically, what data is stored on the stack?
- The return address of a function call
- Register state, so different functions can use the same registers
- Function arguments and local variables that don’t fit in the registers.
Coodinating Function Calls
Hwo do functions in different object files coordinate their use of the stack and of register state?
Answer: Functions abide by a calling convention
The Linux x86-64 calling convention
The Linux x86-64 calling convention organizes the stack into frames, where each function instantiation gets a single frame of its own.
- The
%rbp
register points to the top of the current stack frame - The
%rsp
register points to the bottom of the current stack frame
The call
and ret
instructions use the stack and the instruction pointer, %rip
, to manage the return address of each function call.
- A
call
instruction pushes%rip
onto the stack and jumps to the operand, which is the address of a function. - A
ret
instrutcion pops%rip
from the stack and returns to the caller.
PROBLEM: Say we have two functions, function A wants to call function B. Those functions might want to use the same registers. Who’s responsible for preserving the register state across a function call and return? e.g., A doesn’t get corrupted data from reigsters.
- The caller might waste work saving register stat that the callee doesn’t use.
- The callee might waste work saving register state that the caller wasn’t using.
The Linux x86-64 calling convention does a bit both.
- Callee-saved registers:
%rbx, %rbp, %r12-%r15
- All other registers are caller-saved