C to Assembly

Posted on 2021-07-13

How Does C Code Become Assembly

The compiler does a lot of stuff to translate C code to assembly

Choose assembly instructions to implement C oeprations
Implement C conditionals and loops using jumps and branches
Choose registers and memory locations to store data
Move data among the registers and memory to satifisy dependecies
Coordinate function calls
Try to make the assembly fast.

However, the direct mapping from C to assembly is not so obvious

Clang/LLVM Compilation Pipeline

To understand this translation process, let us see how the compiler reasons about it.

You can see what clang compiler does by looking at the LLVM IR

outline

LLVM IR Primer
C To LLVM IR
- Straigh-line C code to LLVM IR
- C functions to LLVM IR
- C conditions to LLVM IR
- LLVM IR aatributes
LLVM IR to Assembly
- Linux x86-64 calling convention

Components of LLVM IR

LLVM IR is similar to assembly

LLVM IR uses a simple instruction format, i.e., <dst operand> = <op code><src operands>
LLVM IR code adopts a similar structure to assembly code
Control flow is implemented using conditional and unconditional branches

LLVM IR is simpler than assembly

Smaller instruction set
Infinite LLVM IR registers, similar to variables in C
No implicit FLAGS register or condition codes
No explicit stack pointer or frame pointer
C-like type system
C-like functions

LLVM IR Registers

LLVM IR stores values variables, called registers

Syntax: %<name>
LLVM register are like C variables: LLVM supports an infinite number of registers, each distinguished by name
Register names are local to each LLVM IR function

One catch: We shall see that LLVM hijacks its syntax for registers to refer to “basic blocks”.

LLVM IR Instructions

LLVM-IR code is organized into instructions

Syntax for instructions that produce a value: %<name> = <opcode> <operand list>
Syntax for other insturctions: <opcode> <operand list>
Operands are registers, constants, or “basica blocks”

Common LLVM IR Instructions

LLVM IR Data Types

Integers: i<number>
- A 64-bit integer: i64
- A 1-bit integer: i1
Floating-point values: double, float
Arrays: [<number> x <type>]
- An array of 5 integer: [5 x i32]
Structs: {<type>, ...}
Vectors: < <number> x <type> >
Pointers: <type>*
- A pointer to an 8-bit integer: i8*
Labels(i.e., basic blocks): label

Straight-line C code to LLVM IR

Straight-line C code (i.e., containing no conditionals or loops) becomes a sequence of LLVM IR instructions

Arguments are evaluated before the C operation
Intermediate results are stored in registers

Aggregate Types

A variable with an aggregate type (i.e., an array or a struct) is typically stored in memory. Accessing the aggregate type involves computing an address and then reading or writing memory.

The getelementptr instruction computes a memory address from a pointer and a list of indices. Example: Compute the address %2 + 0 + %4.

%5 = getelementptr inbounds [7 x i32], [7 x i32]* %2, i64 0, i64 %4

LLVM IR Functions

Functions in LLVM IR resemble functions in C.

LLVM IR function parameters map directly to their C counterparts.

Baisc Blocks

The body of a function definition is partitioned into basic blocks: sequence of instructions (i.e., straight-line code) where control only enters through the first instruction and only exists from the last.

Control-Flow Graphs

Control-flow instructions (e.g., br instructions) induce control-flow edges between the basic blocks of a function, creating a control-flow graph(CFG).

C Conditionals

A conditional in C is translated into a conditional branch instruction, br, in LLVM IR

The conditional branch in LLVM IR takes as arugments a 1-bit integer and two basic-blcok labels.

A conditional branch terminates its basic block and creates 2 outgoing control-flow edeges in CFG.

Unconditional Branches

If a br instructions has just one operand, it is an unconditional branch

An unconditional branch terminates its basic block and produces 1 outgoing control-flow edge.

In general, a C conditional typically creates a diamond pattern in the CFG

Loops

The loop control for a C loop consists of a loop induction variable, an initialization, a condition, and an increment. The induction varaible changes registers at the code for the loop increment.

LLVM IR maintains the static single assignment(SSA) invariant: a register is defined by at most one instruction in a function. But what happens to the induction variable which changes as iteration goes or as the loop unfolds ? The answer is phi instruction.

The phi instruction speficies, for each predecessor p of a basic block B, the value of the destination register if control enters B via P.

In this particular code, the phi instruction says, if the the code comes from block6 which is the entry point the loop, then the reigster %9 is going to be 0. Otherwise, it is going to adpot the value %14 which holds the incremental operation (i+=1).

A block with multiple incoming edges may have phi instructions
The phi instruction is not a real instruction. It won’t appear in the assembly, it’s just a trick for LLVM to represent loops.

LLVM IR Attributes

LLVM IR constructs (e.g., instructions, operands, functions, and function parameter) might be decorated with attributes. Some attributes are derived from the source code.

Other attributes are determined by compiler analysis.

%15 = load double, double* %24, align 8

Summary of LLVM IR

LLVM IR is similar to assembly, but simpler.

All computed values are stored in registers.
SSA: Each reigster name is written on at most one line of IR
A function is modeled as a control-flow graph, whose nodes are basic blocks(straight line code), and whose edges denote control flow between basic blocks.
Compared to C, all operators are explict.
- All integer sizes are apparent.
- There are no implicit operations, e.g., type casts.

LLVM IR to Assembly

THe compiler must perform three tasks to translate LLVM IR into x86-64 assembly.

Select assembly instructions to implement LLVM IR instructions.
Allocate x86-64 general-purpose registers to hold values.
Coordinate function calls.

When a program executes, virutal memory is organized into segments.

Assembly code contains directives that refer to and operate on sections of assembly.

Segment directives organize the contents of an assembly file into segments.
.text: Identifies the text segment
.bss: Identifies the bss segment
.data: Identifies the data segment
Storage directives store content into the current segment
- x: .space 20 : Allcoate 20 bytes at location x
- y: .long 172 Stores the constant 172L at the location y
- z: .asciz "6.172" : Store the string "6.172\0" at location z`
- .align 8: Align the next content to 8-byte boundary.
Scope and linkage directive control linking.
- Example .globl fib: Makes “fib” visiable to other object files.

Stack Segment

The stack segment stores data in memory to manage function calls and returns. More specifically, what data is stored on the stack?

The return address of a function call
Register state, so different functions can use the same registers
Function arguments and local variables that don’t fit in the registers.

Coodinating Function Calls

Hwo do functions in different object files coordinate their use of the stack and of register state?

Answer: Functions abide by a calling convention

The Linux x86-64 calling convention

The Linux x86-64 calling convention organizes the stack into frames, where each function instantiation gets a single frame of its own.

The %rbp register points to the top of the current stack frame
The %rsp register points to the bottom of the current stack frame

The call and ret instructions use the stack and the instruction pointer, %rip, to manage the return address of each function call.

A call instruction pushes %rip onto the stack and jumps to the operand, which is the address of a function.
A ret instrutcion pops %rip from the stack and returns to the caller.

PROBLEM: Say we have two functions, function A wants to call function B. Those functions might want to use the same registers. Who’s responsible for preserving the register state across a function call and return? e.g., A doesn’t get corrupted data from reigsters.

The caller might waste work saving register stat that the callee doesn’t use.
The callee might waste work saving register state that the caller wasn’t using.

The Linux x86-64 calling convention does a bit both.

Callee-saved registers: %rbx, %rbp, %r12-%r15
All other registers are caller-saved