Computer Architecture I `CMPUT229`

V01 - Binary Representation

Decimal number \(5432\)
Decimal and binary both use a positional numbering system
- Base ten: first position is \(10^0\), then \(10_1\), \(10_2\), etc
- Base two (binary): first position is \(2^0\), then \(2^1 = 2\), then \(2^2 = 4\), etc.

Unsigned Binaries

Regular binary doesn't incorporate a sign anywhere, but thus cannot represent negative numbers
Largest number in \(k\) bits is \(2^k - 1\) and would be \(111\dots1\), \(k\) times

Decimal to Binary Conversion

Method of repeated division: Take a decimal number. If the number is odd, add a \(1\) to the list and subtract \(1\) from the number. Otherwise, write a \(0\). Then, on the line below, divide the number in half and keep doing the same thing. The binary representation of the number is read bottom to top
Method of repeated subtraction: find the largest power of two that the number is larger than, subtract that, and at a 1 at the corresponding place in the binary number. Keep repeating this process until arriving at \(0\)

2-Complement Notation

Convention: \(-1 = 111\dots1\)
- Leading \(1\) indicates a negative number
- Leading \(0\) indicates a positive number
Conversion to negative number: flip the bits and add 1
- \(x + \bar{x} = -1 \implies -x = \bar{x} + 1\)
Largest positive number with \(k\) bits: \(011\dots1 = 2^{k-1}-1\)
Smallest negative number with \(k\) bits: \(1\dots11 = -2^{k-1}\)

Sign Extension

If more bits are allocated for a number, we must make sure that the number has the same value by sign-extending it
Positive numbers are sign-extended with \(0\)s, negative numbers are sign-extended with \(1\)s
- Adding leading \(0\)s for positive numbers and \(1\)s for negative numbers doesn't change their values

Sign-extension in RISC-V

RISC-V words have 32 bits
If a 12 or 20 bit value needs to be stored, it should be sign-extended
Example: addi t0 t0 15
- addi meaning "add immediate"
- The immediate takes up 12 bits in the instruction, but is sign-extended to 32-bits for the computation

V02 - Storing Data in Memory

Hexadecimal Notation

Binary numbers can have digits grouped into 4s and be expressed as hexadecimal numbers
- E.g. 0000 0000 0000 0000 1001 1100 0011 0100 becomes 0x00009C34
- Prefixed with 0x to let us know they are hexadecimal
With two's-complement, negative numbers usually have leading Fs

Memory

Each byte stored in memory has an index called an address
- E.g. 0x00001004 is an address
- On a 32-bit machine, addresses have 32 bits
- Each address points to a value
Special named addresses are called registers

Memory Organization

Each memory address points to a single byte
Since memory addresses measured in bytes, each consecutive memory address differs by 4 bytes
Each individual value in memory is a byte, but these are usually clumped into groups of 4 to form a word
Memory can only be accessed at word-aligned addresses; offsets must be used to access individual bytes

Endianness

RISC-V is little-endian: the first bit in the number is the most significant bit (the one that represents the highest power of two)
- The 1's bit is the last (rightmost) bit in the number
- The least significant bit is stored at the smallest memory address
- Advantage: memory slices read like place-value notation numbers
Big-endian architectures do the reverse
- There is no "right choice"; they are equally right

V03 - Organization of a computer

Arithmetic Operations

# Arithmetic Add
add a, b, c     # a <- b + c

# Add immediate
addi a, b, 20   # a <- b + 20

Compiling C Code

// C code to compile
f = (g+h) - (i+j)

// RISC-V Implementation
add t0, g, h
add t1, i, j
sub f, t0, t1

// Assumptions
f goes in s0, g in s1, h in s2, i in s3, j in s4

Note that each "temporary" value in the c assignment is calculated, and those are used to calculate the final value
- This is what registers t0 to t6 are for

Register Types

RISC-V has a 32x32 bit register file; it has 32 registers: x0 to x31
These have the following aliases
- t0 - t6 for temporary values
- s0 - s11 for saved variables

Addressing an Array

If the array \(A\) contains values of size \(n\) bytes, then we access \(A[i]\) in memory at \(\text{firstElementAddress} + (n*i)\)
In practice, we often need to calculate the offset for this address using arithmetic instructions, save it in a register, and then use that register to access the memory we want

// C code
A[0] = h + A[2]

// RISC-V Implementation
lw t0, 8(s1)      // loads A[2] by using an 8-byte (2-word) offset
add t0, s2, t0    // adds the two terms together
sw t0, 0(s1)      // saves the result where we need it (0-word offset)

// Assumptions: h in s2, A in s1

lw t0, 8(s1)       # takes the value in the memory address s1 with offset
                   # 8 and saves it in t0
sw t0, 0(s1)       # saves the value in t0 at the memory address s1 with offset 0

The offset describes how many bytes forward in memory the memory address should be loaded at (notation: n(register))
- This number can be negative

Registers vs. Memory

Registers are faster to access than memory
- Using memory requires loads and stores; registers do not
- We aim to use only the registers if we can
Memory should be used for composite data, i.e. arrays, strings, structures, etc.

Immediate Operands

An immediate is a number specified directly in an assembly instruction
- E.g. addi t0, t0, 1 has the immediate \(1\)
These exist to make the common case fast (a design principle of RISC-V): small constants are often required. No need to load these out of memory all the time

V04 - Data Transfer between Memory and Registers

Example: `lw` (load word)

Example: `sw` (save word)

V05 - Binary Representation of Instructions

Instructions get encoded into binary numbers
- This is called machine code
RISC-V, being 32-bit, compiles into 32-bit instruction words

General Idea

Different segments (bit fields) of the word store different things, including
- The addresses of the registers involved
- The operation being done (opcode and funct codes)
- Immediates
Different bit fields may encode things differently
- E.g. registers can't be negative, so their encoding is not two's-complement
There are a bunch of different formats that are optimized for different kinds of things
Bits might not be directly in order; there are hardware

Formats

V05 - Branches and Jumps

In code, we need to make decisions: if this, then that

// C code
if (i == j) {
    f = g + h;
} else {
    f = g - h;
}

// RISC-V Implementation
            bne s3, s4, Subtract   // branch to subtract if s3 does not equal s4
            add s0, s1, s2         // add statement
            jal zero, Exit         // jump to exit (jal zero always jumps)
Subtract:   sub, s0, s1, s2        // subtract statement
Exit: ...

New instructions and registers
- bne: Branch if not equal
  - There's also bge (branch \(\geq\)), blt (branch \(<\)), and beq (branch \(=\))
  - The other two inequalities aren't necessary, since we can accomplish those with what we already have (design principle at work)
- jal: Jump and Link
  - If first register is x0, it is an unconditional jump (j can just be used in RARS are a shortcut for this)
  - It is really about jumping to functions; jal automatically writes the return address (next instruction after the jal) into the corresponding register
- x0: register that is hardwired to the constant 0 and won't be overwritten
  - Useful for setting one register to another (mv in RARS is really add t0, t1, zero), setting a variable to \(0\), and jumping
Branch instructions use the SB type
- The label to branch to (actually a line number after assembly) is stored as an immediate
- This instruction leaves more spaces for the immediate so the jump can be as far as possible
Jumps use the UJ (unconditional jump) type, which leaves even more space for the immediate

V06 and V07 - Logical Operations and Large Constants

Shifts

slli: Shift Left Logical Immediate: shifts bits left and appends 0s to the little end
- E.g. slli, t2, s0, 4 with \(t_0 = 00000010\) would set \(s_0\) to \(00100000\)
- Shifting left by \(n\) bits is equivalent to multiplying the number by \(2^n\), and is much faster
- Shifting left by \(16\) moves the first half of the word into the second half
- Shifting left by \(32\) makes the register equal to \(0\)
sll: non-immediate version of slli
srli: Shift Right Logical Immediate: shifts bits right and appends 0s to the big end
- E.g. srli, t2, s0, 2 with \(t_0 = 01000000\) would set \(s_0\) to \(00010000\)
- Shifting right by \(n\) bits is equivalent to floor-dividing by \(2^n\), i.e. \(f(r_0) = \displaystyle\lfloor{\frac{r_0}{2^n}}\rfloor\)
  - Floor division because the least significant bits get lost
- srli is not sign-extended
srl: non-immediate version of srli
srai/sra: Shift Right Arithmetic (Immediate): these sign-extend the right-shift

Bitwise Operations

add t0, t1, t2
or t0, t1, t2
xor t0, t1, t2

Each regular logical operation, performed bitwise (comparing each bit together with its correspondent in the other number)
Each of these have an immediate version (at "i" to the end of the instruction)

Bitmasks

The andi instruction can be used to implement masking since it only gives back the bits that are \(1\) in the immediate
- E.g. using \(00001111\) gives the first four bits of whatever the other value is
- The bits where \(1\) are in the immediate are unchanged; the rest are \(0\)

Getting a Specific part of a Number

To get some specific bits from a number, there are two strategies

Number is small
- Figure out the immediate you need that is \(1\) for the bits you want and \(0\)
- Use andi to mask for those bits
- Shift the result to the right using srli
Number is large:
- Figure out the immediate you need that has the right pattern of bits
- Shift it to the left
- Mask and shift the result to the right like before

Both of these can also be accomplished by only shifting
- Shift in both directions up to each end of the bit field you want; the rest of the number gets filled with 0s

`xori ... -1`

Using xori with the \(-1\) immediate performs logical negation (flips all the bits)
- This is because the two's-complement of \(-1\) is \(111\dots111\), and xor flips these

Loading Immediates

li: loads an immediate into a register, starting at the little (least significant) end

This is a RARS pseudo-instruction, not a real one
RARS implements this as follows

# target
li s0, 0x0FFFF000

# implementation
lui s0, 0x1000
addi s0, s0, -256     # -256 is 0xF00, with addi's sign extension 0xFFFFFF00
# adding these forces everything left of the 1 to roll over to 0 (amazing!)

lui: loads an immediate into a register, starting at the most significant end
- Essentially equivalent to using li to load it, then shifting it to the left by the remaining amount so that the loaded value reaches the most significant end
- This is a real instruction
- Bits that aren't used by LUI get zeroed; no original bits remain after an lui instructions

Example for Masking, Large Constants

Topic V0A is an example where actual RISC-V instruction binary representations are compared to learn more about what they do. This involves
- Bitwise logic
- Moving groups of bits around in the word
- RISC-V instruction type knowledge
- Manipulating large constants

V08 - Loops and Conditionals

While Loop

// C Code
while (save[i] == k) {
    i = i + j
}

Idea: keep evaluating the loop condition, branch to the top of the loop if it's true and branch past the loop once it's not
In particular, the most efficient order is
1. On startup, immediately perform actions required to evaluate condition and evaluate
2. Branch to exit if true, continue if false
3. Perform loop body, including iteration
4. Perform actions required to evaluate condition and evaluate. Branch back to loop body if false, continue if true

        bne ... , Exit   # check condition immediately
Loop:   ...              # start the loop body here (on this line)
        add ...          # (cont. ) loop body and iteration
        beq ..., Loop    # jump to loop top or continue if fulfilled
Exit:

to add: look in slides and figure out what the less efficient example is (the "first thought" example)

More Conditional Operations

slt: set less than (sets 1 if less than and 0 otherwise)
- slti: set less than, with immediate
Are used in conjunction with beq, ..., zero, Label for inequality based loops
- beqz, bnez are pseudo-instructions in RARS
sltu, sltiu are the unsigned versions of slt and slti
We only need this one because we can reverse comparison order and use beq ... zero for other conditions

Basic Blocks

A basic block is a sequence of instructions with no branch targets (except the beginning) and no branches (except at the end)
- Thus, if one instruction in the block is evaluated they all will be or have been
- Compilers identify basic blocks for optimization

V09 - Storing Arrays in Memory

An array is a contiguous chunk of memory

Arrays of integers have one word per element, since integers are 32-bit
Strings (arrays of characters) have one byte per element, since ASCII characters are 8-bit
- Strings also get null-terminated; their last character is \0 = 0x00

![[V03 - Organization of a computer#Addressing an Array]]

Another C Example

# C Code
y = A[B[j]]

# RISC-V Implementation
slli t1, t0, 2    // 4 * j
add t2, s7, t1    // B + 4 * j (address of B[j])
lw t3, 0(t2)      // store B[j] in t3
slli t4, t3, 2    // 4 * B[j]
add t5, s6, t4    // A + 4 * B[j] (address of A[B[j]])
lw s3, 0(t5)      // load the thing at this address into y

V0B - Procedure Calls, Stack, and Memory Layout

A callable unit has many names: procedure, function (no side effects), routine, subroutine, subprogram, method (object-oriented), etc.
- We will use the term procedure

Procedures

A procedure is a "function" with side effects
The caller is the code that references the function
- E.g. foo()
The callee is the code that defines the function, and runs when the function is called
- E.g. void foo() { ... }
Functions can call other functions; this is what makes them so useful as tools of abstraction

Returning

Functions must return program flow to the point where they were originally called
The function returns to the return address
- The caller must remember the return address while the function is running
- The callee must return to the correct address
- RISC-V has a single register ra to store a return address
It may also have return values: values the function computes that the caller likely wants

The Stack

Since RISC-V only has one ra register, we can't just put it in there because functions returned by other functions would have their return addresses overwritten
The stack exists in memory, and stores the following, which form a stack frame
- The value of fp (the frame pointer)
- The value of ra (the return address)
- Any register modified by the function
- Local variables that get stored in memory
The stack pointer in register sp always points to the top of the stack
- We use this to figure out where to place the next stack frame
- It gets decremented when we write something to the stack
The frame pointer in register fp (s0) points to the start of the current frame
- We use this to figure out where to access values in the current stack frame, since this doesn't change relative to existing values if new ones are added
- The function is responsible for keeping track of this

What Does the Stack Do when We Call a Procedure?

Address of current instruction (the program counter pc) is written into ra
ra is saved onto the stack, and sp is decremented/updated so it points to the top of the stack
The function is evaluated. Say this includes calling another function
1. fp is moved up, since this new function creates a new frame. However, if this function doesn't in turn call more, it will stay empty
2. ra is overwritten with new value and the pc is moved to the new function
3. The function returns; pc is moved to the value of ra and execution continues as before.
ra is restored from the stack. If we changed other things, they would be as well
The (outer) function returns, meaning
- pc is moved to ra
- sp is incremented/updated; the function's frame is popped from the stack
- (if this is also in a function) ra is restored to its previous value from the stack

Memory Layout

The stack grows downwards from the dynamic memory (i.e. it starts at the highest memory and gets lower) and serves as the automatic storage
- The stack (obviously) is a type of stack data structure
The Dynamic data / heap grows upwards from the dynamic memory. This is memory allocated dynamically, like using malloc in C or new in Java
Static data are global variables, constant arrays, strings, etc. These are immutable
Text is the actual program code, stored as binary instructions

V0C - Register Calling Conventions

Different registers are used for different things
There needs to be agreement between the caller and callee about which registers should be used for what purposes, since some are globally accessible and may be overwritten
- What values are expected to stay the same before and after a function call?
- Running a process is almost like having external code running

Registers with Conventions

s0-s11 are the saved registers: any values from s0 to s11 must remain the same before and after function execution
- The callee can overwrite s0-s11 for their own purposes, but must restore them before the function returns
- s0 is the frame pointer
t0-t6 are the temporary registers: these are allowed to change after function execution
- They should be used as needed for computing things, not relied upon to save values
- The callee is free to modify these
a0-a7, ra are the "procedure registers": they are used for things related to procedure calls
- a0 and a1 are used to specify parameters for a procedure, and may contain return values from the callee once the procedure is done
- a2-a7 are more registers to specify parameters
- ra is the return address, and is used to return to the point after the call
sp: The callee must save it - no??
- Same as the s registers.

Other Registers

zero: always contains the value \(0\)
sp: stack pointer
gp: global pointer
tp: thread pointer
The frame pointer fp is actually s0

Table of Calling Conventions

Register	Name	Use	Saver
`x0`	`zero`	The constant value 0	N.A.
`x1`	`ra`	Return address	Caller
`x2`	`sp`	Stack pointer	Callee
`x3`	`gp`	Global pointer	--
`x4`	`tp`	Thread pointer	--
`x5`-`x7`	`t0`-`t2`	Temporaries	Caller
`x8`	`s0`/`fp`	Saved register/frame pointer	Callee
`x9`	`s1`	Saved register	Callee
`x10`-`x11`	`a0`-`a1`	Function arguments/return values	Caller
`x12`-`x17`	`a2`-`a7`	Function arguments	Caller
`x18`-`x27`	`s2`-`s11`	Saved registers	Callee
`x28`-`x31`	`t3`-`t6`	Temporaries	Caller

V0D - Recursive Functions

Non-leaf procedures are procedures that call other procedures
- For nested calls, these need to save ra and callee-saved registers onto the stack and restore them before returning
Recursive functions are non-leaf functions that call themselves
- Recursive functions must use the stack, just like any other nested function
- It's possible for compilers to optimize tail recursion by simply replacing stack frames instead of pushing new ones, since the stack unwinds immediately when the the base case is reached

Like any function call, recursive function calls add a frame to the stack. If the function is tail-recursive, it is possible that the compiler can optimize the stack use by writing the new values for scoped variables to the same stack frame instead of to a new one each time.

Lecture V0D has an in-depth example of the RISC-V implementation of a recursive C function, which should be consulted as an example

V0E - Different Sizes of Data

Strings are arrays of characters, which are only one byte each. As such, they are typically manipulated on the byte level, not the word level

# load byte from the address in rs1 into rd
# this is sign-extended to 32-bits (fills the whole word)
lb rd, offset(rs1)

# load unsigned byte (zero extension instead of sign-extension)
lbu rd, offset(rs1)

# stores the rightmost byte in rs2 into the memory address in rs1
sb rs2, offset(rs1)

Analogous instructions exist for half-words, which are 2 bytes (16 bits)
- These are lh, lhu, and sh

Manipulating Different Sizes of Arrays

When accessing arrays, the number of bytes to increment by is the size of the data being stored
- Integer: 32 bits = 4 bytes = 1 word
- Character: 8 bites = 1 byte = 1/4 word
- Structs: the size of the struct is the total of the number of bytes in each member

String Manipulation Example

An example of the RISC-V implementation of the strcpy function in C is given in Lecture V0E

V0F - Relative Performance

The algorithm determines how many operations get executed
The programming language / compiler / architecture determines how many machine instructions are executed per operation
The processor and memory system determine how fast instructions are executed
The I/O system, including the OS determine how fast I/O operations are executed

Response Time vs. Throughput

Response time: how long it takes to do a task (also known as latency)
Throughput: the total amount of work done per unit time
- E.g. multiple processors have the same response time but higher throughput

Performance

\(\text{Performance} = \dfrac{\text{Task}}{\text{Execution Time}}\)
For two machines performing the same task, the ratio of their performances is the inverse of the ratio of their execution times

Where Does time Get Spent Executing a Program?

Elapsed time (total time): processing + I/O + system + idle
CPU time: processing
- If a machine has multiple CPU cores, this can be longer than the elapsed time since multiple CPUs are running at the same time
System time: system tasks

V11 - Clocks Per Instruction

Definitions

\(\text{CPU time} = \# \text{ clock cycles } \times \text{ clock cycle time}\)
\(\# \text{ clock cycles } = \# \text{ instructions } \times \text{ CPI}\)
\(\boxed{ \text{ CPU Time } = \# \text{ instructions } \times \text{ CPI } \times \text{ Clock cycle time}}\)
Frequency: number of clock cycles per second
- \(\text{Clock cycle time} = \dfrac{1}{\text{frequency}}\)

Where Do These Values come From?

\(\# \text{ instructions }\): the number of instructions in a program
- Determined by the ISA (instruction set architecture), compiler
\(\text{CPI}\): average cycles per instruction
- Determined by CPU hardware
- Different instructions may have different CPI (e.g. add is less expensive than mul)
- Thus, the average CPI is affected by the distribution of instructions
\(\text{ Clock cycle time}\): determined by the ISA and technology

More Detail on CPI

\(\text{Clock cycles } = \sum\limits_{k=1}^{n} \, (\text{CPI}_k \times \text {Instruction count})\)
\(\text{CPI} = \dfrac{\text{Clock Cycles}}{\text{Instruction Count}} = \sum\limits_{k=1}^{n} (\text{CPI} \times \text{Instruction Count}_k \times \dfrac{1}{\text{Instruction Count}})\)
- CPI is a weighted average, weighted by the expense of a given instruction

V12 - Power, Multiprocessor, SPEC, Amdahl's Law

Power and Performance

\(\text{Power} \propto \dfrac{1}{2} \text{Capacitive load} \times \text{Voltage}^{2}\times \text{Frequency}\)
Power increases over time, but eventually hits a ceiling
Uniprocessor performance is constrained by power, instruction-level parallelism, memory latency
MIPS performance metric: millions of instructions per second
- Doesn't account for differences in ISAs between computers and instruction complexity
- Execution time is a better performance measure, since that's what ultimately matters
Instruction-level parallelism: hardware executes multiple instructions at once
- This is abstracted away from the programmer
Multicore processors: processors with multiple cores. Requires the programmer to execute parallelism themselves

SPEC

SPEC: standard performance evaluation corporation
Has various work groups for systems, high performance, graphics, research
Joint effort between large tech companies in industry and research universities
Develops benchmarks, i.e. programs used to measure performance

Amdahl's Law

Improving a single aspect of a computer doesn't lead to a proportional increase in overall performance
- Each aspect only affects the performance a bit, so only that part gets improved proportionally
E.g. doubling the speed of the mul instruction where it accounts for \(50\%\) of the program's runtime can improve the program up to \(50\%\)

Takeaways

Performance and cost is improving due to technological development
Hardware and software development involves hierarchical layers of abstraction
ISA is the interface between hardware and software
Power is a limiting factor; thus, parallelism can improve performance significantly

V13 - Bubble Sort RISC-V Example

Swap Function

Swaps an element in an array with the next element in the array

// C
void swap(int A[], int k) {
    int temp;
    temp = A[k];
    A[k] = A[k+1];
    A[k+1] = temp;
}

// RISC-V
swap:
    slli t1, a1, 2      // A[] is in a0; get k*4
    add t1, a0, t1      // get the address of A[k]
    lw t0, 0(t1)        // temp storage for A[k]
    lw t2, 4(t1)        // load A[k+1] into t2
    sw t2, 0(t1)        // put A[k+1] into A[k]
    sw t0, 4(t1)        // put A[k] value (stored in temp) into A[k+1]
    jalr zero, ra, 0    // end function

The V13 slides have an excellent example of generating RISC-V from C for the swap function

Bubble Sort

void bubblesort(int A[], int n) {
    int i, j;
    for(int i = 0; i < n; i++) {
        for(int j = i-1, j >= 0 && A[j] > A[j+1]; j--) {
            swap(A, j);
        }
    }
}

We move through the list \(n\) times and swap each successive pair of elements that's not in order (\(O(n^2)\))

Implementing Bubblesort in RISC-V

We must remember to follow the for loop template: initialize, then test → loop body → increment, then back to test; branch past loop if test passes
We must remember to save certain values (like \(A[\,]\) and \(n\))
Implementation details are in the slides
- A good exercise to do on one's own to practice transcribing algorithms

Effect of the Compiler

We notice that the compiler greatly speeds up bubble sort, i.e. the relative performance of unoptimized bubblesort vs. optimized bubblesort is high
However, algorithms like quicksort outperform bubblesort considerably, even without any compiler optimization
Takeaway: nothing can fix a dumb algorithm

V14 - Calling Conventions for Functions inside Loops

We have to keep in mind which variables might be altered (temporaries, arguments/returns) and which must stay the same (saved, ra)
We also have to remember to restore saved variables ourselves using the stack

Why is Calling a Function inside of a Loop Different?

It's not!
It's just that since the function gets called over and over since it's inside the loop, the consequences of our poor choices will very likely have an effect
Common antipatterns
- Using a temporary value as loop iterator. The function called may change this value; it must be a persistent value
- Using the argument registers with respect to the function we are writing the loop in, not the one that we called

V15 - Indexing vs. Pointers

Array Indexing: we keep track of the index of the element in the array we are tracking
- We might have to calculate the actual pointer to the element each time
- Requires a shift (by \(2\)) inside the loop since we have to multiply the iterator by \(4\) each time, assuming the array is of integers
Pointer Indexing: we keep track of a pointer that moves along each element of the array
- We might have a more complicated check to see if the loop is done, since that is usually specified in terms of index
Example: clear writes \(0\)s into every element in an integer array

// Array indexing
void clear(int A[], int n) {
    int i;
    for(i = 0; i < size; i++) {
        A[i] = 0;   // notice we access a value by its array index
    }
}

// Pointer Indexing
void clear(int* A, int n) {
    int* p;
    for(p = &array[0]; p < &array[size], p += 1) {
        *p = 0   // notice we deference the pointer to access the value to change
    }
}

The RISC-V implementations are straightforward; they are found in the lecture slides.
We find that the CPI of the pointer indexing method is generally faster, namely because we to iterate and calculate the pointer for array indexing, but both are done in one instruction with pointer indexing
However, the compiler can optimize this difference away when we write in C
Overall, it's better to write whatever makes the code clearer and safer, i.e. with array indexing

V16 - Meaning of Dereferencing

What does *p mean?
In this example, p and x are function arguments, so they correspond to the a0 and a1 registers

void foo(int *p, int x) {
    *p = x;  // STORE the VALUE of x in the memory location whose ADDRESS is in p
    // Analogous RISC-V Instruction: sw a1, 0(a0) (like doing save word)
    x = *p;  // LOAD into x the VALUE from the memory location whose ADDRESS is in p and 
    // Analogous RISC-V Instruction: lw a1, 0(a0) (like doing load word)
}

If p and x are in a stack frame, the RISC-V instructions are a bit different
- We have to load both p and x into registers (since they are in stack memory), then perform the save word
- Remember: to access these variables from the stack, we need to use offsets from the stack or frame pointer

V17 - Examples of String Functions

Strings are stored as arrays of \(8\)-bit characters in memory
- These are encoded into integers using the ASCII encodings

Examples of String Functions

All of these are straightforward implementations in RISC-V, and should be done as exercises

strlen: returning the number of characters in a string
- Uses are simple for loop that branches if the current character being examined is the null terminator
maxlen: given an array of strings, return the number of characters in the longest one
- Requires a nested for loop
- Since this a nested array, we have double pointers: pointers to pointers to characters. Note that we will have to perform load instructions twice to achieve this
- Uses a function call to strlen
summax: list of classes (which are lists of students): sum of longest names in each session
- Once again, requires calling maxlen in a loop
- Takeaway: using a lw takes us to the next level closer to the base value

Types

char: character
char* address of or pointer to a char
char**: address of or pointer to a pointer to a char

V18 - Parameter Passing by Value and Reference

What's the Difference?

When passing by value, a parameter must be copied into the stack (or a register) when a function is called
- This is fine for small things like integers, but does NOT need to be done for large structures like arrays, structs, etc
Instead, when passing by reference, a reference to the object is passed into the function instead, which means is doesn't need to be copied
- This saves space, but means that any changes made to the referenced object will persist after the function is done (mutability and global state)
- This can be a good thing (take that functional programmers!) but can also lead to unintended bugs that are difficult to track down

V19 - Synchronization

Naive Approach to Multiple Processors Asking Same Data

Write into a shared place when accessing a resource; when a resource is about to be used, write a \(1\) into it, then proceed
- Won't work: nothing is being checked
Check before writing too. Only write if the shared place is a \(0\)
- This seems like it would work, but it's possible that P1 reads, then P2 reads, then P1 writes, then P2 writes. Basically, since there is a small period of time between reading an writing, if P1 and P2 try to check at almost the same time, they might both read, then both write a \(1\) without realizing they are overwriting. This leads to a data race
Data race: the result of a read from memory depends on the order of access
- This is what synchronization tries to avoid

Synchronization

Requires hardware support and an atomic read/write operation
- I.e. there is a specific instruction in the instruction set that corresponds to synchronous access of data
- This instruction could be an atomic swap of register and memory
- Atomic read/write memory operation: an operation that performs a read and write at exactly the same time; in a single operation
No access to the location is allowed between the read and the write operation
- This avoids the data race problem from the naive implementation

Synchronization in RISC-V

Load reserved: used to load a shared resource, like whatever signals if the memory is free to access
- lr.w, lr.h, lr.b, etc.
Store conditional: succeeds if (what is stored in) the location has not changed since the load reserved; fails if the location has changed
- Returns 0 in rd if the store succeeds
- Returns non-0 in rd if the store fails

Atomic Swap Example

try:
// s1 contains the shared resrouce between the processors
lr.w t0, (s1)      // load reserved word from the shared resource
// t0 contains the value in s1
// now that a thing has been loaded from the synchronized area
// every processor (involved?) has a record that the address of s1 is load-reserved
sc.w t1, s4, (s1)  // store conditional into s4 (?) from s1 (shared area)
// t1 now contains the status of the store operation
// now the address is no-longer load-reserved;
// the address of s1 is no longer stored as load-reserved if it succeeded
bne t1, zero, try  // if the store fails, branch back to the try
// we know the synchronized thing succeeded
add s4, zero, t0   // put the loaded value into s4

Detailed slides for this example are found in the V19 Slides

V1A and V1B - Blackboard Architecture and Transactional Memory

Based on how people work at a blackboard: people don't communicate with each other directly; they write everything they need to solve a problem on the blackboard where everyone can see it at the same time. If a person has a better solution, they simply erase what they want to replace on the blackboard and write that. No ambiguity exists here because it is plainly visible when someone is writing on the board.

Blackboard Architecture

Example: Newton's Method

Two processors are trying to find the minimum (or get as close) as they can
One is using newton's method; the other is guessing randomly
If one has a better guess than what is currently on the blackboard, it will replace the guess
The newton's method process also has to read the blackboard to know where to check the next derivative
Both the current min's x-value \(x\) and y-value \(f(x)\) are on the board

No-lock Version

Everyone writes to the shared blackboard to solve their problem
However, they just read and write randomly, so a value might get overwritten while another processor is using it in a calculation, and the process won't "realize" this has happened
- E.g. a processor writes the value for \(x\), but before it writes \(f(x)\), the other one reads that \(x\) value and \(f(x)\) value, getting an inconsistent pair
In order to solve this, a lock is required

Lock Version

Lock: a variable that is \(0\) when no one is writing to the blackboard and \(1\) when someone is accessing the board (reading or writing)
The lock is read and written to using load reserved and store conditional
Reads and writes are only allowed by a new processor when the lock is \(0\)
Before writing to the blackboard, the store conditional instruction is used to update the lock variable to \(1\) to let everyone looking at the blackboard know that it is being accessed

Blackboard Architecture Terms

Critical section: the shared resource
Race Condition: when a resource is being accessed by multiple processes at the same time
Starvation: one process fails to access a resource for an extended amount of time
Deadlock: no progress

Transactional Memory

Architectural support for lock free data structures
Implemented on mainframes
Each thread in a transaction can perform an LL on many different memory locations and the commit operation performs a kind of SC that takes effect on those multiple locations simultaneously, with either every store failing or succeeding together
Leads to vulnerabilities

V1C - Compiling, Linking, and Loading

Producing an Object Module

Assembler and/or compiler: Translates the program into machine instructions and provides information for building a complete program from the pieces
- Header: describes the content of the object module (function definitions, etc)
- Text Segment: translated instructions
- Static Data Segment: data allocated for the entire life of the program (e.g. "external" files that the program uses but doesn't modify)
- Relocation info: for content that depends on the absolute address of where the program is located
- Symbol table: global definitions and external references
- Debug info: for associating with the source code

Linking Object Modules

Produces an executable image
1. Merges different segments
2. Figures out the addresses of the labels
3. Figure out the paths of the local and external references (relative and absolute paths)
Program can be loaded into an absolute location in virtual memory, so location dependencies don't need to be fixed (???)

Loading a Program

Load the image file on the disk into the memory
1. Read the header to figure out the segment sizes
2. Create virtual address space
3. Copy text and initialized data into memory
  - Or set up page table entries so they can be included from there
4. Set up arguments on the stack
5. Initialize registers, including the stack pointer, frame pointer, gp
6. Jump to the startup routine (copies the arguments into a0, etc. and calls main, then invokes the exist system call when main returns)

Dynamic Linking

Only link and load a library procedure when it is called
Requires the code to be relocatable: the linker has to fix the code to run in a new location
- Position-independant code can run anywhere in memory without needing to be changed
Avoids image bloat and automatically adapts to new library versions

Lazy Linkage

watch video!

Dynamic JIT Compilers

V1D - Computer Arithmetic

Operations on Integers

Addition and subtraction
Multiplication and division
Dealing with overflow

Addition

Bitwise Addition

Two bits get added
There is a carry bit to signify if the one bit has overflowed
- Cin: carry in: input for the carry bit
- Cout: carry out: output for carry bit (whether it gets used)
Four possible results of this are indicated below, as well as the logical circuit used to execute the arithmetic

Collectively, this structure is known as a half-adder
The xor operation happens to encapsulate most of the logic for addition; however, in the case that both digits are 1, we also need a carry bit keep track of the next place

Combining Half-adders

Two half-adders can be combined into a full one-bit adder, which address three bits
- These correspond to \(a\), \(b\), and \(Cin\)
Outputs: the sum \(S\) (a one-bit number, i.e. \(0\) or \(1\)) and \(Cout\) (whether a carry is used)

We can abstract this whole circuit as a black box that adds two one-bit numbers
- It has a truth table like any other combinational circuit

Ripple-carry Adder

We take a full adder and connect the carry-out of the first adder into the carry-in of the second adder, then do the same with the next, etc
If we connect \(n\) adders this way, we can add together two \(n\)-bit numbers
- It is indeed that easy!
However, travelling through a circuit takes time; the ripple-carry adder is \(O(n)\), where \(n\) is the number of bits in the number
- This is less efficient than it can be; we can use a carry look-ahead adder

Carry Look-ahead Adder

Carry look-ahead unit: computes the propagate signal and lookahead signal
- Is itself a black box that can be implemented logically
Each full adder has a propagate bit and a generate bit
- Propagate bit: tells you that if the unit receives a \(Cin\), then it will propagate that carry to the next unit
- Generate bit: regardless of the \(Cin\) state, this unit will generate a \(Cout\) of \(1\)
Each adder unit has its \(S\), \(p\) (propagate), and \(g\) (generate) bits feeding into the carry lookahead unit
Receives its \(Cin\) from the carry lookahead unit as well
Each carry lookahead unit has a final \(Cout\), \(p\), and \(g\), which means they can be chained together as well
Logic for generating the \(g\) and \(p\) signals
- \(g_i = A_i \cdot B_i\)
- \(p_i = A_i + B_i\)
This is more efficient because having a central source know the generate and propagate properties will allow it to "figure out" if it can skip certain parts of the calculation, since it already knows the answer
- E.g. if all the generate bits are \(1\), whatever \(Cin\) to the main unit is will be the \(Cout\) of t

Subtraction

Add the negation of the second operand
- The negation in this case will be two's-complement negation, not logical negation
Subtraction can underflow, much like addition overflows

Overflow

If the sum is too large for the bits containing it, then the sum will overflow and the sign of the result will change
E.g. in \(4\) bits, \(7\) + \(1\) is \(-8\) (assuming two's-complement representation)
If inputs have different signs, there is no chance of overflow. However, in the other cases, overflow is possible
- This is the opposite for subtraction

Checking for Overflow

Unsigned addition: if either of the operands is smaller than the result, an overflow occurred
- Also works for a signed addition with one operand signed known (likely an immediate)
Signed addition: compute \(c = a+b\) into new register, check if \(c < 0\) and if \(a < b\). If one of these is false and one is true, then an overflow occurred
This needs to done manually in RISC-V

Design Principles: Dealing With Overflow

Some languages, like C, RISC-V, etc ignore overflow; they give you enough rope to hang yourself with
- RISC-V has no special instructions that help you perform overflow checks
- Instructions that could result in overflow should be used with a check
- Then, if an overflow occurs, invoke the exception handler
Other languages require that an exception is raised when an overflow occurs
- Save PC in user exception program counter
- Jump to handler address which is stored in user trap handler base address
Multimedia applications involve different data sizes (e.g. \(8\) and \(16\)-bit numbers), so there are more efficient arithmetic instructions to help with that

V1E - Multiplication and Division

Unsigned Multiplication

For \(a\times b = c\), \(a\) is the multiplicand, \(b\) is the multiplier, and \(c\) is the product
The length of the product is the sum of the operand lengths
- E.g. a \(4\)-bit number times \(4\) bit number is an \(8\)-bit number

Multiplication Hardware

Multiplicand, multiplier, and product (after computation, 0 initially) will be stored somewhere
- Only the bottom half of the multiplicand register is used to hold it; the multiplicand shifts left, so the place it is stored needs to be twice as big as the multiplicand itself
64-bit adder (ALU) to perform addition operations
Control unit that performs tests to decide what calculations to do

Multiplication Algorithm

Calculate a bunch of partial products and add them together, like the elementary school multiplication algorithm
- The sum terms are the individual bits of the multiplier times the multiplicand
- Since these are binary, multiplying by \(1\) or \(0\) is the same as either adding the number or not
We have to shift left (multiply by 2) each time we add a term, since we need to move to the next place-value place
- Multiplicand shifts left and multiplier shifts right
Thus, we need to start with the most significant bit of the multiplier and end with the least significant

Full Run-through of a Multiplication

The least significant bit of the multiplier is checked. If it is
- 1: Multiplicand is added to product (using bitwise addition)
- 0: The addition step is skipped and we move back around to the shift step
Multiplicand register shifts by one bit to the left
Multiplier shifts to the right by one bit

Better Multiplication

Notice
- The upper half of the multiplicand doesn't get used; it gets initialized to \(0\)s and then shifted out of the register during the multiplication
- The adder can only write into the upper half of the product, then shift to the right
Thus, we can store the multiplier in the bottom half of the product register to save space, since we will get rid of that area eventually
Also, the shifts we performed for the separate registers before can be executed with one shift of the combined register

Fastest Multiplication

Idea
- We compute all the partial products instead of accumulating them in the product sequentially (this uses more storage but gives us speed)
- Partial sum formulae: \(\text{multiplier} \times \text{multiplicand} << 1\), \(\text{multiplier} \times \text{multiplicand} << 2\), etc
- Then, we use a tree of adders to add each two results pairwise, then the results pairwise, etc.
This is more expensive because it takes more space and more adders, but performs multiplication in \(O(\log{n})\) adds, where \(n\) is the number of bits

Pipelining

Pipelining: performing tasks at staggered times so that every resource is being used to perform the task
- As opposed to waiting for the task to be done before starting a new one, which leaves resources dormant if the task is made up of subtasks
For our multiplication algorithm, this means that we don't need to wait for each layer of the tree to be done computing before performing the next multiplication operation
- Instead, each layer of the tree may be used for a different operation at the same time
Note that this doesn't make a single multiplication algorithm any faster, but does speed up a sequence of them

RISC-V Multiplication

mul: performs 32x32 multiplication and places the lower 32 bits in rd
- Note that we lose half the bits!
mulh: performs 32x32 multiplication and places the higher 32 bits in rd
Together, mul and mulh perform a full multiplication
- This is because a 32x32 bit multiplication is 64 bits, which fits into two words, not one
- Internally, the RISC-V processor will merge the operations into one operation if it sees these two right after each other
mulhu: performs unsigned multiplication and stores the upper half into the register
mulhsu: performs signed x unsigned multiplication and stores the upper bits
We always use mul for the lower bits; we choose for the upper bits since those are the ones that would change based on the sign (???)

Division

It is possible to do this with multiplication hardware, but we will not explore that

RISC-V

div: performs integer division
- Also divu
rem: stores the remainder of an integer division
- Also remu

V1F - Polling, Interruptions, Watchdog Pooling

Sometimes, we need to process asynchronous events that can happen at any time, and will cause the program to stop evaluating what it was before, then return to it
- E.g. keyboard input

Exceptions and Interrupts

Exception: something caused by executing instructions (e.g. an error, like reading from a misaligned address)
Interruption: something caused by something external to the program (e.g. I/O, power failure)

Exception Overview

When an exception occurs, the processor will
1. Save the address of the instruction that caused the exception in the User Exception Program Counter (upec)
2. Jump to a specified address, which is stored in the User Trap Handler base address (utvec). This leads to OS code that can handle the exceptions; this is the OS's responsibility by definition, since RISC-V cannot deal with it
The OS needs to know the following things
- The reason for the exception (can be reported in two ways)
  - Cause register: a special register that encodes the reason (i.e. each type of reason has a number); the OS decode it to find the reason
  - Interrupt vectors: PC receives a BASE field (the address of the start of the handler (?)
    - Synchronous exceptions: \(PC \leftarrow \text {BASE Field}\)
    - Asynchronous exceptions: \(PC \leftarrow \text {BASE Field} + 4 \times \text{cause number}\)
    - I.e. instead of decoding the exception itself, the PC gets directed to different code directly for different types of exceptions
  - The utvec mode bit determines which way gets used (\(0\) for cause register, \(1\) for vector)
- The particular instruction that called the exception; this will be in uepc

Polling vs. Interrupts

Polling: processor stops regularly and checks if an outside process needs attention
- This wastes time and resources, but doesn't require trust that the process will notify the main process when necessary
- Common for low-performance or embedded systems because of predictable timing, low hardware cost
Interrupts
- The external process sends an interrupt to the main process when it is done; this causes the main process to stop and deal with the external process
- Work similarly to exceptions, except it is not synchronized to the program's instructions

Polling Watchdog

When there is a cause to interrupt the processor, start a timer
If the processor has not pulled for the input by the time the timer is done, interrupt the processor
This way, the processor has a bit of freedom to poll when it is most convenient for it
Originally proposed for multi-threaded systems

V20 - Exceptions in RISC-V

What Happens when an Exception Occurs

Process jumps to the coding starting at the base of the trap handler (exception handler, stored in utvec)
Exception handler examines the cause of the exception and jumps to the corresponding OS handler for the exception in question
The OS handler may
- Terminate the process that caused the exception
- Perform an action (e.g. print) and return

Writing an Exception Handler

Can an exception handler write registers on the stack?
- MIPS: possibility that the sp itself is corrupted, so no
In RISC-V, there are multiple levels of exception handler, so it depends
- User level: maybe, so RISC-V will let you
- Supervisor level and above: nope!
As such, we will assume that the stack cannot be used
So where do we store registers? We create a stack in the kernel area
All the registers must remain unchanged before and after an exception

Control and Status Registers

ucause register: encodes the reason that an exception or interrupt was raised

Status register: ustatus
- Most important bit: bit 0: depends if user interrupts are enabled or disabled

Interrupt-enable register: uie

Control and Status Register Instructions

Each instruction that manipulates a CSR has 12 bits to specify it

Atomic read-write (read-write in one operation)

csrrw: control status register read-write

csrrw  t0, 0, t1     // t0 <- CSR0 zero extended AND CSR0 <- t1
csrrwi t0, 0, 0x01   // t0 <- CSR0 zero extended AND CSR0 <- imm

csrrc: atomic read and clear bits

csrrc  t0, 0, t1     // t0 <- CSR zero extended AND CSR <- CSR & /t1
// i.e. complement of t1 bitwise anded to CSR
// any bit in t1 that's a 1 has to become a 0 in CSR
// immediate version also exists

List of Control and Status Registers

Register name	Register number	Usage
`ustatus`	`0`	interrupt mask and enable bits
`uie`	`4`	Enable user interrupts such as Timer and Keyboard (external)
`utvec`	`5`	The base address of the trap handler is stored in this register
`uscratch`	`64`	Temporary register to be used freely in the user trap handler
`uepc`	`65`	address of instruction that caused exception
`ucause`	`66`	exception type
`utval`	`67`	memory address at which an offending memory reference occurred
`uip`	`68`	User interrupt pending

List of Exception Codes

Interrupt	Exception Code	Description
`1`	`0`	User software interrupt
`1`	`1`	Supervisor software interrupt
`1`	`2`	Reserved for future standard use
`1`	`3`	Machine software interrupt
`1`	`4`	User timer interrupt
`1`	`5`	Supervisor timer interrupt
`1`	`6`	Reserved for future standard use
`1`	`7`	Machine timer interrupt
`1`	`8`	User external interrupt
`1`	`9`	Supervisor external interrupt
`1`	`10`	Reserved for future standard use
`1`	`11`	Machine external interrupt
`1`	`12-15`	Reserved for future standard use
`1`	`>=16`	Reserved for platform use

Three levels of execution
- When the user program is running (the programs that we write)
- OS-level
- Machine level: software running underneath the operating system (e.g. BIOS)

Re-entrant exception Handlers

When will exceptions be enabled (i.e. can happen again)
- I.e. can exception handler code be itself interrupted?
If we want to be able to throw exceptions during exceptions, we need a re-entrant exception handler
So, we need to store some sort of stack somewhere to support this recursive exception behaviour
- This is the kernel stack, which is created in the kernel memory reason
- The pointer for this stack will be a regular RISC-V register that you designate
In the handler, we must
- Write the values to the stack and move the kernel stack pointer as necessary
- Restore the position of the pointer before the handler is done
- Exceptions must be disabled in between writing to the stack and moving the kernel stack pointer
  - This is done by updating the value in sscratch, which gets updated with atomic read-write
Code for this handler is in the L20 video at 16:20
Note that for the lab, we do not need a re-entrant exception handler

Non-reentrant exception Handlers

No exceptions are allowed to be thrown when an exception is being evaluated
Since there is no "recursive" behaviour, we do not need to implement a stack
All we need to worry about is restoring all the registers we change while the exception runs
Corresponding video: L21 video

V22 - Input and Output Devices

I/O Devices

I/O devices are managed by I/O controller hardware
- Transfer data to and from the device
- Synchronizes operations with software
Command registers: cause a device to do something; the I/O is bound to the register
Status registers: Indicate what the device is doing and if errors occur
Data registers: writing and reading to/from the device

Types of I/O Mappings

Memory mapped I/O: some special memory addresses are used to access an I/O register
- Address decoder captures I/O accesses
- What looks like a regular load actually loads data from the device; store commands work the same way
- OS assures that I/O-mapped memory is only accessible to the kernel; users cannot interact with them directly
- Thus, we use ecalls to access the I/O
I/O instructions: there are a separate set of instructions to access I/O registers
- These can only be accessed in kernel mode

Storage and I/O

All the I/O controllers are connected to the I/O interconnect, which communicates with the processor via interrupts
This interconnect is also connected to memory in order to serve memory-mapped I/O

- Parallel bus: multiple data lines (wires) - Busses are either synchronous (and use a clock signal like a processor) or asynchronous (use some sort of handshake)

V23 - Floating-point representation

Normalization: expressing a number with one digit before the decimal point
- Used for scientific notation
- E.g. base \(10\): \(26.73 = 2.713_{10} \times 10^1\)

Converting Decimal Numbers to Binary

Direct option: just figure it out. The decimals work the same way as base-\(10\), just with inverse powers of \(2\) instead of \(10\)
Multiplication option: multiply by a power of \(2\), convert, then place the decimal point appropriately
Repeated multiplication: like [[V01 - Binary Representation#Decimal to Binary Conversion|repeated division]], but multiplying by \(2\) instead of dividing by it

Floating-point Standard: IEEE 754 for 32 bits

Standard for single (32-bit) and double (64-bit) precision floating point numbers
Developed in response to diverging representations; now ~ universal

Technical Specification

Bits are allocated as follows
- [1]: sign (0 for positive, 1 for negative)
  - First (most significant) bit of the number
- 8 bits: biased exponent (unsigned integer)
  - Displaced by \(-127\)
- 23 bits: fraction (significand)
  - Uses inverse powers of two instead of regular ones
  - First bit (MSB) represents \(1\), second represents \(\frac{1}{2}\), then \(\frac{1}{4}\), etc.
  - So, the whole fraction ranges between \(0\) and \(2\), which makes sense since it "fills the gap" between each possible step of the exponent value

IEEE-754 Representation Formula

Formula

\(a = (-1)^{S}\times \text{fraction} \times 2^{\text{biased exponent} -127}\)

Special Representations

A biased exponent of \(255 = 11111111_2\) indicates a special value
- Fraction is \(0\): \(\pm \infty\) depending on the sign bit
- Fraction is not \(0\): \(NaN\) (not a number)

Sign Bit and Bias vs. Two-s Complement

Couldn't the same thing be accomplished if the exponent was a \(9\)-bit, two's complement number without a bias?
- Since that model is simpler, there's probably a reason that it isn't getting used
In fact, the bias/sign bit setup means that nonnegative floating point numbers can be compared as integers

Possible Range of Representation

Largest magnitude number: \(\pm \approx 2^{128}\)
- Fraction is as close to \(2\) as possible
Smallest non-zero magnitude number: \(\pm \approx 2^{-149}\)
- Fraction is as small as possible (\(2^{-23}\))

Other precisions

Double-precision: 1 sign bit, 11 exponent bits, 52 fraction bits (64 bits total)
Half-precision: 1 sign bit, 5 exponent bits, 10 fraction bits (16 bits total)
Specifications with even less bits are useful for machine learning
- Various allocations for exponent vs. fraction
- As little as 8 bits are sometimes used

V24 - Floating-point Arithmetic

Specific Operations

Addition

Align decimal point of numbers to that of the number with the smallest exponent (leads to a loss of precision)
Add significands
Renormalize the result
Round-off the result to fit the available representation space

Multiplication

Add the exponents together
Multiply the significands
Normalize the product
Round-off the product

Accurate Arithmetic

guard and round digits???

Involves keeping track of a round and guard digit sticky bit
All these are specified by IEEE 754

Associativity of Floating Point Addition

Floating-point addition is not associative
- E.g. \(-1.5_{10} \times 10^{38} + 1.5_{10} \times 10^{38} = 0.0_{10}\), but \(1.5_{10} \times 10^{38} + -1.5_{10} \times 10^{38} = 1.0_{10}\)
Parallel programming may assume associativity when interleaving operations
- Such programs need to be validated under varying degrees of parallel-capacity

V25 - Floating Point in RISC-V

video…

Floating point and `f`-registers

RISC-V supports single-precision and double-precision formats
Has three architecture extensions: F, D, and Q; each add f-registers that can be used like the regular ones, but for floating-point values
- F-extension: adds f-registers for \(32\)-bit floating point
  - Note that f0 is not hardwired with the value \(0\)
- D-extension: Widens the f-registers to 64 bits for double-precision floating-point
- Q-extension: Widens to 128 bits for… even more precision I guess
  - Q for "quads"

Floating Point load and store

// floating-point load word
flwrd, imm(rs1)
// loads word into an f-register
// F[rd] = M[R[rs1] + imm]
// rs1 is an f-register

// floating point save word
fsw rs2, imm(rs1)
// saves into regular memory from an f-register
// M[R[rs1] + imm] = F[rs2]
// rs2 is an f-register

Floating Point Arithmetic Instructions

Both exist in single-precision (.s) and double-precision (.d) variants

fadd.s // addition
fsubs.s // subtraction
fmul.s // multiplication 
fdiv.s // division
feq.s, flt.s, fle.s // comparison (rd is set to 0 if values are not equal, etc)
fcvt.s.w // converts integer in rs1 to floating-point value, which gets stored in rd
fcvt.w.s // same as above, but converts floating-point -> integer

Example

Topic V26 is an example of a floating-point matrix multiplication function translated/compiled from C to RISC-V
It contains a large subset of the operations discussed in this topic

V27 - Datapath Building Blocks

Logic Design Basics

Information is encoded in binary
- Low voltage = 0, high voltage = 1
- One wire corresponds to one bit
- Multi-bit data encoded on multi-wire buses
Combinational circuit: inputs (each bit) → combiner (logic gates) → outputs
Sequential circuit: Inputs (each bit) → (combiner ←→ state) → outputs
- I.e. sequential circuits have state as well
- The circuit and "read" from the state and "write" to the state
- Also known as a finite state machine

Combinational Elements

AND gate: \(A \land B\)
Multiplexer: has multiple inputs and one output; chooses one input to pipe through to the output
- Has a switch that determines which input becomes the output, I.e. \(S \, ? \, input1 : input0\)
Adder: \(Y = A+B\) (discussed earlier)
ALU: \(Y = F(A, B)\) for some operation \(F\)

Sequential Elements

Register: stores data in a circuit
- Uses a clock signal to determine when to update the stored value
- So, inputs are clock signal (Clk), data input (D), and output (Q, current state)
- At every clock transition (change in value), the input will be stored and become the output (this means it is edge triggered, triggered when the clock moves from \(0\) to \(1\))
- However, this means all the registers change every clock signal, which is not useful
Register with write control: only updates on the clock edge when the write control is 1 (additional input to the register)
- Requires an additional input: Write signal
- The output only updates to match the input when the Write signal is true

Clocking Methodology

Combinational logic transforms the data during clock cycles
- In between clock edges
Inputs data from state element and outputs the transformed data to state element
- This may be different elements or the same element
- Longest delay determines the clock period

Building a Datapath

Datapath: processes data and addresses in the CPU (i.e. takes care of the instructions and their accesses to ALU and memory)
Must include
- Program counter (remember, this is the address to fetch the current instruction being executed from memory)
- Memory (clearly we need this to get the instructions)
- ALU for operations that don't read/write (at least, we need an adder to increment the program counter)
  - ALU Operation input to specify which operation we are performing
  - Regular two inputs, one output scheme
- Register file

Register File

Register file: combination of all the registers in the processor
- Inputs are a 5 bit buses, because we need this to specify which of the \(2^{5}= 32\) registers we want to access
- We need two such inputs to specify the two registers to read from and one to specify the register to write to
- We also need data input to know what to write (32 bits in?)
- RegWrite control line: active if we need to write for the instruction, inactive otherwise

5 to 32 decoder: chooses the correct output from 0-31 that the 5-bit input refers to
For each register, C is control and D is data
- If regWrite is 1, then one of the registers receives C=1 and the others receive 0; that is the one that will be written to. If it's 0, then all the Cs are 0
- The data has to be sent to all the registers (!), but at most one actually gets data written to it
Sent to multiplexers to determine which outputs are register 1 and 2
- Read registers 1 and 2 (both 5 bit buses) are piped to the mux switch to make this decision
- Each of these are actually 32 multiplexers in one circuit, but we can abstract it away

General Design Notes

Components are abstracted as boxes that do things with inputs and outputs
If we have to decide between two things, we can represent this as depending on a signal in that is either 1 or 0.
- Specifically, do this OR do this with one simple condition can be done by writing a multiplexer to both outputs and coming up a with a signal that describes what we want
Wires can go backwards into units and can get split into multiple places
Everything is just the redirection of inputs and outputs

V28 - Single-Cycle Datapath

Now that we know the structure, we look at how a datapath is actually implemented for RISC-V
Single cycle: one cycle per instruction

Arithmetic instructions

We look at the opcode to figure out the instruction and format
Take the corresponding bits from rs1, rs2, and rd, and connect these to the corresponding inputs
- Done with bitmasks perhaps?
Register file figures out which registers to use and loads the data
ALU calculates the result of the operation (e.g. add)
Output of ALU is inputted into the write-data input of the register file
- Remember, the write register rd is already known to the register file from 2.
- RegWrite must be enabled before the write happens

Load/Store Instructions

Requires an additional data memory unit to handle load and store instructions
- Receives both a read and write address as input
- Write data is also provided (the data to be written)
- Can receive either a memRead or memWrite signal, which instructs the unit which operation is being performed
Also requires an immediate generation unit to handle (i.e. sign-extend) the immediate from the instruction in order to use it
- Copies the LSB of the representation into all the "empty" bits
The rest of the structure is the same as an arithmetic instruction

Load Instruction

Store Instruction

- Don't have to write into the register file, since store instructions don't change the value of any registers ## Combining Arithmetic and Load/Store Designs - If we combine these into one circuit, there are two possible inputs for the ALU data - We use a mux to determine which signal gets used - ALU mux control signal: `ALUSrc` - The same thing happens for the signal that goes into the register file's write data (memData vs. ALU result) - Register file mux control signal: `MemToReg`

Note on Hardware

Second value from register (i.e. read data 2) value isn't used by anyone
However, the hardware will retrieve this value and pass it forward through the circuit
It is the control signals that make this signal irrelevant; they prevent the value from being read
This "redundancy" is in the circuit because it simplifies the design

Branch Instructions

We need another adder in our design to compute the new value for the program counter
We also need to shift the immediate left by 1, since all branch instructions have an implicit 0
An adder is used to add the PC and the immediate to get the branch target
We only write to PC if the branch condition is true. Otherwise, we add it by 4 (as we do for any other instruction to advance it)

Complete Datapath

Steps for Each Instruction Type

All instructions start with fetch, decode because the instruction and registers always have to be read

R-Type (Arithmetic)

Fetch: load instruction from memory and increment PC
Read: read both registers from the register file
- Setting of control lines is also computed in this step
Execute: ALU computes the data read from the register file. The opcode is used to tell the ALU which operation to execute
Write back: Write the ALU result back to the destination register

Load

Fetch and decode
Execute: ALU computes sum of value read from register file (address to load from) and sign-extended and bitshifted immediate
Memory: use result of the ALU as the address for the data memory
Write back: write data from memory unit into destination register

Branch-on-equal

Fetch, decode
Execute: ALU subtracts one register data value from the other. (Another ALU??) calculates the value of the target address by adding the sign-extended, bitshifted immediate to the PC
Write back: the zero status information from the ALU is used to determine which adder result gets stored in PC

Datapath with Control

All of the control signals come from the control unit

Jump-and-Link instructions

Saves PC+4 in the \(R[rd]\) address (return address) and sets the PC to PC + offset (PC-relative jump)
Needs another adder (???)

V29 - Pipelining

Consider a task that uses a number of different resources. Instead of waiting each a task to be complete before starting the next one, we overlap the tasks so that each resource is being accessed by a different task, but at the same time
All modern processors execute in a pipelined fashion

Speedup

Speedup (finite amount of taskss): \(\text{Speedup} = \dfrac{\text{time to complete tasks}}{\text{time to complete pipelined tasks}}\)
Speedup (nonstop tasks): \(\text{Speedup} = \text{number of concurrent tasks on average}\)
Unbalanced pipelines (i.e. tasks that take different amounts of time) take longer
Speedup is increased through throughput; it does not affect the latency, i.e. the time that it takes to execute an instruction

RISC-V Pipeline

RISC-V instruction execution follows 5 stages
- IF: Instruction Fetch
- ID: instruction decode and register read
- EX: Execute operation or calculate address
- MEM: Access memory
- WB: write back to register
Assume register read/write takes 100ps and other space take 200ps
- Different formats take different times (load is the longest, branch is the shortest)
- Longest takes 800ps, so as of now, each task takes \(\approx\) 800ps, stacked end to end
RISC-V can be pipelined by running each of these steps at the same time for different instructions
Pipelined version: each instruction takes 200ps, so on average, an instruction execution takes 200ps

ISA Design for Pipelining

All instructions are 32 bits → more uniform decoding for more balanced pipeline
Few and regular instruction formats → more overlap in the pipeline
- Registers can be decoded and read in one step
Load/store addressing → address can be calculated in third stage, memory accessed in 4th stage
- This is why we can't perform arithmetic operations directly from or to the memory

Hazards

Certain situations may prevent starting the next instruction in the next cycle
Structural Hazard: a resource the instruction needs is busy
Data Hazard: An instruction depends on data produced by a previous instruction
Control Hazard: control action (i.e. conditional) depends on the previous instruction

Structural Hazards

Processors don't actually have separate memory for instructions and data
Structural Hazard (precise): the data access of a load/store instruction and the instruction fetch need to use memory at the same time
Solution: processors have separate caches for "data memory" and "instruction memory"
- Caches will be covered later

Data Hazards

An instruction depends on completion of data access by a previous instruction

// x1 may cause a data hazard
add    x1, x2, x3
sub    x4, x1, x5

Solution 1: insert a bubble into the pipeline so that we have time for the data to become available
- Bubble: a space in the pipeline that doesn't have an instruction
However, this decreases performance, so we would like to do something better

Forwarding (Bypassing)

Forwarding: using a result directly when it is computed instead of waiting for it to be stored in a register
This requires extra connections in the datapath
- The output of the execution unit (i.e. ALU) is directly connected to the input of the execution unit

Load-use Data Hazard (Stall)

If data is being written into the memory and read from the memory directly in the next instruction, then we have no choice but to insert a bubble
- Otherwise, the load would happen before the store due to the pipelining, since stores happen more than one stage after loads
- Values cannot be sent backward in time

Code-scheduling to Avoid Stalls

The compiler knows how the pipeline works, so it can reorder code (!) to minimize load-use data hazards (and other hazards)
The compiler must understand the instruction dependency of the program to do this

[[V2A - Pipeline Control Hazards]]

V2A - Pipeline Control Hazards

Branches determine control flow, so fetching the next instruction depends on the branch outcome
- This isn't always known yet, because the result of the branch (and thus the target instruction) may not have been determined by the pipeline yet
So, figuring out all the targets should be done as early in the pipeline as possible
- This can be done by adding hardware, or by simply stalling on a branch

Speculative Execution

Instead of waiting (stalling) for the result, we start computing instructions as if we know the branch
- If the prediction is right, we just keep executing, having saved time
- If it's wrong, then we stall and recompute
This gets more and more efficient as pipelines get longer
- Actual processors have between 14 and 20 stages in their pipelines
In RISC-V, this can be done by ignoring the branch (and moving onto the next instruction), since failed branches just move to the next one anyway
Instructions that turn out to be wrong are turned into NOPs (no-operation), which nullifies them
- Instruction can be turned into a NOP by setting all the control lines to 0
- Known as instruction/load squashing

Realistic Branch Prediction

Static branch prediction: Instead of just moving past the branch, the prediction should be based on typical branch behavior
- E.g. any backwards branch is a loop, and most of the times, loops loop back many times before failing; why else would we write one? So, the backward branch is always taken
- E.g any forward branch is a conditional. Generally, programmers put code they think will execute less often in the "else" block, so forward branches should not be taken
Dynamic branch prediction: the branch prediction depends on past behavior of either that branch or the branches in the program
- Hardware measures (counts) branch behavior, i.e. they record the recent branch history
  - Stored in a table indexed by the address the branch branches to
- We assume that future behavior will follow past behavior (general design principle in computer architecture)

Calculating the Branch Target

Even if a speculation is right, we still need to calculate the target address, which takes time and space in the pipeline
Solution: branch target buffer
- Cache of target addresses addressed by the PC when the instruction is fetched
- We can fetch the target immediately if we have used it before
Some architectures have a branch delay slot: the instruction directly after a branch is always executed

V2B - Pipelined Datapath

Our single-cycle datapath is not suitable for pipelining because all of the stages of execution are all done in a single circuit, at the same time
A pipelined execution requires a redesign of the datapath

Design Changes

For the initial datapath, the signals mostly flow from the left to the right, but there are some that flow backwards. Both of these are data hazards
- Write back to registers
- Next value of program counter
We need registers between stages to store the intermediate results between stages
- These also need to store the control signals that will be used later
The bits that usually get fed back into the register file (for example) will get stored in the intermediate result register

Pipelined Control

Data Hazards in ALU Instructions

Possibility: we write to a register in one instruction, then read from that for another operation in the directly following instructions
In this case, the value accessed in the second instruction will be out of date

Forwarding

We can resolve these data hazards with forwarding: connecting components backwards (?) in the pipeline

V2D - Instruction-Level Parallelism

still make notes on this

Pipelining: executing multiple instructions in parallel
To increase ILP (Instruction-level parallelism):
- Deeper pipeline (less work per stage → shorter clock cycle)
- Wider pipeline (replicate pipeline stages → more instructions executed per clock cycle)

Multiple Issue

Static multiple issue: compiler groups instructions into "issue packets", which can be thought of as a really long instruction that contains many concurrent operations
- Compile groups instructions to be issued together and packages them into "issue slots"
Dynamic multiple issue: CPU looks at the stream of instructions and chooses which instructions to issue each cycle
- The compiler can reorder the instructions
- However, the results must be committed back to the registers in order; this is required so that exceptions can work as expected

dynamically scheduled CPU

The way we have treated registers is a simplified concept; most instructions executed by the processor don't depend on the register at all; registers are names for values that propagate between instructions, not actual places that we store things (???)
register renaming: we can rename registers, I.e. change the location where registers are. This is done to eliminate read after write and write after write dependencies, I.e. lets multiple instructions be executed at the same if they use registers separately that won't affect each other
- Only 32 registers because that's all the space available in the instruction to specify a register. We can have more registers not accessible by instructions, but still able to be used by the compiler
Why do dynamic scheduling?
- Not all stalls can be predicted
- Can't always schedule around branches
- Different implementations of an ISA have different latencies and hazards

Speculation

When we guess what to do with an instruction
Start the operation as soon as possible, then check whether the guess was right
Do the common case fast, and do the rare case correct
- I.e. it makes sense to spend a lot of time on the rare cases if the common cases are as fast as possible
If the speculation is wrong, roll back and do the correct thing.
- If the common case really is common, immediately choosing it, then checking (and correcting if necessary) is cost-effective compared to always checking and always being correct
Example: speculation about whether a store instruction and a later load instruction store/load independent values or related values. We might speculate they are unrelated, then run both instructions at the same time since we don't think they affect each other

Speculation and Exceptions

If an exception occurs on a speculatively executed instruction
- Static speculation: can add ISA support for deferring exceptions
- Dynamic speculation: can buffer exceptions

V2F - Memory Hierarchy

So far, we have abstracted memory as an array
It turns out that memory is actually organized more complexly than a linear array
This complexity comes from the fact that different data storage technology has different expense and access time

Types of Memory

SRAM: static RAM: data remains stable as long as power is provided
- Very quick to access, but extremely expensive per GB
DRAM: dynamic RAM: the data may disappear if the RAM is not refreshed
- "Normal" access time (~100x) SRAM access time, normal price per GB
Storage drivers
- Extremely slow access time but very cheap per GB
We want to emulate a system with the speed of SRAM, but that is more cost effective

Principles of Locality

Principle of locality: programs access a small portion of their address space at any time
- Temporal locality: things accessed recently are likely to be accessed again soon
  - E.g. accesses done in loops
- Spatial Locality: memory being accessed is usually close in the address space to other memory being accessed
  - E.g. accessing each element of an array in order

Taking Advantage of Locality

Everything (code and data) are stored in solid state drives
Recently accessed and nearby items are copied from disk to smaller DRAM (main memory)
The most recently accessed and nearby items are copied from DRAM into SRAM (cache memory) that is attached directly to the CPU
- Its access time is very fast

Memory Hierarchy

Overview of memory hierarchy: there are various places to store things in memory. The most used locations are stored close to the CPU in very fast-access locations. The further away storage is from the CPU, the larger, slower, and cheaper it gets
Block: smallest unit of stored data (may be multiple words)
Hit: memory access that is satisfied by a higher level in memory
- Miss: memory access is not satisfied
- The hit/miss ratio is something that is important to consider for performance
- Miss penalty: time taken to access non-cached memory

V30 - Cache Memory

Cache memory: level(s) of memory that are closer to the CPU than the storage

Identifying Blocks

Cache memory is measured in blocks; each block may be multiple words
Each block needs a unique identifier; \(2^k\) blocks means \(k\)-bit identifiers
Since each block may have multiple words, we need an offset to determine which word we are looking for
- \(2^k\) words/block means \(k\)-bit offset

Cache Addressing

In order of LSB (right) to MSB (left):
First 2 bits: offset (when trying to access a specific word from each block)
Next 12 bits: cache block index
Rest of the address: tag, which is stored before the block
- Identifies which address in memory the block came from
- We only store as many bits as we can fit
- Note that since there are \(2^{32}\) memory locations, multiple (many) memory locations map to the same place in the cache
However, we still need to know if the content of the block is valid (i.e. was the address we are looking for stored there, or a different one that maps to the same place?)
- There is a "valid bit" that is 1 if the cache is valid and 0 otherwise

Associativity

We want a given memory address to be able to map to more than one block into a cache
- Use case: if a memory location is used over and over, we have to rewrite to the cache each time, which defeats the purpose
Solution: we group blocks into a set
- Each block still has a valid bit and a tag
- Each set gets a valid id, but we need less of these than for the blocks since there are multiple blocks per set
- Now, when matching a tag, we have to check each tag in the set to see if it matches, which takes more time
New notation: 0xFF00_0 for 17 bits: first four digits are hexadecimal, then the last is binary
When placing new blocks in the cache:
- We must decide which block to remove
- Best policy: we remove the block that has been used the least recently (application of temporal locality)
- To implement this, we add one more bit to the cache design to indicate the least recently used bit (LRU bit)
\(n\) blocks per set: \(n\)-way associative cache
Increasing associativity decreases the miss rate, but returns are diminishing

Set-associative cache diagram

V31 - Block Size and Associativity

We can figure out how many bits we need per block (offset bits) and how many sets there are in the cache (index bits)

more calculations here, but the follow from the definitions

LRU for Higher Associativity

The higher the associativity, the more bits we need to keep track of the LRU location (e.g. \(2^n\) associative → \(n\) LRU bits)
This gets expensive quickly, so sometimes it makes more sense not to precisely track the LRU bits
- 2-way is simple, 4-way is manageable, but more than that is not worth tracking precisely
Picking randomly (which can be done with hardware, thanks IBM!) is generally pretty good because it avoids worse-case picks (statistically)

Alternative: Not Most Recently Utilized

Tree-based algorithm for approximating LRU
Each node in the tree is one bit, which indicates which side (left or right) the LRU is
When a block (leaf) is accessed, the node bits are flipped recursively upwards if they point towards that block
For an \(n\)-way set associative cache, requires \(n\) bits

Replacement Policy

Direct mapped: no choice but to replace what's in the cache if two things are saved to the same address
Set associative
- Prefers to replace an invalid entry, if there is one
- Otherwise, we must choose somehow (generally LRU or LRU approx, maybe random)

Finding the Block from an Address

Finding the block an address points to in an \(n\)-way associative cache: \(\lfloor \frac{\text{address}}{\text{bytes per block}} \rfloor \mod \text{\# of blocks}\)
- I.e. if an address is "larger" than the cache, it just wraps back around to the start, modulo-style

Large Blocks

Larger blocks reduce miss rate, due to spatial locality
However, larger blocks → less blocks → more competition for a particular block → higher miss rate
- Specifically, pollution (bringing things into the cache that don't get used again) is more of an issue
Also, larger block → larger miss penalty

Larger Cache

Why not just make the cache larger?
The size of the cache is limited by the physical distance between blocks in the cache
- The distance to the furthest cache block puts a lower bound on cache access time

V32 - Write Strategy

Strategy for Hits

Write-back Policy

When the processor is trying to do a store, and the location to write to is in the cache
In this case, the write is only written to the cache!
However, the corresponding block in memory is now out of date (stale)
We can add a dirty bit to our cache storage that gets changed when we take it from memory (this helps us keep track of changes, since now, the act of storing this in the cache changes it)
Then, when the block is (presumably) eventually removed from the cache, we right its new value back into memory

Write-through Policy

Instead of just writing to the cache (like write-back), any change to the block gets written to both the cache and the memory
Thus, the memory and cache are always synchronized

Strategy for Misses

Write-allocate Policy

Equivalent to write-back policy
Processor fetches entire block from memory, makes space for it in the cache, and writes it into the cache
- The line is how dirty, which is encoded by the dirty bit

Write-around Policy

Equivalent to write-through
The new values is only written to the memory
What is the point… the cache doesn't get used…

Write Buffer

No matter the policy, writing to memory takes a very long time, and may halt other instructions (i.e. reads)
We can add another layer (a write buffer) to the memory hierarchy to help with this
- It is small and close to the cache, so it can be read from and written to quickly
At some point, there will be enough resources to transfer the contents of the write buffer to memory. Then, the write buffer is cleared
With every read, we must how check if what we are looking for is in the write buffer or memory
- We have to search every entry in the write buffer for this. Yikes!

V33 - Cache Performance

Components of CPU Time

Program execution cycles
- Fetching instructions and data
- Includes cache hit time
Memory stall cycles
- Mainly from cache misses
Generally, \(\text{CPU time} = (\text{CPU cycles} + \text{Memory stall cycles}) \times \text{cycle time}\)
- \(\text{Memory stall cycles} = \dfrac{\text{\# of memory accesses}}{\text{Program}} \times \text{Miss rate} \times \text{Miss penalty}\)
- \(\text{Memory stall cycles} = \dfrac{\text{\# of misses}}{\text{Program}} \times \text{Miss penalty}\)

Memory Accesses

Includes instruction fetching and data load/store
Unified cache: data and instructions are in the same cache
- Common for lower levels of cache, i.e. L2, L3, etc
Split cache: different caches for instructions and data
- Possibly different miss rates for Dcache (data) and Icache (instructions)

Strategies for Solving Problems

Draw a diagram
Pick a dummy number (e.g. 1000) of instructions to calculate things. This number will cancel out if a percentage is trying to be calculated

V34 - Average Memory Access Time

Average memory access time: \(\text{hit time} + \text{miss rate} \times \text{miss penalty}\)
Example:

Local hit rate: number of access that hit the cache over number of accesses to that cache
Global hit rate: number of access that hit the cache over total number of accesses

Performance - Summary

CPU performance increase → miss penalty becomes more significant
Decrease base CPI → proportion of time on memory stalls increases
Increase clock rate → memory stalls account for more CPU cycles

V35 - Multi-level caches

Primary cache: small but fast cache attached directly to CPU
- Focuses on minimal hit time: every access can be completed in one cycle
- Primary limit: physical distance between cache blocks
- Generally has a smaller block size than L2
- Generally smaller than it would be without a L2 → quick hit time
L2 Cache: larger and slower, but still faster than main memory
- Focuses on low miss rate to avoid accessing main memory
- Hit time has less overall impact
- \(\text{Effective CPI} = \text{base CPI} + \text{L2 rate} \times \text{L2 penalty} + \text{mem rate} \times \text{mem penalty of L2}\)
L3 and L4 caches are present in some systems, but main memory usually comes after L2

Interactions with Advanced CPUs

Out-of-order CPUs can execute instructions during a cache miss
- Pending store stays in a load/store unit
- Dependant instructions accumulate in reservation stations while independent instructions continue
Performance analysis is much harder to work out directly; we rely on actual simulations to measure performance

Interaction with software

Code should be written with locality in mind, since it will (in theory) run faster on hardware
Compilers also optimize for memory access

V36 - Virtual Memory

Lies!! When you access data from memory (as the processor), the address the processor generates is not the same as the actual address the data is stored at

Why Lie?

Data protection: if all programs can access all the memory, then data may leak (accident) or be spied on (on purpose)
Providing the illusion of larger memory
- This made more sense when machines had very limited memory (i.e. punch-card days)

Overview

We have a cache, main memory, and a disk
- Cache → memory is like memory → disk; the disk is much larger but much slower
We use the same locality principles
We divide the data in the main memory into pages (often 4Kb)
- Each page consists of a number of cache blocks
- Pages are stored on the disk
The main memory is used as a cache for the content of the disk
- This system is called virtual memory
VM is managed by both hardware and software (because of the large sizes of data involved), unlike the memory cache, which is all hardware

Virtual Address Space

Each program has a private virtual address space, all of which map to the physical memory address space
Pages in the virtual space correspond to frames in the physical space; maps exist between individual pages and frames
Hardware and OS operate this system
Data may be shared between programs through the frames, but this requires operating system permission
The "cache miss" for virtual memory is called a page fault
Page faults are incredibly slow, so we really want to minimize them
- Very smart replacement algorithms (OS topic)
- Non-associative mapping

Address Translation

Some virtual addresses map to physical addresses; some map to the disk (if there is a page fault?)
Pages, like blocks, have page numbers and page offsets
- The page offset isn't changed when addresses are translated
- The page number is translated into a physical frame/page number

Address Translation in RISC-V

We store a bunch of indexes in the virtual address that point to the root table; looking them up, then looking the result up in the table, etc, leads to the physical address
SATP: Supervisor Address Translation and Protection register
- Contains physical page number, address space identifier, and mode (write any values, read legal values)

missing

Writes

Disk writes take millions of cycles → we use write-back with a dirty bit in the PTE

Translation Lookaside Buffer

TLB caches the virtual address → physical address process, since by the principles of locality, we probably access the same addresses often
A TLB miss may either mean a page is present (but not indexed in the TLB), or not present at all
- So, a miss raises an exception that copies the PTE from memory to the TLB, then restarts the instruction
- If the page is not present, a page fault will occur and be dealt with normally

Page Fault Handler

Use faulting VA to find PTE
Locate the page on the disk
Choose the page to replace (if dirty, write to disk first)
Read page into memory and update the page table
Return to the normal process

Memory Protection

Different tasks can share parts of the virtual address space, but we need to protect against errant access
Hardware has support for OS protection
- Privileged supervisor mode (kernel mode)
- Privileged instructions
- Page tables and other state information only accessible in supervisor mode
- System call exception (ecalls in RISC-V)

Entire System Overview

V37 - Memory Hierarchy Concepts

https://www.youtube.com/watch?v=ljS1vTHx04o

Review

Each level
- Is based on the notion of caching
- Has some mechanism for block placement and finding a block
- Can handle misses and replacement
- Has a write policy

Virtual Machines

System where a host computer emulates a host operating system
- Improves isolation between multiple guests
- Avoids security and reliability problems
- Aids sharing of resources
Feasible with modern computers
Maps virtual resources to physical resources (memory, I/O, CPUs, etc)
Guest code runs on the native machine in user mode

Cache Coherence Problem

Suppose two CPUs share a physical address space
Coherence: the property that a read will return the most recently written value
- If there are multiple writes, all the processors in the system should see writes in the same order
To enforce this, we need cache coherence protocols
- Migration of data to local caches
- Replication of read-shared data

Computer Architecture I CMPUT229