Computer Architecture I CMPUT229

V01 - Binary Representation

Unsigned Binaries

Decimal to Binary Conversion

2-Complement Notation

Sign Extension

Sign-extension in RISC-V

V02 - Storing Data in Memory

Hexadecimal Notation

Memory

Memory Organization

Endianness

V03 - Organization of a computer

Arithmetic Operations

# Arithmetic Add
add a, b, c     # a <- b + c

# Add immediate
addi a, b, 20   # a <- b + 20

Compiling C Code

// C code to compile
f = (g+h) - (i+j)

// RISC-V Implementation
add t0, g, h
add t1, i, j
sub f, t0, t1

// Assumptions
f goes in s0, g in s1, h in s2, i in s3, j in s4

Register Types

Addressing an Array

// C code
A[0] = h + A[2]

// RISC-V Implementation
lw t0, 8(s1)      // loads A[2] by using an 8-byte (2-word) offset
add t0, s2, t0    // adds the two terms together
sw t0, 0(s1)      // saves the result where we need it (0-word offset)

// Assumptions: h in s2, A in s1
lw t0, 8(s1)       # takes the value in the memory address s1 with offset
                   # 8 and saves it in t0
sw t0, 0(s1)       # saves the value in t0 at the memory address s1 with offset 0

Registers vs. Memory

Immediate Operands

V04 - Data Transfer between Memory and Registers

Example: lw (load word)

Example: sw (save word)

V05 - Binary Representation of Instructions

General Idea

Formats

V05 - Branches and Jumps

// C code
if (i == j) {
    f = g + h;
} else {
    f = g - h;
}

// RISC-V Implementation
            bne s3, s4, Subtract   // branch to subtract if s3 does not equal s4
            add s0, s1, s2         // add statement
            jal zero, Exit         // jump to exit (jal zero always jumps)
Subtract:   sub, s0, s1, s2        // subtract statement
Exit: ...

V06 and V07 - Logical Operations and Large Constants

Shifts

Bitwise Operations

add t0, t1, t2
or t0, t1, t2
xor t0, t1, t2

Bitmasks

Getting a Specific part of a Number
  1. Number is small
  2. Number is large:

xori ... -1

Loading Immediates

Example for Masking, Large Constants

V08 - Loops and Conditionals

While Loop

// C Code
while (save[i] == k) {
    i = i + j
}
        bne ... , Exit   # check condition immediately
Loop:   ...              # start the loop body here (on this line)
        add ...          # (cont. ) loop body and iteration
        beq ..., Loop    # jump to loop top or continue if fulfilled
Exit: 

More Conditional Operations

Basic Blocks

V09 - Storing Arrays in Memory

An array is a contiguous chunk of memory

![[V03 - Organization of a computer#Addressing an Array]]

Another C Example

# C Code
y = A[B[j]]

# RISC-V Implementation
slli t1, t0, 2    // 4 * j
add t2, s7, t1    // B + 4 * j (address of B[j])
lw t3, 0(t2)      // store B[j] in t3
slli t4, t3, 2    // 4 * B[j]
add t5, s6, t4    // A + 4 * B[j] (address of A[B[j]])
lw s3, 0(t5)      // load the thing at this address into y

V0B - Procedure Calls, Stack, and Memory Layout

Procedures

Returning

The Stack

What Does the Stack Do when We Call a Procedure?

  1. Address of current instruction (the program counter pc) is written into ra
  2. ra is saved onto the stack, and sp is decremented/updated so it points to the top of the stack
  3. The function is evaluated. Say this includes calling another function
    1. fp is moved up, since this new function creates a new frame. However, if this function doesn't in turn call more, it will stay empty
    2. ra is overwritten with new value and the pc is moved to the new function
    3. The function returns; pc is moved to the value of ra and execution continues as before.
  4. ra is restored from the stack. If we changed other things, they would be as well
  5. The (outer) function returns, meaning

Memory Layout

V0C - Register Calling Conventions

Registers with Conventions

Other Registers

Table of Calling Conventions

Register Name Use Saver
x0 zero The constant value 0 N.A.
x1 ra Return address Caller
x2 sp Stack pointer Callee
x3 gp Global pointer --
x4 tp Thread pointer --
x5-x7 t0-t2 Temporaries Caller
x8 s0/fp Saved register/frame pointer Callee
x9 s1 Saved register Callee
x10-x11 a0-a1 Function arguments/return values Caller
x12-x17 a2-a7 Function arguments Caller
x18-x27 s2-s11 Saved registers Callee
x28-x31 t3-t6 Temporaries Caller

V0D - Recursive Functions

Like any function call, recursive function calls add a frame to the stack. If the function is tail-recursive, it is possible that the compiler can optimize the stack use by writing the new values for scoped variables to the same stack frame instead of to a new one each time.

Lecture V0D has an in-depth example of the RISC-V implementation of a recursive C function, which should be consulted as an example

V0E - Different Sizes of Data

# load byte from the address in rs1 into rd
# this is sign-extended to 32-bits (fills the whole word)
lb rd, offset(rs1)

# load unsigned byte (zero extension instead of sign-extension)
lbu rd, offset(rs1)

# stores the rightmost byte in rs2 into the memory address in rs1
sb rs2, offset(rs1)

Manipulating Different Sizes of Arrays

String Manipulation Example

V0F - Relative Performance

Response Time vs. Throughput

Performance

Where Does time Get Spent Executing a Program?

V11 - Clocks Per Instruction

Definitions

Where Do These Values come From?

More Detail on CPI

V12 - Power, Multiprocessor, SPEC, Amdahl's Law

Power and Performance

SPEC

Amdahl's Law

Takeaways

V13 - Bubble Sort RISC-V Example

Swap Function

// C
void swap(int A[], int k) {
    int temp;
    temp = A[k];
    A[k] = A[k+1];
    A[k+1] = temp;
}

// RISC-V
swap:
    slli t1, a1, 2      // A[] is in a0; get k*4
    add t1, a0, t1      // get the address of A[k]
    lw t0, 0(t1)        // temp storage for A[k]
    lw t2, 4(t1)        // load A[k+1] into t2
    sw t2, 0(t1)        // put A[k+1] into A[k]
    sw t0, 4(t1)        // put A[k] value (stored in temp) into A[k+1]
    jalr zero, ra, 0    // end function

Bubble Sort

void bubblesort(int A[], int n) {
    int i, j;
    for(int i = 0; i < n; i++) {
        for(int j = i-1, j >= 0 && A[j] > A[j+1]; j--) {
            swap(A, j);
        }
    }
}

Implementing Bubblesort in RISC-V

Effect of the Compiler

V14 - Calling Conventions for Functions inside Loops

Why is Calling a Function inside of a Loop Different?

V15 - Indexing vs. Pointers

// Array indexing
void clear(int A[], int n) {
    int i;
    for(i = 0; i < size; i++) {
        A[i] = 0;   // notice we access a value by its array index
    }
}

// Pointer Indexing
void clear(int* A, int n) {
    int* p;
    for(p = &array[0]; p < &array[size], p += 1) {
        *p = 0   // notice we deference the pointer to access the value to change
    }
}

V16 - Meaning of Dereferencing

void foo(int *p, int x) {
    *p = x;  // STORE the VALUE of x in the memory location whose ADDRESS is in p
    // Analogous RISC-V Instruction: sw a1, 0(a0) (like doing save word)
    x = *p;  // LOAD into x the VALUE from the memory location whose ADDRESS is in p and 
    // Analogous RISC-V Instruction: lw a1, 0(a0) (like doing load word)
}

V17 - Examples of String Functions

Examples of String Functions

All of these are straightforward implementations in RISC-V, and should be done as exercises

Types

V18 - Parameter Passing by Value and Reference

What's the Difference?

V19 - Synchronization

Naive Approach to Multiple Processors Asking Same Data

Synchronization

Synchronization in RISC-V

Atomic Swap Example

try:
// s1 contains the shared resrouce between the processors
lr.w t0, (s1)      // load reserved word from the shared resource
// t0 contains the value in s1
// now that a thing has been loaded from the synchronized area
// every processor (involved?) has a record that the address of s1 is load-reserved
sc.w t1, s4, (s1)  // store conditional into s4 (?) from s1 (shared area)
// t1 now contains the status of the store operation
// now the address is no-longer load-reserved;
// the address of s1 is no longer stored as load-reserved if it succeeded
bne t1, zero, try  // if the store fails, branch back to the try
// we know the synchronized thing succeeded
add s4, zero, t0   // put the loaded value into s4

V1A and V1B - Blackboard Architecture and Transactional Memory

Based on how people work at a blackboard: people don't communicate with each other directly; they write everything they need to solve a problem on the blackboard where everyone can see it at the same time. If a person has a better solution, they simply erase what they want to replace on the blackboard and write that. No ambiguity exists here because it is plainly visible when someone is writing on the board.

Blackboard Architecture

Example: Newton's Method

No-lock Version

Lock Version

Blackboard Architecture Terms

Transactional Memory

V1C - Compiling, Linking, and Loading

Producing an Object Module

Linking Object Modules

Loading a Program

Dynamic Linking

Lazy Linkage

watch video!

Dynamic JIT Compilers

V1D - Computer Arithmetic

Operations on Integers

Addition

Bitwise Addition

Combining Half-adders

Ripple-carry Adder

Carry Look-ahead Adder

Subtraction

Overflow

Checking for Overflow

Design Principles: Dealing With Overflow

V1E - Multiplication and Division

Unsigned Multiplication

Multiplication Hardware

Multiplication Algorithm

Full Run-through of a Multiplication

Better Multiplication

Fastest Multiplication

Pipelining

RISC-V Multiplication

Division

RISC-V

V1F - Polling, Interruptions, Watchdog Pooling

Exceptions and Interrupts

Exception Overview

Polling vs. Interrupts

Polling Watchdog

V20 - Exceptions in RISC-V

What Happens when an Exception Occurs

Writing an Exception Handler

Control and Status Registers

Control and Status Register Instructions

List of Control and Status Registers

Register name Register number Usage
ustatus 0 interrupt mask and enable bits
uie 4 Enable user interrupts such as Timer and Keyboard (external)
utvec 5 The base address of the trap handler is stored in this register
uscratch 64 Temporary register to be used freely in the user trap handler
uepc 65 address of instruction that caused exception
ucause 66 exception type
utval 67 memory address at which an offending memory reference occurred
uip 68 User interrupt pending

List of Exception Codes

Interrupt Exception Code Description
1 0 User software interrupt
1 1 Supervisor software interrupt
1 2 Reserved for future standard use
1 3 Machine software interrupt
1 4 User timer interrupt
1 5 Supervisor timer interrupt
1 6 Reserved for future standard use
1 7 Machine timer interrupt
1 8 User external interrupt
1 9 Supervisor external interrupt
1 10 Reserved for future standard use
1 11 Machine external interrupt
1 12-15 Reserved for future standard use
1 >=16 Reserved for platform use

Re-entrant exception Handlers

Non-reentrant exception Handlers

V22 - Input and Output Devices

I/O Devices

Types of I/O Mappings

Storage and I/O

- Parallel bus: multiple data lines (wires) - Busses are either synchronous (and use a clock signal like a processor) or asynchronous (use some sort of handshake)

V23 - Floating-point representation

Converting Decimal Numbers to Binary

Floating-point Standard: IEEE 754 for 32 bits

Technical Specification

IEEE-754 Representation Formula

Formula
\(a = (-1)^{S}\times \text{fraction} \times 2^{\text{biased exponent} -127}\)
Special Representations

Sign Bit and Bias vs. Two-s Complement

Possible Range of Representation

Other precisions

V24 - Floating-point Arithmetic

Specific Operations

Addition

  1. Align decimal point of numbers to that of the number with the smallest exponent (leads to a loss of precision)
  2. Add significands
  3. Renormalize the result
  4. Round-off the result to fit the available representation space

Multiplication

  1. Add the exponents together
  2. Multiply the significands
  3. Normalize the product
  4. Round-off the product

Accurate Arithmetic

guard and round digits???

Associativity of Floating Point Addition

V25 - Floating Point in RISC-V

video…

Floating point and f-registers

Floating Point load and store

// floating-point load word
flwrd, imm(rs1)
// loads word into an f-register
// F[rd] = M[R[rs1] + imm]
// rs1 is an f-register

// floating point save word
fsw rs2, imm(rs1)
// saves into regular memory from an f-register
// M[R[rs1] + imm] = F[rs2]
// rs2 is an f-register

Floating Point Arithmetic Instructions

fadd.s // addition
fsubs.s // subtraction
fmul.s // multiplication 
fdiv.s // division
feq.s, flt.s, fle.s // comparison (rd is set to 0 if values are not equal, etc)
fcvt.s.w // converts integer in rs1 to floating-point value, which gets stored in rd
fcvt.w.s // same as above, but converts floating-point -> integer

Example

V27 - Datapath Building Blocks

Logic Design Basics

Combinational Elements

Sequential Elements

Clocking Methodology

Building a Datapath

Register File

General Design Notes

V28 - Single-Cycle Datapath

Arithmetic instructions

  1. We look at the opcode to figure out the instruction and format
  2. Take the corresponding bits from rs1, rs2, and rd, and connect these to the corresponding inputs
  3. Register file figures out which registers to use and loads the data
  4. ALU calculates the result of the operation (e.g. add)
  5. Output of ALU is inputted into the write-data input of the register file

Load/Store Instructions

Load Instruction

Store Instruction

- Don't have to write into the register file, since store instructions don't change the value of any registers ## Combining Arithmetic and Load/Store Designs - If we combine these into one circuit, there are two possible inputs for the ALU data - We use a mux to determine which signal gets used - ALU mux control signal: `ALUSrc` - The same thing happens for the signal that goes into the register file's write data (memData vs. ALU result) - Register file mux control signal: `MemToReg`

Note on Hardware

Branch Instructions

Complete Datapath

Steps for Each Instruction Type

R-Type (Arithmetic)

Load

Branch-on-equal

Datapath with Control

V29 - Pipelining

Speedup

RISC-V Pipeline

ISA Design for Pipelining

Hazards

Structural Hazards

Data Hazards

// x1 may cause a data hazard
add    x1, x2, x3
sub    x4, x1, x5
Forwarding (Bypassing)
Load-use Data Hazard (Stall)
Code-scheduling to Avoid Stalls

[[V2A - Pipeline Control Hazards]]

V2A - Pipeline Control Hazards

Speculative Execution

Realistic Branch Prediction

Calculating the Branch Target

V2B - Pipelined Datapath

Design Changes

Pipelined Control

Data Hazards in ALU Instructions

Forwarding

V2D - Instruction-Level Parallelism

still make notes on this

Multiple Issue

dynamically scheduled CPU

Speculation

Speculation and Exceptions

V2F - Memory Hierarchy

Types of Memory

Principles of Locality

Taking Advantage of Locality

Memory Hierarchy

V30 - Cache Memory

Identifying Blocks

Cache Addressing

Associativity

Set-associative cache diagram

V31 - Block Size and Associativity

more calculations here, but the follow from the definitions

LRU for Higher Associativity

Alternative: Not Most Recently Utilized

Replacement Policy

Finding the Block from an Address

Large Blocks

Larger Cache

V32 - Write Strategy

Strategy for Hits

Write-back Policy

Write-through Policy

Strategy for Misses

Write-allocate Policy

Write-around Policy

Write Buffer

V33 - Cache Performance

Components of CPU Time

Memory Accesses

Strategies for Solving Problems

V34 - Average Memory Access Time

Performance - Summary

V35 - Multi-level caches

Interactions with Advanced CPUs

Interaction with software

V36 - Virtual Memory

Why Lie?

Overview

Virtual Address Space

Address Translation

Address Translation in RISC-V

missing

Writes

Translation Lookaside Buffer

Page Fault Handler

  1. Use faulting VA to find PTE
  2. Locate the page on the disk
  3. Choose the page to replace (if dirty, write to disk first)
  4. Read page into memory and update the page table
  5. Return to the normal process

Memory Protection

Entire System Overview

V37 - Memory Hierarchy Concepts

https://www.youtube.com/watch?v=ljS1vTHx04o

Review

Virtual Machines

Cache Coherence Problem

more