Operating System Concepts `CMPUT379`

01 - Introduction

Why Study Operating Systems?

Knowing how the OS works is crucial for efficient and secure programming, must like how knowledge of computer architecture is crucial for writing an efficient interpreter.

To study OS is to study the design of large software systems

An operating system is a canonical example of a low-level program that must be efficient, reliable, complex, and secure; to study its design is to study how to write code with these attributes
E.g. how to trade between performance and usability
E.g. how to manage complexity through abstraction

E.g. we will learn the client-server model and how its equivalent is implemented in an operating system

What is an Operating System?

Software that manages a computer's hardware and coordinates its use among the programs it runs
Acts as an intermediary between users and hardware
Hides architectural details
Allocates resources efficiently and fairly

An OS virtualizes a physical resource: it transforms a resource into a general and powerful virtual form of itself

Virtualizing the CPU: the CPU "looks" like multiple virtual CPUs, allowing multiple programs to run at the same time on the same CPU
Virtualizing memory: each running program accesses its own private address space
- The OS is responsible for mapping this address space onto physical memory

The Roles of an OS

The OS as a referee

Allocates resources between users and applications
Protects users against other users
Coordinates communication between applications

The OS as an illusionist

Gives the illusion of infinite resources (e.g. virtual CPU and memory) and reliable storage and network transport.
Each application appears to have a dedicated machine
Hides physical limitations and details

The OS as a glue

Provides an execution environment with a standard library and common services

What Constitutes an OS?

A core kernel that runs all the time in the kernel mode (supervisor mode/privileged mode)

Located and loaded into memory by the bootstrap program at boot time
The kernel mode has full access to the hardware, whereas programs written by users (which run in user mode) do not.

System programs and daemons, which come with the operating system but are not part of the kernel

e.g. systemd is the system daemon in linux that starts all the other system daemons

Middleware, which "connects" additional services to application developers, e.g. databases, multimedia, graphics

02 - Interaction with Hardware

How does the OS make it easier to use and program a computer?

Components of Hardware (229 recap!)

CPU
Registers
Cache (3 levels)
Memory
Disk (storage)
System bus
I/O devices and controllers
- Sometimes, more than one devices is attached to a controller.
- CPU and controllers run in parallel, but both use memory cycles
- Controller manages a local buffer and special registers
- Each controller has a driver in the OS

Memory Hierarchy (229 recap!)

Recall that the CPU fetches and runs instructions one at a time, using the registers as its workspace. However, data doesn't just exist in the registers, but in memory and storage as well.

Because programs benefit from fast memory, it is economical to hierarchize the memory into several levels, each with a different zero-sum tradeoff between access cost (power) and speed. A clever compiler (or programmer) can make good guesses about which memory levels to use. Thus, having more memory options is generally more efficient.

The design of memory hierarchy is informed by the principle of locality in both time and space: things that were once accessed are likely to be accessed again, and if a certain memory location is accessed, nearby locations are likely to be accessed as well
Higher levels of memory are faster to access but more expensive and volatile as well

Types of Processor Systems

A single-processor system contains one general-purpose processor with one processing core for user processes. In contract, a multiprocessor system contains multiple single-core CPUs, allowing multiple processes to run at the same time.

Each CPU has its own registers, but memory is shared
Load balancing may be required to allocate tasks to the many CPUs efficiently

A multicore system has multiple cores residing on a single processor chip. Each core has its own L1 cache, but the L2 cache and below are shared between cores. Because less distance needs to be travelled, multicore systems are more efficient than their multi-chip counterparts.

The \(n\) cores appear as \(n\) CPUs to the operating system

Making Use of the CPU

Multiprogramming

An operating systems performs multiprogramming by keeping multiple processes in memory at the same time. The OS schedules the processes by picking one to run on the CPU, waiting until it terminates or has to wait, then picking another one, etc. This increases CPU utilization because a CPU core is less likely to be idle if it has multiple processes to choose from.

When a process must wait for I/O, the CPU executes another process.
This process is added to the waiting queue of the I/O device
The I/O device has a buffer that data is written to
I/O events trigger an interrupt that transfers control to the OS. Once the event is handled, the OS adds the process back into the ready queue.

Interrupts

An interrupt is a signal sent by a device controller to the CPU's interrupt-request line(s) that informs the CPU that an event (e.g. an I/O request) has occurred.

Each device controller has a small processor that works by itself, asynchronously from the CPU

There are actually two interrupt-request lines; the maskable one for device controllers, etc, and the non-maskable one for "urgent" interrupts like hardware faults. This way, if a critical process is running, the interrupt can be deferred until the process is complete by using the maskable line.

Interrupts have a priority level. The CPU will defer the handling of a low-level interrupt if a high-level interrupt is raised.

Handling Interrupts

The CPU polls the interrupt-request line after each instruction. If an interrupt has occurred, the CPU stops what it was doing, catches the interrupt, and uses the in-memory interrupt vector to dispatch the appropriate interrupt-handler routine. The handler clears the interrupt by servicing the device.

This handler will store the state information (e.g. registers, program counter, etc) and restore it after the interrupt, hence the name. This is needed because the handling of the interrupt may use the CPU and its registers.

If there are more interrupt handlers than addresses available in the interrupt vector, interrupt chaining can be used. Instead of pointing to a handler directly, an entry in the interrupt vector points to the head of a linked list of interrupt handler addresses. Each list entry is tried until one clears that handler.

Exceptions/Traps

An exception/trap is a software-generated interrupt that may be generated by an error (e.g. division by \(0\)) or a user request for an OS service (e.g. a system call or ecall).

A trap vector exists, analogous to the interrupt vector
So, exceptions and I/O handling are on the two sides of the hardware/software coin

Multitasking

Multitasking extends this idea to the process level by letting the CPU execute multiple processes at the same time; this is done by switching between them frequently. Choosing the next process to run requires CPU scheduling.

03 - System Calls, Linking and Loading

Lecture 3 slides Linking and Loading article

What is a System Call?

System calls are provided by the kernel to application programs

A typical OS exports 300-400 different system calls
System calls are usually defined as functions in the C standard library
In UNIX systems, most syscalls are declared in unistd.h

The following tasks require a (UNIX) system call to be executed. Consider how often your C programs do one of these things.

Access devices and files
- open(), close(), read(), write(), stat(), lseek(), link(), etc.
Request memory allocation
Set access permissions
- chmod(), chown(), etc.
Stop, start, and communicate with processes
- E.g. getpid(), fork(), exec(), wait(), exit(), etc.
Interact with time and timers

The strace linux command can be used to trace which system calls are executed when an application is run.

Interacting with System Calls

Direct vs. Indirect System Calls

Generally, system calls are called through a C library wrapper instead of directly. These functions are more convenient because they handle the steps around making the syscall, e.g. trapping to the kernel mode.

E.g. calling SYS_write via syscall() from syscalls.h calls the syscall directly, whereas calling write() from unistd.h calls it indirectly

POSIX API

The POSIX API is a standard API across unix-based systems for system calls. This interface provides functions to interact with system calls that abstract away the low-level details of making the particular syscall on a particular system.

Linux and OSX are both POSIX-compliant; this is partially why programmers like OSX so much
The POSIX threads library extends the POSIX API into POSIX. 1c.
The corresponding API for windows-based systems is the Win32 API

Thus, code calling this API is portable between (POSIX-compliant) systems because the implementation of a particular POSIX function is chosen for the system it runs on.

Protecting the OS

Here is where the philosophy of operating systems diverges from that of the languages they are written in; they do not give you enough rope to hang yourself with

Operating Modes

An operating system has two (or more) operating modes that grant different levels of access to the hardware. Kernel mode grants the full privileges of hardware, whereas user mode does not. This design exists to protect the OS from user programs, and to protect user programs from each other.

User code could be buggy or malicious; both of these jeopardize the integrity of the OS
Privileged instructions include I/O control, timer management, interrupt management, etc.

The purpose of a system call is to allow a user program to run privileged instructions; access to these is mediated by the operating system. So, the underlying services of the OS can only be accessed through the "interface" (protection boundary) of an operating system in order to protect the hardware.

The computer's hardware has at least one status bit that indicates the current mode (i.e. user or kernel). Different hardware may have different modes → different bit configurations. So, the protection of privileged instructions happens at the hardware level.

Base and Limits Registers

Before context switching (switching processes), the OS loads a base/relocation register with the smallest legal physical memory address of the process and a limit register with the measure of memory allocated to the process. Then, in user mode, each reference (instruction and data address) is checked to ensure it falls between the base and the base+limit addresses.

This protects the shared memory that the OS manages

Timer Interrupts

It is possible that a process doesn't wait for I/O signals, and thus would continue running until termination (best case). To ensure the kernel gets regular control, a timer is set that interrupts the CPU at regular intervals (e.g. \(100\) microseconds).

The interrupt interval is set by the kernel
To protect the system, the kernel must periodically regain control of the CPU
The timer has an entry in the interrupt vector

Each timer interrupt, the CPU chooses a new process to execute.

Running a User Program

Steps: Program → Process

A compiler converts a source file (e.g. ASCII text) into an object file that corresponds directly to the source file

E.g. gcc -c main.c → main.o
Object files do not contain information about external functions, e.g. from imported libraries
Since they are not "tied down" to the context of the filesystem, object files can be relocated freely
Depending on the language, the "compilation step" may include other substeps; for C, the source code is preprocessed by expanding out macros and include statements (gcc-E), compiled, then assembled into the object file.

A linker combines the object files and specific libraries into a single binary executable

E.g. gcc main main.o -lm → main (executable)
The data of the other files is relocated into the main executable

A loader brings the executable file created by the linker into memory. Specifically, the loader adds the code and data from the executable to the process memory.

This is executed when you enter the name of the program in the terminal
The loader uses the execve() syscall
A dynamically linked library (DLL) can be conditionally linked and loaded at runtime

Executable and Linkable Format (ELF)

ELF (Executable and Linkable Format) is the standard executable file format of UNIX systems.

Executable files must have a standard format so the OS knows how to load and run them

The ELF Relocatable File contains the compiled code and a symbol table that contains metadata about functions and variables in the program; this is used to link the object file with other object files.

The ELF Executable File contains the address of the first instruction of the program; this can be loaded into memory

ELF headers start with a \(4\)-byte magic string 0x7F 0x45 0x4C 0x46 (the last \(3\) spell "ELF" in ASCII)

ELF File Sections

Name	Contents
`.text`	Machine code (instructions) The content of this can be viewed with `objdump -drS objfile` on UNIX systems
`.data`	Initialized global variables
`.bss`	(block storage start) uninitialized global variables empty in the object file (??)
`.rodata`	Read-only data, e.g. constant strings
`.symtab`	Symbol table
`.rel.text` `.rel.data`	Relocation information for functions and global variables that are referenced, but not defined in the file (i.e. external). The linker modifies this section by resolving external references

Compiler and Linker Diagrams

make my own?

Operating System Structure

UNIX

UNIX has a monolithic structure: everything in the kernel is compiled into a single static binary file that runs in a single address space. As a result, the communication with the kernel is fast and syscalls have little overhead.

So, syscalls are used to interface to the kernel, which in turn interfaces to the hardware to get stuff done.

LINUX

LINUX is also monolithic, with a core kernel and additional dynamically loaded kernel modules for services like device drivers, file systems, etc.

Modules may be loaded a boot time or during run time; the latter does not require recompiling the kernel
Each module communicates with the other modules through defined interfaces

Applications use glibc, the GNU version of the standard C library; this is where POSIX functionality is provided. glibc is the syscall interface to the kernel.

The marvel of software engineering that is the linux kernel is available online. Its directories are include (public headers), kernel, arch (hardware dependent code), fs (filesystems), mm (memory management), ipc (interprocess communication), drivers, usr (user-space code), lib (common libraries).

Darwin (MacOS kernel)

Darwin follows a layered structure with a microkernel, BSD (threads, command line, networking, file system), and user-level services (GUI). Layered structures have the following advantages and disadvantages:

The modular and simple design of a layered structure is more portable and easier to debug
However, communication between layers has a higher overhead due to extra copying and book-keeping.

The microkernel structure is characterized by a small kernel providing basically functionality (e.g. virtual memory, scheduling) and interprocess communication (e.g. message passing through ports). Other OS functionality is provided through user-level processes.

Windows NT

Windows NT is also layered, with the Windows executive (Ntoskrnl.exe) providing core functionality. The kernel(s) exist between the hardware abstraction layer and the executive services layer (where the Windows executive is). The kernel handles thread scheduling and interrupt handling.

The Hardware Abstraction Layer abstracts over all the different hardware platforms that windows may run on; device drivers call functions in this layer to interact with hardware.

The native Windows API is undocumented, so code requiring OS services (everything) runs on an environment subsystem with a documented API, like Win32, POSIX, os OS/2. WSL also fits into here I imagine.

04 - Process Abstraction

A process is a program being executed in an environment with restricted rights. The process has resources (e.g. CPU registers, memory to contain code, etc) and encapsulates one or more threads sharing process resources.

A process has a unique identifier (PID for "Process ID")
The process provides a thread with a controlled execution environment, limits on memory, etc.
This abstraction is necessary for concurrent execution and protection

Note that processes are not programs; different processes might run different instances of the same program, e.g. multiple instances of a web or file browser.

Process Memory Layout

A process contains a

Text section: the program code itself
Data section: initialized and uninitialized global variables
(The) Stack: space for temporary data, function parameters, return addresses, local variables, etc
(The) Heap: space for memory that is dynamically allocated at runtime; this is done with the malloc() library function or the sbrk() system call

Since both the stack and the heap are able to grow over a program's execution, they grow into the same unallocated space. Specifically, the stack grows downwards from the top of memory whereas the heap grows upwards, with respect to the memory addresses.

Each process has a distinct and isolated address space that defines which memory addresses it can access. No process can write to the memory of another process. The address space contains virtual addresses, which are translated to physical addresses by the hardware.

The addresses in the executable file are written as if the memory available to the process running the program instance is 0x00000000; the addresses must be adjusted if the program is relocated.

Threaded Processes

A thread is a sequential execution stream of instructions. A process may have multiple threads; these threads share the address space of the process as well as the heap, text, data, etc. However, each thread has its own registers, program counter, and stack.

Threads can run simultaneously on different cores of a multicore system, e.g. a word processor accepting text input and running a spellchecker

Process Control Block

The OS keeps track of the stat of each process' execution in a process control block (PCB). This is a kernel data structure in memory that represents the context (runtime information) about the process. This includes

Status: running | ready | blocked/waiting | etc.
Process ID and children's process IDs
States of CPU registers, including the PC, SP, HP (heap pointer), base and limit registers, page-table register, general-purpose registers
Thread control blocks
Accounting and scheduling information
Set of OS resources being used
Current working directory
Username of process owner

In Linux, the PCB is represented by the C structure task_struct, which is defined in <linux/sched.h>

Loading Revisited

Before a program is loaded into memory, the loader creates a PCB, address space, stack, and heap for the process. It also pushes argc and argv to the stack. Then, the registers and program counter are initialized. Finally, the data and program is loaded into memory.

Keeping Track of Processes

Although only one process can be running on a core at a given time, the OS is "juggling" many process, which may be in ready or waiting states. Active processes are placed in a queue to be dealt with. These include:

Ready queue: processes that are ready to execute. Priority may be defined in terms of arrival time or more complicated process-scheduling
- So, multiprogramming is achieved by regularly dequeuing processes from the ready queue
Wait queue: each device has one to store pending instructions
Zombie queue: terminated processes that will be killed by their parents, but are still in the process table at the moment

The queue may change due to a process action (e.g. termination or making a syscall), OS actions (scheduling), or external actions (hardware interrupts).

Zombie and Orphan Processes

A process is a zombie process if it has terminated, but its parent hasn't read its exit status yet.

A process becomes an orphan process if its parent is terminated while the process is still running.

Scheduling

The scheduler maintains a doubly linked list of PCBs that are moved between the job queue, ready queue, and the device queues. It regularly selects a process from the ready queue to run; the algorithm defining this selection may be tuned to prefer fairness, minimum latency, guarantees.

If there is no process in the queue, the CPU runs an idle process
The (short term) scheduler is called very often and makes decisions very fast
Note: a doubly-linked list is needed implement the queue because when a PCB is removed from the queue, we need to know where the next last PCB is.

Overall, the scheduler selects the process to run next, whereas the dispatcher actually executes the context switch.

Context Switching

Context switching is the act of stopping a process and starting another one; the context contains the values of the CPU registers, the process statement, and information about the memory.

Context switching is expensive because it requires loading the process' hardware registers (e.g. PC) from the PCB when the process starts and saving them to the PCB when the process ends.

Also, since different processes have different address space, the switch also involves moving to a different address space; we don't get any benefit with spatial locality
This is entirely overhead; the system does no useful work
So, the OS must effectively balance the frequency of context switches with its scheduling requirements to achieve maximal efficiency

05 - Process Management

How can we create, control, and terminate processes?

Go fork yourself

Process Management System Calls

// returns the current PID
getpid() 

// duplicates the current process (a new PID is assigned to the child process)
fork() 

// loads a new binary file into memory without changing its PID
// this is the loader command from earlier
execve() 

// waits until one of its child processes terminates
wait() 

// waits until the specified child process terminates
waitpid() 

// terminates a process
// note: _exit and exit are different things; this has CS 379 assignment implication
_exit() 

// causes the calling process to sleep until a signal is delivered that either terminates the process or causes the invocation of a signal-catching function
pause() 

// suspends execution of a process for at least the specified time (can be interrupted by a signal that triggers the invocation of a handler)
nanosleep() 

// sends a signal (interrupt-like notification) to another process
kill() 

// sets handlers for signals
sigaction()

Forking a New Process

The fork (fork()) syscall creates a child process that inherits a copy of its parent's memory, file descriptors, CPU registers, etc. Both the parent and child process will execute after the fork instruction; either one might run first (no guarantee).

fork returns an integer \(n\) (technically of type pid_t):
- \(n > 0\): currently running in the original parent process; value is the PID of the child process
- \(n=0\): currently running in new child process
- \(n < 0\): error that must be handled in the (still running) parent process
systemd is the root process of entire "fork tree", with PID of 1.

Presumably, the PID of the child is kept track of on the stack somewhere, lest a zombie process be created.

// includes
int main() {
    // COMMON TO ALL PROCESSES
    // fork 3 times
    fork();
    fork();
    fork();
    
    printf("hello\n");
    // "hello" is printed 2^3 = 8 times
    return 0;
}

Note that the forked child processes can "see" everything, i.e. the code is running "multiple" times. This means we need to check if we're in the parent or child process when running code. It also means that anything before the fork is visible to both the parent and child processes; this lets us create shared structures like pipes, shared memory, etc.

Common Problems with Fork

Fork is slow because copying the entire address space of a process is high
Fork is insecure because it might give the child process access to states in memory that it doesn't need; extra work is required to filter these out
Fork is not thread safe
- The child process created by form only copies the single thread that called it
- If a thread is allocating memory (and thus holding a heap lock) while another thread is forking, any attempt to allocate memory in the child process (e.g. to acquire the same lock) will create a deadlock that cannot be resolved

So, DON'T FORK IN MULTITHREADED PROCESSES!

Or call execve immediately afterwards.

Loading Programs: `execve`

The exec (execve()) system call allows a process to load a different program and start executing it at main. A number of arguments (argc) and an array of said arguments (argv) must be sent to the new program to be executed.

If the call is successful, the same process runs a different program
The code, data, stack, head, etc are overwritten

We usually call execve after calling fork. Note that this makes all the memory copied by fork useless immediately.

If performance is crucial, we can use vfork if execve is called immediately; this doesn't copy all the memory over from the parent process. We won't need access to that memory if we call execve immediately, since it replaces the memory anyway

Waiting for Child Processes to Terminate: `wait`

A parent process can execute concurrently with its children or wait until some or all of them terminate; this is achieved with the wait() command, which puts the parent to sleep until a result is returned.

The child must call exit() to indicate it is time to return to the sleeping parent process
Then, the kernel uses the SIGCHLD signal to unblock the parent and return the child's value and PID
If no children exist, wait() returns immediately; if zombie children exist, one of their PIDs is returned and the zombie is killed

Terminating a Process: `_exit`

The termination of a process is the final reclamation of its resources by the OS. All open files are closed and the process' memory is deallocated.

If the parent process is still alive, its exit status is held until the parent requests it; this is the zombie state
Otherwise, it is deallocated immediately

A process is usually terminated by using return in main, but it can be terminated explicitly by calling exit() (C library function) or _exit() (syscall). Note that exit() is called indirectly when main returns.

Abnormal termination is characterized by a call to abort(), which generates the SIGABRT signal. Any functions registered with atexit() are not called, open files are not closed, etc. Not ideal!

Kill

A process may terminate its own child with the kill() system call, which takes as parameters the child's PID and the signal to use, in this case SIGKILL

In UNIX, The kill command uses the SIGTERM signal by default
UNIX also supports killall, which kills by name instead of process ID

Dealing with Resulting Zombie and Orphan Processes

A process becomes a zombie when it has terminated, but its parent has not yet called wait().

A process becomes an orphan if its parent terminates without invoking wait().

In UNIX/Linux, this process is reparented by assigning the init (PID=\(1\)) process as its parent. We can check for this with if(getppid() == 1) in C.
init repeatedly calls wait to allow the exit status or any reparented orphans to be collected

Other Process Control Syscalls

The priority of a syscall can be adjusted with nice(); nice(incr) increases the nice value of the process by incr. A lower nice value has a higher scheduling priority; a process is "nice" by giving up some of its CPU share.

The ptrace system call lets a process intercept the system calls of another process. This lets it check arguments, set breakpoint, peep in on registers, etc

This is useful for debugging

The sleep call puts the process on a timer queue by waiting for a certain number of milliseconds; this allows alarm functionality to be implemented.

The Shell

In short, a shell is a process control system that lets programmers create and manage processes to do task. Logging into a (UNIX) machine starts a shell process

Every time a shell command is executed, fork and exec are called implicitly, as a pair
We can derive features like input/output redirection and pipes by separating fork and exec
Aside: does that imply there's a (combinatorially) limited wait to "separate" these calls → I/O redirection and pipes can be characterized in a space of possible "manipulation commands"?

Code Examples

Forking a Process

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>

int main() { 

    //... 
    pid_t ppid = getpid(); // store parent’s pid 
    pid_t pid = fork(); // create a child 
    
    if(pid == 0) { 
        // child continues here 
        printf("Child pid: [%d]\n", getpid()); 
        // ... 
    } else if (pid > 0) { 
        // parent continues here 
        printf("Parent pid: [%d] Child pid: [%d]\n", ppid, pid); 
        // ... 
    } else { 
        perror(“fork failed!”); 
        exit(1); 
    }
    // ...
}

We can run the ps command to check the process' IDs. The -el flag increases the information listed; the -u flag lets processes be filtered by the user that created them.

Combining `fork` and `wait`

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

int main() { 
    // ... 
    pid_t ppid = getpid(); // store parent’s pid 
    pid_t pid = fork(); // create a child 
    
    if(pid == 0) { // child continues here 
        // ... 
    } else if (pid > 0) { // parent continues here 
    // ... 
    pid_t cpid = wait(&child_status); 
    } else { 
        perror(“fork failed!”); 
    }
    // ...
}

Combining `fork` and `exec`

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

main() { 
    // ... 
    pid_t ppid = getpid(); // store parent’s pid 
    pid_t pid = fork(); // create a child 
    
    if(pid == 0){ 
        // child continues here 
        // mark the end with a null pointer
        status = execve("/bin/ls", arg0, arg1, ...); 
        /* exec doesn’t return on success! 
        so if we got here, it must have failed! */ 
        
        perror(“exec failed!”); 
    } else if (pid > 0) { 
        // parent continues here ... 
        cpid = wait(&status); // pass NULL if not interested in exit status 
        
        if (WIFEXITED(status)) {
            printf("child exit status was %d\n", WEXITSTATUS(status));
        }
        
    } else { 
        perror(“fork failed!”); 
    }
    //... 
}

Parent Killing Its Child

Welp, I'm on a watchlist now

#include <signal.h>
#include <unistd.h>
#include <stdio.h>

main() { 
    // ... 
    int ppid = getpid(); // store parent’s pid 
    int pid = fork(); // create a child 
    
    if(pid == 0) { // child continues here 
        sleep(10); // child sleeps for 10 seconds 
        // ... 
        exit(0); 
    } else { 
        // parent continues here 
        // ... 
        printf( "Type any character to kill the child.\n" ); 
        char answer[10]; 
        gets(answer); 
        
        if (!kill(pid, SIGKILL)) { 
            printf(“Killed the child.\n”); 
        } 
    }
}

06 - Signals

"Kill" is the word associated with sending signals. For some reason.

Signals are software interrupts; they provide a way of handling asynchronous events by sending extremely short messages between processes (and are thus a form of PID). Each signal has an associated action.

Occasionally handle synchronous events (exceptions), e.g. division by \(0\).
A signal may be directed towards a process or a thread; if it is process directed, it may be delivered to any thread without the signal blocked
OSX and Linux both have \(2^{5}-1= 31\) different signals; this seems to imply that they are stored as \(5\) bit numbers (why?)

Signals are defined as integer constants in the <signal.h> header; their textual representation always starts with "SIG", e.g. SIGKILL is \(9\).

kill -l can be used to list all the signals on some architectures

A signal is posted if the even that triggers it has occurred, delivered if the action associated with it is taken, pending if it has been posted but not delivered yet, and blocked if the target process does not want it delivered.

The target process may use a signal mask to ask the kernel to block a particular signal

Signal Generation

A signal can be generated by

A certain keystroke. E.g. ctrl-C → SIGINT (interrupt), ctrl-Z → SIGSTP (stop), ctrl-\ → SIGQUIT (quit)
The OS kernel when it wants to notify a process that it has caused a hardware exception, e.g. division by \(0\) or a segmentation fault
The kill command or syscall, which is how we manually send a signal to a process.
A software condition, e.g. SIGURG, SIGPIPE, SIGALRM

POSIX.1's `reliable-signal`

A process can send a signal to another process or a group of processes using the kill syscall. However, this can only happen if the user ID of the receiving process is the same as the one of the sending process (i.e. permission is required).

A signal may be sent with data (i.e. basically IPC) using the sigqueue function
A process can send itself a signal using the raise function; raise(sig) is roughly equivalent to kill(getPID(), sig). This why we way processes "raise" exceptions; these are self-sent.

A process can wait for a signal by calling pause; this puts the process to sleep until a signal is caught.

The alarm function can be used to generate a signal for this purpose

A process may have a signal disposition, which determines how it responds to signals.

SIG_DFL: let the default action happen
SIG_IGN: ignore the signal (except for signals that can't be ignored, like SIGKILL and SIGSTOP)
The address of the user-defined signal handler may also be provided; the kernel will catch the signal by invoking this function

Defining a Signal Handler

A signal handler is a function that takes a single integer as an argument and returns nothing, i.e. everything is accomplished through side-effects.

sighandler_t signal(int signum, sighandler_t handler);

The second parameter sets the disposition of the specified signal

#include <stdio.h>, <unistd.h>, <signal.h>

int i; void quit(int code) { 
    fprintf(stderr, "\nInterrupt (code= %d, i= %d)\n", code, i); 
} 

int main (void) { 

    if(signal(SIGQUIT, quit) == SIG_ERR) 
        perror(“can′t catch SIGQUIT”); 
    
    // loop that wastes time in a measurable way
    for (i= 0; i < 9e7; i++) 
        if (i % 10000 == 0) putc('.', stderr); 
        return(0); 
}

This behavior various across different UNIX versions though.

POSIX Signal Environment

The action associated with a signal may be examined or modified using sigaction, which supersedes the signal function from earlier UNIX releases.

#include <stdlib.h>, <stdio.h>, <sys/types.h>, <unistd.h>, <signal.h>

void signal_callback_handler(int signum) { 
    printf("Caught the signal!\n"); 
    // uncommenting the next line breaks the loop when signal received 
    // exit(1); 
} 
    
int main() { 
    struct sigaction sa; 
    sa.sa_flags = 0; 
    sigemptyset(&sa.sa_mask); 
    sa.sa_handler = signal_callback_handler; 
    sigaction(SIGINT, &sa, NULL); // we are not interested in the old disposition 
    // sigaction(SIGTSTP, &sa, NULL); 
    while (1) {}
}

sigprocmask may be used to block signals in a signal set from getting delivered to a process.

The bit vector representing the signal set may be modified with sigemptyset(), sigfillset(), sigaddset(), sigdelset(), and sigismember()

When a signal is caught, the handler adds the signal to the signal mask of the process to prevent subsequent occurrences of the signal from interrupting the signal handler

How Does Process Creation Interact with Signals?

When a process forks, the child inherits the parent's signal dispositions and signal mask; signal-handlers are also defined in the child because it has inherited its parents' memory.

When a process execs, the disposition of caught signals changes because its default action in the new program may be different; all other signal statuses are left alone and the signal mask is preserved.

07 - Interprocess Communication I

Cooperating processes work with each other to accomplish a single task; this improves performance (e.g. increases parallelism) and program structure

Any two given processes are either independent or cooperating

Information sharing (i.e. IPC) is likely required between cooperating processes; it is on the OS to make this happen. The two fundamental approaches to this are message passing using a message queue and shared memory.

Message passing involves communicating through the kernel via system calls
Shared memory requires a designated segment of memory be appended to the address space of each process. Since this only needs to be set up once, no syscalls are needed to actually share things, so shared memory is faster. However, processes need to make sure not to access memory concurrently, requiring synchronization. Shared memory is also more error-prone.

Message Passing

Message passing requires kernel intervention through system calls. Sharing is clear and easy to track, but requires more overhead in the form of copying data and crossing domains.

So, the operating system is responsible for handling messages
Distributed systems typically use message passing

Implementation: a message queue (implemented as a linked list of messages) is stored in the kernel; a message queue identifier pointed to this list is shared with all cooperating processes.

We interact with the message queue using the send(consumer_pid, next_producer) and receive(producer_pid, next_consumer) syscalls.

Property	Options	Description
Direction	simplex \| half-duplex \| full-duplex	Which way the message travels (half-duplex: both ways, but one at a time)
Boundaries	datagram \| byte stream	datagram has message boundaries, byte stream does not
Connection model	connection-oriented \| connectionless	Connection-oriented: recipient is specified when the connection starts, so it doesn't need to be re-specified Connectionless: each send operation has a parameter that contains its recipient
Reliability	Can messages get reordered? corrupted? lost?

Issues with Message Passing

Blocking: message passing may involve blocking operations, i.e. operations that force the sender and receiver to wait until the message has been received or is available, respectively.

Non-blocking operations' send operation returns immediately; the receive operation returns if no message is available
A partially blocking operation may use a timeout to limit the time that processes wait for message passing; this puts an upper bound on the blocking time

Shared Memory

Shared memory is achieved by establishing a mapping between a process' memory space to a named memory object, which is shared among processes. The processes that need access to the memory are forked after the memory is created so that they know the memory object's name.

So, the processes communicate through the shared memory object instead of the kernel
We invoke the mmap() function to use the shared memory

Shared Memory in POSIX

POSIX uses memory-mapped files; each region of the shared memory is associated with a file. So, shared memory objects are implemented with files. We interact with the SMO through the following syscalls:

shm_open(const char* name, int O_flag, mode_t mode) creates and opens a shared memory object; it returns a file descriptor that refers to the newly created object (since it is indeed a file)
ftruncate(int file_desc, off_t length) lets us set the size of the shared memory object
shm_unlink(const char* name) removes the shared memory object by de-allocating (and destroying) the contents of the associated region of memory

Finally, mmap syscall establishes the memory-mapped file that contains this shared memory object. (we can remove this file with munmap). mmap takes parameters for the address of the shared memory section, the length (size) of the section, prot (?), flags, and an offset (?).

The Producer-Consumer Problem

The producer-consumer problem (a.k.a the bounded buffer problem) any time we have a process (the producer) writing to a buffer and a different process (the consumer) reading from the same buffer. This pattern is common; a canonical example is in a shared pipe.

What does the producer do when the buffer is full?
What does the consumer do when the buffer is empty?
I.e. how can we avoid reading from an empty buffer or writing to a full buffer?

The producer-consumer problem requires the producer and consumer to communicate, i.e. inter-process communication is required. So, we can implement in terms of message passing or shared memory.

Solution: Message Passing Implementation

We defer the management of the message queue to the kernel, using send and receive as an interface to interact with the queue. Each process needs to be able to name (identify) the other processes to communicate.

Notice how we have while(true) loops here because are are always waiting for a new update from the buffer; this is a common pattern in systems that wait for interruptions and/or have blocking processes, e.g. user input (e.g. dragonshell) or waiting for the message queue

int main() {
    //...
    int(fork() != 0) {
        producer();
    } else {
        consumer();
    }
}

int producer() {
    //...
    while(true) {
        //...
        next_p = /* produced item */;
        send(C_pid, next_p);
        //...
    }
}

int consumer() {
    //...
    while(true) {
        //...
        receive(P_pid, &nextc);
        // consume nextc
        //...
    }
}

Solution: Shared Memory Implementation

this code doesn't quite make sense to me

We use the mmap system call to create shared memory. We store the buffer in this shared memory and manipulate it directly. As such, we need to keep track of some variables and constants

N (constant) is the size of the buffer. In this example, we use a circular buffer, i.e. where we loop back around to the start when we reach the end (typically implemented with a modulus operator)
in points to the next free location and out points to the next full location (i.e. where we should read from)
- These are shared between producer and consumer

int main() {
    //...
    mmap(/*...*/, in, out, PROT_WRITE, PROT_SHARED, /*...*/);
    in, out = 0;
    
    if(fork() != 0) producer(); else consumer();
}

int producer() {
    //...
    while(true) {
        //... (produce the item however)
        nextp = /* produced item */;
        // wait until we have something to put in the queue
        while((in + 1) % N == out) {}
        // if we're here, (in + 1) % N != out, so we write to the buffer
        buffer[in] = nextp;
        in = (in + 1) % N;
    }
}

int consumer() {
    //...
    while(true) {
        //...
        // wait here until there is something to read from the buffer
        while(in == out) {}
        // read from buffer
        nextc = buffer[out];
        // update out; we implement a circular buffer with the modulus operator
        out = (out + 1) % n
        // consume the next item however
    }
}

POSIX Code Examples

Producer

To create the producer, we use shm_open to create the shared memory file descriptor, then call ftruncate to establish the size of the shared memory. Then, we use mmap to create the shared memory object and get a pointer to it. Finally, we use sprintf to write write to the shared memory; we increment the pointer to make sure we're writing to the right place.

Consumer

We do all the exact same setup as the consumer: shm_open and mmap (among others) with the same name. However, instead of writing to the shared memory, we read from the memory instead by setting it as the "source" for printf. Finally, we call shm_unlink to remove the shared memory object when we're done.

Note that this is a "hard-coded" example; we write and write exactly one thing, so we can just close everything at the end of the consumer. In a real-time system, it's not this easy; we have to pay attention to the lifecycle of the program.

Reader and Writer Example

Note: the notes also have examples of this in System V; I'm not sure these are as directly relevant so I didn't include them. But, they are useful for comparison of system design, so I've made a note that these are indeed in the slides.

08 - Interprocess Communication II (Pipes and File Descriptors)

A pipe is a channel established between two processes to communicate; so, they are a form of shared memory (particularly, a lighter one). Pipes imply a producer/consumer relationship.

Ordinary Pipes

An ordinary pipe (a.k.a. anonymous pipe also implies a parent-child relationship: the parent creates a pipe, then forks a child process that has access to the pipe too. These are the only processes with access to the pipe.

An ordinary pipe has a write-end that the producer writes to and a read-end that the consumer reads from.
A pipe has a limited capacity; if it is full, writing to it may block the process or fail completely
Parent-child pipes may flow in their direction; the parent may consume or produce

In UNIX, pipes are implemented as special types of files. Thus, they are identified by file descriptors and are manipulated with the read and write syscalls. Namely, we read to consume content and write to produce content. This data pipe flows through the kernel.

Before a process starts using the pipe, it must close the unused end
When the process is done with the pipe, it must close its end too
Reading from a pipe whose write-end is closed returns 0 to indicate EOF
Writing to a pipe whose read-end is closed generates a SIGPIPE signal

// structure to store the necessary file descriptors
int[2] fd;
// create a pipe 
pipe(fd);
// the file descriptor for the READ end is stored in fd[0]
// the file descriptor for the WRITE end is stored in fd[1]

File Descriptors

A file descriptor is an integer index that identifies a file in the kernel's file descriptor table; these are shared between processes. (correctness)

By convention, 0 is the file descriptor for stdin, 1 for stdout, and 2 for stderr.
These can also be referred to by integer constants STDIN_FILENO, STDOUT_FILENO, STDERR_FILENO.

File descriptors can be obtained via a syscall by opening a file/pipe/socket/etc, or by inheritance from a parent process.

Implementing I/O Redirection

We can implement I/O redirection by closing a standard stream and re-allocating the corresponding descriptor to another file (or pipe). But there are better ways to do this

// close the standard input
close(STDIN_FILENO);
// set the standard input to read from my file
open("/path/to/my/file", O_RDONLY, STDIN_FILENO);
// we can do this because input is abstracted as a file anyway
// so feeding in a file directly works since it uses the same interface

Instead, we can duplicate a file descriptor with the dup2 system call. This creates another reference to the same file descriptor. This way, we can replace it with another descriptor without closing the original file the descriptor points to. (??? what?)

Everything is a File

The "pipes as a file" pattern is common in UNIX: many things are implemented as files, e.g. devices, sockets, pipes, actual files, etc. YOU get a file and YOU get a file and YOU get a file.

So, the interfaces for all of these things is the same: a file descriptor to identify the object, open, read, write, and close to interact with the file.

Piping to Another Process

To pipe to another process, we would need to create the pipe (i.e. the file descriptors), fork a child process, close the unused ends of the pipe for both processes, call execve to run the command we want in the child process, then wait for the command to terminate. This is annoying and tedious to implement. So, popen() does exactly this.

popen() returns an I/O file pointer to where we need to read from to get the standard output from the child process

Then, pclose() closes the standard I/O stream, waits for the command, and returns termination status.

#include <stdio.h>
#include <unistd.h>
#define LINESIZE 20

int main (int argc, char *argv[]) {

    // set up constants and buffer
    size_t size=0; 
    char buf[LINESIZE]; 
    FILE *fp; 
    
    // open the pipe
    // provide the command we will execve
    // we can provide "w" as the second parameter to read instead
    fp = popen("ls -l", "r"); 
    // continue reading from the pipe as long as we can
    // store the result in the buffer with printf
    while(fgets(buf, LINESIZE, fp) != NULL) 
        // here, we "print" to the buffer instead of standard out
        printf("%s\n", buf); 
    // close the pipe
    pclose(fp); 
    return 0; 
}

Named Pipes

Named pipes (FIFOs in UNIX) generalize ordinary pipes by allowing bidirectional communication between any (possibly several) processes, not just parent-child pairs. Essentially, a named pipe is a named file that any process and read to or write from.

A named pipe continues to exist even when the processes using them have terminated; thus, they need to be explicitly deleted (much like freeing heap memory in C).
POSIX does not define whether opening a FIFO is a blocking process or not

In UNIX, a FIFO is created with the mkfifo syscall and is manipulated like an ordinary pipe, i.e. with open, read, write, close.

Aside: how is a named pipe functionally different from shared memory? they seem really similar

Appendix: Actually Using a Pipe in BASH

In BASH, we use the character | to open a pipe between two processes; BASH will handle the opening and closing. This pipe redirects the standard output of the left process into the standard input of the right process.

In BASH, we can chain pipes as much as we want, e.g. p1 | p2 | p3 …

# the pipe identity
./process_1 > temp.file && ./process_2 < temp.file
# equivalent to
./process_1 | ./process_2

Code Examples

Anonymous Pipe

Duplication

Hardcoded: w | wc -w

09 - Distributed Systems

A distributed system is a set of loosely coupled (often physically separate) nodes connected via a communication network, e.g. the internet. Each node has an operating system, as well as its own resources, e.g. a CPU, etc.

Communication is achieved through message passing between nodes
This is unreliable, i.e. messages may be lost or corrupted. Systems must be resilient to this.
Most systems today are distributed

Nodes may be configured in a client-server configuration (the server has a resource the client wants to use), a peer-to-peer configuration (each model acts as both a client and a server), or a hybrid model.

Why make a system distributed?

Resource sharing: resources don't need to be replicated in different places (e.g. files don't need to be shared). Scarce or expensive resources can be more effectively utilized.
Speedup: subtasks executed concurrently with effective load-balancing don't compete for one CPU, and thus achieve faster compute time through higher throughput.
Reliability: distributing a system over many machines increases fault-tolerance: if a single machine fails, the entire system doesn't fail (at least, not immediately).

A distributed operating system is capable of migrating data, computation, and processes between its nodes without its user needing to be aware of its internal function. A high level of parallelism is needed to achieve this.

Design Goals of Distributed Systems

Robustness is the ability of a system to withstand failures. Systems should be fault-tolerant, i.e. able to withstand certain amounts of certain types of failure.

Transparency is how well the system appears as a conventional, centralized system to users (reword, direct copy). The system should act the same way from every access point, e.g. files stored on remote servers should appear like those stored locally to the system's user.

Users should not (need to) be aware of how the machines are configured, or how data and processes distributed behind the scenes.

Scalability is the ability of a system to adapt to the increased load from accepting new resources. A system should respond gracefully, i.e. not slow down significantly or fail under increased load.

Compressing data and removing duplicate data can reduce data storage, and thus network resources, allowing the load to increase.

Consistency is the ability of a system to ensure that its cached local data is consistent with the master data source.

Basics of Communication

A network is a set of communication links that let multiple computers communicate via packets (the atomic unit of data transmission) through a network interface. A protocol (common set of communication rules) must be adopted across the system.

Network Structure

A local area network (LAN) covers a small geographic area, e.g. a building. LANs connect computers with common peripherals (e.g. printers, storage servers, etc); they must be fast and reliable to be effective. Nodes may connect to a LAN via ethernet or wifi.

Ethernet is defined by IEEE 802.3; every device has access to the same wire and simply discards packages that aren't meant for them
Ethernet and wifi have different speeds and reliability
Each device on a LAN has a Medium Access Control (MAC) address

A wide area network (WAN) connects geographically separate sites across the world. Its links are serious physical infrastructure, e.g. telephone lines, data lines, optical cables, satellite channels, etc. Due to the distances involved, WAN is slower and less reliable than LAN.

LANs are connected together into WANs.

Principles of Communication over Networks

Network data is divided into packets and flows between switching points; computers at these points control the packet flow. When a packet arrives at its destination, the computer is interrupted.

If packets need to travel between different LANs, a router needs to be used to direct traffic from one network to another. The router reads the packet header to determine the packet's destination, finds the closest node to the destination in a routing table, then forwards packages there.

Identifying the Destination

Packets are accepted by processes, so a <host-name, PID> pair is needed to identify which host and process the packet is meant for. Each process in a system has a unique PID, and each system in a network has a unique name, so this pair alone can identify the process in question.

The domain name system (DNS) is a global distributed database system for resolving hostname-IP (internet protocol) address mappings. These map IPv4 addresses (32-bit integers, e.g. 129.128.5.180) to human-readable names, e.g. gpu.srv.ualberta.ca.

TCP/IP (Internet Protocol Suite)

Internet communication is organized into 4 layers of abstraction

The application layer provides a way for processes to exchange data between themselves. The transport layer partitions data into packets, maintains the order of packets, and transfers packets between hosts. The internet layer routes these packets through the network; this involves encoding and decoding addresses. Finally, the link layer handles error transmission over physical media, including the required error correction and detection.

These layers are ordered from most abstract (application layer: applications communicate with other applications) to most concrete (link layer: packets are transmitted over atomic "segments" of the network)

A packet can move through a LAN using only its MAC address. However, if the packet needs to be sent to another system, the address resolution protocol (ARP) must be used to map the MAC address to an IP address.

The content of a networks ARP table can be seen with arp -a

Transferring Packets between Hosts

Every host has a name and an associated IP Address; this may be an IPv4 address (32-bit) or an IPv6 address (128-bit).

The local host has the special address 127.0.0.1
Each router (and the sender) uses the network part of the host-id pair to determine where the packet should be sent next (as we have seen); this is either another router (recursive case) or the destination (base case)

Finding the Right Process

Once the host has been located, the receiving or sending process is identified by a port number. This is part of the transport layer; the TCP and UDP protocols are responsible for this. A host has multiple ports → multiple processes can send and receive packets

Some standard services have standard ports in \(\set{0, \dots, 1023}\) by convention: FTP on port \(21\), SSH on port \(22\), etc.
Ports \(1024\) to \(49151\) are assigned for specific services (assigned by IANA)
The dynamic/private ports \(\set{49152, \dots, 65535}\) are available for use by any process using TCP/UDP

User Datagram Protocol (UDP)

UDP allows for fixed-size messages (datagrams) up to some maximum size. UDP is unreliable: packets may be lost or out of order (although corruption is still checked). UDP is connectionless: no setup or takedown is required, i.e. the process is stateless

This is the simplest case: we just send data and hope it works
This is enough for a lot of systems

Transmission Control Protocol (TCP)

TCP is reliable and connection-oriented: it provides abstractions to allow in-order, uninterrupted streaming across an unreliable network. Connections are opened and closed with control packets. The opening consists of a three-way handshake (SYN, SYN + ACK, ACK).

When a host sends a packet, the receiver must send an acknowledgement packet (ACK). If this is not received before a timer expires, the sender will timeout and retransmit the packet. The sender keeps track of the current set of packages sent without being acknowledged.

A single acknowledgement can account for multiple received packets; this takes advantage of throughput

The sequence counter assigns an order to the packets; the receiver can use this to determine whether packets have been duplicated, i.e. if the original package was received but the original package was lost in transmission.

10 - Sockets and Servers

Sockets are like pipes that go outside

Distributed computing often follows the client-server model: a server provides services (e.g. file service, database service, name service) that a client (process) might request.

E.g. web clients communicating with a web server over some port

A socket is a connection endpoint that provides an abstraction over a network I/O queue. Each endpoint is defined by an IP address and a port number. Communication between two processes requires two sockets, one for each process.

Sockets allow for bidirectional communication
Network sockets abstract communication over a network
UNIX domain sockets abstract communication over a local (UNIX) machine

Sockets come in two common types:

Stream sockets reliable, connection-oriented two-way (duplex) communication that follows the stream model (similar to TCP)
- We must establish a connection; then bidirectional communication can occur via a byte-stream, meaning applications are unaware of message boundaries (since the stream is a series of unpartitioned bytes).
Datagram sockets support unreliable, connectionless two-way communication that follows the datagram model (similar to UDP).
- We simply send a message "into" the socket. We don't know if the message was lost, and messages may receive out of order

A socket supports some set of address domains: address spaces tailored to some aspect of communication (e.g. INET for IP protocols, UNIX domains for processes on the same machine)

These are defined in sys/socket.h

Creating Sockets

We create a socket with the socket syscall; this returns a socket descriptor that identifies the socket that was created (or -1 on error).

int socket(int domain, int type, int protocol);

Domain: AF_INET for IPv4, AF_INET6 for IPv6, AF_UNIX for UNIX domain
Type: SOCK_DGRAM for connectionless, SOCK_STREAM for connection oriented message, SOCK_SEQPACKET, etc.
Protocol: UDP, TCP, ICMP, IP, IPv6, etc. 0 sets the default for the given domain and type

Sockets work like files; in fact, since this is UNIX, sockets are abstracted by files (of course).

So, we close, read, and write from sockets
dup can be used to duplicate a socket
shutdown() can be used to disable a socket in one or both directions
However, we can't use every file descriptor-accepting function with sockets

#include <sys/socket.h>

int sockfd; 
// TCP 
int sockfd = socket(AF_INET, SOCK_STREAM, 0); 
// sockfd= socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); 

// UDP 
int sockfd= socket(AF_INET, SOCK_DGRAM, 0); 
// sockfd= socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);

Addressing

We can identify the process we want to communicate with by host name (e.g. IPv4) and/or by port number.

We can call addrinfo(name, port_number, …) to get a list of addrinfo structures, each containing a local address.
These local addresses can be assigned to a socket using bind()

struct addrinfo {
    int ai_flags;
    // socket domain
    int ai_family;
    // socket type
    int ai_socktype;
    // protocol
    int ai_protocol;
    socklen_t ai_addrlen;
    // contains IP address and port number
    struct sockaddr *ai_addr;
    char *ai_canonname;
    struct addrinfo *ai_next;
}

Calling socket doesn't assign an address to the socket: we wither need to call bind to associate a port (must be larger than \(1024\)) or leave that choice to the OS when listen or connect is called.

getsockname() can be used to discover the address bound to a socket

Connections

A connection is uniquely defined by the 5-tuple (source IP address, source port number, destination IP address, destination port number, protocol).

I.e. the source address and port number, destination address and port number, and the protocol to communicate between them
The OS usually assigns the client port number (source?), so we don't usually need to call bind
The server port number is usually well-known, e.g. \(80\) for HTTP

Accepting a Connection

A server will call listen (providing the socket descriptor and a number of connect requests to queue) to convert the socket to a listening socket, which start allowing clients to connect.

Full queue → new connection requests are rejected

Then, the server will call accept to create a new connection socket for a particular client connection (from the queue). Thus, accept returns a new socket descriptor for the connection socket so that the original socket is still available to receive requests.

So, we have a "parent" socket that acts as the front line to accept requests and delegates them to their actual connection point
accept blocks until a connection request arrives (unless it is in non-blocking mode)

Establishing Connections

Connection-oriented services (e.g. TCP) require a connection to be established first; this is done by calling connect, which connects a socket to the specified remote socket address.

I.e. connect establishes the connection between the two sockets
This isn't required for connectionless services, which just yeet messages out into the ether without a second thought
If the connection fails (e.g. there isn't a socket on the other side), connect returns -1.

Transferring Data

We write data with send (connection-oriented) or send-to (connectionless), which take flags.

send blocks until all the data has been transferred
sendto requires a destination address for its connectionless socket

We read data with recv (connection-oriented) or recvfrom (connectionless, requires source address), which also take flags.

We can force recv to block until it has received as much data as we requested by using the MSG_WAITALL flag.

We can also use read and write to, well, read and write data. However, these aren't socket-specific, so they provide a lower level of abstraction and are more-error prone. The socket-specific functions deal with sockets more cleanly and safely.

If the socket is connection-oriented, then connection must be established first; the socket-specific functions take care of this for us (maybe?? correctness not sure)
We can use poll and select to see and wait for the descriptor to become ready for I/O
read and write don't let us specify flags

Trying to send or receive data on a broken socket creates a SIGPIPE signal.

Diagrams put These in the Right spot

11a - Remote Procedure Call

When a local function makes a syscall, the function prepares the arguments and raises the syscall exception, which gets handled by the system call handler. This handler runs, produces the result in a register, then returns control to the function. Easy.

The idea of Remote Procedure Call (RPC) is to make calling syscalls on remote systems as easy as this.

An RPC server defines a list of functions it wishes to export
RPCs are much slower and less reliable than regular syscalls
The associated syscall is rpc

We need RPCs to implement distributed systems, since a syscall could be made on any device, not just the one that the local process is running on.

Client-side

The client sets up a message buffer, then reads the function identifier and arguments into the buffer (message serialization or marshalling). Next, the message is sent across a network to the destination RPC server. The client waits for the reply, then unpacks (unmarshalls) the code and returns it to the calling process.

A stub generator takes the list of function calls and packs the function arguments into messages; this avoids mistakes by abstracting away the partitioning of the arguments into message to a separate function
Marshalling may require serializing (converting to raw data) and deep-copying arguments passed by reference, etc

Server-side

Each message is addressed to an RPC daemon listening to a port on the RPC server. When a message arrives, it is unpacked (unmarshalled), the syscall is made with parameters, re-marshalled, and the reply is sent back.

RPC servers are implemented as concurrent servers, often using a thread pool.

Runtime Library

The RPC server needs a runtime library to handle the tasks related accepting, processing, and sending the data:

Locating the remote server using the hostname and port number
Converting the byte order (big endian vs. little endian) if necessary
Sends data using UDP/IP or TCP/IP, implying the use of sockets

Diagrams

11b - CPU Scheduling I

How can we best utilize the CPU? I.e. how do we manage the ready queue?

Motivation

Scheduling is difficult for a few reasons

There are multiple objectives a system can optimize for, e.g. reducing user wait time, fully utilizing system resources, etc. So, tradeoffs and design decisions are necessary
The workload of the CPU is unpredictable

In particular, processes consist of alternating CPU bursts and I/O bursts. A balanced process has roughly equal composition of CPU and I/O busts; an unbalanced process may favor one more than the other.

Generally balanced processes perform the best
If a process consists mostly of I/O bursts, the ready queue is often empty
If a process consists mostly of CPU-bound tasks, the I/O waiting queue will almost always be empty
Most bursts are short, with a steep drop-off (thin tail)

// Grammars on the mind
// Can you tell I'm also taking CMPUT 415 this term?
process: (IO_burst)? (CPU_bust IO_burst)+ (CPU_burst)?

Definitions

A workload is a set of tasks for the system to run. Each task has a specific arrival time and burst length; if the system is a real-time system, it also has a deadline.

A performance metric is a criterion that can be used to compare scheduling policies. A scheduling policy is an algorithm that decides the order in which the tasks are executed.

Fairness is an important performance metric that measures the fraction of resources given to each task. Under the max-min fairness policy, we maximize the minimum allocation given to a task, i.e. reduce inequality among tasks.

Scheduling

Job scheduling (long-term scheduling) is the process by which the OS decides which and how many jobs should execute in main memory at the same time

I.e. how much multiprogramming occurs?
The OS aims to keep a balanced mix of jobs active at any given moment

CPU scheduling (short-term scheduling) is the process by which the OS picks the next process from the ready queue.

Off-line scheduling assumes knowledge of the arrival times and burst lengths of future tasks
On-line scheduling makes decisions as tasks arrive
aside: I feel like there's a general concept that can be described here: stateful vs stateless is close but not it

The kernel decides when the scheduler is run. In a preemptive system, the scheduler must wait for a notable event to occur (including a timer interrupt to make sure the scheduler gets run), whereas the scheduler of a non-preemptive system can interrupt a running process to run the scheduler.

A process switches from "running" to "waiting"
An interrupt occurs
A process is created or terminated

Performance Criteria for Scheduling

CPU Utilization: The percentage of time that the CPU is busy
Throughput: The number of processes that can be completed in a fixed unit of time.
Waiting time: the total length of time a process is in the ready queue
Turnaround time: the length of time it takes to run a process from initialization to termination, including the time spent waiting in the queue.
Response time: the time from a process' entry to the ready queue to when it finishes executing on the CPU

Minimizing the average turnaround time is the same as minimizing the time it takes jobs to execute in general; this is the defining problem we are trying to solve. So, turnaround time is probably the most telling metric, and the one we focus on.

Clearly, it is not possible to optimize all of these criteria at the same time (or is it?). So, we choose a policy to optimize a metric, which may involve optimizing one or more of the following criteria:

Minimizing average response time
Minimizing the variance of response time (this may be more important than the average response time in interactive systems)
Maximizing throughput, which is achieved through minimizing overhead and maximizing utilization
Minimizing waiting time. Note: simply giving each process the same amount of time might increase the average response time. Oops!

Assumptions

I think I posted a reel to this song

For the scheduling algorithms we will cover, we assume each user has a single process, which has a single thread, and the machine has one processing core. All processes are assumed to be independent.

These are not realistic assumptions for modern computers. However, developing scheduling algorithms under relaxation of the above assumptions is still largely an open problem.

Part II contains algorithms. Yay!!

12 - CPU Scheduling II

make some cool diagrams for this!

First Come, First Served (FCFS/FIFO)

First-come-first-served scheduling

Algorithm

Jobs are executed to completion in the order that they arrive.

So, FCFS runs only when a job gets blocks because it is waiting for I/O (i.e. it is non-preemptive)

Note: in early implementations, jobs did not give up the CPU even when they were waiting for I/O. Yikes.
So, if \(n\) jobs arrive in numerical order with lengths \(\ell_1, \ell_2, \dots, \ell_n\), we have \(\text{total wait time} = \displaystyle\sum\limits_{i=0}^{n} \sum\limits_{k=0}^{i} \ell_i\)

Advantages

This algorithm is simple to implement and has low overhead.

Disadvantages

Since the order of tasks remains unchanged, short jobs can get "stuck" behind long jobs. So, the wait time is highly variable. As the variance of tasks sizes increases, the average response time of FCFS gets longer.

FCFS performs the best when tasks are of roughly equal length

FCFS isn't fair, namely because long tasks take up time directly proportional to their length, by definition.

Finally, overlap between the CPU and I/O is clunky because CPU-heavy processes will force I/O-heavy processes to wait, leaving the I/O devices idle for (possibly long) periods of time. This degrades the experience of the user.

Round-Robin Scheduling

Algorithm

Each task is allocated resource for a fixed time quantum. If the process is not finished executing by the end of this period, it re-enters the queue at the end.

This can be implemented with a timer (to trigger the task change) and a preemptive policy, since the task scheduler interrupts the process all by itself.
With quantum length \(Q\) and \(N\) jobs in the queue, a job must wait at most \((N-1)\times Q\) units to receive resources again

If \(Q\) is too long, waiting time increases and round-robin degenerates to FCFS as \(Q \to \infty\), since processes are preempted less and less. If \(Q\) is too short, a higher percentage of time is spent on context switching (which increases at \(O(N)\) since the "context switch time" is constant), increasing overhead.

An effective heuristic for choosing a balanced \(Q\) is \(Q := \dfrac{1}{100}\text{ context switch time}\), i.e. \(\approx 1\%\) of the time is spent context-switching.
The choice of time quantum can lead to chaotic (in the math sense) effects on the turnaround time
In theory, if the context switch time were \(0\), as \(N \to \infty\), \(Q \to 0\) would optimize every metric and ensure fairness by essentially running the tasks at the same time. more here, aside this.

Round-robin is pretty good; most systems today use it.

Advantages

Round-robin is fair, by definition.

Disadvantages

If tasks are equal in size, average waiting time can suffer if there is "constructive interference" between the quantum length (\(Q\)) and the length of the tasks. Namely, if the task length is slightly longer than a multiple of \(Q\), then tasks that are almost done are added back to the queue. Next time the scheduler passes through the queue, each task ends quickly, increasing context-switch overhead.

Aside: what's the name for this kind of thing, where thing align and them being "just off" makes things way worse? it's the problem that randomized algorithms solve sometimes?

Shortest [Job | Remaining Time] First Scheduling (SJF/SRTF)

SJF/SRTF Scheduling

Algorithm

Each time it runs, the scheduler picks the job that has the least expected amount of work left

SJF does not require preemptivity, but does assume it can effectively predict the CPU time of unseen tasks
- "Preemptive SJF" is also known as shortest remaining time first (SRTF); if the scheduler "changes its mind" about which task will finish in the shortest amount of time, it interrupts the current job
Round-robin approximates SJF ('s finish order, correctness) when tasks are variable in size.

Estimating Burst Lengths of Processes

Generally, we apply the concept of temporal locality: we estimate the next burst length we will see using the previous burst lengths we already saw. In particular, the exponential moving average estimator \(\hat{b}\), defined by \(\left\{ \begin{array}{ll}\hat{b}(1) := \hat{b}(1) \\ \hat{b}(t) := \eta \times \hat{b}(t) + (1-\eta) \times \hat{b}(t-1) & \text{ for } \eta \in (0, 1] \end{array}\right.\).

A user may also provide a burst length for their task, since they may know relevant information about their tasks (much like a compiler directive).
However, a user may lie to game the system, i.e. violate fairness to get their tasks finished first. To deter this behavior, execution is terminated after the specified burst length has passed.

Advantages

SJF is provably optimal for minimizing average waiting time, and is compatible with both preemptive and non-preemptive schedulers.

Disadvantages

It is not possible to accurately predict the amount of CPU time a process will need ahead of time (so then why is this here? is it impossible, or are we just bad at it. Is there a bound on how good we can get at it? is it an active area of research?). Under this assumption, SRTF degenerates into picking tasks pseudo-randomly.

Randomly selecting tasks isn't as bad as one might think; randomized task selection probabilistically solves the "stuck behind long task" problem of FCFS. We will see this as lottery scheduling later.

If new short jobs keep arriving in the queue, a long-running CPU task may starve, i.e. never be allocated resources.

aside: do this! it's cool.

13 - CPU Scheduling III

Can we design a policy that minimizes response time, is fair and starvation-free, with low overhead?

Priority Scheduling

We can annotated each task with a (usually) user-defined priority that indicates how important they are. Then, we keep multiple queues, one for each priority level.

However, if the scheduler simply picks from the highest priority queue that isn't empty, jobs are prone to starvation again. Instead we can

Demote long-running tasks to a lower priority, i.e. keep a job timer and demote a task if it is still running when the timer interrupts the process
Allocate each queue some fixed percentage of the total time, i.e. time-slice among the queues. Higher-priority queues are allocated more time.

These solutions may increase average response time, but optimize fairness.

Priority Inversion

If a high-priority task is blocked by a lower-priority task, the high-priority task may grant its high priority to the low-priority task so that it can run on its behalf. Once the low-priority task is finished, the high-priority task can run again

This process is called priority inversion

Multi-level Feedback Queue (MFQ) Scheduling

Algorithm

The scheduler keeps track of multiple job queues, classified by priority level. Each queue is evaluated with round-robin (possibly with different quanta); once a queue is empty, the scheduler checks the queue with next-highest priority.

If a job's time slice expires, its priority level is decreased
If a job is ended by something other than its time slice (e.g. I/O request), then its priority level is increased
The time slice (quantum) is exponentially inversely proportional to the priority of the queue, i.e. lower priority queues have exponentially longer quanta (check quanta here correctness)

So, CPU-heavy jobs drop in priority while I/O-heavy jobs stay at a high priority
Assigning tasks to different queues allows MFQ to increase fairness while optimizing responsiveness and overhead to a reasonable degree

Note that any improvement to fairness will come from giving long jobs more CPU time, which degrades average wait time. Some "solutions":

If each queue is given a fixed fraction of CPU time, the solution is only fair when each queue has an even and stable distribution of jobs among queues
If the priority of unserviced jobs is increased periodically, starvation is avoided. However, if the system is overloaded, the higher-priority queues get overpopulated, increasing the waiting time.
- UNIX originally did this

Lottery Scheduling

Algorithm

Each process gets some number of lottery tickets. Each time slice, a winning ticket is picked and the corresponding process is run.

So, the share of CPU cycles allocated to a job is probabilistically proportional to the number of tickets assigned to each job. Thus, the priority of a job corresponds to how many tickets are given to it. Every job is given at least one ticket to avoid starvation

Priority inversion is achieved by the high-priority process "donating" its tickets temporarily to the low-priority process

So, performance degrades gracefully as the load changes because the effect of adding or removing a process is probabilistically "spread out" over all existing processes equally.

This "zero-cost spreading of load" can only be accomplished probabilistically; the "spreading" in question is the probability of being picked is reduced proportionally over all the existing tasks when a new task is added

Scheduling on Multi-core Systems

There are two approaches

A common ready queue holds all the waiting tasks; when a core becomes idle, the next task in the queue is assigned to it.
Each core has its own run queue, i.e. each core acts like its own, single-core system. Load balancing will be required to make sure the queues are approximately the same size

Aside: a common read queue system would exponentially alleviate the problems with other algorithms, right? since a single core isn't blocking anything or can we just abstract this away as stuff we've seen already? look into this

Scheduling on Real-time Systems

A real-time system is a system where each task specifies a deadline by which it must be completed. In a soft real-time system, failing to meet the deadline leads to degradation in performance. In a hard real-time system, missed deadlines cause the system to fail entirely.

So, scheduler performance must be guaranteed in a hard real-time system

Soft real-time systems often use a preemptive, priority-based scheduler that assigns real-time tasks to the highest priority. E.g. in Windows, priority levels \(16\)-\(32\) are reserved for real-time processes (\(32\) is the highest level).

Hard real-time systems require the use of admission control: the scheduler can only accept a task if it can guarantee it will finish execution by the deadline. Otherwise, the scheduler rejects the task

What happens when a task is rejected? What does that even mean? it just gets dropped? huh?

Rate-Monotonic Scheduling (for Periodic tasks)

A periodic tasks has a fixed deadline \(d\), processing time \(t\), and will require the CPU at constant interval \(p\). So, (for any of this to have a chance of working) we must have \(0\leq t\leq d\leq p\). The task will need \(\dfrac{t}{p}\) of the CPU's time.

Rate-Monotonic Scheduling

Algorithm

The priority of a rate is proportional to the inverse of its period, i.e. its frequency

So, tasks that occur more often (and thus have tighter deadlines) get higher priority, while tasks with longer deadlines get lower priority
A task is rejected when (i.e. our admission control rule is) \(\displaystyle\sum\limits_{i=0}^{N} \dfrac{t_i}{p_i}> 1\), where \(t_i\) and \(p_i\) are the time and period of task \(i\in \set{1, \dots, N}\), respectively. Here, we assume that the deadline is equal to the period.
This implementation is simple, and doesn't require knowledge of the deadline.

Earliest-Deadline-First (EDF) Scheduling

Earliest-deadline-first (EDF) Scheduling

Algorithm

The priority of a task is inversely proportional to how early its deadline is

So, the closer the deadline, the higher the priority
This policy is preemptive: if a task with an earlier deadline is added to the queue, the current task should be interrupted
The tasks are not required to be periodic, and knowledge of burst size is not needed either. So, it is easy to implement.

If a feasible schedule exists, EDF is theoretically optimal with respect to CPU utilization. However, if no schedule exists, EDF may cause more tasks to miss their deadlines than is necessary.

Least-Laxity-First (LLF) Scheduling

We define the laxity or slack time as \(\text{time until deadline } - \text{ remaining execution time required}\) \(=\text{deadline }-\text{ current time }-\text{ remaining execution time required}\).

Least-Laxity-First (LLF) Scheduling

Algorithm

The priority of a task is inversely proportional to the task's laxity

So, the lower the laxity, the higher the priority
LLF is also preemptive
Note that if the laxity of a task is negative, it strictly cannot meet the deadline. So, LLF can detect unmeetable deadlines and choose not to allocate resources to the corresponding tasks at all; EDF cannot do this, and can thus waste time on useless tasks.

So, LLF is essentially a refinement of EDF, inheriting from the definition of laxity as a refinement of "time until deadline". The overall structures are the same; it's just the "cost function" that's different.

14 - Intro to Threads

Threads are the next level down from processes; threads of a process execute on the same program at the same time, using the same address space. They can further "split" tasks, but incur less costs than using separate processes.

Motivation

Modern systems have multiple processing cores and the OS needs to handle many things at once. So, introducing parallelism has the potential to improve performance. A process with multiple threads has multiple "little pieces" that can run in parallel.

If one thread blocks, the other threads may not be blocked and can still execute.

E.g. a web browser may have a thread to render the HTML and CSS, another to retrieve data from the network, another for responding to user input, etc. Thus, all of these things can happen at the same time.

E.g. a web server executing multiple threads concurrently

E.g. the kernel of an operating system is multi-threaded; each thread performs a specific task, e.g. device management, memory management, interrupt handling, etc.

Running the Same Program at Different Points

A thread is a single stream of execution within a process presenting an independently schedulable task. So, each process must have one or more threads, i.e. one or more points of execution.

So, multithreading is basically running a program in "multiple places" at the same time, i.e. having a bunch of different program counters moving around

Each thread has a thread ID (analogous to a process ID), a program counter, a stack pointer (and thus its own stack), and a set of registers to use. However, the heap, text, and data sections are shared with the rest of the threads in the process, as well as other OS resources (e.g. open files, signals)

The resources unique to a thread are stored in the thread control block (TCB), analogous to the process control block. This may also include thread-local storage (TLS), which is accessible in a thread because it persists across function calls (i.e. works differently than the stack).

Multi-threaded Processes

The address space of a process is shared among its threads, i.e. many threads exist in one protection domain.

Since each thread of a multi-threaded process shares the heap, data, files, etc, data-sharing is much easier and does not require syscalls, message passing, shared memory, etc.

If two threads run successively on a single processor, a context switch is still required because the PC and registers must be replaced. However, since the address space is the same, context switches between threads are less expensive.

Structure of a Multi-threaded Program

Order between threads is not specified or guaranteed, so threads may run in arbitrary order. Thus, we can design the program so all the threads can execute concurrently (asynchronous threading) or force the main thread to wait until the other threads terminate (synchronous threading).

#define N 100;
int in, out;
int buffer[N];

void producer() {
    //...
}

void consumer() {
    //...
}

int main() {
    in = 0;
    out = 0;
    // we could also use pthread_create, which takes a function pointer
    fork_thread(producer());
    fork_thread(consumer());
    // ...
}

Here, we have \(3\) threads after calling fork_thread twice: the main thread and the two new threads.
We can use fork_thread or pthread_create to create new threads

Benefits of Multithreading

The response to user input is faster because a thread can be assigned to handling I/O without affecting/blocking the other thing the program does. So, heavy load from the program won't cause I/O operations to suffer
Sharing resources between threads is faster than sharing between processes because threads share the same address space (heap).
Creating and switching between threads is cheaper (both in terms of time and space) than the analogous operations for processes (in Linux, context switching is on the order of 100000x faster for threads than processes)
Threads can run in parallel on different cores → threaded programs are more scalable

Performance Gain

We can quantify performance gain with Amdahl's law: If a system has \(S\%\) serial (i.e. sequential) components and thus \((1-S)\%\) parallel components, then we have the bound \(\boxed{\displaystyle\text{speedup with respect to adding new cores }\leq \left({S+\dfrac{1-S}{N}}\right)^{-1}}\) where the system has \(N\) processing cores.

So, as \(N\to \infty\), \(\text{speedup} \to \dfrac{1}{S}\). Thus, the speedup of a system is ultimately limited by how much of it is serial; a \(100\%\) parallel can be sped up by any arbitrary amount by adding more cores.
We can derive this law ourselves by observing that adding new cores only affects parallel processes. In the best case, these are evenly split among the cores. So, \(S+\dfrac{1-S}{N}\) describes what "percentage" of the original program's time we need given \(N\) cores; speedup is inversely proportional to this

How Does Multithreading Work?

We have two types of threads: user threads and kernel threads.

Kernel Threads

A kernel thread (or lightweight process) is a thread directly managed by the OS; the kernel manages and schedules it.

Switching between kernel threads requires a context switch, but a smaller one than switching between processes since we don't need to copy memory information

User-level Threads

A user-level thread is a thread that the OS is not aware of. Instead, the programmer uses a thread library (e.g. C-threads) to implement and manage the threads in code. This library handles the creation, synchronization, and scheduling of the threads instead of the OS.

This lets us tailor the thread scheduling algorithm/policy for each process instead of using the same one for all of them like the OS's scheduler would. Thus, is is more flexible.
User-level threads do not require system calls or context switches since the threads are implemented in a library instead of using the OS directly. So, library calls are faster than their corresponding syscalls.

Communicating between both Levels of Threads

Since the OS doesn't know about the user-level threads, it can't use them to make scheduling decisions, decreasing the performance of the scheduler. Solving this problems requires the thread library to communicate with the kernel, which would require syscalls.

The OS's scheduler doesn't know how many threads a process has
It doesn't know if a process consists entirely of idle threads, and thus may choose to run it when it otherwise wouldn't
A user level thread making a blocking syscall (e.g. I/O waiting) blocks the entire process, instead of just the thread that is actually blocked

Thus, the user and kernel threads must communicate in some configuration: one-to-one (high parallelism and concurrency, used in most current implementations), many-to-one (doesn't work for multicore systems since threads of the same process can't be parallelized), many-to-many (lets us use less kernel threads than user threads) or two-level (hybrid).

Concurrency vs. Parallelism

A system is parallel if more than one task at the same time. Multi-threading can improve the parallelism of a program.

A system is concurrent if all of its tasks can make progress at any given time*. This can be implemented by switching quickly between processes, and thus does not require multi-threading.

So, concurrency and parallelism do not imply each other; it is possible to have concurrency without parallelism.

Challenges of Programming on Multi-core Systems

Tasks must be identified and split in such a way that they are independent and can be run in parallel. They should also perform equal amounts (and value) or work for effective load balancing. Distributing tasks across cores is the core aim of task parallelism.

Data must also be split so that each tasks can see the data it needs (but unneeded data isn't stored). This is the core aim of data parallelism

If there is dependency between data accessed by multiple tasks, the task execution must be synchronized, i.e. we need to make sure that conflicting tasks don't run at the same time.

Testing and debugging get much harder in multithreaded applications because execution happens in multiple places and has elements of non-determinism (e.g. due to no guarantees on thread order).

The scheduler can also just re-order and switch threads whenever it wants to, possibly in the middle of an operation that requires multiple instructions

15 - The POSIX Thread Library

`fork`ing In Multithreaded Programs?

fork has two versions: one the duplicates all the threads of the parent process, and one that creates a single-threaded child process. Generally, the thread copied is the one that called fork in the first place

So, if exec is called right after fork, duplicating all the threads is a waste of time and memory because the exec call replaces the whole memory of the process containing the threads.

Implicit vs. Explicit Threading

Under implicit threading, thread creation and management is left to the compiler and run-time libraries of a program. Thus, programmers just need to identify tasks that can run in parallel in their programs; the infrastructure they use will do the actual thread allocation for them.

OpenMP is an example of a set of compiler directives and library routines for C/C++ that instruct the compiler to automatically generate threads
So, we have a layer of abstraction over the actual threads

Under explicit threading, a library provides an API for creating and managing threads to the programmer, who must use it to implement the threading manually

This library can either be in the user space, or a kernel level library supported by the OS (two approaches)
POSIX PThreads, Windows thread library, and Java are all examples of thread libraries; PThreads can either be provided at user or kernel level

POSIX Thread Library

In PThreads, a thread container data type pthread_t acts as a handle of a thread, i.e. something to identify it (like a PID).

See man pthread for more information

// creates a thread and returns the thread's handle
// we run the routine we want to place in the thread
// directly as a parameter
// aside: how the hell does this work??
// in pratice, implemented with the clone system call
pthread_create(start_coutine());

// allows the initial attributes of the thread to be set
pthread_attr_init();

// blocks the thread that called the new thread until the new thread has terminated
pthread_join();

// causes the calling thread to give up the CPU voluntairly
pthread_yield();

// terminates the calling thread and executes cleanup handlers
pthread_exit();

Signals in Multithreaded Programs

A synchronous signal is delivered directly to the thread that caused the signal to be generated. On the other hand, asynchronous signals (i.e. signals from external events) are sent to the first thread that hasn't blocked that signal, and possibly all the threads that haven't blocked that signal.

Only sent to the first one sometimes because a signal should only be handled once, i.e. by one thread

POSIX Pthreads has a pthread_kill(pthread_t pid, int signal) function that lets the user send a signal to a specific thread.

Cancelling (terminating) Threads

Under asynchronous cancellation, a thread will terminate the target thread immediately. Thus, the OS does not have time to react, and might not be able to free/reclaim any resources allocated to the target thread.

Under deferred cancellation, the target thread periodically checks (polls) if it has been cancelled, specifically at cancellation points. If a pending cancellation request is found, the thread will terminate itself and a cleanup handler is invoked to release the thread's resources.

Many blocking POSIX syscalls are cancellation points, e.g. read. So, we can cancel threads blocked while waiting for input
A thread can also opt to check for a cancellation request manually before reaching a cancellation point (in PThreads, using pthread_testcancel)

In PThreads, we cancel a thread using pthread_cancel(pthread_t pid). Note that threads can set their own cancellation type with pthread_setcanceltype(); deferred cancellation is the default. A thread may choose to disable cancellation entirely.

Thread Pools

A thread pool is a collection of worker threads that execute callback functions for the application, i.e. are passed functions (as arguments) to execute inside the thread. At startup, some fixed number of threads are created; threads wait for tasks to be delegated to them

Since we create all the threads on startup, there is no overhead involved with assigning new threads to new processes. It's faster to use an existing (waiting) thread then to create a new one
Becomes more efficient as the amount of work each thread does gets smaller
This also provides a fixed bound on system resources
We see this in the server example below

We can implement a job queue to queue incoming requests; this covers the case when we have more requests than threads in the pool. Once a thread becomes idle, it is "added back to the pool" and the next job in the queue is assigned to it.

Aside: to solve the producer-consumer problem, we might wish to lock access to the job queue and use a condition variable to store whether there is space in the job queue or not.

The Fork-Join Model

Under the fork-join model, a thread forks subthreads with fork, passes them arguments, waits for them to complete work in parallel, then joins them with join and combines/processes the results.

So, the main thread rejoins execution after joining the threads it spawned

This is like a synchronous version of a thread pool; the user (or library, if it uses fork-join to implement something) determines the number of threads that need to be created, which would depend on the number of tasks.

This implies that a clever library could re-use threads ??

Aside: fork-join follows the structure of divide and conquer algorithms well, implying that they can be easily parallelized under this structure. For example, to implement merge sort, we could fork into two threads to recursively merge sort each half of the list, then join them and collect the results by merging the two lists, as per the definition of the algorithm.

Code Examples

Basic Threads with PThreads

Multithreaded Server Concept

// the server loop is constantly running
serverLoop() {
    // blocked by acceptConnection until there is a new connection
    connection = AcceptConnection();
    // creates a new thread to handle each request that comes in
    thread_fork(ServiceRequest, connection);
}

16 - Synchronization I - Concurrency Problems

Recap: cooperating processes (or threads) share data via message passing or shared memory. However, concurrent access to shared data can lead to inconsistency, so programmers must synchronize access to shared data.

E.g. each request is handled by a different thread
How can we have global data that needs to be read and modified by multiple threads concurrently, where modifying the data isn't a single operation?

Race Conditions

A race condition occurs when the outcome of a program depends on the order in which threads accessing shared data are scheduled. Since this order is not defined, race conditions lead to unpredictable, non-deterministic results.

The last thread that changes a value "wins"; the work of the other threads is lost

// initialize: n = ARRAY_SIZE - 1
if (n == ARRAY_SIZE) {
    return -1;
}
array[n] = valueA;
n++;

If two threads run this code at the same time, there are a few possibilities

If thread \(1\) runs entirely before thread \(2\), it will increment \(n\), causing thread \(2\) to satisfy the if statement and return. Thus, thread \(2\) never writes its value (thread \(1\) wins)
If threads \(1\) and \(2\) alternate running, then thread \(2\) will execute in lockstep with thread \(1\); it will overwrite the value thread \(1\) stores (thread \(2\) wins)
If thread \(1\) increments \(n\) before thread \(2\) increments \(n\) but after it passes the if-statement, thread \(2\) will try to write to the incremented value (\(n+1\)), which causes an overflow.

Since a certain bug might only happen for a certain interleaving of the threads (which itself is non-deterministic), it can be extremely difficult to debug race conditions.

Atomicity

An operation is atomic if it either runs to completion or not at all. Single instructions may be atomic, but compound instructions (e.g. ones compiled from lines of user code) might not be, e.g. x = x + 1. This includes traditional reading and writing to memory, since we must first load the value, then the address, then write, etc.

If an operation is not atomic, threads may interleave within it, meaning the state of the program directly before and directly after the instruction might be different, with the difference being caused by something unrelated to the instruction.

A Combinatorics aside that no One Asked for

We can derive that for \(N\) lines of code and \(k\) threads, we have \(\dfrac{(kN)!}{(N!)^k}\) ways to interleave the threads, i.e. ways the different threads may traverse the same program. This corresponds to the number of random walks from the "top-left" to "bottom-right" corners of a \(k\)-dimensional \(N\times \dots\times N\) lattice.

So, for two threads, we'd have \(\displaystyle{2N \choose N}\) possibilities
Further aside: if we required one thread not to be able to pass the other one (i.e. the lattice path must stay below the diagonal), the \(2\)-thread (i.e. \(k=2\)) case would correspond to the Catalan numbers; for \(N\) lines of code we would have \(\dfrac{(2N)!}{N!\times N!}=\dfrac{1}{N+1}\displaystyle{2N \choose N} =: C_N\) traversals.

We note that each traversal might not produce a unique result, i.e. multiple different traversals might end up "accomplishing" the same thing. So, this formula doesn't count how many different program outcomes there are.

Critical Sections

A critical section is a block of code (a sequential set of programming instructions) that cannot be executed in parallel by multiple threads, i.e. if they were a race condition would occur. Thus, there is mutual exclusion of threads in the critical section: only one can be in there at a time.

A program may have multiple critical sections
Synchronization primitives are used to make sure only one thread is in the critical section at a time

To be effective, a critical section must be correct, efficient (entry and exit of the critical section is fast and the critical section is short), flexible (have few restrictions), and support high concurrency.

Within correctness, we have liveness (we avoid starvation of the critical section) and safety (only one thread enters the critical section at a time)

Naive Synchronization Strategies

Just leave a note! I.e. have a variable that we store 1 in when a thread is in the critical section and 0 otherwise. Then, before entering the critical section, we just check what the note is.

Problem 1: threads can get context switched any time, including right in between when a thread enters the critical section but before it updates the note. So, notes can get overridden
We're also assuming here that load and store are atomic and no hardware support is required. This may not actually be true in practice.

Just have different notes for each thread, and check each thread's note before entering the critical section. This would be implemented as an array of notes.

Problem 1: if both threads update their notes before either of the threads enters the critical section, both threads appear to be in the critical section, so the critical section is starved (nothing enters it)

A Naive Strategy that Works

We have two notes, but in one of the threads, we spin (stall) until we see that the other thread's note is 0; then we enter the critical section. We implement the spin by having a while(true) look that breaks if the note becomes 0.

We know this works because by the time thread B stops spinning, thread A cannot be in the critical section anymore based on how we're checking. Either thread A hasn't entered the critical section yet or is already past it; either way, it's safe for B to enter since its note is set to 1.

This works, strictly speaking, but sucks for a few reasons

Relies on load and store being atomic operations (which they might not be, special hardware support might be required)
Too complicated (we needed to convince ourselves it worked)
Threads A and B must have different implementations, so we'd need to duplicate all the common code between them; this scales badly when adding more threads
Employs busy waiting: thread B consumes CPU resources without doing useful work

17 - Synchronization Algorithms and Hardware Support

Dekker's Algorithm

Algorithm

We keep track of a lock, as well as an extra variable that stores which thread's turn it is to enter the critical section. While the critical section is locked, we continue checking if it's our turn to access it. It if isn't, we release the lock (so the other thread can access the critical section), then wait (spin) until it's our turn again. When the critical section unlocks, we stop spinning, execute it, then remove our turn and lock.

reword, this sucks
I.e. if it's not our turn, then we release our lock, spin until it's our turn again, then re-engage our lock

Peterson's Algorithm

Peterson's algorithm simplifies Dekker's algorithm by merging both while loops into one loop

Peterson's Algorithm

Algorithm

Initialize your own lock to 1 and give the other process the turn. Then, spin while the other process is locked and the turn is not yours. After that, you are free to enter the critical section, then remove your lock.

Thread i only enters the critical section if lock[j] == false or turn == i, both of which are conditions under which it is allowed to access the section.

We can prove that the following boolean expressions cannot both be true at the same time; thus, mutual exclusion is guaranteed if the invariant holds before entering the critical section.

Peterson's algorithms is correct, and the solution is symmetric, i.e. the code of both threads "looks the same" (the only difference is which entry in the lock array is accessed). However, it is limited to two threads and requires busy-waiting.

So, it may not work on modern systems because compilers may re-order operations that don't have dependencies, which might allow multiple processes to enter the critical section at the same time

Correctness Conditions

A solution is correct if it has safety (i.e. guarantees mutual exclusion), liveness (guarantees that progress will be made if needed, i.e. a thread waiting to enter an empty critical section will eventually enter it) and a bounded wait time.

Specifically, if a thread has made a request to enter the critical section, there is a bound on the number of other threads that are allowed to enter their critical sections first before the request is granted.

Hardware-supported Synchronization

To eliminate busy-waiting and lock-contention (only one thread being able to hold a given lock) altogether, we need hardware support for synchronization.

Aside: busy waiting (and while loops that accept input) seem to be a fundamental programming pattern for "real-time" stuff: it uses the whole self-referential recursion loop fundamental construct to have the program "wait" for the user to do things. Specifically, it is polling.
Aside: interesting to note that hardware synchronization is faster, but isn't necessary (!)

We have three main strategies for hardware-supported synchronization: memory barriers, atomic hardware instructions, and atomic variables (usually implemented with atomic hardware instructions).

Basically, all the synchronization problems come from the fact that a process can be interrupted at any time. So, we find ways to prevent interrupts from being a problem.

Synchronization by Disabling Interrupts

The CPU (non-preemptive) scheduler only gets control when internal and external events (i.e. interrupts) happen. So, on uniprocessor systems, we can simply disable interrupts from happening.

We won't request I/O operations during a critical section
We "tell the hardware" to delay that handling of external events until the thread is outside the critical section

Synchronization by Atomic Instructions

In hardware, we implement an atomic read-modify-write instruction that reads a value from memory into a register and writes a new value to it in a single instruction. Thus, it can't be interrupted; the critical section is formed of a single instruction.

On a uniprocessor, we can just add another instruction that does this
On a multiprocessor, we must invalidate any copies that other processes made of the data in question to the cache; the system must have some type of cache coherence.

Examples: test_and_set in x86 assembly atomically reads a value from memory into a register and writes 1 back to the memory location, exchange atomically swaps between a register and a memory address, compare_and_swap does a conditional swap of values based on the values of some registers.

Implementing Spinlocks

We can implement a spinlock with test_and_set:

while(true) {
    while(test_and_set(&lock));
    // we are entering the critical section
    // ...
    // critical section done
    lock = false;
}

As long as the lock is 1 (true), test_and_set will keep reading and writing 1 to it

We can implement a spinlock with compare_and_swap as well

while(true) {
    while(compare_and_swap(&lock, 0, 1) != 0);
    // we are entering the critical section
    // ...
    // critical section done
    lock = 0;
}

compare_and_swap returns 1 when a lock is obtained by another thread by 0 when the lock is free. Since it is atomically set to 1 when the lock is accepted, there can't be a race condition between threads trying to obtain the lock
However, the waiting time is not bounded since a thread can theoretically lose every time (in the inverse of the NP-hard always wins sense), so it is not correct. We can solving this by forcing a thread to give the next thread a turn after exiting the critical section.

Implementing Atomic Variables

We can define an increment function that atomically increments a value in memory using compare_and_swap and an atomic variable.

C++ has <atomic>, which provides type aliases and operations for all the integral types (i.e. integers, bools, longs, etc). Operations include load, store, exchange, compare_exchange, add, etc.

void increment(atomic_int *v) {
    int temp;
    do {
        temp = *v;
    } while (temp != (compare_and_swap(v, temp, temp+1)));
}

So, we check that the increment operation hasn't happened yet by seeing if the temp is the same as "before"

18 - Synchronization Primitives I

Synchronization primitives must remain invariant over all the threads of a process

Programming languages provide primitive, atomic operations for synchronization. These offer the basic building blocks to create synchronous programs, just like the lambda calculus provides the basic building blocks to create functional programs.

Mutex Locks

A mutex lock is a high-level programming abstraction of an object that only on thread can hold at a time; it may be implemented with a blocking operation or a spinlock (which uses busy waiting)

A mutex lock must define an Acquire method and a Release method; this the the interface through which it communicates with the rest of the program
Acquire blocks until the lock is acquired. So, if the program executes after an Acquire call, the mutex lock must have been acquired (analogous for release)

// recall: all threads of a process share the heap
void *malloc (size_t size) {
    // assume we have some mutex lock over the heap
    // we get the heap lock
    heaplock.acquire();
    // allocate the memory with syscalls, etc
    p = /* the memory */;
    // release the lock
    heaplock.release();
}

void *free(void *p) {
    heaplock.acquire();
    // deallocate memory with syscalls, etc
    heaplock.release();
}

E.g. the heap has a lock because multiple processes/threads shouldn't be writing to it at the same time. If a thread or process wants to use the heap, it must acquire the heap lock.

We can implement mutex locks on uniprocessor systems by disabling interrupts with intr_disable() and re-enabling them with intr_enable(); this happens at the start and end (respectively) of both acquire and release.

This strategy won't work on multiprocessors, since disabling interrupts on one processor won't disable it on the others. So, for multiprocessor systems, we can use instructions like test_and_set or compare_and_swap to implement a busy-waiting loop that atomically checks the status of the lock.

So, with multiprocessors, we can't eliminate busy-waiting
We can minimize busy-waiting by busy-waiting to check atomically check the lock value and give up the CPU if it's busy (as opposed to simply busy-waiting for the lock).

A two-phase lock spins for a small amount of time to see if the lock can be acquired (spin phase). If it can't, the caller is put to sleep (sleep phase). This is a heuristic that assumes temporal locality; if a lock isn't available now, it probably won't be available for awhile.

Condition Variables

Combined with mutexes, it allow threads to wait in a race-free way for arbitrary conditions to occur; these conditions are stored (as booleans?) in the condition variable

A condition variable is an abstraction that supports conditional synchronization. A queue of threads waits for a specific event that happens in the critical section. So, the value of the condition variable is defined by data protected by a mutex lock.

A thread must hold this mutex lock when performing operations with the condition variable. This prevents signal and wait operations from being interleaved

A condition variable defines three functions:

Wait(), which takes a lock to be released. It atomically releases the lock and goes to sleep, i.e. blocks the thread until signalled. Then, it reacquires the lock when it wakes up.
Notify() (historically, signal()), which wakes up a waiting thread and puts it on the ready queue
- Aside: I wonder if this is where the signal idea in frontend web frameworks comes from
NotifyAll() (historically, Broadcast()), which wakes up all the waiting threads and puts them on the ready queue

We take the following steps when dealing with a condition variable:

Acquire the lock to enter the critical section
Check the condition inside the critical section. If it's true, block the thread and release the lock. otherwise, just release the lock.

Example: Vending Machine

Intuitive Summary

A mutex lock is a device to prevent more than one thread from "moving forward" in the code; the thread must acquire the lock to this, and the "primitive" part is that only one thread can acquire it at a time.

Sometimes, we can implement what we need just with mutex locks. However, whether we need to enter the critical section could be predicated on program state that can only be checked inside the critical section (e.g. if a bounded buffer (which is in the critical section) is full or not). We can't do this with just a mutex lock, since we need it to look in the critical section in the first place.

A condition variable is a primitive that lets us communicate with a mutex lock from inside the critical section. We can call wait on it outside the critical section to block the thread until we call signal (one thread) or broadcast (all threads) from inside the critical section. Thus, we can share information from the critical section without needing to "acquire" the lock in the mutex sense.

Each variable represents a "boolean condition", defined by how and when the condition variables is used in the critical section.

Usage Pattern

We often use a while loop to handle a mutex lock and condition variable:

Acquire the lock
Enter a while loop that checks the negation of the "condition" inside the critical section
- Inside this loop, we call wait on the condition variable. Under the hood, this releases the mutex lock and pauses the execution of that thread until the condition variable is signaled/broadcasted
- This way, waiting for the condition doesn't hog the lock, causing a deadlock
By definition, after this loop, we have the lock and the condition we are checking for is true; the code after this is the critical section
Release the lock

My implementation of MapReduce (assignment 2) has good examples of this pattern everywhere.

19 - Synchronization Primitives II

Semaphores

A semaphore is a generalization of the mutex lock + condition variable pattern: a semaphore has an integer variable (as opposed to the lock, which is a boolean variable) that supports the wait and signal operations.

A binary semaphore acts like a lock; it guarantees mutually exclusive access to a shared resource. Its values are in \(\set{0, 1}\); \(1\) means the semaphore is free and \(0\) means it is locked.
A counting semaphore is used for conditional synchronization. If a critical resource has multiple "units" that can accessed independently, a counting semaphore will grant access as long as a unit is available.

A semaphore keeps a wait queue of threads waiting to access the critical section. If a thread calls wait and the semaphore is free (i.e. not \(0\)), it grants access and decrements the variable. Otherwise, the process is placed in the wait queue. When signal is called, one process is unblocked and the semaphore is incremented; the next process in the queue is given access.

This wait queue is what makes the binary semaphore different from a lock + condition variable
A semaphore can be implemented with a count variable, a condition variable to check the count, and a mutex lock to protect accesses to count.
Semaphores have memory in the form of this count variable.

Mutual exclusion can be implemented by a binary semaphore by calling wait directly before the critical section and signal directly after. Used this way, a counting semaphore would let at most \(k\in \mathbb{N}\) threads into the critical section at a time.

Unlike with condition variables, calling signal on a semaphore with an empty queue does have side effects: it still increment's the semaphore's count, which gets decremented the next time wait is called. Calling signal on a "empty" condition variable does nothing until a thread calls wait.

Monitors

A monitor is an abstract encapsulation of a lock and a number of condition variables into a class. It hides the synchronization primitives behind this class interface; the user of the monitor interacts with the critical section through thread-safe methods, which access the critical section's data.

Mesa-style (signal and continue) monitor: signal puts a waiting thread on ready queue, but this thread still keeps the lock. So, when it releases the lock, another thread could grab it first.
- So, the signal is a "hint" the condition might be true, but the condition needs to be checked after waking up
Hoare-style (signal and wait) monitor: signal gives the lock (and thus processor access) to the waiting thread, which executes immediately. When this execution is done, the processor and lock are given back to the signalling thread (so, a signal queue needs to be used).
- So, the condition must hold after waking up

Both styles of monitor can be implemented with semaphores:

Mesa-style: binary semaphore that acts as an entry queue for the entire class; each method has a counting semaphore that acts as a wait queue for that method.
Hoare-style: each counting semaphore (method) has a wait queue and next queue

Course notes contain examples

20 - Synchronization Examples

Classic Synchronization Problems

Bounded buffer problem: A finite buffer holds shared data, i.e. different processes read from and write to it. We need mutual exclusion to be sure of the data's integrity; we also need to know if the buffer is full or empty to avoid writing or reading, respectively.

Solution: mutex lock
E.g. web server

Readers-Writers problem: Again we have a shared buffer, but we might have multiple processes reading and writing. We need to assure that only one writer at a time accesses the data, but many readers should be able to read at the same time.

Using a single lock is too restrictive
Solution: binary semaphore to control initial access (reader or write), counting semaphore to count how many readers there are
Solution: many libraries (including pthreads) provide a read/write lock primitive to solve this problem specifically.
E.g. databases

Dining-Philosophers Problem: \(n\) philosophers sit at a table with \(n\) chopsticks. Philosophers switch between thinking and eating. They require \(2\) chopsticks to eat, but can only pick up \(1\) at a time. How can they assure no deadlocks and no starvation without communicating directly?

21, 22 - Concurrency Bugs and Deadlock

Non-deadlock Concurrency Bugs

Atomicity bug: An operation consists of multiple instructions and a thread is interrupted in the middle of the instruction by another thread that changes some of the data.

Order-violation bug: Some order of thread executions is assumed (usually for threads in a parent-child relationship) that doesn't always hold true. Look for these by considering what would happen if a child thread ran immediately after creation, pausing the parent.

Deadlock Bugs

A deadlock occurs when two threads access shared data, but don't have the same protocol for interacting with the lock (or whatever synchronization primitive).

E.g. the order in which locks are accessed is incompatible

// thread A
lock1.acquire();
lock2.acquire();

// thread B
lock2.acquire();
lock1.acquire();

// both threads are in each other's wait queues

The System Model

We define a system to have resource types \(R_1, \dots, R_m\), with \(w_i\) interchangeable instances of resource type \(R_i\) (e.g. CPUs, disks, locks, etc.) and threads (or processes, whatever) \(T_1, \dots, T_n\). To utilize a resource, a thread must request it, use it, and release it.

Formally, a deadlock has occurred in a set \(T\subseteq\set{T_1, \dots, T_n}\) of processes/threads/etc when every member of \(S\) is waiting for an event that must be generated by another member of \(S\). So, we have "closure under waiting" (can you tell I'm also taking algebra this term)
Deadlocks imply starvation

We can define a resource-allocation graph with vertices \(\set{R_1, \dots, R_m}\cup\set{T_1, \dots, T_n}\).

An arc \(T_i\to R_k\) exists if and only if thread \(T_i\) is requesting resource \(R_k\).
An ark \(R_k \to T_i\) exists if and only if resource \(R_k\) has been assigned to thread \(T_i\).

If a deadlock has occurred, the resource-allocation graph must have a cycle. However, the existence of a cycle doesn't imply a deadlock; it is necessary but not sufficient.

If only one instance of each resource exists, then a cycle does indeed a deadlock.
Also in the one instance of each resource case, arcs \(T_i \to R_j\) and \(R_j\to T_k\) imply that \(T_i\) waits for \(T_k\), i.e. the graph is transitive.

We can also define a wait-for graph with vertices \(T_1, \dots, T_n\) by joining two threads if a path from one to the other exists in the resource-allocation graph.

When Do Deadlocks Occur?

Four conditions are needed for a deadlock to occur:

Mutual exclusion of the resources
Threads hold and wait for additional resources
No preemption of resources from threads is allowed
A circular chain of waiting threads exists (i.e. the cycle in the resource-allocation graph)

Dealing with Deadlocks

Detection

We establish a policy for when to check for a deadlock (e.g. if a request can't be fulfilled in some amount of time, CPU utilization drops below a threshold, at a regular interval, check before every resource allocation, etc). Then, when we need to check, we run a cycle-detection algorithm in the resource-allocation graph (\(O(|E|+|V|)\)).

In practice, operating systems just defer this problem to the user (this is usually cheaper, so saves time overall).

Recovery

There are a few options:

Manual recovery: inform the operator that a deadlock has occurred and have them fix it
Abort all the deadlocked threads and reclaim their resource (nuclear option)
Abort one thread at a time until a cycle no longer exists in the RAG
- A policy for which thread to pick is needed: by thread priority, by resources allocated, longest-running thread
Preempt some resources from a deadlocked thread
- Again, how do we choose these resources?

Prevention

No-preemption Bugs

Strategy I: Always preempt all the resources of a waiting thread

Strategy II: Preempt the resources of a waiting thread when another thread requests those resources. This involves checking if a requested resources is being allocated to a waiting thread.

Easy saving and restoration of resource state is needed for these strategies to work effectively. So, the resource must store its state in something like CPU registers.

Circular Waiting

We can define a total order over the resources, i.e. an injective function \(f : R_{\mathbb{N}} \to \mathbb{N}\). Then, when multiple threads are waiting for a lock, we pick threads according to our order.

Advantage: many requests for the same resource can be condensed into one request
Disadvantage: doesn't scale well with complexity
Disadvantage: may force a thread to acquire resources earlier than needed, leading to worse utilization

Avoidance

Resource Reservation

A sequence of thread executions \(t_1, \dots, t_n\) is safe if each execution \(t_i\) can run with the currently available resources plus the resources of all earlier executions (i.e. all \(t_j\) for \(j <i\)).

Safe state: there is at least one safe sequence of thread executions for a set of threads
Unsafe state: no safe sequence exists
Doomed state: all thread executions lead to a deadlock

An unsafe state doesn't guarantee a deadlock, but is needed for a deadlock to have a chance of occurring. Deadlock cannot occur in a safe state.

The purpose of avoidance is to prevent the system from ever going into an unsafe state. Under resource reservation, threads provide the maximum number of instances of each resource type they might need. Then, every time a thread requests a resource, the system checks if granting this resource would lead to an unsafe state.

Banker's Algorithm

We keep the following data structures for \(n\) threads, \(m\) resources:

available[i] counting how many instances of resource \(i\) is available
max[i][j] tracks how many resources of type \(R_j\) can be requested by thread \(T_i\)
allocation[i][j] tracks current resource allocations
need[i][j] tracks how many resources a thread needs (difference between max and allocation)

To request resources, we keep a request vector; request[i][j] is the number of instances of \(R_j\) wanted by \(T_i\).

If request[i] is larger than need[i], raise an error since \(T_i\) has claimed more than its maximum
If there enough resources available to satisfy the request, proceed. Otherwise, wait until the resources are available
"Pretend" to allocate resources by modifying the algorithm state to what it would be if the resource were allocated. Then, check for safety by searching for a safe sequence (continuously search for a "next" thread and keep track of all the threads that have been included, if termination happens before all threads are reached, the system is in an unsafe state).

available-=request[i]
allocation[i]+=request[i]
need[i]-=request[i]

Drawbacks of banker's algorithm

We don't usually know the maximum requests of a thread
Slow, with impracticable assumptions
Doesn't guarantee safety if a hardware failure occurs

Head in the Sand Approach

Just don't try to avoid at all! EZ.

23 - Memory Management

How are memory addresses the CPU uses mapped to physical memory addresses? What is the difference between static and dynamic relocation? How do we allocate contiguous memory? What is fragmentation and how do we get rid of it?

Review: the Instruction Execution Cycle

An instruction is fetched from memory according to the value of the program counter

The instruction is decoded, possibly requiring loading more from memory

The instruction is executed

The result is stored in a register or written back to memory

The CPU can access registers and the main memory, since these are both in the CPU core. However, the CPU cannot access disk addresses, so anything needed from there must be loaded into memory first.

Also, registers can be accessed in one cycle (229 review!), but memory access usually takes much longer, forcing the CPU to stall. Implementing caches can reduce the symptoms of this problem, but isn't a "full" solution because it doesn't (and could never) guarantee that memory never needs to be accessed.

Memory Addressing

A \(k\)-bit memory address can trivially reference \(2^k\) locations in memory; the size of each location depends on the system, but is often \(1\) word (\(32\) bits). A memory address characterizes a memory space, which may be different from the actual memory layout of the machine, so some sort of translation between these two "spaces" is needed.

Memory must be protected: a process cannot access memory that isn't its own
A system where this mapping is direct (i.e. the address space is mapped directly to system memory) is called uniprogramming. This might allow access to insecure memory, even the memory storing the OS itself (bad bad bad).
With multiprogramming,

! - Class info

Course Goals

Enhance knowledge of programs that utilize OS services
Learn about concurrency and concurrent operations
Perform inter-process communication
Interact with the internet
Introduce the design and implementation ideas of modern operating systems

Textbooks and Resources

Operating System Concepts (Silberschatz, Galvin, Gagne)
Advanced Programming in the Unix Environment (Stevens, Rago)
Linux Kernel Development (Love)
Operating Systems: three easy pieces
Computer Architecture (Hennessy and Patterson)

I really should have made my 229 notes over the summer to prepare for this, oops!

Evaluation

45% is programming assignments
- C programming in the UNIX environment
- Small-scale software project/simulating part of an operating system
20% is the midterm (October 28)
35% is the final exam

Assignments

C!
Make sure that code runs on the lab machine

Labs

Attendance not required, but recommended
Additional concepts from class

Overview of Topics

OS Roles, Services, Architecture
Process Management
- Scheduling
- Process communication
Synchronization
- Deadlock detection, avoidance, and recovery
Memory Management
- Keeping track of memory
- Allocating and deallocating memory as needed
- Deciding which processes and data to move in and out of memory
Storage and I/O management
- Creating and deleting files and directories
- Mapping files onto mass storage
Virtual machines

Operating System Concepts CMPUT379

01 - Introduction

Why Study Operating Systems?

What is an Operating System?

The Roles of an OS

What Constitutes an OS?

02 - Interaction with Hardware

Components of Hardware (229 recap!)

Memory Hierarchy (229 recap!)

Types of Processor Systems

Making Use of the CPU

Multiprogramming

Interrupts

Handling Interrupts

Exceptions/Traps

Multitasking

03 - System Calls, Linking and Loading

What is a System Call?

Interacting with System Calls

Direct vs. Indirect System Calls

POSIX API

Protecting the OS

Operating Modes

Base and Limits Registers

Timer Interrupts

Running a User Program

Steps: Program → Process

Executable and Linkable Format (ELF)

ELF File Sections

Compiler and Linker Diagrams

Operating System Structure

UNIX

LINUX

Darwin (MacOS kernel)

Windows NT

04 - Process Abstraction

Process Memory Layout

Threaded Processes

Process Control Block

Loading Revisited

Keeping Track of Processes

Zombie and Orphan Processes

Scheduling

Context Switching

05 - Process Management

Process Management System Calls

Forking a New Process

Common Problems with Fork

Loading Programs: execve

Waiting for Child Processes to Terminate: wait

Terminating a Process: _exit

Kill

Dealing with Resulting Zombie and Orphan Processes

Other Process Control Syscalls

The Shell

Code Examples

Forking a Process

Combining fork and wait

Combining fork and exec

Parent Killing Its Child

06 - Signals

Signal Generation

POSIX.1's reliable-signal

Defining a Signal Handler

POSIX Signal Environment

How Does Process Creation Interact with Signals?

07 - Interprocess Communication I

Message Passing

Issues with Message Passing

Shared Memory

Shared Memory in POSIX

The Producer-Consumer Problem

Solution: Message Passing Implementation

Solution: Shared Memory Implementation

POSIX Code Examples

Producer

Consumer

Reader and Writer Example

08 - Interprocess Communication II (Pipes and File Descriptors)

Ordinary Pipes

Operating System Concepts `CMPUT379`

Loading Programs: `execve`

Waiting for Child Processes to Terminate: `wait`

Terminating a Process: `_exit`

Combining `fork` and `wait`

Combining `fork` and `exec`

POSIX.1's `reliable-signal`