Operating System Concepts CMPUT379

01 - Introduction

Why Study Operating Systems?

Knowing how the OS works is crucial for efficient and secure programming, must like how knowledge of computer architecture is crucial for writing an efficient interpreter.

To study OS is to study the design of large software systems

E.g. we will learn the client-server model and how its equivalent is implemented in an operating system

What is an Operating System?

An OS virtualizes a physical resource: it transforms a resource into a general and powerful virtual form of itself

The Roles of an OS

The OS as a referee

The OS as an illusionist

The OS as a glue

What Constitutes an OS?

A core kernel that runs all the time in the kernel mode (supervisor mode/privileged mode)

System programs and daemons, which come with the operating system but are not part of the kernel

Middleware, which "connects" additional services to application developers, e.g. databases, multimedia, graphics

02 - Interaction with Hardware

How does the OS make it easier to use and program a computer?

Components of Hardware (229 recap!)

Memory Hierarchy (229 recap!)

Recall that the CPU fetches and runs instructions one at a time, using the registers as its workspace. However, data doesn't just exist in the registers, but in memory and storage as well.

Because programs benefit from fast memory, it is economical to hierarchize the memory into several levels, each with a different zero-sum tradeoff between access cost (power) and speed. A clever compiler (or programmer) can make good guesses about which memory levels to use. Thus, having more memory options is generally more efficient.

Types of Processor Systems

A single-processor system contains one general-purpose processor with one processing core for user processes. In contract, a multiprocessor system contains multiple single-core CPUs, allowing multiple processes to run at the same time.

A multicore system has multiple cores residing on a single processor chip. Each core has its own L1 cache, but the L2 cache and below are shared between cores. Because less distance needs to be travelled, multicore systems are more efficient than their multi-chip counterparts.

Making Use of the CPU

Multiprogramming

An operating systems performs multiprogramming by keeping multiple processes in memory at the same time. The OS schedules the processes by picking one to run on the CPU, waiting until it terminates or has to wait, then picking another one, etc. This increases CPU utilization because a CPU core is less likely to be idle if it has multiple processes to choose from.

Interrupts

An interrupt is a signal sent by a device controller to the CPU's interrupt-request line(s) that informs the CPU that an event (e.g. an I/O request) has occurred.

There are actually two interrupt-request lines; the maskable one for device controllers, etc, and the non-maskable one for "urgent" interrupts like hardware faults. This way, if a critical process is running, the interrupt can be deferred until the process is complete by using the maskable line.

Interrupts have a priority level. The CPU will defer the handling of a low-level interrupt if a high-level interrupt is raised.

Handling Interrupts

The CPU polls the interrupt-request line after each instruction. If an interrupt has occurred, the CPU stops what it was doing, catches the interrupt, and uses the in-memory interrupt vector to dispatch the appropriate interrupt-handler routine. The handler clears the interrupt by servicing the device.

If there are more interrupt handlers than addresses available in the interrupt vector, interrupt chaining can be used. Instead of pointing to a handler directly, an entry in the interrupt vector points to the head of a linked list of interrupt handler addresses. Each list entry is tried until one clears that handler.

Exceptions/Traps

An exception/trap is a software-generated interrupt that may be generated by an error (e.g. division by \(0\)) or a user request for an OS service (e.g. a system call or ecall).

Multitasking

Multitasking extends this idea to the process level by letting the CPU execute multiple processes at the same time; this is done by switching between them frequently. Choosing the next process to run requires CPU scheduling.

03 - System Calls, Linking and Loading

Lecture 3 slides Linking and Loading article

What is a System Call?

System calls are provided by the kernel to application programs

The following tasks require a (UNIX) system call to be executed. Consider how often your C programs do one of these things.

The strace linux command can be used to trace which system calls are executed when an application is run.

Interacting with System Calls

Direct vs. Indirect System Calls

Generally, system calls are called through a C library wrapper instead of directly. These functions are more convenient because they handle the steps around making the syscall, e.g. trapping to the kernel mode.

POSIX API

The POSIX API is a standard API across unix-based systems for system calls. This interface provides functions to interact with system calls that abstract away the low-level details of making the particular syscall on a particular system.

Thus, code calling this API is portable between (POSIX-compliant) systems because the implementation of a particular POSIX function is chosen for the system it runs on.

Protecting the OS

Here is where the philosophy of operating systems diverges from that of the languages they are written in; they do not give you enough rope to hang yourself with

Operating Modes

An operating system has two (or more) operating modes that grant different levels of access to the hardware. Kernel mode grants the full privileges of hardware, whereas user mode does not. This design exists to protect the OS from user programs, and to protect user programs from each other.

The purpose of a system call is to allow a user program to run privileged instructions; access to these is mediated by the operating system. So, the underlying services of the OS can only be accessed through the "interface" (protection boundary) of an operating system in order to protect the hardware.

The computer's hardware has at least one status bit that indicates the current mode (i.e. user or kernel). Different hardware may have different modes → different bit configurations. So, the protection of privileged instructions happens at the hardware level.

Base and Limits Registers

Before context switching (switching processes), the OS loads a base/relocation register with the smallest legal physical memory address of the process and a limit register with the measure of memory allocated to the process. Then, in user mode, each reference (instruction and data address) is checked to ensure it falls between the base and the base+limit addresses.

Timer Interrupts

It is possible that a process doesn't wait for I/O signals, and thus would continue running until termination (best case). To ensure the kernel gets regular control, a timer is set that interrupts the CPU at regular intervals (e.g. \(100\) microseconds).

Each timer interrupt, the CPU chooses a new process to execute.

Running a User Program

Steps: Program → Process

A compiler converts a source file (e.g. ASCII text) into an object file that corresponds directly to the source file

A linker combines the object files and specific libraries into a single binary executable

A loader brings the executable file created by the linker into memory. Specifically, the loader adds the code and data from the executable to the process memory.

Executable and Linkable Format (ELF)

ELF (Executable and Linkable Format) is the standard executable file format of UNIX systems.

The ELF Relocatable File contains the compiled code and a symbol table that contains metadata about functions and variables in the program; this is used to link the object file with other object files.

The ELF Executable File contains the address of the first instruction of the program; this can be loaded into memory

ELF File Sections
Name Contents
.text Machine code (instructions)
The content of this can be viewed with objdump -drS objfile on UNIX systems
.data Initialized global variables
.bss (block storage start) uninitialized global variables
empty in the object file (??)
.rodata Read-only data, e.g. constant strings
.symtab Symbol table
.rel.text
.rel.data
Relocation information for functions and global variables that are referenced, but not defined in the file (i.e. external).
The linker modifies this section by resolving external references

Compiler and Linker Diagrams

make my own?

Operating System Structure

UNIX

UNIX has a monolithic structure: everything in the kernel is compiled into a single static binary file that runs in a single address space. As a result, the communication with the kernel is fast and syscalls have little overhead.

LINUX

LINUX is also monolithic, with a core kernel and additional dynamically loaded kernel modules for services like device drivers, file systems, etc.

Applications use glibc, the GNU version of the standard C library; this is where POSIX functionality is provided. glibc is the syscall interface to the kernel.

The marvel of software engineering that is the linux kernel is available online. Its directories are include (public headers), kernel, arch (hardware dependent code), fs (filesystems), mm (memory management), ipc (interprocess communication), drivers, usr (user-space code), lib (common libraries).

Darwin (MacOS kernel)

Darwin follows a layered structure with a microkernel, BSD (threads, command line, networking, file system), and user-level services (GUI). Layered structures have the following advantages and disadvantages:

The microkernel structure is characterized by a small kernel providing basically functionality (e.g. virtual memory, scheduling) and interprocess communication (e.g. message passing through ports). Other OS functionality is provided through user-level processes.

Windows NT

Windows NT is also layered, with the Windows executive (Ntoskrnl.exe) providing core functionality. The kernel(s) exist between the hardware abstraction layer and the executive services layer (where the Windows executive is). The kernel handles thread scheduling and interrupt handling.

The native Windows API is undocumented, so code requiring OS services (everything) runs on an environment subsystem with a documented API, like Win32, POSIX, os OS/2. WSL also fits into here I imagine.

04 - Process Abstraction

A process is a program being executed in an environment with restricted rights. The process has resources (e.g. CPU registers, memory to contain code, etc) and encapsulates one or more threads sharing process resources.

Note that processes are not programs; different processes might run different instances of the same program, e.g. multiple instances of a web or file browser.

Process Memory Layout

A process contains a

Since both the stack and the heap are able to grow over a program's execution, they grow into the same unallocated space. Specifically, the stack grows downwards from the top of memory whereas the heap grows upwards, with respect to the memory addresses.

Each process has a distinct and isolated address space that defines which memory addresses it can access. No process can write to the memory of another process. The address space contains virtual addresses, which are translated to physical addresses by the hardware.

Threaded Processes

A thread is a sequential execution stream of instructions. A process may have multiple threads; these threads share the address space of the process as well as the heap, text, data, etc. However, each thread has its own registers, program counter, and stack.

Process Control Block

The OS keeps track of the stat of each process' execution in a process control block (PCB). This is a kernel data structure in memory that represents the context (runtime information) about the process. This includes

In Linux, the PCB is represented by the C structure task_struct, which is defined in <linux/sched.h>

Loading Revisited

Before a program is loaded into memory, the loader creates a PCB, address space, stack, and heap for the process. It also pushes argc and argv to the stack. Then, the registers and program counter are initialized. Finally, the data and program is loaded into memory.

Keeping Track of Processes

Although only one process can be running on a core at a given time, the OS is "juggling" many process, which may be in ready or waiting states. Active processes are placed in a queue to be dealt with. These include:

The queue may change due to a process action (e.g. termination or making a syscall), OS actions (scheduling), or external actions (hardware interrupts).

Zombie and Orphan Processes

A process is a zombie process if it has terminated, but its parent hasn't read its exit status yet.

A process becomes an orphan process if its parent is terminated while the process is still running.

Scheduling

The scheduler maintains a doubly linked list of PCBs that are moved between the job queue, ready queue, and the device queues. It regularly selects a process from the ready queue to run; the algorithm defining this selection may be tuned to prefer fairness, minimum latency, guarantees.

Overall, the scheduler selects the process to run next, whereas the dispatcher actually executes the context switch.

Context Switching

Context switching is the act of stopping a process and starting another one; the context contains the values of the CPU registers, the process statement, and information about the memory.

Context switching is expensive because it requires loading the process' hardware registers (e.g. PC) from the PCB when the process starts and saving them to the PCB when the process ends.

05 - Process Management

How can we create, control, and terminate processes?

Go fork yourself

Process Management System Calls

// returns the current PID
getpid() 

// duplicates the current process (a new PID is assigned to the child process)
fork() 

// loads a new binary file into memory without changing its PID
// this is the loader command from earlier
execve() 

// waits until one of its child processes terminates
wait() 

// waits until the specified child process terminates
waitpid() 

// terminates a process
// note: _exit and exit are different things; this has CS 379 assignment implication
_exit() 

// causes the calling process to sleep until a signal is delivered that either terminates the process or causes the invocation of a signal-catching function
pause() 

// suspends execution of a process for at least the specified time (can be interrupted by a signal that triggers the invocation of a handler)
nanosleep() 

// sends a signal (interrupt-like notification) to another process
kill() 

// sets handlers for signals
sigaction() 

Forking a New Process

The fork (fork()) syscall creates a child process that inherits a copy of its parent's memory, file descriptors, CPU registers, etc. Both the parent and child process will execute after the fork instruction; either one might run first (no guarantee).

Presumably, the PID of the child is kept track of on the stack somewhere, lest a zombie process be created.

// includes
int main() {
    // COMMON TO ALL PROCESSES
    // fork 3 times
    fork();
    fork();
    fork();
    
    printf("hello\n");
    // "hello" is printed 2^3 = 8 times
    return 0;
}

Note that the forked child processes can "see" everything, i.e. the code is running "multiple" times. This means we need to check if we're in the parent or child process when running code. It also means that anything before the fork is visible to both the parent and child processes; this lets us create shared structures like pipes, shared memory, etc.

Common Problems with Fork

So, DON'T FORK IN MULTITHREADED PROCESSES!

Loading Programs: execve

The exec (execve()) system call allows a process to load a different program and start executing it at main. A number of arguments (argc) and an array of said arguments (argv) must be sent to the new program to be executed.

We usually call execve after calling fork. Note that this makes all the memory copied by fork useless immediately.

Waiting for Child Processes to Terminate: wait

A parent process can execute concurrently with its children or wait until some or all of them terminate; this is achieved with the wait() command, which puts the parent to sleep until a result is returned.

Terminating a Process: _exit

The termination of a process is the final reclamation of its resources by the OS. All open files are closed and the process' memory is deallocated.

A process is usually terminated by using return in main, but it can be terminated explicitly by calling exit() (C library function) or _exit() (syscall). Note that exit() is called indirectly when main returns.

Kill

A process may terminate its own child with the kill() system call, which takes as parameters the child's PID and the signal to use, in this case SIGKILL

Dealing with Resulting Zombie and Orphan Processes

A process becomes a zombie when it has terminated, but its parent has not yet called wait().

A process becomes an orphan if its parent terminates without invoking wait().

Other Process Control Syscalls

The priority of a syscall can be adjusted with nice(); nice(incr) increases the nice value of the process by incr. A lower nice value has a higher scheduling priority; a process is "nice" by giving up some of its CPU share.

The ptrace system call lets a process intercept the system calls of another process. This lets it check arguments, set breakpoint, peep in on registers, etc

The sleep call puts the process on a timer queue by waiting for a certain number of milliseconds; this allows alarm functionality to be implemented.

The Shell

In short, a shell is a process control system that lets programmers create and manage processes to do task. Logging into a (UNIX) machine starts a shell process

Code Examples

Forking a Process

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>

int main() { 

    //... 
    pid_t ppid = getpid(); // store parent’s pid 
    pid_t pid = fork(); // create a child 
    
    if(pid == 0) { 
        // child continues here 
        printf("Child pid: [%d]\n", getpid()); 
        // ... 
    } else if (pid > 0) { 
        // parent continues here 
        printf("Parent pid: [%d] Child pid: [%d]\n", ppid, pid); 
        // ... 
    } else { 
        perror(“fork failed!); 
        exit(1); 
    }
    // ...
}

We can run the ps command to check the process' IDs. The -el flag increases the information listed; the -u flag lets processes be filtered by the user that created them.

Combining fork and wait

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

int main() { 
    // ... 
    pid_t ppid = getpid(); // store parent’s pid 
    pid_t pid = fork(); // create a child 
    
    if(pid == 0) { // child continues here 
        // ... 
    } else if (pid > 0) { // parent continues here 
    // ... 
    pid_t cpid = wait(&child_status); 
    } else { 
        perror(“fork failed!); 
    }
    // ...
}

Combining fork and exec

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

main() { 
    // ... 
    pid_t ppid = getpid(); // store parent’s pid 
    pid_t pid = fork(); // create a child 
    
    if(pid == 0){ 
        // child continues here 
        // mark the end with a null pointer
        status = execve("/bin/ls", arg0, arg1, ...); 
        /* exec doesn’t return on success! 
        so if we got here, it must have failed! */ 
        
        perror(“exec failed!); 
    } else if (pid > 0) { 
        // parent continues here ... 
        cpid = wait(&status); // pass NULL if not interested in exit status 
        
        if (WIFEXITED(status)) {
            printf("child exit status was %d\n", WEXITSTATUS(status));
        }
        
    } else { 
        perror(“fork failed!); 
    }
    //... 
}

Parent Killing Its Child

Welp, I'm on a watchlist now

#include <signal.h>
#include <unistd.h>
#include <stdio.h>

main() { 
    // ... 
    int ppid = getpid(); // store parent’s pid 
    int pid = fork(); // create a child 
    
    if(pid == 0) { // child continues here 
        sleep(10); // child sleeps for 10 seconds 
        // ... 
        exit(0); 
    } else { 
        // parent continues here 
        // ... 
        printf( "Type any character to kill the child.\n" ); 
        char answer[10]; 
        gets(answer); 
        
        if (!kill(pid, SIGKILL)) { 
            printf(“Killed the child.\n”); 
        } 
    }
}

06 - Signals

"Kill" is the word associated with sending signals. For some reason.

Signals are software interrupts; they provide a way of handling asynchronous events by sending extremely short messages between processes (and are thus a form of PID). Each signal has an associated action.

Signals are defined as integer constants in the <signal.h> header; their textual representation always starts with "SIG", e.g. SIGKILL is \(9\).

A signal is posted if the even that triggers it has occurred, delivered if the action associated with it is taken, pending if it has been posted but not delivered yet, and blocked if the target process does not want it delivered.

Signal Generation

A signal can be generated by

POSIX.1's reliable-signal

A process can send a signal to another process or a group of processes using the kill syscall. However, this can only happen if the user ID of the receiving process is the same as the one of the sending process (i.e. permission is required).

A process can wait for a signal by calling pause; this puts the process to sleep until a signal is caught.

A process may have a signal disposition, which determines how it responds to signals.

Defining a Signal Handler

A signal handler is a function that takes a single integer as an argument and returns nothing, i.e. everything is accomplished through side-effects.

sighandler_t signal(int signum, sighandler_t handler);
#include <stdio.h>, <unistd.h>, <signal.h>

int i; void quit(int code) { 
    fprintf(stderr, "\nInterrupt (code= %d, i= %d)\n", code, i); 
} 

int main (void) { 

    if(signal(SIGQUIT, quit) == SIG_ERR) 
        perror(“can′t catch SIGQUIT”); 
    
    // loop that wastes time in a measurable way
    for (i= 0; i < 9e7; i++) 
        if (i % 10000 == 0) putc('.', stderr); 
        return(0); 
}

This behavior various across different UNIX versions though.

POSIX Signal Environment

The action associated with a signal may be examined or modified using sigaction, which supersedes the signal function from earlier UNIX releases.

#include <stdlib.h>, <stdio.h>, <sys/types.h>, <unistd.h>, <signal.h>

void signal_callback_handler(int signum) { 
    printf("Caught the signal!\n"); 
    // uncommenting the next line breaks the loop when signal received 
    // exit(1); 
} 
    
int main() { 
    struct sigaction sa; 
    sa.sa_flags = 0; 
    sigemptyset(&sa.sa_mask); 
    sa.sa_handler = signal_callback_handler; 
    sigaction(SIGINT, &sa, NULL); // we are not interested in the old disposition 
    // sigaction(SIGTSTP, &sa, NULL); 
    while (1) {}
}

sigprocmask may be used to block signals in a signal set from getting delivered to a process.

The bit vector representing the signal set may be modified with sigemptyset(), sigfillset(), sigaddset(), sigdelset(), and sigismember()

How Does Process Creation Interact with Signals?

When a process forks, the child inherits the parent's signal dispositions and signal mask; signal-handlers are also defined in the child because it has inherited its parents' memory.

When a process execs, the disposition of caught signals changes because its default action in the new program may be different; all other signal statuses are left alone and the signal mask is preserved.

07 - Interprocess Communication I

Cooperating processes work with each other to accomplish a single task; this improves performance (e.g. increases parallelism) and program structure

Information sharing (i.e. IPC) is likely required between cooperating processes; it is on the OS to make this happen. The two fundamental approaches to this are message passing using a message queue and shared memory.

Message Passing

Message passing requires kernel intervention through system calls. Sharing is clear and easy to track, but requires more overhead in the form of copying data and crossing domains.

Implementation: a message queue (implemented as a linked list of messages) is stored in the kernel; a message queue identifier pointed to this list is shared with all cooperating processes.

Property Options Description
Direction simplex | half-duplex | full-duplex Which way the message travels (half-duplex: both ways, but one at a time)
Boundaries datagram | byte stream datagram has message boundaries, byte stream does not
Connection model connection-oriented | connectionless Connection-oriented: recipient is specified when the connection starts, so it doesn't need to be re-specified
Connectionless: each send operation has a parameter that contains its recipient
Reliability Can messages get reordered? corrupted? lost?

Issues with Message Passing

Blocking: message passing may involve blocking operations, i.e. operations that force the sender and receiver to wait until the message has been received or is available, respectively.

Shared Memory

Shared memory is achieved by establishing a mapping between a process' memory space to a named memory object, which is shared among processes. The processes that need access to the memory are forked after the memory is created so that they know the memory object's name.

Shared Memory in POSIX

POSIX uses memory-mapped files; each region of the shared memory is associated with a file. So, shared memory objects are implemented with files. We interact with the SMO through the following syscalls:

Finally, mmap syscall establishes the memory-mapped file that contains this shared memory object. (we can remove this file with munmap). mmap takes parameters for the address of the shared memory section, the length (size) of the section, prot (?), flags, and an offset (?).

The Producer-Consumer Problem

The producer-consumer problem (a.k.a the bounded buffer problem) any time we have a process (the producer) writing to a buffer and a different process (the consumer) reading from the same buffer. This pattern is common; a canonical example is in a shared pipe.

The producer-consumer problem requires the producer and consumer to communicate, i.e. inter-process communication is required. So, we can implement in terms of message passing or shared memory.

Solution: Message Passing Implementation

We defer the management of the message queue to the kernel, using send and receive as an interface to interact with the queue. Each process needs to be able to name (identify) the other processes to communicate.

int main() {
    //...
    int(fork() != 0) {
        producer();
    } else {
        consumer();
    }
}

int producer() {
    //...
    while(true) {
        //...
        next_p = /* produced item */;
        send(C_pid, next_p);
        //...
    }
}

int consumer() {
    //...
    while(true) {
        //...
        receive(P_pid, &nextc);
        // consume nextc
        //...
    }
}

Solution: Shared Memory Implementation

this code doesn't quite make sense to me

We use the mmap system call to create shared memory. We store the buffer in this shared memory and manipulate it directly. As such, we need to keep track of some variables and constants

int main() {
    //...
    mmap(/*...*/, in, out, PROT_WRITE, PROT_SHARED, /*...*/);
    in, out = 0;
    
    if(fork() != 0) producer(); else consumer();
}

int producer() {
    //...
    while(true) {
        //... (produce the item however)
        nextp = /* produced item */;
        // wait until we have something to put in the queue
        while((in + 1) % N == out) {}
        // if we're here, (in + 1) % N != out, so we write to the buffer
        buffer[in] = nextp;
        in = (in + 1) % N;
    }
}

int consumer() {
    //...
    while(true) {
        //...
        // wait here until there is something to read from the buffer
        while(in == out) {}
        // read from buffer
        nextc = buffer[out];
        // update out; we implement a circular buffer with the modulus operator
        out = (out + 1) % n
        // consume the next item however
    }
}

POSIX Code Examples

Producer

To create the producer, we use shm_open to create the shared memory file descriptor, then call ftruncate to establish the size of the shared memory. Then, we use mmap to create the shared memory object and get a pointer to it. Finally, we use sprintf to write write to the shared memory; we increment the pointer to make sure we're writing to the right place.

Consumer

We do all the exact same setup as the consumer: shm_open and mmap (among others) with the same name. However, instead of writing to the shared memory, we read from the memory instead by setting it as the "source" for printf. Finally, we call shm_unlink to remove the shared memory object when we're done.

Reader and Writer Example


Note: the notes also have examples of this in System V; I'm not sure these are as directly relevant so I didn't include them. But, they are useful for comparison of system design, so I've made a note that these are indeed in the slides.

08 - Interprocess Communication II (Pipes and File Descriptors)

A pipe is a channel established between two processes to communicate; so, they are a form of shared memory (particularly, a lighter one). Pipes imply a producer/consumer relationship.

Ordinary Pipes

An ordinary pipe (a.k.a. anonymous pipe also implies a parent-child relationship: the parent creates a pipe, then forks a child process that has access to the pipe too. These are the only processes with access to the pipe.

In UNIX, pipes are implemented as special types of files. Thus, they are identified by file descriptors and are manipulated with the read and write syscalls. Namely, we read to consume content and write to produce content. This data pipe flows through the kernel.

// structure to store the necessary file descriptors
int[2] fd;
// create a pipe 
pipe(fd);
// the file descriptor for the READ end is stored in fd[0]
// the file descriptor for the WRITE end is stored in fd[1]

File Descriptors

A file descriptor is an integer index that identifies a file in the kernel's file descriptor table; these are shared between processes. (correctness)

File descriptors can be obtained via a syscall by opening a file/pipe/socket/etc, or by inheritance from a parent process.

Implementing I/O Redirection

We can implement I/O redirection by closing a standard stream and re-allocating the corresponding descriptor to another file (or pipe). But there are better ways to do this

// close the standard input
close(STDIN_FILENO);
// set the standard input to read from my file
open("/path/to/my/file", O_RDONLY, STDIN_FILENO);
// we can do this because input is abstracted as a file anyway
// so feeding in a file directly works since it uses the same interface

Instead, we can duplicate a file descriptor with the dup2 system call. This creates another reference to the same file descriptor. This way, we can replace it with another descriptor without closing the original file the descriptor points to. (??? what?)

Everything is a File

The "pipes as a file" pattern is common in UNIX: many things are implemented as files, e.g. devices, sockets, pipes, actual files, etc. YOU get a file and YOU get a file and YOU get a file.

Piping to Another Process

To pipe to another process, we would need to create the pipe (i.e. the file descriptors), fork a child process, close the unused ends of the pipe for both processes, call execve to run the command we want in the child process, then wait for the command to terminate. This is annoying and tedious to implement. So, popen() does exactly this.

Then, pclose() closes the standard I/O stream, waits for the command, and returns termination status.

#include <stdio.h>
#include <unistd.h>
#define LINESIZE 20

int main (int argc, char *argv[]) {

    // set up constants and buffer
    size_t size=0; 
    char buf[LINESIZE]; 
    FILE *fp; 
    
    // open the pipe
    // provide the command we will execve
    // we can provide "w" as the second parameter to read instead
    fp = popen("ls -l", "r"); 
    // continue reading from the pipe as long as we can
    // store the result in the buffer with printf
    while(fgets(buf, LINESIZE, fp) != NULL) 
        // here, we "print" to the buffer instead of standard out
        printf("%s\n", buf); 
    // close the pipe
    pclose(fp); 
    return 0; 
}

Named Pipes

Named pipes (FIFOs in UNIX) generalize ordinary pipes by allowing bidirectional communication between any (possibly several) processes, not just parent-child pairs. Essentially, a named pipe is a named file that any process and read to or write from.

In UNIX, a FIFO is created with the mkfifo syscall and is manipulated like an ordinary pipe, i.e. with open, read, write, close.

Aside: how is a named pipe functionally different from shared memory? they seem really similar

Appendix: Actually Using a Pipe in BASH

In BASH, we use the character | to open a pipe between two processes; BASH will handle the opening and closing. This pipe redirects the standard output of the left process into the standard input of the right process.

# the pipe identity
./process_1 > temp.file && ./process_2 < temp.file
# equivalent to
./process_1 | ./process_2

Code Examples

Anonymous Pipe

Duplication

Hardcoded: w | wc -w

09 - Distributed Systems

A distributed system is a set of loosely coupled (often physically separate) nodes connected via a communication network, e.g. the internet. Each node has an operating system, as well as its own resources, e.g. a CPU, etc.

Nodes may be configured in a client-server configuration (the server has a resource the client wants to use), a peer-to-peer configuration (each model acts as both a client and a server), or a hybrid model.

Why make a system distributed?

A distributed operating system is capable of migrating data, computation, and processes between its nodes without its user needing to be aware of its internal function. A high level of parallelism is needed to achieve this.

Design Goals of Distributed Systems

Robustness is the ability of a system to withstand failures. Systems should be fault-tolerant, i.e. able to withstand certain amounts of certain types of failure.

Transparency is how well the system appears as a conventional, centralized system to users (reword, direct copy). The system should act the same way from every access point, e.g. files stored on remote servers should appear like those stored locally to the system's user.

Scalability is the ability of a system to adapt to the increased load from accepting new resources. A system should respond gracefully, i.e. not slow down significantly or fail under increased load.

Consistency is the ability of a system to ensure that its cached local data is consistent with the master data source.

Basics of Communication

A network is a set of communication links that let multiple computers communicate via packets (the atomic unit of data transmission) through a network interface. A protocol (common set of communication rules) must be adopted across the system.

Network Structure

A local area network (LAN) covers a small geographic area, e.g. a building. LANs connect computers with common peripherals (e.g. printers, storage servers, etc); they must be fast and reliable to be effective. Nodes may connect to a LAN via ethernet or wifi.

A wide area network (WAN) connects geographically separate sites across the world. Its links are serious physical infrastructure, e.g. telephone lines, data lines, optical cables, satellite channels, etc. Due to the distances involved, WAN is slower and less reliable than LAN.

LANs are connected together into WANs.

Principles of Communication over Networks

Network data is divided into packets and flows between switching points; computers at these points control the packet flow. When a packet arrives at its destination, the computer is interrupted.

If packets need to travel between different LANs, a router needs to be used to direct traffic from one network to another. The router reads the packet header to determine the packet's destination, finds the closest node to the destination in a routing table, then forwards packages there.

Identifying the Destination

Packets are accepted by processes, so a <host-name, PID> pair is needed to identify which host and process the packet is meant for. Each process in a system has a unique PID, and each system in a network has a unique name, so this pair alone can identify the process in question.

The domain name system (DNS) is a global distributed database system for resolving hostname-IP (internet protocol) address mappings. These map IPv4 addresses (32-bit integers, e.g. 129.128.5.180) to human-readable names, e.g. gpu.srv.ualberta.ca.

TCP/IP (Internet Protocol Suite)

Internet communication is organized into 4 layers of abstraction

The application layer provides a way for processes to exchange data between themselves. The transport layer partitions data into packets, maintains the order of packets, and transfers packets between hosts. The internet layer routes these packets through the network; this involves encoding and decoding addresses. Finally, the link layer handles error transmission over physical media, including the required error correction and detection.

A packet can move through a LAN using only its MAC address. However, if the packet needs to be sent to another system, the address resolution protocol (ARP) must be used to map the MAC address to an IP address.

Transferring Packets between Hosts

Every host has a name and an associated IP Address; this may be an IPv4 address (32-bit) or an IPv6 address (128-bit).

Finding the Right Process

Once the host has been located, the receiving or sending process is identified by a port number. This is part of the transport layer; the TCP and UDP protocols are responsible for this. A host has multiple ports → multiple processes can send and receive packets

User Datagram Protocol (UDP)

UDP allows for fixed-size messages (datagrams) up to some maximum size. UDP is unreliable: packets may be lost or out of order (although corruption is still checked). UDP is connectionless: no setup or takedown is required, i.e. the process is stateless

Transmission Control Protocol (TCP)

TCP is reliable and connection-oriented: it provides abstractions to allow in-order, uninterrupted streaming across an unreliable network. Connections are opened and closed with control packets. The opening consists of a three-way handshake (SYN, SYN + ACK, ACK).

When a host sends a packet, the receiver must send an acknowledgement packet (ACK). If this is not received before a timer expires, the sender will timeout and retransmit the packet. The sender keeps track of the current set of packages sent without being acknowledged.

The sequence counter assigns an order to the packets; the receiver can use this to determine whether packets have been duplicated, i.e. if the original package was received but the original package was lost in transmission.

10 - Sockets and Servers

Sockets are like pipes that go outside

Distributed computing often follows the client-server model: a server provides services (e.g. file service, database service, name service) that a client (process) might request.

A socket is a connection endpoint that provides an abstraction over a network I/O queue. Each endpoint is defined by an IP address and a port number. Communication between two processes requires two sockets, one for each process.

Sockets come in two common types:

A socket supports some set of address domains: address spaces tailored to some aspect of communication (e.g. INET for IP protocols, UNIX domains for processes on the same machine)

Creating Sockets

We create a socket with the socket syscall; this returns a socket descriptor that identifies the socket that was created (or -1 on error).

int socket(int domain, int type, int protocol);

Sockets work like files; in fact, since this is UNIX, sockets are abstracted by files (of course).

#include <sys/socket.h>

int sockfd; 
// TCP 
int sockfd = socket(AF_INET, SOCK_STREAM, 0); 
// sockfd= socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); 

// UDP 
int sockfd= socket(AF_INET, SOCK_DGRAM, 0); 
// sockfd= socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);

Addressing

We can identify the process we want to communicate with by host name (e.g. IPv4) and/or by port number.

struct addrinfo {
    int ai_flags;
    // socket domain
    int ai_family;
    // socket type
    int ai_socktype;
    // protocol
    int ai_protocol;
    socklen_t ai_addrlen;
    // contains IP address and port number
    struct sockaddr *ai_addr;
    char *ai_canonname;
    struct addrinfo *ai_next;
}

Calling socket doesn't assign an address to the socket: we wither need to call bind to associate a port (must be larger than \(1024\)) or leave that choice to the OS when listen or connect is called.

Connections

A connection is uniquely defined by the 5-tuple (source IP address, source port number, destination IP address, destination port number, protocol).

Accepting a Connection

A server will call listen (providing the socket descriptor and a number of connect requests to queue) to convert the socket to a listening socket, which start allowing clients to connect.

Then, the server will call accept to create a new connection socket for a particular client connection (from the queue). Thus, accept returns a new socket descriptor for the connection socket so that the original socket is still available to receive requests.

Establishing Connections

Connection-oriented services (e.g. TCP) require a connection to be established first; this is done by calling connect, which connects a socket to the specified remote socket address.

Transferring Data

We write data with send (connection-oriented) or send-to (connectionless), which take flags.

We read data with recv (connection-oriented) or recvfrom (connectionless, requires source address), which also take flags.

We can also use read and write to, well, read and write data. However, these aren't socket-specific, so they provide a lower level of abstraction and are more-error prone. The socket-specific functions deal with sockets more cleanly and safely.

Trying to send or receive data on a broken socket creates a SIGPIPE signal.

Diagrams put These in the Right spot

11a - Remote Procedure Call

When a local function makes a syscall, the function prepares the arguments and raises the syscall exception, which gets handled by the system call handler. This handler runs, produces the result in a register, then returns control to the function. Easy.

The idea of Remote Procedure Call (RPC) is to make calling syscalls on remote systems as easy as this.

We need RPCs to implement distributed systems, since a syscall could be made on any device, not just the one that the local process is running on.

Client-side

The client sets up a message buffer, then reads the function identifier and arguments into the buffer (message serialization or marshalling). Next, the message is sent across a network to the destination RPC server. The client waits for the reply, then unpacks (unmarshalls) the code and returns it to the calling process.

Server-side

Each message is addressed to an RPC daemon listening to a port on the RPC server. When a message arrives, it is unpacked (unmarshalled), the syscall is made with parameters, re-marshalled, and the reply is sent back.

Runtime Library

The RPC server needs a runtime library to handle the tasks related accepting, processing, and sending the data:

Diagrams

11b - CPU Scheduling I

How can we best utilize the CPU? I.e. how do we manage the ready queue?

Motivation

Scheduling is difficult for a few reasons

In particular, processes consist of alternating CPU bursts and I/O bursts. A balanced process has roughly equal composition of CPU and I/O busts; an unbalanced process may favor one more than the other.

// Grammars on the mind
// Can you tell I'm also taking CMPUT 415 this term?
process: (IO_burst)? (CPU_bust IO_burst)+ (CPU_burst)?

Definitions

A workload is a set of tasks for the system to run. Each task has a specific arrival time and burst length; if the system is a real-time system, it also has a deadline.

A performance metric is a criterion that can be used to compare scheduling policies. A scheduling policy is an algorithm that decides the order in which the tasks are executed.

Scheduling

Job scheduling (long-term scheduling) is the process by which the OS decides which and how many jobs should execute in main memory at the same time

CPU scheduling (short-term scheduling) is the process by which the OS picks the next process from the ready queue.

The kernel decides when the scheduler is run. In a preemptive system, the scheduler must wait for a notable event to occur (including a timer interrupt to make sure the scheduler gets run), whereas the scheduler of a non-preemptive system can interrupt a running process to run the scheduler.

Performance Criteria for Scheduling

Minimizing the average turnaround time is the same as minimizing the time it takes jobs to execute in general; this is the defining problem we are trying to solve. So, turnaround time is probably the most telling metric, and the one we focus on.

Clearly, it is not possible to optimize all of these criteria at the same time (or is it?). So, we choose a policy to optimize a metric, which may involve optimizing one or more of the following criteria:

Assumptions

I think I posted a reel to this song

For the scheduling algorithms we will cover, we assume each user has a single process, which has a single thread, and the machine has one processing core. All processes are assumed to be independent.

These are not realistic assumptions for modern computers. However, developing scheduling algorithms under relaxation of the above assumptions is still largely an open problem.

Part II contains algorithms. Yay!!

12 - CPU Scheduling II

make some cool diagrams for this!

First Come, First Served (FCFS/FIFO)

First-come-first-served scheduling

Algorithm
Jobs are executed to completion in the order that they arrive.

So, FCFS runs only when a job gets blocks because it is waiting for I/O (i.e. it is non-preemptive)

Advantages

This algorithm is simple to implement and has low overhead.

Disadvantages

Since the order of tasks remains unchanged, short jobs can get "stuck" behind long jobs. So, the wait time is highly variable. As the variance of tasks sizes increases, the average response time of FCFS gets longer.

FCFS isn't fair, namely because long tasks take up time directly proportional to their length, by definition.

Finally, overlap between the CPU and I/O is clunky because CPU-heavy processes will force I/O-heavy processes to wait, leaving the I/O devices idle for (possibly long) periods of time. This degrades the experience of the user.

Round-Robin Scheduling

Round-Robin Scheduling

Algorithm
Each task is allocated resource for a fixed time quantum. If the process is not finished executing by the end of this period, it re-enters the queue at the end.

If \(Q\) is too long, waiting time increases and round-robin degenerates to FCFS as \(Q \to \infty\), since processes are preempted less and less. If \(Q\) is too short, a higher percentage of time is spent on context switching (which increases at \(O(N)\) since the "context switch time" is constant), increasing overhead.

Round-robin is pretty good; most systems today use it.

Advantages

Round-robin is fair, by definition.

Disadvantages

If tasks are equal in size, average waiting time can suffer if there is "constructive interference" between the quantum length (\(Q\)) and the length of the tasks. Namely, if the task length is slightly longer than a multiple of \(Q\), then tasks that are almost done are added back to the queue. Next time the scheduler passes through the queue, each task ends quickly, increasing context-switch overhead.

Shortest [Job | Remaining Time] First Scheduling (SJF/SRTF)

SJF/SRTF Scheduling

Algorithm
Each time it runs, the scheduler picks the job that has the least expected amount of work left

Estimating Burst Lengths of Processes

Generally, we apply the concept of temporal locality: we estimate the next burst length we will see using the previous burst lengths we already saw. In particular, the exponential moving average estimator \(\hat{b}\), defined by \(\left\{ \begin{array}{ll}\hat{b}(1) := \hat{b}(1) \\ \hat{b}(t) := \eta \times \hat{b}(t) + (1-\eta) \times \hat{b}(t-1) & \text{ for } \eta \in (0, 1] \end{array}\right.\).

Advantages

SJF is provably optimal for minimizing average waiting time, and is compatible with both preemptive and non-preemptive schedulers.

Disadvantages

It is not possible to accurately predict the amount of CPU time a process will need ahead of time (so then why is this here? is it impossible, or are we just bad at it. Is there a bound on how good we can get at it? is it an active area of research?). Under this assumption, SRTF degenerates into picking tasks pseudo-randomly.

If new short jobs keep arriving in the queue, a long-running CPU task may starve, i.e. never be allocated resources.

aside: do this! it's cool.

13 - CPU Scheduling III

Can we design a policy that minimizes response time, is fair and starvation-free, with low overhead?

Priority Scheduling

We can annotated each task with a (usually) user-defined priority that indicates how important they are. Then, we keep multiple queues, one for each priority level.

However, if the scheduler simply picks from the highest priority queue that isn't empty, jobs are prone to starvation again. Instead we can

These solutions may increase average response time, but optimize fairness.

Priority Inversion

If a high-priority task is blocked by a lower-priority task, the high-priority task may grant its high priority to the low-priority task so that it can run on its behalf. Once the low-priority task is finished, the high-priority task can run again

Multi-level Feedback Queue (MFQ) Scheduling

Multi-level Feedback Queue (MFQ) Scheduling

Algorithm
The scheduler keeps track of multiple job queues, classified by priority level. Each queue is evaluated with round-robin (possibly with different quanta); once a queue is empty, the scheduler checks the queue with next-highest priority.

Note that any improvement to fairness will come from giving long jobs more CPU time, which degrades average wait time. Some "solutions":

Lottery Scheduling

Lottery Scheduling

Algorithm
Each process gets some number of lottery tickets. Each time slice, a winning ticket is picked and the corresponding process is run.

So, the share of CPU cycles allocated to a job is probabilistically proportional to the number of tickets assigned to each job. Thus, the priority of a job corresponds to how many tickets are given to it. Every job is given at least one ticket to avoid starvation

So, performance degrades gracefully as the load changes because the effect of adding or removing a process is probabilistically "spread out" over all existing processes equally.

Scheduling on Multi-core Systems

There are two approaches

Aside: a common read queue system would exponentially alleviate the problems with other algorithms, right? since a single core isn't blocking anything or can we just abstract this away as stuff we've seen already? look into this

Scheduling on Real-time Systems

A real-time system is a system where each task specifies a deadline by which it must be completed. In a soft real-time system, failing to meet the deadline leads to degradation in performance. In a hard real-time system, missed deadlines cause the system to fail entirely.

Soft real-time systems often use a preemptive, priority-based scheduler that assigns real-time tasks to the highest priority. E.g. in Windows, priority levels \(16\)-\(32\) are reserved for real-time processes (\(32\) is the highest level).

Hard real-time systems require the use of admission control: the scheduler can only accept a task if it can guarantee it will finish execution by the deadline. Otherwise, the scheduler rejects the task

Rate-Monotonic Scheduling (for Periodic tasks)

A periodic tasks has a fixed deadline \(d\), processing time \(t\), and will require the CPU at constant interval \(p\). So, (for any of this to have a chance of working) we must have \(0\leq t\leq d\leq p\). The task will need \(\dfrac{t}{p}\) of the CPU's time.

Rate-Monotonic Scheduling

Algorithm
The priority of a rate is proportional to the inverse of its period, i.e. its frequency

Earliest-Deadline-First (EDF) Scheduling

Earliest-deadline-first (EDF) Scheduling

Algorithm
The priority of a task is inversely proportional to how early its deadline is

If a feasible schedule exists, EDF is theoretically optimal with respect to CPU utilization. However, if no schedule exists, EDF may cause more tasks to miss their deadlines than is necessary.

Least-Laxity-First (LLF) Scheduling

We define the laxity or slack time as \(\text{time until deadline } - \text{ remaining execution time required}\) \(=\text{deadline }-\text{ current time }-\text{ remaining execution time required}\).

Least-Laxity-First (LLF) Scheduling

Algorithm
The priority of a task is inversely proportional to the task's laxity

So, LLF is essentially a refinement of EDF, inheriting from the definition of laxity as a refinement of "time until deadline". The overall structures are the same; it's just the "cost function" that's different.

14 - Intro to Threads

Threads are the next level down from processes; threads of a process execute on the same program at the same time, using the same address space. They can further "split" tasks, but incur less costs than using separate processes.

Motivation

Modern systems have multiple processing cores and the OS needs to handle many things at once. So, introducing parallelism has the potential to improve performance. A process with multiple threads has multiple "little pieces" that can run in parallel.

E.g. a web browser may have a thread to render the HTML and CSS, another to retrieve data from the network, another for responding to user input, etc. Thus, all of these things can happen at the same time.

E.g. a web server executing multiple threads concurrently

E.g. the kernel of an operating system is multi-threaded; each thread performs a specific task, e.g. device management, memory management, interrupt handling, etc.

Running the Same Program at Different Points

A thread is a single stream of execution within a process presenting an independently schedulable task. So, each process must have one or more threads, i.e. one or more points of execution.

Each thread has a thread ID (analogous to a process ID), a program counter, a stack pointer (and thus its own stack), and a set of registers to use. However, the heap, text, and data sections are shared with the rest of the threads in the process, as well as other OS resources (e.g. open files, signals)

The resources unique to a thread are stored in the thread control block (TCB), analogous to the process control block. This may also include thread-local storage (TLS), which is accessible in a thread because it persists across function calls (i.e. works differently than the stack).

Multi-threaded Processes

The address space of a process is shared among its threads, i.e. many threads exist in one protection domain.

Since each thread of a multi-threaded process shares the heap, data, files, etc, data-sharing is much easier and does not require syscalls, message passing, shared memory, etc.

If two threads run successively on a single processor, a context switch is still required because the PC and registers must be replaced. However, since the address space is the same, context switches between threads are less expensive.

Structure of a Multi-threaded Program

Order between threads is not specified or guaranteed, so threads may run in arbitrary order. Thus, we can design the program so all the threads can execute concurrently (asynchronous threading) or force the main thread to wait until the other threads terminate (synchronous threading).

#define N 100;
int in, out;
int buffer[N];

void producer() {
    //...
}

void consumer() {
    //...
}

int main() {
    in = 0;
    out = 0;
    // we could also use pthread_create, which takes a function pointer
    fork_thread(producer());
    fork_thread(consumer());
    // ...
}

Benefits of Multithreading

Performance Gain

We can quantify performance gain with Amdahl's law: If a system has \(S\%\) serial (i.e. sequential) components and thus \((1-S)\%\) parallel components, then we have the bound \(\boxed{\displaystyle\text{speedup with respect to adding new cores }\leq \left({S+\dfrac{1-S}{N}}\right)^{-1}}\) where the system has \(N\) processing cores.

How Does Multithreading Work?

We have two types of threads: user threads and kernel threads.

Kernel Threads

A kernel thread (or lightweight process) is a thread directly managed by the OS; the kernel manages and schedules it.

User-level Threads

A user-level thread is a thread that the OS is not aware of. Instead, the programmer uses a thread library (e.g. C-threads) to implement and manage the threads in code. This library handles the creation, synchronization, and scheduling of the threads instead of the OS.

Communicating between both Levels of Threads

Since the OS doesn't know about the user-level threads, it can't use them to make scheduling decisions, decreasing the performance of the scheduler. Solving this problems requires the thread library to communicate with the kernel, which would require syscalls.

Thus, the user and kernel threads must communicate in some configuration: one-to-one (high parallelism and concurrency, used in most current implementations), many-to-one (doesn't work for multicore systems since threads of the same process can't be parallelized), many-to-many (lets us use less kernel threads than user threads) or two-level (hybrid).

Concurrency vs. Parallelism

A system is parallel if more than one task at the same time. Multi-threading can improve the parallelism of a program.

A system is concurrent if all of its tasks can make progress at any given time*. This can be implemented by switching quickly between processes, and thus does not require multi-threading.

Challenges of Programming on Multi-core Systems

Tasks must be identified and split in such a way that they are independent and can be run in parallel. They should also perform equal amounts (and value) or work for effective load balancing. Distributing tasks across cores is the core aim of task parallelism.

If there is dependency between data accessed by multiple tasks, the task execution must be synchronized, i.e. we need to make sure that conflicting tasks don't run at the same time.

Testing and debugging get much harder in multithreaded applications because execution happens in multiple places and has elements of non-determinism (e.g. due to no guarantees on thread order).

15 - The POSIX Thread Library

forking In Multithreaded Programs?

fork has two versions: one the duplicates all the threads of the parent process, and one that creates a single-threaded child process. Generally, the thread copied is the one that called fork in the first place

Implicit vs. Explicit Threading

Under implicit threading, thread creation and management is left to the compiler and run-time libraries of a program. Thus, programmers just need to identify tasks that can run in parallel in their programs; the infrastructure they use will do the actual thread allocation for them.

Under explicit threading, a library provides an API for creating and managing threads to the programmer, who must use it to implement the threading manually

POSIX Thread Library

In PThreads, a thread container data type pthread_t acts as a handle of a thread, i.e. something to identify it (like a PID).

// creates a thread and returns the thread's handle
// we run the routine we want to place in the thread
// directly as a parameter
// aside: how the hell does this work??
// in pratice, implemented with the clone system call
pthread_create(start_coutine());

// allows the initial attributes of the thread to be set
pthread_attr_init();

// blocks the thread that called the new thread until the new thread has terminated
pthread_join();

// causes the calling thread to give up the CPU voluntairly
pthread_yield();

// terminates the calling thread and executes cleanup handlers
pthread_exit();

Signals in Multithreaded Programs

A synchronous signal is delivered directly to the thread that caused the signal to be generated. On the other hand, asynchronous signals (i.e. signals from external events) are sent to the first thread that hasn't blocked that signal, and possibly all the threads that haven't blocked that signal.

POSIX Pthreads has a pthread_kill(pthread_t pid, int signal) function that lets the user send a signal to a specific thread.

Cancelling (terminating) Threads

Under asynchronous cancellation, a thread will terminate the target thread immediately. Thus, the OS does not have time to react, and might not be able to free/reclaim any resources allocated to the target thread.

Under deferred cancellation, the target thread periodically checks (polls) if it has been cancelled, specifically at cancellation points. If a pending cancellation request is found, the thread will terminate itself and a cleanup handler is invoked to release the thread's resources.

In PThreads, we cancel a thread using pthread_cancel(pthread_t pid). Note that threads can set their own cancellation type with pthread_setcanceltype(); deferred cancellation is the default. A thread may choose to disable cancellation entirely.

Thread Pools

A thread pool is a collection of worker threads that execute callback functions for the application, i.e. are passed functions (as arguments) to execute inside the thread. At startup, some fixed number of threads are created; threads wait for tasks to be delegated to them

We can implement a job queue to queue incoming requests; this covers the case when we have more requests than threads in the pool. Once a thread becomes idle, it is "added back to the pool" and the next job in the queue is assigned to it.

Aside: to solve the producer-consumer problem, we might wish to lock access to the job queue and use a condition variable to store whether there is space in the job queue or not.

The Fork-Join Model

Under the fork-join model, a thread forks subthreads with fork, passes them arguments, waits for them to complete work in parallel, then joins them with join and combines/processes the results.

This is like a synchronous version of a thread pool; the user (or library, if it uses fork-join to implement something) determines the number of threads that need to be created, which would depend on the number of tasks.

Aside: fork-join follows the structure of divide and conquer algorithms well, implying that they can be easily parallelized under this structure. For example, to implement merge sort, we could fork into two threads to recursively merge sort each half of the list, then join them and collect the results by merging the two lists, as per the definition of the algorithm.

Code Examples

Basic Threads with PThreads

Multithreaded Server Concept

// the server loop is constantly running
serverLoop() {
    // blocked by acceptConnection until there is a new connection
    connection = AcceptConnection();
    // creates a new thread to handle each request that comes in
    thread_fork(ServiceRequest, connection);
}

16 - Synchronization I - Concurrency Problems

Recap: cooperating processes (or threads) share data via message passing or shared memory. However, concurrent access to shared data can lead to inconsistency, so programmers must synchronize access to shared data.

Race Conditions

A race condition occurs when the outcome of a program depends on the order in which threads accessing shared data are scheduled. Since this order is not defined, race conditions lead to unpredictable, non-deterministic results.

// initialize: n = ARRAY_SIZE - 1
if (n == ARRAY_SIZE) {
    return -1;
}
array[n] = valueA;
n++;

If two threads run this code at the same time, there are a few possibilities

Since a certain bug might only happen for a certain interleaving of the threads (which itself is non-deterministic), it can be extremely difficult to debug race conditions.

Atomicity

An operation is atomic if it either runs to completion or not at all. Single instructions may be atomic, but compound instructions (e.g. ones compiled from lines of user code) might not be, e.g. x = x + 1. This includes traditional reading and writing to memory, since we must first load the value, then the address, then write, etc.

If an operation is not atomic, threads may interleave within it, meaning the state of the program directly before and directly after the instruction might be different, with the difference being caused by something unrelated to the instruction.

A Combinatorics aside that no One Asked for

We can derive that for \(N\) lines of code and \(k\) threads, we have \(\dfrac{(kN)!}{(N!)^k}\) ways to interleave the threads, i.e. ways the different threads may traverse the same program. This corresponds to the number of random walks from the "top-left" to "bottom-right" corners of a \(k\)-dimensional \(N\times \dots\times N\) lattice.

We note that each traversal might not produce a unique result, i.e. multiple different traversals might end up "accomplishing" the same thing. So, this formula doesn't count how many different program outcomes there are.

Critical Sections

A critical section is a block of code (a sequential set of programming instructions) that cannot be executed in parallel by multiple threads, i.e. if they were a race condition would occur. Thus, there is mutual exclusion of threads in the critical section: only one can be in there at a time.

To be effective, a critical section must be correct, efficient (entry and exit of the critical section is fast and the critical section is short), flexible (have few restrictions), and support high concurrency.

Naive Synchronization Strategies

Just leave a note! I.e. have a variable that we store 1 in when a thread is in the critical section and 0 otherwise. Then, before entering the critical section, we just check what the note is.

Just have different notes for each thread, and check each thread's note before entering the critical section. This would be implemented as an array of notes.

A Naive Strategy that Works

We have two notes, but in one of the threads, we spin (stall) until we see that the other thread's note is 0; then we enter the critical section. We implement the spin by having a while(true) look that breaks if the note becomes 0.

This works, strictly speaking, but sucks for a few reasons

17 - Synchronization Algorithms and Hardware Support

Dekker's Algorithm

Dekker's Algorithm

Algorithm
We keep track of a lock, as well as an extra variable that stores which thread's turn it is to enter the critical section. While the critical section is locked, we continue checking if it's our turn to access it. It if isn't, we release the lock (so the other thread can access the critical section), then wait (spin) until it's our turn again. When the critical section unlocks, we stop spinning, execute it, then remove our turn and lock.

Peterson's Algorithm

Peterson's algorithm simplifies Dekker's algorithm by merging both while loops into one loop

Peterson's Algorithm

Algorithm
Initialize your own lock to 1 and give the other process the turn. Then, spin while the other process is locked and the turn is not yours. After that, you are free to enter the critical section, then remove your lock.

We can prove that the following boolean expressions cannot both be true at the same time; thus, mutual exclusion is guaranteed if the invariant holds before entering the critical section.

Peterson's algorithms is correct, and the solution is symmetric, i.e. the code of both threads "looks the same" (the only difference is which entry in the lock array is accessed). However, it is limited to two threads and requires busy-waiting.

Correctness Conditions

A solution is correct if it has safety (i.e. guarantees mutual exclusion), liveness (guarantees that progress will be made if needed, i.e. a thread waiting to enter an empty critical section will eventually enter it) and a bounded wait time.

Hardware-supported Synchronization

To eliminate busy-waiting and lock-contention (only one thread being able to hold a given lock) altogether, we need hardware support for synchronization.

We have three main strategies for hardware-supported synchronization: memory barriers, atomic hardware instructions, and atomic variables (usually implemented with atomic hardware instructions).

Basically, all the synchronization problems come from the fact that a process can be interrupted at any time. So, we find ways to prevent interrupts from being a problem.

Synchronization by Disabling Interrupts

The CPU (non-preemptive) scheduler only gets control when internal and external events (i.e. interrupts) happen. So, on uniprocessor systems, we can simply disable interrupts from happening.

Synchronization by Atomic Instructions

In hardware, we implement an atomic read-modify-write instruction that reads a value from memory into a register and writes a new value to it in a single instruction. Thus, it can't be interrupted; the critical section is formed of a single instruction.

Examples: test_and_set in x86 assembly atomically reads a value from memory into a register and writes 1 back to the memory location, exchange atomically swaps between a register and a memory address, compare_and_swap does a conditional swap of values based on the values of some registers.

Implementing Spinlocks

We can implement a spinlock with test_and_set:

while(true) {
    while(test_and_set(&lock));
    // we are entering the critical section
    // ...
    // critical section done
    lock = false;
}

We can implement a spinlock with compare_and_swap as well

while(true) {
    while(compare_and_swap(&lock, 0, 1) != 0);
    // we are entering the critical section
    // ...
    // critical section done
    lock = 0;
}

Implementing Atomic Variables

We can define an increment function that atomically increments a value in memory using compare_and_swap and an atomic variable.

void increment(atomic_int *v) {
    int temp;
    do {
        temp = *v;
    } while (temp != (compare_and_swap(v, temp, temp+1)));
}

18 - Synchronization Primitives I

Synchronization primitives must remain invariant over all the threads of a process

Programming languages provide primitive, atomic operations for synchronization. These offer the basic building blocks to create synchronous programs, just like the lambda calculus provides the basic building blocks to create functional programs.

Mutex Locks

A mutex lock is a high-level programming abstraction of an object that only on thread can hold at a time; it may be implemented with a blocking operation or a spinlock (which uses busy waiting)

// recall: all threads of a process share the heap
void *malloc (size_t size) {
    // assume we have some mutex lock over the heap
    // we get the heap lock
    heaplock.acquire();
    // allocate the memory with syscalls, etc
    p = /* the memory */;
    // release the lock
    heaplock.release();
}

void *free(void *p) {
    heaplock.acquire();
    // deallocate memory with syscalls, etc
    heaplock.release();
}

We can implement mutex locks on uniprocessor systems by disabling interrupts with intr_disable() and re-enabling them with intr_enable(); this happens at the start and end (respectively) of both acquire and release.

This strategy won't work on multiprocessors, since disabling interrupts on one processor won't disable it on the others. So, for multiprocessor systems, we can use instructions like test_and_set or compare_and_swap to implement a busy-waiting loop that atomically checks the status of the lock.

A two-phase lock spins for a small amount of time to see if the lock can be acquired (spin phase). If it can't, the caller is put to sleep (sleep phase). This is a heuristic that assumes temporal locality; if a lock isn't available now, it probably won't be available for awhile.

Condition Variables

Combined with mutexes, it allow threads to wait in a race-free way for arbitrary conditions to occur; these conditions are stored (as booleans?) in the condition variable

A condition variable is an abstraction that supports conditional synchronization. A queue of threads waits for a specific event that happens in the critical section. So, the value of the condition variable is defined by data protected by a mutex lock.

A condition variable defines three functions:

We take the following steps when dealing with a condition variable:

  1. Acquire the lock to enter the critical section
  2. Check the condition inside the critical section. If it's true, block the thread and release the lock. otherwise, just release the lock.

Example: Vending Machine

Intuitive Summary

A mutex lock is a device to prevent more than one thread from "moving forward" in the code; the thread must acquire the lock to this, and the "primitive" part is that only one thread can acquire it at a time.

Sometimes, we can implement what we need just with mutex locks. However, whether we need to enter the critical section could be predicated on program state that can only be checked inside the critical section (e.g. if a bounded buffer (which is in the critical section) is full or not). We can't do this with just a mutex lock, since we need it to look in the critical section in the first place.

A condition variable is a primitive that lets us communicate with a mutex lock from inside the critical section. We can call wait on it outside the critical section to block the thread until we call signal (one thread) or broadcast (all threads) from inside the critical section. Thus, we can share information from the critical section without needing to "acquire" the lock in the mutex sense.

Usage Pattern

We often use a while loop to handle a mutex lock and condition variable:

  1. Acquire the lock
  2. Enter a while loop that checks the negation of the "condition" inside the critical section
  3. By definition, after this loop, we have the lock and the condition we are checking for is true; the code after this is the critical section
  4. Release the lock

My implementation of MapReduce (assignment 2) has good examples of this pattern everywhere.

19 - Synchronization Primitives II

Semaphores

A semaphore is a generalization of the mutex lock + condition variable pattern: a semaphore has an integer variable (as opposed to the lock, which is a boolean variable) that supports the wait and signal operations.

A semaphore keeps a wait queue of threads waiting to access the critical section. If a thread calls wait and the semaphore is free (i.e. not \(0\)), it grants access and decrements the variable. Otherwise, the process is placed in the wait queue. When signal is called, one process is unblocked and the semaphore is incremented; the next process in the queue is given access.

Mutual exclusion can be implemented by a binary semaphore by calling wait directly before the critical section and signal directly after. Used this way, a counting semaphore would let at most \(k\in \mathbb{N}\) threads into the critical section at a time.

Unlike with condition variables, calling signal on a semaphore with an empty queue does have side effects: it still increment's the semaphore's count, which gets decremented the next time wait is called. Calling signal on a "empty" condition variable does nothing until a thread calls wait.

Monitors

A monitor is an abstract encapsulation of a lock and a number of condition variables into a class. It hides the synchronization primitives behind this class interface; the user of the monitor interacts with the critical section through thread-safe methods, which access the critical section's data.

Both styles of monitor can be implemented with semaphores:

Course notes contain examples

20 - Synchronization Examples

Classic Synchronization Problems

Bounded buffer problem: A finite buffer holds shared data, i.e. different processes read from and write to it. We need mutual exclusion to be sure of the data's integrity; we also need to know if the buffer is full or empty to avoid writing or reading, respectively.

Readers-Writers problem: Again we have a shared buffer, but we might have multiple processes reading and writing. We need to assure that only one writer at a time accesses the data, but many readers should be able to read at the same time.

Dining-Philosophers Problem: \(n\) philosophers sit at a table with \(n\) chopsticks. Philosophers switch between thinking and eating. They require \(2\) chopsticks to eat, but can only pick up \(1\) at a time. How can they assure no deadlocks and no starvation without communicating directly?

21, 22 - Concurrency Bugs and Deadlock

Non-deadlock Concurrency Bugs

Atomicity bug: An operation consists of multiple instructions and a thread is interrupted in the middle of the instruction by another thread that changes some of the data.

Order-violation bug: Some order of thread executions is assumed (usually for threads in a parent-child relationship) that doesn't always hold true. Look for these by considering what would happen if a child thread ran immediately after creation, pausing the parent.

Deadlock Bugs

A deadlock occurs when two threads access shared data, but don't have the same protocol for interacting with the lock (or whatever synchronization primitive).

// thread A
lock1.acquire();
lock2.acquire();

// thread B
lock2.acquire();
lock1.acquire();

// both threads are in each other's wait queues

The System Model

We define a system to have resource types \(R_1, \dots, R_m\), with \(w_i\) interchangeable instances of resource type \(R_i\) (e.g. CPUs, disks, locks, etc.) and threads (or processes, whatever) \(T_1, \dots, T_n\). To utilize a resource, a thread must request it, use it, and release it.

We can define a resource-allocation graph with vertices \(\set{R_1, \dots, R_m}\cup\set{T_1, \dots, T_n}\).

If a deadlock has occurred, the resource-allocation graph must have a cycle. However, the existence of a cycle doesn't imply a deadlock; it is necessary but not sufficient.

We can also define a wait-for graph with vertices \(T_1, \dots, T_n\) by joining two threads if a path from one to the other exists in the resource-allocation graph.

When Do Deadlocks Occur?

Four conditions are needed for a deadlock to occur:

Dealing with Deadlocks

Detection

We establish a policy for when to check for a deadlock (e.g. if a request can't be fulfilled in some amount of time, CPU utilization drops below a threshold, at a regular interval, check before every resource allocation, etc). Then, when we need to check, we run a cycle-detection algorithm in the resource-allocation graph (\(O(|E|+|V|)\)).

In practice, operating systems just defer this problem to the user (this is usually cheaper, so saves time overall).

Recovery

There are a few options:

Prevention

No-preemption Bugs

Strategy I: Always preempt all the resources of a waiting thread

Strategy II: Preempt the resources of a waiting thread when another thread requests those resources. This involves checking if a requested resources is being allocated to a waiting thread.

Easy saving and restoration of resource state is needed for these strategies to work effectively. So, the resource must store its state in something like CPU registers.

Circular Waiting

We can define a total order over the resources, i.e. an injective function \(f : R_{\mathbb{N}} \to \mathbb{N}\). Then, when multiple threads are waiting for a lock, we pick threads according to our order.

Avoidance

Resource Reservation

A sequence of thread executions \(t_1, \dots, t_n\) is safe if each execution \(t_i\) can run with the currently available resources plus the resources of all earlier executions (i.e. all \(t_j\) for \(j <i\)).

An unsafe state doesn't guarantee a deadlock, but is needed for a deadlock to have a chance of occurring. Deadlock cannot occur in a safe state.

The purpose of avoidance is to prevent the system from ever going into an unsafe state. Under resource reservation, threads provide the maximum number of instances of each resource type they might need. Then, every time a thread requests a resource, the system checks if granting this resource would lead to an unsafe state.

Banker's Algorithm

We keep the following data structures for \(n\) threads, \(m\) resources:

To request resources, we keep a request vector; request[i][j] is the number of instances of \(R_j\) wanted by \(T_i\).

  1. If request[i] is larger than need[i], raise an error since \(T_i\) has claimed more than its maximum
  2. If there enough resources available to satisfy the request, proceed. Otherwise, wait until the resources are available
  3. "Pretend" to allocate resources by modifying the algorithm state to what it would be if the resource were allocated. Then, check for safety by searching for a safe sequence (continuously search for a "next" thread and keep track of all the threads that have been included, if termination happens before all threads are reached, the system is in an unsafe state).
available-=request[i]
allocation[i]+=request[i]
need[i]-=request[i]

Drawbacks of banker's algorithm

Head in the Sand Approach

Just don't try to avoid at all! EZ.

23 - Memory Management

How are memory addresses the CPU uses mapped to physical memory addresses? What is the difference between static and dynamic relocation? How do we allocate contiguous memory? What is fragmentation and how do we get rid of it?

Review: the Instruction Execution Cycle

  1. An instruction is fetched from memory according to the value of the program counter
  2. The instruction is decoded, possibly requiring loading more from memory
  3. The instruction is executed
  4. The result is stored in a register or written back to memory

The CPU can access registers and the main memory, since these are both in the CPU core. However, the CPU cannot access disk addresses, so anything needed from there must be loaded into memory first.

Also, registers can be accessed in one cycle (229 review!), but memory access usually takes much longer, forcing the CPU to stall. Implementing caches can reduce the symptoms of this problem, but isn't a "full" solution because it doesn't (and could never) guarantee that memory never needs to be accessed.

Memory Addressing

A \(k\)-bit memory address can trivially reference \(2^k\) locations in memory; the size of each location depends on the system, but is often \(1\) word (\(32\) bits). A memory address characterizes a memory space, which may be different from the actual memory layout of the machine, so some sort of translation between these two "spaces" is needed.

! - Class info

Course Goals

Textbooks and Resources

I really should have made my 229 notes over the summer to prepare for this, oops!

Evaluation

Assignments

Labs

Overview of Topics