Lecture Zero: Introduction

Housekeeping

Grade breakdown:

  • 50% homework (5 assignments, 10% per homework released on GitHub and submitted on Gradescope)
  • 40% final project
  • 10% participation

Class is curved.

Post questions on Piazza (not for debugging, but for conceptual questions on the homeworks or lecture clarifications) or come to OH.

What is DevOps?

Breaking down the wall between developers (people writing code) and operations (people releasing and deploying code into production and making sure it is reliable). Traditionally these have been two very separate teams, which means that the incentives developers and operations engineers don’t always align. Developers aren’t motivated to make life easier for operations and operations isn’t motivated to make life easier for developers. When a crash happens in production, the people handling the crash aren’t the ones familiar with the code.

The key concept behind DevOps is that if these two teams can share responsibilities, they can build empathy, align their incentives, and ultimately lead to a better experience for the end user if new features are more stable and reliable.

There are a few main DevOps solutions we will be focusing on in CIS 188:

  • Automated testing and deployment (we can easily ship new features with testing)
  • Easy deploy rollback (if something breaks we can revert quickly)
  • Observability (so we can know when something is wrong)

The main takeaway is to get developers involved in the operations process so that developers can use their skills to build tools to automate away the tedious parts of operations jobs. DevOps is not a role, but a way of doing things.

Python

We’ll be using Python for most of the development side of the DevOps solutions we cover in this course. It’s common and well supported in the infrastructure space because it’s easy to learn and there is wide library support.

Python example

# Comments start with a `#`

import time # import a module from the standard library

for i in range(1, 16): # For loops only over iterators like lists and `range()`
    print(i)
    if i % 3 == 0 and i % 5 == 0:  # Conditional expressions
        print(time.time())
        print("fizzbuzz")  # Strings can be double or single quoted
    elif i % 3 == 0:
        print("fizz")
    elif i % 5 == 0:
        print("buzz")

Running the above code produces the following output:

$ python3 test.py
1
2
3
fizz
4
5
buzz
6
fizz
7
8
9
fizz
10
buzz
11
12
fizz
13
14
15
1607229328.9530184
fizzbuzz

Code in Python can run at the top-level, but it’s good practice to pull logic into functions:

def get_buzz(i): # (def)ine a function
    if i % 3 == 0 and i % 5 == 0:
        return "fizzbuzz"
    elif i % 3 == 0:
        return "fizz"
    elif i % 5 == 0:
        return "buzz"
    return ""

for i in range(1, 16):
    print(str(i) + " " + get_buzz(i)) # Call a function

You can check out CIS 192 for more learning materials, and come to office hours with any questions about Python!

Packaging

Writing code is useful, reusing code is even more useful! Making sure that you has access to the right packages and are also using the correct version of that package is no easy task. Python’s default package manager, pip, will simply take the most recent version of a package and (and its dependencies) and pull it down, not taking into account compatibility with other dependencies and potential conflicts a mismatch might cause. Other package managers built on top of pip, like Poetry, help solve this problem. Poetry is like NPM for JavaScript, or Maven for Java. Take a look that the Poetry demo below for a more in-depth explanation!

Most importantly, Poetry helps us create reproducable build environments wherever we run our code: on our local machines, on our friends’s machines, or even on a production server somewher in “the cloud.”

How it works

Poetry creates and manages two files. pyproject.toml is Poetry’s dependency file: a human-readable (and writeable) file which declares “acceptable versions” of packages, generally a range, such as “1.1 - 1.12”, if version 2.0 contains a breaking change. poetry.lock is the lock file: an autogenerated file used to declare specific package versions, including dependencies of dependencies. Poetry uses the lock file to save and persist its resolution of conflicts that it resolves from the list in pyproject.toml.

The Poetry demo is replicated below, so here’s a relevant comic courtesy of xkcd, describing the spaghetti mess that Poetry helps us avoid: Relevant XKCD

Demos

Poetry

One reoccurring design pattern we see in DevOps is package managers. This is a tool that helps manage your program’s dependencies. In other words, the package manager is in charge of keeping track of what packages your project needs to run correctly, and then downloading those packages in a way that makes it easy for your program to use this auxillary code.

We’ll look at a few different package managers over the course of the semester. Node has one called NPM (Node Package Manager), Java has a package manager called Maven, and Python has a few offerings. Note that these package managers are all a little different because they work with different languages that all have different nuances. This is why we can’t reuse package managers across languages.

The Python package manager we’ll be using is called Poetry. Essentially, Poetry allows you to download certain Python libraries, then it creates a virtual python environment on your machine to run your code with the given libraries. So, why the virtual environment? The answer is that Python varies a lot from version to version (especially Python 2 compared to Python 3). The virtual environment ensures that you, your team of developers, and your production environment are all on the same version of Python. This way we can avoid any issues and bugs that may arise from code that’s written to work on one version of Python actually being run with a different version of Python.

Now, let’s get into how to actually use Poetry. First, make sure that you have Poetry installed on your machine, instructions for installation can be found here.

Once you have Poetry installed, let’s create a new project:

# Create a new folder called poetry_demo
$ mkdir poetry_demo
# Enter the new folder
$ cd poetry_demo
$ poetry init

Now Poetry will give you lots of options for how to initialize your project, just hit enter for all of them (Poetry will use the default setup which is fine for our purposes). Once you’ve finished, you’ll see that there is a new file pyproject.toml in the directory, this is the file that stores the information we just initialized.

Next, let’s add a dependency:

$ poetry add numpy
Creating virtualenv poetry-demo-KkU142w6-py3.9 in /Users/airbenderang/Library/Caches/pypoetry/virtualenvs
Using version ^1.19.5 for numpy

Updating dependencies
Resolving dependencies... (39.8s)

Writing lock file

Package operations: 1 install, 0 updates, 0 removals

  • Installing numpy (1.19.5)

Now, Poetry actually does two things here. It downloads NumPy, but before that it actually creates a virtual environment which we are going to use to run our Python code. If we wanted to use a deprecated version of Python (like Python 2) we could configure Poetry to setup the virtual environment so it runs an older release of Python. Again, you will see a new file in your directory, this is the poetry.lock file. It doesn’t make much sense to humans, but the poetry.lock file tracks which packages your program depends on and the version number of those packages.

Finally, let’s run some code on Poetry’s virtual environment. There are two ways that you will run python programs with Poetry. The first is you can type poetry run script.py and this would run a Python script in the Poetry environment, but instead we will be opening a new shell that will have the Poetry virtual environment as our default Python environment:

$ poetry shell
Spawning shell within /Users/airbenderang/Library/Caches/pypoetry/virtualenvs/poetry-demo-KkU142w6-py3.9
$ . /Users/airbenderang/Library/Caches/pypoetry/virtualenvs/poetry-demo-KkU142w6-py3.9/bin/activate
# Open a new Python interactive terminal
$ python
Python 3.9.1 (default, Dec 24 2020, 16:53:18) 
[Clang 12.0.0 (clang-1200.0.32.28)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> x = np.array([[1,2],[3,4]])
>>> x
array([[1, 2],
       [3, 4]])
>>> y = np.linalg.inv(x)
>>> y
array([[-2. ,  1. ],
       [ 1.5, -0.5]])
>>> exit() # To leave the Python terminal
# Then exit again to leave the Poetry shell
$ exit

It looks like NumPy works! This means that Poetry has been properly able to manage our dependencies so that they are accessible when we run our Python code with Poetry. Now, let’s make a simple Python file and have Poetry run it. Create a new file called average.py in the same directory as your pyproject.toml and poetry.lock and paste this code into it:

import sys
import numpy as np

if len(sys.argv) < 2:
    print("Not enough command line arguments")
    exit()

xs = []
try:
    for i in range(1, len(sys.argv)):
        xs.append(int(sys.argv[i]))
except:
    print("Command line arguments are not integers")
    exit()

print(np.average(np.asarray(xs)))

Now, we can run this in the virtual environment created by Poetry:

$ poetry run python average.py 1 2 3 4
2.5

Awesome, it looks like this is working, too. Try changing around the command line arguments!

Lecture One: Networking

Slides: https://docs.google.com/presentation/d/1FVklEogqEGn6zsp8YOpCuynUCCvhB6mXusB979pMCak/edit#slide=id.p

7 Layers of the OSI Model

The internet is built in layers that allow us to abstract away a lot of the complexity that is inherent to networks. One common model for layers is the OSI Model (from top to bottom):

  • Application: End user later (HTTP, FTP, SSH, DNS)
  • Presentation: Syntax layer (SSL, SSH, IMAP, FTP, JPEG)
  • Session: Synch and send to port (APIs, sockets)
  • Transport: End-to-end connections (TCP, UDP, QUIC)
  • Network: Packets (IP, ICMP, IPSec, IGMP)
  • Data Link: Frames (Ethernet, PPP, Switch, Bridge)
  • Physical: Physical structure (Coax, Fiber, Wireless, Hubs, Repeaters)

Internet Protocol (IP)

IP is the lowest level of abstraction that we will uncover in this course. You can think of an IP address as a single atomic element that lives on a network. A computer is given a IP address that consists of four octets (32 bits of data) separated by dots. A subnet is a subset of all IP addresses available. The IP ranges given for Loopback and Private IPs below represent subnets, but a subnet can be any range of IPs.

There are a few special subnets that you might become familiar with:

  • Loopback: IP addresses on the local machine.
    • 127.0.0.0 - 127.255.255.255 (127.0.0.0/8)
  • Private: Only for devices on the inside of the local network, are never surfaced publicly
    • 10.0.0.0 - 10.255.255.255 (10.0.0.0/8)
    • 172.16.0.0 - 172.31.255.255 (172.16.0.0/12)
    • 192.168.0.0 - 192.168.255.255 (192.168.0.0/16)

If you’d like to see your IP address on your machine, you can run one of these commands depending on your OS:

  • Linux: ip addr show
  • Mac: ifconfig
  • Windows: ipconfig \all

Visualizing the Network

Say we have two machines on a subnet, this means that these machines know each other exist, and they can communicate, but they don’t know how to communicate. They need some shared language that will allow them to send information back and forth between them.

Subnet Diagram

A Note on IPv6

The four octets in IPv4 only gives us 32 bits of data to specify an IP address, this gives us 2^(32) = 4,294,967,296 possible IP addresses. When IPv4 was invented that seemed like more than we would ever need, but nowadays every smartphone, TV, PlayStation, or laptop might need an IP address. The solution was IPv6 which allowed for 128 bits of data in IP addresses, drastically expanding the number of available IP addresses.

However, IPv6 has really struggled with adoption because network engineers aren’t too keen on migrating all of their technology to IPv6. 45% of network traffic in the US happens over IPv6, but that’s because many big tech companies have adopted IPv6 on their internal networks. Outside of that, few people have switched over, so we’ll use IPv4 throughout this class.

Transport Layer: TCP

TCP is a protocol that is layered on top of IP to allow for a shared language between computers. TCP has robust error-checking to ensure that all of the information transmitted over the internet actually gets to the destination. This error-checking can also cause latency and congestion issues as it requires the person receiving the data to also send back a confirmation that they got everything.

TCP also introduces the concept of ports. This is a number ranging from 1-65535 that allows for differentiating between connection types. Typically ports are designated for a specific kind of connection: HTTP is port 80, SSH is port 22, MySQL is port 3306. You can actually configure your machine to listen for whatever kind of connection on whichever port, but you’ll confused everyone else who is expecting specific connection types on specific ports.

TCP also allows for multiple concurrent connections. For example, you could have a number of users all connecting to port 80 on an HTTP server and TCP would still be able to manage these connections. This is an essential feature, if we could only establish a single connection the client-server model could never work in networking.

There are some alternatives to TCP:

  • UDP: This protocol is quite similar to TCP but without the error checking. This is great when you want data to be sent fast, and don’t care too much about correctness. If you’re in a Zoom call, you won’t notice if some pixels are flipped during a single frame or some audio is warbled as long as the call keeps up with real time, so better to be fast than correct.
  • QUIC: This protocol is built on UDP and is driver for HTTP/3 and will hopefully resolve some of the latency issues that come from TCP error-checking.

Socket Status

$ ss -4lt
State    Recv-Q   Send-Q     Local Address:Port       Peer Address:Port   Process
LISTEN   0        128              0.0.0.0:47497           0.0.0.0:*
LISTEN   0        4096             0.0.0.0:mysql           0.0.0.0:*
LISTEN   0        10               0.0.0.0:57621           0.0.0.0:*
LISTEN   0        1              127.0.0.1:12315           0.0.0.0:*

This ss command that allows us to examine the TCP ports on our machine. Here we can see that each process is “listening” which means that they are awaiting connections. In the client-server model these would be the servers that are awaiting connections from clients.

One key point is that in the Local Address:Port column we see both 0.0.0.0 and 127.0.0.1. When we see 0.0.0.0 this means that the port can be connected to by an machine with any IP address. Conversely, when we se 127.0.0.1 this means that the process on this port can only be connected to from the host machine itself. Some examples might be if you are running a local database for a webserver, you only want your host machine to write to that database and not outside users.

Application Layer: HTTP

Now we layer on top of TCP up to the application layer where we implement HTTP. We can see how HTTP uses TCP which uses IP so the layers of the internet are starting to come together.

HTTP is the protocol behind the web. It operates on the client/server model: the client requests something from the server, and the server responds to that request with the desired result. The server will serve many clients. HTTP underlies the orchestration that we will build up to throughout the course, and so it’s important to start with foundational understanding. If we run into issues with our deployments, knowledge of how HTTP enables communication over the web is going to be vital in addressing the problem.

Components of a HTTP Request

There are four main components of any HTTP request:

  • Method: What do we want to do?
    • Verbs like GET, POST, PUT
  • URL : Where are we sending the request?
  • Headers: Metadata about the request and how to handle it
    • Computer’s client ID
    • Caching settings
  • Body: The data associated with the request
    • For example, final submission step of a form is a POST request with a body as the form data.

HTTP Methods

There are a number of HTTP methods, but we will use primarily these four:

  • GET
    • No body
    • Idempotent (same request gets the same result)
    • No side effects
    • URL parameters (http://youtube.com/?search=avatar)
    • Common use case is to retrieve the content of a webpage
  • POST
    • Has a body.
    • Will have side effects (posting a comment on a post)
    • Common use case is form submission (think user registration)
  • PUT
    • Puts the body onto the server
    • Idempotent
    • Common use case is file upload (profile picture)
      • Putting the same file onto the server multiple times should only upload one file
  • DELETE
    • Deletes the resource at the given URL (assuming you have permissions)
    • Common use case is deleting a file (deleting your profile picture)

Idempotency is an important concept with HTTP, but also in DevOps more broadly. An operation is idempotent when, no matter how many times it is performed on a system, the system only changes as if it was performed once. A simple example is multiplying a non-zero integer by zero. 4*0=0, and 4*0*0=0 and so on. The “multiply by zero” operation has an effect on the resulting product the first time, but any more applications of the operation won’t change the outcome.

Bringing this back to DevOps, there are many times when you want to be able to retry an action, but don’t want to actually perform that action multiple times if two requests happen to both succeed. If you want to provision a virtual machine on a cloud provider, retrying the “provision” action shouldn’t result with you having two virtual machines, paying twice what you expected to pay.

HTTP URLs

A URL like https://httpstat.us/200 has three components to it:

  • https: This is the protocol (how we communicate with the host)
  • httpstat.us: This is the host (the place we are communicating with)
    • The port defaults for 80 for HTTP and 443 for HTTPS, but you can also specify with :port after the host.
  • /200: This is the path (what resource we want to access on the host)

HTTP Headers

Headers pass metadata about the request. Here are some examples:

Body

The body is used to pass chunks of data along with the requests. Pretty straighforward, is generally text that’s given in a format specified by the Content-Type header.

Example Request

$ curl -vv 'httpstat.us/200'
*   Trying 172.67.148.117...
* TCP_NODELAY set
* Connected to httpstat.us (172.67.148.117) port 80 (#0)
> GET /200 HTTP/1.1
> Host: httpstat.us
> User-Agent: curl/7.58.0
> Accept: */*

Here we can see that when we send an HTTP request, we first establish a TCP connection with the host on a given port (typically port 80 for HTTP) and then we transmit the HTTP request (method, headers, and body) through the TCP connection.

Note that for each HTTP request up to version 1.1, a new TCP connection is established to transport the information. There is some overhead to creating an TCP connection, so this is something that newer transport protocols like HTTP/2 and QUIC have attempted to improve upon.

Another solution is websockets which are built on top of HTTP to establish a persistent and bidirectional connection between two computers. This is great for use cases where small amounts of information have to be sent often, like a chat server.

Components of HTTP Responses

When you make a request to an HTTP server, it sends you back a response. The HTTP response looks similar to the HTTP request:

  • Status Code: A number indicating the status of the response.
  • Headers: Metadata about the response, similar to those in the request.
  • Body: The actual data within the response (when you request a webpage, this would be the HTML that your browser will compile and display on your screen).

Status Codes

The status codes are organized by the first digit in the 3-digit code:

  • 2xx: Success
    • 200: OK
    • 201: Created (new user registered correctly)
    • 202: Accepted
  • 3xx: Redirection
    • 301: Moved Permanently
    • 302: Found
    • 304: Not Modified (for caching, more detail later)
  • 4xx: Client error (people making the request messed up)
    • 400: Bad Request (server doesn’t know how to response to the request)
    • 401: Unauthorized (you’ll never have access)
    • 403: Forbidden (you may have access but your credentials are wrong)
    • 404: Not found (the URL path your specified does not exist)
  • 5xx: Server error (people processing the request messed up)
    • 500: Internal Server Error (maybe you made a syntax error in your Python code)
    • 503: Service Unavailable

No need to remember these status codes because they’re uniform across all of the internet, so it’s super easy just to look them up if you encounter them. Normally, googling the status code will be able to point you in the right direction if you’re having trouble with HTTP communication.