Lecture Zero: Introduction
- 50% homework (5 assignments, 10% per homework released on GitHub and submitted on Gradescope)
- 40% final project
- 10% participation
Class is curved.
Post questions on Piazza (not for debugging, but for conceptual questions on the homeworks or lecture clarifications) or come to OH.
What is DevOps?
Breaking down the wall between developers (people writing code) and operations (people releasing and deploying code into production and making sure it is reliable). Traditionally these have been two very separate teams, which means that the incentives developers and operations engineers don’t always align. Developers aren’t motivated to make life easier for operations and operations isn’t motivated to make life easier for developers. When a crash happens in production, the people handling the crash aren’t the ones familiar with the code.
The key concept behind DevOps is that if these two teams can share responsibilities, they can build empathy, align their incentives, and ultimately lead to a better experience for the end user if new features are more stable and reliable.
There are a few main DevOps solutions we will be focusing on in CIS 188:
- Automated testing and deployment (we can easily ship new features with testing)
- Easy deploy rollback (if something breaks we can revert quickly)
- Observability (so we can know when something is wrong)
The main takeaway is to get developers involved in the operations process so that developers can use their skills to build tools to automate away the tedious parts of operations jobs. DevOps is not a role, but a way of doing things.
We’ll be using Python for most of the development side of the DevOps solutions we cover in this course. It’s common and well supported in the infrastructure space because it’s easy to learn and there is wide library support.
# Comments start with a `#` import time # import a module from the standard library for i in range(1, 16): # For loops only over iterators like lists and `range()` print(i) if i % 3 == 0 and i % 5 == 0: # Conditional expressions print(time.time()) print("fizzbuzz") # Strings can be double or single quoted elif i % 3 == 0: print("fizz") elif i % 5 == 0: print("buzz")
Running the above code produces the following output:
$ python3 test.py 1 2 3 fizz 4 5 buzz 6 fizz 7 8 9 fizz 10 buzz 11 12 fizz 13 14 15 1607229328.9530184 fizzbuzz
Code in Python can run at the top-level, but it’s good practice to pull logic into functions:
def get_buzz(i): # (def)ine a function if i % 3 == 0 and i % 5 == 0: return "fizzbuzz" elif i % 3 == 0: return "fizz" elif i % 5 == 0: return "buzz" return "" for i in range(1, 16): print(str(i) + " " + get_buzz(i)) # Call a function
You can check out CIS 192 for more learning materials, and come to office hours with any questions about Python!
Writing code is useful, reusing code is even more useful! Making sure that you has access to the right packages and are also using the correct version of that package is no easy task. Python’s default package manager,
Most importantly, Poetry helps us create reproducable build environments wherever we run our code: on our local machines, on our friends’s machines, or even on a production server somewher in “the cloud.”
How it works
Poetry creates and manages two files.
pyproject.toml is Poetry’s dependency file: a human-readable (and writeable) file which declares “acceptable versions” of packages, generally a range, such as “1.1 - 1.12”, if version 2.0 contains a breaking change.
poetry.lock is the lock file: an autogenerated file used to declare specific package versions, including dependencies of dependencies. Poetry uses the lock file to save and persist its resolution of conflicts that it resolves from the list in
One reoccurring design pattern we see in DevOps is package managers. This is a tool that helps manage your program’s dependencies. In other words, the package manager is in charge of keeping track of what packages your project needs to run correctly, and then downloading those packages in a way that makes it easy for your program to use this auxillary code.
We’ll look at a few different package managers over the course of the semester. Node has one called NPM (Node Package Manager), Java has a package manager called Maven, and Python has a few offerings. Note that these package managers are all a little different because they work with different languages that all have different nuances. This is why we can’t reuse package managers across languages.
The Python package manager we’ll be using is called Poetry. Essentially, Poetry allows you to download certain Python libraries, then it creates a virtual python environment on your machine to run your code with the given libraries. So, why the virtual environment? The answer is that Python varies a lot from version to version (especially Python 2 compared to Python 3). The virtual environment ensures that you, your team of developers, and your production environment are all on the same version of Python. This way we can avoid any issues and bugs that may arise from code that’s written to work on one version of Python actually being run with a different version of Python.
Now, let’s get into how to actually use Poetry. First, make sure that you have Poetry installed on your machine, instructions for installation can be found here.
Once you have Poetry installed, let’s create a new project:
# Create a new folder called poetry_demo $ mkdir poetry_demo # Enter the new folder $ cd poetry_demo $ poetry init
Now Poetry will give you lots of options for how to initialize your project, just hit enter for all of them (Poetry will use the default setup which is fine for our purposes). Once you’ve finished, you’ll see that there is a new file
pyproject.toml in the directory, this is the file that stores the information we just initialized.
Next, let’s add a dependency:
$ poetry add numpy Creating virtualenv poetry-demo-KkU142w6-py3.9 in /Users/airbenderang/Library/Caches/pypoetry/virtualenvs Using version ^1.19.5 for numpy Updating dependencies Resolving dependencies... (39.8s) Writing lock file Package operations: 1 install, 0 updates, 0 removals • Installing numpy (1.19.5)
Now, Poetry actually does two things here. It downloads NumPy, but before that it actually creates a virtual environment which we are going to use to run our Python code. If we wanted to use a deprecated version of Python (like Python 2) we could configure Poetry to setup the virtual environment so it runs an older release of Python. Again, you will see a new file in your directory, this is the
poetry.lock file. It doesn’t make much sense to humans, but the
poetry.lock file tracks which packages your program depends on and the version number of those packages.
Finally, let’s run some code on Poetry’s virtual environment. There are two ways that you will run python programs with Poetry. The first is you can type
poetry run script.py and this would run a Python script in the Poetry environment, but instead we will be opening a new shell that will have the Poetry virtual environment as our default Python environment:
$ poetry shell Spawning shell within /Users/airbenderang/Library/Caches/pypoetry/virtualenvs/poetry-demo-KkU142w6-py3.9 $ . /Users/airbenderang/Library/Caches/pypoetry/virtualenvs/poetry-demo-KkU142w6-py3.9/bin/activate # Open a new Python interactive terminal $ python Python 3.9.1 (default, Dec 24 2020, 16:53:18) [Clang 12.0.0 (clang-1184.108.40.206)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> x = np.array([[1,2],[3,4]]) >>> x array([[1, 2], [3, 4]]) >>> y = np.linalg.inv(x) >>> y array([[-2. , 1. ], [ 1.5, -0.5]]) >>> exit() # To leave the Python terminal # Then exit again to leave the Poetry shell $ exit
It looks like NumPy works! This means that Poetry has been properly able to manage our dependencies so that they are accessible when we run our Python code with Poetry. Now, let’s make a simple Python file and have Poetry run it. Create a new file called
average.py in the same directory as your
poetry.lock and paste this code into it:
import sys import numpy as np if len(sys.argv) < 2: print("Not enough command line arguments") exit() xs =  try: for i in range(1, len(sys.argv)): xs.append(int(sys.argv[i])) except: print("Command line arguments are not integers") exit() print(np.average(np.asarray(xs)))
Now, we can run this in the virtual environment created by Poetry:
$ poetry run python average.py 1 2 3 4 2.5
Awesome, it looks like this is working, too. Try changing around the command line arguments!
Lecture One: Networking
7 Layers of the OSI Model
The internet is built in layers that allow us to abstract away a lot of the complexity that is inherent to networks. One common model for layers is the OSI Model (from top to bottom):
- Application: End user later (HTTP, FTP, SSH, DNS)
- Presentation: Syntax layer (SSL, SSH, IMAP, FTP, JPEG)
- Session: Synch and send to port (APIs, sockets)
- Transport: End-to-end connections (TCP, UDP, QUIC)
- Network: Packets (IP, ICMP, IPSec, IGMP)
- Data Link: Frames (Ethernet, PPP, Switch, Bridge)
- Physical: Physical structure (Coax, Fiber, Wireless, Hubs, Repeaters)
Internet Protocol (IP)
IP is the lowest level of abstraction that we will uncover in this course. You can think of an IP address as a single atomic element that lives on a network. A computer is given a IP address that consists of four octets (32 bits of data) separated by dots. A subnet is a subset of all IP addresses available. The IP ranges given for Loopback and Private IPs below represent subnets, but a subnet can be any range of IPs.
There are a few special subnets that you might become familiar with:
- Loopback: IP addresses on the local machine.
- 127.0.0.0 - 127.255.255.255 (127.0.0.0/8)
- Private: Only for devices on the inside of the local network, are never surfaced publicly
- 10.0.0.0 - 10.255.255.255 (10.0.0.0/8)
- 172.16.0.0 - 172.31.255.255 (172.16.0.0/12)
- 192.168.0.0 - 192.168.255.255 (192.168.0.0/16)
If you’d like to see your IP address on your machine, you can run one of these commands depending on your OS:
ip addr show
Visualizing the Network
Say we have two machines on a subnet, this means that these machines know each other exist, and they can communicate, but they don’t know how to communicate. They need some shared language that will allow them to send information back and forth between them.
A Note on IPv6
The four octets in IPv4 only gives us 32 bits of data to specify an IP address, this gives us 2^(32) = 4,294,967,296 possible IP addresses. When IPv4 was invented that seemed like more than we would ever need, but nowadays every smartphone, TV, PlayStation, or laptop might need an IP address. The solution was IPv6 which allowed for 128 bits of data in IP addresses, drastically expanding the number of available IP addresses.
However, IPv6 has really struggled with adoption because network engineers aren’t too keen on migrating all of their technology to IPv6. 45% of network traffic in the US happens over IPv6, but that’s because many big tech companies have adopted IPv6 on their internal networks. Outside of that, few people have switched over, so we’ll use IPv4 throughout this class.
Transport Layer: TCP
TCP is a protocol that is layered on top of IP to allow for a shared language between computers. TCP has robust error-checking to ensure that all of the information transmitted over the internet actually gets to the destination. This error-checking can also cause latency and congestion issues as it requires the person receiving the data to also send back a confirmation that they got everything.
TCP also introduces the concept of ports. This is a number ranging from 1-65535 that allows for differentiating between connection types. Typically ports are designated for a specific kind of connection: HTTP is port 80, SSH is port 22, MySQL is port 3306. You can actually configure your machine to listen for whatever kind of connection on whichever port, but you’ll confused everyone else who is expecting specific connection types on specific ports.
TCP also allows for multiple concurrent connections. For example, you could have a number of users all connecting to port 80 on an HTTP server and TCP would still be able to manage these connections. This is an essential feature, if we could only establish a single connection the client-server model could never work in networking.
There are some alternatives to TCP:
- UDP: This protocol is quite similar to TCP but without the error checking. This is great when you want data to be sent fast, and don’t care too much about correctness. If you’re in a Zoom call, you won’t notice if some pixels are flipped during a single frame or some audio is warbled as long as the call keeps up with real time, so better to be fast than correct.
- QUIC: This protocol is built on UDP and is driver for HTTP/3 and will hopefully resolve some of the latency issues that come from TCP error-checking.
$ ss -4lt State Recv-Q Send-Q Local Address:Port Peer Address:Port Process LISTEN 0 128 0.0.0.0:47497 0.0.0.0:* LISTEN 0 4096 0.0.0.0:mysql 0.0.0.0:* LISTEN 0 10 0.0.0.0:57621 0.0.0.0:* LISTEN 0 1 127.0.0.1:12315 0.0.0.0:*
ss command that allows us to examine the TCP ports on our machine. Here we can see that each process is “listening” which means that they are awaiting connections. In the client-server model these would be the servers that are awaiting connections from clients.
One key point is that in the
Local Address:Port column we see both
127.0.0.1. When we see
0.0.0.0 this means that the port can be connected to by an machine with any IP address. Conversely, when we se
127.0.0.1 this means that the process on this port can only be connected to from the host machine itself. Some examples might be if you are running a local database for a webserver, you only want your host machine to write to that database and not outside users.
Application Layer: HTTP
Now we layer on top of TCP up to the application layer where we implement HTTP. We can see how HTTP uses TCP which uses IP so the layers of the internet are starting to come together.
HTTP is the protocol behind the web. It operates on the client/server model: the client requests something from the server, and the server responds to that request with the desired result. The server will serve many clients. HTTP underlies the orchestration that we will build up to throughout the course, and so it’s important to start with foundational understanding. If we run into issues with our deployments, knowledge of how HTTP enables communication over the web is going to be vital in addressing the problem.
Components of a HTTP Request
There are four main components of any HTTP request:
- Method: What do we want to do?
- Verbs like GET, POST, PUT
- URL : Where are we sending the request?
- Headers: Metadata about the request and how to handle it
- Computer’s client ID
- Caching settings
- Body: The data associated with the request
- For example, final submission step of a form is a POST request with a body as the form data.
There are a number of HTTP methods, but we will use primarily these four:
- No body
- Idempotent (same request gets the same result)
- No side effects
- URL parameters (http://youtube.com/?search=avatar)
- Common use case is to retrieve the content of a webpage
- Has a body.
- Will have side effects (posting a comment on a post)
- Common use case is form submission (think user registration)
- Puts the body onto the server
- Common use case is file upload (profile picture)
- Putting the same file onto the server multiple times should only upload one file
- Deletes the resource at the given URL (assuming you have permissions)
- Common use case is deleting a file (deleting your profile picture)
Idempotency is an important concept with HTTP, but also in DevOps more broadly. An operation is idempotent when, no matter how many times it is performed on a system, the system only changes as if it was performed once. A simple example is multiplying a non-zero integer by zero.
4*0*0=0 and so on. The “multiply by zero” operation has an effect on the resulting product the first time, but any more applications of the operation won’t change the outcome.
Bringing this back to DevOps, there are many times when you want to be able to retry an action, but don’t want to actually perform that action multiple times if two requests happen to both succeed. If you want to provision a virtual machine on a cloud provider, retrying the “provision” action shouldn’t result with you having two virtual machines, paying twice what you expected to pay.
A URL like
https://httpstat.us/200 has three components to it:
https: This is the protocol (how we communicate with the host)
httpstat.us: This is the host (the place we are communicating with)
- The port defaults for 80 for HTTP and 443 for HTTPS, but you can also specify with
:portafter the host.
- The port defaults for 80 for HTTP and 443 for HTTPS, but you can also specify with
/200: This is the path (what resource we want to access on the host)
Headers pass metadata about the request. Here are some examples:
- Authentication (include some header that verifies your identity)
- Caching (we can have a header that saves the age of the request for caching purposes)
- Cookies (Browser information that is passed to the HTTP server as a header)
- Body info (is it in XML, JSON, or something else?)
The body is used to pass chunks of data along with the requests. Pretty straighforward, is generally text that’s given in a format specified by the
$ curl -vv 'httpstat.us/200' * Trying 220.127.116.11... * TCP_NODELAY set * Connected to httpstat.us (18.104.22.168) port 80 (#0) > GET /200 HTTP/1.1 > Host: httpstat.us > User-Agent: curl/7.58.0 > Accept: */*
Here we can see that when we send an HTTP request, we first establish a TCP connection with the host on a given port (typically port 80 for HTTP) and then we transmit the HTTP request (method, headers, and body) through the TCP connection.
Note that for each HTTP request up to version 1.1, a new TCP connection is established to transport the information. There is some overhead to creating an TCP connection, so this is something that newer transport protocols like HTTP/2 and QUIC have attempted to improve upon.
Another solution is websockets which are built on top of HTTP to establish a persistent and bidirectional connection between two computers. This is great for use cases where small amounts of information have to be sent often, like a chat server.
Components of HTTP Responses
When you make a request to an HTTP server, it sends you back a response. The HTTP response looks similar to the HTTP request:
- Status Code: A number indicating the status of the response.
- Headers: Metadata about the response, similar to those in the request.
- Body: The actual data within the response (when you request a webpage, this would be the HTML that your browser will compile and display on your screen).
The status codes are organized by the first digit in the 3-digit code:
- 2xx: Success
- 200: OK
- 201: Created (new user registered correctly)
- 202: Accepted
- 3xx: Redirection
- 301: Moved Permanently
- 302: Found
- 304: Not Modified (for caching, more detail later)
- 4xx: Client error (people making the request messed up)
- 400: Bad Request (server doesn’t know how to response to the request)
- 401: Unauthorized (you’ll never have access)
- 403: Forbidden (you may have access but your credentials are wrong)
- 404: Not found (the URL path your specified does not exist)
- 5xx: Server error (people processing the request messed up)
- 500: Internal Server Error (maybe you made a syntax error in your Python code)
- 503: Service Unavailable
No need to remember these status codes because they’re uniform across all of the internet, so it’s super easy just to look them up if you encounter them. Normally, googling the status code will be able to point you in the right direction if you’re having trouble with HTTP communication.