First Semester at CMU

cmu content heavy post

Dec 25, 2023

I finished my classes last week at CMU, so I decided to write my thoughts on them since I had some strong feelings, especially cloud computing.

Foundations of Software Engineering (18652)

This course was required for my degree and consisted of a semester-long full-stack team project and a bunch of quizzes and reading materials covering software engineering practices. The project involved developing a chat application including frontend (HTML, CSS, JS) and backend (NodeJS, MongoDB). Developing the project itself was pretty rewarding since it was relatively complex, but the lectures were super boring and honestly reminded me back to the days in CSCE 315 at TAMU. The exam and quizzes were also terribly hard and pretty bad imo, but it is what it is.

While the quality of the project we made in the end was questionable, I definitely made some lifelong friends and the team bond was the strongest I have had in a group project 😀.

Cloud Computing (15619)

One of the legendary CS courses at CMU that covers cloud computing topics. This class is completely project-based and consists of 6 bi-weekly projects, a semester-long team project for the grad version, and weekly quizzes. Each project has a budget, and exceeding it incurs either a 10% or 100% penalty depending on how much you exceed it. So ironically, the more time you spend on it, the more likely you’ll get a 0.

Below, I’ll share my thoughts on each project and its difficulty (rated out of 5).

P1. Elasticity AWS auto-scaling (5/5)

The goal of the project was to invoke AWS APIs to create an auto-scaling group that will scale out when there’s an increasing number of requests and scale in otherwise. Implementing this with the Python APIs was around 900 lines of code, but I would say the harder part was tuning the scaling parameters since I had to reach a target RPS while staying within a cost per hour, and the auto grader had very tight bounds. Furthermore, I couldn’t just brute force to find the best parameters since that would exceed the project budget. Anyways, each submission took 40 minutes to run and I spent around 20 hours in total. This was the hardest individual project for me in the class.

P2. Containers: Docker and Kubernetes (4/5)

The goal of this project was to containerize a chat application using Docker, push it onto the Docker registry, and then deploy the containers in Kubernetes cluster on multiple clouds for fault tolerance. The project wasn’t terribly difficult since there wasn’t much coding, but there was a lot of new stuff to learn including Docker management, helm charts, pod auto-scaling, multi-cloud deployment on Azure and GCP, and creating a CI/CD pipeline using Github Actions.

P3.1 Cloud Storage - SQL and NoSQL (3/5)

Compare MySQL and HBase on a specific dataset. The MySQL part of the project was free, but HBase was honestly hard to use and took me a long time to understand. HBase is essentially a column-oriented distributed database where you should pre-split the table into multiple regions and then design a RowKey pattern that loads the data evenly. This is scalable, fast, and fault-tolerant. Theoretically, it’s cool but it was so hard to use compared to MySQL that I don’t think I will ever want to use it again.

P3.2 Cloud Storage - Heterogeneous Storage on the Cloud (1/5)

A simple project that used several types of databases including SQL, Neo4j, and MongoDB. This was the simplest project in the class since it was mainly just learning and using the databases. I wished they also added Redis to the set since that would have been kind of different yet useful.

P4. Iterative Processing with Spark (3/5)

In this project, I had to do data exploratory analysis and implement a page-rank algorithm using Spark and Scala. Honestly, it wasn’t too bad since a lot of the starter code was already provided, but understanding Scala syntax is slightly annoying. Also, since Spark is a distributed data processing framework, I needed to provision a cluster to run it at scale, which is expensive and annoying. Eventually, I tried both Azure HDInsight cluster and Azure Databricks and the latter is way better, although a bit more expensive.

P5. Stream Processing with Kafka and Samza (2/5)

The goal of this project was to implement a ride-sharing application like Uber where I had to process large amounts of real-time data to match drivers with customers. Once a customer requested a ride, I also had to serve the right ads to them based on their interests and health data in real-time. This is all done using Kafka, which is a publisher/subscriber model for handling real-time data, and Samza, which is used to process the data with low latency. This was probably my second most favorite individual project since the implementation was straight-forward. It’s also cool yet scary to see how powerful and fast data can be. For instance, health data from your Apple Watch can pretty much predict your mood and the food you want in real-time.

P6. Machine Learning on the Cloud (3/5)

In this project, I had to do some basic feature engineering tasks, tune a model’s hyperparameters using Google Vertex AI service, deploy the model on the cloud and use it as a service from a Flask backend, use Cloud Vision API, and train our own image recognition model using AutoML. I honestly thought Google’s Vertex AI service was pretty difficult to use since there are multiple components, but maybe it was just very different from the way I used to do ML and train my own models on Google Colab. From my understanding, Vertex AI is better for ML applications in the industry.

Team Project (6/5)

As you can see, the projects get easier after P2, but that’s because the team project starts during P2 and goes until the end of the semester.

There are three phases to the team project.

Phase 1 focuses on implementing three web services: Blockchain, Qrcode, and Twitter friend recommendation. The Blockchain and Qrcode service are relatively straightforward since it’s all just implementation. However, the Twitter service requires you to first migrate (ETL) 1TB of raw data from an S3 bucket to a MySQL database, and then implement the service based on the schema you designed. Each ETL process can take hours, so the DB schema design should be finalized towards the beginning of the project and ETL implementation should be correct to avoid reruns and not exceed the budget (although we have a ton of budget : D). After that, we had to write Terraform scripts and create a self-managed Kubernetes cluster to deploy everything for submission.

Phase 2’s task was to increase the throughput of the Twitter service and ensure all three of the services could run simultaneously during the live test at the end of the phase, where a load generator will spam requests at your endpoint for 3 hours. Since most teams don’t finish the Twitter service in phase 1, they spend the majority of the time finishing it in phase 2.

Phase 3 was focused on using managed Kubernetes clusters including ECS on Fargate, EKS on Fargate, and EKS on managed node groups. We also had to transition our database to a managed one either RDS, ElastiCache (Memcache, Redis), Keyspaces (Cassandra), Neptune (graph), DocumentDB, or DynamoDB. Phase 3 itself wasn’t hard but getting high up on the leaderboard was difficult.

Thoughts on the class overall

This was probably my favorite CS class of all the CS classes I have ever taken from high school through masters, probably because it covered so much, was well organized, and yet gave the students enough flexibility in the team project to try new things. The material is very useful in the industry and I have seen the technologies used multiple times at Capital One like ECS Fargate. In fact, some students in the class are industry people just here to take this class. There were multiple days where I stayed up until 2 - 3 am and worked 12 hours a day just to finish an assignment, but I learned a lot and it was all worth it. Previously, I have mainly debugged things on a program level, but finding the bottleneck in a system has been quite a new experience. My biggest regret in this class was the ranking in the team project. Had we used C++ for the first two web services and tried Redis as a caching layer for the twitter service we would have ranked much higher. In fact, we had the Redis schema designed and implemented but just didn’t have time to integrate. As a competitive programmer, this hurts 😭.

For anyone taking the class, I recommend the grad version (15619) over the undergrad version (15319) since it includes the team project, and that’s where the gold is.

CMU experience so far

When I first arrived at CMU, I didn’t think I’d vibe that well. But so far I have made great memories with the people here. Maybe it’s because the program is so small so everyone is tightly coupled whereas TAMU was huge and thus I spend the majority of the time in dorms. I have also been going to more events compared to TAMU whether that’s late-night parties, karaoke, or events hosted by CMU.

Classes have been very time-consuming. At TAMU, I only stayed up till 2 am four times across three years. But I have stayed up till 2 am for more than two weeks this semester at CMU. But I did take a very hard class combo so it should only get easier going forward.

There are also a good amount of tech events in both the Bay Area and remotely. Some notable ones include two crypto talks by a16z VCs at Stanford, an AI talk by a researcher from OpenAI, a cloud computing talk by Azure CTO, and a bunch of talks at CMU about job hunting and startups. I think the best thing about being in the bay and a good university is the opportunity to go to many of these events with high-profile speakers for free.

I haven’t met anyone that’s crazily cracked. There are definitely a lot of smart people but not many I admire to like in high school. Maybe it’s just I haven’t talked to enough people. Though I’m comparing this relatively to Plano West, where I have met the crackest group of people.

“If I have seen further, it is by standing on the shoulders of giants.” - Issac Newton. It is in the brilliance of others that we often find the sparks that illuminate our own paths.

Mark Sturt

Dec 26, 2023

Cheers to a great first semester! Glad it went well, especially since you moved to an entirely new area and had to make new friends / connections from scratch (with the exception of your roommate :) ). It's definitely very different from TAMU.

"There were multiple days where I stayed up until 2 - 3 am and worked 12 hours a day just to finish an assignment, but I learned a lot and it was all worth it"

Honestly, the best classes we take tend to follow this scheme 😅 (cough cough networking); I feel like these types of classes mainly are so difficult because the first time you take them you're being exposed to so many new ideas / languages / libraries / techniques / etc. to the point that it will take hours just to learn how to do something simple. But I agree that it definitely is worth it because once you've done it once, now you have a bunch of experience that you can take with you to either other classes or the future

"I’m comparing this relatively to Plano West, where I have met the most cracked group of people"

Yeah, once you meet Jimmy, there's not many other people you can meet who can compare

"As a competitive programmer, this hurts 😭"

👀

Thanks for the write-up and looking forward to next semester's. Might just have to apply in hopes of taking these classes now...

3 replies by Danny Chen and others

3 more comments...

Danny’s Substack

Discussion about this post

Ready for more?