I think of life as a long adventure through an infinite, mountainous terrain. The landscape is knowledge, and the exploration thereof is learning. We all start in the well-travelled lowlands, gradually making our way to more challenging regions. Most of the time, we use maps to help us navigate, and we work with others to surmount the challenges. The nature of the pursuit is to ultimately find oneself in an unknown land, breaking new ground in the hopes of finding interesting new sights and hopefully taking the time to mark them on our map.
In this landscape, the elevation at a location is the unreachability of that knowledge, and the steepness of the slope is the difficulty of the route to get there. Some achievements are at the top of sheer cliffs with miniscule holds all the way up, while others are long treks up gradually-ascending terrain. Most knowledge can be acquired in multiple ways, trading off difficulty for time or vice versa. Some routes that are passable to some are too strenuous for others, but there's usually another way around given the determination.
Every person, throughout their life, becomes familiar with different regions of the landscape. However, each contributes some of that familiarity to the atlases that are universities, governments, corporations, cultures, and families. The total mapped out area of this world is the sum of human knowledge. I want to make that bigger.
The Terrain Ahead
The summit in the distance
Motivating oneself is much easier when one has something to strive towards. For me, that pursuit is what I'll call a distributed programming framework. Let me explain what I mean by this, since it's not an idea that I've yet encountered in the wild (understandably, since it is unashamedly ambitious).
I'll start with the current lay of the computer-engineering land. As things stand, computer engineering is rigidly componentized. There are broad categories of technologies each with restricted interoperatibility with the others. One way to break it down is as follows.
This includes networking hardware, general-purpose computation, and specialized computation like GPUs, supercomputers, and IBM's new "neural" computation.
- Operating systems
In the last decade, with the rise of commodity hardware and containerization, infrastructure has been converging on Linux as the OS of choice (setting distributions aside). Of course, Windows and OS X aren't going anywhere in the consumer space, and the entry of newcomers like Android, ChromeOS, and iOS will have unpredictable effects on the consumer space.
- Programming languages
Despite the entrenchment of well-established staples (C++, Java, JS, C# (I guess?)), the field of programming languages has demonstrated a surprising averseness to stagnation (or even convergence). A number of trends are apparent, including: dynamic typing (Python, Ruby), functional programming (Haskell, Scala, Clojure, F#), "stick it on the JVM" syndrome (Clojure, Groovy, Scala), stronger type systems (Haskell, Scala, Rust), re-thinking systems programming (D, Go, Rust), and provability (Agda, Coq). There are so many programming languages that engineers are expected to know more than one and be able to learn new ones quickly.
This is further divided into databases, data processing, and orchestration. All of these have made tremendous advances in the last 15 years. The advances were kicked off by Google with BigTable, MapReduce, and Borg, respectively, but the ideas have made their way into OSS (e.g. Hadoop, Cassandra, Mesos) and innovation is happening everywhere with the NoSQL movement (Manhattan, MongoDB, Riak, RoachDB, Spanner), stream processing (Flink, Samza, Spark, Storm) and further innovations on that (lambda architecture, Dataflow model), and an abundance of orchestration systems, both self-hosted (Docker, Kubernetes, Mesos) and not (AWS, Azure, GCE).
- Application services
The land of the webserver and intermediate application-specific logic, these services and the libraries they're built on evolve as quickly as the demands on applications. Blink, and you'll be left behind.
- Application frontends
The most important are HTML/CSS/JS, Android, and iPhone. Thanks to their ubiquity, these foundational technologies are protected (or held back) from rapid evolution. However, higher levels of abstraction (Web libraries in particular) evolve along with the underlying application services. Additionally, their ubiquity also means it's unprecedentedly easy to ship products to the consumer.
The peak I see in the distance eliminates these problems. It works by pushing the componentization behind existing layers of abstraction. In particular, it would push application frontends, application services, and infrastructure into the programming language component. Instead of dealing with unwieldy components that talk over the network, I want to push the notion of end-to-end network communication behind the abstraction of a programming language (note that I said end-to-end, where current languages just give you a single end of the pipe). Operating systems and hardware do not share a common medium with the other components, and so are out of reach at the moment.
So, what does this buy us? Merging the components means the components become libraries, not whole applications. They can interact via a programmatic API instead of a network API, so a compiler can check for the kinds of errors that are otherwise only possible to check with integration tests. Deployment of large-scale infrastructure becomes straightforward thanks to the homogenized system. Most importantly, arbitrary application boundaries disappear, allowing a whole new level of reusability.
Predictably, this change introduces tremendous challenges. Absorbing end-to-end communication means the "language" has to know about the network and the computers on it. This makes it a totally different beast from a regular programming language. It now needs a distributed runtime, with all the problems that brings (Orleans is an interesting project in this direction). Solving the distributed runtime problem requires powerful abstractions that implicitly handle the common problems of distribution, while remaining customizable enough to patch the holes when the abstractions leak (Finagle is a fascinating endeavor to do this with an RPC library). Another big problem is supporting the independent evolution I mentioned above. That means the system must support partial deployments (e.g. just deploy the database for a specific application). The uniformity of the proposed system is itself a challenge. If the plethora of programming languages teaches us anything, it's that no one language will ever fulfill all requirements and tastes. Accepting this, we must design a programming language that allows itself to evolve along with its libraries (Haskell is an interesting case study with its language extensions), maybe even creating an intermediate language for the runtime that languages can compile to (similar to JVM bytecode). The system itself should be open-source, of course, but it should also allow the distribution of closed-source packages, similarly to what Java can do, but maybe even with licensing to support a variety of business models.
I don't know that such a system is feasible in my lifetime, or even that it is a good idea. It may be that the current networked component model is the way to go, and someone will figure out a way to make it manageable. If that becomes the case, I'll happily change course and lend the maps I have drawn in pursuit of the new peak. Until then, this is the summit I'm heading towards, and looking for the best route there keeps me motivated and happy.
Scouting the landscape
I expect the path to a distributed programming framework will be long and circuitous, so I'm scouting the area before attempting to summit.
My time at Twitter served this end, giving me an understanding of the difficulties of running applications that run across hundreds or thousands of machines. I learned about the importance of metrics from my team, the unwieldiness of multi-zone deployments from my time writing a multi-zone alerting system, the difficulties of the network through Finagle, the downsides of coordination from our Zookeeper clusters, the unforeseen issues that can arise in storage systems through Manhattan, the varied requirements of orchestration from Aurora and Mesos, and the discomfort of working in an overly-complex type system from Scala.
I've continued to do independent research, particularly in the mathematical side of Computer Science. My friend and former intern Kyle introduced me to Dependent Type Theory, a fascinating type theory that allows one to write and prove constructivist mathematical theories that are checked by the type checker for correctness. The programming language Adga (Overview, Learn You an Agda ) is a promising approach to making this idea more practically useful. To familiarize myself with Rust's interesting memory ownership model, I wrote a simple gas simulator (inspired by an assignment Iryna had in one of her Astrophysics classes).
My new position on Google's Cloud Dataflow team is an opportunity to learn about the demands of data processing.
I think of these experiences as (relatively) short climbs surrounding the main peak (reading the Agda paper was a brutal two weeks of struggling up a granite face (no I'm not saying it was as hard as the Dawn Wall, just that El Cap is awesome)). Through these smaller climbs, I get glimpses of potential routes to the main peak. Plus, each climb is satisfying in its own right. That's how I determine what I want to work on.
The Roads Behind
Google At the head of a new trail
Here in March, 2016, I'm just starting out at Google, so the nature of my work and what I can expect to learn have yet to become clear. I chose the Cloud Dataflow team because the project (open sourcing as Apache Beam) is at the forefront of data processing. Data processing is a mountain range I've only seen from afar, so I want to bag a summit or two.
I've had numerous conversations with friends about the reasons to choose either a startup or an established company. From what I've heard about Google, there's a tremendous amount of knowledge embedded in their developer tools, a term in which I'm including everything from local tools to infrastructure. In other words, Google has its own (mostly) private atlas of knowledge whose breadth most startups will never even approach. I chose Google to read that atlas and use it to guide me through as much terrain as I can. I'm sure their tooling will be a boon for my productivity, but I also want, as much as possible, to understand the reasoning behind the design of those tools.
Twitter A path to a new land
I worked at Twitter as an intern in Summer 2012 and full time from August 2013 to March 2016. It was my first serious foray into the land of Computer Science. I'd taken many CS classes in college, but my time was limited by the requirements of my Physics degree, so I didn't have time to learn the practicalities. Computer have always fascinated me, though, and I was eager to break into a (lucrative) new field.
I spent both the internship and full time on the Observability team, Twitter's internal monitoring team. From there, I got a good deal of experience from running the stack itself (which I'll describe presently), but also using the team as a vantage point into other teams' challenges. Observability was something of a nexus point between the operations teams (SREs and Command & Control), software engineers who worked on live services (mostly the Logic part in Example 2 in Marius's excellent talk on Finagle), and software engineers who worked on common libraries (e.g. Finagle itself).
The Observability stack has a pretty straightforward design, but a great deal of complexity in practice. You can refer to this outdated blog post for an understanding of what it looked like when I started. Some components have shifted or been removed since, but much of it remains the same. The data flow as follows:
- Applications aggregate their metrics, such as counters and histograms, into minutely windows.
- Every minute, a colocated process queries an HTTP endpoint for a JSON dump of of that window's metrics and forwards the dump to the timeseries writing service.
- The writing service processes these dumps and writes them to the underlying timeseries database ( Manhattan ) and the indexing service that tracks metric groupings.
- A query for data is received by the timeseries reading service, so it performs index lookups, retrieves timeseries data, and performs computations specified in the query, returning the final result to the caller.
- The caller can either be a user via a chart on a webpage or command line tool, or the alerting system. The latter operates by allowing users to configure sets of queries to run periodically, along with criteria determining the state of the query and a policy on how to notify in case the state requires it.
I spent most of my time working on the alerting system, maintaining the old one at first, then later making a replacement for it. By the time I left, the project to replace the alerting system had grown into a project to create a whole new visualization UI with a new configuration system for users. The project, which I led for the last 6 months, was productionized, and migration by users from the old to new system was well under way.
That work was my primary focus, but I dedicated substantial time to understanding other libraries and systems. Finagle, and the underlying Util, are likely the most involved codebase I've tried to understand. Documentation is good for superficial topics, but very poor for understanding the involved depths of Finagle. By spelunking through the uncharted caverns, I learned about some of the problems encountered by RPC systems. However, I also learned higher- level concepts: the importance of documentation, including what kinds of documentation are and aren't worth spending time on; the power of well-considered abstractions for simplifying APIs; techniques for spelunking more efficiently, e.g. trying to answer a single question at a time, using types to simplify reasoning, and using grep effectively. I learned about the misery of being on call when all the services are burning up. I learned to enjoy helping users with their problems instead of resenting the time they took away from my "real work".
Caltech Above the tree line
My experience of college was untypical by the standards of most public universities. I knew when I accepted that Caltech has a reputation for brutal academic standards, but I wasn't expecting the degree to which that would hold. It was like the difference between seeing a trail on a map and actually hiking it. You know it's labeled "very strenuous" and that it's a ten thousand foot elevation gain over 15 miles, but you really don't understand what that means until you're gasping for air, legs cramping, racing the implacable sun to get to the top before night comes. I did get some advance warning when I took the Math and Physics placement tests. I was placed in the remedial Math section and didn't test out of the introductory Physics class despite my 5s on the Calculus BC and both Physics C AP tests, the latter of which I'd taught myself.
As the trimesters passed, each one presenting a new gauntlet of challenges to face, I had little idle time. There was usually time for a single social outing per week and maybe a few hours of video games or TV. Most socializing happened collaborating on sets, the furnaces whose heat smelted away the slag of unfamiliarity and tact from my friendships, leaving behind close friendships that I expect will last a lifetime despite long distance. The remainder of the time was spent at the gym, working with the same friends to hone our bodies the way our studies were honing our minds.
Looking back, my time at Caltech is some of the most valued of my life. I discovered that "being smart" is not about already knowing everything you're supposed to know, but rather about having the humility to admit you don't know something and the dedication to figure it out. I learned how to organize my time, and how to take advantage of my natural tendencies to be more efficient. I learned how to deal amicably with people who think completely differently from myself and how to build strong friendships with those who think compatibly. I realized the importance of humility, but gained confidence in myself to take on new challenges. It really was like a long, strenuous hike. 4 years of " Type-II fun interspersed with a healthy share of Type-I. I came away with far more than I had when I arrived.
Carroll High School Strolling through the grassy foothills
I grew up a lot in high school, but I can't say the academics were the main challenge. Those 4 years were spent learning human skills. I made a lot of progress figuring out how to interact with people: how to avoid insulting potential friends, how to get over the self-consciousness of new situations. I discovered the gratification of working hard towards a goal from the Cross Country team.
I was not particularly driven, and I didn't really have a goal. I meandered around idly, doing what my parents told me to do (usually), but often halfheartedly. I did my chores, but needed many reminders. I got good grades, but not because I studied. I played the violin, but didn't practice methodically like I knew I should. I ran, but I would often stop early when it became too painful, and I didn't go above 30 miles per week until my junior year. I had crushes, but I didn't act on them.
I gradually discovered what I was missing. I dedicated myself to running in my senior year, and I buckled down and studied for standardized tests. I applied to top colleges, aiming high. The whole picture didn't come together, though. It would take another 4 years at Caltech for me to understand how I can be successful. Don't misunderstand me, I'm not claiming I know any "secrets to success" or anything cheesy like that. Those are far from secret: consistency, goals, and humility. What I've done is figure out how to motivate myself to do these things. That doesn't mean I won't encounter problems in the future or that I won't fail, but I hope to tackle those problems head on and get back on my feet after a failure. I learned how to walk. Now I need to get to exploring the landscape before me.
The landscape metaphor is inadequate for capturing the whole picture, since it entirely misses the self-reinforcing nature of knowledge acquisition. That is to say, knowledge begets more knowledge, typically faster. I still like the metaphor. It plays to my love of nature, and is a useful medium for expressing my thoughts. I hope it wasn't cumbersome or cliché.