My Journey in Data
Everyone has to start somewhere. That can be unexpected or designed, but you eventually look back and maybe it makes sense. (Been watching Dark on Netflix so time travel has been on my mind)
Trippy philosophy aside, let me talk about what data means to me and what brought me to this day. I want to focus on what I did and learned rather than any company-specific stuff.
These words are pretty much a summary of my journey. Let me elaborate more.
What if?
I would have honestly ended up working at IBM in Research Triangle Park (RTP), NC. I had the opportunity to work in the Pure systems group in 2014 right after grad school. I did Security on Cloud as research and IBM was the big player in RTP that scooped up folks from NC State.
I really don’t know what would have come to be had I accepted that offer. It was compelling, and I wouldn’t have had to move across the country.
Although, I kept hearing “Big data” despite being in the Cloud space for a while. It seemed like something was brewing but I never quite understood until I finally joined this place in Palo Alto, CA.
Cloudera
It was the place to be in data at the time. In a way, the elephant in the room (if you know, you know).
The initial push
It started with 2 back to back courses about Hadoop administration and MapReduce alike were enough to keep anyone engaged. Learning how to spin up a 1 node Hadoop cluster with daemon processes for individual services was fun. I managed to get certified in both. Back then certifications mattered, I guess?
Follow that with the required reading of Hadoop Definitive Guide - Tom White and you are in the midst of the entire Big data ecosystem. It was like reading all of the Harry Potter series in one go.
While it was a great overview of all the technologies, reading only gets you so far. You need to jump in.
My role
This role wasn’t something I expected. It was operations work with customers and their large-scale deployments. These were technical customers who had issues that would occur all over a distributed system landscape.
Debugging, advocating best practices, and seeing how things break at scale was a trip on its own. We’d see things break that were newer on a daily basis.
I worked closely with the largest customer there and even got to engage with Sales and Professional Services. It was a role, unlike anything.
What I learned
Highlighting some of the areas I worked on.
MapReduce - taught me everything about distributed computing. Writing jobs in Java, seeing how schedulers deal with applications, and making sure the memory requirements were tuned were all part of this.
Spark - I saw this phenomenon being born and then taking this world by storm. You didn’t have Python in this world back then. You had only Scala and Java. Python was a forgotten third child that received no love. But, Spark was powerful and had a strong community. It laid the foundation of my knowledge in this domain.
HDFS - The big beast of storage. You look under the hood and there’s enough complexity and intricacies that you might not grasp it. But, it was the de-facto storage choice for this world. Everything from MapReduce, Hive, and Spark used HDFS. I don’t know what could replace this accurately in the present world.
Hive - SQL over MapReduce. Yes, please. It brought in a lot of new capabilities but still had a strong MapReduce dependency. There was work on Hive-on-Spark, but I can’t remember using it. The Metastore, however, was a valuable piece here (I realized this years later).
Kafka - Funny enough, we at Cloudera received a company-wide Kafka training the same week Confluent was born. It’s like seeing a competitor form out of thin air. I used to attend Kafka meetups at Linkedin’s Mountain View campus (Unite room) to learn more about this new thing. It did deliver and change a lot of the messaging infrastructure world, as predicted back then.
Other tools - notable mentions to Kudu, Pig, Impala, Flume, Scoop, Cloudera Manager, and HBase. These were around and kicking. I didn’t go deeper into these so I won’t speak to them. I did see things here and there but I wish I explored more. There were some interesting learnings about how these tools were built.
Languages - Java was the main one here. A bit of Scala and python were part of my day-to-day.
My experience here kicked off my open source contribution journey too. Read more: [Open source journey]
I left Cloudera in early 2017 to join this other company where they were starting to build a team around Big Data.
Stitch Fix
Coming from a vendor to a user of technologies is a strange shift. Not just in the Engineering practices but also in how you think as an engineer. Different challenges and different perspectives.
My role
In the Algorithms group at Stitch Fix, there were two large teams: Data Science and Data Platform. I was in a sub-team within Data Platform that was tasked with building Data tooling for the Data science teams. Sort of a common tooling layer that all of the DS teams could use.
I stayed in the same team throughout my tenure and it was a joy to learn, build, and contribute to the mission of helping Data science be more productive.
What I learned
I wrote a piece about the general lessons from my time here: [5 lessons from 5 years of building data infrastructure] and I know I didn’t cover technologies in it. I’m going to do that here - somewhat concisely since these are large topics that might even need their own posts.
These are areas of work and tooling alike:
Apache Spark - Bringing my earlier experience here was valuable to me but I learned a lot more about building the infrastructure to support Spark’s usage. I wrote ETLs in Spark to help test or work with Data scientists but my work was focused on building the tooling for them. Managing the clusters (EMR), the Spark binaries, adding new releases, patching, and building all the internal Scala libraries, was what I did as part of this work. Spoke about this in my talk last year: Modularized ETL writing with Apache Spark.
Presto/Trino - Ad-hoc queries were just as important as scheduled ETLs in our world. Learned about Presto’s deployments, its inner workings, and scheduling needs, and that helped in the testing of Trino later. An important tool, IMO, that helps power valuable business insights for the team.
S3 - This was our source of truth. All data inside buckets. Our team managed this warehouse so we were responsible for all the service interactions. I wrote the migration tooling for this warehouse back in 2018 to bring it to a cleaner state. S3 did a lot of heavy lifting but we wouldn’t have found value if we didn’t build the tooling around it.
Apache Livy - I adore this. It was a job server to run Spark jobs in a batch and later streaming fashion. We used it primarily in our reader/writer tooling that delivered data to the users in a Pandas DataFrame. A spoke about this here: Improving ad hoc and production workflows at Stitch Fix
Metadata - There’s so much I can talk about this. I wrote a lot of tooling here. Staring at 24000 line thrift generated file to find problems scars you. It taught me a lot about Hive, general metadata management, and how it interacted with Spark and Presto. I want to build something in the future around the problems I’ve seen here. It’s useful as a tool but it takes effort to drive value out of it. I spoke about this too: Building a Metadata ecosystem using the Hive Metastore.
Privacy - I was part of the effort to help move the business to the UK. That meant GDPR and eventually CCPA. I was in London the day GDPR was triggered and I met a bunch of folks that solved/were solving the problems. Having English breakfast at E Pellicci’s and talking about GDPR with practitioners was an expected event that I recall. I can’t go into the specifics here given the sensitivity of the topic but it was a teachable moment for me in my career.
Infrastructure tooling - Stitch Fix had a great foundation of developer tooling that deploying services was a breeze. I owned a few services and had to set up a lot of them end-to-end, which taught me about load balancing, integration testing, memory management, alerting, monitoring and metrics measurement.
Data quality - Understanding what breaks and when is a hard problem, especially at scale. We built internal tooling to help improve and identify problems in data with Spark. Without going into much detail beyond the talk I gave (mentioned earlier: Spark), it was like AWS deequ and automated for the users. I wrote the Spark piece (all the calculations at scale) and doing it efficiently taught me a ton of lessons. It also helped bring a testing first culture into our ETL writing.
Apache Iceberg - The last thing I did was Iceberg. Dove into this after understanding that it worked best for us (given our S3 + Hive Metastore world) than Delta or Hudi. Wrote the initial implementations, worked with open source to push improvements for Iceberg itself and planned out the future for it in the company before departing. Wrote a Scala-based REST server for Iceberg’s cataloging needs - that was a ride of its own.
I’m immensely grateful for my tenure at Stitch Fix. It taught me a lot and strengthened my foundation.
After 5 years, my career brought me to a point where I looked outward to find the next step and see where I could learn more.
dbt-labs
I think it’s too early to talk much about what I do here. A quarter has passed and there’s a lot of learning left. But here is what excites me and some early learnings.
My role
I’m part of a tiny team that works cross-functionally across the dbt Engineering org. Our goal is to provide the strategy and execution path for the teams in the org that leads to a better future for our product.
Different from my previous roles so I’m excited about the opportunity here. Interfacing with different teams and learning about the product and company has been a delight.
What I learned
In progress for most of these:
Golang - Still in progress beyond a few tutorials. I can see how useful and powerful this language is and will be for the team. Exploring and will share more when I’m a bit further out.
Protobuf + gRPC - My thrift experience is definitely helping here. I’m amazed at the tooling and how useful this is for standardization across the board. I’m getting to understand API design, contracts, and schemas a lot better now.
Connecting the business to the product architecture - How do you scale and efficiently design a business’ tooling for the future? That’s a question I’m trying to answer. Learning more about architecture and business concepts than I ever did. A lot of books, tutorials, team interviews, and design prototypes here. On-going and educative.
dbt open source - the open source product has a vibrant community that is passionate about the future. Doing some Spark-related ongoing work here. Excited to work with the community as it builds for the future.
Looking ahead
There’s so much in the data world that I don’t know. You may think all these learnings are enough to be useful - maybe, but I think a lot more needs to happen in my life.
If you’ve read this far, thank you for spending the time. I am not sure how unique or different my journey is from everyone else. But I do know there’s a lot more traveling to do. I’ll leave you with this: [Traveling Man - Ricky Nelson].
Until the next post, thanks for reading.
Stay encouraged.