Reflecting on DataAISummit2022
It was my first in-person conference in 3 years. A big, scary moment, you know with the pandemic still around in some form. But, it wasn’t all bad. I learned some things. Let me share those with you.
Source: myself
Before 2022
I first attended this conference in 2016 (New York). Back then it was called SparkSummit surrounding itself around Apache Spark. I was somewhat knowledgeable in the area and it was nice to see the community that used to talk in JIRA, come to one place.
It took me till 2019 to attend it again as part of my job. I worked on Spark heavily for the last 5-6 years and that took me to the now-named, Data+AI Summit as an attendee. Spark had grown a lot since 2016 and users were maturing along with their use cases and data architectures.
Leaping forward to 2021, I spoke at Data AI Summit for the first time. My talk was, not surprisingly, Spark-related. Leading the charge in Spark in my previous role taught me a lot of nuances of building tooling around Spark. Here’s the talk if you are interested: Modularized ETL writing using Apache Spark.
Overall conference
The conference page and other material do a better job of highlighting the new initiatives announced, so I won’t discuss them here in detail.
Databricks, the showrunner, announced a lot of new initiatives. They’ve historically timed their key announcements, including open-source Spark releases, for this big show. It helps marketing, I guess 🙂
That being said, the conference has evolved.
It isn’t that much centric on Spark and you can see that in the announcements and the use cases powered by it. Yes, Spark is still around but I see it playing a behind-the-scenes role on top of the abstractions. I guess it’s now akin to a programming language?
The focus has now shifted to the community and its needs. Unity Catalog would become the new metadata layer. Delta is the storage layer to dump all data and Lakehouse is the refinement and long-term “warehouse”. A Databricks Cleanroom to securely help analyze data.
These are patterns of the community shifting. More towards use cases and less about the technology. Who cares what ML tool you use, ML Flow will help manage it.
But, this world of old has a new change to deal with.
Modern Data Stack
This is a reckoning for the old-school data patterns. I come from the old world of Hadoop, Spark, and Hive and I now work for dbt. I’m feeling the change myself (more on that later).
But, this MDS is changing the perspective of the user and focuses on the exact nature of the problem rather than being a catch-all. This gave rise to a plethora of tooling focussed specifically on the problem at hand.
MDS can be visualized like this:
Source: https://continual.ai/post/the-modern-data-stack-ecosystem-fall-2021-edition
Faced with this new paradigm the traditional ETL/Data Warehouse/Spark/Big Data world is rethinking itself. Even if you have your data in one place, a lot of the use cases you need on top of data become a debate around Build vs Buy.
And going by this conference and the announcements, Databricks is trying to be that one shop to do all of this. Whether it succeeds or not is a matter that time will tell.
My changed perspective
What I worked on before dbt (someday, I’ll blog more about these):
Data warehouse in AWS S3
Spark for everything ETL - reading, writing, Spark SQL dominated SQL
Presto/ Trino for ad-hoc querying
Pandas + Apache Livy for Data Science use cases
Hive Metastore + tooling for everything Metadata
Airflow for orchestration
AWS EMR, then EMR on EKS for infrastructure
Complicated, yes. Build vs Buy was almost always Build. So, a lot of building came with a lot of lessons.
Having recently switched to dbt, I went from the above world to this Modern Data stack world. My former colleague called dbt labs the “bell of the ball”. It was evident from the people that showed up.
dbt makes things simpler with regard to analytics engineering. Pushing the code down to the data platform (Cloud data warehouse like Snowflake, Databricks, BigQuery Redshift), and giving the user git-style project management can go a long way. In my opinion it simplifies collaboration in an already complicated world of data.
I spoke to attendees who came to our booth. The prompt I used was: “What data problem keeps you/your team up at night?” And then the flood gates were open 😃
I could understand and relate to the problems they faced. Listening to their data problems made me think of how dbt can fit into a world like the one I described before. Since I’m not a sales or in a sales-related role, I, naturally, helped them understand the technical side of solving the problems.
My colleagues at the booth were amazing to watch and learn from as they spoke to attendees and helped them understand more about dbt.
I think we changed some minds at this conference. I know I have transformed.
Highlights for me
A few things stood out.
dbt
I know dbt the product. I understand the problem it solves. But, what I learned was how transformative it is to the world of data. What it empowered and how it makes things easier for someone who knows SQL. And now soon, Python as well.
As someone who recently joined the company, it is amazing to see how engaged and passionate the community is around this product. I know this will continue and help make the product better. I’m excited for Coalesce 2022.
SparkConnect
This is the big one that I am excited about. In a past life, it would have solved a lot of the need for abstractions and made the infrastructure a lot easier to support around Spark.
Managed to find time with Martin Grund, a former colleague, who leads this project in Databricks to pick his brain about the motivation around this work and what’s next.
In a nutshell, this removes the complexity of shipping spark code to the driver. The code (pyspark code in this case) will be shipped across the client, translated to unresolved relations via a grpc + protobuf layer, and shipped to the driver monolith (the cluster) which will validate and execute the code.
The important links here are the ongoing work page: [SPIP-59375] and the [high-level design overview].
Delta Open source
All Delta is Open source as part of the Delta 2.0 release. Finally! I hope it stops Snowflake’s employees from posting about this.
I think this is long overdue. Databricks, I’m sensing, battle-tested a lot of its features and now released them rather than allow open-source first as an approach - an opposite to what Iceberg/ Hudi are doing. Although, those technologies and their respective SaaS companies Tabular, OneHouse are most likely building things in private to be differentiated as well.
Open Sourcing all of Delta Lake
Photon
It’s the next-generation query engine on Databricks written to be directly compatible with Apache Spark APIs. It’s a fast C++, vectorized execution engine for Spark and SQL workloads that runs behind Spark’s existing programming interfaces.
Apache Spark and Photon and the Photon SIGMOD paper
Unity Catalog
A unified governance solution for all data assets. This is going to hurt a lot of other players in the Data catalog space. Assuming it covers all the bells and whistles. I saw the demo for this at their booth and I think it does make a lot of the Metadata and Data catalog setup easier. As a developer, I think the only user experience concern is surfacing the unity catalog behind an abstraction that is easy to use.
Ending thoughts
It was interesting to see so many new vendor names at the Expo hall. It tells me there’s more left to this data world.
I also ran into a few former Cloudera colleagues who are now in different companies in this space. That tells me how much of that old world has now adapted to this new way of thinking about data and its use cases. Hadoop may have died, but I think it founded a new age of data tooling that we are collectively experiencing.
I like that DataAISummit is still fewer suits and more practitioners (going by the attendees). If the trend reverses, you know the conference is headed for its demise. I hope this conference survives in the long run.
Until the next post, thanks for reading.
Stay encouraged.