Summary
Building applications on top of unbounded event streams is a complex endeavor, requiring careful integration of multiple disparate systems that were engineered in isolation. The ksqlDB project was created to address this state of affairs by building a unified layer on top of the Kafka ecosystem for stream processing. Developers can work with the SQL constructs that they are familiar with while automatically getting the durability and reliability that Kafka offers. In this episode Michael Drogalis, product manager for ksqlDB at Confluent, explains how the system is implemented, how you can use it for building your own stream processing applications, and how it fits into the lifecycle of your data infrastructure. If you have been struggling with building services on low level streaming interfaces then give this episode a listen and try it out for yourself.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Michael Drogalis about ksqlDB, the open source streaming database layer for Kafka
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what ksqlDB is?
- What are some of the use cases that it is designed for?
- How do the capabilities and design of ksqlDB compare to other solutions for querying streaming data with SQL such as Pulsar SQL, PipelineDB, or Materialize?
- What was the motivation for building a unified project for providing a database interface on the data stored in Kafka?
- How is ksqlDB architected?
- If you were to rebuild the entire platform and its components from scratch today, what would you do differently?
- What is the workflow for an analyst or engineer to design and build an application on top of ksqlDB?
- What dialect of SQL is supported?
- What kinds of extensions or built in functions have been added to aid in the creation of streaming queries?
- What dialect of SQL is supported?
- How are table schemas defined and enforced?
- How do you handle schema migrations on active streams?
- Typically a database is considered a long term storage location for data, whereas Kafka is a streaming layer with a bounded amount of durable storage. What is a typical lifecycle of information in ksqlDB?
- Can you talk through an example architecture that might incorporate ksqlDB including the source systems, applications that might interact with the data in transit, and any destinations sytems for long term persistence?
- What are some of the less obvious features of ksqlDB or capabilities that you think should be more widely publicized?
- What are some of the edge cases or potential pitfalls that users should be aware of as they are designing their streaming applications?
- What is involved in deploying and maintaining an installation of ksqlDB?
- What are some of the operational characteristics of the system that should be considered while planning an installation such as scaling factors, high availability, or potential bottlenecks in the architecture?
- When is ksqlDB the wrong choice?
- What are some of the most interesting/unexpected/innovative projects that you have seen built with ksqlDB?
- What are some of the most interesting/unexpected/challenging lessons that you have learned while working on ksqlDB?
- What is in store for the future of the project?
Contact Info
- @michaeldrogalis on Twitter
- michaeldrogalis on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- ksqlDB
- Confluent
- Erlang
- Onyx
- Apache Storm
- Stream Processing
- Kafka
- ksql
- Kafka Streams
- Pulsar
- Pulsar SQL
- PipelineDB
- Materialize
- Kafka Connect
- RocksDB
- Java Jar
- CLI == Command Line Interface
- PrestoDB
- ANSI SQL
- Pravega
- Eventual Consistency
- Confluent Cloud
- MySQL
- PostgreSQL
- GraphQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, a a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you've got everything you need to run a fast, reliable, and bulletproof data platform. And for your machine learning workloads, they've got dedicated CPU and GPU instances.
Go to data engineering podcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and auto scaling so you can focus on your exciting data projects. Your team will get the most complete, accurate, and ready to use behavioral web and mobile data delivered into your data warehouse, data lake, and real time data streams.
Go to data engineering podcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you're a listener for a special offer. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, ODSC, and Data Council.
Upcoming events include Strata Data in San Jose and PyCon US in Pittsburgh. Go to data engineering podcast.com/conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Michael Drogales about k SQL DB, the open source streaming database layer for Kafka. So, Michael, can you start by introducing yourself?
[00:02:12] Unknown:
Yeah. Thanks for having me, Tobias. Yeah. My name is Michael Dragales. I'm a product manager at Confluent where I work on stream processing, kind of the direction strategy of both, ksqlDB and Kafka streams.
[00:02:24] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:28] Unknown:
Yeah. So it kinda goes back to college, actually. I think my first interest in distributed systems was in some of the the papers around Erlang at the time, which weren't necessarily new but were kinda cool because they were new to me. And so I think kinda 1 thing led to another with with different papers and and areas of research. And, a couple area a couple of years after I finished college, I was excited enough about stream processing, which was pretty nascent at the time in 2012 or 2013 that I started to build an open source streaming platform enclosure named named Onyx. And so I I kinda worked on that for a couple years, which was pretty cool because, yeah, there wasn't a whole lot of motion going on in stream processing at the time. Storm was kinda new, but there wasn't much else.
And, you know, I ended up working on that for a number of years with a guy named Lucas Bradstreet. And, eventually, he and I ended up founding a business on top of that named distributed masonry where we built what was effectively tiered storage for Kafka on top of Monix. And a couple years later, Confluent acquired us, and, that's kinda how we got to now.
[00:03:28] Unknown:
Now you're overseeing the development and product direction of ksqlDB. So can you start by just giving a bit of a description about what it is?
[00:03:36] Unknown:
Yeah. I think I think the best way to understand what it is is to take a look at the problem that we we saw. And so, you know, we look at a lot of applications of streaming and and sort of the overall trend in the world is that everyone wants things to be more immediate and more real time, not just on a personal level, but you also see this in a more pronounced way in businesses. You know, the way that you, you shop and you bank and you hail rides is actually very different than it was, like, 10 years ago or so. And it kinda gives you this illusion that a company is, like, all hands on deck just for you. It's very, very special, and lots of businesses are trying to pivot to this model, which is effectively businesses trying to scale in a near unlimited way with their market by taking humans out of the loop. And so at Confluent, we think that the answer to doing this is events and stream processing. That's kind of what's powering this whole evolution of businesses being powered by software.
But the problem is actually pulling this off is really hard. If you look at what a typical event streaming architecture looks like, it's many distributed systems. Almost every time you have separate distributed systems for event capture, for event storage, for string processing, and then you have like a database for taking your processed events and aggregates and, you know, storing them so you can issue, answer queries for applications. And the problem is when you you're trying to build 1 of these systems, you have these 3 to 5 separate distributed systems and they all have separate mental models for doing everything. They have different ways of securing and scaling and monitoring and it's up to you as a developer to operate them as if they're 1.
But this is actually really hard in practice because, you know, you as a developer, every time you want to make a change you need to think in these 3 to 5 separate ways all at the same time and it's just completely exhausting. I like to say that's like you're trying to build a car out of parts but none of the manufacturers have talked to each other and so we wanted we wanted this to be a lot simpler. We've actually wanted everyone to be able to do events streaming much much in the way that everyone is able to build a web application with Rails or or microservices with with Spring. And we, you know, we understand that people sort of wanted to get back to something a bit simpler like they had with, you know, just the database and these 3 tier applications. And so that's kind of what led to us building ksqlDB and so what it is is it's an event streaming database for building stream processing applications and has, sort of very few moving parts and has everything that you need to be able to build these applications. So it has the ability to connect to different data sources on the outside. It uses Kafka for storage, it has its own stream processing runtime, and it's able to serve queries to applications. And so it's very lightweight and it sort of acts as a database for stream processing to get you something a lot simpler.
[00:06:17] Unknown:
And just to give some context about how long ago did this project get started, and what was the landscape like at the time as far as the availability of querying streaming data using SQL interfaces?
[00:06:30] Unknown:
I think the project itself is maybe 2 or 2 and a half years ago. It it actually predates my time at Confluent. And so, you know, early on, ksql, the successor to ksqlDB, was sort of this SQL dialect that just compiled to Kafka Streams. And that was actually pretty good at the time because people, like, 2016, 2017 were still figuring out how do you do this stream processing thing. The APIs were pretty complicated. There were different sorts of APIs sort of being fostered in parallel to try to figure out, like, what's the best way to work with streaming and batch data at the same time. And so SQL was a nice improvement because it sort of got people away from the complications that were starting starting to, evolve at the code layer and so there wasn't a whole lot of innovation going on here. We saw this as a good incremental step in the right direction, but ksqlDB, we feel, is actually the right, you know, major step in the next direction where it's it's kinda something categorically new that builds on what we already had.
[00:07:23] Unknown:
There are a few different products that are available now that, at least at surface level, are similar in terms of the capabilities to what ksqlDB is offering as far as being able to use a SQL interface on different streams of data. So notably thinking about the Pulsar SQL layer or there's another product called Pipeline DB, which I know got acquired. So I'm not sure what the current status of that is. Or most recently, there's a product that got launched called Materialize that offers the ability to generate streaming materialized views of information as it's being piped from relational databases using Debezium. So I'm wondering if you can just give a bit of a comparison in terms of the capabilities and primary focus of kccooldb related to some of those other projects.
[00:08:13] Unknown:
Yeah. I think it's it's good that everyone's realizing this is a problem. I think all these products actually signal that people realize that how how important stream processing is and how complicated it tends to be. 1 of the things that I really like about ksql is ksqlDB rather is that it's built on, Kafka streams, which is a very battle tested stream processing runtime. And so we're able to take advantage of years of effort that's gone into reliability and fault tolerance and scalability. And so, things that we get for free with Kafka streams is that if, you know, you have a a stream processing topology and you terminate your query, rather, you you pause it or you stop it and then you bring it back up on another node, it's it's actually able to recover all of its intermediate state. And we didn't really have to do very much work at all in ksqlDB to make that work. That's actually just something that we inherit from Kafka streams. And so there's a lot of history that we're building on. But I think the bigger thing is that we actually wanna unify, you know, our entire platform as 1. We've actually been working for quite a long time at different pieces, of the puzzle to be able to do event streaming. And I think a good example of that is in the Kafka ecosystem, there's a a component called Kafka Connect, which allows you to move data into and out of Kafka from a range of systems, sort of gives you the framework to do this in an exactly once in a fault tolerant manner. And so over the last year, Confluent has been working really, really hard to build up the range of connectors.
Super important that if you're gonna use a system like this, you gotta have ways easy ways to get the data into, and out of the system. And so we've built up now more than a 100 connectors out of the box. I think it's we're almost, like, at a 110 at this point. And so when you use ksqlDB, we actually have support for connectors built right in and not just 2 or 3 connectors, actually all the connectors and so you can talk to a variety of databases or SaaS applications or cloud providers, kind of anything you can think of. We're just trying to make it really easy and so we've put a lot of work into really leveraging the whole ecosystem and so it's not as if we're just like laser focused on stream processing. We're actually very focused on making sure that we're taking advantage of all the other pieces that we're using. And that's kind of 1 of the ways that we're trying to make stream processing easier for everyone.
[00:10:16] Unknown:
In terms of the motivation for building this unified layer on top of the overall Kafka ecosystem, I'm curious how that has driven the priorities of building ksqlDB and where the area focuses, and some of the edge cases that existed in trying to tie all these underlying systems together by yourself that ksqlDB is addressing and, some of the challenges that you faced in terms of trying to provide an easy to interface on top of all those different systems and tie them together in a cohesive whole?
[00:10:55] Unknown:
Yeah. I think the driving pain point is is asking a simple question, like, can a developer do it? Can a developer build a stream processing application on a Friday night with a pizza where you just want to like make something fun in a couple of hours? And the answer for the most part is no. You spend a couple of hours trying to, like, change a configuration that's supposed to affect, like, several different systems and you just tear your hair out trying to make it work. I've had plenty of nights like this and I'm actually like rather experienced with stream processing and yet it's super hard for me. And so for someone who's a newcomer, how can this possibly be tractable? And so our our motivation is like something as simple that you could trust it in a hackathon which you know what the bar for that is like. You would never dare use something that is not completely understandable or reliable in a hackathon because your life will be really hard. And so we wanted something that was easy to understand, something that had far fewer moving parts. And another piece of this is when you start to integrate all these systems together, what you often wanna do is actually have 2 kinds of queries. You actually wanna have the the database like point in time lookup queries where you're just asking a question, getting an answer, and then you want these streaming type queries that are powering these almost subscription like behaviors in the real time apps that you use. And so when you when you put all these systems together, like, that's often the goal, is to actually unify these 2 pieces. And so with ksqlDB, we wanted 1 query language that could do both of them. So you're not actually trying to stratify these 2 different systems together and and sort of, like, make this piecemeal architecture. We actually thought we could make a an event streaming database that's able to do both of them natively and that's kind of the rationale behind it. So this might be a good time to just talk about the overall
[00:12:31] Unknown:
architecture of ksqlDB and how it interacts with and interfaces with Kafka and some of the other components in the Confluent ecosystem.
[00:12:40] Unknown:
Yeah. So, as I was saying, it's it's it's built directly on top of Kafka streams. And so what basically happens is we have the the notion of persistent queries. And so you can ask a query, and every time some data comes in, we keep these, views or tables sort of incrementally updated. And when you take 1 of these queries, when you submit them to ksqlDB, we compile that directly into a Kafka streams topology and run that on 1 of the nodes in the cluster of servers. And so that's actually just like running Kafka streams, natively. And then all the communication is actually happening through the Kafka brokers. And so you have the cluster of ksqlDB servers doing all the processing, the cluster of Kafka brokers who are doing storage, and that's sort of how they line up. The big question that we get a lot is how do you handle state? And we use an approach where we actually use embedded RocksDB on each of the ksqlDB instances or as we're, creating state, we're doing sort of a 2 phase thing where we're keeping your your state in storage locally on the disk through RocksDB for fast access, but it's transient. We actually store the entire change log that backs the table into the Kafka brokers, which are replicated and volatile and and durable. And so if you lose a ksqlDB server, you can fail over to, a different 1 that's able to replay the whole change log into the new server. We do some optimizations where sometimes we can actually skip that change log replay process, but that's sort of how it works at a fundamental level if you, know, take all bets off the table.
[00:14:01] Unknown:
And you mentioned that the Kafka Connect libraries are built right into ksqlDB. So is that just part of the deployed binary once you get a ksqlDB instance running so that you don't have to manage those separately? Or is that still something that you would configure in terms of the Kafka layer and ksqlDB just takes advantage of those inputs and outputs from the system?
[00:14:22] Unknown:
That's the idea. So when you download ksqlDB, what you're able to do is actually add connectors to the class path. So connectors are are JARs that you add to your class path. And once you have them on the class path, you can use some syntax in ksqlDB to say create source connector or create sync connector. And we offer that up in 2 ways. We think, like, 1 of the the the easier ways to use that out of the box is actually to configure nothing else. You put the connector on your class path and then you're just able to go. You're able to say, create this connector and internally, ksqlDB will actually run the connector for you on its servers and so there's there's no separate cluster. There's actually just 1 cluster doing both of these things for you. We do have lots of people who do high volume workloads. And once you get into sort of higher volume territory, you are very mindful of resource isolation. And so the same syntax, being able to say create source connector, actually works not only in this embedded mode, it also works on an externalized connect cluster. And so we actually let you choose, and it's the same program for both. And so we think this is a pretty powerful way to let people go from beginner stages to sort of intermediate and then all the way to, high volume mission critical use case.
[00:15:28] Unknown:
When you're dealing with multiple different systems that are all working together, even when they're designed to be consumed as an entire unit, there are some cases where you have some weird edge cases or some design considerations that you might not have done if you were to redo the entire system from scratch. So if you were to just rebuild the entire platform today, primarily focused on the ksqlDB use case, what are some of the things that you think you would do differently?
[00:15:57] Unknown:
That's a great question. I think 1 thing we would have done is taken a much harder line about what do things look like at the top level? I think if you read through our documentation, a very fair criticism is that we we don't really, make a call around how much Kafka that you need to know. And so you sometimes you see these low level implementation details about Kafka sort of bubbling up all the way through. And a good example is, configuring auto offset reset, which is, like, the lowest layer in the Kafka client libraries for do you consume from the beginning or the end of the stream? Super important concept. It actually does need to surface in some form at the very top in ksqlDB so you can, you know, really know which end of the log to consume from. But we actually do it in the most low level way. And so I think trying to make a call about how to make that much more unified would have been better. You can probably make similar arguments about product seams around partitions. Partitions are very, very important to every layer of the stack. But sort of the way that we handled each layer can can be a bit different and a bit confusing, and those are the challenges that we need to work through. I think we've done a pretty good job so far at making sure that you can sort of not get too much in the weeds when you use case equal DB. But, there there's certainly cases here and there where if we were to reconstruct it from the start, that would be something we'd keep a very careful eye towards.
[00:17:03] Unknown:
And with having everything being vertically integrated and designed to interface together, it's definitely easier to have a system that is pleasant to use and easy to consume. But are there any common abstractions that you think that as an industry we can orient around where it would be possible to use something like ksqlDB as an interchangeable layer on top of other streaming systems?
[00:17:28] Unknown:
Yeah. That's a, that's an interesting question. And so the the way that you communicate with ksqlDB in terms of its abstraction are streams and tables. And we think that that's probably the right way to model event streams in the long run. I think academia is still out on, like, what is the right abstraction for this and how do we get towards a unified model for doing all this. I think in the end, we'd we'd actually be very happy if there was a standard for how you express these things, and it would just be a lot easier to sort of interchange these these these different components and systems. And so I I think because streaming SQL is still so early, there's not really consensus on what that should look like, but it's something we're we're very much looking forward to.
[00:18:06] Unknown:
In terms of the workflow of actually using ksqlDB and building something on top of it, what is involved in actually designing and building and deploying an application that's going to use k SQL DB as the execution context?
[00:18:21] Unknown:
There's 3 kinds of queries in ksqlDB. There's persistent queries, which are the continuous queries that people sort of know when you talk about pre installing a query as a stream processor, and that's like a server side query. And then there's 2 other kinds of queries, which are client side queries. They're they're push queries and they're pull queries, and they kinda take the perspective of an application. Sometimes you wanna pull the current answer to you on demand once. That's a pull query, like a database. And then there are push queries when you wanna have a subscription to changes as as events come in. That's that's a push query for us.
And so when you're designing a program, you're probably starting out with these persistent queries. And so you you boot up, ksqlDB server locally and then you connect to it through a CLI. And this CLI is actually made to feel very similar to like MySQL or Postgres, where you're sort of having this interactive repl like experience where you're creating streams, tweaking them, modifying them, tearing them down, really just trying to get it right and, like, banging on it until it's good. Kind of similar to Python in some sense. And as you get these things right, you'll actually just deploy them permanently. And so you'll send them off through a ksqlDB server and it will take care of running these things reliably even if in the presence of faults. And so that's that's kind of the first piece where you're figuring out, like, what are my core streams and tables and how do they connect together? And then the second piece is how do I actually involve my application to leverage some of this data?
And today, that is where ksqlDB presents a REST API. And you would sort of design your system to communicate with that by issuing these queries where, you know, select from table where, you know, key is x, maybe give it to me in the streaming form, maybe give it to me just in a regular form, and you just build your app then. You sort of get the answers back and you live in your own application framework. And so we're trying to be very conscientious of not not making ksqlDB an island. We actually wanna make it very easy for you to use the tools that you do today to build applications. And so that's that's sort of what it looks like in terms of how you put together a full application flow.
We are doing some work to make it really easy actually to integrate at the the programming language level. And so in a a normal database like Postgres, you actually just expect to be able to have, like, a driver or a library where you, you know, spin it up and give it your your connection details, and then you have a nice pleasant, you know, application level interface, in, like, Java or Node or Ruby or whatever. And that's an area of work for us where maybe this REST API continues to exist, but we actually have something a little bit more native for the programming language where you feel even more at home in your your favorite programming language.
[00:20:47] Unknown:
And another element of that interface is the dialect of SQL. And so I'm curious what you're targeting in ksqlDB as far as what constructs are supported, and what are some of the types of extensions or built in functions that you've added to simplify the work of operating in this streaming space? And what are some of the constructs that you've had to leave out because they make too many assumptions about the nature of the relational aspect of the database?
[00:21:16] Unknown:
We do have something a little nonstandard. I I could be wrong in this, but I think the initial thought was, let's take the the, the grammar for presto and run with that. Because back in, like, you know, when it was started 2016, maybe even, like, early, late 2015 rather, there wasn't a whole lot of work into what the standard for this could look like. And so we do have something where it's it's it's similar to ANSI SQL, but with a few few extensions. And so we do have some syntax around creating streams and tables where they sort of wrap normal SQL. It ends up being pretty easy to parse if you just sort of like if you look at the nesting of these and and, the the the query bits. The query language itself is actually made to feel as much like ANSI SQL as possible and it's it's something that we're we're trying to get to. I do think, as I said earlier, if we can actually standardize on this, it's it's better for everyone. The trouble is though that there's really no notion of what streaming SQL looks like. There has been some work in academia around a a model where we kinda take, SQL as we know it today and then we augment it with a clause like emit changes, which is essentially what we do. We we actually taken a a step in that direction as well. And so we actually just wanna make it as unmodified as possible. And if you think about why that's good, you actually wanna take advantage of all the preexisting work that's gone into SQL. So you have lots of BI tools, lots of, graphing tools that already speak SQL through JDBC. And so it's it's better for everyone, actually, if everyone's speaking the same language because you can you can latch on to this ecosystem of the language that everyone is already speaking. And so, yeah, that's that's kinda where we're at with that. Another aspect of the SQL interface
[00:22:48] Unknown:
is the definition of tables and schemas and the migration of those schemas as they evolve and add new columns or rename things or change types. And so I'm wondering how that is defined in ksqlDB and the different layers in the Kafka ecosystem that operate together to enforce those constraints on the data as it's flowing through?
[00:23:12] Unknown:
So the way that you you sort of express types in ksqlDB is through, yeah, like a schema language where you'd say, like, create stream or table and then you you say something that actually looks very similar to, like, MySQL or Postgres where you list off your columns and their types. And so that's that's the top level layer of enforcement. You asked about how you evolve that over time. That's something that we're working on right now. It's it's a really interesting area of research where it's by no means a solved problem and there's different layers to this. When you think about what goes into actually modifying not just a streaming SQL statement, but really under the covers of stream processing typology.
It kind of depends on what is the compatibility with the old statement and the new statement and is there state. And so what we're working on right now, what we're likely to publish pretty soon, is a series of, design changes sort of across the ecosystem to support this. Not just in ksqlDB, it would require changes in Kafka streams and actually probably a few changes to Kafka itself, which are are good for everyone. And so we're we're thinking about this idea of, like, a user controlled event where you can sort of insert markers into the stream. And, this is very early, but the the way that we envision this working is that it's actually very invisible to you at the top if we get it right, where you can do some sort of a schema alteration or an evolution and say, you know, alter table add column and what we could do under the covers is that we could actually spin up a parallel processing topology, have it churn through your data if necessary and actually, you know, run it with the new code and then at just the right time cut over to the new 1. And so, some people probably call this like shadow processing where you're running the, the new topology in the background and you're waiting for just the right moment to transactionally cut over in a safe manner.
This is how people tend to do it today. They actually do this by hand or they do this with some amount of programmatic, guardrails, and it's complicated. It it's sort of the best way people can figure out to do replay, which is actually really important to getting the event streaming story right. And so we think we're on to a way to do this in a a much more elegant manner and 1 that would save a ton of incidental complexity.
[00:25:07] Unknown:
That sounds somewhat similar to a conversation that I had with 1 of the developers in the Prevega platform as far as how they handle things like updates in the schema, and they just treat that as a separate stream with different checkpoints. And then you can rely on those checkpoints and their coincidence with the checkpoints in the other streams for being able to enforce those changes in schema as you evolve the code and being able to process older events based on what the other schema is defined in in terms of the other adjacent streams?
[00:25:41] Unknown:
Yeah. That is a similar approach, and it becomes a really interesting exercise to think of. How do you do this when you are in the presence of incompatible state? And so maybe I have an aggregation, that that looks 1 way today, and then tomorrow I want it to actually take a different shape. You know, you sort of get into this problem of, are these 2 states even compatible? Can I actually, use any of the prior work that I did in the last stream processing topology in the new 1? And that is actually really unclear to us, and I think it's unclear to most people who are researching that today. But, I think that would be a really cool area that that comes out of the work we're doing.
[00:26:12] Unknown:
Another expectation that people have when they hear anything that has DB at the end of it is the ability to do things like joins or have transaction support. And so I'm curious what level of capability exists in ksqlDB at the moment for those types of operations.
[00:26:29] Unknown:
Right now, ksqlDB is an eventually consistent database, which I don't think, it's kind of a loan in. Databases are often transactional or have some some level of even weak consistency guarantees. But, eventually, consistent databases actually are out there, and many people sort of rely on this to stay in the form of a key value store in s 3. People sometimes forget this, but Amazon S3 is eventually consistent despite being, like, 1 of the most used data stores on the planet. It is something that we're actually looking at doing. I think, a lot of people actually are able to get away with eventual consistency for streaming because their workloads do tend to be sharded anyway, but it's something that we're looking forward to improving over time.
[00:27:05] Unknown:
And then another question is the aspect of the life cycle of data as it exists within ksqlDB and how the streaming nature of it factors into the duration of storage that you might be able to rely on and being able to life cycle data out into other systems or have tiered storage or anything like that. And so I'm curious how that's handled in this ecosystem of the ksqlDB platform.
[00:27:32] Unknown:
This question comes up a lot with Kafka itself and and really the root of the question is how much data can you put in Kafka? Because it's providing the primitive log for you to be able to do these, essentially provide a transaction log that you'd have in your normal database and the answer is kind of it depends on what you're comfortable with and so the the classic example people tend to refer to is the New York Times who stores all of their data for all their publications in a single topic and that can work. That can work if you have the sizing right and you actually have enough capacity and you have the right volume to be able to do that. And so, we sort of inherit this in ksqlDB. It's sort of like what are you comfortable with? 1 of the upshots is that stream processing applications tend to have a bias to operate over a particular window of time. So they actually care more about the here and the now. They may care about things that are a week or a month old. But sort of the models that have been developed to handle this with windowing have this this idea of like shedding their, their previous windows out to permanent storage. And so Kafka Streams actually can can do this as well where you're sort of like TTL ing the window and you you no longer keep it in storage. And so, yeah, I think I think it sort of inherits this property in in terms of ksqlDB.
The 1 of the more interesting things that we're working on, and and I alluded to this at, the beginning of interviews, I I my company had been previously working on tiered storage for Apache Kafka, and, we now we're here. And so 1 of the things that we rolled out in Confluent Platform, our enterprise product around Apache Kafka, is tiered storage for Kafka itself. And so we're really keen to be able to take advantage of that, not just at the the the user topic level, but actually also for intermediate topics and other places in our system. I think that's gonna unlock a huge number of use cases for Kafka over the long run, and it's a really exciting development in the ecosystem.
[00:29:11] Unknown:
Another aspect of the use cases for ksqlDB SQL DB is where it sits in the overall ecosystem of someone's infrastructure or the overall data landscape where I know that it has capabilities built in from the connect platform for being able to consume data from other systems and deliver it to destination systems. But wondering if you can just talk through an example architecture of somebody who has incorporated ksqlDB into their application life cycle and their data life cycle and just some of the ways that it's being used and leveraged? Yeah. So I guess I'll answer the last part first. There's there's kind of 2 main use cases for ksqlDB.
[00:29:50] Unknown:
We see people building a lot of data pipelines with them. And then we see people doing this pattern of asynchronously materializing views. And when you put both of them together, it's actually a great design to be able to build an application on top of for stream processing. And so that's, at a technical level, really the use cases that we aim at. And in the field, we've seen some really cool applications of this. We've seen people build transactional caches in core banking. We've seen people do event driven micro event driven microservices, for, you know, human resources, SaaS software. We've seen streaming ETL for ride hailing and digital music providers and travel and tourism. Seen a whole bunch of different things. So if you take 1 of these, you can sort of think of maybe like an online marketing company who's using data in real time to, like, be able to sell things. And so maybe some source systems, like, you may be picking off messages from customers or calls or or events from some some, like, legacy platform.
We often see people putting their data initially into, like, SQL Server and capturing it with Debezium and then putting it into Kafka. And so where ksqlDB fits in is that you may actually wanna build an application on top of all this incoming data. And so you can create a connector for Debezium that will talk to SQL Server, and then move your data into Kafka and then do some processing in ksqlDB, perhaps some aggregations. And then in in your front end, you can actually use this unified query model for pushing and pulling data. And so you can, you can ask ask a question about your data and get an answer back or you can do the streaming variant of that and subscribe to all the answers. And then the final piece that we see a lot of, it's not just source connectors, it's it's also sync connectors. And so ksqlDB doesn't wanna be the only database. We actually wanna make it easier to use other databases too and so that's another reason that connectors are so important. And so where where ksqlDB's query model isn't quite right, you can spill your data out to something like Elasticsearch, for example, and do more indexing on it and keep it keep it in there for the long term. And so we try to make it really easy to play with a variety of technologies, but, really, in the end, you're spanning, source systems, yeah, moving data into Kafka, doing some processing, and then serving applications, and then pushing data out into other, data stores for other kinds of indexing and processing.
[00:31:56] Unknown:
And in terms of the capabilities and potential use cases for ksqlDB, what are some of the less obvious ones or some of the less publicized capabilities or features that you think should be more widely publicized and that people should be more aware of?
[00:32:13] Unknown:
I I don't think people quite understand the the power that this dual query models query model offers you. And so the the whole idea that we wanna get to is that you can take the same query, ask a question, get an answer, but also just tack a clause on and it changes to get the streaming variation of that. I think it's it's not gonna be like a panacea. You're not gonna be able to ask arbitrary queries like that, but we are sort of working towards making this more and more expressive over time. And then I think the other piece, and we've been talking about it the whole time because I'm so excited about it, but I think it still has yet to reach its potential is the ability to run connectors right inside of ksqlDB. This is huge. Every time I've worked on a system, just getting the data in, step 1, can be so painful.
Now we make it quite easy. There's there's a tool, Confluent Hub, to be able to grab connectors off of the internet, put them on your class path, and then there's simple syntax to run them. This, alleviates a huge amount of pain that I've had working with stream processing over the years. And so I think that this is 1 of the most important things that we're doing.
[00:33:10] Unknown:
In terms of the design and implementation of the streaming applications, what are some of the potential edge cases or pitfalls that users should be aware of as they're designing and implementing their applications?
[00:33:23] Unknown:
I think there's there's there's 2 things. So the first is that ksqlDB is sort of an asynchronous system and so it's it's good at asynchronously computing views and it's it is it is eventually consistent. So we don't the thing that may surprise people is that we don't do read after write consistency yet. And so you wouldn't really wanna use this for, like, an OLTP store for your primary data. It's just not cut out for that. You actually are using this as a replica of your primary data to be able to serve your applications. And then the second piece is this is something that we're working on right now. We've we've just cut out some improvements for this in our last release and we're continuing to make it even better. Is the availability of the serving layer. We're we're bounded by our fail over time. And this is this goes down to the architecture of Kafka Streams, where you have several workers running tasks and they are asynchronously materializing topics themselves. And so when you do a failover, the the secondary worker may not be all the way caught up. And so you actually, may have reduced availability because of that. And so this is something again that we're working on, and we actually think we can improve this quite a bit over time. For somebody who is interested in deploying and maintaining an installation of ksqlDB,
[00:34:25] Unknown:
what are some of the operational and considerations of the system that they should be thinking about while designing the deployment strategy? And what is actually involved in getting it up and running if they don't already have a Kafka platform deployed?
[00:34:42] Unknown:
So so you probably have to answer the last piece first, yeah, you need you need Kafka and you need ksqlDB to rock and roll with this. And so, with Kafka, that's sort of freely available on on the Internet. You can get that in a variety of forms. Docker is a pretty common way to do that. Confluent Cloud is actually, like, another great way to get Kafka in a in a fully managed setting. So that that is the first piece. You need Kafka to be able to do this for storage. And then the second piece is ksqlDB, which today we just ship in Docker. And you can run that in a standalone setting. You can run that in a clustered mode. But the idea is it sort of presents itself as a container and you can you can link these things together. And then you have a a series of remote processes that are available for taking inquiries over the REST API or just over, like a local command line. And I forgot the rest of your question.
[00:35:26] Unknown:
Yeah. Also just curious about things like potential scaling limitations and issues with high availability and any potential bottlenecks that might exist in the deployment infrastructure for things like latency or just overall compute capacity?
[00:35:41] Unknown:
I think the biggest thing that I would, advise people to think about is, like, what is the tendency of your queries? And so this is a perennial problem with databases and data warehouses where actually doing multi tenancy is extremely hard, and you may have 1 query that just, like, nukes the performance of others. And this is actually something we've chosen not to work on a whole lot because we don't think we're gonna improve on on the status quo very much when there's actually a lot of other interesting things to go after. And so you're gonna wanna think about, like, you know, what is the resource utilization of this set of queries and try to group together the queries for a single use case on a single ksqlDB cluster. You don't really wanna load up all of your queries on 1 database. People have done that in the past with MySQL where they use, like, they use a single table for, you know, their their online website and also for the analytics and then their availability, sucks because they're actually just, like, crushing their server all the time. And so it's kind of a similar metaphor where you actually don't wanna put all your eggs in 1 basket and you wanna localize the queries that are a little bit more performance sensitive.
[00:36:36] Unknown:
When is ksqlDB the wrong choice and somebody would be better suited going with just a traditional relational database or a data warehouse or some other type of streaming platform?
[00:36:47] Unknown:
I think there's there's probably 2 pieces there. So first of all, if you need strong consistency, today ksqlDB doesn't do that. And so relational databases, people often need high degrees of, consistency. And so turning towards, like, a traditional tried and true database like MySQL or Postgres is is great a lot of the time. And another thing is, this is probably, like, an unusual answer to your question is if you wanna do batch processing. We don't think that batch processing is completely dead. We actually think maybe in terms of implementation, streaming is probably the right way to power batch processing because you actually end up solving a lot of similar problems, but batch processing, processing has continued to be useful. And so 1 thing we try to advise people is like, don't force fit your problem into streaming if it's not a streaming problem.
Many problems are streaming problems, but there's actually continued use for batch processing out there. And so, don't try to fit a a square peg in a round hole. It just doesn't always work. In terms of the use cases and usages of ksql SQL DB, what are some of the most interesting or unexpected or innovative projects that you've seen people build with it? 1 of the most interesting experiments that I've seen recently is someone doing anomaly detection on mainframes. And 1 of the reasons that I like this story so much is that they're taking really, really old technology and they're using really, really new technology to examine it. And so it's this interesting juxtaposition of taking things that I have never worked with in my life with things that we're just starting to come out with. And so it's it's this really curious combination of the the new and the old.
[00:38:15] Unknown:
And what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of working with and on ksqlDB?
[00:38:23] Unknown:
I would say that new categories are hard, and this is a lesson that I've learned several times over. And so when I was working on my stream processor very early in my career, stream processing maybe academically wasn't new, but in terms of industry, no 1 was really doing it. And so that that was just confusing to explain. And then I learned this again with, my company, Distributed Masonry, building a streaming native data warehouse where tiered storage for Kafka, again, wasn't something that people were thinking about. So it was like new category, new kind of technology.
How do you explain it? And then here again, an event streaming database, it's like a new new take on, databases for a stream processing world. Trying to explain these these categorically new things is extremely complicated and so a lesson that I continually relearn is how to explain complicated things in a really simple way, and it is probably 1 that I will, always relearn for the rest of my life.
[00:39:15] Unknown:
In terms of the product road map for ksqlDB and the future of the overall Kafka ecosystem where it is residing. What do you have planned?
[00:39:27] Unknown:
I think the main thing that we wanna do is just making it easier for engineers to be able to build these stream processing applications. And so we're we're laser focused on this. Our internal mantra is that we wanna give people 1 mental model for doing everything they need to to build a stream processing app. And so most immediately, we're focused on reworking this, external API layer. I mentioned today we have a rest API. We actually want something a bit more suitable for the long term, so we're working on what a better, more robust interface would look like. And along with that, we're working on first class client support. So we're actually rolling out clients for, Java and JavaScript and so you can just plug and play really, really fast. And we think we can we can riff on this even further and so, that is clients are 1 way that you can expand your footprint. We think another thing that's super popular is GraphQL.
There's this front end world where everyone's doing, event driven programming in, you know, React and Vow. And then in the back end world, we have Kafka where everyone's doing event driven programming. And these 2 communities don't really talk to 1 another. And anyone who's, you know, in the tough spot of having to make these 2 things communicate has to build this awful middle layer. And so, GraphQL QL has interestingly come on the scene where it's providing this, you know, data back end agnostic layer for doing queries, not only for regular queries but also for subscriptions. And we we looked at this and we said, this actually sounds very similar to what we're trying to do. And so we're looking at just sort of an experimental way. Like, what would it look like if you married these 2 things up? What would it look like if you had GraphQL support over your ksqlDB data and and hence your Kafka data?
And so we're really excited about that. We're we're gonna sort of put out some work there really soon and and see what people think. And then the the last major piece that we're working on, as I mentioned, is is schema evolution. I'm really excited about that. That is challenging work that I think is gonna be very much rewarding. But we hope all these things make it easier for people to to build stream processing applications in general.
[00:41:19] Unknown:
Are there any other aspects of the ksqlDB project or the ways that it's being used or the overall efforts going into streaming applications that we didn't discuss yet that you'd like to cover before we close out the show?
[00:41:33] Unknown:
Oh, I can't think of any. That's a good set of questions.
[00:41:37] Unknown:
Alright. Well, for anybody who wants to follow along with the work that you are doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?
[00:41:52] Unknown:
I think the biggest thing is increased, retention for streaming data and so I I firmly believe that tiered storage is gonna change the landscape of how people use Kafka. We really, really think that Kafka is of a similar technology in terms of importance as the relational database. And so it is of utmost importance that people could store more data in it for longer and cheaper. And I think that this is 1 of the bigger gaps that's about to be solved and, it's gonna be quite important over the next few years.
[00:42:21] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your work and expertise on ksqlDB and the landscape of streaming applications. As you said, definitely a very important area for people to be exploring right now. And so it's great to see that there are people out there trying to make that easier to achieve. So thank you for all of your time and effort on that front, and I hope you enjoy the rest of your day. Thanks for having me. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction to the Episode
Guest Introduction: Michael Drogales
What is ksqlDB?
History and Evolution of ksqlDB
Developer Experience and Challenges
Architecture of ksqlDB
Workflow and Query Types in ksqlDB
SQL Interface and Extensions
Schema Evolution and Data Lifecycle
Use Cases and Applications of ksqlDB
Operational Considerations and Deployment
When Not to Use ksqlDB
Future Roadmap and Developments
Closing Thoughts and Contact Information