Combining Python And SQL To Build A PyData Warehouse - Episode 227

Summary

The ecosystem of tools and libraries in Python for data manipulation and analytics is truly impressive, and continues to grow. There are, however, gaps in their utility that can be filled by the capabilities of a data warehouse. In this episode Robert Hodges discusses how the PyData suite of tools can be paired with a data warehouse for an analytics pipeline that is more robust than either can provide on their own. This is a great introduction to what differentiates a data warehouse from a relational database and ways that you can think differently about running your analytical workloads for larger volumes of data.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? Check out Linode at linode.com/podcastinit or use the code podcastinit2020 and get a $20 credit to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Taking a look at recent trends in the data science and analytics landscape, it’s becoming increasingly advantageous to have a deep understanding of both SQL and Python.

A hybrid model of analytics can achieve a more harmonious relationship between the two languages. Read more about the Python and SQL Intersection in Analytics at pythonpodcast.com/mode Specifically, we’re going to be focusing on their similarities, rather than their differences



Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Taking a look at recent trends in the data science and analytics landscape, it’s becoming increasingly advantageous to have a deep understanding of both SQL and Python. A hybrid model of analytics can achieve a more harmonious relationship between the two languages. Read more about the Python and SQL Intersection in Analytics at mode.com/init. Specifically, we’re going to be focusing on their similarities, rather than their differences.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Robert Hodges about how the PyData ecosystem can play nicely with data warehouses

Interview

  • Introductions
  • How did you get introduced to Python?
  • To start with, can you give a quick overview of what a data warehouse is and how it differs from a "regular" database for anyone who isn’t familiar with them?
    • What are the cases where a data warehouse would be preferable and when are they the wrong choice?
  • What capabilities does a data warehouse add to the PyData ecosystem?
  • For someone who doesn’t yet have a warehouse, what are some of the differentiating factors among the systems that are available?
  • Once you have a data warehouse deployed, how does it get populated and how does Python fit into that workflow?
  • For an analyst or data scientist, how might they interact with the data warehouse and what tools would they use to do so?
  • What are some potential bottlenecks when dealing with the volumes of data that can be contained in a warehouse within Python?
    • What are some ways that you have found to scale beyond those bottlenecks?
  • How does the data warehouse fit into the workflow for a machine learning or artificial intelligence project?
  • What are some of the limitations of data warehouses in the context of the Python ecosystem?
  • What are some of the trends that you see going forward for the integration of the PyData stack with data warehouses?
    • What are some challenges that you anticipate the industry running into in the process?
  • What are some useful references that you would recommend for anyone who wants to dig deeper into this topic?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to podcast.in it the podcast about Python and the people who make it great. When you're ready to launch your next app, I want to try a project you hear about on the show you need somewhere to deploy it. So take a look at our friends over at winnowed. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network all controlled by a brand new API, you can get everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models and running your continuous integration, they just launched dedicated CPU instances, go to Python podcast.com slash the node that's LINODE. Today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for the continued support of this show. Taking a look at recent trends in the data science and analytics landscape, it's becoming increasingly advantageous to have a deep understanding of both SQL into Python. A hybrid model of analytics can achieve a more harmonious relationship between the two languages. Read more about the Python and SQL intersection and [email protected] slash in it that's INIT. Specifically will be focusing on their similarities rather than their differences. And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity Corinthian global Intelligence Center data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture, summit and graph forum and data Council in Barcelona. Go to Python podcast.com slash conferences today to learn more about these and other events and take advantage of our partner discounts when you register. Your host, as usual, is Tobias Macey. And today I'm interviewing Robert Hodges, about how the PI Data ecosystem can play nicely with data warehouses. So Robert, can you start by introducing yourself?
Robert Hodges
0:02:09
Hi, my name is Robert Hodges, and I'm CEO of volatility, we offer commercial support and software for click house, which is a popular open source data warehouse. Beyond that, I have a pretty long background in databases. I started working on databases in 1983, with a system called em to a four and click house is now database number 20. I think, because I'm kind of losing count.
Tobias Macey
0:02:34
Yeah, after the first few, I'm sure you probably stopped bothering to keep exact track.
Robert Hodges
0:02:38
Well, every time I count, I keep finding ones that I forgot about. And it's it's definitely it's been very variable. For example, I worked for over 10 years with my sequel, then there were others like DB to where I used it for a day and decided I hated
0:02:49
it. So it still counts.
Tobias Macey
0:02:51
And also databases is in some ways a bit of a nebulous definition where if you squint enough different things can be considered databases that you might not think and at first blush.
Robert Hodges
0:03:01
Absolutely. And I think what's interesting over the last perhaps, say 10 to 12 years is the range of things that we properly consider databases have been increased enormously as people deal with different kinds of data, different amounts of data and different problems that they're trying to solve when they analyze the data driven sort of a plethora of different kinds of approaches to databases.
Tobias Macey
0:03:23
And do you remember how you first got introduced to Python?
Robert Hodges
0:03:25
You know, that's a really good question. I think what it was, and I can't remember where it happened, I read an article that said that a long time ago that said, Python was so beautifully designed that it is something where you would be up and running and productive in about four hours. And so I thought, Okay, that sounds good. And I went and tried it. And it did seem pretty easy to use, but I didn't program with it at the time, I got into introduced to it at an industrial level at VMware, about two and a half years ago, where I was working on a project doing tools for one of the VMware cloud products. And I was helping to I was sort of architect expert group that was developing open source Python API's. And that was where I really learned how to use Python, well learned how to do things like actually properly program modules, do documentation set up, do set up to PY, things like that. So that that was really my full education.
Tobias Macey
0:04:15
And so for this conversation, can you start with giving a bit of a quick overview about what a data warehouse is, and some of the ways that it differs from a quote unquote, regular database that somebody might think of for anybody who isn't familiar with this space? Sure.
Robert Hodges
0:04:29
So I think that it helps to start this up by defining what we mean by regular database, because I just said, as I, as I've referred to, and as I think many people know, there are many types of databases. So when people think about a database, they commonly think about something like my sequel, which stores data in tables, and the tables consist of so that means that there's a set of columns that each table contains, and then the data are actually stored as rows. And this is the same data structure that you see when you work in Excel. So what we generally speaking, we call that kind of database of grow oriented store, which means that when you actually look at how the data are stored, on disk, or on SSD, as the case may be, as well as the way that they're accessed, we go get them by rows. And this type of database was one of the earliest that was developed. And it's very well optimized for things like managing data in a bank or for managing sessions. And on a website, because there's a lot of little change, there's a lot of little updates, and we can go to the road, change it, put it back, and so on and so forth. Now, that kind of database has a problem, which is that as you get to very, very large amounts of data, the fact that it's efficient for update leads to a trade off that it's not particularly efficient for reading. And one of the reasons is that if you wanted to, for example, how to data, a data warehouse would like a table in your in your database that contain taxi rides, and one of those fields was the number of people that that wrote in the taxi, and he wanted to just take the average, you would end up reading and pulling into storage every single all the rest of the rows, just to read that one little tiny field. So as a result, data warehouses began to develop in the 90s. And they took a different approach to the way they organize data, which is that they organized it in columns. So you could have your billion row table. And you might have your field, which contains the number of people in the taxi that would be stored in a single column. And the idea is that the that the data warehouse is in this and many other ways optimize to read data very, very efficiently. So if you want to do that average, what you're going to do is you're going to read down this column very quickly. And and on top of that, the data warehouse typically has optimizations like that that array of data if you will, is going to be compressed and the compression algorithms are extremely can be extremely efficient. So for example, you can get 200 full compression on one of those columns, depending on depending on what the values are. So it compresses down to a very small amount of data plus, we use very sophisticated processing techniques. So that we can take advantage of what are called si MD instructions, single instruction multiple data that allow us to to operate on multiple values at a time when these things are loaded on the CPU registers. So this type of database, which is optimized for read is now generally what we know if when we talk about data warehouses.
Tobias Macey
0:07:23
And there are some cases where people might end up using a row oriented store as a data warehouse just because that's what they have close to hand, I know that I've seen it with things like the Microsoft sequel server, and I've heard of people using Postgres for that purpose if they have a small enough data set where they're able to perform their analytics, but the pattern in which they're using it is more along the lines of a sort of multiple more read oriented than right oriented so that they can run their analytics off of it without necessarily interrupting their application database. Exactly.
Robert Hodges
0:07:56
And in fact, what I started to notice perhaps 12 years ago, when I was working with customers was that you would see, we I ran a business that was actually focused on on as it turned out, clustering for MySQL databases, and we had people that ran pretty big MySQL installations, but what you would see in their data sets was they might have 200 tables in a database. But when you actually went and look at how big the tables were, they were probably one or two, which turned out to be extremely large, sometimes they would contain hundreds of millions of rows, and then the rest of the tables would tail off real quickly. Because then they would be you know, things like customers currency values, or things like that, that turned out to be small numbers. So what has happened is that people have done exactly as you described, they start out using my sequel, Postgres Microsoft sequel server, they get to the point where they have about 100 or 200 million rows. And that's the kind of the point where it becomes very, very difficult to do these read operations on the on the row oriented database, not only does it disrupt the operational work that's going on, so for example, trying to update the tables. But the other thing is because comes extremely slow. And the difference in performance can be staggering. When you have 200 million rows in a MySQL table, the difference between the performance there and running in a data warehouse will be a factor of 100, sometimes even more in some cases.
Tobias Macey
0:09:14
And on the other side of what we're talking about today, there's the PI Data ecosystem, which encompasses a large number of packages, but at the core of which most people will consider things such as non pie pandas, maybe the Jupiter notebooks format, and probably things such as psychic learn, or maybe some of the machine learning libraries. And I'm wondering, what are some of the cases where those libraries on their own aren't necessarily sufficient for being able to process data either efficiently? Or where you might need to lean on something such as a data warehouse in conjunction with the PI Data stack for being able to achieve a particular project?
Robert Hodges
0:09:53
Yeah, that's a great question. So I think the biggest thing that I see is that well, first of all, I want to say that these tools, the pie data, that whole ecosystem is really inspiring, because it has such a wealth of operations that you can do on on matrix and vector oriented data. Plus, it's coupled with really great machine learning as well as visualization tools. So so it's a really, really great ecosystem. But I think the single biggest thing that I hear about is people probably first of all people just complaining, hey, my, my data don't fit into memory anymore. And so what you end up seeing, at least in my experience, is that when you're running models in Python, you'll end up sampling, because you actually can't hold all the data in memory. So you just you just load a sample, say 10%, or 1%, or something like that, you train your model on that, or you run your model, or you you do whatever whatever operation that you're doing, you'll do on a small fraction, the data and the basic reason is you can't put it all in memory and process it at once. And I think part of this is part of this is because I think this is happening for two reasons. One is that Python bless its heart does not have a particularly efficient way of representing data memory. This is actually common to other systems like Java, it's not, it's not, it's not a Python problem, per se. But it's just that when you materialize objects in memory, they're not necessarily using memory layout as efficiently as they can. The second thing, and I think this is maybe more important is that by default, pandas wants to have everything in memory at once. So if you have a matrix, or you know, like a pandas data frame, it's going to want to have the whole thing in memory. And there's not a notion, for example, at least by default of being able to stream through data. And so that can also be a problem when you're trying to deal with large amounts of information.
Tobias Macey
0:11:38
Yeah, for a lot of these out, of course, systems, people might start to lean on things such as desk or the ray project for being able to scale across multiple machines and multiple cores, both for the memory issues that you were laying out, as well as some of the issues with dealing with embarrassingly parallel computations. And I'm curious how that plays into the overall use case for data warehouses. And some of the other ways that those broader elements in the ecosystem can also help in terms of managing larger data volumes, and also where the data warehouse can benefit. Yeah,
Robert Hodges
0:12:11
I think that's a great question, because it begins to because I think what you see happening with projects like task at one level, you could say, Hey, this is Python, kind of rediscovering or re implementing MapReduce, which, you know, the notion that you can do distributed processing you have, you have a lot of storage across a bunch of notes. But I think more particularly what I see happening in the in the, in the pie data ecosystem is that people are beginning to replicate in Python, things that are already solved in data warehouses. And I'll give you a couple of examples that data warehouses like vertical or click house, the one that I I operate on are very good at spreading data across multiple nodes. So they have the ability to break it up into we call this sharpening. So break it up into pieces that can be spread across notes, and then run the query in parts and join the results together, or, or aggregate the results together, before they return. Another thing we can do is we can replicate the data. So as as the number of users on your system span extends up upwards, and and you're, you know, beginning to add, you know, ask a lot of, you know, more questions concurrently, the ability to have replicas allows you to scale the performance very efficiently. So because, you know, different queries can run against different replicas. So I think this is that I think what's happening is that as these systems like tasks start to emerge, it's time to ask the question, do we want to implement this entire thing in Python? Or do we want to actually go look in the sequel data warehouse, see what they've got there, and maybe begin to sort of shift the boundaries a little bit between what we do inside the database and what we do in Python.
Tobias Macey
0:13:54
And in terms of the actual data warehouses themselves, you've already mentioned, click house and vertical, there are a number of other ones out there, such as snowflake, that's a recent addition. And then there are also various things that you could potentially consider in terms of data warehousing, better bordering the line with data lake such as the Hadoop system, or things that that the data bricks folks are doing with Delta lake. And I'm wondering what are some of the main axes along which the different data warehouses try to differentiate differentiate themselves and some of the considerations that somebody should be thinking about as they're trying to select a platform for moving forward with?
Robert Hodges
0:14:33
Sure, I think that's,
0:14:34
I think we can definitely divide the market up into into some important segments. So for example, you mentioned snowflake, that's a cloud data warehouse. And there, there's a family of data warehouses that are really cloud native, they are always going to, they may be tied to a particular cloud implementation. That would be the case with things like Big Query on Google Cloud, or redshift on Amazon, or snowflake, which can span clouds, but depends on the underlying cloud capabilities to work. So those are those data warehouses have some have great capabilities, they have very full sequel, they're well funded, they have very good sequel. Implementations, they also deal one of the things that particularly snowflake and bit query to very efficiently as they decouple the processing from the story from the from the storage. So for example, one of things I like about snowflake and their processing model is that you can have a big business, and you can have a couple organizations, they can each do queries on the same data. And the way that snowflake handles this is they spin up what are called virtual data warehouses, which are the compute part, each business unit will get their own virtual data warehouse. And they can go to town on this underlying data that they're reading without interfering with each other at all. So so that's that's one class, a data warehouse that I think is really important to look at. And I think where people tend to make choices in that direction is if I think probably the single biggest factor is has your business just decided to we're going to be on Google, if you are, then there's probably a pretty strong argument for looking very closely at Google Cloud because your Google Big Query, because you're already there, your data stored in object storage. So that's one big class. I think another important class of data warehouse is the traditional data warehouses like of which Veronica and Microsoft sequel server and an Oracle are sort of all, you know, sort of all play into this, I think the most interesting one is vertical. That was a very innovative column store, they're doing some interesting things in terms of separating compute and storage, they've now got a cloud version of it called Eon. So that's another one to look at. I think the trade they have good capabilities, I think the trade off there is they tend to be expensive to operate. And it's not just because their proprietary software with with expensive licensing, they tend to require pretty, pretty good hardware to run on. And then you have data warehouses, like click house, click house is kind of an interesting case, it's open source, Apache License, it's also more like my sequel, in the sense that it is, even though it's a column store, and has all these these properties that make it run very fast for reads, it's very simple to operate. And it's also very portable across a lot of a lot of different environments. So for example, we run click house on everything from bare metal, we have a lot of people who run their own, still run their own data centers, or lease space, to run their equipment to Kubernetes all the way to the AMS running in the cloud. So these are three different classes of data warehouse. And I think, depending on your use case, you know, where your price points are, you know, what, what is it you're looking at doing there, each of them has has virtues and also drawbacks,
Tobias Macey
0:17:37
and also just point out that you and one of your colleagues was also on my other podcast, talking a bit more deeply about click house. And I've also covered a number of these other technologies on that show as well, for anybody who wants to go further down the rabbit hole.
Robert Hodges
0:17:51
Exactly. That's a great rabbit hole we we like living down there. So absolutely, you know, sort of out in the open air for a little while this afternoon.
Tobias Macey
0:17:59
And so once somebody has selected a data warehouse, they've got it deployed. And now they're looking to actually populate it with data and be able to start integrate, again, integrating it into their overall workflow, what are some of the ways that the pie data ecosystem can help facilitate that process of getting data into it and making sure that all of the data is appropriately cleaned, and the schema is matching appropriately? And then also in terms of ways to ensure that the schema is optimized for the types of analytics that you might be doing? And just some of the other considerations going into the design and implementation of the data warehouse?
Robert Hodges
0:18:37
Yeah, that's an interesting question. So I can give an example from, you know, from our work on click house, which I think illustrates some of where Python fits in. So a lot of data in the in the real world is in is in CSV, so comma separated values. And it turns out that pandas is pretty nice for has good ability to read CSV, it's relatively tolerant of different formats. And so one of the ways that we I've actually seen customers and reading are using Python to help with ingest is that they will actually use pandas to read the CSV files, clean them up, write them to parquet, and then we have the ability to ingest parquet directly into click house. So that's an example of where Python is kind of there to help in terms of the data cleanup up front. I think more generally, one of the things and this is not true of everybody. But I think that in the systems where we see very large amounts of data being ingested, actually, I think what happens is that Python, you kind of stay out of the way, because for example, about half of the people that we work with ingest data from Kafka, so the quick house like other, like some other data warehouses can actually read cough cookies directly. So that's an example of where if where you don't actually need an intermediary, and if you want to do clean up, you'll actually wait till it's in the data warehouse, and then you often use sequel to clean it up. So and that's a very common, that's a very common you pattern, I think the other place where the other way that data gets loaded is that you read it from object store. So for example, if you if you use redshift, it has this great command called copy, which is used to read files, but it's been adapted in redshift, so that it can read data directly out of s3. So we're PI data with with a PI Data ecosystem would fit in his pie data might be the stuff up front, for example, that scraping the data off other online systems. So for example, I did an application with students from University of California Davis, where they built scrapers that would go search for prices for web, for web resources, like easy to on Amazon, they search for them on the web, they put them into s3, and then we would read them, then we would read them into the data warehouse from there. So if paid, it is involved, it's really at the front end of collecting the data and putting into object store. Those are two big ways that data gets into these systems at scale. And Python is sort of helping along the way.
Tobias Macey
0:20:56
And on the other end, once the data is already been loaded. And you're looking to do some sort of analysis, or maybe even trained some machine learning models based on the data that's there. What are some of the ways that an analyst or a data scientist might use for interacting with the data warehouse?
Robert Hodges
0:21:11
Yeah, I think that in this particular case, I can, I can really only speak to examples from click house, but but I think these are relevant for other databases. So for example, one of the ways that you can, one of the simple ways that you can interact with the data warehouse is you just fire up the Python Client, and, you know, sort of build a simple application on top of it. So when I start to analyze data, I typically start in a Jupiter notebook. And what I will use is there's two drivers that I can use. One is called the click house driver. So that's, that's a standard data database client, you make a connection, you call a method where you supply a sequel command, you get back a result set. And it's usually a one or two lines of code to pop that into pandas. So that's one way that you can get it. Another way is there's a sequel alchemy driver. So in sequel alpha me as is implemented pretty widely across across databases, you'll use that as a percent sequel magic function, that'll give you a result set, which again, you can, you can pop into pandas and begin and begin operating on. So so those are pretty typical ways to get data out of the data warehouse, I think where it gets interesting is, is where you start to explore, hey, how much can it can the data warehouse do more for me than just dump raw data? Can I actually do some of the things that I want to do in pandas? And, you know, do it in the data warehouse and save myself some time? So that's, that that's where you begin to dig into, okay, what's the What can I do in the data warehouse, it's actually going to save me time and in Python.
Tobias Macey
0:22:45
And one of the potential issues that you might deal with if you're just trying to do as you said, and just use the database is just a dumb storage layer and pull all the data back out into Python is running into the memory bottlenecks that we referenced earlier? So what are some of the strategies for working around those bottlenecks, particularly if you're dealing with large volumes of data, and just ways that the data warehouse can help supplement the work that you would be doing within Python? Yeah,
Robert Hodges
0:23:12
I think that's I think, when you start to frame it as like, how do I work with the memory I have available? At that point? You're asking the right question, because this is a problem that the databases have fundamentally been occupied with since Well, at least, as long as I've been working with them. So which is to say decades. So the idea, there's, there's a couple different ways that you can think about this, what the data warehouse does is it basically allows you to access data that is vastly larger than the amount of memory that you have. And so there's like at least three different ways that you can think about using this. For example, if you need to do queries that only need that only needs certain columns, the data warehouse is going to, it's going to answer those queries for you very efficiently.
0:23:57
So instead of thinking of
0:23:58
things in terms of the, you know, having everything in a pandas data frame that contains all of the data and all the rows, just think in terms of having only the columns that you're working with, because you can get them very quickly out of the database. And if you need more, you can go back and ask for it. So thinking in terms of like, let's go and own and isolate the columns we're working with, bring them in, and then operate them on them Panda, that's, that's one thing we can do. Another thing that data warehouses can do very efficiently is down sampling. So sequel has this great feature called the materialized view. And the idea with a materialized view is that it has a different representation of the data that is designed to be more either smaller, or more efficient, because of the way it's sorted, for example, or the way that it's stored than the original source data. So for example, if you have, if you're doing sampling off devices, or collecting prices in the market, what you can do, if this is essentially going to be talking series data, what you can do with a materialized view as you can down sample it, so that instead of getting a data point, for every time you do a measurement, you can actually reduce it to time segments, like 15 minutes segments,
0:25:12
this vastly reduces the amount of data that you that you collect. Moreover,
0:25:16
with the materialized view, the database will calc it will basically do this down sampling at the time you ingest data, it does it once. So that if you then go to the View and ask for the data, you're going to get it back really fast. And you're going to get a much, much smaller amount of data that you can then operate on. I think what happens in the pandas, if you just work off files, which is the way you know, the way some of the pipelines work, you end up asking these questions again, and again and again. So that's another, that's another important way that the data warehouses can help.
Tobias Macey
0:25:49
And also another Python tool that can help in terms of creating and maintaining those materialized views is the data build tool or DVD for sure, that will help in terms of ensuring that you have some measure of testing and consistency as far as processing the source data and creating the materialized views from it. And then also, once you have a materialized view, or even if you're just dealing with the table without doing any additional processing on it, one of the ways that it can help from what you were referring before, as far as pulling in Windows time segments is that you can actually let the database handle the cursor and feeding chunks of memory at a time instead of trying to have to implement that logic yourself in your Python code to say, Okay, I'm only going to process this amount that I need to shift this out and then pull in the next bit, you can actually use the database as you're working memory pulling the piece that you actually care about for that particular computation and then put it back into the data warehouse. Absolutely
Robert Hodges
0:26:43
correct. And in fact, I
0:26:45
mentioned that the three ways that data warehouses can help that sort of gets to the third, you can think of the data warehouse, you can think of like when you when you particularly use your data sets are very large, you can think of this the the the raw data as existing in a window so far, for example, when you're collecting web logs, or you're collecting, perhaps temperature sensor data, the data that you collect, that's, that's newest is the most interesting. And what tends to happen is after a certain period of time, that data becomes less interesting to keep around. So the data warehouse can can do what you're describing at a very at a very large scale in the sense that you can put a time to live on data. This is a common feature in in many data warehouses so that your raw data will actually time out after some period of days or weeks, or whatever you choose. And it just goes away. And the database does this automatically. So it's maintaining a window of of the raw data that you can then jump in and look at without you having to do anything special, like having complex pipelines or a lot of logic. The other thing is, then with materialized views and other techniques, you can down sample and you can actually keep those around for much longer periods of time. So I think when you combine these together, you then in the database itself, it's kind of, you know, holding your data in an optimized way that also builds, you know, creates these windows that you can, you know, so in effect creates these, these different granularity. So that data that you can look at. And then finally, to the point that you were making database connectivity API's are very good at streaming data. So for example, in click house, streaming API's are the the wire protocol, you tend to get data in chunks. So if you write your, your, your Python code, well, you can basically get a piece of at a time process that throw the memory way, get the next piece, so on and so forth. And this is something this kind of buffering of data is something that the conductivity API's have been doing since the late 80s, that databases are really good and very well optimized for this problem.
Tobias Macey
0:28:43
And another case where the data scientist or the analyst might be leaning on Python is for being able to interact with tools such as TensorFlow, or pytorch, for building machine learning models, which generally require a fairly substantial amount of training data. And I'm wondering how the data warehouse fits into that flow versus some of the other ways that they might be consuming the training information for building the models either in these deep learning neural networks or in some of the more traditional machine learning algorithms that they might be leaning on psychic learn for I
Robert Hodges
0:29:14
think that's the place where actually
0:29:15
we see the biggest gap in the technology right now. And what is happening right now is if you look at state of the art, in particularly among the people that I speak with, who in some cases deal with very large data sets, what they're typically doing is pulling this data out into Spark, and, and doing their machine learning there, or they're pulling it out. And, and, and doing intensive flow. So you basically are taking the raw data, or maybe you know, down sampled aggregates that are in the data warehouse, and you're just copying them out, you're, you know, you're you're running the machine learning on it, for example, training models, in which case, you would just drop the data after you're done, or you're executing models, in which case, you would, you know, score the data, maybe put it back in the database. So there's still a pretty big gap there. I think that as we go forward, I'm seeing two things which are really pretty interesting. One is that you're beginning to see the data warehouses actually put at least basic machine learning into the data warehouse warehouse itself. So Google, Big Query is doing that they have in databases traditionally have a CREATE TABLE command, they have a create model command. And so you can begin to you can begin to run basic machine learning models there. The other thing that's happening is we're starting to see the emergence of other ways of sharing data between data warehouses and machine learning system. And these are represented in projects like arrow, which is a columnar memory format, that is very, pretty accessible from Python. And that was, that's actually a project that's being driven by West McKinney, who's also the author of pandas. And the idea there is that we're going to have common memory formats to make both transfer of data simpler, but also open up the possibility that we can now communicate through things like shared memory. And as opposed to having TCP IP streaming to move the data. So I think those are a couple of interesting things that are happening, but it's still very, very basic. And and I think there's a lot more that we need to do to to move that forward.
Tobias Macey
0:31:12
Another interesting development to is things like what Microsoft is doing with embedding the Python runtime into the Microsoft sequel server to try and bring the compute directly in line with the data so that you don't have to deal with the wire transfer and serialization and D serialization itself, right?
Robert Hodges
0:31:30
That's it. That's a really interesting, that's a really interesting development. In fact, I think that's why one of the reasons why I'm, I'm very interested in arrow is like, we call that that kind of that kind of operation, we call that UTM, user defined function. And databases like Postgres, I've had this for a long time, the ability to hook in, for example, you know, see routines, but also Java and Python and other things like that. I think the problem that that I see with this is that if you just do it in a naive way that for every row, you go and click Python is just horribly inefficient, the, because what the way that data warehouses operate on, on data efficiently, is that everything's an array, they basically break up the array into pieces, they farm it out on to all the cores that are available, and they just go, you know, screaming through this data, to process it as quickly as possible. If you have to turn around for every value, you haven't called something in Python, that doesn't work. So I think that's, I think what and I'm not sure I haven't used the the Microsoft sequel server capabilities. But I think what we need is something that allows you to deal with data at the level of blocks. And that's where I think something like arrow, something that combines the capabilities of arrow where you can actually share the data formats, as well as the way that for example, click house processes materialized views, where we don't just when we populate a materialized view, we don't actually just do one row at a time, we do thousands or hundreds of thousands of rows at a time. So you have to think in terms of processes that allow you to do these operations for you on very, very large amounts of data using kind of a streaming, sort of a streaming processing model that allows you to get to the data really quickly.
Tobias Macey
0:33:11
And so in addition to things like machine learning, where we don't have a an excellent solution yet, what are some of the other limitations of data warehouses in the context of the Python data ecosystem? As far as being able to run analyses or some of the other data science workflows that somebody might be trying to perform?
Robert Hodges
0:33:31
Well, I think the biggest thing
0:33:32
I see is there's just such a richness of things that you can do a nun pie and pandas that still aren't fully supported in, in data warehouses. So most data, most mature data warehouses will do a pivot, but for example, click house doesn't do a full pivot the way that the way that pandas does it, I mean, pandas makes this so easy to do. It's, you know, and pandas also has pandas, and Python, in general, has a wealth of statistical functions. So there are certain databases. So for example, KDB is a is a column oriented store, or, you know, sort of a data warehouse that that's, is renowned for the fact that it has, it has a very rich set of functions. But in general, the databases don't have the richness of statistical functions that you find in the pie data ecosystem. And I think that creates a problem, because if you're, if you're looking to do, you know, if you look at what people are doing with machine learning, and data science, it's fundamentally driven by Statistics. So if you don't have those statistics, in sequel, you actually can't farm workout to sequel very easily, you either have to have user defined functions, and people had them themselves. As I mentioned, there's a lot of inefficiencies there. So I think this is a big gap. I think this is where, you know, we need to look over and see the good things that you know, that are being done in in Python and actually pull more of that stuff into into the database and make the sequel implementation richer.
Tobias Macey
0:34:58
And in terms of we're all trends in the industry and in the Python community, what are the some of the things that you see going forward that you're excited about as far as the tighter integration of the pie data stack and data warehouses and any challenges that you anticipate us running into as we try to proceed along that path?
Robert Hodges
0:35:17
Yeah, I think that I think moving models into the database is a very interesting is a very interesting development, I'm a little bit skeptical that databases are going to, to be able to do this as well as Python does. And the reason is that if you look at what if you look, for example, if you go to psychic learn, and you look, and you look at the models, they just have a raft of parameters that are that you actually need to be able to twist and turn to enable the the model to give you the you know, to, you know, not to be over fitted not to be under fitted, and, and work effectively, the database implementations that I've seen have a very limited parameter ization. And so for example, with Big Query, as far as I can tell, it just kind of figures that, you know, you give it the data to train the model, it kind of does it, but you don't actually know quite what it did to train the models that there's you don't have the full access to the hyper parameters that that you would want to adjust to, to make the model work? Well, I think that's a problem. And it's, it's not clear to me that the data warehouses are going to are going to solve this. I think that one of the really interesting questions is that we all have to work on is how can we take advantage of the fact that there are very, very powerful machine learning and deep learning systems that that exists outside the database? How can we combine them, so instead of thinking about this, this data being pushed into the database, instead thinking about how these systems can work together. And so for example, we have a, one of the one of the click house users that I know, their, their biggest desire, is to be able to take a row of data, pass it to a model for scoring, and we're right that data back into the database, all in a single operation. And so that's, that's what we have to focus on to be able, you know, to be able to, I think, to really join these two worlds together, it's problems like that. And we have to think about how to do it at scale. So and I think there's some very interesting, there's some very interesting things that we can do the, you know, if you have data, data, spread across many nodes. In many ways, this is not unlike what we had in the in the Hadoop file system, where you had data spread across a bunch of a bunch of disk spindles. And you could send the processing down to go grab the data off those disks. So I think there's some things that we can do. But I think there's definitely some work to do to, to, to make that to implement those ideas, and really be able to join these two worlds together and efficient way.
Tobias Macey
0:37:48
For anybody who wants to dig deeper into this space. What are some references that you found useful that you'd recommend?
Robert Hodges
0:37:55
it? That's an interesting question. So I'm sort of into academic papers, I think one of the things that you if you want to understand what a data warehouse is, I think just quick, and you're, and you're reasonably familiar with data, I think one of the quickest ways to get up to speed is to go read the C store paper, by Mike stone breaker and a bunch of database stars, it was written by about 14, there's about 14 different people on the paper. But that described the version that basically described what what later became vertical. So it was a column store it built on things that had already been done in the 90s. And then it introduced a bunch of other interesting things. I think that sort of gives you a notion of how data warehouses work and the kinds of things that they can do beyond that, I think it's I think the simplest thing is to go try them out. So click house, for example, is very easy to use. You can just, if you're running on Ubuntu, you say app install, click a server, it comes down, I think there are other systems that you can, you can try out. So for example, redshift on you on Amazon, was really a complete brown brick groundbreaker in terms of ease of ease of use, just being able to click a few buttons and have a data warehouse spin up a big query is the same way. And if you're on a juror, you can do similar things with with a Microsoft equivalent based on Microsoft sequel server. So I think just going and trying this stuff is probably the best thing to do, and, and just sort of begin to understand how you can actually use these systems. They're very accessible at this point.
Tobias Macey
0:39:28
And are there any other aspects of the overall space of the PI Data ecosystem and data warehouses in general that we didn't discuss yet you'd like to cover before we close out the show?
Robert Hodges
0:39:38
I think another place that there's sort of an interesting long term integration, which is how we deal with data warehouses and GPU integration. So for example, one of the reasons that one of the reasons you go to TensorFlow is TensorFlow will use you know, has a much more efficient processing model way, way, way more compute than we can get to when Ron conventional hosts data warehouses today are optimized for IO. They don't necessarily have GPU integration. So I think that's another interesting case where we can, you know, maybe some of that processing, if data warehouses begin to be able to take advantage of GPU that may open up some other interesting opportunities for doing something, you know, for for sort of adjusting the split of where, where processing happens. So that's another that's another thing we're definitely looking at with a lot of interest.
Tobias Macey
0:40:22
Yeah, I know that there's been some movement in that space with things like kinetic, that is a GPU power database, but I haven't looked closely at it myself to be able to give any deep details on that,
Robert Hodges
0:40:33
right, there's a there's a database called map to you, which I believe is now called Omni sigh. That's really that's a really interesting space. And I do want to say that, you know, I've been sort of focused on moving things into the, into the database, but I think that there's just as pie data, people can learn from what's going on in the data warehouse, because some of the problems, I think, what's happening in the data space is, is people are beginning to recapitulate solutions to problems that data warehouses have already solved a long time ago, you know, sort of how to distribute data, how to, you know, sort of turn it into an embarrassingly parallel problem. So you can get results very quickly. On the flip side, I think database I think people in databases need to be looking very, very hard at the ease of use and just the the wealth of operations that you can perform using the pie data, ecosystem modules, you know, non pie, pandas seaborne psychic learn, there's just so much stuff there that I think this is something that we can really learn from.
Tobias Macey
0:41:32
And for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics and this week, I'm going to choose a book that I've been reading called foundations of architected data solutions. It's an O'Reilly published book. And it's been great for just getting a high level overview about the things you need to be thinking about if you're trying to plan out a new data infrastructure. So definitely recommend that for somebody who's interested in getting more and to this space. And with that, I'll pass it to you Robert, do you have any pics this week?
Robert Hodges
0:42:03
I think my pic is going back and reading all papers that's as I say, the C store paper I headed up preparing for the show. I love reading that paper. It's it's just a it's it's a really, really great thing to read. Beyond that the books that really interested me are things like Python machine learning by Sebastian Rasha, that's something that came out, it's now in the second edition
0:42:23
I just
0:42:24
got a little while ago. It's not something you can read all at once, I just keep going back to it. And whenever I have time, I go look at what he has, you know, sort of work the exercises and just try to keep learning more and more stuff about about how to deal with data
Tobias Macey
0:42:37
in Python. Well, I appreciate you taking the time today to join me and share your interest and experience in the cross section of data warehouses and the pie data ecosystem. It's definitely an interesting cross section and an area that I am excited to see more development with. So thank you for your time and all of your efforts on that front. And I hope you enjoy the rest of your day.
Robert Hodges
0:42:57
Yeah, thank you, Tobias. It's always great being on your show.
Tobias Macey
0:43:02
Thank you for listening. Don't forget to check out our other show the data engineering podcast at data engineering podcast com for the latest on modern data management. And visit the site of Python podcasts. com to subscribe to the show, sign up for the mailing list and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcast in a.com with your story. To help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support Podcast.__init__ on Patreon!
Combining Python And SQL To Build A PyData Warehouse 1