Bridging The Gap Between Machine Learning And Operations At Iguazio

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy. And today, I'm interviewing Yaron Haviv about Iguazio,

a platform for end to end automation of machine learning applications using MLOps principles. So Yaron, can you start by introducing yourself?

Hi, everyone. I'm Yaron. I'm CTO and cofounder for Equazio.

Many years in high-tech, been in companies like Mellanox and others. A lot of interest also in data analytics and data engineering,

which is a good segue to this podcast.

And do you remember how you first got involved in the area of data science and analytics?

When I was in Mellanox, I was heading all the data center activities,

the open source projects, you know, Hadoop, Linux,

working with databases, you know, we we help build things like Oracle, X18,

and SAP HANA, and, and other solution. Mellanox is sort of an infrastructure

player, and right now it's part of NVIDIA.

My role was essentially to help optimize the the performance of each 1 of those databases, the storage layers,

etcetera. So at a certain point, we said, you know what, maybe we can do it better.

And we formed the company, we brought some

architects from from other companies like IBM and EMC and others. And essentially, our our initiative was to build

something area of a multi model database, essentially something that performs

really, really fast, but also can do analytics and sort of historical data and real time data and all of that. And obviously, the key application for that is data science. You have people that need to do training on 1 end, but then inferencing

on the other end. And 1 of the biggest challenges in data science is around data, and we'll probably get to talk about it. Can you give a bit more background about what it is that you've built at Aguasio and some of the story behind how it got started and your motivation for creating the company?

Essentially, what we started with is sort of a extremely high performance database where those kind of applications of machine learning and and analytics

we started with was essentially using a traditional flash storage for

getting and getting the performance of any memory databases,

which we still do, you know, sub millisecond latencies and things like that. So using that, you can essentially

consolidate a lot of the different layers. You don't need to move things around. You just point to them with different references.

And we just started exposing

different APIs into that engine so people can work against it with Spark, with Pandas,

with SQL, with NoSQL, with Presto,

all the traditional tools.

So then it helped you build, machine learning pipelines. So as things progress,

we needed a lot more flexibility for the computation part. So we started forming,

serverless engines that are

very flexible but very high performance. You've probably heard the term serverless functions like Amazon, Lambda,

etcetera, but those are really slow and they're stateless.

So what we've created is some engines, 1 of them is called Nucleo,

very popular,

about 4, 000 stars in GitHub, which is about 100 times faster than

Amazon Lambda, and it's also stateful. So you can actually build

transformations, you know, real time transformations

for data, and you could build like, inferencing for a model. You could take data from any type of source, whether it's HTTP,

you know, JSON packets or Kafka streams

and all that. It has native integration with with Kafka and all that. And then the next layer that we build on top of it as we progress was actually all the machine learning pipeline

that goes along with it, like the serving layers,

the model monitoring,

the training, the ingestion and preparation of data before the training.

And so essentially, we created the stack, which its key role is operationalizing

the machine learning

side. So

most of the platforms in the data science side are really more about, okay, let's do some experiment and training and maybe we throw

some models into, like an endpoint around HTTP.

But all of our customers essentially what what they had to build is a real pipeline,

which involves transformation of lots of data at scale

and then doing training and maybe retraining. And then and then you also need to be able to build exactly the same data on the production serving pipeline

attached to operational databases and and all of that. So this is our focus is building operational pipelines.

And we can talk later about some of the users and examples that we we have. 1 of the interesting things about machine learning and data science is that there are drastic differences in terms of the scale at which they might be executed. So

a hobbyist data scientist might pull down a static dataset that has the training data already labeled, and then they will, you know, build some sort of experiment to try and do some inference on it, like the New York taxicab dataset that's been used in a lot of examples or MNIST.

And we also have that in 1 of our demos.

Exactly. Yeah. And then on the other side, you might have a large scale machine learning operation where you're pulling in real time clickstream data from a massive web property. So you're looking at, you know, 1, 000 or millions of events on a daily basis. So that's a very different order of scale and magnitude and requirements for the pipelines.

So this is exactly the challenge that we're addressing. Okay? That's the problem, you know. So data scientist opens up the Jupyter notebook,

he gets some CSV file that someone extracted for him from some, you know, some data engineer or whatever.

And maybe a few CSV files, like 1 for every from every database or table, and he starts playing around with it in his Jupyter. As it gets slightly larger, his Jupyter, like, is getting out of memory, if you if you know the phenomena.

But now that assuming he needs to do exactly the same stuff at scale, what are these options? You know, now he needs to move from Pandas, you know, very nice Python semantics to

SQL and encode against, data warehouse with, like, where and join and all that stuff. Or maybe needs to start learning Scala

or, you know, the variations of Spark with Python and things like that in order to build. Some it's not the road. They don't understand any of that because they're looking for something which is more Pythonic and more sort of abstract.

There's also a lot of transformations that, built for machine learning are not really popular on sort of the databases side, like feature imputing,

you know, or scaling of features,

encoding, and things like that, but still requires a lot of processing

across lots of data. So this is really where we build our

open source frameworks around our database to be able to make it very transparent

to the user as if he's working on Pandas or Spark or whatever. But be able to process

a 10 or or more times the amount of data. And and actually instead of storing it in memory, we store it in flash, which is relatively cheap, process the data. And we created abstraction layer on top of it so we can really program this logic

once

and then we can use exactly the same abstract logic for the training or like batch

preparation

and for serving,

for like real time

serving, leveraging the low latency

access.

And this really eliminates a lot of those problems that you just spoke about.

Yeah. That's definitely interesting because

as you said, you know, if you go from

a handful of CSV files to real time data streams, there's a massive learning gap in terms of the familiarity that any engineer or data scientist might have or

the capacity that the organization might have for being able to deploy and support those types of technical requirements.

If you can just talk a bit more about some of the ways that those challenges manifest as you go from

theorizing about a problem, experimenting with a sampling of data to see, can you get anything interesting out of this, to then trying to bring it into production

and some of the impedance mismatches that occur in that journey?

1 of the things that we've worked a lot on the last year is something called the feature store. I don't know if you've heard the term. In Google, Uber, Netflix, and so on, the way that they're doing it is every, you know, data engineers or even data scientists to produce features. What is feature? It's essentially

processing a bunch of data

elements and creating something meaningful. You know, think like you have a stream of of data of, like, ticks or transact monetary transaction.

What you really care about on the machine learning side is, like, what's the average transaction

volume or how many products a customer typically buy in a day. That's a feature. And each 1 of those features, if you think data engineering wise, essentially can translate to, like, a SQL query and and all of that, and you can build that. Some features are more complicated like time oriented

aggregations.

SQL is not the native language for those kind of things. So in a feature story, you define the business logic to create the features.

You place those features in a catalog.

And then when the data scientists wants to train a model, it essentially says, you know what? I need the average age for users. You know, everything has server name, it's labeled. And behind it, it's hiding all the data transformation logic, all the analytics logic behind those feature. So it goes and feeds that into the training. It runs the training, and then as you go into the operational

aspect, this is where the real fun begins. You know? Before, what we said is, like, it goes to the warehouse, runs some SQL query, gets features, and does training. But when you go into production pipeline, it looks pretty different. It's like web logs, it's HTTP APIs, you know, it's like accessing an external service to grab the data because you need the data right now, like listening on a click stream, it probably arrived

using Kafka. But as you get the click stream and there's like a user ID, you may wanna know some data about the user like gender,

like when was it first registered or whatever.

That means that you need to now go to a NoSQL database and join, you know, since you go fetch the record of the user and combine the 2 and so on. That making extremely, you know, exponentially harder than just doing some queries on a data warehouse.

And and really where the industry trend today is is to build those abstract logics

built into something called the feature store, where you just define what are the sets of transformation. And there's machines, you know, essentially software programs that translate that into 2 parallel universes.

You know, 1 for more batch kind of pipeline and the other 1 is for real time pipe. This is really where we leverage the serverless technologies that we built.

Essentially, this abstract logic compiles into a serverless

function, and that serverless function is running in real time. Like, take an example of processing a Kafka stream. And Nucleo has auto scaling features and integration with Kafka

and rolling upgrades and lots of other capabilities where you just need to focus on the business logic.

So the feature store compiler, if you will, builds the pipeline into the serverless function, deploys the serverless function, and then the serverless function has a life of of its own. It just listens on Kafka,

will scale to fit the exact workload,

write the results into

our NoSQL

or into, like, parquet files for for offline use or for other forms, like to another stream, to inference on, or whatever

is needed. So that

level of abstraction is making

the operations

of machine learning

significantly simpler.

Cause then you essentially define those features, you put them in a catalog, someone just collects them into his application,

and collecting them is not like a catalog data engineering. It's not like just selecting a dataset.

Essentially, the features

hide behind them the entire

data engineering pipeline.

It's also a lot of computation

when you say feature. Looks abstract, but it's it's the entire like, stream processing or data analytics logic is hidden behind those objects.

RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support Snowflake, Google BigQuery, Amazon Redshift, and more.

Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming.

With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder

today.

In terms of actually

digging more into

how the Feature Store manages those transformations,

I'm curious to understand some of the

scaling challenges that exist around the definition of a feature and then

sort of whether you compute it lazily or eagerly and making sure that you can optimize for

the latencies from when somebody requests the value from a feature to when it's able to deliver it because you're mentioning that, you know, it might be a batch process or it might be a real time process or it might be trying to

merge between both of those where it periodically computes the batch and then maintains a stateful stream computation or aggregation across that to layer on top of the batched data.

Some of the complexities that arise as you start to build out this large catalog of features and

any sort of usage patterns on the machine learning side too as far as memoizing the return value so that you don't recompute it every time it gets called. Or,

you know, you might use the feature at 1 level of the logic flow, and then you might need to pull it again with maybe a different parameter or even the same parameter, just making sure that you're not

wasting compute time throughout the life cycle of the feature computation as well as the execution of the machine learning model in production.

So what we've done, we support a bunch of of engines.

Most people that build Feature Store, they support, like, Spark. Okay? And Spark has its limitations. Some of the ones that you mentioned are essentially limitations of how Spark work and micro batching and and all of that. And and also most feature stores, what they do, they essentially build the stream and they build the offline features. And essentially what they do, they copy the offline

dataset into, like, a NoSQL

data store

periodically. So that essentially, by definition, creates a data skew.

Because if you haven't copied, like, for 5 minutes this record of user Joe into the key value store, that means that for 5 minutes, this data is inaccurate.

Especially important for, like, time series and time window calculations

where if you build the feature in the key value, that that's the average last 1 hour

purchases. Okay?

Every 30 minutes, that means that when you go and read that data, it has a 30 minutes skew. And if you keep updating it every 2 minutes, that means you're gonna kill the database or kill the computation. Okay? The way that we've addressed it so first, we support all all that as well, but we

also built the sound of the logic into the database.

Okay? Because the database is the thing that gets those updates

and it essentially calculates the results instead

of responding with the value that's stored in a in a cell. It essentially has the equation built into the database and it knows how to calculate things like time window,

aggregations

and other things that have those problems of data skew. Now also remember that where we started, we started by building a database that performs like in memory database, just runs on flash. So based on all benchmark, we do like half a millisecond response time. So you can essentially get a feature vector in half a millisecond. We have a deployment

in Samsung

for, like, credits recommendation.

We enriched 10, 000 features on a feature vector.

It's data coming, it's recommendation added for a credit application where they have 70 different

sources.

Every source contributes, you know, dozens of features in a to total about 10, 000 feature for individual user.

Now the the deep learning model that needs to go and take a decision, what is their recommendation,

every time that it makes an interesting decision needs to

read 10, 000 values

on every

transaction.

When they initially work in the in the lab, you know, the data scientists finished their work with the Jupyter, etcetera, and said, you know what? Let's go and run this thing. They managed to do, like, 6 requests per second. Okay? That was the number. The requirements for the application in real business,

workload was about 6, 000

requests per second. Okay? They're serving tens of millions of subscribers.

So

idea is probably not to take a 1, 000 servers or a 1, 000 databases and just replicate it. That's really where you have to do so. A mechanism that has much higher concurrency built in this analog challenge of someone built something in Jupyter

and now you have to move to something which is more like stream processing. You don't do a SQL query every time, you now 6000 times a second, just 1 port. You have to essentially build the results

dynamically

doing all sorts of transformations on the data in flight

and sort it in sort of more of a key value layout. So when you do it, get it in half a millisecond, not in like a second or more. And so this is where we shine. The ability to ingest tons of data

and insert them extremely quick and exactly the same data could be used for training because we have sequential query capability.

Now all of that is wrapped in a new framework that we've built, which is not just specific to our database.

We took the ideas from, like, probably with, like, Apache Beam and Flink and all of that. The key challenges of those frameworks that they're very Java oriented.

The data scientists, they're

they like Python by nature, you know, pandas and all of that. So what we built

is an asynchronous engines, which is pretty much like Apache Beam, but it's native Python. It's used async IO and some other acceleration techniques.

It also works with the underlying serverless engine. What you do, you write Python business logic.

That Python business logic translates into something like Flink but using this engine we call story,

and it spread itself across serverless

microservices

that do the actual enforcement of the transformation pipeline.

And by the way, it's all open source and people can play around with it. That's definitely interesting. I actually spend a lot of my time in Python as well and actually have another podcast that covers that ecosystem. And as you mentioned, there's a lot of heavy hitters in the streaming ecosystem

are coming from the Java landscape. And there is that impedance mismatch between the interfaces that are available for Python developers to be able to

execute against them

and the ways that those frameworks

represent data and process it under the hood. And so that's where a lot of the pain comes from from people that I've talked to who are using things like PySpark for trying to build their jobs because the way that you think about problems in Python is very different than the way that they think about it in Spark, with Scala, and Java.

This really gives the velocity, the data scientist, and the data engineer. You know, they write some basic script, you know, and that script runs

thousands of requests per second by just, like, deploying it. We've shown accelerations of 28 20 x on running exactly the same business logic,

translating it, compiling it into the serverless processing

pipeline,

20 x faster.

There's some blogs that have written about it. And also by moving it to something more Pythonic by nature, you can incorporate all the things that the Python ecosystem developed. So

the nice thing about our feature store, it's the only 1 that doesn't go just to machine learning. We support deep learning. We support NLP.

We have a financial customer that actually

feeds the feature store with a PDF document. Okay?

Extracts the text from the PDF document as another step in this microservice pipeline,

It runs a sort of a birth model for text recognition

and predictions,

extract entities,

who are we talking about in that document and all of that. And essentially, the final result is stored in a table.

So you can build very interesting things if you leverage the ecosystem

and you don't try to invent the parallel ecosystem.

You just focus on the challenges of of making things extremely

concurrent and, you know, avoid garbage collection, all those things that we do under the hood to make it faster than Java.

Digging a bit more into some of the workflow around feature stores, particularly from the perspective of a data engineer,

I'm interested in understanding

how the transformation logic is defined and executed and

how the feature store might integrate with something like a DAXTER or an Airflow for managing the full flow of the data pipeline from collecting the source information through to

building the transformations

for the feature store and storing those values in that engine for the machine learning engineers and data scientists to be able to consume them?

So you can use things like Airflow, but the cool thing about what we just described, that it's a federated

DAG.

Federated real time DAG. So it means that that's, you know, things like I mentioned like Flink,

but it's not limited to real time. That's

build batch workflows or real time workflows. We also work with Kubeflow and Argo, which are sort of the equivalent of

another pipeline, which is more Kubernetes native. So we support both the real time pipeline engine that we developed, but that pipeline can actually also do batch. It's just how you trigger the steps within the graph, and the other 1 is Qflow.

The advantage of those engines is they're they're more microservice native.

So the underlying thing in the graph is a serverless

function. If you're familiar with, like, Amazon Step Functions,

as another sort of a pipelining mechanism where this individual

steps in Amazon Step Function is like a Lambda function or an Amazon service.

So we built something similar.

It's just, again, much higher performance.

You can just connect those with streams. The different steps

Between 1 step and another, you can actually have, like, a stream.

Another thing that we've done different than those frameworks that you mentioned,

part of the problem is that in Airflow is that every step is its own process

and potentially its own container or workload. So let's assume I just need a very simple transformation, you know, take a data, drop a column, and then maybe do some other things, you know, multiply by 2, etcetera.

So I will create a very heavy workflow, heavy

pipeline. For each 1 of those transformations, I need to essentially

go and and bring up a process or a container and so on. So 1 of the things that we've done is that essentially,

you can co locate

the different steps in the same process

or spread them across based on the user the actual user. So you avoid serialization, deserialization,

and now moving to JSON and back and and all of that. You avoid bringing up containers and down. This is why it's extremely fast in, like, real time workloads. But it's also nice for batch workloads. No 1 said that you have to, you know, waste CPU energy

for fun. Just

Digging a bit more too into the sort of breakdown of responsibilities

around the producers and consumers into a feature store,

how have you typically seen

how the workflow

orients itself? Where are the data engineers generally the ones who end up defining the transformations for building out the features? Or is that something that is usually more self serve from the data scientist side and the data engineers

are responsible just for setting up the data access and the initial sort of extract and load or maybe some of the initial transformations for providing the data to the feature store to then do the feature extraction?

I think you're raising a very important point is that in some organizations,

there's significant silos. There's like the data science group, you know, the guys with the Jupyter and the CSVs,

and there's the data engineer guy, and there's the DevOps guys, and the ML engineers now, and, like, each 1 works as in in his own silo. And then when you create those boundaries,

from my experience with the customers, the ones that

are more advanced and says you think of it as, okay, we have a business application to build and we have a multidisciplinary

team that collaborates. Those are the ones that I see them, like, generating projects,

getting to production very quick. The ones who are saying, you know what? I'm the data science group and I don't know how how to do data engineering or I don't even know who's the data engineer. Those are the ones where I see the projectors

failing all the time. Because eventually, it has to be a collaboration

of the team of data scientists, the DevOps, and the data engineer. Now it needs to be in a way that, you know, the data scientists have to do some analytics.

Because some things that he's trying out is, you know what, let's try and aggregate this feature.

Let's bring this data. Maybe I need to transform it this way, etcetera. Maybe I need to scale the feature and so on in order to see what's making a bigger impact

on the model.

On the other end, there are things that like ETL processes, you know, bringing the data from your Oracle or data warehouse and all that. Probably the data scientist is not gonna touch that that part of the world. So they have to work in collaboration

where sometimes

baseline features will be produced by the data engineer.

And sometimes,

data scientists will actually do that work. So this is 1 critical

point about feature stores. They have to be able to speak the 2 dialects. You know? On 1 end, data engineers that may want Spark and may want, you know, data and all of that or may want to do things that are more heavyweight.

And also the data scientist that can just go and consume whatever the data engineer did or maybe change it and build some new features.

Some data engineering tasks that I mentioned it before

are very data science oriented,

like, imputing, which is essentially filling up missing values.

Sometimes the way that you do them, the filling up of missing values has a lot of statistics

logic built into it. Sometime it actually runs

a machine learning model to predict what are the values that needs to be

substituted with. Okay? And those things require

a collaboration between the 2 teams and then have essentially having them to work on the same platform.

This is really 1 of the things that differentiates us is the fact that we do speak those 2 dialects. In our platform, we have a managed Spark, managed Presto, we have Dask, which is serving between, you know, the data engineering and the data scientists.

And we have, like, TensorFlow and scikit learn and and all of that. So you have kind of like Hadoop, for example, as a data engineering play that's mostly

caters to building

data lakes and batch computation.

What we're built is sort of the hybrid of the machine learning

pipeline.

So it's like training,

model evaluation,

serving, model monitoring, drift analysis, all of that which are very data science oriented features

along with

analytics, stream processing,

details, and so on.

And a couple of interesting things I wanna dig into out of that. 1 being

the question of data lineage

tracking and

how that plays into the feature store where

that's 1 of the big trends that's been coming up a lot lately is just how do you make sure that you understand

the overall

data quality questions and where the information is coming from? That's also an important question to ask from the data science perspective is I have this feature. It's going to give me this output value, but where is it sourcing the data from? How is it making those computations so that I can understand

some of the ramifications

that that might have in terms of what I'm inferring from the values that it's returning?

Essentially, this is really 1 of the challenges with the platforms that are decoupled.

Like, you know, you have other feature stores of Interview is 1 of them. It's a company that focused just on feature store. You have a company that's focused just on model serving. You have a company just focused on training, etcetera.

And then how do you connect the dots?

Like, if the training logic runs and generates a dataset, how do you track the lineage across that part? If you have serving that essentially does inferencing and makes a prediction and that prediction of value, you don't wanna know how you got to that predicted value.

How do you keep on the sort of the lineage across all those different components

and not to mention data governance, you know, quality policies, and and all of that. So what we've done, we've worked quite a bit. I had a lot of meetings with Microsoft and Google and other companies, and we tried to create a model of how would an end to end solution look like. And we created a work group. And out of that work group, we we wrote some documents and modeled the entire problem.

The stepchild of that is an open source project that we've created called ML run. And the general idea was modeling each element in this pipeline.

Also, the data engineering parts, not just the data science parts. And and there's a common thread across all the objects in that sort of metadata database. Everything is versioned, everything has lineage.

And because it's sort of more of an end to end, like when you're running a job,

it auto detects the process,

the git commit of the code in the process and so on. It essentially

tags the job with all that information. So processing

collecting

information about a code snapshot, a configuration snapshot. So which scikit learn version was I running with this experiment, for example.

The data lineage tracking,

which data went into that process, which data went out of that process.

And again, everything is version, including the data artifacts that go in and out of each 1 of those pipelines.

So that allows you to create a more holistic solution.

Because what I see, people just go and take, you know, like, Tecton and Seldan and Azure and SageMaker and all that, and they stitch their own solution.

You find that 80% of the work is just creating the stitching, creating the glue

across all those components and, you know, different security paradigms,

different way of lineage for each 1 of those components, different ways of provisioning, different ways of doing upgrades.

So by having 1 framework that we've built

for the

all the operational pipeline from data gathering to producing model and then monitoring the models,

it's it's much more powerful and it requires less manual work to do each 1 of the steps.

Another question that I was interested in digging into

is once you have

the machine learning model built and you've got it deployed,

what are some of the types of information

that you

use to track to understand how well is the model performing, when is it starting to go through model drift,

how much drift is acceptable,

what are the necessary timelines from being able to say I've detected a

statistically important level of drift in the model to then having the retrained model deployed into production

and just some of the trade offs that are necessary there and how much of that

responsibility lies on the data engineering and data platform team to be able to

manage the production environment for that model and how much time does the data science or machine learning team spend

providing input to or

actively monitoring some of those aspects of the machine learning project?

Essentially,

there are 3 different pipelines, issue, logically.

Okay. Well, the research fire pipeline where we develop the model, you know, we would go, we throw some data into a data lake or parquet files. We run some analytics. We prepare data. We do the training, and and then we generate the model. There's a second pipeline, which is essentially, you know, we get more of up to date operational data. We run stream processing.

We do all the inferencing and so on. There's a third pipeline, which we call it the monitoring or governance pipeline,

where everything that was generated in the model serving is getting sent you back into yet another stream,

sort of an output stream that records every activity that went up with the model. What was the latency for inferencing? Which data came in? Which data went out? And so on. And that also needs analytics to understand that data. So the drift, what is a drift detection? Essentially, a real time stream processing

on the outputs of the model

behavior. And given there, you also have some batch processing because some decision or some insights are not like real time. You have to look it into a bigger time window, and maybe you need to incorporate results that came, like, 2 hours later or 2 days later and to evaluate those results versus the predicted results. So there's also there's also some batch pipeline. So what we've done, we've combined all of those into a single thing. And the feature story is also

an essential part of all of that. In our case, other feature stores don't do it because what is essential, how do you monitor the quality of the model is essentially you have to look into 2 different things. 1 is if things are gonna serve as a data quality or data drift. Okay? So that's a scheme I was training my model

and the statistical behavior of a feature of number of clicks per hour, okay,

was that by average, you have, like, 10 clicks per hour per user or maybe woman this so much and man this much and so on. And based on that, I created the training and I created the model. What is a model? It's just a logarithmic equation. You know, it just takes a bunch of numbers and generates an outcome. Now if the data right now in production is different,

the average number of clicks per hour in production is different,

by definition, it means that my model is inaccurate.

It cannot make exactly the same prediction if the statistical behavior of the features going into it isn't the same between production and the sort of training.

So that means that what we need to do, that's the first mechanism for identifying

drift is essentially

in the feature store, when you adjust features, essentially auto learns the features.

Essentially creates a statistical

analysis of all the features that are going to the feature store and stores the statistical

analysis of those features. In addition, in real time, it calculates

the real time statistical analysis of those features

using stream processing.

K? By listening by tapping on the model

serving containers.

And then what you essentially do, you compare them in real time. You compare the offline feature

statistical analysis that was created before training

with the real time statistical analysis, and this is how you can create a drift

indication for the data. And again, the way that we did it is integrated in the future. Store. And

that also goes back into my point that let's assume that you don't have your own model serving engine.

They just come with a feature store as a vendor.

So how would you integrate all this functionality that I just said

into a bunch of different things? You have model serving companies and you have training company you know, model training companies. You have feature store companies and so on. So they all need to agree on the same metadata.

So the feature store will store the statistical analysis in the certain patterns that the training

will generate it, and the survey will know how to generate a similar pattern just in real time. And then there will be another 4th product of drift analysis that will now need to know how to work against all those 4 schemas.

That's extremely hard if you don't build an integrated approach like we did. So in our case, it's actually the nice thing, it's clueless

because the minute that you find the feature,

you already defined the fact that the system needs to start gathering the statistical analysis,

and it's using the exactly same stream processing technology I mentioned before that does in order to calculate the gaps and time windows and all of that. And that's for the data drift. There's also

accuracy

drift, which is slightly different,

which it means essentially we have to compare

the results,

not the incoming data, the results

beyond the statistical pattern of the result, which is also

important

with the actual data. So let's assume I predicted the stock market is gonna go up and it went down. So after a few minutes, I know if my

decision was correct or not.

So I can essentially go and just shift do some time shift on the result

and compare the training data with the shifted

the shifted risk actual data, and I can also measure the exact accuracy of my model. So that's another way of monitoring.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days.

DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Go to data engineering podcast.com/datafold

today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll send you a cool water flask.

Another aspect of the statistical analysis of the models and understanding

how they're supposed to be behaving is the question of

testing and validation

and propagating the model from the

experimentation

through to the

training and validation stages and then into production, and just how you manage the overall continuous integration, testing, and continuous deployment process for machine learning projects?

We're

very oriented towards CICD pipelines like in microservices.

And we do we use either the traditional tools that in the Kubernetes ecosystem

like a Qflow or or others.

Also, within all the frameworks that we've done, there is a way to do, like, git hooks and things like that. So you can essentially we have a nice demo where you can just push a commit or change in a beta definition

or a training Python code or whatever.

And as you push this commit to Git, it will essentially

trigger the CI workflow.

That's the CI workflow

runs the machine learning pipeline

and responds with the actual accuracy of the model based on all the training pipeline.

It throws it back as a pull request comment on GitHub. So essentially, some can actually see

just like traditional CI.

It tells you, okay, this is the accuracy of the new model that was trained. And then you could take a decision by just making a comment on your pull request saying, you know what? Go deploy it into a test cluster or a specific model into a test cluster and so on. And you can also create review process because in Git, you can say, you know what? The only guy that can do

approve a pull request is this guy.

I know that there are some platforms that have their own, like, built in capabilities for that. We prefer something more standard, more open approach, and we believe that it is the right mechanism

to, you know, whether it's GitHub or GitLab or other kids. We believe that's the right approach to build

your automation around. We've spent a lot of time talking about a lot of the problems that Igwasio solves and,

at a high level, discussed some of the architecture. But can you dig a bit deeper into how the overall Iguasio platform is architected and

some of the integration points and extension points that are available for bringing Iguasio into a company's existing data platform?

Our key focus is we're not necessarily we don't have to do everything in in the pipeline.

Our key focus is the the key challenge that most people are facing is operationalizing

machine learning.

You hear those things like 80% of

machine learning projects fail because, you know, the how difficult it is to make them operational. And we started the show. We talked about it. Know, the guy with the Jupyter that plays around with the CSV,

and then he needs to make it something that actually works in a real pipeline. They don't have any clue, you know. More knowledgeable organization

understand that they need to now bring Kafka in and bring stream processing and automation and spark and so on.

But many organizations

have some junior data scientist that just finished their school. They don't have any clue on how to build it. So this is the area where we're focused on is how to accelerate this project, the process of, okay, you played around with your Jupyter.

Now let's see how we scale this. Now you can choose to work with AutoML tools like DataRobot.

We we also have strong partnership with Microsoft Azure, with SageMaker and so on. You could choose to do the training using AutoML,

but the biggest challenge is creating those operational pipelines.

And 80% of the challenge we just spoke about is about data. You know, data engineering, real time analysis,

monitoring of data,

statistical analysis of data, and so on. Those are the things that most data scientists and data data science group, they only tackle them when someone starts shouting, why don't we do anything productive for the company with all this data science group that we just created?

This is why we also appeal to

data science because we help them translate, rebuild in the lab with minimal efforts into something that actually yields value to the business, recommendation

engine or predictive maintenance logic. And so the way that we work first,

much of our technology is open source.

It's standardized around the standard. Kibonas Kubernetes is the cluster framework that we work on top of. We work with all data

all data sources, And we also have our own real time database that is very essential for the process, but you don't have to use it. It's it's not the data lake. It's like this thing that allows

very quick shuffling and transformations on data.

And we have all those serverless

engines on top and an automation layer in order to streamline the process and eliminate

80% of the work or even more. You know, we have customers said and told me, you know what? Last year, we generate a single pipeline to production.

We use you guys for, like, 6 months. We already produce 4

projects and productize them because that gives them the velocity, all this automation that we built. And by the way, you know, people don't have to buy our product. We are happy to work with some of our open source technologies and people that like to adopt them. We know that tier tier 1 cloud providers are using some of our technologies.

We're good with that as well.

With all of the experience that you've built up working in the machine learning space and helping to provide the tooling and automation for companies to be able to bring their ideas into production.

What are some of the gaps that you see still remaining in the overall tooling or processes,

whether

in terms of the development workflows or organizational

aspects that are

still open questions and still need to be addressed for managing the overall life cycle of machine learning projects?

I don't think there are major things that are missing. I think, you know, now feature store are becoming more and more mainstream.

Many people still don't understand maybe the difference between, like, a feature store and a data catalog.

I think we tried to make it clear here that feature store is also about the automated transformation

and the ability to do real time and and offline. That relieves a lot of the operational aspects. It's not like a catalog. You just, you know, register offline,

they don't like data.

So I think this is gonna solve a lot of the challenges.

On the machine learning side, I think AutoML

actually makes it simple even to data engineers to go into data science if you think about it. Because, you know, a lot of the black magic was, okay, do I use XGBoost or do I use, like, any algorithm for your discretion or whatever and the right parameters to use and so on. But with all those AutoML

tools and products and open source projects,

now it's sort of becoming commodities. So the real

number 1 challenge that people have, and I keep on hearing that 80% of it is about how do we select the features, the best features, how do we produce the best features? And then on the operate on the other side, how do we track the behavior of the model? Do we training? And do all those, you know, personal challenges?

I don't think there's significant missing tools, maybe slightly on the, like, deep learning side.

There's still more it's still complicated to build neural networks and all of that and still a lot of voodles over there. But I think the biggest challenge is having a more integrated approach. How do you choose the right tools out of this jungle? You know, there's probably thousands

of products and open source projects that do machine learning. How do you create an ecosystem out of all of that that talks to each other and fill all those? That's a lot of things that we're trying to do, whether it's using our open source project or integrating our open source project with

the best of breed tools in the ecosystem or the best of breed managed cloud services

in the ecosystem. So

trying

to reduce as much as possible the friction

of building a solution. I think that's key challenge. How do you reduce the friction

of building an operational

pipeline?

And some of it has to do with politics.

Because, again, I told you before, I see the successful

companies we work with are the ones that broke the barriers. There's no silos. It's like a team,

you know, and the data science

scientists can talk to the data engineer across the hall

and ask him for a favor. And then when he finishes, he talks to the DevOps guys and they bring up the container. And 1 of the challenges, for example, is throwing stuff over the wall. If the data scientists write some code in Jupyter and then someone needs to go and transform it to, like, Docker container that runs on Kubernetes. Essentially, someone just goes takes the code, probably throws away 90% of it and recodes

it. But if they work in collaboration,

for example, some of the mechanism we offer, others offer that essentially

data scientists write some code, you maybe put some hooks for exception handling,

the data engineering, the DevOps guys improve that. Now let's assume we need to tweak something in the code instead of redoing the entire ceremony. The data science can go into the code that's already, like, ready for production and change a parameter or add a parameter. It's not necessarily

really doing the direction.

We have to be very conscious of people's time and creating more of a continuous

deployment and integration workflow

in collaboration of those 3 personas.

And in terms of the ways that you've seen the EQuazio

platform being used and some of the different open source projects that you've built, what are some of the most interesting or innovative or unexpected projects that you've seen built with your technology?

So actually 1 cool project that we've engaged with throughout

the COVID

time

was 1 of the in hospitals in the in the Middle East Essentially, used us for

real time patient deterioration

analysis.

So essentially, you think, like, on every patient, you have lots of different datasets. You have the medical record. You know, he has certain diseases,

male, female, age, and so on. Then you have all sorts of measurements. Like, yesterday, you took blood sample and that was the result, or you took some, I don't know, urine test or whatever. Those things are being done periodically

on now

sort of frequent basis. Just like yesterday, you did this and a week ago, you did the other thing. And there's also, like, 7 different dataset. 1 of the other, the real time dataset is the bed sensor logic. So

the real time heart rates, movement sensors, and there's lots of sensor in the in the patient's

location.

So analyze all those different things, the medical records, the measurements of his samples,

you know, the real time data coming from the bed, potentially also things like x rays that if if taken, this record deep

deep learning and you incorporate it into a real time decision.

Okay. This patient is must get this treatment in an pretty urgent

and actually

throw an alert to the doctor that he has to go and treat the patient as well as showing in the nice dashboard with, like, reds and greens and showing which patient need to get which treatment in what importance.

So think like prediction of the deterioration

of patients.

So that's 1 pretty interesting application, also

important 1. Yeah. It's definitely very cool. It's always great when you're able to see the tools that you build. They're able to be used to help people.

In terms of your experience building the Iguazio platform and the open source technologies that you use and produce and

helping to scale the business around it. What are some of the most interesting or unexpected or challenging lessons that you've learned in that process?

I think 1 of the challenges that we keep on facing, you know, 1 of the key values that we're bringing is automation.

Automation in order to save time for people, go faster to production.

Where we sometimes see challenges is internal teams, especially when you go on prem. Because in the cloud, people are very much oriented toward, let's go and grab some pre baked service and use it. But especially with IT, which is essentially in sort of an oxymoron because usually the more modern,

the sharper ones are the ones working on the cloud on all the latest and greatest.

Sometimes in on premises, you don't have, like, the best technologies and working on internal ITs and and and all of that.

But for some reason,

a lot of organization wanna invent things by themselves.

And they'll say, no, we can build it in a week. Now you know they can't build it in a week. We have, like, 80 people and we're working on for a few years. You know, it's not something you can build

in a week or they sometimes you have consultants that come and say, you know what? We're gonna build some stack and we're gonna build some TensorFlow for you and we'll build everything. And then he asks, okay. But, you know, like, TensorFlow every few weeks changes and you'll need to modify your containers. And once the consultant goes away,

how are you gonna keep on maintaining that work? So you rather have a product versus

do things yourself or with some third party consultants.

Reminds me of also speaking to a CIO of 1 of the insurance companies that we work with, pitching about consultants in general. So they're saying, you know what? Those consultants, they interview us. They ask us what's the data, How is the data used? You know, and all those questions about the data. We serve waste

tons of energy. We'll explain to them how it works and all that, what kind of aggregations we need to do, what's the business logic, and all of that. And then they go run some scikit learn on all the things that we've done. I rather just get the data scientist and instead of, you know, teaching the consultants and then paying them and then they leave and we sort of stuck with something we don't even know how to maintain,

I rather just bring a couple of data scientists and have some automation built for them so they can produce

this logic by themselves. Because the biggest thing you need is to understand the business logic of your business.

You know? If you don't understand the business logic of your business, you can't really build a solution

because you always have a you have analysts that you have a gut feeling about, you know, what are the trends that you need to monitor, things like that. Those are things that if you bring in a consultant, that means that you have to go and teach the consultants all this

business knowledge that you've gathered throughout the years.

Yeah. So this is really what our role we also have data scientists in the company,

but their role is essentially their data scientists are very oriented towards operations.

They also have data engineering skills. They have DevOps skills usually.

And their role is to work with the customer data scientists and empower them and teach them how

to move away from sort of a Jupyter concept

and to move to Python and do some unit tests and build things in a more reproducible way and things like that. And then experiment tracking and

other things potentially needed to make it something that you can just, like, throw it into a machine that will retrain itself, will run at scale and and all of that. So what we're trying to do is take our organizations and

teach them how to, you know, walk and run instead of doing it for them. Giving them the fishing net, you know, instead of the fish.

For people who are

looking to be able to build these machine learning pipelines, they wanna be able to get their models into production. What are the cases where Iguazio

is the wrong choice?

So if it's, like, entirely

batch oriented flow

and also small scale data, if Jupyter is good for you or you don't have an operational pipeline, there are a lot of, companies that, you know what, they use the machine learning model and they do some batch prediction and they create some fancy reports for the CEO

show that we could have done something different. So those those are probably

not a good choice for us. What we're focusing is on build people that build applications around machine learning. More operational, more online, you know, things like recommendation edges, fraud analysis.

I gave the example of the hospital, you know, the patient monitoring.

We do predictive infrastructure monitoring with a huge scale

running essentially all the network compliance, all NetApp, predictive maintenance, run around on our platform on few virtual machines. So, you know, we have airports.

We actually monitor everything. It's like bell condition cameras, all that, real time insights and set which lane goes first into the gate, things like that. So

operational

use cases where we're actually trying to put machine learning in AI to work for you, these are the ones that we accept. If you're, you know, you have a dataset, you wanna run some training and make a work report,

don't use us. Okay? Or if you have very small scale or you're not trying to create

share of a CICD kind of mentality in the organization,

don't use us. When I see that the organization is very, very siloed,

like the data science

group, I ask them, okay. How do you do the real time features?

They say, no. It's not a problem. I say, how come it's not a problem? You have to do it. And I said, no. We throw everything we do. We send it over to the other department, and they're essentially recoding everything. So there's no problem. You know, it's someone else's problem.

So we can't fix it because if politically the organization

doesn't know how to work together,

we'll go and run into the war.

And as you continue to build out the technical and business capabilities of Igwasio and continue to contribute to these open source frameworks that you support.

What are some of the things that you have planned for the near to medium term, and what are you most excited about in the overall machine learning space?

I love technologies. I'm excited about many things. But I think what key focus areas for us right now are

enhancing the feature store, making it more powerful,

more automated,

it's more scalable,

you know, higher performance and so on. That's 1 thing. The other thing is that we've produced this sort of DACH technologies.

You can build a real time and offline pipelines or serverless pipelines.

And they have a lot of built in primitives that we created, essentially, a library of things like aggregators,

joiners, things like that. Think of it like Lego building blocks, like a playground for building

analytics using higher level

Lego bricks, and then it just compiles to something extremely fast

and that can work on batch or real time. So that's that's something that we're going to expand the different primitives on and expand the other capabilities to the harder problems of data consistency,

exactly 1 semantics, and so on. And the 3rd tool

is also leveraging the other 2 technologies. They're all sort of connected. It's all the sort of post production aspects of enhancing the model monitoring capabilities,

alerting, retraining,

data governance,

and that side. So you see a lot of our focus is not necessarily making better AutoML. I think that market is, like, already very saturated and are good open source projects

in that field. And also that's something that we support in our pipeline. You can just launch a training job in all Azure ML or SageMaker

from within our pipeline. So it's not there that we think we need to waste energy on. But the other thing that I mentioned,

like the the feature store, the real time pipelines, and the model monitoring analysis we create, you know, serve the CICD for AML ops. The areas where I think there's the biggest challenges

today

in the industry.

Around our theme, which is operationalizing

machine learning.

Would you agree with that? Yeah. I definitely agree that

the biggest challenges that I've seen in terms of

understanding and adopting machine learning is just

figuring out how to

get access to the data, how to operationalize that, and then just managing the overall life cycle where a lot of the work that I've seen going into machine learning is in more of these

static use cases

that you mentioned where you're just pulling the data, you're building a report, and then you have something fancy to show for it, but not as much of the sort of building it into the core of your application. And that's where a lot of the

current energy I see going into the overall space is figuring out how do we make this maintainable,

particularly for smaller teams who don't have capacity of a Google or a Facebook to, you know,

just throw a bunch of engineers at the problem and work their way through it. Yeah. And we also invest a lot usability.

We have couple of UX

designers working for us. So because the problem, we have lots of capabilities and lots of technology.

When you have lots of technologies, it's not so trivial for the novice. There's a term called clickers and coders. You should familiar with that. There are platforms like RapidMiner

or whatever, where you just, like, click, click, click, or even h 2 0 or data robot. You throw in a CSV and it generates a model for you. But it's very limited because it only works with the CSV. It probably won't work with a terabyte worth of data. And also it's it's very CAD logic. You know, it has a bunch of very specific algorithm that it works. So we provide huge flexibility so it's not as, like, you push 1 button, there are a few buttons to push. So we're trying to make also the experience

much easier, you know, incorporate wizards, you know, do even

some AIML in the way that we present things to make it deeper and slicker to consume.

For anybody who does want to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

1 of the interesting point, I was speaking to some, Gartner

analyst and saying, okay. You know what? Like, machine learning used to be manual and now now there's, like, auto ML. Okay?

DevOps used to be manual and now there's serverless or all of that. And, like, in data engineering, we're still serving the stone age. Okay? SQL queries and and all of that. So,

you know, I think that Physto story is not just for machine learning. If you think about the same concept that you've built

machine learning for because you wanna simplify it for the data scientist,

you can actually

apply the same concepts of a feature store to also data engineering, to reports, you know. But 1 of the things that we've done is pretty unique in our feature store. Most feature stores output is key value table or,

data warehouse or Parquet or BigQuery or something like that. We can output

from our feature store also time series

SQL data.

So if you want to put a BI

dashboard on top of some

features that you've built,

it's actually we we support real time dashboards like Grafana, and we support also BI dashboards through

JDBC

plugins. So you can think about, like, feature stories like the next level of attraction also for the analytics guide, for the traditional ETL flow.

But that's more slightly futuristic for some of the data engineers, I think, but that's my gut feeling. What do you think about it? I definitely think that using the feature store

as a step in the overall pipelining of a data engineering flow would definitely be interesting, and it could definitely help to

simplify some of the discovery that goes into it, particularly as you grow the number of pipelines where there are going to be commonalities in terms of the transformation steps that you build or some of the aggregations that you need to run across different data sources

And using the Feature Store as

an output

for some of those downstream stages, I think, could definitely

simplify

a lot of the complexity that goes into building and maintaining all of these pipelines.

It doesn't have to feed a training model. It it could fit an event driven logic,

simple event processing.

It could feed a dashboard or something like that. And not everything

is just a batch like querying, you know, SQL and all of that. You can update a SQL

database so you can build a BI system on top of it, but keep it fresh.

And still, there's lots of challenges on stream processing. All the challenges around always on and auto scaling and exactly 1 semantics and all of that. So this is another area where the traditional, you know, Kafka,

Spark Streaming, and all that, you can displace it with something which is more, like, intelligent and more abstract. And they build it based on those sort of Lego building blocks I mentioned. Like, Fint, 1 of the nice things we have in the higher level obstruction is dealing with aggregations.

I like to show this sort of hospital example with some of the code in the open.

Like, I'm building aggregates for all the telemetry data of a patient in 3 lines in Python.

Now if I need to build it in SQL, that's few 100 lines of SQL that's undebuggable.

So this is another thing that we can start thinking about. Okay. How do we build a higher level abstraction logic data engineering app? Yeah. Definitely something that I'll be interested to dig a bit more into and hopefully see some different frameworks start to or best practices

guides factor into the overall approach to data engineering and data management.

1 of the interesting phenomena that could happen is sort of a mix between data, between predictive and prescriptive.

Okay? So think of how do we create a feature? We run some analytics. Okay? And we create a new feature. But some features may be predictive.

Like, I can add a column, like inputing, I give it as an example. Or maybe 1 of the columns essentially needs to be a predictive value

using a model. So today, we have like, okay. Those guys do data engineering and then produce a set of the features, and those guys do machine learning. They take the features gave

some examples, like in NLP, converting unstructured data to, I gave some examples, like in NLP, converting unstructured data to structured data,

like creating predictive value. So I think 1 of the things that we may see in the future is essentially that things could start to get mixing. You'll see machine learning and analytics and analytics and the machine learning. Yeah. I've definitely already started to see some of the machine learning applications get factored into some of the data engineering pipelines, particularly

in the data quality ecosystem where a lot of companies are starting to build models around

doing statistical analysis of column values to understand is the outlier something that

is genuinely bad data, or is it something that is actually expected based on some other external factors to the business that, you know, might contribute to larger than normal seasonal drop in business because of COVID, for instance. Yep.

Exactly.

Well, thank you very much for taking all the time today to join me and share all the work that you're doing with Iguazio and working to help bring machine learning into production. It's definitely an interesting and challenging problem domain, so it's always great to speak to people who are helping to simplify that and make it more accessible. So I appreciate all the time and effort that you've put into that, and I hope you have a good rest of your day. Thank you very much, and it was fun talking to you. Likewise.

Listening. Don't forget to check out our other show, podcast dotinit@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links