Distributed Computing In Python Made Easy With Ray - Episode 258

Summary

Distributed computing is a powerful tool for increasing the speed and performance of your applications, but it is also a complex and difficult undertaking. While performing research for his PhD, Robert Nishihara ran up against this reality. Rather than cobbling together another single purpose system, he built what ultimately became Ray to make scaling Python projects to multiple cores and across machines easy. In this episode he explains how Ray allows you to scale your code easily, how to use it in your own projects, and his ambitions to power the next wave of distributed systems at Anyscale. If you are running into scaling limitations in your Python projects for machine learning, scientific computing, or anything else, then give this a listen and then try it out!

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? Check out Linode at linode.com/podcastinit or use the code podcastinit2020 and get a $20 credit to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.



Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Your host as usual is Tobias Macey and today I’m interviewing Robert Nishihara about Ray, a framework for building and running distributed applications and machine learning

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Ray is and how the project got started?
    • How did the environment of the RISE lab factor into the early design and development of Ray?
  • What are some of the main use cases that you were initially targeting with Ray?
    • Now that it has been publicly available for some time, what are some of the ways that it is being used which you didn’t originally anticipate?
  • What are the limitations for the types of workloads that can be run with Ray, or any edge cases that developers should be aware of?
  • For someone who is building on top of ray, what is involved in either converting an existing application to take advantage of Ray’s parallelism, or creating a greenfield project with it?
  • Can you describe how Ray itself is implemented and how it has evolved since you first began working on it?
  • How does the clustering and task distriubtion mechanism in Ray work?
  • How does the increased parallelism that Ray offers help with machine learning workloads?
    • Are there any types of ML/AI that are easier to do in this context?
  • What are some of the additional layers or libraries that have been built on top of the functionality of Ray?
  • What are some of the most interesting, challenging, or complex aspects of building and maintaining Ray?
  • You and your co-founders recently announced the formation of Anyscale to support the future development of Ray. What is your business model and how are you approaching the governance of Ray and its ecosystem?
  • What are some of the most interesting or unexpected projects that you have seen built with Ray?
  • What are some cases where Ray is the wrong choice?
  • What do you have planned for the future of Ray and Anyscale?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:13
Hello, and welcome to podcast ordinate, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project to hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at linode who 200 gigabit and private networking node balancers, 40 gigabit public network fast object storage and a brand new managed Kubernetes platform all controlled by a convenient API, you've got everything you need to scale up. And for your tasks that need fast computation such as training machine learning models, or running your ci and CD pipelines. They've got dedicated CPU and GPU instances. Go to Python podcast.com slash linode. That's Li n o d today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macey. Today I'm interviewing Robert Nishihara about Ray, a framework for building and running distributed applications and machine learning workloads in Python. So, Robert, can you start by introducing yourself?
Robert Nishihara
0:01:10
Yeah. Well, first of all, thanks so much for for speaking with me. So I am one of the creators of an open source projects called Ray. And I'm one of the cofounders and CEO of any scale, which is commercializing Ray. And before before starting this company about half a year ago, I did a PhD in machine learning and distributed systems at UC Berkeley. And of course, Ray was a project we began several years ago at UC Berkeley doing I mean, the Ph. D. program to address problems that we were running into in our own research. And that has since transitioned into into the into any scale,
Tobias Macey
0:01:47
and do you remember how you first got introduced to Python? So
Robert Nishihara
0:01:50
I first started using Python in college, you know, when I actually started out doing machine learning research as an undergraduate, the dominant tool at that time was MATLAB. Everyone was using MATLAB to, to run experiments and do regressions and things like that. And you know, at some point, everyone started transitioning to Python. And once deep learning really took off around 2013 and and frameworks like piano, then of course later on TensorFlow and pi torch took off that really solidified pythons dominance in machine learning.
Tobias Macey
0:02:22
And so you mentioned that ray is a distributed task execution framework and that it started when you were doing your PhD research at UC Berkeley. And my understanding is that you were a part of the rise lab at the time. I'm wondering if you can just give a bit more background on the project and the motivations for starting it some of the challenges that you were facing in your research that led to it and how the overall environment of the rise lab factored into some of the early design and development decisions of the project? That's a great question.
Robert Nishihara
0:02:51
So the research that I started out doing and and that fill up So fill up more, it's one of my co founders and also one of the creators. Ray, we were doing research in more theoretical machine learning. So trying to design better algorithms for reinforcement learning for optimization, if we're learning and if you, you know, if you, if you sort of walk into the AI research building at Berkeley or any at any research university, one thing you'll notice is that you have all these machine learning researchers, and you know, they have backgrounds in math and statistics, but they're spending and they're, you know, they're trying to design better algorithms for for machine learning, but what they actually spend a ton of their time doing is low level engineering, to scale things up and to speed things up. Because, you know, the experiments in machine learning are very computationally intensive. They can be quite time consuming to run. And so speeding things up or scaling things up to run on a cluster instead of on a laptop can be the difference between an experiment taking a week versus a day versus an hour. And so you have all these different you know, graduate students and machine learning practice. singers who are not only not only are they designing the machine learning algorithms and implementing their algorithms, they're also building all of this infrastructure and scaffolding underneath to to run their algorithms at a large scale. And, you know, we did this ourselves, of course, but we felt that there had to be a better, you know, a better way that could let us just focus on the algorithms and the applications that we were trying to implement, and to not have to, you know, become experts in distributed computing. So we set out to try to build better tools. And of course, we'd use many of the existing tools out there, even you know, back as an undergraduate, when I was doing machine learning research a number of years ago, I had to use a tool called ipython parallel, which is one of the first tools for doing, you know, parallel and distributed Python. And of course, that made my life a lot easier. But when it came to deep learning and more modern machine learning applications, the tools just weren't there yet. And so these researchers were constantly building their own ad hoc tooling and new infrastructure. So that was sort of the the scenario that we found ourselves in where to do machine learning, you really had to build a lot of infrastructure. And that's why we set out to try to build tools to make this much, much easier. And you asked about the rise lab well, so when I started the Ph. D program, I was part of the AMP lab, Berkeley, which, you know, created Spark, and that transition in the rise lab. And in both of these settings, I was surrounded not just by experts in machine learning, but also experts in distributed systems. And these were people who, for me, I was coming from a more theoretical background, I was thinking about machine learning algorithms, and now starting to get interested in how to build tools and build systems. And it wouldn't have been possible if I hadn't been in that environment. You know, I was in that environment. I was surrounded by experts in distributed computing. I was able to learn from them to learn from their experience with spark and other open source systems. And and that really helped us avoid some of the pitfalls. It helped us, you know, really moves much more quickly in terms of building a great framework.
Tobias Macey
0:06:03
And one of the common themes and a lot of the projects that I've had on the show and people that I've spoken with is that they set out to solve one problem and on the way discovered that they actually have to solve some completely orthogonal problem first, and half the time, it seems that they never actually get back to the original problem that they're trying to deal with. And I'm curious what your experience was of if Ray just consumed all of your attention, and you never actually made it back to your original research, or if you were able to at least get back to what you were originally trying to accomplish after you got to a working state with Ray.
Robert Nishihara
0:06:35
That's, uh, that's funny that you asked. Yeah. When we started Ray, it was a little bit of a detour. We were doing machine learning research, it became clear to us that the tooling and the infrastructure was the bottleneck. And so we set out to build something better there. And of course, the goal from the start with building Ray was to build useful open source software was to try to solve this pain point where people really struggle to scale up their applications from one machine to a cluster. And when we started out, we had a, you know, we had an idea of the API that we wanted and the behavior that we wanted. And we figured it would take us about a month to to finish that. And of course, in a month, we did have something we had, you know, we had a prototype, and we had something working, but to really make something useful for people to, you know, give companies confidence that they can run it in production, it requires much more than a prototype, and you have to really take it all the way. And so even if you can get some initial prototype, running up and running pretty quickly to really polish it, and really, you know, make it great. There's a lot of extra work involved. And that's, and so, you know, it takes longer than, than you initially expect. Now, one interesting thing related to what you asked is that there haven't been too many changes and to the scope of what we've been building. You know, we started out as quite broad. But over time, we encountered more and more use cases that we realized we could support that we didn't initially realize we could support and I can talk more about these later. But one example, Ray actually started out as purely, like more like functional and stateless. And, and essentially, it started out as a way to scale up Python functions. But then we realized a lot of the applications we really cared about, were stateful, and required managing, you know, mutable state in the in the distributed setting. And so we introduced the ability to scale up not just Python functions, but also Python classes. And that really, you know, by introducing actors, and that was a one API change that really opened up a lot of new use cases in a really powerful way.
Tobias Macey
0:08:40
Yeah, when I was looking through the documentation and seeing the difference between just a remote task versus a remote actor, at first glance, it wasn't really clear to me the significance of just the minor syntax difference because you still just use a decorator, but it's just a matter of whether you decorate a class versus a function to determine whether or not statefulness isn't correct. added into it. So I think that the API and the sort of user experience that you landed on is fairly elegant and very minimal, while still being able to provide a significant amount of power to the end user.
Robert Nishihara
0:09:11
That's exactly that's something we very much strive for. And to keep it simple to have only a few concepts that are powerful and general.
Tobias Macey
0:09:20
And so you mentioned that some of the original work you were doing was some of these advanced machine learning algorithms, such as reinforcement learning. And that was some of the initial use case that you were targeting. I'm curious now that has been publicly available for some time, and it's been out in the wild, what are some of the other ways that it's being used, which you didn't originally anticipate?
Robert Nishihara
0:09:40
Yeah, one use case that we didn't initially anticipate was serving. So like inference, putting machine learning models in production, but it turns out that ray is actually a pretty natural fit for this use case. And, and the powerful thing here, the exciting thing is, is that when you have a system that can support, not just the serving and inference side of the equation, but also the training side. Also, you can build online learning applications that do both where you have algorithms and applications that are continuously interacting with the real world and making decisions, but also, and you know, learning from those interactions, and updating themselves updating the models. So you have applications that look like we are companies doing this in production, where you have data streaming in, you're incrementally processing this data in a streaming fashion and training new recommendation models, and then serving recommendations from those models back to the users, which then affects the data that you're collecting. And there's this tight loop. Normally, when companies implement this kind of application, this kind of online learning application, they're actually stitching together several different distributed systems that they'll take one for stream processing, one for training, machine learning models, and then one for serving. And here, we've had companies that are able to take that kind of setup with several different distributed systems that are stitched together and To rebuild it all on top of Ray. And you know, it becomes simpler because it's on top of one framework. They don't have to learn how to use a ton of different distributed systems, it can become faster because you aren't moving data between different systems, which can sometimes be expensive. And it generally, you know, it can get simpler because you these different distributed systems are not usually not designed to interface with each other. So if you have, if you're using heart of odd or distributed TensorFlow for training, and you're using Flink or Spark streaming for the stream processing, you know, Flink and Hor avadh, were not really designed to just compose together. But if you can do these things on top of Ray, then it's, it can be as natural as just using Python and you know, importing different libraries like NumPy, and pandas. And you can use all of these things together.
Tobias Macey
0:11:46
Another project in the Python ecosystem that is tackling the distributed computation need that I'm also aware of his desk and I know that it was originally built for more sort of Scientific computation and research workloads. But I'm wondering what your thoughts are on some of the trade offs of when somebody might be looking to use desk versus Ray or ways that they can work together?
Robert Nishihara
0:12:12
That's a great question. And das has some has some great libraries for data frames, and sort of distributed array computation, which which we haven't been focused on so much. We've with Ray, we've been very focused on scalable machine learning, you know, everything from training machine learning models, to reinforcement learning to hyper parameter search, to serving and in the kinds of applications that, you know, we've been focused on, really benefit from the ability to do stateful computation, you know, with actors. I think that's, you know, in terms of the, you know, the what task provides at a low level, it's very similar to the ray remote function API. So the ability to take Python functions and scale them up and But a lot of the scalable machine learning libraries that we're building really require stateful computation. So that's there's some difference in emphasis there. And of course, you know, I think Gascony began its life focused on large multi core machines and really optimizing for high performance on a single multi core machine. And, you know, we are focused on performance very much in the large multi core machine setting, as well as the cluster setting and really trying to match the performance of low level tools. Like if you were just using g RPC or on top of Kubernetes, and things like that, then using Ray should ideally be just as fast
Tobias Macey
0:13:45
for the types of workloads that somebody might think of for Ray, what are some of the limitations that they should be aware of or edge cases that they might run into as they're developing the application or any of the specific design considerations that you have found to be beneficial for successful implementations,
Robert Nishihara
0:14:02
there are a couple different things to be aware of. So one, the ray API can be pretty low level. And so to do to implement all of the features and all of the things you want for your application, you may need to build quite a bit of stuff. So for example, if you're looking to do distributed hyper parameter search, well, we provide a great library for a scalable hyper parameter search. And so you can do that out of the box on top of Ray, and it works well. On the other hand, if you want to do something like large scale stream processing, and we currently don't have a great library for stream processing, and so while it should be possible in principle, to do that, on top of re doing so would require first building your own Stream Processing library or building something like that, or some subset of that before you can really do that well.
Tobias Macey
0:14:51
And if somebody is trying to convert an existing application to be able to scale out some subset of its function What does the workflow look like for being able to do that? And how does it compare to somebody who's approaching this in a Greenfield system? Yeah, so
Robert Nishihara
0:15:08
this is actually an area where red Ray really shines. And we've had the same with Ray, if you want to scale up a Python application, an existing Python application that you already wrote, you can often you know, not always, but often, just add a few lines, change a few lines, add a few annotations, take that code and run it, you know, anywhere from your laptop, to a large cluster. And the reason for this and you know, the important thing here is not just the number of lines of code, it's nice, you know, if you only have to add seven lines of code, but the important thing here is that you don't have to restructure your code or you don't have to re architect it and change the abstractions and change the interfaces to to use coerce it into using raid, it should really be something you can do in place. And the reason for this is that has to do with the abstractions that Ray provides. So it If you want to scale, for example, if you want to scale up a, an arbitrary Python application using Spark, well, the core abstraction that spark provides is a large scale data set like a distributed data set. And so if you have an application that is centered around, whether the core abstraction is a large data set, and you're manipulating that data set, then spark is the perfect choice. On the other hand, if you're trying to do something like hyper parameter search, or you know, inference and serving, where the main abstraction is not really a data set, it might be something like a machine learning experiments, or it might be something like, you know, a neural network that you're trying to deploy, then coercing that into, to scale that up with Spark, you have to, you know, coerce it into this data set abstraction, which is not a natural fit. And that can lead you to having to re architect your application. On the other hand with Ray, you know, we of course, we have higher level libraries, but at the core Ray API is not introducing any new concepts. It's not introducing these higher level abstract It's just taking the existing concepts of Python functions and classes. And of course, all application, all Python applications are pretty much built with functions and classes. And it's taking those two concepts and translating those into the distributed setting. So you have a way of taking, you know, any application that's built with functions and classes, and then ideally, with a few just modifications running that in the distributed setting. So this is something where Ray actually can feel very lightweight. And it's something where we've had a number of different users tell us that they you know, they love Ray, because they can just get started with it so quickly, they can tell, you know, they can tell people on their team how to use it and in 30 minutes, and then they're off and running and and they don't have to really change their code much. They can just take their existing code and scale it up. In terms
Tobias Macey
0:17:49
of Ray itself. I'm wondering if you can discuss how
Robert Nishihara
0:17:52
it's actually implemented and some of the ways that it's evolved since you first began working on it. One thing we've done is we've actually rewritten it from scratch. crashed several times, as we got new ideas and and thought of improved place to architected, we had an initial prototype that was written in rust. And I think the implementation after that was in c++. And then we switched to it was very like a c++ implementation with lots of multi threading. And then we switch to, you know, an implementation in C actually, where all of the different components like the scheduler and the object store and so on were had single threaded event loops. And from there, we switched back to, you know, more recently, we've, it's back to c++ using g RPC as the underlying technology or the underlying RPC framework. And, you know, we've continued to re architect it quite a bit, both to improve performance to improve stability and to simplify the design so that's something we are you know, I think the people working on Ray have done a great job of you know, making these bold changes and really not being afraid to tear things down and and rewrite them when it makes sense to do so. Now how does it actually work? There are, you know, a number of key components. So one is the scheduler. And the scheduler is actually a distributed concept where we have a different scheduler process on each machine. And, you know, the scheduler is responsible for things like starting new worker processes for assigning tasks to different workers and book doing the bookkeeping for what resources are in use what you know, CPUs, or GPUs, or memory or things like that are currently allocated. There's an object store process. And this is actually, again, a pretty important optimization for performance optimization for machine learning applications. So a lot of machine learning applications rely heavily on large numerical arrays and large numerical data. And when you have a bunch of different worker processes, if they're all using the same data set, or the same neural network weights or things like that, it can get quite expensive to send copies of The state around to each worker process and for each, you know, process to have its own copy of the data. And so an important optimization is to allow each worker to instead of having its own copy, to just have a pointer to just use shared memory to share it to access the, the data. And so what happens is we have this object store process, which can store arbitrary Python objects. But the important part is, is objects containing large arrays of numbers. And then all the different worker processes can have essentially can access that data just by having a pointer to a shared memory, instead of having to create their own copy. And you know, and then you can avoid a lot of expensive serialization and D serialization. So this is something where we first developed this object store as part of Ray, and then we contributed it to the Apache Aero project, which is started by West McKinney and that's something that of project we've collaborated very closely with and Now, you know, people are using the object store as a standalone thing, independent of Ray.
Unknown
Robert Nishihara
0:21:06
I would say those are some of the key architectural pieces. There's more, of course, there's the Ray autoscaler, which is a cluster launcher. Of course, one of the pain points with doing distributed computing is starting and stopping clusters and managing the servers and and installing all of your libraries and dependencies on the on the, you know, VMs. And what the Ray autoscaler does is it lets you just run a command from your laptop, like you can run Ray up and that'll start a cluster, you can do it on AWS, or GCP, or Azure, any any cloud provider, and, you know, you can essentially, configure and manage these, you know, these clusters on any cloud provider just from just by running a single command. So that's another important part of what we're building.
Tobias Macey
0:21:53
Yeah, the operational characteristics of distributed computation frameworks are often one of the hardest parts and one of the big As barriers to entry, so I was pretty impressed when I was reading through the documentation and saw that you had that built in as a first class concern. And just briefly going back for a moment, I was surprised to hear that you ended up going from rust to c++, because it's usually the other direction that I've heard people going. So I'm curious what the state of the rust language was at the time that you made that decision or any constraints that it had? Because I'm usually heard people who are going to it because of the memory safety cap capabilities of it.
Robert Nishihara
0:22:28
Yeah, that's a great question. And I think, you know, rust, if we had stuck with rust, it would have been a fine choice and, and would have had a lot of advantages, you know, and of course, it was a lot of fun to write. And the memory safety is a big advantage. At the time we were somewhat concerned about their or less sure about the trajectory of rust and just the maturity of the ecosystem. And there were a lot of projects, you know, for the some of the third party libraries that we were integrating like g RPC, and arrow, you know c++ support was very mature. And it was in some sense, maybe the the conservative choice was to do c++. And in addition, we wanted it to be something that a lot of people in the community could contribute to. And and of course, a lot of people do know rust, but there's perhaps more people who are able to contribute in c++. But that said, you know, we do care a lot about memory safety, and we do, you know, follow best practices with when using c++ to try to be as careful as possible.
Tobias Macey
0:23:31
Yeah, the ecosystem is generally one of the biggest motivators for choosing any particular project or language is because if you go with something that's brand new and very promising, it can be exciting. But after you get started for a little bit, you realize how much you're leaning on the rest of the ecosystem and the rest of the work that everybody is contributing to be able to help you move faster. So I can definitely sympathize with their motivation for going back to c++ and then going back to the operational characteristics I think that it's great that you have this autoscaler built in. And one of the questions that I was having is the thought of being able to manage dependencies in the instances that you're deploying and how you manage being able to populate the code that you're sending to these instances and any dependent libraries that are necessary for them to be able to execute.
Unknown
0:24:23
Yeah. And that's,
Robert Nishihara
0:24:24
that's an ongoing area that we're working on. So I don't think it's it's a fully solved yet. But one thing that's always been a challenge is, you know, with distributed computing, if you build some large scale distributed application, and you're running it on a cluster, it's quite challenging to actually share that application with someone else in a way that they can just easily reproduce it. And part of the reason for that is, it's not enough to just share the code. There's quite a bit more in terms of the cluster configuration and the libraries, you're depending on maybe some data that lives somewhere And you know, in machine learning, people often spend more time trying to reproduce other people's experiments than they do running their own experiments. And this is something we've seen quite a bit. So one thing we are doing is trying to make it easy for people to share their, their applications, and then for others, to just run those applications. And that's something I think, you know, we can make it as easy as just clicking a button essentially. And part of that will be is about, you know, using if you're building an application with Ray, it's a very portable API. It's something that you can run anywhere from your laptop to any cloud provider, or you know, Kubernetes cluster, and then making sure that we are specifying all the relevant dependencies, whether those are Python libraries, or instance types or data. And that's something we're developing in the open source project, to try to encapsulate all these dependencies,
Tobias Macey
0:25:55
as you have built out some more of the capabilities of Ray I know that you've also added these other layers and libraries in the form of RL lib and the tune library that you mentioned for being able to do distributed hyper parameter search. And I'm wondering what you found to be some of the most interesting or challenging or complex aspects of the work of building and maintaining Ray and its associated ecosystem.
Robert Nishihara
0:26:18
Yeah, so let me say a little bit more about all the different pieces that we're building. So there are two parts to re, at least two parts. There's the core runtime, which has, you know, the remote decorator and the ability to scale up Python functions and classes. And then on top of that, we're building a library, like an ecosystem of libraries targeting scalable machine learning. So that includes things like, as you mentioned, hyper parameter tuning, training, serving, and more, right, and reinforcement learning. And these are all things that, you know, to do reinforcement learning to do training you, you typically have to distribute the computation to get it done in a reasonable amount of time. So those are libraries that were very excited about, you know, you can use them together seamlessly. So what is the challenge? Well, in some sense, there are several different projects happening here that are being developed in the same code base. And so it can, it can be a little challenging to keep everything organized. And to make it easy for people to contribute to say RL lib without having to know anything about the rea core or without having to know anything about the serving library, it can be a little hard harder for people to navigate the code base because they're just more directories and more things going on. And, but there are also a lot of advantages to having things in the same code base. And in particular, one is that you don't have to worry about incompatible versions of, you know, the reinforcement learning library and the tuning library and the core core Ray library.
0:27:49
And that's a pretty big advantage.
Tobias Macey
0:27:51
And another aspect of this overall effort is as you mentioned, the any scale company that you and some of your collaborators have full To around Ray to help support its future development. I'm wondering if you can discuss a bit about the business model of the company and some of the ways that you're approaching eight and some of the ways that you're approaching the governance of the ray project and its associated ecosystem.
Robert Nishihara
0:28:14
Great question. So I think if you know, we don't have a product yet, we haven't announced any product plans. But if you look at other companies that are commercializing open source projects, data bricks, for example, or GitHub, you know, there are a number of companies like this. You can also think MongoDB, or elastic or Confluence. Having a great open source project is a great starting point. But it's not enough to create a successful business. In addition to having a great open source project. You also need to have a great product. And that's something that will add, you know, if you look at like GitHub, for example, should add a lot of value on top of just the open source project and, you know, get of course as an open source project. GitHub is as much more than just just get right. It also includes tools for code reviews, tools for continuous integration, tools for sharing your your applications with other people, and collaborating on large projects. And so similarly, we plan on building products that add a lot of value to users of the open source project. And that make life easier for developers. You can actually one way to think about what we're trying to do. So stepping back a little bit. The main premise of this company is that distributed computing is becoming the norm that distributed computing is becoming increasingly essential, especially for for all types of applications, but especially applications that touch machine learning. And the main challenge here, you know, one of the challenges is that doing distributed computing is actually quite difficult, and it requires a lot of expertise. So what we're trying to do with a company is to make it as easy to break program clusters of machines as it is to program on your laptop. And you could think about the products we're trying to offer as essentially the experience of being able to develop as if you're on your laptop, but you have an infinite laptop, and your laptop has infinite compute and infinite memory. And you don't have to know anything about distributed systems.
Tobias Macey
0:30:19
One of the elements that's associated with that is the idea of being able to design your software in a way that it is able to be processed in either an item potent or an embarrassingly parallel way. And I'm curious what your experience has been of people onboarding onto re and trying to build an application and then accidentally coding themselves into a corner because they have something that doesn't actually work when it's spread out across multiple machines, and it's actually much better as a single process computation.
Robert Nishihara
0:30:51
So to scale up an application with re for that to make sense there actually, you know, there does have to be some opportunity for parallelism. There has to be a You know, some things that can happen at the same time. And not every application is like that, but a good number are and that includes more than just embarrassingly parallel applications, for example, hyper parameter tuning, you can write it in an embarrassingly parallel way. But when you get some more sophisticated hyper parameter search algorithms, it's no longer embarrassingly parallel, you're actually looking at the experiments as they're running and stopping the poorly performing ones early, you're potentially devoting more resources to the ones that are doing well, you're sharing information between from some of the experiments to choose parameters for starting new experiments. And you know, of course, within each of these experiments, you're doing multiple experiments in parallel, and each experiment might itself be distributed might be doing, you know, distributed reinforcement learning or distributed training. And so you have this nested parallelism, and even a conceptually simple thing like hyper parameter search can get quite sophisticated and so on. The simple embarrassingly parallel story, it's on its own is not enough for supporting hyper parameter search well. So you're right, it does take some amount of sophistication to, you know, to scale up applications and do it in the most efficient possible way. But if you are implementing a hyper parameter search library, or if you're implementing one of these applications, you will often have a good understanding of, you know, the structure of the computation and what kinds of things could potentially happen in parallel. So we found that people are generally have the right intuition for how to scale up their applications, if they've developed the applications.
Tobias Macey
0:32:41
What are some of the most interesting or unexpected ways that you have seen Ray used and some of the types of applications that people have built with it?
Robert Nishihara
0:32:49
You know, it's all over the place. I think one of the the most exciting applications I've seen is this large scale online learning application that was used by By ant financial, which is the largest FinTech company in the world, they are they're using Ray for a number of different use cases, ranging from fraud detection to recommendations. These are applications that are really taking advantage of race generality, because they combine everything from processing streaming data to training models, to serving models, and doing hyper parameter search, and, you know, launching tasks from other tasks and, and being able to, you know, both do sort of batch computations as well as interactive computation. So these are some of the most stressful workloads.
Unknown
Robert Nishihara
0:33:35
they were running these applications, you know, in production dream, double 11, which is the biggest Shopping Festival in the world. So that's one of the most, one of the most exciting applications. We are also really excited to see a number of companies using our libraries, both for reinforcement learning and tuning and distributed training to you build their own internal machine learning platforms, as well as to, to scale up their machine learning application. So this is everything from trading stocks to designing airplane wings to, you know, optimizing supply chains and things like that.
Tobias Macey
0:34:16
What do you have planned for the future of Rei and its associated ecosystem?
Robert Nishihara
0:34:20
We think that like I mentioned that distributed computing is becoming increasingly important, and will be essential for many different types of applications. And at the same time, we think that there's no good way to do it today that that really doing distributed computing is just there's a big barrier to entry. For example, if you're a biologist, you might know Python, you can write some Python scripts to solve your biology problems. But if you have a lot of data or you have a lot of, you know, you need to run things faster to scale things up, then you you not only have to be an expert in biology, but you also have to be an expert in distributed computing. And that's just Kind of too much of a barrier to entry for most people. So the goal really is to make it as easy to program clusters of machines, as it is to program on your laptop to let people take advantage of these kinds of, you know, cluster resources and take advantage of all the advances in machine learning, but without having to really know anything beyond just Python. And given how many people are learning Python these days and how rapidly Python is growing, we think that'll be a lot of people.
Tobias Macey
0:35:25
And also given the fact that the core elements of Ray are written in c++ in terms of the capabilities of the task distribution and the actual execution. I imagine that there's the possibility for expanding beyond Python as well.
Robert Nishihara
0:35:42
Oh, yeah, I forgot to mention that. We also we already support Java and their companies using Java, you know, doing distributed Java through Ray and production. Python, of course, is our biggest focus, but we're also working on support for distributed you know, c++ through re and there are even some applique Have rails that are combining both Python and Java. So a lot of companies, you know, their machine learning is in Python, but their business logic is in Java. And so if they're, they want to be able to, you know, call their machine learning from the application side. And they're able to do that using Ray, which is pretty exciting. And of course, you know, Ray is designed in a language agnostic way. And there's the potential to add support for many more languages.
Tobias Macey
0:36:27
Are there any other aspects of the ray project itself, or the associated libraries or the overall space of distributed computation that we didn't discuss yet that you'd like to cover before we close out the show, or the work that you're doing with any scale?
Robert Nishihara
0:36:40
So one answer to that there's a lot of things we're working on, is that while we are trying to make it as easy as possible to develop distributed applications, developing distributed applications is not enough on its own. Because once you once you develop your application and you run it, and then you run into some bug, you have to debug your applications. debugging distributed applications is notoriously difficult. So one thing we're focused on or that we that we is very important to us, is to make the experience of debugging cluster applications as easy or easier than debugging applications on your laptop. And of course, a big part of that is thinking about the user experience. It's about providing the right visualizations and surfacing the right information so that developers don't have to go looking for that information. It's just right there. It's about providing great tools for profiling and understanding the performance characteristics of your application. And one thing we've we've developed, which is sort of still an experimental feature, but it's really exciting, is that through the raid dashboard, you can if you're running a live Ray application through the dashboard, you can essentially click a button to profile a running actor running task and get a flame graph showing you where time is being spent in that actor. And you know what the computational bottlenecks are. And normally, to profile your code, you have to even on your laptop, you have to look up the documentation for some profiling library, you have to instrument your code, you have to rerun your code. And you know, it's it's a, it's a bit involved. And so we are trying to make it so that you don't even even if you didn't think about that, you would want to profile your code ahead of time. And after you've already started running it, you can just click a button and and get the profiling information. And this is this and things like this, you know, have the potential to provide a really great debugging experience.
Tobias Macey
0:38:31
Yeah, that's definitely pretty impressive and important element of the overall user experience. So I'm glad that you're considering that and trying to integrate that into the core runtime.
Robert Nishihara
0:38:41
Yeah, we think it'll be pretty fantastic.
Tobias Macey
0:38:44
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or get involved with the project, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics and this week I'm going to follow up with a recent pick on the topic of board games. With a couple of specifics, so I recently picked up the dungeons and dragons Castle ravenloft board game, which is a simplified mechanism, it doesn't have the whole dungeon master and everything. It's something that you can play just with a couple of people. And it's pretty enjoyable. So had fun playing that recently with my kids. And I also picked up another game called one deck dungeon, which you can play with one or two people and it's just a card game that you you know can quickly go through and just have a quick little dungeon crawl and try and survive through to the boss Oh, interesting game mechanic there. So recommend picking up either of those if you need something to do with some spare time. And with that, I'll pass it to you, Robert, do you have any pics this week?
Robert Nishihara
0:39:39
I recently read
0:39:41
the everything store, which is a history of Amazon and some stories about Amazon during the early days and really enjoyed reading that.
Tobias Macey
0:39:50
You know, definitely have to take a look at that as well. So thank you very much for taking the time today to join me and discuss the work that you're doing with Ray and any scale and the associated libraries that you're building on top of all of that. It's definitely a very interesting project that I've been hearing a lot about recently. So I'm happy to have the chance to talk to you and I appreciate all the work that you're doing on that. So I thank you again for that and I hope you enjoy the rest of your day.
Robert Nishihara
0:40:13
Thanks so much and you too.
Tobias Macey
0:40:18
Thank you for listening. Don't forget to check out our other show the data engineering podcast at data engineering podcast comm for the latest on modern data management. And visit the site at Python podcast comm to subscribe to the show, sign up for the mailing list and read the show notes. And if you've learned something or try it out a project from the show then tell us about it. Email [email protected] with your story. To help other people find the show please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support Podcast.__init__ on Patreon!