When you start working on a data project there are always a variety of unknown factors that you have to explore. One of those is the volume of total data that you will eventually need to handle, and the speed and scale at which it will need to be processed. If you optimize for scale too early then it adds a high barrier to entry due to the complexities of distributed systems, but if you invest in a lot of engineering up front then it can be challenging to refactor for scale. Modin is a project that aims to remove that decision by letting you seamlessly replace your existing Pandas code and scale across CPU cores or across a cluster of machines. In this episode Devin Petersohn explains why he started working on solving this problem, how Modin is architected to allow for a smooth escalation from small to large volumes of data and compute, and how you can start using it today to accelerate your Pandas workflows.
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Devin Petersohn about Modin, a Pandas compatible dataframe library for datasets from 1MB to 1TB+
- How did you get introduced to Python?
- Can you describe what Modin is and the story behind it?
- Why study dataframes?
- How do dataframes compare to databases?
- What can you do in a dataframe that you couldn’t in a database?
- What are your overall goals for the Modin project?
- Who are the target users of Modin and how does that influence your prioritization of features?
- What are some of the API inconsistencies that you have had to abstract and work around between Pandas, Ray, and Dask to give users a seamless experience?
- What are some of the considerations in terms of capabilities or user experience that will influence whether to use Ray or Dask as the execution engine?
- Can you describe how Modin is implemented?
- How has the constraint of replicating the Pandas API influenced your architectural choices?
- What are the most complex or challenging Pandas APIs to replicate in Modin?
- In addition to the core Pandas API you have also added experimental features such as SQL support and a spreadsheet interface. How have those capabilities affected the range of potential use cases and end users?
- What are some of the complexities that come from acting as a middleware between the Pandas API and the Ray and Dask frameworks?
- What are some of the initial ideas or assumptions that you had about the design or utility of Modin that have been challenged as you worked through building and releasing it?
- What are the most interesting, innovative, or unexpected ways that you have seen Modin used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Modin?
- When is Modin the wrong choice?
- What do you have planned for the future of Modin?
Keep In Touch
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email firstname.lastname@example.org) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- UC Berkeley