Pandas is a swiss army knife for data processing in Python but it has long been difficult to customize. In the latest release there is now an extension interface for adding custom data types with namespaced APIs. This allows for building and combining domain specific use cases and alternative storage mechanisms. In this episode Tom Augspurger describes how the new ExtensionArray works, how it came to be, and how you can start building your own extensions today.
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
- To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email firstname.lastname@example.org)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Your host as usual is Tobias Macey and today I’m interviewing Tom Augspurger about the extension interface for Pandas data frames and the use cases that it enables
- How did you get introduced to Python?
- Most people are familiar with Pandas, but can you describe at a high level the new extension interface?
- What is the story behind the implementation of this functionality?
- Prior to this interface what was the option for anyone who wanted to extend Pandas?
- What are some of the new data types that are available as external packages?
- What are some of the unique use cases that they enable?
- How is the new interface implemented within Pandas?
- What were the most challenging or difficult aspects of building this new functionality?
- What are some of the more interesting possibilities that you are aware of for new extension types?
- What are the limitations of the interface for libraries that add new array functionality?
- What is the next major change or improvement that you would like to add in Pandas?
Keep In Touch
- Original IP Address proposal
- Mid-implementation blog post
- Wes McKinney
- Array ufunc