Build Composable And Reusable Feature Engineering Pipelines with Feature-Engine - Episode 338

Summary

Every machine learning model has to start with feature engineering. This is the process of combining input variables into a more meaningful signal for the problem that you are trying to solve. Many times this process can lead to duplicating code from previous projects, or introducing technical debt in the form of poorly maintained feature pipelines. In order to make the practice more manageable Soledad Galli created the feature-engine library. In this episode she explains how it has helped her and others build reusable transformations that can be applied in a composable manner with your scikit-learn projects. She also discusses the importance of understanding the data that you are working with and the domain in which your model will be used to ensure that you are selecting the right features.

Do you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? With Linode’s managed Kubernetes platform it’s now even easier to get started with the latest in cloud technologies. With the combined power of the leading container orchestrator and the speed and reliability of Linode’s object storage, node balancers, block storage, and dedicated CPU or GPU instances, you’ve got everything you need to scale up. Go to pythonpodcast.com/linode today and get a $100 credit to launch a new cluster, run a server, upload some data, or… And don’t forget to thank them for being a long time supporter of Podcast.__init__!


Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Your host as usual is Tobias Macey and today I’m interviewing Soledad Galli about feature-engine, a Python library to engineer features for use in machine learning models

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you describe what feature-engine is and the story behind it?
  • What are the complexities that are inherent to feature engineering?
    • What are the problems that are introduced due to incidental complexity and technical debt?
  • What was missing in the available set of libraries/frameworks/toolkits for feature engineering that you are solving for with feature-engine?
  • What are some examples of the types of domain knowledge that are needed to effectively build features for an ML model?
  • Given the fact that features are constructed through methods such as normalizing data distributions, imputing missing values, combining attributes, etc. what are some of the potential risks that are introduced by incorrectly applied transformations or invalid assumptions about the impact of these manipulations?
  • Can you describe how feature-engine is implemented?
    • How have the design and goals of the project changed or evolved since you started working on it?
  • What (if any) difference exists in the feature engineering process for frameworks like scikit-learn as compared to deep learning approaches using PyTorch, Tensorflow, etc.?
  • Can you describe the workflow of identifying and generating useful features during model development?
    • What are the tools that are available for testing and debugging of the feature pipelines?
  • What do you see as the potential benefits or drawbacks of integrating feature-engine with a feature store such as Feast or Tecton?
  • What are the most interesting, innovative, or unexpected ways that you have seen feature-engine used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on feature-engine?
  • When is feature-engine the wrong choice?
  • What do you have planned for the future of feature-engine?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Liked it? Take a second to support Podcast.__init__ on Patreon!