Data Science

Version Control For Your Machine Learning Projects - Episode 206

Summary

Version control has become table stakes for any software team, but for machine learning projects there has been no good answer for tracking all of the data that goes into building and training models, and the output of the models themselves. To address that need Dmitry Petrov built the Data Version Control project known as DVC. In this episode he explains how it simplifies communication between data scientists, reduces duplicated effort, and simplifies concerns around reproducing and rebuilding models at different stages of the projects lifecycle. If you work as part of a team that is building machine learning models or other data intensive analysis then make sure to give this a listen and then start using DVC today.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Bots and automation are taking over whole categories of online interaction. Discover.bot is an online community designed to ​serve as a platform-agnostic digital space for bot developers and enthusiasts of all skill levels to learn from one another, share their stories, and move the conversation forward together. They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news. For newcomers to the space they have the Beginners Guide To Bots that will teach you the basics of how bots work, what they can do, and where they are developed and published. To help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need they have compiled a list of the major options and how they compare. Go to pythonpodcast.com/discoverbot today to get started and thank them for their support of the show.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Dmitry Petrov about DVC, an open source version control system for machine learning projects

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what DVC is and how it got started?
  • How do the needs of machine learning projects differ from other software applications in terms of version control?
  • Can you walk through the workflow of a project that uses DVC?
    • What are some of the main ways that it differs from your experience building machine learning projects without DVC?
  • In addition to the data that is used for training, the code that generates the model, and the end result there are other aspects such as the feature definitions and hyperparameters that are used. Can you discuss how those factor into the final model and any facilities in DVC to track the values used?
  • In addition to version control for software applications, there are a number of other pieces of tooling that are useful for building and maintaining healthy projects such as linting and unit tests. What are some of the adjacent concerns that should be considered when building machine learning projects?
  • What types of metrics do you track in DVC and how are they collected?
    • Are there specific problem domains or model types that require tracking different metric formats?
  • In the documentation it mentions that the data files live outside of git and can be managed in external storage systems. I’m wondering if there are any plans to integrate with systems such as Quilt or Pachyderm that provide versioning of data natively and what would be involved in adding that support?
  • What was your motivation for implementing this system in Python?
    • If you were to start over today what would you do differently?
  • Being a venture backed startup that is producing open source products, what is the value equation that makes it worthwile for your investors?
  • What have been some of the most interesting, challenging, or unexpected aspects of building DVC?
  • What do you have planned for the future of DVC?

Keep In Touch

Picks

  • Tobias
  • Dmitry
    • Go outside and get some fresh air 🙂

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Wes McKinney's Career In Python For Data Analysis - Episode 203

Summary

Python has become one of the dominant languages for data science and data analysis. Wes McKinney has been working for a decade to make tools that are easy and powerful, starting with the creation of Pandas, and eventually leading to his current work on Apache Arrow. In this episode he discusses his motivation for this work, what he sees as the current challenges to be overcome, and his hopes for the future of the industry.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Check out the Practical AI podcast from our friends at Changelog Media to learn and stay up to date with what’s happening in AI
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Your host as usual is Tobias Macey and today I’m interviewing Wes McKinney about his contributions to the Python community and his current projects to make data analytics easier for everyone

Interview

  • Introductions
  • How did you get introduced to Python?
  • You have spent a large portion of your career on building tools for data science and analytics in the Python ecosystem. What is your motivation for focusing on this problem domain?
  • Having been an open source author and contributor for many years now, what are your current thoughts on paths to sustainability?
  • What are some of the common challenges pertaining to data analysis that you have experienced in the various work environments and software projects that you have been involved in?
    • What area(s) of data science and analytics do you find are not receiving the attention that they deserve?
  • Recently there has been a lot of focus and excitement around the capabilities of neural networks and deep learning. In your experience, what are some of the shortcomings or blind spots to that class of approach that would be better served by other classes of solution?
  • Your most recent work is focused on the Arrow project for improving interoperability across languages. What are some of the cases where a Python developer would want to incorporate capabilities from other runtimes?
    • Do you think that we should be working to replicate some of those capabilities into the Python language and ecosystem, or is that wasted effort that would be better spent elsewhere?
  • Now that Pandas has been in active use for over a decade and you have had the opportunity to get some space from it, what are your thoughts on its success?
    • With the perspective that you have gained in that time, what would you do differently if you were starting over today?
  • You are best known for being the creator of Pandas, but can you list some of the other achievements that you are most proud of?
  • What projects are you most excited to be working on in the near to medium future?
  • What are your grand ambitions for the future of the data science community, both in and outside of the Python ecosystem?
  • Do you have any parting advice for active or aspiring data scientists, or resources that you would like to recommend?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

The Past, Present, and Future of Deep Learning In PyTorch - Episode 202

Summary

The current buzz in data science and big data is around the promise of deep learning, especially when working with unstructured data. One of the most popular frameworks for building deep learning applications is PyTorch, in large part because of their focus on ease of use. In this episode Adam Paszke explains how he started the project, how it compares to other frameworks in the space such as Tensorflow and CNTK, and how it has evolved to support deploying models into production and on mobile devices.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Check out the Practical AI podcast from our friends at Changelog Media to learn and stay up to date with what’s happening in AI
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Your host as usual is Tobias Macey and today I’m interviewing Adam Paszke about PyTorch, an open source deep learning platform that provides a seamless path from research prototyping to production deployment

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what deep learning is and how it relates to machine learning and artificial intelligence?
  • Can you explain what PyTorch is and your motivation for creating it?
    • Why was it important for PyTorch to be open source?
  • There is currently a large and growing ecosystem of deep learning tools built for Python. Can you describe the current landscape and how PyTorch fits in relation to projects such as Tensorflow and CNTK?
    • What are some of the ways that PyTorch is different from Tensorflow and CNTK, and what are the areas where these frameworks are converging?
  • How much knowledge of machine learning, artificial intelligence, or neural network topologies are necessary to make use of PyTorch?
    • What are some of the foundational topics that are most useful to know when getting started with PyTorch?
  • Can you describe how PyTorch is architected/implemented and how it has evolved since you first began working on it?
    • You recently reached the 1.0 milestone. Can you talk about the journey to that point and the goals that you set for the release?
  • What are some of the other components of the Python ecosystem that are most commonly incorporated into projects based on PyTorch?
  • What are some of the most novel, interesting, or unexpected uses of PyTorch that you have seen?
  • What are some cases where PyTorch is the wrong choice for a problem?
  • What is the process for incorporating these new techniques and discoveries into the PyTorch framework?
    • What are the areas of active research that you are most excited about?
  • What are some of the most interesting/useful/unexpected/challenging lessons that you have learned in the process of building and maintaining PyTorch?
  • What do you have planned for the future of PyTorch?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Polyglot: Multi-Lingual Natural Language Processing with Rami Al-Rfou - Episode 190

Summary

Using computers to analyze text can produce useful and inspirational insights. However, when working with multiple languages the capabilities of existing models are severely limited. In order to help overcome this limitation Rami Al-Rfou built Polyglot. In this episode he explains his motivation for creating a natural language processing library with support for a vast array of languages, how it works, and how you can start using it for your own projects. He also discusses current research on multi-lingual text analytics, how he plans to improve Polyglot in the future, and how it fits in the Python ecosystem.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. Podcast.__init__ listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Rami Al-Rfou about Polyglot, a natural language pipeline with support for an impressive amount of languages

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Polyglot is and your reasons for starting the project?
  • What are the types of use cases that Polyglot enables which would be impractical with something such as NLTK or SpaCy?
  • A majority of NLP libraries have a limited set of languages that they support. What is involved in adding support for a given language to a natural language tool?
    • What is involved in adding a new language to Polyglot?
    • Which families of languages are the most challenging to support?
  • What types of operations are supported and how consistently are they supported across languages?
  • How is Polyglot implemented?
  • Is there any capacity for integrating Polyglot with other tools such as SpaCy or Gensim?
  • How much domain knowledge is required to be able to effectively use Polyglot within an application?
  • What are some of the most interesting or unique uses of Polyglot that you have seen?
  • What have been some of the most complex or challenging aspects of building Polyglot?
  • What do you have planned for the future of Polyglot?
  • What are some areas of NLP research that you are excited for?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull - Episode 184

Summary

As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at podcastinit.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Emily Miller and Peter Bull about Deon, an ethics checklist for data projects

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Deon is and your motivation for creating it?
  • Why a checklist, specifically? What’s the advantage of this over an oath, for example?
  • What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering?
  • What is the typical workflow for a team that is using Deon in their projects?
  • Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen?
    • Have you received pushback on any of the default items?
  • How does Deon simplify communication around ethics across team boundaries?
  • What are some of the most often overlooked items?
  • What are some of the most difficult ethical concerns to comply with for a typical data science project?
  • How has Deon helped you at Driven Data?
  • What are the customer facing impacts of embedding a discussion of ethics in the product development process?
  • Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced?
  • What are your hopes for the future of the Deon project?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

How Python Is Used To Build A Startup At Wanderu with Chris Kirkos and Matt Warren - Episode 183

Summary

The breadth of use cases that Python supports, coupled with the level of productivity that it provides through its ease of use have contributed to the incredible popularity of the language. To explore the ways that it can contribute to the success of a young and growing startup two of the lead engineers at Wanderu discuss their experiences in this episode. Matt Warren, the technical operations lead, explains the ways that he is using Python to build and scale the infrastructure that Wanderu relies on, as well as the ways that he deploys and runs the various Python applications that power the business. Chris Kirkos, the lead software architect, describes how the original Django application has grown into a suite of microservices, where they have opted to use a different language and why, and how Python is still being used for critical business needs. This is a great conversation for understanding the business impact of the Python language and ecosystem.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at podcastinit.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Matt Warren and Chris Kirkos and about the ways that they are using Python at Wanderu

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Wanderu does?
    • How is the platform architected?
  • What are the broad categories of problems that you are addressing with Python?
  • What are the areas where you chose to use a different language or service?
  • What ratio of new projects and features are implemented using Python?
    • How much of that decision process is influenced by the fact that you already have so much pre-existing Python code?
    • For the projects where you don’t choose Python, what are the reasons for going elsewhere?
  • What are some of the limitations of Python that you have encountered while working at Wanderu?
  • What are some of the places that you were surprised to find Python in use at Wanderu?
  • What have you enjoyed most about working with Python?
    • What are some of the sharp edges that you would like to see smoothed over in future versions of the language?
  • What is the most challenging bug that you have dealt with at Wanderu that was attributable in some sense to the fact that the code was written in Python?
  • If you were to start over today on any of the pieces of the Wanderu platform, are there any that you would write in a different language?
  • Which libraries have been the most useful for your work at Wanderu?
    • Which ones have caused you the most pain?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Understanding Machine Learning Through Visualizations with Benjamin Bengfort and Rebecca Bilbro - Episode 166

Summary

Machine learning models are often inscrutable and it can be difficult to know whether you are making progress. To improve feedback and speed up iteration cycles Benjamin Bengfort and Rebecca Bilbro built Yellowbrick to easily generate visualizations of model performance. In this episode they explain how to use Yellowbrick in the process of building a machine learning project, how it aids in understanding how different parameters impact the outcome, and the improved understanding among teammates that it creates. They also explain how it integrates with the scikit-learn API, the difficulty of producing effective visualizations, and future plans for improvement and new features.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Your host as usual is Tobias Macey and today I’m interviewing Rebecca Bilbro and Benjamin Bengfort about Yellowbrick, a scikit extension to use visualizations for assisting with model selection in your data science projects.

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you describe the use case for Yellowbrick and how the project got started?
  • What is involved in visualizing scikit-learn models?
    • What kinds of information do the visualizations convey?
    • How do they aid in understanding what is happening in the models?
  • How much direction does yellowbrick provide in terms of knowing which visualizations will be helpful in various circumstances?
  • What does the workflow look like for someone using Yellowbrick while iterating on a data science project?
  • What are some of the common points of confusion that your students encounter when learning data science and how has yellowbrick assisted in achieving understanding?
  • How is Yellowbrick iplemented and how has the design changed over the lifetime of the project?
  • What would be required to integrate with other visualization libraries and what benefits (if any) might that provide?
    • What about other ML frameworks?
  • What are some of the most challenging or unexpected aspects of building and maintaining Yellowbrick?
  • What are the limitations or edge cases for yellowbrick?
  • What do you have planned for the future of yellowbrick?
  • Beyond visualization, what are some of the other areas that you would like to see innovation in how data science is taught and/or conducted to make it more accessible?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Pandas Extension Arrays with Tom Augspurger - Episode 164

Summary

Pandas is a swiss army knife for data processing in Python but it has long been difficult to customize. In the latest release there is now an extension interface for adding custom data types with namespaced APIs. This allows for building and combining domain specific use cases and alternative storage mechanisms. In this episode Tom Augspurger describes how the new ExtensionArray works, how it came to be, and how you can start building your own extensions today.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Your host as usual is Tobias Macey and today I’m interviewing Tom Augspurger about the extension interface for Pandas data frames and the use cases that it enables

Interview

  • Introductions
  • How did you get introduced to Python?
  • Most people are familiar with Pandas, but can you describe at a high level the new extension interface?
    • What is the story behind the implementation of this functionality?
    • Prior to this interface what was the option for anyone who wanted to extend Pandas?
  • What are some of the new data types that are available as external packages?
    • What are some of the unique use cases that they enable?
  • How is the new interface implemented within Pandas?
  • What were the most challenging or difficult aspects of building this new functionality?
  • What are some of the more interesting possibilities that you are aware of for new extension types?
  • What are the limitations of the interface for libraries that add new array functionality?
  • What is the next major change or improvement that you would like to add in Pandas?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Asking Questions From Data Using Active Learning with Tivadar Danka - Episode 162

Summary

One of the challenges of machine learning is obtaining large enough volumes of well labelled data. An approach to mitigate the effort required for labelling data sets is active learning, in which outliers are identified and labelled by domain experts. In this episode Tivadar Danka describes how he built modAL to bring active learning to bioinformatics. He is using it for doing human in the loop training of models to detect cell phenotypes with massive unlabelled datasets. He explains how the library works, how he designed it to be modular for a broad set of use cases, and how you can use it for training models of your own.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Your host as usual is Tobias Macey and today I’m interviewing Tivadar Danka about modAL, a modular active learning framework for Python3

Interview

  • Introductions
  • How did you get introduced to Python?
  • What is active learning?
    • How does it differ from other approaches to machine learning?
  • What is modAL and what was your motivation for starting the project?
  • For someone who is using modAL, what does a typical workflow look like to train their models?
  • How do you avoid oversampling and causing the human in the loop to become overwhelmed with labeling requirements?
  • What are the most challenging aspects of building and using modAL?
  • What do you have planned for the future of modAL?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Scaling Deep Learning Using Polyaxon with Mourad Mourafiq - Episode 158

Summary

With libraries such as Tensorflow, PyTorch, scikit-learn, and MXNet being released it is easier than ever to start a deep learning project. Unfortunately, it is still difficult to manage scaling and reproduction of training for these projects. Mourad Mourafiq built Polyaxon on top of Kubernetes to address this shortcoming. In this episode he shares his reasons for starting the project, how it works, and how you can start using it today.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • Finding a bug in production is never a fun experience, especially when your users find it first. Airbrake error monitoring ensures that you will always be the first to know so you can deploy a fix before anyone is impacted. With open source agents for Python 2 and 3 it’s easy to get started, and the automatic aggregations, contextual information, and deployment tracking ensure that you don’t waste time pinpointing what went wrong. Go to podcastinit.com/airbrake today to sign up and get your first 30 days free, and 50% off 3 months of the Startup plan.
  • To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • Your host as usual is Tobias Macey and today I’m interviewing Mourad Mourafiq about Polyaxon, a platform for building, training and monitoring large scale deep learning applications.

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you give a quick overview of what Polyaxon is and your motivation for creating it?
  • What is a typical workflow for building and testing a deep learning application?
  • How is Polyaxon implemented?
    • How has the internal architecture evolved since you first started working on it?
    • What is unique to deep learning workloads that makes it necessary to have a dedicated tool for deploying them?
    • What does Polyaxon add on top of the existing functionality in Kubernetes?
  • It can be difficult to build a docker container that holds all of the necessary components for a complex application. What are some tips or best practices for creating containers to be used with Polyaxon?
  • What are the relative tradeoffs of the various deep learning frameworks that you support?
  • For someone who is getting started with Polyaxon what does the workflow look like?
    • What is involved in migrating existing projects to run on Polyaxon?
  • What have been the most challenging aspects of building Polyaxon?
  • What are your plans for the future of Polyaxon?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA