Data Science

Build Your Own Knowledge Graph With Zincbase - Episode 223

Summary

Computers are excellent at following detailed instructions, but they have no capacity for understanding the information that they work with. Knowledge graphs are a way to approximate that capability by building connections between elements of data that allow us to discover new connections among disparate information sources that were previously uknown. In our day-to-day work we encounter many instances of knowledge graphs, but building them has long been a difficult endeavor. In order to make this technology more accessible Tom Grek built Zincbase. In this episode he explains his motivations for starting the project, how he uses it in his daily work, and how you can use it to create your own knowledge engine and begin discovering new insights of your own.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Podcast.init listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Tom Grek about knowledge graphs, when they’re useful, and his project Zincbase that makes them easier to build

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what a knowledge graph is and some of the ways that they are used?
    • How did you first get involved in the space of knowledge graphs?
  • You have built the Zincbase project for building and querying knowledge graphs. What was your motivation for creating this project and what are some of the other tools that are available to perform similar tasks?
  • Can you describe how Zincbase is implemented and some of the ways that it has evolved since you first began working on it?
    • What are some of the assumptions that you had at the outset of the project which have been challenged or updated in the process of working on and with it?
  • What are some of the common challenges when building or using knowledge graphs?
  • How has the domain of knowledge graphs changed in recent years as new approaches to entity resolution and data processing have been introduced?
  • Can you talk through a use case and workflow for using Zincbase to design and populate a knowledge graph?
  • What are some of the ways that you are using Zincbase in your own projects?
  • What have you found to be the most challenging/interesting/unexpected lessons that you have learned in the process of building and maintaining Zincbase?
  • What do you have planned for the future of the project?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Open Source Automated Machine Learning With MindsDB - Episode 218

Summary

Machine learning is growing in popularity and capability, but for a majority of people it is still a black box that we don’t fully understand. The team at MindsDB is working to change this state of affairs by creating an open source tool that is easy to use without a background in data science. By simplifying the training and use of neural networks, and making their logic explainable, they hope to bring AI capabilities to more people and organizations. In this interview George Hosu and Jorge Torres explain how MindsDB is built, how to use it for your own purposes, and how they view the current landscape of AI technologies. This is a great episode for anyone who is interested in experimenting with machine learning and artificial intelligence. Give it a listen and then try MindsDB for yourself.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Podcast.init listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing George Hosu and Jorge Torres about MindsDB, a framework for streamlining the use of neural networks

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what MindsDB is and the problem that it is trying to solve?
    • What was the motivation for creating the project?
  • Who is the target audience for MindsDB?
  • Before we go deep into MindsDB can you explain what a neural network is for anyone who isn’t familiar with the term?
  • For someone who is using MindsDB can you talk through their workflow?
    • What are the types of data that are supported for building predictions using MindsDB?
    • How much cleaning and preparation of the data is necessary before using it to generate a model?
    • What are the lower and upper bounds for volume and variety of data that can be used to build an effective model in MindsDB?
  • One of the interesting and useful features of MindsDB is the built in support for explaining the decisions reached by a model. How do you approach that challenge and what are the most difficult aspects?
  • Once a model is generated, what is the output format and can it be used separately from MindsDB for embedding the prediction capabilities into other scripts or services?
  • How is MindsDB implemented and how has the design changed since you first began working on it?
    • What are some of the assumptions that you made going into this project which have had to be modified or updated as it gained users and features?
  • What are the limitations of MindsDB and what are the cases where it is necessary to pass a task on to a data scientist?
  • In your experience, what are the common barriers for individuals and organizations adopting machine learning as a tool for addressing their needs?
  • What have been the most challenging, complex, or unexpected aspects of designing and building MindsDB?
  • What do you have planned for the future of MindsDB?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Algorithmic Trading In Python Using Open Tools And Open Data - Episode 216

Summary

Algorithmic trading is a field that has grown in recent years due to the availability of cheap computing and platforms that grant access to historical financial data. QuantConnect is a business that has focused on community engagement and open data access to grant opportunities for learning and growth to their users. In this episode CEO Jared Broad and senior engineer Alex Catarino explain how they have built an open source engine for testing and running algorithmic trading strategies in multiple languages, the challenges of collecting and serving currrent and historical financial data, and how they provide training and opportunity to their community members. If you are curious about the financial industry and want to try it out for yourself then be sure to listen to this episode and experiment with the QuantConnect platform for free.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Podcast.init listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • The Python Software Foundation is the lifeblood of the community, supporting all of us who want to run workshops and conferences, run development sprints or meetups, and ensuring that PyCon is a success every year. They have extended the deadline for their 2019 fundraiser until June 30th and they need help to make sure they reach their goal. Go to pythonpodcast.com/psf today to make a donation. If you’re listening to this after June 30th of 2019 then consider making a donation anyway!
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Jared Broad and Alex Catarino about QuantConnect, a platform for building and testing algorithmic trading strategies on open data and cloud resources

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what QuantConnect is and how the business got started?
  • What is your mission for the company?
  • I know that there are a few other entrants in this market. Can you briefly outline how you compare to the other platforms and maybe characterize the state of the industry?
  • What are the main ways that you and your customers use Python?
  • For someone who is new to the space can you talk through what is involved in writing and testing a trading algorithm?
  • Can you talk through how QuantConnect itself is architected and some of the products and components that comprise your overall platform?
  • I noticed that your trading engine is open source. What was your motivation for making that freely available and how has it influenced your design and development of the project?
  • I know that the core product is built in C# and offers a bridge to Python. Can you talk through how that is implemented?
    • How do you address latency and performance when bridging those two runtimes given the time sensitivity of the problem domain?
  • What are the benefits of using Python for algorithmic trading and what are its shortcomings?
    • How useful and practical are machine learning techniques in this domain?
  • Can you also talk through what Alpha Streams is, including what makes it unique and how it benefits the users of your platform?
  • I appreciate the work that you are doing to foster a community around your platform. What are your strategies for building and supporting that interaction and how does it play into your product design?
  • What are the categories of users who tend to join and engage with your community?
  • What are some of the most interesting, innovative, or unexpected tactics that you have seen your users employ?
  • For someone who is interested in getting started on QuantConnect what is the onboarding process like?
    • What are some resources that you would recommend for someone who is interested in digging deeper into this domain?
  • What are the trends in quantitative finance and algorithmic trading that you find most exciting and most concerning?
  • What do you have planned for the future of QuantConnect?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

A Data Catalog For Your PyData Projects - Episode 213

Summary

One of the biggest pain points when working with data is getting is dealing with the boilerplate code to load it into a usable format. Intake encapsulates all of that and puts it behind a single API. In this episode Martin Durant explains how to use the Intake data catalogs for encapsulating source information, how it simplifies data science workflows, and how to incorporate it into your projects. It is a lightweight way to enable collaboration between data engineers and data scientists in the PyData ecosystem.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Martin Durant about Intake, a lightweight package for finding, investigating, loading and disseminating data

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what Intake is and the story behind its creation?
    • Can you outline some of the other projects and products that intersect with the functionality of Intake and describe where it fits in terms of use case and capabilities? (e.g. Quilt Data, Arrow, Data Retriever)
  • Can you describe the workflows for using Intake, both from the data scientist and the data engineer perspective?
  • One of the persistent challenges in working with data is that of cataloging and discovery of what already exists. In what ways does Intake address that problem?
    • Does it have any facilities for capturing and exposing data lineage?
  • For someone who needs to customize their usage of Intake, what are the extension points and what is involved in building a plugin?
  • Can you describe how Intake is implemented and how it has evolved since it first started?
    • What are some of the most challenging, complex, or novel aspects of the Intake implementation?
  • Intake focuses primarily on integrating with the PyData ecosystem (e.g. NumPy, Pandas, SciPy, etc.). What are some other communities that are, or could be, benefiting from the work being done on Intake?
    • What are some of the assumptions that are baked into Intake that would need to be modified to make it more broadly applicable?
  • What are some of the assumptions that were made going into this project that have needed to be reconsidered after digging deeper into the problem space?
  • What are some of the most interesting/unexpected/innovative ways that you have seen Intake leveraged?
  • What are your plans for the future of Intake?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Building A Privacy Preserving Voice Assistant - Episode 211

Summary

Being able to control a computer with your voice has rapidly moved from science fiction to science fact. Unfortunately, the majority of platforms that have been made available to consumers are controlled by large organizations with little incentive to respect users’ privacy. The team at Snips are building a platform that runs entirely off-line and on-device so that your information is always in your control. In this episode Adrien Ball explains how the Snips architecture works, the challenges of building a speech recognition and natural language understanding toolchain that works on limited resources, and how they are tackling issues around usability for casual consumers. If you have been interested in taking advantage of personal voice assistants, but wary of using commercially available options, this is definitely worth a listen.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Adrien Ball about SNIPS, a set of technologies to make voice controlled systems that respect user’s privacy

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what the Snips is and how it got started?
  • For someone who wants to use Snips can you talk through the onboarding proces?
    • One of the interesting features of your platform is the option for automated training data generation. Can you explain how that works?
  • Can you describe the overall architecture of the Snips platform and how it has evolved since you first began working on it?
  • Two of the main components that can be used independently are the ASR (Automated Speech Recognition) and NLU (Natural Language Understanding) engines. Each of those have a number of competitors in the market, both open source and commercial. How would you describe your overall position in the market for each of those projects?
  • I know that one of the biggest challenges in conversational interfaces is maintaining context for multi-step interactions. How is that handled in Snips?
  • For the NLU engine, you recently ported it from Python to Rust. What was your motivation for doing so and how would you characterize your experience between the two languages?
    • Are you continuing to maintain both implementations and if so how are you maintaining feature parity?
  • How do you approach the overall usability and user experience, particularly for non-technical end users?
    • How is discoverability handled (e.g. finding out what capabilities/skills are available)
  • One of the compelling aspects of Snips is the ability to deploy to a wide variety of devices, including offline support. Can you talk through that deployment process, both from a user perspective and how it is implemented under the covers?
    • What is involved in updating deployed models and keeping track of which versions are deployed to which devices?
  • What is involved in adding new capabilities or integrations to the Snips platform?
  • What are the limitations of running everything offline and on-device?
    • When is Snips the wrong choice?
  • In the process of building and maintaining the various components of Snips, what have been some of the most useful/interesting/unexpected lessons that you have learned?
    • What have been the most challenging aspects?
  • What are some of the most interesting/innovative/unexpected ways that you have seen the Snips technologies used?
  • What is in store for the future of Snips?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Probabilistic Modeling In Python (And What That Even Means) - Episode 209

Summary

Most programming is deterministic, relying on concrete logic to determine the way that it operates. However, there are problems that require a way to work with uncertainty. PyMC3 is a library designed for building models to predict the likelihood of certain outcomes. In this episode Thomas Wiecki explains the use cases where Bayesian statistics are necessary, how PyMC3 is designed and implemented, and some great examples of how it is being used in real projects.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Thomas Wiecki about PyMC3, a project for probabilistic programming in Python

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what probabilistic programming is?
  • What is the PyMC3 project and how did you get involved with it?
  • The opening line for the project README is packed with a slew of terms that are rather opaque to the lay-person. Can you unpack that a bit and discuss some of the ways that PyMC3 is used in real-world projects?
  • How much knowledge of statistical modeling and Bayesian statistics is necessary to make effective use of PyMC3?
  • Can you talk through an example use case for PyMC3 to illustrate how you would use it in a project?
    • How does it compare to the way that you would approach the same problem in a deterministic or frequentist modeling framework?
  • Can you describe how PyMC3 is implemented?
  • There are a number of other projects that build on top of PyMC3, what are some that you find particularly interesting or noteworthy?
  • What do you find to be the most useful features of PyMC3 and what are some areas that you would like to see it improved?
  • What have been the most interesting/unexpected/challenging lessons that you have learned in the process of building and maintaining PyMC3?
  • What is in store for the future of PyMC3?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Version Control For Your Machine Learning Projects - Episode 206

Summary

Version control has become table stakes for any software team, but for machine learning projects there has been no good answer for tracking all of the data that goes into building and training models, and the output of the models themselves. To address that need Dmitry Petrov built the Data Version Control project known as DVC. In this episode he explains how it simplifies communication between data scientists, reduces duplicated effort, and simplifies concerns around reproducing and rebuilding models at different stages of the projects lifecycle. If you work as part of a team that is building machine learning models or other data intensive analysis then make sure to give this a listen and then start using DVC today.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Bots and automation are taking over whole categories of online interaction. Discover.bot is an online community designed to ​serve as a platform-agnostic digital space for bot developers and enthusiasts of all skill levels to learn from one another, share their stories, and move the conversation forward together. They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news. For newcomers to the space they have the Beginners Guide To Bots that will teach you the basics of how bots work, what they can do, and where they are developed and published. To help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need they have compiled a list of the major options and how they compare. Go to pythonpodcast.com/discoverbot today to get started and thank them for their support of the show.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Dmitry Petrov about DVC, an open source version control system for machine learning projects

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what DVC is and how it got started?
  • How do the needs of machine learning projects differ from other software applications in terms of version control?
  • Can you walk through the workflow of a project that uses DVC?
    • What are some of the main ways that it differs from your experience building machine learning projects without DVC?
  • In addition to the data that is used for training, the code that generates the model, and the end result there are other aspects such as the feature definitions and hyperparameters that are used. Can you discuss how those factor into the final model and any facilities in DVC to track the values used?
  • In addition to version control for software applications, there are a number of other pieces of tooling that are useful for building and maintaining healthy projects such as linting and unit tests. What are some of the adjacent concerns that should be considered when building machine learning projects?
  • What types of metrics do you track in DVC and how are they collected?
    • Are there specific problem domains or model types that require tracking different metric formats?
  • In the documentation it mentions that the data files live outside of git and can be managed in external storage systems. I’m wondering if there are any plans to integrate with systems such as Quilt or Pachyderm that provide versioning of data natively and what would be involved in adding that support?
  • What was your motivation for implementing this system in Python?
    • If you were to start over today what would you do differently?
  • Being a venture backed startup that is producing open source products, what is the value equation that makes it worthwile for your investors?
  • What have been some of the most interesting, challenging, or unexpected aspects of building DVC?
  • What do you have planned for the future of DVC?

Keep In Touch

Picks

  • Tobias
  • Dmitry
    • Go outside and get some fresh air 🙂

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Wes McKinney's Career In Python For Data Analysis - Episode 203

Summary

Python has become one of the dominant languages for data science and data analysis. Wes McKinney has been working for a decade to make tools that are easy and powerful, starting with the creation of Pandas, and eventually leading to his current work on Apache Arrow. In this episode he discusses his motivation for this work, what he sees as the current challenges to be overcome, and his hopes for the future of the industry.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Check out the Practical AI podcast from our friends at Changelog Media to learn and stay up to date with what’s happening in AI
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Your host as usual is Tobias Macey and today I’m interviewing Wes McKinney about his contributions to the Python community and his current projects to make data analytics easier for everyone

Interview

  • Introductions
  • How did you get introduced to Python?
  • You have spent a large portion of your career on building tools for data science and analytics in the Python ecosystem. What is your motivation for focusing on this problem domain?
  • Having been an open source author and contributor for many years now, what are your current thoughts on paths to sustainability?
  • What are some of the common challenges pertaining to data analysis that you have experienced in the various work environments and software projects that you have been involved in?
    • What area(s) of data science and analytics do you find are not receiving the attention that they deserve?
  • Recently there has been a lot of focus and excitement around the capabilities of neural networks and deep learning. In your experience, what are some of the shortcomings or blind spots to that class of approach that would be better served by other classes of solution?
  • Your most recent work is focused on the Arrow project for improving interoperability across languages. What are some of the cases where a Python developer would want to incorporate capabilities from other runtimes?
    • Do you think that we should be working to replicate some of those capabilities into the Python language and ecosystem, or is that wasted effort that would be better spent elsewhere?
  • Now that Pandas has been in active use for over a decade and you have had the opportunity to get some space from it, what are your thoughts on its success?
    • With the perspective that you have gained in that time, what would you do differently if you were starting over today?
  • You are best known for being the creator of Pandas, but can you list some of the other achievements that you are most proud of?
  • What projects are you most excited to be working on in the near to medium future?
  • What are your grand ambitions for the future of the data science community, both in and outside of the Python ecosystem?
  • Do you have any parting advice for active or aspiring data scientists, or resources that you would like to recommend?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

The Past, Present, and Future of Deep Learning In PyTorch - Episode 202

Summary

The current buzz in data science and big data is around the promise of deep learning, especially when working with unstructured data. One of the most popular frameworks for building deep learning applications is PyTorch, in large part because of their focus on ease of use. In this episode Adam Paszke explains how he started the project, how it compares to other frameworks in the space such as Tensorflow and CNTK, and how it has evolved to support deploying models into production and on mobile devices.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Check out the Practical AI podcast from our friends at Changelog Media to learn and stay up to date with what’s happening in AI
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Your host as usual is Tobias Macey and today I’m interviewing Adam Paszke about PyTorch, an open source deep learning platform that provides a seamless path from research prototyping to production deployment

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what deep learning is and how it relates to machine learning and artificial intelligence?
  • Can you explain what PyTorch is and your motivation for creating it?
    • Why was it important for PyTorch to be open source?
  • There is currently a large and growing ecosystem of deep learning tools built for Python. Can you describe the current landscape and how PyTorch fits in relation to projects such as Tensorflow and CNTK?
    • What are some of the ways that PyTorch is different from Tensorflow and CNTK, and what are the areas where these frameworks are converging?
  • How much knowledge of machine learning, artificial intelligence, or neural network topologies are necessary to make use of PyTorch?
    • What are some of the foundational topics that are most useful to know when getting started with PyTorch?
  • Can you describe how PyTorch is architected/implemented and how it has evolved since you first began working on it?
    • You recently reached the 1.0 milestone. Can you talk about the journey to that point and the goals that you set for the release?
  • What are some of the other components of the Python ecosystem that are most commonly incorporated into projects based on PyTorch?
  • What are some of the most novel, interesting, or unexpected uses of PyTorch that you have seen?
  • What are some cases where PyTorch is the wrong choice for a problem?
  • What is the process for incorporating these new techniques and discoveries into the PyTorch framework?
    • What are the areas of active research that you are most excited about?
  • What are some of the most interesting/useful/unexpected/challenging lessons that you have learned in the process of building and maintaining PyTorch?
  • What do you have planned for the future of PyTorch?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Polyglot: Multi-Lingual Natural Language Processing with Rami Al-Rfou - Episode 190

Summary

Using computers to analyze text can produce useful and inspirational insights. However, when working with multiple languages the capabilities of existing models are severely limited. In order to help overcome this limitation Rami Al-Rfou built Polyglot. In this episode he explains his motivation for creating a natural language processing library with support for a vast array of languages, how it works, and how you can start using it for your own projects. He also discusses current research on multi-lingual text analytics, how he plans to improve Polyglot in the future, and how it fits in the Python ecosystem.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. Podcast.__init__ listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Rami Al-Rfou about Polyglot, a natural language pipeline with support for an impressive amount of languages

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Polyglot is and your reasons for starting the project?
  • What are the types of use cases that Polyglot enables which would be impractical with something such as NLTK or SpaCy?
  • A majority of NLP libraries have a limited set of languages that they support. What is involved in adding support for a given language to a natural language tool?
    • What is involved in adding a new language to Polyglot?
    • Which families of languages are the most challenging to support?
  • What types of operations are supported and how consistently are they supported across languages?
  • How is Polyglot implemented?
  • Is there any capacity for integrating Polyglot with other tools such as SpaCy or Gensim?
  • How much domain knowledge is required to be able to effectively use Polyglot within an application?
  • What are some of the most interesting or unique uses of Polyglot that you have seen?
  • What have been some of the most complex or challenging aspects of building Polyglot?
  • What do you have planned for the future of Polyglot?
  • What are some areas of NLP research that you are excited for?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA