Data Engineering

A Data Catalog For Your PyData Projects - Episode 213

Summary

One of the biggest pain points when working with data is getting is dealing with the boilerplate code to load it into a usable format. Intake encapsulates all of that and puts it behind a single API. In this episode Martin Durant explains how to use the Intake data catalogs for encapsulating source information, how it simplifies data science workflows, and how to incorporate it into your projects. It is a lightweight way to enable collaboration between data engineers and data scientists in the PyData ecosystem.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Martin Durant about Intake, a lightweight package for finding, investigating, loading and disseminating data

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what Intake is and the story behind its creation?
    • Can you outline some of the other projects and products that intersect with the functionality of Intake and describe where it fits in terms of use case and capabilities? (e.g. Quilt Data, Arrow, Data Retriever)
  • Can you describe the workflows for using Intake, both from the data scientist and the data engineer perspective?
  • One of the persistent challenges in working with data is that of cataloging and discovery of what already exists. In what ways does Intake address that problem?
    • Does it have any facilities for capturing and exposing data lineage?
  • For someone who needs to customize their usage of Intake, what are the extension points and what is involved in building a plugin?
  • Can you describe how Intake is implemented and how it has evolved since it first started?
    • What are some of the most challenging, complex, or novel aspects of the Intake implementation?
  • Intake focuses primarily on integrating with the PyData ecosystem (e.g. NumPy, Pandas, SciPy, etc.). What are some other communities that are, or could be, benefiting from the work being done on Intake?
    • What are some of the assumptions that are baked into Intake that would need to be modified to make it more broadly applicable?
  • What are some of the assumptions that were made going into this project that have needed to be reconsidered after digging deeper into the problem space?
  • What are some of the most interesting/unexpected/innovative ways that you have seen Intake leveraged?
  • What are your plans for the future of Intake?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Gnocchi: A Scalable Time Series Database For Your Metrics with Julien Danjou - Episode 189

Summary

Do you know what your servers are doing? If you have a metrics system in place then the answer should be “yes”. One critical aspect of that platform is the timeseries database that allows you to store, aggregate, analyze, and query the various signals generated by your software and hardware. As the size and complexity of your systems scale, so does the volume of data that you need to manage which can put a strain on your metrics stack. Julien Danjou built Gnocchi during his time on the OpenStack project to provide a time oriented data store that would scale horizontally and still provide fast queries. In this episode he explains how the project got started, how it works, how it compares to the other options on the market, and how you can start using it today to get better visibility into your operations.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. Podcast.__init__ listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected]init.com)
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Julien Danjou about Gnocchi, an open source time series database built to handle large volumes of system metrics

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Gnocchi is and how the project got started?
    • What was the motivation for moving Gnocchi out of the Openstack organization and into its own top level project?
  • The space of time series databases and metrics as a service platforms are both fairly crowded. What are the unique features of Gnocchi that would lead someone to deploy it in place of other options?
    • What are some of the tools and platforms that are popular today which hadn’t yet gained visibility when you first began working on Gnocchi?
  • How is Gnocchi architected?
    • How has the design changed since you first started working on it?
    • What was the motivation for implementing it in Python and would you make the same choice today?
  • One of the interesting features of Gnocchi is its support of resource history. Can you describe how that operates and the types of use cases that it enables?
    • Does that factor into the multi-tenant architecture?
  • What are some of the drawbacks of pre-aggregating metrics as they are being written into the storage layer (e.g. loss of fidelity)?
    • Is it possible to maintain the raw measures after they are processed into aggregates?
  • One of the challenging aspects of building a scalable metrics platform is support for high-cardinality data. What sort of labelling and tagging of metrics and measures is available in Gnocchi?
  • For someone who wants to implement Gnocchi for their system metrics, what is involved in deploying, maintaining, and upgrading it?
    • What are the available integration points for extending and customizing Gnocchi?
  • Once metrics have been stored, aggregated, and indexed, what are the options for querying and analyzing the collected data?
  • When is Gnocchi the wrong choice?
  • What do you have planned for the future of Gnocchi?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull - Episode 184

Summary

As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at podcastinit.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Emily Miller and Peter Bull about Deon, an ethics checklist for data projects

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Deon is and your motivation for creating it?
  • Why a checklist, specifically? What’s the advantage of this over an oath, for example?
  • What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering?
  • What is the typical workflow for a team that is using Deon in their projects?
  • Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen?
    • Have you received pushback on any of the default items?
  • How does Deon simplify communication around ethics across team boundaries?
  • What are some of the most often overlooked items?
  • What are some of the most difficult ethical concerns to comply with for a typical data science project?
  • How has Deon helped you at Driven Data?
  • What are the customer facing impacts of embedding a discussion of ethics in the product development process?
  • Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced?
  • What are your hopes for the future of the Deon project?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Fast Stream Processing In Python Using Faust with Ask Solem - Episode 176

Summary

The need to process unbounded and continually streaming sources of data has become increasingly common. One of the popular platforms for implementing this is Kafka along with its streams API. Unfortunately, this requires all of your processing or microservice logic to be implemented in Java, so what’s a poor Python developer to do? If that developer is Ask Solem of Celery fame then the answer is, help to re-implement the streams API in Python. In this episode Ask describes how Faust got started, how it works under the covers, and how you can start using it today to process your fast moving data in easy to understand Python code. He also discusses ways in which Faust might be able to replace your Celery workers, and all of the pieces that you can replace with your own plugins.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at podcastinit.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Ask Solem about Faust, a library for building high performance, high throughput streaming systems in Python

Interview

  • Introductions
  • How did you get introduced to Python?
  • What is Faust and what was your motivation for building it?
    • What were the initial project requirements that led you to use Kafka as the primary infrastructure component for Faust?
  • Can you describe the architecture for Faust and how it has changed from when you first started writing it?
    • What mechanism does Faust use for managing consensus and failover among instances that are working on the same stream partition?
  • What are some of the lessons that you learned while building Celery that were most useful to you when designing Faust?
  • What have you found to be the most common areas of confusion for people who are just starting to build an application on top of Faust?
  • What has been the most interesting/unexpected/difficult aspects of building and maintaining Faust?
  • What have you found to be the most challenging aspects of building streaming applications?
  • What was the reason for releasing Faust as an open source project rather than keeping it internal to Robinhood?
  • What would be involved in adding support for alternate queue or stream implementations?
  • What do you have planned for the future of Faust?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Asking Questions From Data Using Active Learning with Tivadar Danka - Episode 162

Summary

One of the challenges of machine learning is obtaining large enough volumes of well labelled data. An approach to mitigate the effort required for labelling data sets is active learning, in which outliers are identified and labelled by domain experts. In this episode Tivadar Danka describes how he built modAL to bring active learning to bioinformatics. He is using it for doing human in the loop training of models to detect cell phenotypes with massive unlabelled datasets. He explains how the library works, how he designed it to be modular for a broad set of use cases, and how you can use it for training models of your own.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Your host as usual is Tobias Macey and today I’m interviewing Tivadar Danka about modAL, a modular active learning framework for Python3

Interview

  • Introductions
  • How did you get introduced to Python?
  • What is active learning?
    • How does it differ from other approaches to machine learning?
  • What is modAL and what was your motivation for starting the project?
  • For someone who is using modAL, what does a typical workflow look like to train their models?
  • How do you avoid oversampling and causing the human in the loop to become overwhelmed with labeling requirements?
  • What are the most challenging aspects of building and using modAL?
  • What do you have planned for the future of modAL?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Great Expectations For Your Data Pipelines with Abe Gong and James Campbell - Episode 161

Summary

Testing is a critical activity in all software projects, but one that is often neglected in data pipelines. The complexities introduced by the inherent statefulness of the problem domain and the interdependencies between systems contribute to make pipeline testing difficult to manage. To make this endeavor more manageable Abe Gong and James Campbell have created Great Expectations. In this episode they discuss how you can use the project to create tests in the exploratory phase of building a pipeline and leverage those to monitor your systems in production. They also discussed how Great Expectations works, the difficulties associated with pipeline testing and managing associated technical debt, and their future plans for the project.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • Finding a bug in production is never a fun experience, especially when your users find it first. Airbrake error monitoring ensures that you will always be the first to know so you can deploy a fix before anyone is impacted. With open source agents for Python 2 and 3 it’s easy to get started, and the automatic aggregations, contextual information, and deployment tracking ensure that you don’t waste time pinpointing what went wrong. Go to podcastinit.com/airbrake today to sign up and get your first 30 days free, and 50% off 3 months of the Startup plan.
  • To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected]
  • Your host as usual is Tobias Macey and today I’m interviewing James Campbell and Abe Gong about Great Expectations, a tool for testing the data in your analytics pipelines

Interview

  • Introduction
  • How did you first get introduced to Python?
  • What is Great Expectations and what was your motivation for starting it?
  • What are some of the complexities associated with testing analytics pipelines?
    • What types of tests can be executed to ensure data integrity and accuracy?
  • What are some examples of the potential impact of pipeline debt?
  • What is Great Expectations and how does it simplify the process of building and executing pipeline tests?
  • What are some examples of the types of tests that can be built with Great Expectations?
  • For someone getting started with Great Expectations what does the workflow look like?
  • What was your reason for using Python for building it?
    • How does the choice of language benefit or hinder the contexts in which Great Expectations can be used?
  • What are some cases where Great Expectations would not be usable or useful?
  • What have been some of the most challenging aspects of building and using Great Expectations?
  • What are your hopes for Great Expectations going forward?

Contact Info

Picks

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Bonobo: Lightweight ETL Toolkit for Python 3 with Romain Dorgueil - Episode 143

Summary

A majority of the work that we do as programmers involves data manipulation in some manner. This can range from large scale collection, aggregation, and statistical analysis across distrbuted systems, or it can be as simple as making a graph in a spreadsheet. In the middle of that range is the general task of ETL (Extract, Transform, and Load) which has its own range of scale. In this episode Romain Dorgueil discusses his experiences building ETL systems and the problems that he routinely encountered that led him to creating Bonobo, a lightweight, easy to use toolkit for data processing in Python 3. He also explains how the system works under the hood, how you can use it for your projects, and what he has planned for the future.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at podastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app. And now you can deliver your work to your users even faster with the newly upgraded 200 GBit network in all of their datacenters.
  • If you’re tired of cobbling together your deployment pipeline then it’s time to try out GoCD, the open source continuous delivery platform built by the people at ThoughtWorks who wrote the book about it. With GoCD you get complete visibility into the life-cycle of your software from one location. To download it now go to podcatinit.com/gocd. Professional support and enterprise plugins are available for added piece of mind.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Your host as usual is Tobias Macey and today I’m interviewing Romain Dorgueil about Bonobo, a data processing toolkit for modern Python

Interview

  • Introductions
  • How did you get introduced to Python?
  • What is Bonobo and what was your motivation for creating it?
    • What is the story behind the name?
  • How does Bonobo differ from projects such as Luigi or Airflow?
    [RD] After I explain why that’s totally different things, maybe a good follow up would be to ask about differences from other data streaming solutions, like Apache Beam or Spark.
  • How is Bonobo implemented and how has its architecture evolved since you began working on it?
  • What have been some of the most challenging aspects of building and maintaining Bonobo?
  • What are some extensions that you would like to have but don’t have the time to implement?
  • What are some of the most interesting or creative uses of Bonobo that you are aware of?
  • What do you have planned for the future of Bonobo?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Event Sourcing with John Bywater - Episode 131

Summary

The way that your application handles data and the way that it is represented in your database don’t always match, leading to a lot of brittle abstractions to reconcile the two. In order to reduce that friction, instead of overwriting the state of your application on every change you can log all of the events that take place and then render the current state from that sequence of events. John Bywater joins me this week to discuss his work on the Event Sourcing library, why you might want to use it in your applications, and how it can change the way that you think about your data.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • I would like to thank everyone who supports the show on Patreon. Your contributions help to make the show sustainable.
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at podastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app. And now you can deliver your work to your users even faster with the newly upgraded 200 GBit network in all of their datacenters.
  • If you’re tired of cobbling together your deployment pipeline then it’s time to try out GoCD, the open source continuous delivery platform built by the people at ThoughtWorks who wrote the book about it. With GoCD you get complete visibility into the life-cycle of your software from one location. To download it now go to podcatinit.com/gocd. Professional support and enterprise plugins are available for added piece of mind.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected]
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Your host as usual is Tobias Macey and today I’m interviewing John Bywater about event sourcing, an architectural approach to make your data layer easier to scale and maintain.

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing the concept of event sourcing and the benefits that it provides?
  • What is the event sourcing library and what was your reason for starting it?
  • What are some of the reasons that someone might not want to implement an event sourcing approach in their persistence layer?
  • Given that you are storing a record for each event that occurs on a domain object, how does that affect the amount of storage necessary to support an event sourced application?
  • What is the impact on performance and latency from an end user perspective when the application is using event sourcing to render the current state of the system?
  • What does the internal architecture and design of your library look like and how has that evolved over time?
  • In the case where events are delivered out of order, how can you ensure that the present view of an object is reflected accurately?
  • For someone who wants to incorporate an event sourcing design into an existing application, how would they do that?
  • How do you manage schema changes in your domain model when you need to reconstruct present state from the beginning of an objects event sequence?
  • What are some of the most interesting uses of event sourcing that you have seen?
  • What are some of the features or improvements that you have planned for the future of you event sourcing library?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Data Retriever with Henry Senyondo - Episode 122

Summary

Analyzing and interpreting data is a large portion of the work involved in scientific research. Getting to that point can be a lot of work on its own because of all of the steps required to download, clean, and organize the data prior to analysis. This week Henry Senyondo talks about the work he is doing with Data Retriever to make data preparation as easy as retriever install.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.podastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app.
  • Visit the site to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Your host as usual is Tobias Macey and today I’m interviewing Henry Senyondo about Data Retriever, the package manager for public data sets.

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you explain what data retriever is and the problem that it was built to solve?
  • Are there limitations as to the types of data that can be managed by data retriever?
  • What kinds of data sets are currently available and who are the target users?
  • What is involved in preparing a new dataset to be available for installation?
  • How much of the logic for installing the data is shared between the R and Python implementations of Data Retriever and how do you ensure that the two packages evolve in parallel?
  • How is the project designed and what are some of the most difficult technical aspects of building it?
  • What is in store for the future of data retriever?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Cauldron with Scott Ernst - Episode 111

Summary

The notebook format that has been exemplified by the IPython/Jupyter project has gained in popularity among data scientists. While the existing formats have proven their value, they are still susceptible with difficulties in collaboration and maintainability. Scott Ernst created the Cauldron notebook to be testable, production ready, and friendly to version control. This week we explore the capabilities, use cases, and architecture of Cauldron and how you can start using it today!

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.podastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app.
  • Visit the site to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Your host as usual is Tobias Macey and today I’m interviewing Scott Ernst about Cauldron, a new notebook format built with software engineering best practices in mind.

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what Cauldron is and what problem you were trying to solve when you created it?
  • In the documentation it mentions that you can use any editor for creating the content of the notebook. Can you describe a typical workflow of authoring the various files and cells and viewing the output?
  • How does Cauldron compare to the Jupyter notebook format and what factors would lead someone to choose one over the other?
  • Does Cauldron support running languages other than Python? If not then what would be involved in adding that capability?
  • Cauldron notebooks support unit tests of individual cells. How does that process work and what are the limitations?
  • The option for running the notebook in the context of a task workflow tool appears to be a powerful capability. What are some of the considerations that are necessary when writing a notebook to be run in that manner?
  • What are some of the most interesting or unexpected projects that you have seen people using Cauldron for?
  • What do you have planned for the future of Cauldron?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA