Great Expectations For Your Data Pipelines with Abe Gong and James Campbell

00:00:00
/
00:50:42

May 13th, 2018

50 mins 42 secs

Your Hosts

About this Episode

Summary

Testing is a critical activity in all software projects, but one that is often neglected in data pipelines. The complexities introduced by the inherent statefulness of the problem domain and the interdependencies between systems contribute to make pipeline testing difficult to manage. To make this endeavor more manageable Abe Gong and James Campbell have created Great Expectations. In this episode they discuss how you can use the project to create tests in the exploratory phase of building a pipeline and leverage those to monitor your systems in production. They also discussed how Great Expectations works, the difficulties associated with pipeline testing and managing associated technical debt, and their future plans for the project.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • Finding a bug in production is never a fun experience, especially when your users find it first. Airbrake error monitoring ensures that you will always be the first to know so you can deploy a fix before anyone is impacted. With open source agents for Python 2 and 3 it’s easy to get started, and the automatic aggregations, contextual information, and deployment tracking ensure that you don’t waste time pinpointing what went wrong. Go to podcastinit.com/airbrake today to sign up and get your first 30 days free, and 50% off 3 months of the Startup plan.
  • To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com
  • Your host as usual is Tobias Macey and today I’m interviewing James Campbell and Abe Gong about Great Expectations, a tool for testing the data in your analytics pipelines

Interview

  • Introduction
  • How did you first get introduced to Python?
  • What is Great Expectations and what was your motivation for starting it?
  • What are some of the complexities associated with testing analytics pipelines?
    • What types of tests can be executed to ensure data integrity and accuracy?


  • What are some examples of the potential impact of pipeline debt?

  • What is Great Expectations and how does it simplify the process of building and executing pipeline tests?

  • What are some examples of the types of tests that can be built with Great Expectations?

  • For someone getting started with Great Expectations what does the workflow look like?

  • What was your reason for using Python for building it?

    • How does the choice of language benefit or hinder the contexts in which Great Expectations can be used?


  • What are some cases where Great Expectations would not be usable or useful?

  • What have been some of the most challenging aspects of building and using Great Expectations?

  • What are your hopes for Great Expectations going forward?

Contact Info

Picks

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA