Data Engineering

Simplified Data Extraction And Analysis For Current Events With Newspaper - Episode 280

News media is an important source of information for understanding the context of the world. To make it easier to access and process the contents of news sites Lucas Ou-Yang built the Newspaper library that aids in automatic retrieval of articles and prepare it for analysis. In this episode he shares how the project got started, how it is implemented, and how you can get started with it today. He also discusses how recent improvements in the utility and ease of use of deep learning libraries open new possibilities for future iterations of the project.

Read More

Digging Into Dagster: An Opinionated Open Source Framework For Data Orchestration - Episode 279

Data applications are complex and continually evolving, often requiring collaboration across multiple teams. In order to keep everyone on the same page a high level abstraction is needed to facilitate a cross-cutting view of the data orchestration across integration, transformation, analytics, and machine learning. Dagster is an innovative new framework that leans on the power and flexibility of Python to provide an extensible interface to the complete lifecycle of data projects. In this episode Nick Schrock explains how he designed the Dagster project to allow for integration with the entire data ecosystem while providing an opinionated structure for connecting the different stages of computation. He also discusses how he is working to grow an open ecosystem around the Dagster project, and his thoughts on building a sustainable business on top of it without compromising the integrity of the community. This was a great conversation about playing the long game when building a business while providing a valuable utility to a complex problem domain.

Read More

When, Why, and How To Use Web Scraping In A Nutshell - Episode 278

The internet is a rich source of information, but a majority of it isn’t accessible programmatically through APIs or databases. To address that shortcoming there are a variety of web scraping frameworks that aid in extracting structured data from web pages. In this episode Attila Tóth shares the challenges of web data extraction, the ways that you can use it, and how Scrapy and ScrapingHub can help you with your projects.

Read More

A Data Catalog For Your PyData Projects - Episode 213

One of the biggest pain points when working with data is getting is dealing with the boilerplate code to load it into a usable format. Intake encapsulates all of that and puts it behind a single API. In this episode Martin Durant explains how to use the Intake data catalogs for encapsulating source information, how it simplifies data science workflows, and how to incorporate it into your projects. It is a lightweight way to enable collaboration between data engineers and data scientists in the PyData ecosystem.

Read More

Gnocchi: A Scalable Time Series Database For Your Metrics with Julien Danjou - Episode 189

Do you know what your servers are doing? If you have a metrics system in place then the answer should be “yes”. One critical aspect of that platform is the timeseries database that allows you to store, aggregate, analyze, and query the various signals generated by your software and hardware. As the size and complexity of your systems scale, so does the volume of data that you need to manage which can put a strain on your metrics stack. Julien Danjou built Gnocchi during his time on the OpenStack project to provide a time oriented data store that would scale horizontally and still provide fast queries. In this episode he explains how the project got started, how it works, how it compares to the other options on the market, and how you can start using it today to get better visibility into your operations.

Read More

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull - Episode 184

As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.

Read More

Fast Stream Processing In Python Using Faust with Ask Solem - Episode 176

The need to process unbounded and continually streaming sources of data has become increasingly common. One of the popular platforms for implementing this is Kafka along with its streams API. Unfortunately, this requires all of your processing or microservice logic to be implemented in Java, so what’s a poor Python developer to do? If that developer is Ask Solem of Celery fame then the answer is, help to re-implement the streams API in Python. In this episode Ask describes how Faust got started, how it works under the covers, and how you can start using it today to process your fast moving data in easy to understand Python code. He also discusses ways in which Faust might be able to replace your Celery workers, and all of the pieces that you can replace with your own plugins.

Read More

Asking Questions From Data Using Active Learning with Tivadar Danka - Episode 162

One of the challenges of machine learning is obtaining large enough volumes of well labelled data. An approach to mitigate the effort required for labelling data sets is active learning, in which outliers are identified and labelled by domain experts. In this episode Tivadar Danka describes how he built modAL to bring active learning to bioinformatics. He is using it for doing human in the loop training of models to detect cell phenotypes with massive unlabelled datasets. He explains how the library works, how he designed it to be modular for a broad set of use cases, and how you can use it for training models of your own.

Read More

Great Expectations For Your Data Pipelines with Abe Gong and James Campbell - Episode 161

Testing is a critical activity in all software projects, but one that is often neglected in data pipelines. The complexities introduced by the inherent statefulness of the problem domain and the interdependencies between systems contribute to make pipeline testing difficult to manage. To make this endeavor more manageable Abe Gong and James Campbell have created Great Expectations. In this episode they discuss how you can use the project to create tests in the exploratory phase of building a pipeline and leverage those to monitor your systems in production. They also discussed how Great Expectations works, the difficulties associated with pipeline testing and managing associated technical debt, and their future plans for the project.

Read More

Bonobo: Lightweight ETL Toolkit for Python 3 with Romain Dorgueil - Episode 143

A majority of the work that we do as programmers involves data manipulation in some manner. This can range from large scale collection, aggregation, and statistical analysis across distrbuted systems, or it can be as simple as making a graph in a spreadsheet. In the middle of that range is the general task of ETL (Extract, Transform, and Load) which has its own range of scale. In this episode Romain Dorgueil discusses his experiences building ETL systems and the problems that he routinely encountered that led him to creating Bonobo, a lightweight, easy to use toolkit for data processing in Python 3. He also explains how the system works under the hood, how you can use it for your projects, and what he has planned for the future.

Read More