Data Analysis

Fast And Educational Exploration And Analysis Of Graph Data Structures With graph-tool - Episode 322

If you are interested in a library for working with graph structures that will also help you learn more about the research and theory behind the algorithms then look no further than graph-tool. In this episode Tiago Peixoto shares his work on graph algorithms and networked data and how he has built graph-tool to help in that research. He explains how it is implemented, how it evolved from a simple command line tool to a full-fledged library, and the benefits that he has found from building a personal project in the open.

Read More

Keep Your Analytics Lint Free With SQLFluff - Episode 318

The growth of analytics has accelerated the use of SQL as a first class language. It has also grown the amount of collaboration involved in writing and maintaining SQL queries. With collaboration comes the inevitable variation in how queries are written, both structurally and stylistically which can lead to a significant amount of wasted time and energy during code review and employee onboarding. Alan Cruickshank was feeling the pain of this wasted effort first-hand which led him down the path of creating SQLFluff as a linter and formatter to enforce consistency and find bugs in the SQL code that he and his team were working with. In this episode he shares the story of how SQLFluff evolved from a simple hackathon project to an open source linter that is used across a range of companies and fosters a growing community of users and contributors. He explains how it has grown to support multiple dialects of SQL, as well as integrating with projects like DBT to handle templated queries. This is a great conversation about the long detours that are sometimes necessary to reach your original destination and the powerful impact that good tooling can have on team productivity.

Read More

Data Exploration and Visualization Made Effortless with Lux - Episode 313

Data exploration is an important step in any analysis or machine learning project. Visualizing the data that you are working with makes that exploration faster and more effective, but having to remember and write all of the code to build a scatter plot or histogram is tedious and time consuming. In order to eliminate that friction Doris Lee helped create the Lux project, which wraps your Pandas data frame and automatically generates a set of visualizations without you having to lift a finger. In this episode she explains how Lux works under the hood, what inspired her to create it in the first place, and how it can help you create a better end result. The Lux project is a valuable addition to the toolbox of anyone who is doing data wrangling with Pandas.

Read More

Be Data Driven At Any Scale With Superset - Episode 307

Becoming data driven is the stated goal of a large and growing number of organizations. In order to achieve that mission they need a reliable and scalable method of accessing and analyzing the data that they have. While business intelligence solutions have been around for ages, they don’t all work well with the systems that we rely on today and a majority of them are not open source. Superset is a Python powered platform for exploring your data and building rich interactive dashboards that gets the information that your organization needs in front of the people that need it. In this episode Maxime Beauchemin, the creator of Superset, shares how the project got started and why it has become such a widely used and popular option for exploring and sharing data at companies of all sizes. He also explains how it functions, how you can customize it to fit your specific needs, and how to get it up and running in your own environment.

Read More

Simplified Data Extraction And Analysis For Current Events With Newspaper - Episode 280

News media is an important source of information for understanding the context of the world. To make it easier to access and process the contents of news sites Lucas Ou-Yang built the Newspaper library that aids in automatic retrieval of articles and prepare it for analysis. In this episode he shares how the project got started, how it is implemented, and how you can get started with it today. He also discusses how recent improvements in the utility and ease of use of deep learning libraries open new possibilities for future iterations of the project.

Read More

When, Why, and How To Use Web Scraping In A Nutshell - Episode 278

The internet is a rich source of information, but a majority of it isn’t accessible programmatically through APIs or databases. To address that shortcoming there are a variety of web scraping frameworks that aid in extracting structured data from web pages. In this episode Attila Tóth shares the challenges of web data extraction, the ways that you can use it, and how Scrapy and ScrapingHub can help you with your projects.

Read More

Open Source Product Analytics With PostHog - Episode 266

You spend a lot of time and energy on building a great application, but do you know how it’s actually being used? Using a product analytics tool lets you gain visibility into what your users find helpful so that you can prioritize feature development and optimize customer experience. In this episode PostHog CTO Tim Glaser shares his experience building an open source product analytics platform to make it easier and more accessible to understand your product. He shares the story of how and why PostHog was created, how to incorporate it into your projects, the benefits of providing it as open source, and how it is implemented. If you are tired of fighting with your user analytics tools, or unwilling to entrust your data to a third party, then have a listen and then test out PostHog for yourself.

Read More

Build Your Own Personal Data Repository With Nostalgia - Episode 248

The companies that we entrust our personal data to are using that information to gain extensive insights into our lives and habits while not always making those findings accessible to us. Pascal van Kooten decided that he wanted to have the same capabilities to mine his personal data, so he created the Nostalgia project to integrate his various data sources and query across them. In this episode he shares his motivation for creating the project, how he is using it in his day-to-day, and how he is planning to evolve it in the future. If you’re interested in learning more about yourself and your habits using the personal data that you share with the various services you use then listen now to learn more.

Read More

Building A Business On Building Data Driven Businesses - Episode 246

In order for an organization to be data driven they need easy access to their data and a simple way of sharing it. Arik Fraimovich built Redash as a way to address that need by connecting to any data source and building attractive dashboards on top of them. In this episode he shares the origin story of the project, his experiences running a business based on open source, and the challenges of working with data effectively.

Read More

Exploratory Data Analysis Made Easy At The Command Line - Episode 230

There are countless tools and libraries in Python for data scientists to perform powerful analyses, but they often have a setup cost that acts as a barrier to ad-hoc exploration of data. Visidata is a command line application that eliminates the friction involved with starting the discovery process. In this episode Saul Pwanson explains his motivation for creating it, why a terminal environment is a useful place for this work, and how you can use Visidata for your own work. If you have ever avoided looking at a data set because you couldn’t be bothered with the boilerplate for a Jupyter notebook, then Visidata is the perfect addition to your toolbox.

Read More