Data Analysis

An Exploration Of Effective Pandas Practices With Matt Harrison - Episode 348

Pandas has grown to be a ubiquitous tool for working with data at every stage. It has become so well known that many people learn Python solely for the purpose of using Pandas. With all of this activity and the long history of the project it can be easy to find misleading or outdated information about how to use it. In this episode Matt Harrison shares his work on the book “Effective Pandas” and some of the best practices and potential pitfalls that you should know for applying Pandas in your own work.

Read More

Build Better Analytics And Models With A Focus On The Data Experience - Episode 340

A lot of time and energy goes into data analysis and machine learning projects to address various goals. Most of the effort is focused on the technical aspects and validating the results, but how much time do you spend on considering the experience of the people who are using the outputs of these projects? In this episode Benn Stancil explores the impact that our technical focus has on the perceived value of our work, and how taking the time to consider what the desired experience will be can lead us to approach our work more holistically and increase the satisfaction of everyone involved.

Read More

An Exploration Of Financial Exchange Risk Management Strategies - Episode 336

The world of finance has driven the development of many sophisticated techniques for data analysis. In this episode Paul Stafford shares his experiences working in the realm of risk management for financial exchanges. He discusses the types of risk that are involved, the statistical methods that he has found most useful for identifying strategies to mitigate that risk, and the software libraries that have helped him most in his work.

Read More

Network Analysis At The Speed Of C With The Power Of Python Using NetworKit - Episode 327

Analysing networks is a growing area of research in academia and industry. In order to be able to answer questions about large or complex relationships it is necessary to have fast and efficient algorithms that can process the data quickly. In this episode Eugenio Angriman discusses his contributions to the NetworKit library to provide an accessible interface for these algorithms. He shares how he is using NetworKit for his own research, the challenges of working with large and complex networks, and the kinds of questions that can be answered with data that fits on your laptop.

Read More

Taking Aim At The Legacy Of SQL With The Preql Relational Language - Episode 325

SQL has gone through many cycles of popularity and disfavor. Despite its longevity it is objectively challenging to work with in a collaborative and composable manner. In order to address these shortcomings and build a new interface for your database oriented workloads Erez Shinan created Preql. It is based on the same relational algebra that inspired SQL, but brings in more robust computer science principles to make it more manageable as you scale in complexity. In this episode he shares his motivation for creating the Preql project, how he has used Python to develop a new language for interacting with database engines, and the challenges of taking on the legacy of SQL as an individual.

Read More

Fast And Educational Exploration And Analysis Of Graph Data Structures With graph-tool - Episode 322

If you are interested in a library for working with graph structures that will also help you learn more about the research and theory behind the algorithms then look no further than graph-tool. In this episode Tiago Peixoto shares his work on graph algorithms and networked data and how he has built graph-tool to help in that research. He explains how it is implemented, how it evolved from a simple command line tool to a full-fledged library, and the benefits that he has found from building a personal project in the open.

Read More

Keep Your Analytics Lint Free With SQLFluff - Episode 318

The growth of analytics has accelerated the use of SQL as a first class language. It has also grown the amount of collaboration involved in writing and maintaining SQL queries. With collaboration comes the inevitable variation in how queries are written, both structurally and stylistically which can lead to a significant amount of wasted time and energy during code review and employee onboarding. Alan Cruickshank was feeling the pain of this wasted effort first-hand which led him down the path of creating SQLFluff as a linter and formatter to enforce consistency and find bugs in the SQL code that he and his team were working with. In this episode he shares the story of how SQLFluff evolved from a simple hackathon project to an open source linter that is used across a range of companies and fosters a growing community of users and contributors. He explains how it has grown to support multiple dialects of SQL, as well as integrating with projects like DBT to handle templated queries. This is a great conversation about the long detours that are sometimes necessary to reach your original destination and the powerful impact that good tooling can have on team productivity.

Read More

Data Exploration and Visualization Made Effortless with Lux - Episode 313

Data exploration is an important step in any analysis or machine learning project. Visualizing the data that you are working with makes that exploration faster and more effective, but having to remember and write all of the code to build a scatter plot or histogram is tedious and time consuming. In order to eliminate that friction Doris Lee helped create the Lux project, which wraps your Pandas data frame and automatically generates a set of visualizations without you having to lift a finger. In this episode she explains how Lux works under the hood, what inspired her to create it in the first place, and how it can help you create a better end result. The Lux project is a valuable addition to the toolbox of anyone who is doing data wrangling with Pandas.

Read More

Be Data Driven At Any Scale With Superset - Episode 307

Becoming data driven is the stated goal of a large and growing number of organizations. In order to achieve that mission they need a reliable and scalable method of accessing and analyzing the data that they have. While business intelligence solutions have been around for ages, they don’t all work well with the systems that we rely on today and a majority of them are not open source. Superset is a Python powered platform for exploring your data and building rich interactive dashboards that gets the information that your organization needs in front of the people that need it. In this episode Maxime Beauchemin, the creator of Superset, shares how the project got started and why it has become such a widely used and popular option for exploring and sharing data at companies of all sizes. He also explains how it functions, how you can customize it to fit your specific needs, and how to get it up and running in your own environment.

Read More

Simplified Data Extraction And Analysis For Current Events With Newspaper - Episode 280

News media is an important source of information for understanding the context of the world. To make it easier to access and process the contents of news sites Lucas Ou-Yang built the Newspaper library that aids in automatic retrieval of articles and prepare it for analysis. In this episode he shares how the project got started, how it is implemented, and how you can get started with it today. He also discusses how recent improvements in the utility and ease of use of deep learning libraries open new possibilities for future iterations of the project.

Read More