Data Science

Data Exploration and Visualization Made Effortless with Lux - Episode 313

Data exploration is an important step in any analysis or machine learning project. Visualizing the data that you are working with makes that exploration faster and more effective, but having to remember and write all of the code to build a scatter plot or histogram is tedious and time consuming. In order to eliminate that friction Doris Lee helped create the Lux project, which wraps your Pandas data frame and automatically generates a set of visualizations without you having to lift a finger. In this episode she explains how Lux works under the hood, what inspired her to create it in the first place, and how it can help you create a better end result. The Lux project is a valuable addition to the toolbox of anyone who is doing data wrangling with Pandas.

Read More

Analyzing The Ecosystem of Python Data Companies With Tony Liu - Episode 305

There are a large and growing number of businesses built by and for data science and machine learning teams that rely on Python. Tony Liu is a venture investor who is following that market closely and betting on its continued success. In this episode he shares his own journey into the role of an investor and discusses what he is most excited about in the industry. He also explains what he looks at when investing in a business and gives advice on what potential founders and early employees of startups should be thinking about when starting on that journey.

Read More

Go From Notebook To Pipeline For Your Data Science Projects With Orchest - Episode 304

Jupyter notebooks are a dominant tool for data scientists, but they lack a number of conveniences for building reusable and maintainable systems. For machine learning projects in particular there is a need for being able to pivot from exploring a particular dataset or problem to integrating that solution into a larger workflow. Rick Lamers and Yannick Perrenet were tired of struggling with one-off solutions when they created the Orchest platform. In this episode they explain how Orchest allows you to turn your notebooks into executable components that are integrated into a graph of execution for running end-to-end machine learning workflows.

Read More

Giving Your Data Science Projects And Teams A Home At DagsHub - Episode 301

Collaborating on software projects is largely a solved problem, with a variety of hosted or self-managed platforms to choose from. For data science projects, collaboration is still an open question. There are a number of projects that aim to bring collaboration to data science, but they are all solving a different aspect of the problem. Dean Pleban and Guy Smoilovsky created DagsHub to give individuals and teams a place to store and version their code, data, and models. In this episode they explain how DagsHub is designed to make it easier to create and track machine learning experiments, and serve as a way to promote collaboration on open source data science projects.

Read More

Turning Notebooks Into Collaborative And Dynamic Data Applications With Hex - Episode 294

Notebooks have been a useful tool for analytics, exploratory programming, and shareable data science for years, and their popularity is continuing to grow. Despite their widespread use, there are still a number of challenges that inhibit collaboration and use by non-technical stakeholders. Barry McCardel and his team at Hex have built a platform to make collaboration on Jupyter notebooks a first class experience, as well as allowing notebooks to be parameterized and exposing the logic through interactive web applications. In this episode Barry shares his perspective on the state of the notebook ecosystem, why it is such as powerful tool for computing and analytics, and how he has built a successful business around improving the end to end experience of working with notebooks. This was a great conversation about an important piece of the toolkit for every analyst and data scientist.

Read More

Add Anomaly Detection To Your Time Series Data With Luminaire - Episode 293

When working with data it’s important to understand when it is correct. If there is a time dimension, then it can be difficult to know when variation is normal. Anomaly detection is a useful tool to address these challenges, but a difficult one to do well. In this episode Smit Shah and Sayan Chakraborty share the work they have done on Luminaire to make anomaly detection easier to work with. They explain the complexities inherent to working with time series data, the strategies that they have incorporated into Luminaire, and how they are using it in their data pipelines to identify errors early. If you are working with any kind of time series then it’s worth giving Luminaure a look.

Read More

Scale Your Data Science Teams With Machine Learning Operations Principles - Episode 289

Building a machine learning model is a process that requires well curated and cleaned data and a lot of experimentation. Doing it repeatably and at scale with a team requires a way to share your discoveries with your teammates. This has led to a new set of operational ML platforms. In this episode Michael Del Balso shares the lessons that he learned from building the platform at Uber for putting machine learning into production. He also explains how the feature store is becoming the core abstraction for data teams to collaborate on building machine learning models. If you are struggling to get your models into production, or scale your data science throughput, then this interview is worth a listen.

Read More

Bringing Artificial Intelligence Projects From Idea To Production - Episode 287

Artificial intelligence applications can provide dramatic benefits to a business, but only if you can bring them from idea to production. Henrik Landgren was behind the original efforts at Spotify to leverage data for new product features, and in his current role he works on an AI system to evaluate new businesses to invest in. In this episode he shares advice on how to identify opportunities for leveraging AI to improve your business, the capabilities necessary to enable aa successful project, and some of the pitfalls to watch out for. If you are curious about how to get started with AI, or what to consider as you build a project, then this is definitely worth a listen.

Read More

Growing Dask To Make Scaling Python Data Science Easier At Coiled - Episode 275

Python is a leading choice for data science due to the immense number of libraries and frameworks readily available to support it, but it is still difficult to scale. Dask is a framework designed to transparently run your data analysis across multiple CPU cores and multiple servers. Using Dask lifts a limitation for scaling your analytical workloads, but brings with it the complexity of server administration, deployment, and security. In this episode Matthew Rocklin and Hugo Bowne-Anderson discuss their recently formed company Coiled and how they are working to make use and maintenance of Dask in production. The share the goals for the business, their approach to building a profitable company based on open source, and the difficulties they face while growing a new team during a global pandemic.

Read More

Building The Seq Language For Bioinformatics - Episode 257

Bioinformatics is a complex and computationally demanding domain. The intuitive syntax of Python and extensive set of libraries make it a great language for bioinformatics projects, but it is hampered by the need for computational efficiency. Ariya Shajii created the Seq language to bridge the divide between the performance of languages like C and C++ and the ecosystem of Python with built-in support for commonly used genomics algorithms. In this episode he describes his motivation for creating a new language, how it is implemented, and how it is being used in the life sciences. If you are interested in experimenting with sequencing data then give this a listen and then give Seq a try!

Read More