GenSim with Radim Řehůřek

The podcast about Python and the people who make it great

20 August 2016

GenSim with Radim Řehůřek - E71

0:00/0:00

Share on social media:

Summary

Being able to understand the context of a piece of text is generally thought to be the domain of human intelligence. However, topic modeling and semantic analysis can be used to allow a computer to determine whether different messages and articles are about the same thing. This week we spoke with Radim Řehůřek about his work on GenSim, which is a Python library for performing unsupervised analysis of unstructured text and applying machine learning models to the problem of natural language understanding.

Brief Introduction

Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show you can visit our site at pythonpodcast.com
Linode is sponsoring us this week. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project
We are also sponsored by Sentry this week. Stop hoping your users will report bugs. Sentry’s real-time tracking gives you insight into production deployments and information to reproduce and fix crashes. Check them out at getsentry.com and use the code podcastinit at signup to get a $50 credit on your account.
Visit our site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
Join our community! Visit discourse.pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.
Your hosts as usual are Tobias Macey and Chris Patti
Today we’re interviewing Radim Řehůřek about Gensim, a library for topic modeling and semantic analysis of natural language.

Interview with Radim Řehůřek

Introductions
How did you get introduced to Python? – Chris
Can you start by giving us an explanation of topic modeling and semantic analysis? – Tobias
What is Gensim and what inspired you to create it? – Tobias
What facilities does Gensim provide to simplify the work of this kind of language analysis? – Tobias
Can you describe the features that set it apart from other projects such as the NLTK or Spacy? – Tobias
What are some of the practical applications that Gensim can be used for? – Tobias
One of the features that stuck out to me is the fact that Gensim can process corpora on disk that would be too large to fit into memory. Can you explain some of the algorithmic work that was necessary to allow for this streaming process to be possible? – Tobias
- Given that it can handle streams of data, could it also be used in the context of something like Spark? – Tobias

Gensim also supports unsupervised model building. What kinds of limitations does this have and when would you need a human in the loop? – Tobias
- Once a model has been trained, how does it get saved and reloaded for subsequent use? – Tobias

What are some of the more unorthodox or interesting uses people have put Gensim to that you’ve heard about? – Chris

In addition to your work on Gensim, and partly due to its popularity, you have started a consultancy for customers who are interested in improving their data analysis capabilities. How does that feed back into Gensim? – Tobias

Are there any improvements in Gensim or other libraries that you have made available as a result of issues that have come up during client engagements? – Tobias

Is it difficult to find contributors to Gensim because of its advanced nature? – Tobias

Are there any resources you’d like to recommend our listeners explore to get a more in depth understanding of topic modeling and related techniques? – Chris