Being able to understand the context of a piece of text is generally thought to be the domain of human intelligence. However, topic modeling and semantic analysis can be used to allow a computer to determine whether different messages and articles are about the same thing. This week we spoke with Radim Řehůřek about his work on GenSim, which is a Python library for performing unsupervised analysis of unstructured text and applying machine learning models to the problem of natural language understanding.
Do you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? Check out Linode at linode.com/podcastinit or use the code podcastinit2020 and get a $20 credit to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.
Stop hoping your users will report bugs. Sentry’s real-time tracking gives you insight into production deployments and information to reproduce and fix crashes. Use the code podcastinit at signup to get a $50 credit!
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show you can visit our site at pythonpodcast.com
- Linode is sponsoring us this week. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project
- We are also sponsored by Sentry this week. Stop hoping your users will report bugs. Sentry’s real-time tracking gives you insight into production deployments and information to reproduce and fix crashes. Check them out at getsentry.com and use the code podcastinit at signup to get a $50 credit on your account.
- Visit our site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- Join our community! Visit discourse.pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.
- Your hosts as usual are Tobias Macey and Chris Patti
- Today we’re interviewing Radim Řehůřek about Gensim, a library for topic modeling and semantic analysis of natural language.
Interview with Radim Řehůřek
- How did you get introduced to Python? – Chris
- Can you start by giving us an explanation of topic modeling and semantic analysis? – Tobias
- What is Gensim and what inspired you to create it? – Tobias
- What facilities does Gensim provide to simplify the work of this kind of language analysis? – Tobias
- Can you describe the features that set it apart from other projects such as the NLTK or Spacy? – Tobias
- What are some of the practical applications that Gensim can be used for? – Tobias
- One of the features that stuck out to me is the fact that Gensim can process corpora on disk that would be too large to fit into memory. Can you explain some of the algorithmic work that was necessary to allow for this streaming process to be possible? – Tobias
- Given that it can handle streams of data, could it also be used in the context of something like Spark? – Tobias
- Gensim also supports unsupervised model building. What kinds of limitations does this have and when would you need a human in the loop? – Tobias
- Once a model has been trained, how does it get saved and reloaded for subsequent use? – Tobias
- What are some of the more unorthodox or interesting uses people have put Gensim to that you’ve heard about? – Chris
- In addition to your work on Gensim, and partly due to its popularity, you have started a consultancy for customers who are interested in improving their data analysis capabilities. How does that feed back into Gensim? – Tobias
- Are there any improvements in Gensim or other libraries that you have made available as a result of issues that have come up during client engagements? – Tobias
- Is it difficult to find contributors to Gensim because of its advanced nature? – Tobias
- Are there any resources you’d like to recommend our listeners explore to get a more in depth understanding of topic modeling and related techniques? – Chris
Keep In Touch
- Dark Matter and the Dinosaurs by Lisa Randall
- Nadia Eghbal
- SQL Addict
- Latent Dirichlet Allocation (LDA)
- Keynote in Italy on distributed processing
- Google Scholar references for Gensim
- Stylometric analysis
- On Writing Well
- Student Incubator
- Wikipedia on topic modeling