Polyglot: Multi-Lingual Natural Language Processing with Rami Al-Rfou - Episode 190

Summary

Using computers to analyze text can produce useful and inspirational insights. However, when working with multiple languages the capabilities of existing models are severely limited. In order to help overcome this limitation Rami Al-Rfou built Polyglot. In this episode he explains his motivation for creating a natural language processing library with support for a vast array of languages, how it works, and how you can start using it for your own projects. He also discusses current research on multi-lingual text analytics, how he plans to improve Polyglot in the future, and how it fits in the Python ecosystem.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? Check out Linode at linode.com/podcastinit or use the code podcastinit2019 and get a $20 credit to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Clubhouse LogoThis episode of Podcast.__init__ is brought to you by Clubhouse, the first project management platform for software development that brings everyone together so that teams can focus on what matters – creating products their customers love. Clubhouse provides the perfect balance of simplicity and structure for better cross-functional collaboration. Its fast, intuitive interface makes it easy for people on any team to focus-in on their work on a specific task or project, while also being able to “zoom out” to see how that work is contributing towards the bigger picture. With a simple API and robust set of integrations, Clubhouse also seamlessly integrates with the tools you use everyday, getting out of your way so that you can deliver quality software on time.

Listeners of Podcast.__init__ can sign up for two free months of Clubhouse by visiting pythonpodcast.com/clubhouse.



Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. Podcast.__init__ listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Rami Al-Rfou about Polyglot, a natural language pipeline with support for an impressive amount of languages

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Polyglot is and your reasons for starting the project?
  • What are the types of use cases that Polyglot enables which would be impractical with something such as NLTK or SpaCy?
  • A majority of NLP libraries have a limited set of languages that they support. What is involved in adding support for a given language to a natural language tool?
    • What is involved in adding a new language to Polyglot?
    • Which families of languages are the most challenging to support?
  • What types of operations are supported and how consistently are they supported across languages?
  • How is Polyglot implemented?
  • Is there any capacity for integrating Polyglot with other tools such as SpaCy or Gensim?
  • How much domain knowledge is required to be able to effectively use Polyglot within an application?
  • What are some of the most interesting or unique uses of Polyglot that you have seen?
  • What have been some of the most complex or challenging aspects of building Polyglot?
  • What do you have planned for the future of Polyglot?
  • What are some areas of NLP research that you are excited for?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA