Stream Processing

Fast Stream Processing In Python Using Faust with Ask Solem - Episode 176

Summary

The need to process unbounded and continually streaming sources of data has become increasingly common. One of the popular platforms for implementing this is Kafka along with its streams API. Unfortunately, this requires all of your processing or microservice logic to be implemented in Java, so what’s a poor Python developer to do? If that developer is Ask Solem of Celery fame then the answer is, help to re-implement the streams API in Python. In this episode Ask describes how Faust got started, how it works under the covers, and how you can start using it today to process your fast moving data in easy to understand Python code. He also discusses ways in which Faust might be able to replace your Celery workers, and all of the pieces that you can replace with your own plugins.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at podcastinit.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Ask Solem about Faust, a library for building high performance, high throughput streaming systems in Python

Interview

  • Introductions
  • How did you get introduced to Python?
  • What is Faust and what was your motivation for building it?
    • What were the initial project requirements that led you to use Kafka as the primary infrastructure component for Faust?
  • Can you describe the architecture for Faust and how it has changed from when you first started writing it?
    • What mechanism does Faust use for managing consensus and failover among instances that are working on the same stream partition?
  • What are some of the lessons that you learned while building Celery that were most useful to you when designing Faust?
  • What have you found to be the most common areas of confusion for people who are just starting to build an application on top of Faust?
  • What has been the most interesting/unexpected/difficult aspects of building and maintaining Faust?
  • What have you found to be the most challenging aspects of building streaming applications?
  • What was the reason for releasing Faust as an open source project rather than keeping it internal to Robinhood?
  • What would be involved in adding support for alternate queue or stream implementations?
  • What do you have planned for the future of Faust?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

GenSim with Radim Řehůřek - Episode 71

Summary

Being able to understand the context of a piece of text is generally thought to be the domain of human intelligence. However, topic modeling and semantic analysis can be used to allow a computer to determine whether different messages and articles are about the same thing. This week we spoke with Radim Řehůřek about his work on GenSim, which is a Python library for performing unsupervised analysis of unstructured text and applying machine learning models to the problem of natural language understanding.

Brief Introduction

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show you can visit our site at pythonpodcast.com
  • Linode is sponsoring us this week. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project
  • We are also sponsored by Sentry this week. Stop hoping your users will report bugs. Sentry’s real-time tracking gives you insight into production deployments and information to reproduce and fix crashes. Check them out at getsentry.com and use the code podcastinit at signup to get a $50 credit on your account.
  • Visit our site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • Join our community! Visit discourse.pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.
  • Your hosts as usual are Tobias Macey and Chris Patti
  • Today we’re interviewing Radim Řehůřek about Gensim, a library for topic modeling and semantic analysis of natural language.

Interview with Radim Řehůřek

  • Introductions
  • How did you get introduced to Python? – Chris
  • Can you start by giving us an explanation of topic modeling and semantic analysis? – Tobias
  • What is Gensim and what inspired you to create it? – Tobias
  • What facilities does Gensim provide to simplify the work of this kind of language analysis? – Tobias
  • Can you describe the features that set it apart from other projects such as the NLTK or Spacy? – Tobias
  • What are some of the practical applications that Gensim can be used for? – Tobias
  • One of the features that stuck out to me is the fact that Gensim can process corpora on disk that would be too large to fit into memory. Can you explain some of the algorithmic work that was necessary to allow for this streaming process to be possible? – Tobias
    • Given that it can handle streams of data, could it also be used in the context of something like Spark? – Tobias
  • Gensim also supports unsupervised model building. What kinds of limitations does this have and when would you need a human in the loop? – Tobias
    • Once a model has been trained, how does it get saved and reloaded for subsequent use? – Tobias
  • What are some of the more unorthodox or interesting uses people have put Gensim to that you’ve heard about? – Chris
  • In addition to your work on Gensim, and partly due to its popularity, you have started a consultancy for customers who are interested in improving their data analysis capabilities. How does that feed back into Gensim? – Tobias
  • Are there any improvements in Gensim or other libraries that you have made available as a result of issues that have come up during client engagements? – Tobias
  • Is it difficult to find contributors to Gensim because of its advanced nature? – Tobias
  • Are there any resources you’d like to recommend our listeners explore to get a more in depth understanding of topic modeling and related techniques? – Chris

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Airflow with Maxime Beauchemin - Episode 44

Visit our site to listen to past episodes, support the show, join our community, and sign up for our mailing list.

Summary

Are you struggling with trying to manage a series of related, interdependent batch jobs? Then you should check out Airflow. In this episode we spoke with the project’s creator Maxime Beauchemin about what inspired him to create it, how it works, and why you might want to use it. Airflow is a data pipeline management tool that will simplify how you build, deploy, and monitor your complex data processing tasks so that you can focus on getting the insights you need from your data.

Brief Introduction

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • Subscribe on iTunes, Stitcher, TuneIn or RSS
  • Follow us on Twitter or Google+
  • Give us feedback! Leave a review on iTunes, Tweet to us, send us an email or leave us a message on Google+
  • Join our community! Visit discourse.pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.
  • I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show you can visit our site at pythonpodcast.com
  • Linode is sponsoring us this week. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project
  • I would also like to thank Hired, a job marketplace for developers and designers, for sponsoring this episode of Podcast.__init__. Use the link hired.com/podcastinit to double your signing bonus.
  • Your hosts as usual are Tobias Macey and Chris Patti
  • Today we are interviewing Maxime Beauchemin about his work on the Airflow project.

Interview with Maxime Beauchemin

  • Introductions
  • How did you get introduced to Python? – Chris
  • What is Airflow and what are some of the kinds of problems it can be used to solve? – Chris
  • What are some of the biggest challenges that you have seen when implementing a data pipeline with a workflow engine? – Tobias
  • What are some of the signs that a workflow engine is needed? – Tobias
  • Can you share some of the design and architecture of Airflow and how you arrived at those decisions? – Tobias
  • How does Airflow compare to other workflow management solutions, and why did you choose to write your own? – Chris
  • One of the features of Airflow that is emphasized in the documentation is the ability to dynamically generate pipelines. Can you describe how that works and why it is useful? – Tobias
  • For anyone who wants to get started with using Airflow, what are the infrastructure requirements? – Tobias
  • Airflow, like a number of the other tools in the space, support interoperability with Hadoop and its ecosystem. Can you elaborate on why JVM technologies have become so prevalent in the big data space and how Python fits into that overall problem domain? – Tobias
  • Airflow comes with a web UI for visualizing workflows, as do a few of the other Python workflow engines. Why is that an important feature for this kind of tool and what are some of the tasks and use cases that are supported in the Airflow web portal? – Tobias
  • One problem with data management is tracking the provenance of data as it is manipulated and shuttled between different systems. Does Airflow have any support for maintaining that kind of information and if not do you have recommendations for how practitioners can approach the issue? – Tobias
  • What other kinds of metadata can Airflow track as it executes tasks and what are some of the interesting uses you have seen or created for that information? – Tobias
  • With all the other languages competing for mindshare, what made you choose Python when you built Airflow? – Chris
  • I notice that Airflow supports Kerberos. It’s an incredibly capable security model but that comes at a high price in terms of complexity. What were the challenges and was it worth the additional implementation effort? – Chris
  • When does the data pipeline/workflow management paradigm break down and what other approaches or tools can be used in those cases? – Tobias
  • So, you wrote another tool recently called Panoramix. Can you describe what it is and maybe explain how it fits in the data management domain in relation to Airflow? – Tobias

Keep In Touch

Picks

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Dag Brattli on RxPy - Episode 26

Visit our site to listen to past episodes, support the show, and sign up for our newsletter!

Summary

Dag Brattli is an engineer with Microsoft and in his spare time he created the ported the Reactive Xtensions framework to Python in the form of the RxPy library. In this episode we had the opportunity to speak with Dag and learn more about what ReactiveX is, why it is useful and how you can use it in your Python programs. It is definitely a very powerful programming patern when manipulating data streams which is becoming increasingly common in modern software architectures.

Brief Introduction

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • Subscribe on iTunes, Stitcher, TuneIn or RSS
  • Follow us on Twitter or Google+
  • Give us feedback! Leave a review on iTunes, Tweet to us, send us an email or leave us a message on Google+
  • I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show you can visit our site at
  • I would also like to thank Hired, a job marketplace for developers, for sponsoring this episode of Podcast.__init__. Use the link hired.com/podcastinit to double your signing bonus.
  • We are recording today on October 2nd, 2015 and your hosts as usual are Tobias Macey and Chris Patti
  • Today we are interviewing Dag Brattli about the RxPy project
hired-logo-dark-padding.pngOn Hired software engineers & designers can get 5+ interview requests in a week and each offer has salary and equity upfront. With full time and contract opportunities available, users can view the offers and accept or reject them before talking to any company. Work with over 2,500 companies from startups to large public companies hailing from 12 major tech hubs in North America and Europe. Hired is totally free for users and If you get a job you’ll get a $2,000 “thank you” bonus. If you use our special link to signup, then that bonus will double to $4,000 when you accept a job. If you’re not looking for a job but know someone who is, you can refer them to Hired and get a $1,337 bonus when they accept a job.

Interview with Dag Brattli

  • Introductions
  • How did you get introduced to Python?
  • For our listeners who haven’t heard of it before, can you describe what RxPy is and why someone might want to use it?
  • What problem domains are best suited for using the Reactive X approach?
  • What is involved in integrating RxPy into an existing code base?
  • When should we use RxPy over asyncio or asynchronous workers like Celery?
  • What resources or tutorials do you recommend people use when trying to understand how and when to use the Reactive X tools?
  • What in particular about Python lends itself to the ReactiveX pattern, and what features of the language does RxPy leverage in particular in its implementation?
  • In what ways does the Python implementation of the Reactive X framework differ from those of other languages?
  • The project description references the use of LINQ for querying the various data streams that RxPy enables consumption of. I had always heard of LINQ in the context of traditional database queries. What makes LINQ a good choice for stream processing?
  • I mostly hear about ReactiveX in terms of UI design, but the project description seemed to indicate it was much more generally useful. What are some of the less common and more interesting problems that RxPy lends itself to solving?

Picks

Keep In Touch