An Open Source Toolchain For Natural Language Processing From Explosion AI - Episode 256

Summary

The state of the art in natural language processing is a constantly moving target. With the rise of deep learning, previously cutting edge techniques have given way to robust language models. Through it all the team at Explosion AI have built a strong presence with the trifecta of SpaCy, Thinc, and Prodigy to support fast and flexible data labeling to feed deep learning models and performant and scalable text processing. In this episode founder and open source author Matthew Honnibal shares his experience growing a business around cutting edge open source libraries for the machine learning developent process.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? Check out Linode at linode.com/podcastinit or use the code podcastinit2020 and get a $20 credit to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.



Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. And now, the events are coming to you, with no travel necessary! We have partnered with organizations such as ODSC, and Data Council. Upcoming events include the Observe 20/20 virtual conference on April 6th and ODSC East which has also gone virtual starting April 16th. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Matthew Honnibal about the Thinc and Prodigy tools and an update on SpaCy

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by giving an overview of your mission at Explosion?
  • We spoke previously about your work on SpaCy. What has changed in the past 3 1/2 years?
    • How have recent innovations in language models such as BERT and GPT-2 influenced the direction or implementation of the project?
  • When I last looked SpaCy only supported English and German, but you have added several new languages. What are the most challenging aspects of building the additional models?
    • What would be required for supporting symbolic or right-to-left languages?
  • How has the ecosystem for language processing in Python shifted or evolved since you first introduced SpaCy?
  • Another project that you have released is Prodigy to support labelling of datasets. Can you talk through the motivation for creating it and describe the workflow for someone using it?
    • What was lacking in the other annotation tools that you have worked with that you are trying to solve for in Prodigy?
  • What are some of the most challenging or problematic aspects of labelling data sets for use in machine learning projects?
    • What is a typical scale of data that can be reasonably handled by an individual or small team working with Prodigy?
      • At what point do you find that it makes sense to use a labeling service rather than generating the labels yourself?
  • Your most recent project is Thinc for building and using deep learning models. What was the motivation for creating it and what problem does it solve in the ecosystem?
    • How does its design and usage compare to other deep learning frameworks such as PyTorch and Tensorflow?
    • How does it compare to projects such as Keras that abstract across those frameworks?
  • How do the SpaCy, Prodigy, and Thinc libraries work together?
  • What are some of the biggest challenges that you are facing in building open source tools to meet the needs of data scientists and machine learning engineers?
  • What are some of the most interesting or impressive projects that you have seen built with the tools your team is creating?
  • What do you have planned for the future of Explosion, SpaCy, Prodigy, and Thinc?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to podcast dotnet, the podcast about Python and the people who make it great. When you're ready to launch your next app, I want to try a project you hear about on the show, he leads somewhere to deploy it. So take a look at our friends over at linode. Go to 100 gigabit and private networking node balancers, a 40 gigabit public network fast object storage and a brand new managed Kubernetes platform all controlled by a convenient API, you've got everything you need to scale up. And for your tasks that need fast computation such as training machine learning models or running your ci and CD pipelines. They've got dedicated CPU and GPU instances. Go to Python podcast.com slash linode. That's Li n o d today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You listen to this show to learn to stay up to date with the waves As the Python is being used, including the latest in machine learning and data analysis, for even more opportunities to meet, listen and learn from your peers, you don't want to miss out on great conferences, and now the events are coming to you with no travel necessary. We have partnered with organizations such as od sc into data Council. Upcoming events include the observe 2020 Virtual Conference on April 6, and od SC East which has also gone virtual starting April 16. Go to Python podcast.com slash conferences to learn more about these and other events and take advantage of our partners discounts to save money when you register today. Your host as usual is Tobias Macey, and today I'm interviewing Matthew Hannibal about the thing in prodigy tools and an update on spacey. So Matthew, can you start by introducing yourself?
Matthew Honnibal
0:01:45
Hi, Tobias. So thanks for having me again. I'm the creator of the spacey natural language processing library. It's a popular tool for working with text in in Python. So it's Often used for information extraction projects, and I, you know, also data science projects to understand text at scale. And I'm the co founder of a company explosion AI. So we also make an annotation tool called prodigy. And we've recently released the updated and released machine learning component of spacey as its own library think as well, which is another thing that I'm excited to talk to you about today.
Tobias Macey
0:02:27
And you were actually on the podcast about three and a half years ago to talk about spacey. So I'm definitely excited to hear about where things have gone since then. But before that, can you share how you first got introduced to Python?
Matthew Honnibal
0:02:39
Sure.
0:02:40
So like a lot of people like came to problems that I wanted to solve with programming before I came to decision about languages or, you know, those sorts of technical things. So basically, I started out in linguistics and I was doing research and I wanted to process volumes of text for to answer Questions about grammar or to, you know, basically work through linguistic theory that I was working with. And so then it just sort of started from there. And I started writing small scripts and everything. And I actually first started out with Perl, but I quite quickly switched across to Python. So this was in around 2004 2005. And since then have, you know, really worked with Python for pretty much my whole career, except for eventually, I realized that I wanted to write programs that will foster and programs in particular, which were more memory efficient, so that I could write, you know, basically concise data structures that would work well with the problems that I was working with. And so then I started working with syphon. And I found that a really good compromise for that because for some problems, it just is a lot easier if you can sit down and plan out the memory ahead of time and sort of reason about how much you can hold in memory. So that's how, and that was very much informed how spacey was written because the library is really implemented in siphon rather than in Python directly.
Tobias Macey
0:04:00
At the time, when we spoke, the natural language toolkit was still sort of the de facto standard for anybody who wanted to do any sort of natural language processing. But these days, most of the times when I see references to people doing any sort of NLP spacey is become the more prominent library for that. So I'm curious what your sense of that has been, as the creator and maintainer of spacey, how things have progressed over the past few years in terms of the level of popularity and adoption for your library.
Matthew Honnibal
0:04:28
So NLT Kay is still an extremely popular and useful library, and they really do different things. So I would, you know, never want to say that, although, you know, there's only one way to do it, and that, you know, the other tools are like deprecated or something, like, there's still a lot of functionality in it, okay, and there's use cases where people find its approach of not having to initialize as much like, you know, load a large model into place and, you know, it's basically got all of these utility functions. So it's still certainly a very popular tool. But I yeah, I've been pleased to say that a lot of people have been finding spacey useful and especially for models that have pre trained and sort of pre trained processing pipelines and also data structures for working with different annotations. So one thing that you know, I think space is quite good at is, if you've got annotations to like entities in text, and you want to do things like get relations between them or you know, you want to find other parts of speech or you want to read tokenizer text, the object oriented interface in spacey makes it quite easy to interact with those annotation layers together. And so people are finding that quite useful also processing pipelines. So being able to string together, rule based matching with entity recognition and then apply some other rules on top and then get out the document at the end is something that I think spacey is quite strong at as well. And so that's why people are using it for these processing pipelines. And spacey also has a little bit more of industrial use case, focus. So it's more oriented towards production use cases. And so I think there's a lot of companies who have basically been looking for a tool that has that kind of focus, rather than one which is more oriented towards teaching research. And for the work that you're doing at explosion, you mentioned that you founded the company around the same time that we talked and as a follow on from your work on spacey, so I'm going to give you can give a bit of an overview of the mission for that company and highlight some of the different projects that you've been working on there. Yeah. So when we first started out with explosion, we did some consulting projects for six or seven months, we refer to this, you know, together with my co founder NSS, raising a client round, it was really a good way to basically understand what sort of problems people had with NLP and figure out what we wanted to do next. And so then the product that we ended up releasing was this annotation tool prodigy, and that's been going very well since then. That's been, you know, really funding our activities in the company. So the way that we see things is that one of the needs that people have for machine learning techniques is to be able to develop them closely themselves. So around them, we've found the explosion, there were a lot of people who were thinking that AI technologies like NLP would be something that you consumed as a cloud service. And you would really have very few developers working closely with across these technologies. And our bet was different. Our bet was that this is a bit more like web development in that to really make use of the technology effectively in projects and products, people would need to work with it closely. And there would be a lot of developers who were wanting to understand how all of these technologies fit together. And so open source and sort of self run technologies would be the way that people wanted to build their projects. And I think that that's largely true, that's largely the way that people have been working with AI is to be using open source libraries or, you know, at least self hosted technologies that they can really understand in detail. And, you know, so we wanted to basically have a tech stack that fit along with that sort of viewpoint, so that people could run their purchase. More, you know, basically move faster and try things out.
Tobias Macey
0:08:03
And so
0:08:03
going back to spacey in the last three and a half years, there have been a lot of new innovations and shift in direction for the study and usage of natural language with models such as bird and GPT. Two coming out. And I'm curious how that has influenced the direction or the implementation of spacey itself and any other products developments or project updates that you think are worth noting and that have happened in that timeframe.
Matthew Honnibal
0:08:30
Yeah, so that's definitely been a very exciting thing that's been happening with natural language processing. So essentially, it's not all of these models give the ability to have very accurate models on through language model pre training. So a problem for natural language processing. Technologies has always been this problem that's broadly called the knowledge acquisition bottleneck. And that's that there's so much knowledge that's kind of in the background about language that That has to work with language has to understand about the world in order to get any specific application done. So let's say you want to do something, you know, reasonably boring, like, extract certain financial figures from some document. And so it's I don't know, profit and loss statements from like, you know, company filings or something, all of these other words and all these background things that the model has to sort of understand something about in order to figure out which sentences are of interest to and which sentences are providing that information. And if you have a person that you're teaching to do this, along with all of their knowledge about the world in general intelligence, they have this just knowledge of language, that means that they only have to learn a little bit about the task before they can do it very accurately. Whereas if you have a model that has to say, all of these words for the first time, you need an enormous number of examples to teach it this boring task. It would be like having a new employee and instead of just teaching them what to do, you have to teach them English as well. That's, you know, obviously a huge learning curve where you want to be able to import the general capability of English and just teach your task on top. And that's always been a well identified problem for natural language processing models. And finally, over the last couple of years, we've really had a big breakthrough in how this is done. So these models and basically start off learning to predict the next word or some tasks similar to it from large bodies of text. And they can use this knowledge to, you know, you can start off with that knowledge and apply it to some specific task. Now, the challenge at the moment is that these models have largely been developed by research labs who have you know, where compute costs are completely not a consideration, and they've especially been developed to favor GPU and TPU devices. So this means that if you just run these models, straight from research at the moment, they really quite expensive to run. And so if you want to be processing large volumes of text and you want to run the processing several times Because you want to keep experimenting, the costs of running those models starts to add up very quickly. The you also have problems with serving them. Because if you've got a model that requires a GPU device, the latency for using it effectively starts to get quite significant because you need to batch up a lot of examples. So it's been very exciting to see these breakthroughs happen, I bet the challenge has been basically adjusting, you know, the architectures that we have, and finding the right compromise between models, which are still cheap to run and models, which are still low enough latency while being able to take advantage of the high accuracy from these new techniques. And these problems are constantly evolving. And you know, there's more and more work that's coming out recently about making these models smaller and more efficient as well. And
Tobias Macey
0:11:45
another element of the large models that I've seen referenced is doing things like transfer learning for being able to take the existing models and then swap out a couple of the layers to be able to make it fit your specific use case is that something that spacey is used for in that context as well or is that sort of outside of the scope.
Matthew Honnibal
0:12:03
So we have a command, like spacey pre train that where you can, you know, run language model pre training, even, you know, basically from scratch. But you can also, especially with think it's quite easy to sort of plug these layers together into take advantage of these sorts of technologies. And in space, he, the reason that we sort of redesigned think was really to take advantage of this type of model better. So one of the challenges that, you know, basically is introduced by the new transformer architectures and the new ways of doing machine learning. And when spacey was first developed, they thought carefully about what the right level of abstraction was to present to developers who so that they could take advantage of natural language processing technologies without while you know, basically, you know, which bits of complexity to shield or off from them and which bits to present as, like decisions that we'll be making. So the level of abstraction that was sort of most sensible when I was designing the library was to think at the component level and say, all right, well, this is a named entity recognizer. And it does, you know, this task of assigning labels to text. And then you can combine it with like a tagger, or you can combine it with, you know, a parser. And then these are things which will analyze the text, and then you'll get back to stock object. And you'll basically work with the doc object from now, with the neural network technologies. And in particular, with the transfer learning technologies, it's at the level of abstraction, that sort of most handy to developers to work with, is a little bit different. Because you want to be able to take these models and basically be thinking about the tenses and thinking about, you know, saying, or I will feed this bit of, you know, this word representation out into this layer, and I'll share that information with this other layer. And that's really a level that, you know, basically developers want to be working with now because, you know, the knowledge of these models is pretty detailed in the community people. There's a lot of people who, you know, understand these things pretty well. And so the abstraction is different. And so this is something that we've basically Wanted to adjust in the library and, you know, make it easy to work at that level, while still, you know, making sure that the library does the job that, you know, it originally did as well as working at the pipeline layer as well.
Tobias Macey
0:14:10
Another development that has happened since we last talked is the fact that you've added support for a number of other languages. Whereas at the time, I believe it was only English in German and I'm curious what you have found to be some of the most challenging aspects of building those additional models for different languages and any challenges that you see in terms of being able to support things like symbolic languages like Japanese or Korean or right to left languages such as Arabic.
Matthew Honnibal
0:14:35
Yeah, so the in terms of supporting more languages, I would say that the the two big challenges of DevOps and data so DevOps challenges simply that, you know, the, as we've added more languages, the operational complexity of training all of the models and the automation required to have those jobs complete well, and you know, have pipelines for all these things that you know, reliably and with low manual effort, we can get all that With those artifacts built, tested for each release, and that was something that took longer than I thought it might. So, you know, the training jobs take, you know, a fair bit of time and then for each individual training job, you need to be able to resume it and stuff. So I tried out a number of technologies like airflow, Luigi and things they ended up with, you know, basically a setup that works well for us. But this was definitely a challenge. And, you know, that was a thing that took a fair bit of time, setting up all of these things, partly because, you know, these DevOps tasks are ones which I was, you know, had such deep expertise in. So it was a bit of a learning curve for me, either as well. And then the other one is just the data resources. So, we want to make sure that for all of these languages, when we produce a model that it's, you know, basically useful to people and that the models don't just sort of exist for the sake of it. So that's been something that's been difficult and especially with different corporate not having, having inconsistent licensing and stuff, you know, we want to make sure that the models that we produce You know, available for commercial use for people, and also that the data is good enough that it's, you know, something that's actually useful. So over time, one of the things that's changed, since we last talked is the universal dependency corporate have gotten a lot better and gotten, you know, pretty consistent. And so that's something that we've been able to take advantage of and produce some more of these models.
Tobias Macey
0:16:21
And for being able to build out these models, as he said, one of the challenges is having the appropriate corporate and I imagine that another aspect is being able to label it effectively and find pre labeled data sets where I'm sure that some of the inspiration for your prodigy tool came from the morning if you can just talk through a bit of the motivation for creating that and describe the use cases that it enables and the workflow for somebody using it.
Matthew Honnibal
0:16:46
Yes, so definitely questions about labeling daughter, you know, some of the problems around this and, you know, so we sorted when we were doing the consulting, this was definitely something that our teams are struggling with. So probably the most Important thing that we thought that we could offer that was a little bit different from, you know, or lacking in people's process was, I guess you could say it's more of an agile methodology to data labeling. So the problem of labeling data is, well, you just sort of decide what the labels should be. And then you tell somebody to, you know, apply that labeling scheme. And then the problem is just this grunt work task of getting the things done. And for some tasks, it looks a little bit like that. Some image tasks are a little bit more like that. But suddenly for language, as soon as you come up with any labeling scheme, and you start applying it to text, and you very quickly realize that there's all these edge cases, and it's kind of edge cases all the way down. And even more importantly, you need to realize that there's that there are ways that you can adjust labeling scheme that will hit a better compromise between what will be useful for your model, what will be useful for your end goal and what will be easy to annotate and what the model will be able to you know, annotate effectively so you know, D The other day, we were working on a little demo of prodigy, and we were annotating a instances of like ingredients in cooking discussions, because we wanted to say, all right, well, there are these like trends in what sort of food, food people are using, especially for home cooks and things. So we wanted to say, all right, well, can we find summarize in the frequencies with which different ingredients are mentioned? And so this sounds, you know, simple enough, but then you quickly realize that there's not really a clear distinction between what's an ingredient and what's a finished product. Because in sometimes you might have something that, you know, I don't know, like, chicken fillets could be a ingredient in a recipe, or it could be the recipe itself. So and there's all sorts of other examples like this, where you're kind of not sure if the boundaries. And so always you're making these decisions when you're annotating any project. And that means that you have to basically take a pretty flexible view of what you're doing and it means that you have to be able to start and stop the annotations and look at the data and have it Basically integrated process. That's really what we did with prodigy, we made sure that it was a tool that was fully scriptable. And that you can really have control of the annotation process yourself. And you'll be able to build out different capabilities and automation you need as well. So you can drive it from Python. And if you can basically write a function in Python that generates the data, then that will be something that you can quite easily put in your little function and then have that thrown up in a web browser for you to click through. And it will be saved in a local database. And you can make different choices if you want to stand out for instance, you can have the database safe to like, you know, my SQL instance instead of a local SQL lite file. You can host the application in different ways. But at the core of it is a tool for any data scientist working individually. It's a really quick way to be able to build out these experiments yourself and be able to try different things so that you don't have this stumbling block of as soon as you need some small amount of adaptation. You hit a sort of process block and you you know, Have to do something different or have to go to your team and get, you know, basically apply for funding with it, to throw it out to an external labeling service. Instead, it's just something that you can do flexibly yourself. And
Tobias Macey
0:20:12
I know that there are some other labeling tools out there. I'm curious what you saw as being some of the lacking features or capabilities in the available market that necessitated building out prodigy as an alternative to them.
Matthew Honnibal
0:20:25
So the number one thing was really the design of it is a developer tool and a scriptable developer tool. Because when we talk to people about the their experience, doing annotation, and using annotation tool in almost all of them had built annotation tools in house. And that was something that was worth thinking about. It's like, okay, so if this is a type of problem, where people are very frequently motivated to write their own tools, you know, why would that be? And you know, the simple answer is I will nobody's come up with, you know, just to write a notation tool that everybody needs. And I don't think it's quite that I think That, you know, the needs are quite flexible. And people want to have control of the process, because that's kind of efficient, and different needs are different. And so we wanted to make sure that it was something that people could really work with and that they could work with as developers. So, you know, the script ability and the fact that people can interact with it programmatically, and have it self hosted is something that, you know, we really wanted to build into the design. So most of the other things are designed as web applications, and programming against a web application that you're not hosting is always going to be kind of limited. The other problem is data privacy. So the vast majority of our users really don't want to, you know, often simply can't upload the text in into some cloud service. And this makes a lot of sense to me. You know, like, if I've got text in a platform that's private to me, I don't think that that vendor or that those people should be sending that data to, you know, some external third parties. And you know, since then the regulation has also caught up with this interview that I have had things should work. And I think that that's right. And I think that the US is catching up with this as well. And we'll have, you know, rules that are more standardized, right, it works in Europe too.
Tobias Macey
0:22:08
Another thing that I saw that was appealing about prodigy is the fact that it supports multiple different types of data for being able to label it. So it has capabilities for text so that you can do things like named entity recognition, like you were referring to earlier, being able to say the example you gave this is an ingredient versus this is a finished product. But it also has support for being able to do labeling of images and segmentation of those images to say, you know, this is a rectangular area, this is a polygonal area, and this is the label associated with it. And then you also have support for some other data types as well. So I'm curious what you found to be some of the challenges of building a tool that supports those different data types and some of the value that you've seen come out of it.
Matthew Honnibal
0:22:49
Yeah. So basically just trying to trying to make the right compromises between you know what people need and different use cases while still without you know, defusing too much and being less useful for any particular use case. So, you know, to be clear, I do think that they'll always be other tooling that people want to use as well. And I, you know, I think that some of the worst things that computational tools are actually tools in general can do is to try to be the one stop shop for all use cases, in all situations, I think, you know, it's important to do the job that you set out for yourself well, but a lot of people have found it useful to have this variety of capabilities and properties so that you don't have to have very different workflows just because you now have an image task as opposed to a text task. We've also introduced some nice audio support. I recently as well in this inner says, had some fun building that out and getting a you know, an interface that's helpful for that. So one of the challenges has been designing workflows for things that we don't do ourselves so often, you know, we still don't do a lot of image work in terms of our our actual projects, and we don't have as deep expertise on them. So making sure that we, you know, basically doing something that's helpful to people without having this closer connection to it. I think it's something that we've had to you know, think carefully. About and then, you know, different data sizes and stuff. So, you know, obviously the the size of the input for something like, video image or audio is quite different from text. And so we had to make some adjustments in the way that the database works and stuff to make sure that those are well accommodated.
Tobias Macey
0:24:18
And you mentioned to some of the challenge of being able to allocate funding for working with an external labeling service where you can have the capacity for doing your own labeling, at least on a small to medium scale. And I'm curious what you see as being a reasonable scale of data that can be handled by an individual or a small team and at what point you think it's necessary to start working with some of these labeling services to handle more large scale or more fine grained labeling for the data that you need to use for building your models or building your product.
Matthew Honnibal
0:24:51
Well, so I think that you definitely can get something to production with working with with a basically with you know, Having just sort of ad hoc resources of yourself or your team or like, you know, maybe some interns or like some other Junior people around the place. And so it depends on the tasks. And it depends on how much data is needed to get the the models trained, because there's no one answer for this. The different tasks have different complexities and things. So I would say that the number of examples that you need per model is dropping all the time, because the transfer learning technologies are very good. So I would say that, you know, especially now, you need less started and never, and I would always be focused on if you find that you're needing, you know, hundreds of hours of annotation, then rather than saying, just saying, well, this is our life, that's just how much we need. I would always be suggesting that you, you look at ways that you can redesign the models, because it may be that you know, something's wrong with the way that you're actually defining the problem. So I'll give you an example of this. You know, one of the examples they use in some of my talks is, imagine that what you wanted to do Do we extract information from crime reports and you want to fill out this database of, you know who the victims name a perpetrator name, if it's their location where the event happened, the event type or something. So one way of doing this is very directly and you might say, Alright, I'm going to do this as a labeling task right late label, this span of text, john smith, as victim, and then this other span of texture, location can crosses, you know, location of crime or something. And so, you know, that's definitely a way that you will be able to train the model, but you're coupling two pieces of information, you're coupling the identification of that john smith as a person with the fact that the event is about, well actually a coupling free piece of information. The sentence is about, you know, the event type crime. And, you know, john Smith's role in that event is the role of victim. So if you factor those three pieces of information out, you can often need far less data because the decision of all right That's a person versus not a person is, you know, basically easier, and it doesn't require as much information about the whole rest of the sentence. Similarly, the information that, you know, is this sentence about a crime, or is it not about a crime, that's one bit of information that you can annotate over the whole sentence. And so if you annotate these separately, and you train the model separately, you can often need far less startup. And so you'll have some situations where people are finding that the model isn't converting well, and their first instinct is to either try a different architecture, or to you know, annotate more data, when, by far the most, you know, best lever to pool is one which people don't really have practice pulling, because it's not one which, you know, you'll have gotten from shared tasks. So you'll have gotten from writing papers and things and that lever is how can I redesign the task? How can I find a different way to either need less for the application or to just structure the models differently so that they attack different parts of the task and define things differently, how can I say, Alright, well, what if I did this as a sentence labeling task? rather than as labeling the words in the sentence? Would my application be able to deal with that slightly less precise piece of information? Well, if so, maybe you'll find that the model converges far faster and far better. So to answer the question about when I would actually switch to a labeling service, I think actually, I would use a labeling service basically never. And the alternative after you have a past prototype, I would actually have people to like be hiring people do to work in house. And you know, they can be remote employees or like, you know, people on freelance contracts. But I would always want them to be specific people that I can talk to, under the supervision of the project. Because after you get past the prototyping stage, the task of the labeling data is this discrete event where you do it once get back this batch of task and then the project is kind of shut down. It'll be something you constantly want this feed of data and feed of examples so that you can keep monitoring the model. Okay. I'm improving it over time. And you don't want to have it as this like discrete contracts where the daughter is going to be different each time you go back to the service, because you're getting it done by different people with different standards, potentially with different pricing. And it's, you know, basically something that you want to have consistent control over over time, because your needs will change as well, the needs of what you're going to find that Oh, okay, does it I want to adjust slightly the way that the data is annotated, because this is problem in the the application problem in the model that needs to be solved.
Tobias Macey
0:29:33
And the third component that we mentioned at the opening, and that ties into this whole ecosystem that you're building out is the think project, which you mentioned was extracted from the spacey project originally, I'm wondering if you can just talk a bit more about the motivation for releasing it as its own library and some of the primary problems that you're aiming to solve with it within the ecosystem of machine learning and data science.
Matthew Honnibal
0:30:00
Yeah, so spacey always kind of came with its own machine learning implementations. And initially, it was, you know, basically a pretty simple linear model that was optimized to work with very sparse features using the average perceptron algorithm. And so this was always, for these linear models, it was pretty common and I will pay for, you know, basically everybody will implement themselves and, you know, most of most other passes would have their own like, you know, linear model implementations lurking within them. So I did it the same way. And I found that, you know, basically a helpful way to, you know, keep the model efficient and working well. And then over time, as the neural network models came in, I had already been implementing neural network code, wrote about when pi torch came out, I was, you know, basically done with the models and we were experimenting for spacey too. So if I tortured come out early in that, you know, before, you know, basically been doing all of that work, there's probably every chance we would have just used pi torch from the start. But one of the advantages Just that we saw in spacey to have, you know, sticking with our own implementation was that we could make the library a little bit smaller, because we only had to implement the models that we needed. And we didn't have to, you know, drag in this whole, you know, huge binaries from an external library. And we were able to make sure that we didn't have a dependency on a specific version of pi torch because we knew that the library would evolve quickly. And we wanted to make sure that people never had a situation where they had two projects. And spacey needed a particular range of versions of pi torch. And then there were other code needed a different range of versions of pi torch and so that they had this lock. So that was always something that we were conscious of as well. So, you know, over time, we have kept using our own implementations. But as I as I mentioned, more and more people have wanted to interact with the machine learning layer, you know, underneath, they want that people need to be able to find their own models and bring their own models into, you know, spacey and prodigy. And so, what we decided to do rather than stand direct standard On Python directly was to
0:32:04
sort of adapt think into a library where you could sit as a wrapper around different machine learning solutions underneath. So in addition to things own implementations of things, you can use it as a just sort of interface layer above pi torch. So you can really easily define any pi torch model that you want and then think is just the interface layer that interacts with the spacing. So that was what we sort of set out to do. And the way that we approach that was to really think about the, you know, what's the sort of lightest white or like, you know, most minimal interface that is necessary for this type of Deep Learning Library, and we ended up with the functional programming inspired design that features a very minimal interface of like a single model class, and the actual work of the lives is done in function definitions and instead of bringing in a Definition of, you know, some sort of auto grad mechanism or tech based differentiation like you have in PI torch. There's just this convention of callback mechanisms. And then you have different relationships between layers, like say a feed forward relationship, or concatenation or subtraction or something is all handled by higher order functions. So we ended up with a design that's really quite lightweight and minimal and the library itself is quite small and easy to read. And this means that you can really bring any model that you want from another library, but whether pi torch and MX net TensorFlow, and you can plug them into spacey and plug them into pi torch. We also built out a few other user interface features that we felt were would be help very helpful and the main one is that I think is kind of underrated as a problem in machine learning is the problem with configuration. So we always found in spacey to this problem of how to pass configuration Through a tree of objects, so one way to do it is that you pass into some, you know, component, a whole tree of configuration that defines the that model and then maybe components to the model etc. So you have this blob of configuration, you pass top down into something. But this means that as soon as that component has a pluggable, you know, sub pieces, you it can never know which you know what features or what configuration options its individual parts will need. So if I have if I want to, say configure a parser, or a tagger, and I want to give flexibility of the model of that, or allow people to change individual pieces of that model, then I have to pass all this opaque blob of configuration forward. And then those functions have been configured to probably have defaults and things. So you end up with this problem of different defaults being said, and you can very often have problems where you think that you've ever been to default and you haven't. So instead of passing the configuration topic down like that, we have a way of passing the defining the configuration bottom up through the config file and letting the tree of objects be defined and sort of brought in from that. And we found that really helpful in keeping the code clean and keeping, you know, helping to manage this problem with defaults. And then finally, we've got typing. So Python three has typed declarations. And we've really made good use of these and think so It's the first time I've really seen good, really full support for NumPy arrays and things in, you know, a PI Data ecosystem we have, you know, so that you can get static type errors for something like indexing an array in a way that's invalid, because it's, you know, a three dimensional array and you've used too many index indices or something. So yeah, I think that that's an exciting feature that we that will be very helpful to people as well.
Tobias Macey
0:35:49
And one of the things that I thought about as I was looking at think is because of the fact that it acts as a high level wrapper for multiple different frameworks for doing that. Deep Learning, it puts it in some sense in the same space as the Kerris project. And I'm curious what your sense of the comparison is for being the comparison of think versus Kerris in terms of acting as that wrapping layer?
Matthew Honnibal
0:36:14
Yes. So I really found the the functional programming style and Caris and, you know, very interesting when I first saw it, and was definitely something that helped inspire the approach that we took, I think that over time, the sort of focus of Kerris has shifted a little bit. And, you know, it is really part of TensorFlow days days. And it's really divided. You know, basically, it's sort of the main interface into TensorFlow to people using it's really a high level, you know, API for tensor flow does quite coupled into the TensorFlow ecosystem. So I would say that the focus is a little bit different we think, and we've also benefited from coming into it a little bit later and being able to come up with a design that's a little bit tighter and that will be able to maintain or more consistent See over time, so that, you know, we really hope that we will not have to make breaking changes over time and that we are able to basically keep, you know, design quite concise and coherent. So I would say that the use cases a little bit different, and you know this, that Kerris is not so much a wrapper around different things as much as just, you know, a key part of TensorFlow specifically.
Tobias Macey
0:37:20
And then, because of the fact that think also acts as its own framework for building these neural nets and doing deep learning, I'm curious what you have found to be the strategy that you use in terms of determining when to do something entirely and think versus when to incorporate either pi torch or tensor flow or some of those other frameworks into the network as a component of the thing project rather than just doing it entirely without those love frameworks.
Matthew Honnibal
0:37:50
So we, our goal is to avoid having in the long term models which have a strict dependency for n pi torch or tensor flow. The core pipeline API's and spacey because I think that this does make things sort of operationally simpler for most spacey users. So the way that I say it is that pi torch in particular is a really excellent compiler of these architectures. And it's able to really take very general, you know, basically, you can implement things in a very sort of neutral way without worrying about the performance details. And it patch will pretty much always do it, you know, a pretty good job of that. And so it's a lot easier to get to, you know, basically a good performance level without having to manage the specifics of the computation, and in particular specifics of the device. But that said, if you take any specific architecture, and you implement something in CUDA or C yourself, and you can usually at least match what the performance that you would get from something like pi torch. So, the way that I would do it is that when I'm experimenting with something, and I, you know, I want to say try out a Gru Well, I might not have a Gru Implementation console, you know, certainly just plug in thing, pi torches. And there's no performance penalty for doing that there's no overhead in translating the tensors. From q pi to pi torch it uses to do pack formula formulation. When you have pi torch, we have a thing that let's did sets q pi eyes memory allocated via pi torch as well. So you've only got one memory pool. So there's really no disadvantage that you just have to have pi torch installed. So there'll be all sorts of architectures where it's either easier to implement it initially in PI torch, or you know, there's a really pi torch code for it where I would be using that directly. And then eventually, if we want to provide that to space users, or if I feel like I can do a little bit better than production, optimizing that specific architecture, I would switch that over to think the other thing is that sometimes you have these sort of high level building blocks of models. And some of that composition is actually easier to do in in think by just thinking about the different sort of functions that you plug together. So in particular, I'm uses a way to writing things in think and I find that that's, you know, something that I find kind of concise to define models and try different things out. But for different components, you know, maybe there'll be something where it's easy to have a Python wrapper around it. Now, in terms of uses of spacey and uses of poetry, almost always, there'll be more familiar with PI torch. And they'll have, they'll want to work more directly in PI torch, and they'll want to, you know, have that as a, especially initially, as you know, don't want to have that as the development framework. And then they can just use a thin wrapper from think around it. And you know, over time, maybe they'll decide to do some other specific thing and think rather than doing it in pipe directly, but the aim is to let people work with the frameworks that they want to work with. And I imagine that for most developers to want to work with, you know, pi torch, and that's a, you know, a pretty standard technology and machine learning.
Tobias Macey
0:40:50
So all of these tools that we've been talking about are open source, and there's something that you're working on as a core element of your business and I'm curious what you have found to be some Have the biggest challenges in terms of building and maintaining these tools to meet the needs of data scientists and machine learning engineers and the approach that you're taking to making them sustainable.
Matthew Honnibal
0:41:11
So one thing that's definitely difficult about this is that the technologies are changing so quickly, both in terms of the research underneath it, and also to software ecosystems around things, we have to strike the right balance between maintaining a good amount of backwards compatibility and stability for people, while also moving quickly enough to take advantage of new opportunities from technology and new integrations with things and basically providing a better experience and improvements to data scientists. So I would say that that's definitely something that's, you know, challenging about the this type of work and to, you know, basically be pushing ourselves to deliver the best quality software that we can, you know, certainly the you know what, there's this constant background thing of, I don't know the continuous integration systems are changing underneath us or something. So at some point, we Implement everything in you know, and get set up with, you know, Travis at fire and circle ci. And then okay as your pipelines comes out and we see that, okay, that's a better option. So we migrate author instead and different wheel formats and different build tools and things. So there's this sort of background level of, you know, the all of the basically the boring problems that did the technology stack around us changing, improving and different libraries and all of these things. And so that's definitely something that occupies a surprising amount of time. It's just all of the rest of these like ecosystem things in interactions with, you know, all of the other software that you wouldn't think of is sort of core parts of solving the problem. But it's definitely things that need to be done to basically to keep delivering high quality software.
Tobias Macey
0:42:45
And then as far as the most interesting or impressive projects that you see, and I'm curious what you have found to be notable and worth calling out either things that your team has created. With the tools that you're building or things that you have seen built with those tools that you're releasing.
Matthew Honnibal
0:43:05
So we were always really blown away by seeing all the things that people are building with spacey and that this is definitely something that's, you know, constantly increasing and improving. So we have a collection of these projects in on the website, the spacey universe. So two ones, which I want to call out in particular are two models for two spacey pipelines for specific types of text. So one is Blackstone, which is a spacey pipeline for legal text processing. And another one is spacey, which is spacey pipeline for biomedical text processing developed by the Allen Institute. Another project that I think is super cool is this information extraction system called homes based on predicate logic. So that's something that I've always wanted to dig into a bit more was developed within a company and then you know, basically kindly open sourced so it's really quite a substantial project. I think, you know, it's definitely cool. Another one which we developed internally that you know, people might want to check out is we have this product called sensitive EQ, where we train word vectors based on, you know, text that's been pre processed with spacey. So, a noun phrases have been merged into one token or entities have been merged into one token. And early this year, we ran spacey over all the text and read it. So this was, you know, several billions of words from 2010 to 2020. And we have we use this to get the entities and basically make vectors and then we precomputed similarities for those vectors on GPU so that there's you know, you can get pretty much instant nearest neighbor queries across all of these, you know, terms. So you can find similarities across entities and things which is quite cool to play with
Tobias Macey
0:44:48
and as a contributor and maintainer of all these projects, and as somebody who is running a business that relies on the What have been some of the most interesting or unexpected or challenging lessons that you've learned over the past few years.
Matthew Honnibal
0:45:02
So one of the things that's, you know, definitely important is to the way that the projects are documented and communicated. So, and, you know, this also stems back to, you know, initial design decisions as well. And, you know, basically making sure that things are consistent in the projects and consistent in libraries. And I think that this really makes things, you know, sort of more useful and open to a wider audience. So this was something in particular, that, you know, improved a lot with my collaboration within us. And so she has been really a driving force in getting the guides in like level of explanation that we can deliver, you know, to basically a higher level. And I think that that's something that's really been setting apart some of our projects as well. And when we we saw this in particular, when we went back and did think that there was so many things where, you know, we felt like we've done this before, of setting up these libraries and setting up a tool in which people would find useful and, you know, ways of doing the documentation, things that people would need. So we've learned a lot from the types of questions that people have And the types of API design decisions that will, you know, lead us into maintenance problems will lead you to be confusing to people. And we've been able to head off some of those things at the start with think, which we've been pleased about. So those are all things which, you know, I definitely think that we've learned as well. And also just ways of setting up the testing and making sure that the code is well tested, well tested and testable to avoid some of these bugs in the first place.
Tobias Macey
0:46:24
And as you look to the future of the explosion company, and the projects that you're building there, what do you have planned? And what are you most excited for?
Matthew Honnibal
0:46:33
So one of the things that we've been working on for a long time, and we excited to finally get out be an extension to prodigy called prodigy teams, which has more of a sort of team management interface and, you know, has a hosted component where you can allocate work to individual annotators start and stop annotation tasks and things. So that's something that we've been working hard on, because it has this part of it's a more managed architecture, and there's a more intricate web app behind It's taken us a little longer to develop, it's been going well, and we're really excited to get that out to people. And yeah, that's, you know, basically the main thing that we've got in the pipeline as well, as well as having getting spacey three out which will make it much easier to use transformer models and really allow you to bring, bring your own model and, you know, basically make it easier to interact with those technologies to spacey, are there any
Tobias Macey
0:47:25
aspects of your work at explosion or the spacey and prodigy and think tools that we didn't discuss yet or anything else in the space of natural language processing and deep learning that we didn't discuss the like to cover before we close out the show?
Matthew Honnibal
0:47:38
Yeah, so one of the things that's been different with explosion since we last spoke is we've managed to grow a small buddy, extremely effective team that we've been working with. So it's bases call maintainers. Now also include Sophie and Adrian. So we've got a team of now four people working on it full time, and that's really been helpful. So in addition to myself in my Krypton ns. And then we also have Sebastian Ramirez has joined the company. So we started working with him because we started using his open source library fast API, which I think is a really great tool that you know, any Python developer who needs to write rest API's should check out. And so he's been working with some prodigy teams. And as well as we've got another developer, Justin, who's in the US who's working on teams as well. So yeah, we've, you know, now got a few extra developers working with this. And on the scale of things, it's still quite a small team. But, you know, we really feel quite blessed to be working with people who are very effective and work quite independently. And, you know, I feel like it's a very fun collaboration that we
Tobias Macey
0:48:43
have with people. For anybody who wants to get in touch with you and follow along with the work that you're doing. I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the picks. In this week, I'm going to choose the movie onward that I watched recently with my family. It's a movie aimed at younger Kids, but it's great for the whole family hilarious, really interesting storyline just had a great time watching that. So if you're looking for something to watch with the whole family, I'll definitely recommend it. And with that, I'll pass it to you, Matthew, do you have any pics this week?
Matthew Honnibal
0:49:13
So outside of the, you know, ai space, a lot of my time over the last few weeks has been spent following the corona virus pandemic. So I'm sure by the time you listen to this, anything that I say will be different, but I guess just stay safe with that. And, you know, in terms of picks for, you know, things within the ecosystem or recommendations to make. One project that I think is really cool that people might check out is this library called Ray, which is developed by some people originally from lab at Berkeley. And I think it's a really cool way to, you know, basically write distributed applications for machine learning in Python. So it's still quite young, but I think they've got a nice design and it's something that you know, Think, continue to be popular and is want to check out.
Tobias Macey
0:50:03
Yeah, I'll definitely second that one. And the original creators of the library have also founded a company called any scale to try and accelerate the development of that framework and turn it into a viable business. So definitely something to keep an eye on there. Mm hmm. Well, thank you very much for taking the time today to join me and share the work that you're doing at explosion on all of your different projects. Definitely a lot of interesting tools that contribute a lot to the ecosystem. So I appreciate all of your time and effort on that front and I hope you enjoy the rest of your day.
Matthew Honnibal
0:50:33
Thanks, you too.
Tobias Macey
0:50:37
Thank you for listening. Don't forget to check out our other show the data engineering podcast at data engineering podcast comm for the latest on modern data management. And visit the site at Python podcast.com to subscribe to the show, sign up for the mailing list and read the show notes. And if you've learned something or try it out a project from the show then tell us about it. Email [email protected] with your story To help other people find the show, please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support Podcast.__init__ on Patreon!