SpaCy with Matthew Honnibal

Hello and welcome to podcast.init,

the podcast about Python and the people who make it great. I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode@linode.com/podcastthenit

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your next app. You'll want to make sure that your users don't have to put up with bugs, so you should use Rollbar for tracking and aggregating your application errors to find and fix the bugs before your users notice they exist. Use the link rollbar.com/podcastinet

to get 90 days and 300, 000 errors tracked for free on their bootstrap plan. You can also visit our site to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. And to help other people find the show, you can leave review on Itunes or Google Play Music and tell your friends and coworkers.

You can also join the community at discourse.pythonpodcast.com

to find out about upcoming guests, suggest questions, and propose show ideas.

Your host as usual is Tobias Macy, and today I'm interviewing Matthew Honnable about spaCy and explosion dotai.

So, Matthew, could you please introduce yourself?

Hi.

So first of all, thanks, Tobias. It's good to be here. So as you said, I've I'm the developer of spaCy.

I've been doing natural language processing research for most of my career. And in 2014, I decided that instead of writing grant proposals and ascending in academia, I would leave and write a commercial open source, library, and this became spaCy.

And how did you first get introduced to Python?

So, this actually takes me back a long way actually. So I I first started programming in Python in 2004

So it was really my first serious language and before that I'd been programming a little bit in Perl,

just sort of self taught and I was working as a research assistant on a project, and this project was using,

Python a little bit and so I got 2 big books on Python including the Riley Python Cookbook and, you know, read them and started writing my little scripts to search vocabulary words and things in text in Python.

So

can you start by sharing what spaCy is and what problem you're trying to solve when you first created it?

So spaCy is a library for doing natural language processing.

In other words,

the main thing that it's good for is if you want to

understand a large volume of text or extract information from it or process it in some useful way,

then, you usually want to first of all split the string up into sentences and words, then you want to understand how the words meanings relate to each other, how the sentences are structured,

find,

sort of proper nouns and numeric entities like people, locations, events. All of these are tasks which are common in language processing, and they're all nontrivial to write. They typically require statistical models to get accurate because a word might have multiple parts of speech, the word apple might sometimes refer to an apple, it might sometimes refer to a company, etcetera. So these are all areas of ongoing research, and this was the type of research that I was engaged in as an academic. And I sort of, okay, these technologies are getting increasingly useful and practical, and they're starting to be applied in a much wider variety of contexts including conversational,

UIs and,

chatbot type things now. And I sorted the largest companies have teams that essentially are tasked with transform.

Taking the

papers that are produced by researchers

and turning them into

systems that their software engineers can use. But this sort of, library was really lacking in the the wider startup ecosystem,

and so I thought that that was a shame and that, you know, sort of smaller companies or the wider ecosystem was really missing out on this, and that there was really space for, somebody to come in and fill that gap. So that's what I decided to do. Another library that I'm sure a lot of listeners are familiar with is the natural language toolkit. So I'm wondering, how does spaCy differ from the NLTK

and if there are any cases where the NLTK would still be a better choice than spaCy.

So the natural language toolkit also kinda had its has its origins in,

academic researchers, but from a different perspective.

So NLTK was really written to assist undergraduate teaching in natural language processing and computational linguist linguistics,

so it was really written to help, you know, introduce these topics to new students and researchers, and in particular, students and researchers from a variety of disciplines who don't necessarily know how to program already. And I think that it does a great job at introducing people to these sorts of basic linguistic topics and, issues in

natural language processing. But it's not designed primarily to help people, you know, build a product. It's not designed to help you really put something in production. And

that's really why, you know, when I decided to learn spaCy, I sorted, you know, I hadn't been using natural language toolkit myself because it didn't fit my needs that well,

and I saw that there were a lot of other people who, you know, needed basically a different sort of library that had different concerns and different priorities.

And

are there any sort of fundamental aspects of how NLTK and spaCy are written that are significantly different that allow for any greater feature set in spaCy either now or in the future?

Sure. So the difference in focus

manifests in 2 differences of design. 1 is that spaCy is written in, I guess, memory managed code. Like, it's written in as, native c extensions that are implemented as siphon. So all of the functionality is written to go fast in the library itself. N OTK favors implementations that are as transparent as possible for teaching and, education,

and written in Python not written in c extensions. And for anything that needs to go fast or be accurate, the library favors calling out to external services. So you'll run something as a, you know, say a service and you'll install some other software, like you'll run you'll install the Stanford library, run it as a service, and then NLTK will communicate with it. So the NLTK library itself doesn't tend to provide many of these modules, by itself. Instead, it's kind of the scaffolding that lets you call out to external tools, but this does mean that you have to set up these pipelines yourself and also that it's less efficient. The other difference in focus is that spaCy is, I guess, more opinionated. I believe more in saying alright if you wanna build these things and get them done, then we should only really deliver you the 1 best way of doing this, and really stay up to date with the current state of the art, the current methodology with these things, and aggressively prune away all their algorithms and all the ways of doing things. Whereas NLTK really favors giving you a small list board of lots of different options, and then once an algorithm is implemented in the library, it tends not to be deleted or not dropped from support. So there's lots of examples of, you know, a history of ways that people

used to do things in the library, whereas spaCy favors in just getting you up and running and doing doing the task directly.

And does the difference in focus manifest

in spaCy being any more difficult to use or requiring any more upfront knowledge of computational linguistics than the NLTK does?

So in many ways, spaCy actually requires less knowledge to use because you don't have to configure the pipeline yourself. You don't have to decide what, you know, modules to use or how how you want these things to work or, you know, what all of these algorithms are. You basically feed in text and get back a document object and then, you know, all of these details are kind of taken care of, under the hood. If you wanna go kind of 1 step in a different direction and, you know, learn something about a a history of algorithms in the field or learn something about certain concepts that might be in a a university course, then the implementations in spaCy are much worse to study than the implementations in NLTK. You should if you want to, you know, crack open the code and understand how everything works, well, spaCy's code is a little bit more difficult to understand in this respect. On the other hand, the algorithms are actually implemented in spaCy, whereas a lot of these algorithms are actually not in NLTK itself. They'll be in external services.

Can you dig a bit deeper into the internal design and architecture of how spaCy is built and also

share what some of the biggest challenges have been during its development process?

Sure. So I guess architecturally,

the first thing is you get you get a string, right, and you wanna split it up into tokens.

So most NLP libraries, they use what's called destructive tokenization as in, you know, the the object that you get back from, this tokenization process is a list of strings. This means that you don't keep hold of where the white space occurred and you can't easily associate the annotations back into the original document, the original string.

SpaCy instead immediately

starts resolving the strings into,

I guess references to lexical types and returns back what's effectively this pointer into

vocabulary items. And then these rich types are have been annotated progressively with more information. So it's a slightly more object oriented approach. The primitives are, you know, see data structures that, you know, kind of accrue annotations over time rather than getting back a list of strings and, you know, then having annotations to the other list of strings that refer to them. And, you know, when you're using the library, you can kind of see this at work with you know, you get a document object that has all of the, the annotations in itself, and then you can iterate over sentences within it, or tokens within it, and I think that this actually helps people use the library a lot. I in particular, it helps people associate different types of annotations together. Like, you know, if you want to

look at the named entities in a document, you don't tend to just want the named entities. You also wanna understand something about how that relates to the other types of information. You wanna say, alright, what's the sentence that this entity is within? What are the other entities within that sentence? What are the word vectors, or meaning representations that are associated with that entity? How does it relate to the syntactic structure of the sentence?

So spaCy's design is really helps you kind of get at all of these different views of the document, but of course managing these different views is difficult and this, you know, I guess 1 of the architectural decisions is to avoid copying the data instead, as much as possible, the

Python API gives you views of the underlying c data and doesn't and avoids making copies. Instead, everything kind of refers back to a single source of truth for, the data representation.

Is the single source of truth an optimization

for allowing for greater speed within the library? Because I know that that's 1 of the main focuses.

So it's a bit of both. Yes. There's an efficiency advantage,

although sometimes it would actually be more efficient to create local copies, but really it's about consistency. You know, if you're taking let's say you take a a span, which is like a slice of words. So it's, you know, a con some number of tokens. It's like token 3 to token 6. If you're creating copies of the data, you then can't modify that span. Otherwise, if you do modify that span then, the modifications that you make won't be reflected in some other slice of the document that you take. And it tends to be very confusing to kind of mentally keep track of what,

operations copy data and what operations don't, and once the user has to kind of keep this in mind of what's a copy, what's not, what's a view, the library I think comes very difficult to use and you get lots of subtle bugs and subtle problems and subtle limitations about what's possible and what's not. Instead I think the stable can consistent options are 1, everything's a copy, which I think is ends up being quite difficult to use, or 2, everything's a view. And so I think actually it's better to adopt this single source of truth policy for this type of application.

And what are some of the other benefits that are provided by maintaining

the original document including the spaces rather than do using destructive tokenization like some of the other libraries?

So 1 of the benefits is that you can really process the whole document more easily. So 1 of the nice things with spaCy is that instead of saying, alright, step 1, we'll split the document into

and then analyze each of those sentences individually, we really read the whole document from left to right. And this includes both the named entity recognizer, parser, the tagger. All of these things see the whole document in context, they don't just see a sentence. This means that if your document has different structures like paragraphs in it or and things, the library is really able to take them into account. And it also means that you're able to do things like extract the whole quote even if that quote crosses sentence boundaries or quotes crosses paragraphs because you're able to see paragraph boundaries as, you know, new line tokens, and you're able to, you know, basically process this type of white space and document structure at the same time as you're processing the internal syntactic structure of the sentences. It's also very useful if you're building an application

that has to, in the end, print markup in the original document. So let's say you wanna highlight the negative sentiment words, like highlight the parts of review that have led the system to call this a negative review. If the annotation

views things as a sequence of tokens and doesn't relate those tokens back to the original text, you then have a difficult alignment problem. And this is just extra work that you shouldn't have to do, and in some cases,

introduces extra sources of errors that should be unnecessary. Most of the time if we wanna actually use these annotations, eventually, we wanna know something about the original text. And so it makes sense to actually be able to relate the annotations back to the text.

And does the fact that the entire document is visible to each of the algorithms allow for any more complicated analysis

or analysis of more complicated sentence structures?

Yes. So in particular it helps with, I guess, loosely formatted or conversational

things which are maybe poorly punctuated.

So in other libraries, because they're kind of developed and evaluated always on what's called gold standard sentence boundaries, as in, you know, if you take look at a system like sin like Google Syntax Net, the research that was done there is only ever done on documents that have perfect boundaries in place. And this means that the model that's trained has never seen an incorrectly recognized sentence. This makes it very vulnerable to formatting errors or inconsistent formats that happen at run time. So if your document is slightly punctuated slightly differently then a decent boundary detector might segment the document differently, and this means that the downstream parsing tasks will receive input that's very unlike any that they've received during training. So there's a kind of robustness to variation that happens from seeing the whole document because the model can be trained on the whole document and isn't trained on this perfect preprocess. It's also potentially useful for speech analysis.

So if you're parsing the output of the speech recognizer usually you get a whole long turn,

speech doesn't come with, you know, nicely punctuated sentences. And so if we wanna understand this in the parser then we want to be parsing all of this at once and having the parser predict the sentence boundaries. So doing this whole document approach is good for that as well. And finally, there's other languages which don't naturally come with the punctuation like English does. So in these situations, it's also useful to have the syntax kind of aware of long streams of text rather than only aware of limited sized sentences.

The subject of different languages in the context of a natural language processing library,

I imagine, adds a lot of complexity. And I know that, for the moment at least, the languages that are supported by spaCy are currently English and German. So I'm wondering, what's the level of effort and complexity involved in adding support for other languages?

So it varies somewhat across language and, in particular, it varies depending on what resources are available

from the computational linguistics community for other languages. 1 thing that's good is that the amount of effort and complexity is rapidly falling because, the kind of feature tuning and problems that used to occur a lot with multilingual

processing are now being addressed by neural network methods.

So we now have machine learning models which are much more general across, different inputs, and you'd have to tune them a lot less with in terms of the features. And so the models more and more are being able to read everything from the characters up, and so this aspect of things is improving. We're having to rely a bit less on these manually constructed resources. That said, if you want to train something like an entity recognizer well, you do need example data for this. And so if that example data isn't provided in a kind of ready made form for us to consume or if we have to construct this ourselves, this makes it take longer to support these languages.

So I'm pleased to say that support for other languages is coming along well. We'll have good announcements for this in the upcoming periods, and I think that after the development of the German model, a lot of these things go 1 too many. And so I think we we're actually getting ready to roll out a lot of resources for this. And in terms of the overall size of these models, is that something that is

contained in a separate data download that you would retrieve

for any particular languages that you're interested in? Or is it currently shipped with the actual library itself?

So that's a good question. So at the moment, the English and German models are free data downloads that you kind of download from our server and inject into spaCy, and these datasets

allow the the library to perform well on general purpose English and German text. Now even within the same language there's often use cases for different statistical models. For instance, if you want to process Twitter,

text that requires different knowledge from processing say biomedical articles or,

legal articles even in English. So there's use cases for different statistical models for different purposes, and of course this applies across language. So, you know, to process Spanish you need different knowledge in the library than you do to process English. And this knowledge is, you know, currently separate from the library as these data downloads, and this will continue because, you know, it's kind of fundamental to how these statistical models work. So this is actually 1 of the commercialization

things. We wanna develop these resources and developing these resources is expensive and supporting them is expensive, but we think that if we develop this once for many people, we can, then have that as a resource that lots of people can use as, I guess, premium content that they can inject into the library. So we'll do the work of developing a nice Twitter model or developing a nice model for Spanish social media text or, you know, Chinese legal text, and then all sorts of other people in the community can then access this for a relatively small fee as a commercial offering.

So you you mentioned at the beginning of the episode as well that you were focused on having spaCyb be available commercially.

And I'm wondering what business model looks like around Spacy and what the open source licensing

looks like for spaCy as well for somebody who wants to use just the freely available pieces of it. So spaCy is available under the MIT license. Initially, when it was first launched I thought about a dual licensing structure with AGPL and commercial, and I relatively quickly sort of this didn't work very well at least for this library. The community around these things really much prefers to use,

MIT licensed libraries and there's really not as much enthusiasm or community for AJPL

licensing options. Instead, the model for the library that we're going to be rolling out is having premium data offerings that we develop for particular use cases in particular languages that people can access. And this works particularly well for the neural network models because a lot of the data assets take are very computationally

expensive to pre train. So we wanna have pre trained lower levels of neural networks, of word vectors, and recurrent neural network models that people can really build on top of, and a lot of these components take, you know, up to weeks of pre training our large data samples. And so we think that this is an opportunity for us to centralize some of this work and make it available to the community as a commercial

offering. So you briefly mentioned

a focus on

deep learning or integration with deep learning as well, and that's something that I noticed in the documentation

also. So I'm wondering

why

deep learning is such an important aspect

of natural language processing and what the integration points look like for spaCy?

So over the last 2 years, if you look at the, I guess, the trends in their research and the outcomes that are being demonstrated,

More and more, there's there's very few situations where deep learning isn't the better approach, and I think that the times where it's not the right way to design an application will continue to shrink. So given that this is what I recommend almost all of our consulting clients to be using, I think it's very important for the library to use it. I think that this is actually the way that people should be using natural language processing going forward, especially because it allows you to kind of future proof your application and more easily integrate new and improved methods of doing things. So being up to date with the latest methods is super important because if you look at what people want to do for natural language processing, they're building things which were not good ideas to build 2 or 3 years ago. Right? And so you want to say, well, what's changed? Well, the what's changed is that the research has moved forward and we now have better methods and, more accurate methods of doing things, And really that basically translates to deep learning. So we want to, help people stay up to date with the best ways of doing things, the easiest and most accurate ways of writing their application. So that's,

really what we're hoping to deliver, and we wanna make sure that spaCy is the easiest way to connect deep learning to text. We want to make sure that it's easy to take text and feed it forward into a deep learning model with lots of kind of well designed primitives and well designed things to, you know, basically make that process as smooth as possible.

So

1 of the

add on projects that you've built around spaCy

that I think is

very very interesting and useful, particularly for somebody who's new to the field of computational linguistics,

is the Displacy tool for being able to visualize the way that spaCy breaks apart its sentence structure. So I'm wondering if you can explain a bit more about what the tool is

and how it was built and also why you think it was important to create?

Sure. So the Displacy tool is actually the first thing that I collaborated on with my co founder Ines Montani. So she's the the other half of Explosion AI at the moment, and with this was a very, productive collaboration because Ines has a long experience as a front end developer,

which is a type of skill that I really lack. And before we developed,

the display c front end, we really had trouble communicating what the functionality in spaCy did and how to use it, and also even understanding

how, the library should behave or will behave on any given sentence. So Displacy actually functions both as a demonstration but also documentation. I mean I use it all the time when I'm trying to use the dependency parser because it otherwise you have to understand this whole annotation manual about how spaCy will annotate the structure of a sentence. So if you wanna refer to the labels or structure then being able to visually explore that analysis is useful. More broadly, Displacy is, you know, part of a sort of growing suite of,

visualization tools that help people understand basically everything that the library does, and we're also building out annotation tool tooling to help both us and other people, you know, create this data efficiently because ultimately, you know, the machine learning stuff is only useful if you, can connect humans to it, and in order to create the useful machine learning,

technologies you tend to have to annotate data and get the knowledge streaming from people into the computer. And so all of these things are, you know, human computer interaction problems where interfaces are really an important part of the whole equation.

So for somebody who wants to get started using something like spaCy, what are some of the signs that would suggest that using natural language processing would be a good fit for their particular problem? And also, what are some of the kinds of applications where spaCy would be useful that might not necessarily be obvious candidates for it?

So anything where you find yourself taking text as input and you wanna compute some function of that text that you find is can't be trivially resolved at the character level, you tend to end up needing some level of natural language processing. So for instance, if the function that you're trying to compute over this text ends up depending on the content of the words, then you'll almost always want to kind of step back and not just have a rule based approach. You'll almost always want to say, alright, well, let's have something that actually splits the text into words accurately and also gives us something that is able to interact with the meaning of those words. Like for instance, word vectors or, you know, finding entities or finding the relationships between those words. So this is actually pretty common, like it's pretty common that you have some text as input and you want to compute something over it that is not just the length of the string or whether it's uppercase or lower case. So in terms of types of applications which, could consider using a library like spaCy, I actually think information

retrieval stuff like search things could potentially benefit from taking a more,

detailed view of tokenization.

And spaCy's tokenizer is actually fast enough to be used in, you know, something like a search project or a search product. And instead of having this kind of, like, slightly ad hoc approach to stemming the words, it may actually be useful to basically use the natural language processing library like this. So would spaCy be potentially useful

for situations where somebody might otherwise consider using Elasticsearch or the Lucene engine?

Well, I would actually say that if you need a full featured search engine like Elasticsearch and Lucene then spaCy is no replacement. But if you're building something like Elasticsearch or Lucene, like, you know, for instance there's there are Python libraries which give you search engine functionality over, you know, more limited domains like WISH or something. If you're building something like that, then it may be worth using spaCy or or some other similar library. You've commented a few times about the speed of spaCy and the fact that you implemented a large portion of it in Cython. So So I'm wondering

why is speed such an important focus for an NLP library,

and I guess what inspired you to have that be 1 of the distinguishing factors?

So speed is very important for natural language processing because there's always more text. So in most applications

you can expect the cost of some computational job to decrease sharply over time. For anything that where the working set is growing at the same speed as the hardware that's not really true. So you might have a task like analyse all of the articles in Wikipedia. Well if you wanted to do that that task in 2, 007 and you want to do it now, the growth in Wikipedia's size is racing the improvements in the hardware. And so that job actually isn't very much cheaper to compute given the same speed of process now than it was in 2, 000 7. That's a pretty rare situation. Usually, you know, if you wanna compute that some fluid dynamics problem or something, well, your problems maybe not growing in, size as much as, you know, the natural language processing problems are. So basically,

we really want people to be able to read whole web dumps, and we want people to be able to read all of the text that's on their platform, and this tends to take a long time. I mean even at spaCy as fast as it is, you know, when I want to compute something over say all posts on Reddit, well, I'm looking at, you know, either setting up a difficult clustering thing or, you know, it maybe takes a couple of days on if I can work all all cause of a single machine. The other thing is that because the data because we're moving the data around,

actually just using Hadoop and, you know, sort of solving everything with parallelization

tends not to be that attractive. We'd rather have the the library be 100 times faster than to have to use 100 times as many nodes because if you do use 100 times as many nodes in a Hadoop cluster, well, you'll end up actually achieving performance that's still much less fast than just writing efficient code to start with. What are the scaling strategies for spaCy when you do need to increase the rate, which you can churn through a corpus of data?

So I usually prefer to use a single machine with lots of cores because then you don't have to worry about networking.

The problem is just much simpler. So I tend to basically just book a machine on my generous soft layer credits that IBM has given us, and

book a machine with maybe 90,

calls or something. Within each process, I fan out to 8 or 10 threads because spaCy supports release of the global interpreter lock so it can efficiently

use multiple threads, and then I just use simple Python multiprocessing.

So the idea is that natural language processing tasks tend to be embarrassingly parallel, you wanna process each document or chunk of documents

How have you managed to ensure that the release of the GIL doesn't lead to any sort of data races or, seg faults in your programs?

So this tends to essentially, I just let OpenMP

take care of a lot of these things. Fortunately, as I said, we really just have to have this single prange loop over to documents. So,

each thread receives its little batch of documents, and they don't have to communicate

until the work on that batch of documents is done. So,

we just have to make sure that we don't have any kind of shared memory that's, like, you know, when you're processing a document that that doesn't accumulate any state that is then under contention, and then this that's enough to make sure that we don't have any of these, race conditions.

You can set things up so that, you do get a problem if you there there's a common problem with threading where if you have kind of nested threads, then you'll hit an error. So if you try to use the multithreading

from within something where you've already spanned out into a child thread, then there will be a problem, but this tends to be a kind of uncommon use case.

So, you know, as much as this is a problem in theory, people don't seem to have this problem. And you've mentioned your company, Explosion AI, that you ended up building up of on top of the success of Spacy. So I'm wondering if you can explain a bit about what the goals are for for the company and what are some of the kinds of services that you're offering through it.

So the goals are are really to help people take advanced natural language processing

and put it into production. So really give people access to the latest,

developments in research that they can build, you know, really great applications that couldn't be built yesterday. So that's, I guess, the the main goal of the company and what what we're really trying to achieve. So 1 way that we're achieving that at the moment is by working with clients on an individual basis. Unfortunately, this tends it tends to be to people. A lot of machine learning problems and natural language processing problems sort of start to look alike if you've solved them before and if you kinda have the right abstraction over these things. So we can quite efficiently help people solve a specific problem by working, with their problem directly in a way that would be difficult to, to do on them by giving them just a library. Because if you give them a library, they still have to understand how to wire together these primitives into something that really solves their problem. So that's 1 thing that we're doing. We have this, you know, basically sort of consulting, model where we develop a machine learning model and ship it to the clients based on their specific needs, and they get full ownership of the code and data. Another way that we're doing this is to,

essentially build out, as I said, this sort of data store or model store, which we wanna build, that lets people use more premium models and more accurate models that help for their specific use case. So at the moment spaCy supports data for general purpose English text and general purpose German text, but this isn't enough. We think that there's a use case for, you know, specific models for different languages and different genres, and that these, you know, these data assets can be used in common between different people, and so we can build them once and have, many people benefit from them. So those are the 2 things that we're really working on to achieve our goal of making it easy to use the latest natural language processing,

solutions.

So what are some of the most interesting uses of spaCy that you have seen or created?

So 1 of our users that, you know, I guess, got on the library early and has built impressive things is Chartbeat.

So they are a media analytics company that help large media companies understand their readership and understand their readers' behavior on their sites. And so to do this, they want to understand the content of the headlines and stories that people are ticking

clicking on and how the readers are behaving on the sites, and they're using spaCy to to assist this. So another company that's doing good things with spaCy is a startup in the UK called Cytora, and they have a product that helps their customers understand risk in their supply chain and basically across their business. And so Citora read news articles and other content on the web and summarize these things into alerts that, you know, basically tell

some customer that, okay. There's going to be there's likely to be an interruption in supply at your factories in South America due to political conditions there or due to,

natural disasters that are, you know, even quite deep in the supply chain. So even, like, 3 steps removed of a company that sells to a company that you buy from. And I think that both of these are quite interesting applications that have people trying to understand all of the knowledge in the world or all the knowledge. Another thing that people are building at the moment is chat applications, and I think that this is really something which people are still figuring out, and people will, you know, I guess refine and nail over time. But there's definitely lots of interest in this, and I expect to see interesting products being released shortly.

Yeah. I think that the

recent explosion of interest in chat focused bots is definitely a great area for

use

and innovation of libraries like spaCy. 1 of the areas that I particularly see it and interact with it is in bots that are used for executing

operations tasks.

So sort of like the chat ops kind of focus or bots that are used inside group chat rooms like Slack or HipChat in a company context. So being able to add a more conversational

style

interaction with the bot as opposed to just the strict keyword based model that has been the sort of state of the art to date, I think would be definitely a very useful application

of libraries like spaCy.

Yeah. And actually there's a whole ecosystem

evolving of companies that I guess are setting out to help people build these applications because I think that it's very early in this sort of space of development, and I think that as a development community people still really have to solve a lot of the problems associated with these things and you know, it'll be exciting to see how that unfolds over time.

So what do you have planned for the future of spaCy?

So we definitely wanna deepen the deep, the deep learning integrations.

So at the moment, the parser and named entity recognizer still use the linear models, partly because

until now the deep learning libraries have been quite difficult to install, and we really wanna make sure that it's as easy as possible to get up and running with spaCy across different platforms. The ecosystem around these things is improving, and in particular, the research models that are being released continue to improve. So that's definitely something that we wanna target. We also wanna target deepened support for other languages

and, continued improvements in documentations and tutorials. Fortunately, with the 1 0 release, we've managed to improve the documentation a lot, but we there's always still more to write and we wanna make sure that the library is as easy to use as possible.

So are there any other topics that you think we should cover before we close out the show? No. Not that I can think of. I think that's, you know, seems good. So for anybody who wants to follow you and keep in touch with what you're up to, what would be the best way for them to do that?

So following on Twitter, either at at honorable or atspacyio

and also the spaCy newsletter is definitely the best thing to do. So which is list on the spaCy site. If you wanna work with us, then we have a a process set up on Explosion AI that essentially gives you some nice questions in a nice type form to, you know, help us figure out how we can help. Great. So with that, I will move us into the picks.

So for my picks this week, I'm actually gonna pick a new field recorder that I picked up recently called the Zoom H4n

Pro. So I picked that up because I'm actually going to be having a booth on the exhibit floor at PyCon US,

for 2017.

So I wanted to be able to have something that I could use for recording ad hoc interviews.

So I picked up that. And then to go along with it, I also got the Shure SM 58 microphone

because the Zoom recorder has XLR inputs so that you can plug in external microphones so that you can get a little higher quality recording in addition to the microphones that are built into the recorder itself. So I'm gonna pick both of those. I've been experimenting with it a little bit so far, and they both seem to be really high quality. And I'm pretty excited for being able to actually put them into use. So with that, I will pass it to you. Do you have anything for us this week?

Not off the top of my head. Sorry.

Sure. No problem. Well, I really appreciate you taking the time out of your day to tell us more about spaCy. It's definitely an interesting project.

And after talking to you about it some more and thinking about it, I'm interested in seeing if I can apply it to building out a, chatbot for my work so that I can add a little bit more fluent syntax for people who are interacting with it. Oh, 0, cool. Sounds interesting. You know,

inevitably, there'll be something that's confusing, and I hope that you ask questions about it on the issue tracker or on Stack Overflow when it is. I will do that. Okay. Cool. Thank you. Have a great evening. You too. Bye.

The Python Podcast.init

Summary

Brief Introduction

Interview with Matthew Honnibal

Keep In Touch

Picks

Links

The Python Podcast.__init__