Machine Learning

AI Driven Automated Code Review With DeepCode - Episode 226

Summary

Software engineers are frequently faced with problems that have been fixed by other developers in different projects. The challenge is how and when to surface that information in a way that increases their efficiency and avoids wasted effort. DeepCode is an automated code review platform that was built to solve this problem by training a model on a massive array of open sourced code and the history of their bug and security fixes. In this episode their CEO Boris Paskalev explains how the company got started, how they build and maintain the models that provide suggestions for improving your code changes, and how it integrates into your workflow.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Boris Paskalev about DeepCode, an automated code review platform for detecting security vulnerabilities in your projects

Interview

  • Introductions
  • Can you start by explaining what DeepCode is and the story of how it got started?
  • How is the DeepCode platform implemented?
  • What are the current languages that you support and what was your guiding principle in selecting them?
    • What languages are you targeting next?
    • What is involved in maintaining support for languages as they release new versions with new features?
      • How do you ensure that the recommendations that you are making are not using languages features that are not available in the runtimes that a given project is using?
  • For someone who is using DeepCode, how does it fit into their workflow?
  • Can you explain the process that you use for training your models?
    • How do you curate and prepare the project sources that you use to power your models?
      • How much domain expertise is necessary to identify the faults that you are trying to detect?
      • What types of labelling do you perform to ensure that the resulting models are focusing on the proper aspects of the source repositories?
  • How do you guard against false positives and false negatives in your analysis and recommendations?
  • Does the code that you are analyzing and the resulting fixes act as a feedback mechanism for a reinforcement learning system to update your models?
    • How do you guard against leaking intellectual property of your scanned code when surfacing recommendations?
  • What have been some of the most interesting/unexpected/challenging aspects of building the DeepCode product?
  • What do you have planned for the future of the platform and business?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:15
Hello, and welcome to podcast.in it the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at the node. With 200 gigabit private networking, scalable shared block storage node balancers, and a 40 gigabit public network all controlled by a brand new API, you've got everything you need to scale up. And for your tasks, they need fast computation such as training machine learning models, they just launched dedicated CPU instances, go to Python podcast.com slash the node that's LINOD today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest and machine learning and data analysis. For even more opportunities to meet listen and learn from your peers. You don't want to miss out on this year's conferences. And we have partnered with organizations such as O'Reilly Media Day diversity, Caribbean global intelligence and data Council. Upcoming events include the O'Reilly API conference, the strata data conference, the combined events of the data architecture, summit and graph forum and data Council in Barcelona. Go to Python podcast.com slash conferences today to learn more about these and other events and take advantage of our partner discounts to save money when you register. Your host, as usual, is Tobias Macey. And today I'm interviewing Boris Pasqua live about deep code and automated code review platform for detecting security vulnerabilities in your projects. So Boris, can you start by introducing yourself?
Boris Paskalev
0:01:47
Hi, my name is Boris Paskalev. I'm CEO and co founder of difficult we currently based in Zurich, Switzerland.
Tobias Macey
0:01:55
And so can you start a bit explaining about what the deep code project is, and some the story of how it got started.
Boris Paskalev
0:02:01
So ultimately, what deep code this is learns from the global development community, every single issue that was ever fixed and how it was fixed. and combine this knowledge of all development, almost like crowdsourcing, the development knowledge to prevent every single user from repeating those mistakes that are already known. In addition, we actually have predictive algorithms to understand issues that may not have been fixed, but could actually appear in software development. And where we started. So ultimately, the started the idea started by the other two co founders, they actually spent more than six years in researching the space of program analysis, and learning from big Colt, which means like billions of lines of code that they're available out there. And they did that in ETH Zurich, which is what we call it the MIT of Europe. And they are one of the foremost experts in the world in that space. They've hundreds of publication in the space. And yeah, and when they finished the research, our CTO of Iceland, he published his PhD, it's super wallet, and we decided that totally makes sense to actually build it into a platform and revolutionize how software development works.
Tobias Macey
0:03:10
And was there any particular reason for focusing specifically on security defects in code and how to automatically resolve or detect them?
Boris Paskalev
0:03:19
Actually, security was a later Allah, I don't we actually did that in 2019, which was just started this year in we publish a specific paper on that the platform itself is not targeting or anything specifically can any issues. It's fixed being that a book performance, you name it can be detected, security was just a nice add on features that the tweet it and it was pretty novel as well.
Tobias Macey
0:03:43
So in terms of the platform itself, can you talk a bit about how its implemented and the overall architecture for the actual platform and how it interacts with users code base.
Boris Paskalev
0:03:54
So pretty much what it does is there's two steps both into learning and analyzing code. The first step is, we take your coat, we analyze it quickly, we converted, we use standard parsing of each language. And then we actually do a data log extraction of semantic facts about the code to build a customized internal representation about the various interaction every single object, how the object propagates, in interacts with functions, getting into other objects, then how they change etc. And in this knowledge represents pretty much the intent and how the program functions, right. And then we do that for every single version of the program. So we see over time when people commit code and change code, how that changes, and that gives us our the Delta, what is changing and how people are fixing things, right? Then we lead extremely fast, and we lead over this hundreds of thousands of repositories, obviously in like, billions of lines of code. And then we actually identify trends. This is where our machine learning kicks in. And it identifies trends, how and how people fix things, what is the most common things are there specific weird cases, etc. And this is we have the scalability, global knowledge, as we call it.
Tobias Macey
0:05:03
For the languages that you're currently supporting, I noticed that you're focusing at least for the time being on Python and JavaScript, and I believe there are one or two others. And I'm wondering what your criteria was for selecting the languages that you were targeting for evaluation and automated fixing and just some of the other languages? Are you thinking about marketing next?
Boris Paskalev
0:05:23
Yep. So pretty much we started with the most popular languages out there. I mean, there's different charts, but kind of the, the standard suspects are obviously Python, Java, JavaScript, then following down that line, we're looking at, obviously, C sharp, PHP will come C and c++. And down the list, I mean, we're getting more and more requests for various languages. So it's a combination of the ranking of the language and popularity, as well as specific customer requests, specifically, big companies over the asking for very, very specific,
Tobias Macey
0:05:52
given the dynamic nature of things like Python and JavaScript, I'm wondering what some of the difficulties you faced star as far as being able to static, we analyze the languages and avoid any cases where there might be things like monkey patching going on, or maybe some sort of code obfuscation?
Boris Paskalev
0:06:12
Yeah, I mean, so since we don't do that you're the typical static analysis here, we doing actually a static semantic analysis. And we do that in context, right. So that allows us to go go much deeper. For example, if you have a particular object, and then you put it into an array, and then the object comes out, we still know that it's the exact same object. So that kind of gets us closer to a dynamic analysis, as well. So that's kind of some of the the features that allow us to stop, analyze and identify much more complex issues that are that are close with an attic into procedural analysis. If you say, and this allows you to get much, much higher accuracy, not have the false positives, data tools will, will throw you there. And as identify issues that classical syntax static analysis, not be able to see it all.
Tobias Macey
0:07:02
Another thing that can potentially complicate that matter is the idea of third party dependencies and how that introduces code into the overall runtime. And I'm wondering how you approach that as well, particularly as those dependencies are updated and evolved.
Boris Paskalev
0:07:17
Pretty much for dependencies, we scan the dependencies code, if if the code is included in your repository, we don't have what there are many other services out there have a list of dependencies in their versions in which one might be having issues or not, we don't do that. Because that's pretty much static databases that they do that. But we do look at how do you actually call specific API. So if you actually calling you have a dependency, and you're calling some kind of a function from it, we actually going identify how you calling the function telling you a unicorn, the first in the right way, or the third parameter that you're passing is not the right one, etc, etc. But specifically, what dependencies you incorporate, we don't actually look at I mean, we can tell you, you're important more than once, or importing Sunday, you know, you're not using things like this we can have as well. And that's kind of the scope that we go into.
Tobias Macey
0:08:08
Another thing that introduces complexities is as languages themselves evolve, and introduce new capabilities, or keywords, and I'm wondering how you keep up with those release cycles and ensure that your analyzers and your recommendation engines are staying up to date with those capabilities. And then also on the other side, ensuring that any recommendations that you provide and your code reviews match the target runtime for the code base as it stands. So for instance, if somebody wrote Python project, it's actually using Python two that you don't end up suggesting fixes that rely on Python three features.
Boris Paskalev
0:08:44
So So the first one, when languages change and evolve, which is, again, pretty common these days. And so there's two things right, first of all, is, are the parts of supporting the new feature, right? Because there's, we have to get the latest version of the partners in the policy supporting and that's great if the person is not supporting, and then we have to do our own extensions until the partner start supporting them. Because we pretty much use standard parcels with minimum extensions only when needed, right? So this is the standard, which if there's something fundamentally different about the language, right? This is where we might actually have to extend our internal representation to support that. But like, taking something like really fundamental, but that's the really, we really see that in specific languages, that's more happening if you had a new language, right? So that's kind of the two, the two major branches when you think comes in, but for the majority of things, there's very little that we have to do. But extending to the latest person. On the second question that you ask is kind of the Python versions, Python version two versus three. So we don't specifically differentiate that but we want to give you suggestions will dedicate for Python version three, you have to be doing this right if you're doing it, but if you're in Python version two, obviously, you can just say ignore these suggestions. And you can actually create set of rules and saying, Okay, this is all the set of rules that are by conversion to just ignore them, you can put that into config file. And until you multiply the version three, you can just ignore our little tables.
Tobias Macey
0:10:07
And there, it also gets a little bit more difficult for within Python three versions. For instance, if your code is targeting Python 3.5, you don't want to suggest fixes that incorporates things such as app strings or a data classes. And I'm curious how you approach that as well. Or if it's more just based on what the user specifies in their config as far as the runtime that they're using.
Boris Paskalev
0:10:31
That is great. So we don't have any very strong changes in that space. The place that helps in that is we actually all the suggestions we provide a contextually based, so usually you can actually see what's happening before and after specific issues, right? And if they're version specific, then you watch you will not get the recommendation, because it looks different. In your case, that doesn't cover all the cases obvious. I think you're right for the asking that questions. And we don't have a great solution for that. We leave it to the developer, to actually when they see a suggestion, say Nope, I don't care about that. Clearly, as I said, we can do to the ignoring rules. But those changes are rare. I mean, they do happen. And we've seen cases where the developer says yeah, I don't care about this yet, I haven't updated. And that happens. But we usually target most of our suggestions and learning says it's automated. It gets the learnings from the latest version, social, large percentage of development communities moving to the latest version, then they make changes related to it. And you'll be getting suggestions for that as well.
Tobias Macey
0:11:27
Can you describe a bit more about the overall workflow for somebody who's using deep code and how it fits into their development process?
Boris Paskalev
0:11:34
Yep. So the most standard one that we envision and receipts most popular out there, it's, it's a developer to that lifts on the good. So pretty much you login with your get account, get Bitbucket, whatever that is, you seal is the repository that you want to analyze. And you subscribe them, right once once the repository subscribed, you getting two things. First of all, every time we do a pull request, we actually analyze it and we tell you, during this diff, are you introducing any new issues, right, so that's number one that's continually monitoring, the new code being generated. The second piece is continuously monitoring on your old code, because old code grow also age right as the development community changes, new security vulnerabilities are uncovered, etc. Something that you've written two years ago, my actually is not secure anymore. And you actually want to get being for that because very few people actually test look into the call from two years ago. So that will give you a pink as well saying, hey, this function here, the code has to be updated to a new encryption, for example, to make sure it's secure. So those are the two major pieces, again, fully lives into the gates. And in addition to that, we will obviously offer an API and command line interface. So you can really integrate our solution anywhere you want. It could be as part of continuous integration, we actually have that in GitHub already. that once you finish the pull request before the merchant can tell you, hey, we analyze it, there's no critical stuff, please proceed or take us one critical stuff. critical issue, look at it. But yeah, they BI and command line interface allows you to like script within minutes, a checker at any point in your workflow, because developers in different companies or setups have very different development workflows. And they might want it to different stages, if you having a QA team, having continuous integration having a continuous delivery versus individual bills every day or month, whatever that is.
Tobias Macey
0:13:22
And then in terms of the model itself, can you describe a bit about the overall process that you're using for training and some of the inputs that you use as far as curating the projects that you're using as references to ensure that they are of sufficient quality and that you're not relying on something that is maybe using some non standard conventions?
Boris Paskalev
0:13:44
Yep. So two points on this. So we do have a custom curation, it takes a lot of different things, how active the project is, how many contributors how many stars, etc, etc. So that's, that's continuously updating us. And this is mainly done, because there's a lot of subjects in the gets like the Kevin touch for like two years, or have one developer only that never touches it so that there's kind of a long tail of such projects. So we just don't want to waste time to analyze them. The machine learning automatically actually seeds out such a kind of a poison pills, in a way, kind of random developer who fixed something in the wrong way. Right. And this is where it comes in with the probability that we the probability we assigned to every single suggestions that we have, which is based on how many people can fix it this way. Is there a trend a lot of people fixing it, how many counterexamples they are, and how many actual such issue exists in the open source community today, right. So based on that, we can automatically see issues because when you fix something wrongly, it's a very unlikely that there's many people that have fix it the same wrong way. That only happens, for example, if somebody publishes a wrong solution, core, and nobody catches it. And that can happen, like one two weeks, but usually that gets resolved immediately. And then our knowledge base automatically update
Tobias Macey
0:14:58
in terms of the amount of domain expertise that's necessary for identifying those faults that you're trying to detect. I'm curious if you're using sort of expert labeling techniques, where you have somebody going through and identifying the faults that were created and the associated fixes, or if you're relying on more of an unsupervised learning model for being able to build the intelligence into your engine.
Boris Paskalev
0:15:23
So it's mainly unsupervised learning, we actually do have some labeling, which is based on how severe the issue is. So we have categorization of critical warnings and info diaper suggestions. So pretty much we have to actually categorize which ones are critical. And this is when our team does that. But that's one type of issues. So like, within two hours, you can label like hundreds of thousands of different suggestions. So it's a, it's a pretty quick process with very minimal supervision that we have to do. Everything else is pretty much fully automatic. What we do automatically detects the type official is the security, is it a book is the performance etc, we use a number of techniques there, we have an NLP on the command. So obviously, look into specific, Colton semantically what it does, because what we do, we have a predictive algorithm that infers the usage of specific functions and objects. So we actually know what they're doing, what setting they're being used.
Tobias Macey
0:16:20
And you mentioned to the, for the pull request cases, you're relying on parsing the depths of what's being changed. And I'm curious if there are cases where the diff just doesn't provide enough context about the overall intent of the code and any approaches that you have for being able to mitigate some potential false positives or false negatives where you missed something because of the fact that the code is only changing maybe one line, but you need the broader context to understand what's being fixed.
Boris Paskalev
0:16:50
Ah, ok. So So yeah, sorry. So maybe I didn't clarify that correctly. So we do analyze the whole, the whole tree, like we always do the full analysis, right. But usually the the semantic, the semantic changes are only within the diff, and we actually show you what it is. So if a change that you make on this line of code is causing a security issue somewhere else will actually catch that. Absolutely. I mean, we cannot analyze anything smaller than that, because our internal representation requires the context of what's happening. So we have to analyze every single function and procedure to see what it is. So we will analyze everything, but usually the changes that are happening on the on the dips, because they are focusing there, but it could be in a different part of the code base as well. Where the issue comes from, in terms of you mentioned, false positives and false negatives again, so there's a number of techniques to lower that. I mean, we have kind of a record high accuracy rate compared to any of the existing tools today. And that's mainly based on contextual analysis. So we actually know in what cases the problem is there. And on the fact that we actually have kind of usually thousands of examples. So it's a pretty accurate what it is, and we're not doing a syntax based comparison within semantic person. So we're not looking at what you're doing in the specific lines of codes, because without knowing the semantic details about it, you actually could be very wrong. But looking semantically gives you the considerably higher accuracy rate.
Tobias Macey
0:18:12
And in terms of identifying those false positives and false negatives, if you do identify maybe a false positive, and is there any way for the users to be able to label it as such, so that it can get fed back into your machine learning models so that you can prevent that from happening in the future, and just any other sort of feedback mechanisms that are built in for users to be able to feed that back into your model to improve it over time?
Boris Paskalev
0:18:38
Yep, so we have two ways. First of all, is you can ignore rules for your own, you can say is, hey, this rule I don't like, and you can decide if you want to do this for a project or in general. And the second and the second is you can actually have a kind of a thumbs up and thumbs down with a commencing. Yeah, I don't like this because of blah, right. So this is the two main mechanisms that we look at it is clear for open source, we are get the feedback automatically if an issue was fixed or not, right, and that, as I said earlier, we'll look at how many of the issues exist in the code base out there. And how many of these type of issues have been fixed, which is part of our probability assessment, if an issue is should actually flag or not.
Tobias Macey
0:19:18
And in terms of the code that you're analyzing, I'm wondering, again, how that feeds back into your models, particularly in the case where somebody might be scanning a private repository, and if there are any sort of intellectual property in terms of algorithms or anything along those lines, and preventing that from getting fed back into your model so that it gets surfaced as a recommendation on somebody else's project.
Boris Paskalev
0:19:42
Yep. So we do not learn from private Colt, do not become part of the public knowledge, right, we have a special function that you can learn from your private code. And that becomes your own knowledge. That's usually is for larger companies with logical basis. If you when we analyze code, we don't learn from that code, right? We don't from open source repositories. And depending on the licensing, there's some open source repositories that you can see, but you cannot use right. So for those who are not going to ever create the suggestions that suggestion examples coming from there will still count them as how many times we've seen that issue and or that it's been fixed. But to never showed as an example, for 16 examples will only come from a fully open source projects.
Tobias Macey
0:20:27
And in terms of the overall challenges or anything that was particularly interesting or unexpected that you've come across in the process of building and growing the deep code project and the business around it. What have been something that was sort of notable in your experience?
Boris Paskalev
0:20:45
Wow, that's an interesting question. I think the one that it's more shocking is the number of different technology and innovations that we have to do like, I mean, we create new versions of the platform, a lot like we actually literally about to release in one. In a matter of weeks, we released it to some pilot customers already, the considerably increases the coverage, while maintaining the same high accuracy. So but yeah, so it's really like we have to come up with new things all the time. I mean, we have half of our team is focusing on really inventing new stuff, we do publish about half of them. Because those, those are pretty interesting findings from them, we keep internally because obviously they are proprietary. And over time they come out, obviously. So yeah, so it's really the sheer volume of new things that you have to build. Like there's so many modules, when our CTO starts drawing the whole picture like it's takes hours since a bunch of small boxes, and each one in its own, it's kind of a different innovation that came up. And that's, that's really interesting. And I was not expecting that. And I was not expecting that two years ago, when I started looking into it. And when I look at it today, we still doing a lot of that. And when I look at the roadmap, a lot of new things coming in the space as well. So that is quite interesting, and explains why they have never been a platform so far that really goes deep into understanding code in that way. And then being able to learn from such a large set of be called out there in a extremely fast way.
Tobias Macey
0:22:12
In terms of the platform itself and its capabilities, what are the some of the overall limitations and some of the cases where you might not want to either use it or avoid some of the recommendations that it might come out with just because of some of the artifacts of the code that you're trying to feed through it.
Boris Paskalev
0:22:30
Sure. Question. So no limitations, in general, fully scalable, can support any language, that's the best piece of architecture specific carrier that you don't want to use it. We haven't found one yet. I mean, ultimately, that's part of the basic building blocks. Maybe when we start delivering some more, more higher level architectural analysis, some of those spaces might come up, but that's still to come. But from the basic building blocks, finding books and issues in your code. Yeah, there's we haven't find any specific areas where they are, I mean, some projects may have a little bit higher false positive rate versus another for specific reasons. As you mentioned, the Python version, for example, using Python version two, and we've given you a lot of Python version three suggestions. But other than that, there is no industry or language or focus specific.
Tobias Macey
0:23:16
And another thing that is potential challenge are cases where the code base itself is quite large, I'm wondering you run into any issues where you've hit an upper limit in terms of just your deployed platform for being able to parse and keep the entirety of that structure, semantically, in, in the working set. And any strategies that you've developed to be able to work around that
Boris Paskalev
0:23:40
the platform is designed can literally handle anything and millions of lines of code in seconds. So I mean, think about it, we are learning from billions of lines of code. And in order to do that efficiently, we've built some pretty efficient algorithms to actually do that. So we haven't seen I can we finalize some pretty God basis, any issues I can use? Like, wow. So we are on average, when I compare it to other tools tend to oftentimes hundred times faster in the analysis space. So yeah, I think that scalability is definitely not an issue. I mean, it happened a couple of times between a man of hard disk space because of caching. But since when the cloud was pretty fast, for a lot more,
Tobias Macey
0:24:21
yeah, I was just thinking in terms of some of the sizes of mono repo is for the Googles, and Facebook's of the world where it takes, you know, potentially hours to actually clone the entire history of the project and some of the workarounds that they've had to do. But I'm sure that you know that, that that's the sort of one 10th of 1% case, code is even of that scale. But I was just curious if you had ever run into something like that.
Boris Paskalev
0:24:47
But you're right, the cloning is the slow part. So those large tissues, large repositories, usually cloning takes a while, and then an ISIS takes much, much faster. In our case. So that's really now we actually separating the shoulders we're calling people know why the slow. But yeah, so cloning mistakes, sometimes fast, the slow, especially if you the dominant network, in the cloud, and it's a lot of people, but then the analysis is much, much faster than the cloning.
Tobias Macey
0:25:13
What are some of the other user experience tweaks that you've ended up having to introduce just to improve the overall receipt of your product to make sure that users are able to take the full advantage of it?
Boris Paskalev
0:25:26
I mean, the areas where we've talked a little bit specifically explanations, trying to actually explain to the customer what the issue is, we've actually had to release. Yeah, there was another new engine just for that. Because people are saying, Yeah, that's a bit confusing. And yeah, so we actually had to build on that UI perspective as well, people understanding what what, that's all we're obviously working progress on the website, specifically explaining to customers that the code is secure that we don't use it, we're not going to display it as you rightfully asked to other customers, we're not going to use it for anything else. We're not going to store it. There's other companies that had issues with that. So we're very diligent in in that. But yeah, those are kind of the the major areas out there.
Tobias Macey
0:26:08
And looking forward, what are some of the features or improvements that you have planned for the platform and for the business.
Boris Paskalev
0:26:16
So key one is in as our internal main KPIs for this year is the number of actual issues like recall, that we can find. So that's, as I mentioned, it's going to be coming up very soon. So expect something like four to five times increasing the number of issues that we can detect. So that's, that's, that's pretty exciting. I mean, other things that we're looking at, we're doing ultimately called fixing, we're starting to look into that right now. But that's likely it's early 2020. Release. So being able to kind of give you suggestions how to fix it automatically. You don't have to even write the code or try to understand it. We don't recommend that, obviously, but Cambridge is going to be there. The other one is, as I mentioned, trying to analyze the Cortana models or more architectural level, semantic level and describe it, it does think that's another big one. I mean, we're toying with some more interesting stuff like this kitchen ration, automatic, fully automatic, as well. But that's more of a Yeah, we have to see the results, how commercially viable that will be. We have many different space, we have quite a long roadmap of cool things that will come up. And on purely operational stuff, getting more integrations, obviously, people are asking for the integration. So we're going to be releasing quite soon our first ID integration, where people, developers will be able to just directly get the results in their ID running somewhere else. And hopefully, that spins out, well kind of open it up. So anybody can do it any idea integrations, because there's a quite a list of ideas out there.
Tobias Macey
0:27:44
Yeah, being able to identify some architectural patterns in ways that the code can be internally restructured to improve it, either in terms of just the understand ability of it, or potentially, the scalability or extensive ability would definitely be interesting. And also what you were mentioning, as far as test cases, either identifying where a test case, isn't actually performing the assertion that you think it is, or cases where you're missing a test, and being able to surface at least a stub of suggesting how to encompass that missing piece of functionality and verifying it.
Boris Paskalev
0:28:21
Correct. Yeah. So in the test case, specifically, the area that we're looking at is find the test case out there that it's most suitable for exactly what you're doing. Because that's human human generated already. And it will, I will maintain it in the long run, which is pretty much the, the main Achilles heel for all the current test case, automatic generations out there, and then a just a little bit, so it's perfectly for you. So that's really the the focus area that we're going in that space, which is pretty exciting. As I said, if it turns out to work, it will be an amazing product as well. And nice add on. But yeah, the platform is God no way that we can build multiple products, and we're just scratching the surface, and lots will come up.
Tobias Macey
0:28:58
So there are some other tools that are operating in this space, at least tangentially or, you know, at surface value might appear to be doing something that's along the same lines of what you're doing most notable being the kite project. And I'm wondering if you can provide some compare and contrast between your product and kite and any others that are that you're tracking that are in a similar space?
Boris Paskalev
0:29:20
Yep. So God is a great tool. It's a great idea integration, they have some great in line suggestions. They again, the main differentiation between title any other similar to that is doing static analysis is they look at the code in a much shallower level, right? They actually tried to throw Hey, looks like based on what you're typing, a lot of other people are typing this, right, which is almost like treating the coldest regular text, like there's it's syntax, right? Why are we actually doing semantic analysis, we're saying is, is actually you're typing this and the parameter, what you're passing in is not right, actually, the object you're passing in his intent has to be a long core, whatever that is. Right. So that's kind of the the main differentiation, so they have suggestions that is mainly kind of old to completely a bit faster to type, they go a bit deeper, and kind of getting kind of the linker type of suggestions, as well, but again, gives you a higher false positive, right, obviously, because it's a doesn't go deeper to understand the issue and doesn't give you the contextual analysis as well. So that's kind of the main thing. So the accuracy, the recall, and accuracy is the two main things that are measured. So we can find considerably more things, and the accuracy rate will be considerably higher. So that's what it kind of the main differentiation out there. But we do have side by the way, they have amazing UI, amazing design and amazing community behind them. So a great tool as well.
Tobias Macey
0:30:38
Are there any other aspects of the work that you're doing at deep code, or just the overall space of automated fixes and automated reviews that we didn't discuss yet they'd like to cover before we close out the show.
Boris Paskalev
0:30:50
Yeah, I don't want to go too deep into things that are more experimental, because those who take time and I don't want to get people too excited, because they might take years to be ready. But the space is right, that's pretty much I have to say. And, yeah, and there'll be a lot of new things coming up. And so developers should be extremely excited what's coming up.
Tobias Macey
0:31:10
And for anybody who wants to follow along with you or get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move into the pics. And this week, I'm going to choose a book series that I read a while ago, and that I'm probably going to be revisiting soon. That's called the red wall series by Brian Jake's that focuses on a bunch of woodland animal characters, and just all it's a very elaborate and very detailed world and series that he built up with a lot of complex history. So definitely worth checking out. If you're looking for a new book or set of books to read. And they all stand alone nicely. You don't have to read them all in any particular order. But all together, they give you a much broader view of sort of his vision for that space. So definitely recommend that. And so with that, I'll pass it to you, Boris, do you have any pics this week?
Boris Paskalev
0:31:59
Yes. Big this week. In general, the AI space has been going great. I mean, everybody knows there's no real AI as much machine learning. But there's a couple of new areas coming in that space. And that's very exciting. It's pretty much applying machine learning to everything or a big data. So that's lovely. But in that contrast, because we all do that every day. And that's our passion here, the difficult, my favorite because the little bit less of that and do some sports and go outside.
Tobias Macey
0:32:26
That's always a good recommendation and something that bears repeating. So thank you for taking the time today for joining me and describing the work that you're doing with deep code. It's definitely an interesting platform. And I'll probably be taking a look at it myself. So thank you for all of your work on that. And I hope you enjoy the rest of your day.
Boris Paskalev
0:32:42
Thank you very much you too.
Tobias Macey
0:32:45
Thank you for listening to the show. If you want to hear more and you don't want to wait until next week and check out my other show the data engineering podcast with deep dives on databases, data pipelines and how to manage information in the modern technology landscape. Also, don't forget to leave a review on iTunes to make it easier for others to find this show.

Build Your Own Knowledge Graph With Zincbase - Episode 223

Summary

Computers are excellent at following detailed instructions, but they have no capacity for understanding the information that they work with. Knowledge graphs are a way to approximate that capability by building connections between elements of data that allow us to discover new connections among disparate information sources that were previously uknown. In our day-to-day work we encounter many instances of knowledge graphs, but building them has long been a difficult endeavor. In order to make this technology more accessible Tom Grek built Zincbase. In this episode he explains his motivations for starting the project, how he uses it in his daily work, and how you can use it to create your own knowledge engine and begin discovering new insights of your own.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Podcast.init listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Tom Grek about knowledge graphs, when they’re useful, and his project Zincbase that makes them easier to build

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what a knowledge graph is and some of the ways that they are used?
    • How did you first get involved in the space of knowledge graphs?
  • You have built the Zincbase project for building and querying knowledge graphs. What was your motivation for creating this project and what are some of the other tools that are available to perform similar tasks?
  • Can you describe how Zincbase is implemented and some of the ways that it has evolved since you first began working on it?
    • What are some of the assumptions that you had at the outset of the project which have been challenged or updated in the process of working on and with it?
  • What are some of the common challenges when building or using knowledge graphs?
  • How has the domain of knowledge graphs changed in recent years as new approaches to entity resolution and data processing have been introduced?
  • Can you talk through a use case and workflow for using Zincbase to design and populate a knowledge graph?
  • What are some of the ways that you are using Zincbase in your own projects?
  • What have you found to be the most challenging/interesting/unexpected lessons that you have learned in the process of building and maintaining Zincbase?
  • What do you have planned for the future of the project?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Open Source Automated Machine Learning With MindsDB - Episode 218

Summary

Machine learning is growing in popularity and capability, but for a majority of people it is still a black box that we don’t fully understand. The team at MindsDB is working to change this state of affairs by creating an open source tool that is easy to use without a background in data science. By simplifying the training and use of neural networks, and making their logic explainable, they hope to bring AI capabilities to more people and organizations. In this interview George Hosu and Jorge Torres explain how MindsDB is built, how to use it for your own purposes, and how they view the current landscape of AI technologies. This is a great episode for anyone who is interested in experimenting with machine learning and artificial intelligence. Give it a listen and then try MindsDB for yourself.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Podcast.init listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing George Hosu and Jorge Torres about MindsDB, a framework for streamlining the use of neural networks

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what MindsDB is and the problem that it is trying to solve?
    • What was the motivation for creating the project?
  • Who is the target audience for MindsDB?
  • Before we go deep into MindsDB can you explain what a neural network is for anyone who isn’t familiar with the term?
  • For someone who is using MindsDB can you talk through their workflow?
    • What are the types of data that are supported for building predictions using MindsDB?
    • How much cleaning and preparation of the data is necessary before using it to generate a model?
    • What are the lower and upper bounds for volume and variety of data that can be used to build an effective model in MindsDB?
  • One of the interesting and useful features of MindsDB is the built in support for explaining the decisions reached by a model. How do you approach that challenge and what are the most difficult aspects?
  • Once a model is generated, what is the output format and can it be used separately from MindsDB for embedding the prediction capabilities into other scripts or services?
  • How is MindsDB implemented and how has the design changed since you first began working on it?
    • What are some of the assumptions that you made going into this project which have had to be modified or updated as it gained users and features?
  • What are the limitations of MindsDB and what are the cases where it is necessary to pass a task on to a data scientist?
  • In your experience, what are the common barriers for individuals and organizations adopting machine learning as a tool for addressing their needs?
  • What have been the most challenging, complex, or unexpected aspects of designing and building MindsDB?
  • What do you have planned for the future of MindsDB?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Algorithmic Trading In Python Using Open Tools And Open Data - Episode 216

Summary

Algorithmic trading is a field that has grown in recent years due to the availability of cheap computing and platforms that grant access to historical financial data. QuantConnect is a business that has focused on community engagement and open data access to grant opportunities for learning and growth to their users. In this episode CEO Jared Broad and senior engineer Alex Catarino explain how they have built an open source engine for testing and running algorithmic trading strategies in multiple languages, the challenges of collecting and serving currrent and historical financial data, and how they provide training and opportunity to their community members. If you are curious about the financial industry and want to try it out for yourself then be sure to listen to this episode and experiment with the QuantConnect platform for free.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Podcast.init listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • The Python Software Foundation is the lifeblood of the community, supporting all of us who want to run workshops and conferences, run development sprints or meetups, and ensuring that PyCon is a success every year. They have extended the deadline for their 2019 fundraiser until June 30th and they need help to make sure they reach their goal. Go to pythonpodcast.com/psf today to make a donation. If you’re listening to this after June 30th of 2019 then consider making a donation anyway!
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Jared Broad and Alex Catarino about QuantConnect, a platform for building and testing algorithmic trading strategies on open data and cloud resources

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what QuantConnect is and how the business got started?
  • What is your mission for the company?
  • I know that there are a few other entrants in this market. Can you briefly outline how you compare to the other platforms and maybe characterize the state of the industry?
  • What are the main ways that you and your customers use Python?
  • For someone who is new to the space can you talk through what is involved in writing and testing a trading algorithm?
  • Can you talk through how QuantConnect itself is architected and some of the products and components that comprise your overall platform?
  • I noticed that your trading engine is open source. What was your motivation for making that freely available and how has it influenced your design and development of the project?
  • I know that the core product is built in C# and offers a bridge to Python. Can you talk through how that is implemented?
    • How do you address latency and performance when bridging those two runtimes given the time sensitivity of the problem domain?
  • What are the benefits of using Python for algorithmic trading and what are its shortcomings?
    • How useful and practical are machine learning techniques in this domain?
  • Can you also talk through what Alpha Streams is, including what makes it unique and how it benefits the users of your platform?
  • I appreciate the work that you are doing to foster a community around your platform. What are your strategies for building and supporting that interaction and how does it play into your product design?
  • What are the categories of users who tend to join and engage with your community?
  • What are some of the most interesting, innovative, or unexpected tactics that you have seen your users employ?
  • For someone who is interested in getting started on QuantConnect what is the onboarding process like?
    • What are some resources that you would recommend for someone who is interested in digging deeper into this domain?
  • What are the trends in quantitative finance and algorithmic trading that you find most exciting and most concerning?
  • What do you have planned for the future of QuantConnect?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Building A Privacy Preserving Voice Assistant - Episode 211

Summary

Being able to control a computer with your voice has rapidly moved from science fiction to science fact. Unfortunately, the majority of platforms that have been made available to consumers are controlled by large organizations with little incentive to respect users’ privacy. The team at Snips are building a platform that runs entirely off-line and on-device so that your information is always in your control. In this episode Adrien Ball explains how the Snips architecture works, the challenges of building a speech recognition and natural language understanding toolchain that works on limited resources, and how they are tackling issues around usability for casual consumers. If you have been interested in taking advantage of personal voice assistants, but wary of using commercially available options, this is definitely worth a listen.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Adrien Ball about SNIPS, a set of technologies to make voice controlled systems that respect user’s privacy

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what the Snips is and how it got started?
  • For someone who wants to use Snips can you talk through the onboarding proces?
    • One of the interesting features of your platform is the option for automated training data generation. Can you explain how that works?
  • Can you describe the overall architecture of the Snips platform and how it has evolved since you first began working on it?
  • Two of the main components that can be used independently are the ASR (Automated Speech Recognition) and NLU (Natural Language Understanding) engines. Each of those have a number of competitors in the market, both open source and commercial. How would you describe your overall position in the market for each of those projects?
  • I know that one of the biggest challenges in conversational interfaces is maintaining context for multi-step interactions. How is that handled in Snips?
  • For the NLU engine, you recently ported it from Python to Rust. What was your motivation for doing so and how would you characterize your experience between the two languages?
    • Are you continuing to maintain both implementations and if so how are you maintaining feature parity?
  • How do you approach the overall usability and user experience, particularly for non-technical end users?
    • How is discoverability handled (e.g. finding out what capabilities/skills are available)
  • One of the compelling aspects of Snips is the ability to deploy to a wide variety of devices, including offline support. Can you talk through that deployment process, both from a user perspective and how it is implemented under the covers?
    • What is involved in updating deployed models and keeping track of which versions are deployed to which devices?
  • What is involved in adding new capabilities or integrations to the Snips platform?
  • What are the limitations of running everything offline and on-device?
    • When is Snips the wrong choice?
  • In the process of building and maintaining the various components of Snips, what have been some of the most useful/interesting/unexpected lessons that you have learned?
    • What have been the most challenging aspects?
  • What are some of the most interesting/innovative/unexpected ways that you have seen the Snips technologies used?
  • What is in store for the future of Snips?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Version Control For Your Machine Learning Projects - Episode 206

Summary

Version control has become table stakes for any software team, but for machine learning projects there has been no good answer for tracking all of the data that goes into building and training models, and the output of the models themselves. To address that need Dmitry Petrov built the Data Version Control project known as DVC. In this episode he explains how it simplifies communication between data scientists, reduces duplicated effort, and simplifies concerns around reproducing and rebuilding models at different stages of the projects lifecycle. If you work as part of a team that is building machine learning models or other data intensive analysis then make sure to give this a listen and then start using DVC today.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Bots and automation are taking over whole categories of online interaction. Discover.bot is an online community designed to ​serve as a platform-agnostic digital space for bot developers and enthusiasts of all skill levels to learn from one another, share their stories, and move the conversation forward together. They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news. For newcomers to the space they have the Beginners Guide To Bots that will teach you the basics of how bots work, what they can do, and where they are developed and published. To help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need they have compiled a list of the major options and how they compare. Go to pythonpodcast.com/discoverbot today to get started and thank them for their support of the show.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Dmitry Petrov about DVC, an open source version control system for machine learning projects

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what DVC is and how it got started?
  • How do the needs of machine learning projects differ from other software applications in terms of version control?
  • Can you walk through the workflow of a project that uses DVC?
    • What are some of the main ways that it differs from your experience building machine learning projects without DVC?
  • In addition to the data that is used for training, the code that generates the model, and the end result there are other aspects such as the feature definitions and hyperparameters that are used. Can you discuss how those factor into the final model and any facilities in DVC to track the values used?
  • In addition to version control for software applications, there are a number of other pieces of tooling that are useful for building and maintaining healthy projects such as linting and unit tests. What are some of the adjacent concerns that should be considered when building machine learning projects?
  • What types of metrics do you track in DVC and how are they collected?
    • Are there specific problem domains or model types that require tracking different metric formats?
  • In the documentation it mentions that the data files live outside of git and can be managed in external storage systems. I’m wondering if there are any plans to integrate with systems such as Quilt or Pachyderm that provide versioning of data natively and what would be involved in adding that support?
  • What was your motivation for implementing this system in Python?
    • If you were to start over today what would you do differently?
  • Being a venture backed startup that is producing open source products, what is the value equation that makes it worthwile for your investors?
  • What have been some of the most interesting, challenging, or unexpected aspects of building DVC?
  • What do you have planned for the future of DVC?

Keep In Touch

Picks

  • Tobias
  • Dmitry
    • Go outside and get some fresh air 🙂

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

The Past, Present, and Future of Deep Learning In PyTorch - Episode 202

Summary

The current buzz in data science and big data is around the promise of deep learning, especially when working with unstructured data. One of the most popular frameworks for building deep learning applications is PyTorch, in large part because of their focus on ease of use. In this episode Adam Paszke explains how he started the project, how it compares to other frameworks in the space such as Tensorflow and CNTK, and how it has evolved to support deploying models into production and on mobile devices.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Check out the Practical AI podcast from our friends at Changelog Media to learn and stay up to date with what’s happening in AI
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Your host as usual is Tobias Macey and today I’m interviewing Adam Paszke about PyTorch, an open source deep learning platform that provides a seamless path from research prototyping to production deployment

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what deep learning is and how it relates to machine learning and artificial intelligence?
  • Can you explain what PyTorch is and your motivation for creating it?
    • Why was it important for PyTorch to be open source?
  • There is currently a large and growing ecosystem of deep learning tools built for Python. Can you describe the current landscape and how PyTorch fits in relation to projects such as Tensorflow and CNTK?
    • What are some of the ways that PyTorch is different from Tensorflow and CNTK, and what are the areas where these frameworks are converging?
  • How much knowledge of machine learning, artificial intelligence, or neural network topologies are necessary to make use of PyTorch?
    • What are some of the foundational topics that are most useful to know when getting started with PyTorch?
  • Can you describe how PyTorch is architected/implemented and how it has evolved since you first began working on it?
    • You recently reached the 1.0 milestone. Can you talk about the journey to that point and the goals that you set for the release?
  • What are some of the other components of the Python ecosystem that are most commonly incorporated into projects based on PyTorch?
  • What are some of the most novel, interesting, or unexpected uses of PyTorch that you have seen?
  • What are some cases where PyTorch is the wrong choice for a problem?
  • What is the process for incorporating these new techniques and discoveries into the PyTorch framework?
    • What are the areas of active research that you are most excited about?
  • What are some of the most interesting/useful/unexpected/challenging lessons that you have learned in the process of building and maintaining PyTorch?
  • What do you have planned for the future of PyTorch?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Polyglot: Multi-Lingual Natural Language Processing with Rami Al-Rfou - Episode 190

Summary

Using computers to analyze text can produce useful and inspirational insights. However, when working with multiple languages the capabilities of existing models are severely limited. In order to help overcome this limitation Rami Al-Rfou built Polyglot. In this episode he explains his motivation for creating a natural language processing library with support for a vast array of languages, how it works, and how you can start using it for your own projects. He also discusses current research on multi-lingual text analytics, how he plans to improve Polyglot in the future, and how it fits in the Python ecosystem.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. Podcast.__init__ listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Rami Al-Rfou about Polyglot, a natural language pipeline with support for an impressive amount of languages

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Polyglot is and your reasons for starting the project?
  • What are the types of use cases that Polyglot enables which would be impractical with something such as NLTK or SpaCy?
  • A majority of NLP libraries have a limited set of languages that they support. What is involved in adding support for a given language to a natural language tool?
    • What is involved in adding a new language to Polyglot?
    • Which families of languages are the most challenging to support?
  • What types of operations are supported and how consistently are they supported across languages?
  • How is Polyglot implemented?
  • Is there any capacity for integrating Polyglot with other tools such as SpaCy or Gensim?
  • How much domain knowledge is required to be able to effectively use Polyglot within an application?
  • What are some of the most interesting or unique uses of Polyglot that you have seen?
  • What have been some of the most complex or challenging aspects of building Polyglot?
  • What do you have planned for the future of Polyglot?
  • What are some areas of NLP research that you are excited for?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull - Episode 184

Summary

As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at podcastinit.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Emily Miller and Peter Bull about Deon, an ethics checklist for data projects

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Deon is and your motivation for creating it?
  • Why a checklist, specifically? What’s the advantage of this over an oath, for example?
  • What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering?
  • What is the typical workflow for a team that is using Deon in their projects?
  • Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen?
    • Have you received pushback on any of the default items?
  • How does Deon simplify communication around ethics across team boundaries?
  • What are some of the most often overlooked items?
  • What are some of the most difficult ethical concerns to comply with for a typical data science project?
  • How has Deon helped you at Driven Data?
  • What are the customer facing impacts of embedding a discussion of ethics in the product development process?
  • Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced?
  • What are your hopes for the future of the Deon project?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Understanding Machine Learning Through Visualizations with Benjamin Bengfort and Rebecca Bilbro - Episode 166

Summary

Machine learning models are often inscrutable and it can be difficult to know whether you are making progress. To improve feedback and speed up iteration cycles Benjamin Bengfort and Rebecca Bilbro built Yellowbrick to easily generate visualizations of model performance. In this episode they explain how to use Yellowbrick in the process of building a machine learning project, how it aids in understanding how different parameters impact the outcome, and the improved understanding among teammates that it creates. They also explain how it integrates with the scikit-learn API, the difficulty of producing effective visualizations, and future plans for improvement and new features.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Your host as usual is Tobias Macey and today I’m interviewing Rebecca Bilbro and Benjamin Bengfort about Yellowbrick, a scikit extension to use visualizations for assisting with model selection in your data science projects.

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you describe the use case for Yellowbrick and how the project got started?
  • What is involved in visualizing scikit-learn models?
    • What kinds of information do the visualizations convey?
    • How do they aid in understanding what is happening in the models?
  • How much direction does yellowbrick provide in terms of knowing which visualizations will be helpful in various circumstances?
  • What does the workflow look like for someone using Yellowbrick while iterating on a data science project?
  • What are some of the common points of confusion that your students encounter when learning data science and how has yellowbrick assisted in achieving understanding?
  • How is Yellowbrick iplemented and how has the design changed over the lifetime of the project?
  • What would be required to integrate with other visualization libraries and what benefits (if any) might that provide?
    • What about other ML frameworks?
  • What are some of the most challenging or unexpected aspects of building and maintaining Yellowbrick?
  • What are the limitations or edge cases for yellowbrick?
  • What do you have planned for the future of yellowbrick?
  • Beyond visualization, what are some of the other areas that you would like to see innovation in how data science is taught and/or conducted to make it more accessible?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA