Linode

Cultivating The Python Community In Argentina - Episode 229

Summary

The Python community in Argentina is large and active, thanks largely to the motivated individuals who manage and organize it. In this episode Facundo Batista explains how he helped to found the Python user group for Argentina and the work that he does to make it accessible and welcoming. He discusses the challenges of encompassing such a large and distributed group, the types of events, resources, and projects that they build, and his own efforts to make information free and available. He is an impressive individual with a substantial list of accomplishments, as well as exhibiting the best of what the global Python community has to offer.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Facundo Batista about his experiences founding and fostering the Argentinian Python community, working as a core developer, and his career in Python

Interview

  • Introductions
  • How did you get introduced to Python?
  • What was your motivation for organizing a Python user group in Argentina?
  • How does the geography and culture of Argentina influence the focus of the community?
  • Argentina is a fairly large country. What is the reasoning for having the user group encompass the whole nation and how is it organized to provide access to everyone?
  • What are some notable projects that have been built by or for members of PyAr?
    • What are some of the challenges that you faced while building CDPedia and what aspects of it are you most proud of?
  • How did you get started as a core developer?
    • What areas of the language and runtime have you been most involved with?
  • As a core developer, what are some of the most interesting/unexpected/challenging lessons that you have learned?
  • What other languages do you currently use and what is it about Python that has motivated you to spend so much of your attention on it?
  • What are some of the shortcomings in Python that you would like to see addressed in the future?
  • Outside of CPython, what are some of the projects that you are most proud of?
  • How has your involvement with core development and PyAr influenced your life and career?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to podcast.in it the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you need somewhere to deploy it. So take a look at our friends over at winnowed. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network all controlled by a brand new API, you get everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models and running your continuous integration, they just launched dedicated CPU instances, go to Python podcast.com slash the node that's LINODE today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity Corinthian global Intelligence Center data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture, summit and graph forum and data Council in Barcelona. Go to Python podcast.com slash conferences today to learn more about these and other events and take advantage of our partner discounts when you register. Your host, as usual is Tobias Macey. And today I'm interviewing for condo Battista about his experiences founding and fostering the Argentinian Python community working as a core developer and his overall career in Python. So for condo could you start by introducing yourself?
Facundo Batista
0:01:47
Hello. Thanks for having me. Yes, I am. I'm fuck on though I, I'm in a sonic engineer, I started programming for fun when I was killing a lot of different languages. Until when work in US engineer found Python and fell in love with it, like 20 years ago, or 18 years ago?
Tobias Macey
0:02:18
And do you remember how you first got introduced to Python?
Facundo Batista
0:02:21
I used to work in a telecommunications company where we had to process a lot of information server side at that point, the language that I was most comfortable with was C which I work with a lot of in the in the university but as you may know, processing texts server side with the sea is not fun at all. So I started to find out what what I could do. I found parallel and I did some developments with power bar. They was like every all day protesting because of finally syntax and everything. So work work companion told me the Herald about Python? No, No, I didn't. You should read this tutorial. So he gave me the tutorial for the official tutorial for Python for Python, I think it's was to that tour to the one at that time. And I sort of when I when I, when I go the tutorial, my first impression was, this looks nice, but it's like, too simple. The I don't know if these will be powerful enough for the processing I wanted to do. So my first test with it was doing a recursive analysis of the networks to try to find potential loops, or some simile that was kind of complex and a lot of processing. And Python works just fine. So I said, Oh, I really like the language. I really like this language.
Tobias Macey
0:04:00
So after you discovered Python and started using it, you have ended up helping to found the Python, Argentina user group. And I'm wondering what your overall motivation was for getting involved with that. And some of the story behind your founding of the group.
Facundo Batista
0:04:16
I, the moment I started to work in Python, I started like doing a lot of things with Python and a couple of work companion also uses Python with me, but nobody else knew about Python. None of my friends knew by Sun at that time. So I say I cannot be the only person in a Cantina who does Python. I mean, I ano the international community and everything. But there should be something in a container. So I refloat the normal meetups, we have a meeting three people in that original meeting with this either the three of us were working kingdom, Python, but at the same time, we knew that somebody should be working in Python. So we decided at some point to start a mailing list about it, probably a web page. And that is the version of it. I mean, they needed the needing of talk with somebody else that also use that technology. That was
Tobias Macey
0:05:23
right. And I've actually heard a number of references to people coming from Argentina who are involved in Python, and both the local community there as well as the international community. And I'm curious how large the Python Argentina user group has gotten to be over the years,
Facundo Batista
0:05:38
it's difficult to measure because we don't have a formal process for you to shine in the in the community. And so it depends on how you which numbers that you take, for example, we have a mailing list, and in the middle is there to sell thin 300 300 people. But we know that a lot of young people is not in the mailing list because they tend to not use mail, we created a telegram group for Python hunting a couple of years ago, and it's already more than 1000 people. So it's difficult to now because we don't know how much how much of one group is in the other and the last pi con Cantina with you there were more more than 1000 people attending. So it's a large group.
Tobias Macey
0:06:36
And Argentina itself is a fairly large country and the group that you have put together, IT services the entirety of the nation. And I'm wondering how the overall geography and culture of your country influences the focus of the community and any of the challenges that you face in terms of trying to facilitate interaction for such a wide world distributed group of people.
Facundo Batista
0:07:01
It's a problem, because it's not only that our country is large, as I tend to say, to people visiting the country, I always say when when they come to when I say this, I said to them that if you want to go to the south, you have to travel 2000 kilometers. And if you have to go to the north, you have to travel another 2000 kilometers. It's a large country. But the problem is deeper than that. This, the cantina is, is very centrally stick, I don't know if that's an English or luxury, they everything tends to happen in when I say this, with the exception of a couple of other big cities like Carlo, our Rosario, or Mendoza. Most of the technology she happens in when I say this, so when we did when we found the Python will wanted to find the Python group, but at the same time within one to found just when I say this, because it's we knew that we will be excluding a lot of people. So we, from the very beginning, we when we decided that we will be addressing the whole container, we decided to call it little Python, Argentina at the same time we started but we started purely beer actually, beginning. So that part was easy, because the mailing list, you can shine anywhere. But the meetings, of course, were locals. So there were there were a lot of meetings in when I say this, when we started to new people from other provinces or cities, we started to encourage them, do meetings in your cities, talk with people locally, we all work in pie some but when we have different problems, or even with the same problems, for example, quantity of companies working with Python, or sharp offers, etc. Maybe with the same problem, the different their solutions have different so let's let's have this group, others, you know, the whole Cantina but let's not be when I say centric, and try to make it as federal as possible.
Tobias Macey
0:09:29
And from your experience overall of being a technologist and living and working and want to Saturday's and interacting with people in the broader community that you that you work with what has been your sense, as far as the level of popularity of Python as compared to other languages or technologies that are being used in Argentina,
Facundo Batista
0:09:49
I think that in that in that regard, is no different from other countries or areas, we have a a lot of people working in other languages like commercial languages with a good basis in universities like shower or pay PHP or c++. And at the same time, we have like, a lot of languages that are are not widely used by Do they have a good community here, especially especially in universities, like Lisp, or Haskell. But again, in the same in the mean, similar with will happen in a lot of other places. Python has a steady, growing, but not really quite growing a lot until 10 or seven years ago, which, at some point, a lot of people are starting to use Python, like five years ago or something like literally exploded loaded with a quantity of people trying to learn Python from the science world. So I don't have a particularly specific data for Argentina and other countries. But what I've heard and in my experience is similar to what happened in the US or Europe,
Tobias Macey
0:11:22
and what of what are some of the ways that you facilitate the growth and interaction of the community. And some of the types of resources and events that you help to provide,
Facundo Batista
0:11:33
we try to make our focus is in is pretty much in the community. I mean, we do Python, where we were a group of people to in Python. So our focus is to make people talk together and get together around Python, from the mailing list or the telegram groups where we provide assistance. So anybody can look are in Python or find answers for the problems around Python. Two meetings, which we have several other kinds of meetings or events, always The idea is to make people get together around the language. What one of the basics. rules that we have for for events in Python or Cantina is we want the events to be free. We don't want to charge you for you to be able to talk with by some with somebody else. So the Thai Connor Cantina, for example, it was always free, which is kind of unusual in what the rest of the world happens.
Tobias Macey
0:12:49
Yeah, it's definitely much different than typical technology conferences that I've had experience with. And I know that in general conference, organization, and management can be both time because consuming and expensive. So I'm wondering how you've approached that in order to be able to provide it as a free resource for people?
Facundo Batista
0:13:07
Well, we have sponsors, I mean, companies, we company sponsor the events, so we get that money and pay for the expenses, we are somehow limited in the sense that for example, we don't provide you for with lunches, our T shirts for everybody, or this kind of generic stuff that you have when you go to a paid events. Because I mean, you're not paying for anything. So we cannot give you lunch. But you can access the or focuses for you to be able to access the information, the information should be free. If you have money or not, that's a focus.
Tobias Macey
0:13:55
And in addition to pi con Argentina, you have also working on this pie camp event. And I'm wondering if you can describe a bit about what that is, and how that got started?
Facundo Batista
0:14:07
Well bigamy for me is one of the events that I must like, for every year in a container. It's a small event. I mean, it's this is not for why assistance, we get together every year, like 40 or 50 people in a place that provides the basics for us to survive, like electricity, internet, bathrooms full, and that kind of stuff. And we spend four days coding and hacking and playing board games and doing fun activities like learning how to fight with swords, and that kind of stuff. It's a very nice event, where you just go to buy some Python Python for for this is very nice, very nice. We had a lot of good pictures about that I showed a lot. We have to reproduce this in other countries for people to get fun.
Tobias Macey
0:15:10
Yeah, that definitely sounds like a lot of fun. And I'm curious if the sword fighting expertise came from within the group or if that's something that you brought somebody from the outside for,
Facundo Batista
0:15:19
know somebody in the group that that specialist is in that so he every every every become he covers some sort of teach a little, but we have we normally do also a sports like playing football or basketball, or actually, or, for example, the last PE camp, we had a talk from an specialist about astronomy, we were in the mountains in a really dark place. So he talked about stars to ask for an hour. And it was very, very good.
Tobias Macey
0:15:58
Yeah, definitely links to pictures for that for anybody who wants to take a look. And I'll definitely advocate for anybody else to replicate that because it sounds like a good time and something that would be worthwhile to help grow some community engagement and just be an excuse to get out and do something different.
Facundo Batista
0:16:15
Yes.
Tobias Macey
0:16:17
So in terms of the overall community, I'm wondering what have been some of the main points of focus in terms of just general themes of events and talks and some of the notable projects that have been built by or four members of Python, Argentina?
Facundo Batista
0:16:33
Yes, well, the focus is mostly mostly the people like making everybody together to talk about Python, but with some specifics, like information should be free to anyone to anybody, as I said before, but also in diversity, we were heavily focused on diversity since I know 10 years ago, similar to what they be SF was doing. Also 10 years ago, before diversity was really in the agenda for everybody. We all we we, we were like pioneers with BASF around that. So it's mostly the people. But sometimes, sometimes, as a group, we want to attack some different projects. For example, one of the longest in time that we have, and that is I'm most proud of is the CDP via the CDPV is a project where we puckish the whole Wikipedia in a city, I mean, originally was the city, then we had we started the DVD version. So you have they are and then we we also started at a dependent I version. But the D is always the same. You go with a CD or a DVD, or a pen drive, with a computer with no internet at all. And you have the whole Wikipedia content. Of course, we are addressing the Spanish part of the Wikipedia, even as we have the idea to make it multi language at some point. But the idea is for you to go to with a CD or DVD, for example to a school in in this distant from any city, and you have computers, but you don't have internet, which is quite common in our container because we have so many rural areas. So the idea is that you have a computer, you have internet, but having the CD PDF, you can get all the information from Wikipedia, which is very, very good project is there since like 13 years of something
Tobias Macey
0:18:56
that definitely is great to be able to provide that information access. And I'm curious, what are some of the challenges and strategies that you're faced with to make it possible to have all of that information available offline and internally linked so that it doesn't require any outbound network access and any potential applications that that that could be made from that project to things like maybe packaging up sections of the Internet Archive for similar purposes,
Facundo Batista
0:19:29
I think that it's very difficult to make it genetic, because the the processing of the Wikipedia basis are so specific for Wikipedia basis, because the need of compress them at the maximum. So it's very difficult to to make it generic. The defeat, the challenge around the projects are mostly about the compression for basis and images on one side, but also the index is very difficult to achieve. They remember that we original aim is to make a CD. So CDs are slow. So if you the moment you want to find for something and open that specific page, you cannot really be reading 100 megabyte to uncompressed something in memory. But you you should have a small blocks access, you should have access in small blocks.
0:20:38
Other big challenge is how do you
0:20:45
determine which patients we will will you include and which images from those patients will you include in the in the CDB because if you it, if you make it fit for a CD, you have 600 megabytes, but if you aim for a DVD, you have like almost five gigabytes. And at the same time we have a version for a with with all images and all basis for that that is a means to pen drives. That is around 13 gigabytes The last time we compile it, but the the process, the process of selecting with basis is quite quite difficult. But that's only the the technical challenge of a project like this because you have also the associated challenge. The moment you have a CD or the moment you have a DVD with the whole will be there. How do you distribute it? Because it's not something you cannot you can well, we we we have it for the last but if you have if you are in the problem that you don't have good internet in this call, how do you original download it? So we have success regarding that the Xiaomi Wayne's, which is the founder of Wikipedia as a gift to person no sorry, it was the the the way around. A person that has this company working with the education ministry in a Cantina made us a gift for Jimmy Wales, the possibility to distribute the CD billion dollar container. So we had the disc in all schools in Argentina, I think around 2011 or something, which is very, which was a very good thing.
Tobias Macey
0:22:53
Another aspect of the project too is that because Wikipedia is continually evolving body of information. There's the issue of staleness of information, where some pages, for instance, are going to be unmodified, because their historical records that don't necessarily have a lot of flux. But for any sort of scientific information that might have been updated since the last time the information was compiled, there's the challenge of being able to redistribute those updates. And I'm curious if you have any thoughts on that problem, or any ways of maybe sending incremental updates for people who already have an existing copy, or because of the fact that it's entirely self referential if, if that's even viable, and we
Facundo Batista
0:23:35
analyze that a couple of times, it's it was very difficult to produce incremental, incremental, because at some point, we witnessed some stats at some point. And it was like, almost there were there are so many changes. And as a lot of patients are references by a lot of other precious, you have like you needed like seven, I think the number was around 65% of the patients needed to modify. So at some point, you just get a new snapshot and deliver this new a new snapshot. And incremental is not it wasn't on the water fit. The problem was Yes, the problem of a patient's going stale is is a problem of all snapshots. The moment you have, the moment you get an A snapshot, you are doomed with that. But there is a similar challenge around that, that what you can do to prevent or mostly avoid people doing bad things to the Wikipedia pages, and you distributing them as as a truth. I prefer to have this page about I don't know this scientific thing that is two months old. But it's true. That that is to the old. But it's a lie, or it's a it's a hack about something or so we have a lot of algorithms about when when we decide to include the patient in the snapshot, which version of that page we choose, we in a lot of situations, we don't choose the latest page. So it's it's complicated.
Tobias Macey
0:25:29
Yeah, it's definitely a complex challenge. And as you said, it's not just the technical, it's also the social aspects of it. And because of the fact that a lot of the people who are using it don't have internet access, it's not necessarily viable to just ship those increments over the internet, you would have to have another physical medium of sending it along, and then have a way of merging the information on a hard drive or something like that. So right, that's best of luck in in that overall effort. And then we under involvement in Python, Argentina, and working on projects such as CD pedia, you have also been working as a core developer for C Python. And I'm wondering how you got started on that path. And what specific areas of the language and runtime you've been most involved with and most focused on.
Facundo Batista
0:26:13
And I started, I have this problem around 2002, where I started a personal project for managing my own money, my own finances. And quickly I found out, I found out that float acid at the time was not a good fit for handling money. So trying to see how you can handle money in in Python, I found out that there was this idea of creating this email data type, which is the best fit for handle money. But that was not the really there. So in my original maize maze, Gisele mantra, some suggested that I did that the this Amanda Debbie is what I needed. And I decided to make it happen there was cold around there. And it is this is pick from IBM, which is specify exactly how the dissimilar the type worker. So I started to work into this email module, I received a lot of help from people that knew a lot about numbers. And I got into things like Tim Peters, or Eric snow, or Well, there is a there was a lot of people involved. But my main success there was to start and finish. Very complicated. Pip that was was is the symbol, the symbol model, and then implementing the model. At that point, I become a core developer because I was committing a lot of code, committing a lot of tests pretty well. Basically working in the in the decimal model. Beyond the decimal module itself, I like to participate in Python back days a lot. And I started to create small events in Argentina for people to grab bags of C, Python and work on them. And I normally tend to work on stuff like that in in Python sprints and everything. But I even even as I'm a core developer, I really don't spend a lot of time with the source code. I'm I the last 15 years or something, or mostly the last 10 years, I was heavily focused in the community part of Python and not so much of the call, I tend to do come eat every I know every several months because of helping somebody with, with patches, our backs the others, but it's more is most an effort of creating a community of people helping with the with the code than helping with a cold myself. For example, I participated in a seven hour Google summit of calls for people who wanted to do calls in Python and that kind of stuff.
Tobias Macey
0:29:45
And I'm curious what it is about the Python language and community that has caused you to spend so much of your time and attention on it as opposed to other endeavors that you might go, that you might spend your time on or other languages that you might be using perfect rationally or personally,
Facundo Batista
0:30:01
on one side of the Python, the Python language itself is something very nice and fun to work with. It's something which works good enough in most of the context that they use the language. Or, in my particular case, in all the context, I use a language. So I really don't have the needing of use another language is for when I do projects in my free time for foreign on or leveling technologies, either by thumb because I like it. And then I finished working as a Python software engineer in a couple of companies I am working in in canonical since more than 10 years ago, doing Python. So I use Python everywhere. On the other hands. Community is one of the healthiest communities that I found in the south Paulo world. There was always this good attitude of people around the language, the language, people, the community was always very welcoming. Always very respectful. And it's a good place to be for people to encourage to be it happened to me a lot that getting people from other languages into Python. In Argentina, one of the aspects was that I really like this, I don't know mailing list, because I can make a silly question and nobody will hit me in the head with something. Or specific specifically speaking about diversity. There is a lot of non male, white, good socio economic position, people that is really happy with their community. And this is this, I think this represented status of the Python community around the globe that this is very good. It's, it's very good, but at the same time, it's like, I don't know if I find anomaly, but it's not usual that communities are so well behaved.
Tobias Macey
0:32:38
Yeah, it's definitely remarkable the amount of effort that has been put in by members of the community globally to help foster that overall sense of Welcome to new people of all skill levels. And just the fact that it has been able to be maintained and sustainable as the community has grown beyond its original roots is pretty remarkable. And I think the fact that there is an organization in the form of the PSF, at the core of it to help drive a lot of those efforts, and set standards for the community has helped to allow it to scale to that, to the point that it has,
Facundo Batista
0:33:16
yes, yes.
0:33:19
For example, the BASF, always well made a focus about the diversity. For example, we every year in these pay camps, we do this, the base camp is the only event in Argentina, this is not free, because I mean, you have to pay for the hotel, and everything, but we normally gives money to people to be able to attend. And we do a focus on diversity there with PSF sponsorship for specifically for that, which makes the community more diverse. And at some point that will be it's it's a positive circle, that making the community more diversity will attract more diversity itself. And at some point, we can stop being equals in the community.
Tobias Macey
0:34:16
And as a user of the Python language and committed to the runtime, for such a long period, I'm sure that there are aspects of the language that you've run into that you would like to see improved or modified. And I'm wondering if there's anything notable that you would like to see addressed in the near to medium future?
Facundo Batista
0:34:35
I think that one of the aspect I in general says I am very happy with the language. It For example, other people say it's is slow in some situations, but I not really, it's that's not really a problem. For me, what I will really want to see improved in the midterm is this time, time for for the for the for the Python process, the time that is there between you type vice and three in the in the terminal, and the script really starts executing this. That time, I think that really helped a lot of different areas where Python could be more widely spread. And it's the problem is that you cannot release executed feeling by fans in a millisecond. I'm exaggerating. But that's the the
Tobias Macey
0:35:42
and outside of your work on Python, Argentina and the C Python runtime and some of the other open source projects that you've mentioned, what are some of the other areas that you spend your time and projects that you're most proud of,
Facundo Batista
0:35:56
I really use a lot of time of my life to make my kids happy, make them grow and be with them. Enjoy them while they are growing. They are still small, but time goes by so fast. I do tennis I love tennis playing. And I really a lot of my free time I put it in in computers and software projects and community. One of the projects is one of the persons that I I spend a lot of time is why is one called phase that they say automatic belittle them rapper for your projects is smaller than it will tell him rubber in the sense that you really don't know that you are reaching out to land in your project or in your autonomy. Now you only specify the dependent says the process or your interactive interpreter or whatever executes inside of your to laugh. But you don't really need to know that the real flame is under there, or how to create it, or how to activate it or anything which is makes it very, very good for people to start in Python, because they don't need to install dependencies or anything. They just if they use phase, they should specify the dependencies that they want. And the script will run in a barrel. And with only those dependencies ultimate, ultimate shakily.
Tobias Macey
0:37:43
And how would you characterize the overall influence that your involvement as a core developer, and with the Python, Argentina group, and just the overall influence that that has had on your life and career,
Facundo Batista
0:37:57
I don't know if he's been a carnival over itself influenced a lot of what I do in Python and Cantina. What really affects what I do in Python, Argentina was in that, in that sense, being part of the Python Software Foundation, being part of the group of people involved in making the language better, and then translating, or translating a lot of those attitudes. Good seems to have from overall BASF to turrentine. Specifically, regarding my career, well, I'm I'm an electronic engineer, I started working at the communications company 20 years ago, and working as an engineer. But then when I started being more and more involved with Python, I was a head of the developers in company in around 2006, then go went back to work as an electronic engineer in another telecommunication company, but then sample to canonical been doing Python, they're almost 11 years now. So it's heavily influenced my career, because I really work as a developer, even if I didn't study that in the university.
Tobias Macey
0:39:29
Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move into the pics and this week, I'm going to choose a book that I picked up from the library recently, that's been a lot of fun, called the dictionary of difficult words. And it's just a bunch of different words that you wouldn't typically use in everyday language that are interesting to say, or here. And they've got useful and complex definition. So it's just great to explore language and fun and entertaining. And there are a lot of funny illustrations to accompany the words. So it's great to sit down and look at it with your kids. So I've been having fun with that. And with that, I'll pass it to you. Do you have any pics this week?
Facundo Batista
0:40:10
Well, I, I will encourage anybody working with rich labs to take a look, face and start is like, at the beginning, you don't really see the value of it. I mean, you say Oh, another rapper by you, you really use in one, what is the benefit for it? But the moment you start really using it, you you will not stop? Isn't it? It's It's It's very, very, it's very, very helpful in the everyday Python usage.
Tobias Macey
0:40:45
All right. I'll have to take a look at that. Well, thank you very much for taking the time today to join me and discuss your experience working with Python and helping to contribute to the growth of the community. I appreciate all your efforts on that front and I hope you enjoy the rest of your day.
Facundo Batista
0:40:58
Okay, thank you. Thank you for having me. Bye bye.
Tobias Macey
0:41:04
Thank you for listening. Don't forget to check out our other show the data engineering podcast at data engineering podcast.com for the latest on modern data management. And visit the site of Python podcasts. com to subscribe to the show, sign up for the mailing list and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcast in a.com with your story. To help other people find the show please leave a review on iTunes and tell your friends and co workers

Python Powered Journalistic Freedom With SecureDrop - Episode 228

Summary

The internet has made it easier than ever to share information, but at the same time it has increased our ability to track that information. In order to ensure that news agencies are able to accept truly anonymous material submissions from whistelblowers, the Freedom of the Press foundation has supported the ongoing development and maintenance of the SecureDrop platform. In this episode core developers of the project explain what it is, how it protects the privacy and identity of journalistic sources, and some of the challenges associated with ensuring its security. This was an interesting look at the amount of effort that is required to avoid tracking in the modern era.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Jen Helsby and Kushal Das about SecureDrop, a secure platform for submitting and receiving documents anonymously

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what SecureDrop is and how it got started?
    • How did you get involved in the project?
  • Can you give some background on where and why it is useful?
  • For someone using a running instance, what does their workflow look like?
    • What are some of the ways that you minimize user experience hurdles to prevent them from circumventing the security through laziness or apathy?
  • I was a bit surprised to see the references to the messaging system that is included. Why is that an important feature?
  • What form do the submissions generally take and what are the limits on formats that you can accept?
  • How is the system itself architected and how has the design evolved since the first implementation?
  • In terms of the security protocols and technologies that are implemented, what factors are you considering as you develop the project?
    • What are the weak points or edge cases that could lead to compromise and how do you guard against them?
  • In terms of the deployment and maintenance of a SecureDrop instance, how much technological sophistication is necessary for the organization running it, and how much effort do you put into simplifying it?
  • What are some of the notable uses of a SecureDrop deployment and what motivates you to continue working on it?
  • What are the most interesting/innovative/unexpected uses of SecureDrop that you have seen?
  • How do you approach the sustainability of the platform?
  • What have you found most challenging/interested/unexpected in your work on SecureDrop?
  • What is in store for the future of the project?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to podcast.in it the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you need somewhere to deploy it. So take a look at our friends over at the node. With 200 gigabit private networking, scalable shared block storage node balancers, and a 40 gigabit public network all controlled by a brand new API, you can get everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models and running your continuous integration. They just launched dedicated CPU instances. Go to Python podcast.com slash Linux that's LINODE today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest and machine learning and data analysis. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity Caribbean global Intelligence Center data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture, summit and graph forum and data Council in Barcelona. Go to Python podcast.com slash conferences today to learn more about these and other events and take advantage of our partner discounts when you register your host as usual as Tobias Macey and today I'm interviewing Jen Helsby and Kushal Das about secure drop a secure platform for submitting and receiving documents anonymously. So Jen, can you start by introducing yourself?
Jen Helsby
0:01:47
Sure. My name is Jen Helsby, and I'm the lead developer of secure drop
Tobias Macey
0:01:50
and Kushal, can you introduce yourself?
Kushal Das
0:01:52
Hi, I'm also a maintainer of secure drop, and I'm part of various other projects, including as like Python co developer, and we both gentlemen, we bought a part of the Tor project also.
Tobias Macey
0:02:02
And Jen, do you remember he first got introduced to Python?
Jen Helsby
0:02:05
Yeah, I started using Python. When I was in graduate school, I did a PhD in astrophysics. And so I started using Python for data analysis. This is some years ago.
Tobias Macey
0:02:14
And Kushal. How about you? Do you remember how you first got introduced to Python?
Kushal Das
0:02:17
I mean, I saw Python back in college days. But like, in 2005, someone told me that I can try to write applications for my Nokia phones using Python. Sadly, I had a different version of Nokia, which citizen ID which never had Python. But yeah, that's how I got into.
Tobias Macey
0:02:33
And so can you start by describing a bit about what the secure job project is, and how it got started, and how you got involved with it?
Jen Helsby
0:02:40
Yeah, totally secure ops and anonymous whistleblowing platform that was first created by Aaron Swartz, Kevin Paulson, and James Dolan around 2012. And that was around the time that Wikileaks was in its heyday. And they had this submission system, and they were getting interesting documents through it. And so the idea was to create an open source project, that would be something similar that major news organizations can use to also get documents while protecting the identity of sources. And so it's not really a new idea, because news organizations have had like anonymous tip lines for some time, but doing it in today's kind of surveillance landscape is the challenge. And I got involved like three years ago, I had installed secure drop and thought it was a great project. And so I started working on
Kushal Das
0:03:27
it, I think my project was in a different way. I saw it like freedom, Tobias Foundation and secure drop from a distance all the time. And back in 2017, I was actually wondering if I should drop an email to fear of the press and saying like, hey, if I can work on the project full time, and I was not being able to do so, in the time, my wife and wishes he actually pushed me, like, just write to them and see what happens. And that changed many things in life. And I'm happy you're working on the project?
Tobias Macey
0:03:52
And so can you give a bit of background as to some of the wares and whys of when the secure job project is used?
Jen Helsby
0:04:00
Yeah, totally. So there are a lot of people that might want to share information with a news organization, but might fear what is what will happen to them if they are identified as a person who shared the information, so they might be fired, or in more kind of extreme situations, they might even be potentially charged with a crime or worse, and generally, reporters, at least in the US will refuse to provide a source of identity when asked by the government. And so the general problem is that, you know, in today's world, all communications are mediated by third parties. And the government doesn't need to ask a journalist who's the identity of your source, they can just go to a third party and ask them and so we started seeing this happen a lot more during the Obama administration, where the government would get a court order to acquire the telephone records of a journalist in order to identify the source that happened with the Associated Press, for example, under Obama and happens even more under Trump, unfortunately. And so that's the situation where if organizations thinks they're going to get sources of that type, and then using a providing secure drop, along with other channels is a good idea.
Tobias Macey
0:05:10
Yeah, given the fact that a lot of the sort of original ways that journalism was done was much more face to face, it was possible to be able to shield your sources, because you didn't have those electronic trails for be people to be able to follow and uncover who might have been released a particular document, but with the global nature of communication, and the fact that a lot more people will be collaborating over larger distances, it increases the availability and access to that information. But as you said, it increases the potential risk. So it's definitely good that there are platforms such as secure drop available to help ensure that there is the availability of that information without necessarily putting people at risk of the process of providing it.
Jen Helsby
0:05:51
Yeah, absolutely. Even like meeting physically in the modern age. Sometimes people are like, Okay, well, I won't call them on the phone, but then I'll meet them physically. And that's still you know, in a city, it certainly produces a significant amount of data because the CCTV cameras everywhere facial recognition, and that's something that, depending on the, the adversary you're concerned about could be used to identify you. So yes, quite a hard problem.
Tobias Macey
0:06:13
So for somebody who is running an instance of secure drop, and somebody who wants to submit some information to that organization, what does the overall workflow look like for the person who is submitting either in terms of just discovering the availability of it in the first place, and then actually providing the information and then on the receiving end, the actions required to actually retrieve that information and make use of it?
Kushal Das
0:06:37
Yeah, I mean, I can talk from the source point of views. And then Jen can explain what happens in that, like from the journalist point of view, so generally, most of the news organizations, they also publish the URL of their secure drop instance, via other medium, like the normal news website, some physical newspapers also printed in the physical copy. And we also have like a directory where we have verified URLs from different organizations, like we're running secure drop. So a source can identify from many of these cases, or one particular case we saw, like, one organization actually put their URL in a billboard in front of another large organization. So when I sources these and like, if they try to read a little bit more about to you about the, how they can submit, all of these websites generally also give some sort of bare bare minimal steps for the sources, how they can actually use tales operating system, you know, they go to a different network, like a cafe or somewhere like don't try to do anything from your office network. And they will open up using Tor browser on tails, they can open up the instance and just click and log into the box and submit any documents or they can ask any question, they'll send some sort of messages. And this from the source point of view, they do not get username, password or anything details, they just get one big dice ready interrupted password, which adapter, just remember, for next time use Jen, you want to go ahead for the journalist,
Jen Helsby
0:08:05
once the source is uploaded either documents or messages to a secure web server, then the journalist will come along and they will access a another web application that is separate from the web application that sources are using, again, using Tor Browser. And they will download those documents, whatever they're interested in, and then they will transfer those documents across an ad gap. So they will transfer it to a machine that's never been connected to the internet and is not currently internet connected using some storage device like USB drives or CDs. So they take these documents across. And that is where they decrypt and read those documents on an aggregate machine, which we call the secure viewing station. And so that's the only place where documents can be decrypted. So at that point, they'll either decide to respond to resource in which case they need to go back to an online machine and send messages back to that source, you can then log in again and read them or they will transfer those documents that they've decrypted to another workstation in the newsroom or print them out such that they can take them to their editor or whatever their workflow is after that point. So it's kind of a bit laborious having to traverse this air gap. But the main concern that motivates that design is it's one of those scenarios where you're asking just random people on the internet to submit you files of any type. And then the journalist is going to open those files. And so the concern is, what if the file contains malware. And so we want to keep that compartmentalize from the rest of the system.
Tobias Macey
0:09:40
And because of the fact that there is the potential for malware, I'm wondering what any sort of best practice or standard operating procedure is in terms of the air gap computer as far as ensuring that it is up to date with its security benefit. They're up to date with its security updates, and has some sort of accurate protection to prevent any sort of malware from corrupting the rest of the machine. Or, I mean, given the fact that its air gap, there's less of a sort of blast radius where you don't have to worry about it escaping from there. But I'm wondering if there's any sort of potential compromise as far as other information on the machine that might get destroyed in the process of opening some of those files, or just making sure that the overall security of that system is up to date as well, given that it's not connected to the internet?
Jen Helsby
0:10:25
Yeah, I mean, it's one of the main challenges was with an air gap is it's not going to be getting automatic security updates. And so people do need to manually update. The the main concern right now with this egg up system is if an attacker can get code execution, it is the same place where the private key is stored. And so we still don't want to allow that to happen. If they do get current execution, all of the physical devices that could be used to exhale data are removed. So for example, the network cards are removed, the mics are removed, etc. So it's it is if you can get a foothold, it's hard to get data off the system. Oh,
Kushal Das
0:11:04
I was saying that we also use tails in both the generalist workstation and also the ZS the secret room station. So it does also provide some sort of like support as a gap system here,
Jen Helsby
0:11:18
now has this amnesiac property, which is why we use it. So almost everything on the system will be destroyed when you reboot it. So there's just one directory that stays that persists and everything else is destroyed. So that's a real advantage in the case of malware, potentially getting a foothold on the secure viewing station.
Tobias Macey
0:11:41
And as you mentioned, some of the overall workflow, particularly on the receiving side is a bit laborious, and then also on the person who's submitting the information, as you said, there's the potential for responding back to them. But it requires them to actively go back and login with that randomly generated password to be able to see if there are any written messages without any way of being notified of their presence. So I'm curious if there's any sort of common workflow that people use to try and reduce any sort of latency or barrier as far as the return communication to maintain some sort of a dialogue or if the Document Submission conserve as the riskiest piece of business. And then the rest of the communication can happen in somewhat of a more convenient form factor.
Jen Helsby
0:12:27
So there are people that just come to secure drop, dump documents and never return. And then there are people that have these more extended interactions, like there are people that only talk through security often have like long running relationships with journalists. And the truth is that we don't know too much about individual a news organization. So we should have said that, we just write the software and then news organizations install it and operate it themselves that we can't SSH into anybody's secured, Rob, it's all managed by administrator something each individual organization. And as a project, we don't want to know too much. I mean, we need to know some about what users are doing in order to design the system. But we don't want to know too much, because it's obviously very sensitive. And that makes us a place where you could go to gather information about these common workflows and potentially use that to attack a news organization.
Tobias Macey
0:13:23
And also in terms of just the overall user experience, having too many sort of difficult steps or too much inconvenience in the process can often lead people to just short circuit the security and take shortcuts that will prevent the overall effectiveness of the system. And I'm wondering how you approach that user experience and education step to ensure that the overall use case and workflow of secure drop remains secure and sort of prevents people or encourages people not to take those shortcuts that might compromise it
Kushal Das
0:13:55
kind of hard for the journalists to actually use any other system, then the properly journalist watch station to access the secure drop instance. And the final like, even if the journalistic workstations, like journalists can download any kind of submission, they cannot view it till they actually move it to a particular gap, secure view station. So even if they want, there is no such simple way to like, you know, bypass the security, the way it is designed, it's difficult, and it's become such a like a difficult level, which is not easily can be bypassed to make that whole flow easy right now,
Jen Helsby
0:14:33
yeah, they would need to know how to like export private keys and stuff like that. Second, man, a security.
Kushal Das
0:14:38
And we also do trainings at those places, most of those places where they take help from us about installing secure dropping things. So like freedom of the press foundations, digital security, they not only teach about how to use secure job and make it into a muscle memory, they also have to learn about digital security one on one and more details, so that the overall digital the secret, how you
0:15:01
it's better for the journalist
Tobias Macey
0:15:03
and going back to the messaging system, because of the fact that this is at least at face value more of a one way relationship where somebody will submit documents to the news organization. I'm curious why you feel that the return messaging and being able to have that be a communications channel is important to the overall workflow and utility of the system.
Jen Helsby
0:15:25
Yeah, one of the challenges with a system like this is that the source journalist relationship is a human one. And so it can be hard to develop that rapport. Without having some kind of back and forth, it might take some time before a journey, before source is comfortable sharing something until they know that it's going to be handled properly, and that they're kind of going to be safe. And so you can imagine that that's one of the uses of the messaging system. And journalists might have follow up questions, they might need clarification on what a document needs, if it's particularly technical in nature, or they might kind of need a pointer to where they can find out more. And so that kind of back and forth is what the messaging system is most useful for.
Tobias Macey
0:16:10
And then also, as far as the types of submissions, I'm curious what form they generally take, whether they're PDF documents typically, or if they're sort of zip archives, just the sort of general volume and scope and sort of format support that's necessary for ensuring that you're able to access that information on the secure workstation, the air gap workstation, once you've retrieved it,
Kushal Das
0:16:34
that is a fine size limitation like 500 Mb that is to start with, then for as far as the file types are concerned, there is no limitation. sources can submit any kind of document. And depending on the jar, and depending on the like, how the journalists want to view those documents in future like after they decrypted it on the secure workstation, they may want to move it out to some other system like like some other fancy system, maybe we through be able to play that video or document and watch the document. So
Jen Helsby
0:17:04
yeah, generally what we try to have good support for in terms of like opening documents is like office kind of documents, PDFs, most common audio and video formats. That's what you can open up and tales machine. And I think, you know, like if you get a sequel database dump or something like that, that would need to be taken to either another machine, or you would need the news organization would need to ferry like a deadlock and open that file nicely onto the workstation.
Tobias Macey
0:17:33
And then in terms of the overall system architecture, I'm wondering if you can talk through how it's designed and how it's deployed, and some of the reasoning behind using Python as the implementation language.
Jen Helsby
0:17:46
Sure, yeah. So the way that it's architected right now is every news organization installs two servers, so they're both run Ubuntu server, and one server is an application server. So that hosts two web applications, one that's used by the sources to submit the documents is previously described in one that's used by journalists to access documents. So that's the first server which we call the application server. And then the second server is a monitoring server that runs a host base IDs that just monitors the application server and then sends alerts for potentially suspicious activity to administrators, administrators here being the person at the news organization who's charged with keeping the security up and running order. And then we have a network firewall that separates the security area of the network from the rest of the network in case there's a compromise of their news organization, network or compromise of the security network just to keep things separated. And all of that is hosted on prem at a news organization. So might be in their data center. Or it might be you know, some cases like the editor's office or the General Counsel's Office. And then both sources and journalists only access the epic server through a veto veto on in services. And that is done primarily to protect sources and make sure that they do come in through Tor. And then admins can either use it or they can just use regular land to administer the service. And then also, a new news organization needs to have a online journalist workstation that the journalists can use to download documents, and then the secure viewing station that is at gap just described earlier. In terms of using Python, we want to generally pick technologies that are widely used and established and easy to maintain. And we really do get an advantage of using Python. So we use that for the two web applications and for a CI that administrators use to administer the system.
Tobias Macey
0:19:49
And given that the organizations that are running these instances don't necessarily have a lot of technical staff, particularly in the case of independent noise news organizations that may be fairly small. I'm wondering how you approach the overall system designed to reduce the maintenance burden on those organizations and ensure that they're able to keep it up to date and appropriately secure so that it fulfills its original intent,
Kushal Das
0:20:15
like what the actual administrator see is one or two single in a couple of basically a couple of small commands. And a gentleman said those are written in Python, but what does commands actually do is that they fire up a setup, and it will play like playbooks. And those playbooks make sure that the servers are in the correct state, like diabetes rules, like what all software has to be installed, what kind of kernel it should run, all details for all of those servers are exactly the same. And that we can only achieve by using these answerable runs. And that also helps to make sure that even if the administrator doesn't know much about Linux systems, they can just type this one single command, which will make sure that the servers are in the latest good set the way it should be.
Tobias Macey
0:21:00
As far as the overall security protocols and technologies that you're using, what are some of the main factors that you're considering as you develop the project and any weak points or edge cases that you are aware of, and that you try to guard against that it could potentially lead to a compromise?
Jen Helsby
0:21:16
Yeah. So generally, as I said earlier, we try to use widely used and established tools. So for example, if we add a dependency, we want to make sure that it's very commonly used. And we will, you know, when we make an update to that dependency, will review the changes, we do things like that, in terms of just general architect and for the project, we do threat modeling to analyze the functionality, the potential threats, and then when we're deciding what mitigation to apply, we go back to a threat model. So we have a document that's internal that contains every potential threat to the system. And then we try to rank all those threats to determine how to allocate engineering efforts so that we don't spend time mitigating threats that are either low impact or very hard to actually execute as an adversary. In terms of weak points and edge cases, probably the biggest challenge right now is just there are limits to what you know, any technical tool can do. So this cases where sources can be identified. And you know, unfortunately, we have seen this not necessarily people that use secured Rob, but people that try to share information with news organizations, operational security failures, you know, if you're using a tool, like scooter, and then you also email and use organization direct and those kind of situation, or if you're in a news organization, if you're in a organization as a leaker. And you're sharing a document that only a few people have access to, and access to the document is logged. That's another really challenging problem that we can't really engineer around. And so those are the biggest threats that face potential sources right now. And I think, you know, certain organizations realize that just having really good logging and other letting internally can potentially mean that as soon as somebody plugs into USB drive, you can flag it. So that is probably the biggest issue.
Tobias Macey
0:23:09
And as far as the overall development of the platform, what have been some of the most interesting or unexpected or challenging aspects in your experience of working on it and maintaining it and interacting with users,
Kushal Das
0:23:21
this further development or like use cases or things we found interesting.
Tobias Macey
0:23:25
Yeah. So for now, mainly just focusing on the actual development and maintenance of the project. And then we'll talk about some of the interesting use cases after
Kushal Das
0:23:33
I think, for me, like what I always find really challenging is that we are trying to secure systems where we do not have any access to all the secured Rob instances servers, they are running inside the organization's we're running them. And we as developers, like have zero access to those. So somehow, we have to make sure that those systems get upgraded and stays secure as it should they should be. That's one of the biggest one in my mind. Yeah, Jen.
Jen Helsby
0:24:00
Yeah, that's an ongoing challenges, especially because we're supporting like, we have contracts with some news organizations to help support the instance. And then another issue is just designing a system while trying to intentionally not know too much about how it is used. That's kind of an ongoing issue. And as far as any sort of interesting or unexpected uses of secure drop, or notable cases where it has proven beneficial. I'm wondering if there are any stories that you can share on that front, there's a recent case where we, what was announced at DEF CON, this month, beginning of this month, that apparently the US federal government is going to use secure drop in order to get security vulnerabilities. So this the reason why they want to secure up in that case is potential security, researchers are concerned about retribution. And so if they could submit through secure drop, they can make sure that whatever agency is aware of the vulnerability and fix it with them being identified,
Kushal Das
0:25:02
and that is the other story, which is about, like someone wrote an anti diversity memo at Google. And it got leaked multiple times multiple versions YRCK dropped it against different organizations, which was a big news all across.
Tobias Macey
0:25:17
Yeah, that would definitely took a while for people to get around it. And there was a lot of conversation and consternation on all sides of that conversation. And then in terms of the overall sustainability of the platform and the project, how do you approach any sort of required funding and men making sure that you have an appropriate level of staffing on the development side, and then also the overall process for user feedback to ensure that you're incorporating new features or system improvements that make sure that everybody who's using secure drop are getting the benefit that they want?
Jen Helsby
0:25:52
Yeah, totally. So in terms of sustainability of the project. And to kick it off, it's been really fortunate that the project is supported by freedom of the press foundation. So Chris shell and I are both employed by freedom of the press Foundation, they took the project over, after Aaron Swartz unfortunately passed away, I believe, in 2012. And so FBF, which is show for freedom of the press Foundation has supported development for several years since then. We've also been fortunate to get funding from Mozilla open source support, which supports a bunch of internet freedom projects like Tor as well. And so thanks to their support, and other kind of grant based funding and small donors that donate FBF, we've been able to keep the project maintained. In terms of user feedback, we get user feedback, either through just our bug tracker, like other projects, we have like a private Support Portal, that organizations that install secure drop, can use to file tickets, if there's an issue or if there's something that they want change. Then we also do surveys and user testing, and we chat privately to existing users. I've secured Rob to do that,
Tobias Macey
0:27:06
since secure drop is a platform that provides a means for circumventing surveillance. I'm wondering if there have been any cases where you've had to deal with any sort of pushback from either governments or other organizations that are either trying to shut the project down or have some sort of influence over it?
Kushal Das
0:27:26
Not that I know about anything?
Tobias Macey
0:27:28
Yeah, yeah, I'm not aware of anything like that, either. I mean, I think we have, at least we both started working on the project when it had already kind of become pretty mainstreamed. And a lot of big news organizations like New York Times, etc. Were using it. And so I think it would be pretty controversial if you know a government agency were to kind of publicly go after secure drop the project at this stage. And then in terms of the overall system maintenance, you mentioned that you have the answerable playbooks that allow users to you get it deployed, I'm curious how you publicize to the different agencies that there's a new release available, and how you simplify the update process to ensure that they're running the latest versions, particularly if you have any dependencies that have some sort of CV or vulnerability that's a present on the system to ensure that they stay up to date. And then particularly for a long running instances, how you help them with any sort of system upgrades of the underlying operating system.
Kushal Das
0:28:29
So like all secured of servers, by default, they get, like any security updates that comes out from the window as an operating system. And then we also like, if there is any changes from us, or new version, or new bug fixed version, those will also get pulled into the servers and deployed without any intervention from the system administrators. So and the servers regularly get rebooted every day, based on the time the SIS admins decided, and so that, and we do a lot of QA on those updates to make sure that those updates can pull in any other actual security updates or any other kind of dependencies, which are required to be there. And as far as the operating system updates, like we did one recently, we moved out of Trustees into the annual open to And for that, we actually, like worked a lot on the messaging and making sure that ministers and administrators get the proper steps and like documents and everything, so that they can go to the certain steps to make sure that the transition happens without any hiccups. So those all of those things together helps the systems to be updated.
Tobias Macey
0:29:44
And then also as far as testing and verification, I'm wondering what that QA process looks like to ensure that you're not introducing bugs, or potential regressions are security vulnerabilities into the platform as you're preparing a release,
Kushal Das
0:29:58
secret drop is a free song project and the source code, the bug, the bug trackers, and everything is public. And you can like actually anyone wants to go and check, they will find the issues file for each of the release, where we have a huge amount of like us steps like each and every parts of the project we manually verify. And then, as far as like, if you ask me as a developer, this is one of the best tested project as I've seen in my life. As far as that like integration test cases, the kind of unit test cases we have in the project. And like for any kind of feature to go in, it actually gets verified by multiple reviewers. And then all like we all, like continuously running and executing those scripts and act on server to make sure that the server behaves the way it should. And we have like two weeks of
Jen Helsby
0:30:49
Yeah, before every release, no, I was gonna say the exact same thing that we do a freeze two weeks before release. I wanted to test everything. And make sure that you know, even though every new feature has test coverage, we still want to test things manually, because some parts of the architecture are difficult to have automated tests for so we have like test for the web application, we have tests for the system state using testing for but for example, the full workflow of installing from tales that to service that's not fully tested. And so we do do that each regular release and multiple times.
Tobias Macey
0:31:29
As far as your overall experience for each of you individually. In on the project what have been some of the most interesting or unexpected or useful lessons that you've learned in the process?
Kushal Das
0:31:39
Nick, I think Jane also already mentioned one of the things is that supporting any project where we do not have any kind of access that was kind of difficult. And like me building a system which is which will be used by people who are not always so much into Linux or like friendly to the, our, our incense, like the developers we are life. So any building any new system for users, keeping those users in mind is always a challenge.
Jen Helsby
0:32:07
I guess for me like making sure you know for any system that has a large number of potential threats, making sure that time is being spent on the kind of lowest hanging fruit in it in a more rigorous way like we've done with the threat modeling process I described earlier is is so valuable, because it's kind of like security nerd, sometimes we want to focus on like the most interesting attacks that we can think of. And it can be tempting to get drawn into those by kind of having a more rigorous approach to it. Okay, the easiest thing that an attacker could do is x, like, let's make sure that we reduce the risk of this and come up with the mitigation is really valuable. And I haven't really seen too many projects of this type publicly presenting that information, kind of how they went about the threat modeling process, it would be cool to see that we've shared some of our threat modeling, documentation in our public docs, Dr. Secure drop.org. And as far as any particular packages or libraries that have been most useful in the process of building secure drop, I'm wondering if there are any that are notable that you'd like to call out
Kushal Das
0:33:11
with Epic SN is flawless. And we also already mentioned danceable that the replicate like huge application and we use molecule for testing and testing for testing part. And then a DOD project is obviously the one of the biggest thing of the whole project or singtel. torrent is Jen
Jen Helsby
0:33:27
Yeah, first security automation and project that we use, which is really great expanded to do static analysis, which we run in ci, bandit and safety so that we can get what we can fail ci when CV is found in one of our dependencies. And if other issues are introduced in a PR just reduces the amount of manual review, it's really great to, you know, easy to integrate, and probably useful for any project, not just one that's security sensitive in terms
Tobias Macey
0:33:52
of the future of the project, what are some of the new features or improvements or just overall work and effort that you you have in store in the near to medium term and any help that you are looking for from the community to improve it or add new capabilities?
Jen Helsby
0:34:09
Sure. So I guess one of the challenges right now is that we've made it pretty easy for sources to share documents with journalists, they just need to download Tor Browser and get a website, basically. But a lot of that complexity has been offloaded to the journal aside, as you know, it's described earlier with this kind of clunky workflow. And so one of the things we've been working on it is making it easier for journalists to check secure drop so that instead of it taking maybe 30 minutes, it may be could only take five minutes. And so we've been working on a project for journalists a workstation that combines that currently two separate workstations. So right now, we have this online workstation that's connected to the internet that they used to download the documents. And then we have a separate workstation. That's a gaps that they used to read the documents. And so we've been experimenting using cubes, which is a great project. And you should all check it out, which is basically a Zen distribution where everything is running inside a VM. And so they also have this concept called disposable VM, which is kind of perfect for a secure drop, because it's the kind of situation where you could open a potentially malicious submission in this disposable VM. If it gets popped. It's fine. It's compartmentalised in the VM, modular Zen escapes, and then it's destroyed after use. And so we've been experimenting kind of architected a kind of into VM pipeline that would download documents, pass them to a VM that's running a nice GUI for the user. And then when the user clicks a button, Open Document, it opens in this disposable VM. So that's all the code for that is public on our GitHub. org freedom of press. And so if you're interested in helping out probably the easiest place for people to get involved would be this gooey that I described, which is written in Python is cute. And there's a lot of active development on that right now.
Tobias Macey
0:36:02
Are there any other aspects of the secure job project or the use cases that it enables that we didn't discuss yet that you'd like to cover before we close out the show?
Jen Helsby
0:36:11
I don't think so. But if you are maybe interested in learning about the organizations, that if you have information you would want to share, you should download Tor Browser and then go to secure drop.org slash directory to get a list of many of them.
Tobias Macey
0:36:27
All right? Well, for anybody who wants to get in touch with either of you or follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose laser tag because I got to hang out with my kids yesterday and some of their friends and our friends. And we all had a lot of fun playing laser tag together. So it's not something I've really done in the past but turned out to be quite enjoyable. So if you're looking for something to get up and move around and have fun doing it, it's worth taking a look at that. And with that option TL Do you have any pics this week?
Kushal Das
0:37:02
Oh, I'm actually waiting for not not for this week, but within few weeks, like Edward Snowden book is coming out, so I'm just waiting for that.
Tobias Macey
0:37:11
All right, and Jen, do you have any pics this week?
Jen Helsby
0:37:13
Do I have any pics? Huh? I'm racking my brain. I don't know that I do. But I will definitely check out Edward Snowden spark released September 17.
Tobias Macey
0:37:23
Alright, well, thank you both for taking the time today to join me and discuss your work on secure drop. It's definitely an interesting project and an interesting problem space. So I appreciate your efforts on that and I hope you enjoy. I hope you enjoy the rest of your day.
Unknown
0:37:36
Thank you. Thanks, Tobias.
Tobias Macey
0:37:40
Thank you for listening. Don't forget to check out our other show the data engineering podcast at data engineering podcast com for the latest on modern data management. And visit the site at Python podcasts. com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Female host said podcast and a.com with your story.
Unknown
0:38:03
To help other people find the show. Please
Tobias Macey
0:38:05
leave a review on iTunes and tell your friends and coworkers

Combining Python And SQL To Build A PyData Warehouse - Episode 227

Summary

The ecosystem of tools and libraries in Python for data manipulation and analytics is truly impressive, and continues to grow. There are, however, gaps in their utility that can be filled by the capabilities of a data warehouse. In this episode Robert Hodges discusses how the PyData suite of tools can be paired with a data warehouse for an analytics pipeline that is more robust than either can provide on their own. This is a great introduction to what differentiates a data warehouse from a relational database and ways that you can think differently about running your analytical workloads for larger volumes of data.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Taking a look at recent trends in the data science and analytics landscape, it’s becoming increasingly advantageous to have a deep understanding of both SQL and Python. A hybrid model of analytics can achieve a more harmonious relationship between the two languages. Read more about the Python and SQL Intersection in Analytics at mode.com/init. Specifically, we’re going to be focusing on their similarities, rather than their differences.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Robert Hodges about how the PyData ecosystem can play nicely with data warehouses

Interview

  • Introductions
  • How did you get introduced to Python?
  • To start with, can you give a quick overview of what a data warehouse is and how it differs from a "regular" database for anyone who isn’t familiar with them?
    • What are the cases where a data warehouse would be preferable and when are they the wrong choice?
  • What capabilities does a data warehouse add to the PyData ecosystem?
  • For someone who doesn’t yet have a warehouse, what are some of the differentiating factors among the systems that are available?
  • Once you have a data warehouse deployed, how does it get populated and how does Python fit into that workflow?
  • For an analyst or data scientist, how might they interact with the data warehouse and what tools would they use to do so?
  • What are some potential bottlenecks when dealing with the volumes of data that can be contained in a warehouse within Python?
    • What are some ways that you have found to scale beyond those bottlenecks?
  • How does the data warehouse fit into the workflow for a machine learning or artificial intelligence project?
  • What are some of the limitations of data warehouses in the context of the Python ecosystem?
  • What are some of the trends that you see going forward for the integration of the PyData stack with data warehouses?
    • What are some challenges that you anticipate the industry running into in the process?
  • What are some useful references that you would recommend for anyone who wants to dig deeper into this topic?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to podcast.in it the podcast about Python and the people who make it great. When you're ready to launch your next app, I want to try a project you hear about on the show you need somewhere to deploy it. So take a look at our friends over at winnowed. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network all controlled by a brand new API, you can get everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models and running your continuous integration, they just launched dedicated CPU instances, go to Python podcast.com slash the node that's LINODE. Today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for the continued support of this show. Taking a look at recent trends in the data science and analytics landscape, it's becoming increasingly advantageous to have a deep understanding of both SQL into Python. A hybrid model of analytics can achieve a more harmonious relationship between the two languages. Read more about the Python and SQL intersection and [email protected] slash in it that's INIT. Specifically will be focusing on their similarities rather than their differences. And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity Corinthian global Intelligence Center data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture, summit and graph forum and data Council in Barcelona. Go to Python podcast.com slash conferences today to learn more about these and other events and take advantage of our partner discounts when you register. Your host, as usual, is Tobias Macey. And today I'm interviewing Robert Hodges, about how the PI Data ecosystem can play nicely with data warehouses. So Robert, can you start by introducing yourself?
Robert Hodges
0:02:09
Hi, my name is Robert Hodges, and I'm CEO of volatility, we offer commercial support and software for click house, which is a popular open source data warehouse. Beyond that, I have a pretty long background in databases. I started working on databases in 1983, with a system called em to a four and click house is now database number 20. I think, because I'm kind of losing count.
Tobias Macey
0:02:34
Yeah, after the first few, I'm sure you probably stopped bothering to keep exact track.
Robert Hodges
0:02:38
Well, every time I count, I keep finding ones that I forgot about. And it's it's definitely it's been very variable. For example, I worked for over 10 years with my sequel, then there were others like DB to where I used it for a day and decided I hated
0:02:49
it. So it still counts.
Tobias Macey
0:02:51
And also databases is in some ways a bit of a nebulous definition where if you squint enough different things can be considered databases that you might not think and at first blush.
Robert Hodges
0:03:01
Absolutely. And I think what's interesting over the last perhaps, say 10 to 12 years is the range of things that we properly consider databases have been increased enormously as people deal with different kinds of data, different amounts of data and different problems that they're trying to solve when they analyze the data driven sort of a plethora of different kinds of approaches to databases.
Tobias Macey
0:03:23
And do you remember how you first got introduced to Python?
Robert Hodges
0:03:25
You know, that's a really good question. I think what it was, and I can't remember where it happened, I read an article that said that a long time ago that said, Python was so beautifully designed that it is something where you would be up and running and productive in about four hours. And so I thought, Okay, that sounds good. And I went and tried it. And it did seem pretty easy to use, but I didn't program with it at the time, I got into introduced to it at an industrial level at VMware, about two and a half years ago, where I was working on a project doing tools for one of the VMware cloud products. And I was helping to I was sort of architect expert group that was developing open source Python API's. And that was where I really learned how to use Python, well learned how to do things like actually properly program modules, do documentation set up, do set up to PY, things like that. So that that was really my full education.
Tobias Macey
0:04:15
And so for this conversation, can you start with giving a bit of a quick overview about what a data warehouse is, and some of the ways that it differs from a quote unquote, regular database that somebody might think of for anybody who isn't familiar with this space? Sure.
Robert Hodges
0:04:29
So I think that it helps to start this up by defining what we mean by regular database, because I just said, as I, as I've referred to, and as I think many people know, there are many types of databases. So when people think about a database, they commonly think about something like my sequel, which stores data in tables, and the tables consist of so that means that there's a set of columns that each table contains, and then the data are actually stored as rows. And this is the same data structure that you see when you work in Excel. So what we generally speaking, we call that kind of database of grow oriented store, which means that when you actually look at how the data are stored, on disk, or on SSD, as the case may be, as well as the way that they're accessed, we go get them by rows. And this type of database was one of the earliest that was developed. And it's very well optimized for things like managing data in a bank or for managing sessions. And on a website, because there's a lot of little change, there's a lot of little updates, and we can go to the road, change it, put it back, and so on and so forth. Now, that kind of database has a problem, which is that as you get to very, very large amounts of data, the fact that it's efficient for update leads to a trade off that it's not particularly efficient for reading. And one of the reasons is that if you wanted to, for example, how to data, a data warehouse would like a table in your in your database that contain taxi rides, and one of those fields was the number of people that that wrote in the taxi, and he wanted to just take the average, you would end up reading and pulling into storage every single all the rest of the rows, just to read that one little tiny field. So as a result, data warehouses began to develop in the 90s. And they took a different approach to the way they organize data, which is that they organized it in columns. So you could have your billion row table. And you might have your field, which contains the number of people in the taxi that would be stored in a single column. And the idea is that the that the data warehouse is in this and many other ways optimize to read data very, very efficiently. So if you want to do that average, what you're going to do is you're going to read down this column very quickly. And and on top of that, the data warehouse typically has optimizations like that that array of data if you will, is going to be compressed and the compression algorithms are extremely can be extremely efficient. So for example, you can get 200 full compression on one of those columns, depending on depending on what the values are. So it compresses down to a very small amount of data plus, we use very sophisticated processing techniques. So that we can take advantage of what are called si MD instructions, single instruction multiple data that allow us to to operate on multiple values at a time when these things are loaded on the CPU registers. So this type of database, which is optimized for read is now generally what we know if when we talk about data warehouses.
Tobias Macey
0:07:23
And there are some cases where people might end up using a row oriented store as a data warehouse just because that's what they have close to hand, I know that I've seen it with things like the Microsoft sequel server, and I've heard of people using Postgres for that purpose if they have a small enough data set where they're able to perform their analytics, but the pattern in which they're using it is more along the lines of a sort of multiple more read oriented than right oriented so that they can run their analytics off of it without necessarily interrupting their application database. Exactly.
Robert Hodges
0:07:56
And in fact, what I started to notice perhaps 12 years ago, when I was working with customers was that you would see, we I ran a business that was actually focused on on as it turned out, clustering for MySQL databases, and we had people that ran pretty big MySQL installations, but what you would see in their data sets was they might have 200 tables in a database. But when you actually went and look at how big the tables were, they were probably one or two, which turned out to be extremely large, sometimes they would contain hundreds of millions of rows, and then the rest of the tables would tail off real quickly. Because then they would be you know, things like customers currency values, or things like that, that turned out to be small numbers. So what has happened is that people have done exactly as you described, they start out using my sequel, Postgres Microsoft sequel server, they get to the point where they have about 100 or 200 million rows. And that's the kind of the point where it becomes very, very difficult to do these read operations on the on the row oriented database, not only does it disrupt the operational work that's going on, so for example, trying to update the tables. But the other thing is because comes extremely slow. And the difference in performance can be staggering. When you have 200 million rows in a MySQL table, the difference between the performance there and running in a data warehouse will be a factor of 100, sometimes even more in some cases.
Tobias Macey
0:09:14
And on the other side of what we're talking about today, there's the PI Data ecosystem, which encompasses a large number of packages, but at the core of which most people will consider things such as non pie pandas, maybe the Jupiter notebooks format, and probably things such as psychic learn, or maybe some of the machine learning libraries. And I'm wondering, what are some of the cases where those libraries on their own aren't necessarily sufficient for being able to process data either efficiently? Or where you might need to lean on something such as a data warehouse in conjunction with the PI Data stack for being able to achieve a particular project?
Robert Hodges
0:09:53
Yeah, that's a great question. So I think the biggest thing that I see is that well, first of all, I want to say that these tools, the pie data, that whole ecosystem is really inspiring, because it has such a wealth of operations that you can do on on matrix and vector oriented data. Plus, it's coupled with really great machine learning as well as visualization tools. So so it's a really, really great ecosystem. But I think the single biggest thing that I hear about is people probably first of all people just complaining, hey, my, my data don't fit into memory anymore. And so what you end up seeing, at least in my experience, is that when you're running models in Python, you'll end up sampling, because you actually can't hold all the data in memory. So you just you just load a sample, say 10%, or 1%, or something like that, you train your model on that, or you run your model, or you you do whatever whatever operation that you're doing, you'll do on a small fraction, the data and the basic reason is you can't put it all in memory and process it at once. And I think part of this is part of this is because I think this is happening for two reasons. One is that Python bless its heart does not have a particularly efficient way of representing data memory. This is actually common to other systems like Java, it's not, it's not, it's not a Python problem, per se. But it's just that when you materialize objects in memory, they're not necessarily using memory layout as efficiently as they can. The second thing, and I think this is maybe more important is that by default, pandas wants to have everything in memory at once. So if you have a matrix, or you know, like a pandas data frame, it's going to want to have the whole thing in memory. And there's not a notion, for example, at least by default of being able to stream through data. And so that can also be a problem when you're trying to deal with large amounts of information.
Tobias Macey
0:11:38
Yeah, for a lot of these out, of course, systems, people might start to lean on things such as desk or the ray project for being able to scale across multiple machines and multiple cores, both for the memory issues that you were laying out, as well as some of the issues with dealing with embarrassingly parallel computations. And I'm curious how that plays into the overall use case for data warehouses. And some of the other ways that those broader elements in the ecosystem can also help in terms of managing larger data volumes, and also where the data warehouse can benefit. Yeah,
Robert Hodges
0:12:11
I think that's a great question, because it begins to because I think what you see happening with projects like task at one level, you could say, Hey, this is Python, kind of rediscovering or re implementing MapReduce, which, you know, the notion that you can do distributed processing you have, you have a lot of storage across a bunch of notes. But I think more particularly what I see happening in the in the, in the pie data ecosystem is that people are beginning to replicate in Python, things that are already solved in data warehouses. And I'll give you a couple of examples that data warehouses like vertical or click house, the one that I I operate on are very good at spreading data across multiple nodes. So they have the ability to break it up into we call this sharpening. So break it up into pieces that can be spread across notes, and then run the query in parts and join the results together, or, or aggregate the results together, before they return. Another thing we can do is we can replicate the data. So as as the number of users on your system span extends up upwards, and and you're, you know, beginning to add, you know, ask a lot of, you know, more questions concurrently, the ability to have replicas allows you to scale the performance very efficiently. So because, you know, different queries can run against different replicas. So I think this is that I think what's happening is that as these systems like tasks start to emerge, it's time to ask the question, do we want to implement this entire thing in Python? Or do we want to actually go look in the sequel data warehouse, see what they've got there, and maybe begin to sort of shift the boundaries a little bit between what we do inside the database and what we do in Python.
Tobias Macey
0:13:54
And in terms of the actual data warehouses themselves, you've already mentioned, click house and vertical, there are a number of other ones out there, such as snowflake, that's a recent addition. And then there are also various things that you could potentially consider in terms of data warehousing, better bordering the line with data lake such as the Hadoop system, or things that that the data bricks folks are doing with Delta lake. And I'm wondering what are some of the main axes along which the different data warehouses try to differentiate differentiate themselves and some of the considerations that somebody should be thinking about as they're trying to select a platform for moving forward with?
Robert Hodges
0:14:33
Sure, I think that's,
0:14:34
I think we can definitely divide the market up into into some important segments. So for example, you mentioned snowflake, that's a cloud data warehouse. And there, there's a family of data warehouses that are really cloud native, they are always going to, they may be tied to a particular cloud implementation. That would be the case with things like Big Query on Google Cloud, or redshift on Amazon, or snowflake, which can span clouds, but depends on the underlying cloud capabilities to work. So those are those data warehouses have some have great capabilities, they have very full sequel, they're well funded, they have very good sequel. Implementations, they also deal one of the things that particularly snowflake and bit query to very efficiently as they decouple the processing from the story from the from the storage. So for example, one of things I like about snowflake and their processing model is that you can have a big business, and you can have a couple organizations, they can each do queries on the same data. And the way that snowflake handles this is they spin up what are called virtual data warehouses, which are the compute part, each business unit will get their own virtual data warehouse. And they can go to town on this underlying data that they're reading without interfering with each other at all. So so that's that's one class, a data warehouse that I think is really important to look at. And I think where people tend to make choices in that direction is if I think probably the single biggest factor is has your business just decided to we're going to be on Google, if you are, then there's probably a pretty strong argument for looking very closely at Google Cloud because your Google Big Query, because you're already there, your data stored in object storage. So that's one big class. I think another important class of data warehouse is the traditional data warehouses like of which Veronica and Microsoft sequel server and an Oracle are sort of all, you know, sort of all play into this, I think the most interesting one is vertical. That was a very innovative column store, they're doing some interesting things in terms of separating compute and storage, they've now got a cloud version of it called Eon. So that's another one to look at. I think the trade they have good capabilities, I think the trade off there is they tend to be expensive to operate. And it's not just because their proprietary software with with expensive licensing, they tend to require pretty, pretty good hardware to run on. And then you have data warehouses, like click house, click house is kind of an interesting case, it's open source, Apache License, it's also more like my sequel, in the sense that it is, even though it's a column store, and has all these these properties that make it run very fast for reads, it's very simple to operate. And it's also very portable across a lot of a lot of different environments. So for example, we run click house on everything from bare metal, we have a lot of people who run their own, still run their own data centers, or lease space, to run their equipment to Kubernetes all the way to the AMS running in the cloud. So these are three different classes of data warehouse. And I think, depending on your use case, you know, where your price points are, you know, what, what is it you're looking at doing there, each of them has has virtues and also drawbacks,
Tobias Macey
0:17:37
and also just point out that you and one of your colleagues was also on my other podcast, talking a bit more deeply about click house. And I've also covered a number of these other technologies on that show as well, for anybody who wants to go further down the rabbit hole.
Robert Hodges
0:17:51
Exactly. That's a great rabbit hole we we like living down there. So absolutely, you know, sort of out in the open air for a little while this afternoon.
Tobias Macey
0:17:59
And so once somebody has selected a data warehouse, they've got it deployed. And now they're looking to actually populate it with data and be able to start integrate, again, integrating it into their overall workflow, what are some of the ways that the pie data ecosystem can help facilitate that process of getting data into it and making sure that all of the data is appropriately cleaned, and the schema is matching appropriately? And then also in terms of ways to ensure that the schema is optimized for the types of analytics that you might be doing? And just some of the other considerations going into the design and implementation of the data warehouse?
Robert Hodges
0:18:37
Yeah, that's an interesting question. So I can give an example from, you know, from our work on click house, which I think illustrates some of where Python fits in. So a lot of data in the in the real world is in is in CSV, so comma separated values. And it turns out that pandas is pretty nice for has good ability to read CSV, it's relatively tolerant of different formats. And so one of the ways that we I've actually seen customers and reading are using Python to help with ingest is that they will actually use pandas to read the CSV files, clean them up, write them to parquet, and then we have the ability to ingest parquet directly into click house. So that's an example of where Python is kind of there to help in terms of the data cleanup up front. I think more generally, one of the things and this is not true of everybody. But I think that in the systems where we see very large amounts of data being ingested, actually, I think what happens is that Python, you kind of stay out of the way, because for example, about half of the people that we work with ingest data from Kafka, so the quick house like other, like some other data warehouses can actually read cough cookies directly. So that's an example of where if where you don't actually need an intermediary, and if you want to do clean up, you'll actually wait till it's in the data warehouse, and then you often use sequel to clean it up. So and that's a very common, that's a very common you pattern, I think the other place where the other way that data gets loaded is that you read it from object store. So for example, if you if you use redshift, it has this great command called copy, which is used to read files, but it's been adapted in redshift, so that it can read data directly out of s3. So we're PI data with with a PI Data ecosystem would fit in his pie data might be the stuff up front, for example, that scraping the data off other online systems. So for example, I did an application with students from University of California Davis, where they built scrapers that would go search for prices for web, for web resources, like easy to on Amazon, they search for them on the web, they put them into s3, and then we would read them, then we would read them into the data warehouse from there. So if paid, it is involved, it's really at the front end of collecting the data and putting into object store. Those are two big ways that data gets into these systems at scale. And Python is sort of helping along the way.
Tobias Macey
0:20:56
And on the other end, once the data is already been loaded. And you're looking to do some sort of analysis, or maybe even trained some machine learning models based on the data that's there. What are some of the ways that an analyst or a data scientist might use for interacting with the data warehouse?
Robert Hodges
0:21:11
Yeah, I think that in this particular case, I can, I can really only speak to examples from click house, but but I think these are relevant for other databases. So for example, one of the ways that you can, one of the simple ways that you can interact with the data warehouse is you just fire up the Python Client, and, you know, sort of build a simple application on top of it. So when I start to analyze data, I typically start in a Jupiter notebook. And what I will use is there's two drivers that I can use. One is called the click house driver. So that's, that's a standard data database client, you make a connection, you call a method where you supply a sequel command, you get back a result set. And it's usually a one or two lines of code to pop that into pandas. So that's one way that you can get it. Another way is there's a sequel alchemy driver. So in sequel alpha me as is implemented pretty widely across across databases, you'll use that as a percent sequel magic function, that'll give you a result set, which again, you can, you can pop into pandas and begin and begin operating on. So so those are pretty typical ways to get data out of the data warehouse, I think where it gets interesting is, is where you start to explore, hey, how much can it can the data warehouse do more for me than just dump raw data? Can I actually do some of the things that I want to do in pandas? And, you know, do it in the data warehouse and save myself some time? So that's, that that's where you begin to dig into, okay, what's the What can I do in the data warehouse, it's actually going to save me time and in Python.
Tobias Macey
0:22:45
And one of the potential issues that you might deal with if you're just trying to do as you said, and just use the database is just a dumb storage layer and pull all the data back out into Python is running into the memory bottlenecks that we referenced earlier? So what are some of the strategies for working around those bottlenecks, particularly if you're dealing with large volumes of data, and just ways that the data warehouse can help supplement the work that you would be doing within Python? Yeah,
Robert Hodges
0:23:12
I think that's I think, when you start to frame it as like, how do I work with the memory I have available? At that point? You're asking the right question, because this is a problem that the databases have fundamentally been occupied with since Well, at least, as long as I've been working with them. So which is to say decades. So the idea, there's, there's a couple different ways that you can think about this, what the data warehouse does is it basically allows you to access data that is vastly larger than the amount of memory that you have. And so there's like at least three different ways that you can think about using this. For example, if you need to do queries that only need that only needs certain columns, the data warehouse is going to, it's going to answer those queries for you very efficiently.
0:23:57
So instead of thinking of
0:23:58
things in terms of the, you know, having everything in a pandas data frame that contains all of the data and all the rows, just think in terms of having only the columns that you're working with, because you can get them very quickly out of the database. And if you need more, you can go back and ask for it. So thinking in terms of like, let's go and own and isolate the columns we're working with, bring them in, and then operate them on them Panda, that's, that's one thing we can do. Another thing that data warehouses can do very efficiently is down sampling. So sequel has this great feature called the materialized view. And the idea with a materialized view is that it has a different representation of the data that is designed to be more either smaller, or more efficient, because of the way it's sorted, for example, or the way that it's stored than the original source data. So for example, if you have, if you're doing sampling off devices, or collecting prices in the market, what you can do, if this is essentially going to be talking series data, what you can do with a materialized view as you can down sample it, so that instead of getting a data point, for every time you do a measurement, you can actually reduce it to time segments, like 15 minutes segments,
0:25:12
this vastly reduces the amount of data that you that you collect. Moreover,
0:25:16
with the materialized view, the database will calc it will basically do this down sampling at the time you ingest data, it does it once. So that if you then go to the View and ask for the data, you're going to get it back really fast. And you're going to get a much, much smaller amount of data that you can then operate on. I think what happens in the pandas, if you just work off files, which is the way you know, the way some of the pipelines work, you end up asking these questions again, and again and again. So that's another, that's another important way that the data warehouses can help.
Tobias Macey
0:25:49
And also another Python tool that can help in terms of creating and maintaining those materialized views is the data build tool or DVD for sure, that will help in terms of ensuring that you have some measure of testing and consistency as far as processing the source data and creating the materialized views from it. And then also, once you have a materialized view, or even if you're just dealing with the table without doing any additional processing on it, one of the ways that it can help from what you were referring before, as far as pulling in Windows time segments is that you can actually let the database handle the cursor and feeding chunks of memory at a time instead of trying to have to implement that logic yourself in your Python code to say, Okay, I'm only going to process this amount that I need to shift this out and then pull in the next bit, you can actually use the database as you're working memory pulling the piece that you actually care about for that particular computation and then put it back into the data warehouse. Absolutely
Robert Hodges
0:26:43
correct. And in fact, I
0:26:45
mentioned that the three ways that data warehouses can help that sort of gets to the third, you can think of the data warehouse, you can think of like when you when you particularly use your data sets are very large, you can think of this the the the raw data as existing in a window so far, for example, when you're collecting web logs, or you're collecting, perhaps temperature sensor data, the data that you collect, that's, that's newest is the most interesting. And what tends to happen is after a certain period of time, that data becomes less interesting to keep around. So the data warehouse can can do what you're describing at a very at a very large scale in the sense that you can put a time to live on data. This is a common feature in in many data warehouses so that your raw data will actually time out after some period of days or weeks, or whatever you choose. And it just goes away. And the database does this automatically. So it's maintaining a window of of the raw data that you can then jump in and look at without you having to do anything special, like having complex pipelines or a lot of logic. The other thing is, then with materialized views and other techniques, you can down sample and you can actually keep those around for much longer periods of time. So I think when you combine these together, you then in the database itself, it's kind of, you know, holding your data in an optimized way that also builds, you know, creates these windows that you can, you know, so in effect creates these, these different granularity. So that data that you can look at. And then finally, to the point that you were making database connectivity API's are very good at streaming data. So for example, in click house, streaming API's are the the wire protocol, you tend to get data in chunks. So if you write your, your, your Python code, well, you can basically get a piece of at a time process that throw the memory way, get the next piece, so on and so forth. And this is something this kind of buffering of data is something that the conductivity API's have been doing since the late 80s, that databases are really good and very well optimized for this problem.
Tobias Macey
0:28:43
And another case where the data scientist or the analyst might be leaning on Python is for being able to interact with tools such as TensorFlow, or pytorch, for building machine learning models, which generally require a fairly substantial amount of training data. And I'm wondering how the data warehouse fits into that flow versus some of the other ways that they might be consuming the training information for building the models either in these deep learning neural networks or in some of the more traditional machine learning algorithms that they might be leaning on psychic learn for I
Robert Hodges
0:29:14
think that's the place where actually
0:29:15
we see the biggest gap in the technology right now. And what is happening right now is if you look at state of the art, in particularly among the people that I speak with, who in some cases deal with very large data sets, what they're typically doing is pulling this data out into Spark, and, and doing their machine learning there, or they're pulling it out. And, and, and doing intensive flow. So you basically are taking the raw data, or maybe you know, down sampled aggregates that are in the data warehouse, and you're just copying them out, you're, you know, you're you're running the machine learning on it, for example, training models, in which case, you would just drop the data after you're done, or you're executing models, in which case, you would, you know, score the data, maybe put it back in the database. So there's still a pretty big gap there. I think that as we go forward, I'm seeing two things which are really pretty interesting. One is that you're beginning to see the data warehouses actually put at least basic machine learning into the data warehouse warehouse itself. So Google, Big Query is doing that they have in databases traditionally have a CREATE TABLE command, they have a create model command. And so you can begin to you can begin to run basic machine learning models there. The other thing that's happening is we're starting to see the emergence of other ways of sharing data between data warehouses and machine learning system. And these are represented in projects like arrow, which is a columnar memory format, that is very, pretty accessible from Python. And that was, that's actually a project that's being driven by West McKinney, who's also the author of pandas. And the idea there is that we're going to have common memory formats to make both transfer of data simpler, but also open up the possibility that we can now communicate through things like shared memory. And as opposed to having TCP IP streaming to move the data. So I think those are a couple of interesting things that are happening, but it's still very, very basic. And and I think there's a lot more that we need to do to to move that forward.
Tobias Macey
0:31:12
Another interesting development to is things like what Microsoft is doing with embedding the Python runtime into the Microsoft sequel server to try and bring the compute directly in line with the data so that you don't have to deal with the wire transfer and serialization and D serialization itself, right?
Robert Hodges
0:31:30
That's it. That's a really interesting, that's a really interesting development. In fact, I think that's why one of the reasons why I'm, I'm very interested in arrow is like, we call that that kind of that kind of operation, we call that UTM, user defined function. And databases like Postgres, I've had this for a long time, the ability to hook in, for example, you know, see routines, but also Java and Python and other things like that. I think the problem that that I see with this is that if you just do it in a naive way that for every row, you go and click Python is just horribly inefficient, the, because what the way that data warehouses operate on, on data efficiently, is that everything's an array, they basically break up the array into pieces, they farm it out on to all the cores that are available, and they just go, you know, screaming through this data, to process it as quickly as possible. If you have to turn around for every value, you haven't called something in Python, that doesn't work. So I think that's, I think what and I'm not sure I haven't used the the Microsoft sequel server capabilities. But I think what we need is something that allows you to deal with data at the level of blocks. And that's where I think something like arrow, something that combines the capabilities of arrow where you can actually share the data formats, as well as the way that for example, click house processes materialized views, where we don't just when we populate a materialized view, we don't actually just do one row at a time, we do thousands or hundreds of thousands of rows at a time. So you have to think in terms of processes that allow you to do these operations for you on very, very large amounts of data using kind of a streaming, sort of a streaming processing model that allows you to get to the data really quickly.
Tobias Macey
0:33:11
And so in addition to things like machine learning, where we don't have a an excellent solution yet, what are some of the other limitations of data warehouses in the context of the Python data ecosystem? As far as being able to run analyses or some of the other data science workflows that somebody might be trying to perform?
Robert Hodges
0:33:31
Well, I think the biggest thing
0:33:32
I see is there's just such a richness of things that you can do a nun pie and pandas that still aren't fully supported in, in data warehouses. So most data, most mature data warehouses will do a pivot, but for example, click house doesn't do a full pivot the way that the way that pandas does it, I mean, pandas makes this so easy to do. It's, you know, and pandas also has pandas, and Python, in general, has a wealth of statistical functions. So there are certain databases. So for example, KDB is a is a column oriented store, or, you know, sort of a data warehouse that that's, is renowned for the fact that it has, it has a very rich set of functions. But in general, the databases don't have the richness of statistical functions that you find in the pie data ecosystem. And I think that creates a problem, because if you're, if you're looking to do, you know, if you look at what people are doing with machine learning, and data science, it's fundamentally driven by Statistics. So if you don't have those statistics, in sequel, you actually can't farm workout to sequel very easily, you either have to have user defined functions, and people had them themselves. As I mentioned, there's a lot of inefficiencies there. So I think this is a big gap. I think this is where, you know, we need to look over and see the good things that you know, that are being done in in Python and actually pull more of that stuff into into the database and make the sequel implementation richer.
Tobias Macey
0:34:58
And in terms of we're all trends in the industry and in the Python community, what are the some of the things that you see going forward that you're excited about as far as the tighter integration of the pie data stack and data warehouses and any challenges that you anticipate us running into as we try to proceed along that path?
Robert Hodges
0:35:17
Yeah, I think that I think moving models into the database is a very interesting is a very interesting development, I'm a little bit skeptical that databases are going to, to be able to do this as well as Python does. And the reason is that if you look at what if you look, for example, if you go to psychic learn, and you look, and you look at the models, they just have a raft of parameters that are that you actually need to be able to twist and turn to enable the the model to give you the you know, to, you know, not to be over fitted not to be under fitted, and, and work effectively, the database implementations that I've seen have a very limited parameter ization. And so for example, with Big Query, as far as I can tell, it just kind of figures that, you know, you give it the data to train the model, it kind of does it, but you don't actually know quite what it did to train the models that there's you don't have the full access to the hyper parameters that that you would want to adjust to, to make the model work? Well, I think that's a problem. And it's, it's not clear to me that the data warehouses are going to are going to solve this. I think that one of the really interesting questions is that we all have to work on is how can we take advantage of the fact that there are very, very powerful machine learning and deep learning systems that that exists outside the database? How can we combine them, so instead of thinking about this, this data being pushed into the database, instead thinking about how these systems can work together. And so for example, we have a, one of the one of the click house users that I know, their, their biggest desire, is to be able to take a row of data, pass it to a model for scoring, and we're right that data back into the database, all in a single operation. And so that's, that's what we have to focus on to be able, you know, to be able to, I think, to really join these two worlds together, it's problems like that. And we have to think about how to do it at scale. So and I think there's some very interesting, there's some very interesting things that we can do the, you know, if you have data, data, spread across many nodes. In many ways, this is not unlike what we had in the in the Hadoop file system, where you had data spread across a bunch of a bunch of disk spindles. And you could send the processing down to go grab the data off those disks. So I think there's some things that we can do. But I think there's definitely some work to do to, to, to make that to implement those ideas, and really be able to join these two worlds together and efficient way.
Tobias Macey
0:37:48
For anybody who wants to dig deeper into this space. What are some references that you found useful that you'd recommend?
Robert Hodges
0:37:55
it? That's an interesting question. So I'm sort of into academic papers, I think one of the things that you if you want to understand what a data warehouse is, I think just quick, and you're, and you're reasonably familiar with data, I think one of the quickest ways to get up to speed is to go read the C store paper, by Mike stone breaker and a bunch of database stars, it was written by about 14, there's about 14 different people on the paper. But that described the version that basically described what what later became vertical. So it was a column store it built on things that had already been done in the 90s. And then it introduced a bunch of other interesting things. I think that sort of gives you a notion of how data warehouses work and the kinds of things that they can do beyond that, I think it's I think the simplest thing is to go try them out. So click house, for example, is very easy to use. You can just, if you're running on Ubuntu, you say app install, click a server, it comes down, I think there are other systems that you can, you can try out. So for example, redshift on you on Amazon, was really a complete brown brick groundbreaker in terms of ease of ease of use, just being able to click a few buttons and have a data warehouse spin up a big query is the same way. And if you're on a juror, you can do similar things with with a Microsoft equivalent based on Microsoft sequel server. So I think just going and trying this stuff is probably the best thing to do, and, and just sort of begin to understand how you can actually use these systems. They're very accessible at this point.
Tobias Macey
0:39:28
And are there any other aspects of the overall space of the PI Data ecosystem and data warehouses in general that we didn't discuss yet you'd like to cover before we close out the show?
Robert Hodges
0:39:38
I think another place that there's sort of an interesting long term integration, which is how we deal with data warehouses and GPU integration. So for example, one of the reasons that one of the reasons you go to TensorFlow is TensorFlow will use you know, has a much more efficient processing model way, way, way more compute than we can get to when Ron conventional hosts data warehouses today are optimized for IO. They don't necessarily have GPU integration. So I think that's another interesting case where we can, you know, maybe some of that processing, if data warehouses begin to be able to take advantage of GPU that may open up some other interesting opportunities for doing something, you know, for for sort of adjusting the split of where, where processing happens. So that's another that's another thing we're definitely looking at with a lot of interest.
Tobias Macey
0:40:22
Yeah, I know that there's been some movement in that space with things like kinetic, that is a GPU power database, but I haven't looked closely at it myself to be able to give any deep details on that,
Robert Hodges
0:40:33
right, there's a there's a database called map to you, which I believe is now called Omni sigh. That's really that's a really interesting space. And I do want to say that, you know, I've been sort of focused on moving things into the, into the database, but I think that there's just as pie data, people can learn from what's going on in the data warehouse, because some of the problems, I think, what's happening in the data space is, is people are beginning to recapitulate solutions to problems that data warehouses have already solved a long time ago, you know, sort of how to distribute data, how to, you know, sort of turn it into an embarrassingly parallel problem. So you can get results very quickly. On the flip side, I think database I think people in databases need to be looking very, very hard at the ease of use and just the the wealth of operations that you can perform using the pie data, ecosystem modules, you know, non pie, pandas seaborne psychic learn, there's just so much stuff there that I think this is something that we can really learn from.
Tobias Macey
0:41:32
And for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics and this week, I'm going to choose a book that I've been reading called foundations of architected data solutions. It's an O'Reilly published book. And it's been great for just getting a high level overview about the things you need to be thinking about if you're trying to plan out a new data infrastructure. So definitely recommend that for somebody who's interested in getting more and to this space. And with that, I'll pass it to you Robert, do you have any pics this week?
Robert Hodges
0:42:03
I think my pic is going back and reading all papers that's as I say, the C store paper I headed up preparing for the show. I love reading that paper. It's it's just a it's it's a really, really great thing to read. Beyond that the books that really interested me are things like Python machine learning by Sebastian Rasha, that's something that came out, it's now in the second edition
0:42:23
I just
0:42:24
got a little while ago. It's not something you can read all at once, I just keep going back to it. And whenever I have time, I go look at what he has, you know, sort of work the exercises and just try to keep learning more and more stuff about about how to deal with data
Tobias Macey
0:42:37
in Python. Well, I appreciate you taking the time today to join me and share your interest and experience in the cross section of data warehouses and the pie data ecosystem. It's definitely an interesting cross section and an area that I am excited to see more development with. So thank you for your time and all of your efforts on that front. And I hope you enjoy the rest of your day.
Robert Hodges
0:42:57
Yeah, thank you, Tobias. It's always great being on your show.
Tobias Macey
0:43:02
Thank you for listening. Don't forget to check out our other show the data engineering podcast at data engineering podcast com for the latest on modern data management. And visit the site of Python podcasts. com to subscribe to the show, sign up for the mailing list and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcast in a.com with your story. To help other people find the show. Please leave a review on iTunes and tell your friends and coworkers

AI Driven Automated Code Review With DeepCode - Episode 226

Summary

Software engineers are frequently faced with problems that have been fixed by other developers in different projects. The challenge is how and when to surface that information in a way that increases their efficiency and avoids wasted effort. DeepCode is an automated code review platform that was built to solve this problem by training a model on a massive array of open sourced code and the history of their bug and security fixes. In this episode their CEO Boris Paskalev explains how the company got started, how they build and maintain the models that provide suggestions for improving your code changes, and how it integrates into your workflow.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Boris Paskalev about DeepCode, an automated code review platform for detecting security vulnerabilities in your projects

Interview

  • Introductions
  • Can you start by explaining what DeepCode is and the story of how it got started?
  • How is the DeepCode platform implemented?
  • What are the current languages that you support and what was your guiding principle in selecting them?
    • What languages are you targeting next?
    • What is involved in maintaining support for languages as they release new versions with new features?
      • How do you ensure that the recommendations that you are making are not using languages features that are not available in the runtimes that a given project is using?
  • For someone who is using DeepCode, how does it fit into their workflow?
  • Can you explain the process that you use for training your models?
    • How do you curate and prepare the project sources that you use to power your models?
      • How much domain expertise is necessary to identify the faults that you are trying to detect?
      • What types of labelling do you perform to ensure that the resulting models are focusing on the proper aspects of the source repositories?
  • How do you guard against false positives and false negatives in your analysis and recommendations?
  • Does the code that you are analyzing and the resulting fixes act as a feedback mechanism for a reinforcement learning system to update your models?
    • How do you guard against leaking intellectual property of your scanned code when surfacing recommendations?
  • What have been some of the most interesting/unexpected/challenging aspects of building the DeepCode product?
  • What do you have planned for the future of the platform and business?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:15
Hello, and welcome to podcast.in it the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at the node. With 200 gigabit private networking, scalable shared block storage node balancers, and a 40 gigabit public network all controlled by a brand new API, you've got everything you need to scale up. And for your tasks, they need fast computation such as training machine learning models, they just launched dedicated CPU instances, go to Python podcast.com slash the node that's LINOD today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest and machine learning and data analysis. For even more opportunities to meet listen and learn from your peers. You don't want to miss out on this year's conferences. And we have partnered with organizations such as O'Reilly Media Day diversity, Caribbean global intelligence and data Council. Upcoming events include the O'Reilly API conference, the strata data conference, the combined events of the data architecture, summit and graph forum and data Council in Barcelona. Go to Python podcast.com slash conferences today to learn more about these and other events and take advantage of our partner discounts to save money when you register. Your host, as usual, is Tobias Macey. And today I'm interviewing Boris Pasqua live about deep code and automated code review platform for detecting security vulnerabilities in your projects. So Boris, can you start by introducing yourself?
Boris Paskalev
0:01:47
Hi, my name is Boris Paskalev. I'm CEO and co founder of difficult we currently based in Zurich, Switzerland.
Tobias Macey
0:01:55
And so can you start a bit explaining about what the deep code project is, and some the story of how it got started.
Boris Paskalev
0:02:01
So ultimately, what deep code this is learns from the global development community, every single issue that was ever fixed and how it was fixed. and combine this knowledge of all development, almost like crowdsourcing, the development knowledge to prevent every single user from repeating those mistakes that are already known. In addition, we actually have predictive algorithms to understand issues that may not have been fixed, but could actually appear in software development. And where we started. So ultimately, the started the idea started by the other two co founders, they actually spent more than six years in researching the space of program analysis, and learning from big Colt, which means like billions of lines of code that they're available out there. And they did that in ETH Zurich, which is what we call it the MIT of Europe. And they are one of the foremost experts in the world in that space. They've hundreds of publication in the space. And yeah, and when they finished the research, our CTO of Iceland, he published his PhD, it's super wallet, and we decided that totally makes sense to actually build it into a platform and revolutionize how software development works.
Tobias Macey
0:03:10
And was there any particular reason for focusing specifically on security defects in code and how to automatically resolve or detect them?
Boris Paskalev
0:03:19
Actually, security was a later Allah, I don't we actually did that in 2019, which was just started this year in we publish a specific paper on that the platform itself is not targeting or anything specifically can any issues. It's fixed being that a book performance, you name it can be detected, security was just a nice add on features that the tweet it and it was pretty novel as well.
Tobias Macey
0:03:43
So in terms of the platform itself, can you talk a bit about how its implemented and the overall architecture for the actual platform and how it interacts with users code base.
Boris Paskalev
0:03:54
So pretty much what it does is there's two steps both into learning and analyzing code. The first step is, we take your coat, we analyze it quickly, we converted, we use standard parsing of each language. And then we actually do a data log extraction of semantic facts about the code to build a customized internal representation about the various interaction every single object, how the object propagates, in interacts with functions, getting into other objects, then how they change etc. And in this knowledge represents pretty much the intent and how the program functions, right. And then we do that for every single version of the program. So we see over time when people commit code and change code, how that changes, and that gives us our the Delta, what is changing and how people are fixing things, right? Then we lead extremely fast, and we lead over this hundreds of thousands of repositories, obviously in like, billions of lines of code. And then we actually identify trends. This is where our machine learning kicks in. And it identifies trends, how and how people fix things, what is the most common things are there specific weird cases, etc. And this is we have the scalability, global knowledge, as we call it.
Tobias Macey
0:05:03
For the languages that you're currently supporting, I noticed that you're focusing at least for the time being on Python and JavaScript, and I believe there are one or two others. And I'm wondering what your criteria was for selecting the languages that you were targeting for evaluation and automated fixing and just some of the other languages? Are you thinking about marketing next?
Boris Paskalev
0:05:23
Yep. So pretty much we started with the most popular languages out there. I mean, there's different charts, but kind of the, the standard suspects are obviously Python, Java, JavaScript, then following down that line, we're looking at, obviously, C sharp, PHP will come C and c++. And down the list, I mean, we're getting more and more requests for various languages. So it's a combination of the ranking of the language and popularity, as well as specific customer requests, specifically, big companies over the asking for very, very specific,
Tobias Macey
0:05:52
given the dynamic nature of things like Python and JavaScript, I'm wondering what some of the difficulties you faced star as far as being able to static, we analyze the languages and avoid any cases where there might be things like monkey patching going on, or maybe some sort of code obfuscation?
Boris Paskalev
0:06:12
Yeah, I mean, so since we don't do that you're the typical static analysis here, we doing actually a static semantic analysis. And we do that in context, right. So that allows us to go go much deeper. For example, if you have a particular object, and then you put it into an array, and then the object comes out, we still know that it's the exact same object. So that kind of gets us closer to a dynamic analysis, as well. So that's kind of some of the the features that allow us to stop, analyze and identify much more complex issues that are that are close with an attic into procedural analysis. If you say, and this allows you to get much, much higher accuracy, not have the false positives, data tools will, will throw you there. And as identify issues that classical syntax static analysis, not be able to see it all.
Tobias Macey
0:07:02
Another thing that can potentially complicate that matter is the idea of third party dependencies and how that introduces code into the overall runtime. And I'm wondering how you approach that as well, particularly as those dependencies are updated and evolved.
Boris Paskalev
0:07:17
Pretty much for dependencies, we scan the dependencies code, if if the code is included in your repository, we don't have what there are many other services out there have a list of dependencies in their versions in which one might be having issues or not, we don't do that. Because that's pretty much static databases that they do that. But we do look at how do you actually call specific API. So if you actually calling you have a dependency, and you're calling some kind of a function from it, we actually going identify how you calling the function telling you a unicorn, the first in the right way, or the third parameter that you're passing is not the right one, etc, etc. But specifically, what dependencies you incorporate, we don't actually look at I mean, we can tell you, you're important more than once, or importing Sunday, you know, you're not using things like this we can have as well. And that's kind of the scope that we go into.
Tobias Macey
0:08:08
Another thing that introduces complexities is as languages themselves evolve, and introduce new capabilities, or keywords, and I'm wondering how you keep up with those release cycles and ensure that your analyzers and your recommendation engines are staying up to date with those capabilities. And then also on the other side, ensuring that any recommendations that you provide and your code reviews match the target runtime for the code base as it stands. So for instance, if somebody wrote Python project, it's actually using Python two that you don't end up suggesting fixes that rely on Python three features.
Boris Paskalev
0:08:44
So So the first one, when languages change and evolve, which is, again, pretty common these days. And so there's two things right, first of all, is, are the parts of supporting the new feature, right? Because there's, we have to get the latest version of the partners in the policy supporting and that's great if the person is not supporting, and then we have to do our own extensions until the partner start supporting them. Because we pretty much use standard parcels with minimum extensions only when needed, right? So this is the standard, which if there's something fundamentally different about the language, right? This is where we might actually have to extend our internal representation to support that. But like, taking something like really fundamental, but that's the really, we really see that in specific languages, that's more happening if you had a new language, right? So that's kind of the two, the two major branches when you think comes in, but for the majority of things, there's very little that we have to do. But extending to the latest person. On the second question that you ask is kind of the Python versions, Python version two versus three. So we don't specifically differentiate that but we want to give you suggestions will dedicate for Python version three, you have to be doing this right if you're doing it, but if you're in Python version two, obviously, you can just say ignore these suggestions. And you can actually create set of rules and saying, Okay, this is all the set of rules that are by conversion to just ignore them, you can put that into config file. And until you multiply the version three, you can just ignore our little tables.
Tobias Macey
0:10:07
And there, it also gets a little bit more difficult for within Python three versions. For instance, if your code is targeting Python 3.5, you don't want to suggest fixes that incorporates things such as app strings or a data classes. And I'm curious how you approach that as well. Or if it's more just based on what the user specifies in their config as far as the runtime that they're using.
Boris Paskalev
0:10:31
That is great. So we don't have any very strong changes in that space. The place that helps in that is we actually all the suggestions we provide a contextually based, so usually you can actually see what's happening before and after specific issues, right? And if they're version specific, then you watch you will not get the recommendation, because it looks different. In your case, that doesn't cover all the cases obvious. I think you're right for the asking that questions. And we don't have a great solution for that. We leave it to the developer, to actually when they see a suggestion, say Nope, I don't care about that. Clearly, as I said, we can do to the ignoring rules. But those changes are rare. I mean, they do happen. And we've seen cases where the developer says yeah, I don't care about this yet, I haven't updated. And that happens. But we usually target most of our suggestions and learning says it's automated. It gets the learnings from the latest version, social, large percentage of development communities moving to the latest version, then they make changes related to it. And you'll be getting suggestions for that as well.
Tobias Macey
0:11:27
Can you describe a bit more about the overall workflow for somebody who's using deep code and how it fits into their development process?
Boris Paskalev
0:11:34
Yep. So the most standard one that we envision and receipts most popular out there, it's, it's a developer to that lifts on the good. So pretty much you login with your get account, get Bitbucket, whatever that is, you seal is the repository that you want to analyze. And you subscribe them, right once once the repository subscribed, you getting two things. First of all, every time we do a pull request, we actually analyze it and we tell you, during this diff, are you introducing any new issues, right, so that's number one that's continually monitoring, the new code being generated. The second piece is continuously monitoring on your old code, because old code grow also age right as the development community changes, new security vulnerabilities are uncovered, etc. Something that you've written two years ago, my actually is not secure anymore. And you actually want to get being for that because very few people actually test look into the call from two years ago. So that will give you a pink as well saying, hey, this function here, the code has to be updated to a new encryption, for example, to make sure it's secure. So those are the two major pieces, again, fully lives into the gates. And in addition to that, we will obviously offer an API and command line interface. So you can really integrate our solution anywhere you want. It could be as part of continuous integration, we actually have that in GitHub already. that once you finish the pull request before the merchant can tell you, hey, we analyze it, there's no critical stuff, please proceed or take us one critical stuff. critical issue, look at it. But yeah, they BI and command line interface allows you to like script within minutes, a checker at any point in your workflow, because developers in different companies or setups have very different development workflows. And they might want it to different stages, if you having a QA team, having continuous integration having a continuous delivery versus individual bills every day or month, whatever that is.
Tobias Macey
0:13:22
And then in terms of the model itself, can you describe a bit about the overall process that you're using for training and some of the inputs that you use as far as curating the projects that you're using as references to ensure that they are of sufficient quality and that you're not relying on something that is maybe using some non standard conventions?
Boris Paskalev
0:13:44
Yep. So two points on this. So we do have a custom curation, it takes a lot of different things, how active the project is, how many contributors how many stars, etc, etc. So that's, that's continuously updating us. And this is mainly done, because there's a lot of subjects in the gets like the Kevin touch for like two years, or have one developer only that never touches it so that there's kind of a long tail of such projects. So we just don't want to waste time to analyze them. The machine learning automatically actually seeds out such a kind of a poison pills, in a way, kind of random developer who fixed something in the wrong way. Right. And this is where it comes in with the probability that we the probability we assigned to every single suggestions that we have, which is based on how many people can fix it this way. Is there a trend a lot of people fixing it, how many counterexamples they are, and how many actual such issue exists in the open source community today, right. So based on that, we can automatically see issues because when you fix something wrongly, it's a very unlikely that there's many people that have fix it the same wrong way. That only happens, for example, if somebody publishes a wrong solution, core, and nobody catches it. And that can happen, like one two weeks, but usually that gets resolved immediately. And then our knowledge base automatically update
Tobias Macey
0:14:58
in terms of the amount of domain expertise that's necessary for identifying those faults that you're trying to detect. I'm curious if you're using sort of expert labeling techniques, where you have somebody going through and identifying the faults that were created and the associated fixes, or if you're relying on more of an unsupervised learning model for being able to build the intelligence into your engine.
Boris Paskalev
0:15:23
So it's mainly unsupervised learning, we actually do have some labeling, which is based on how severe the issue is. So we have categorization of critical warnings and info diaper suggestions. So pretty much we have to actually categorize which ones are critical. And this is when our team does that. But that's one type of issues. So like, within two hours, you can label like hundreds of thousands of different suggestions. So it's a, it's a pretty quick process with very minimal supervision that we have to do. Everything else is pretty much fully automatic. What we do automatically detects the type official is the security, is it a book is the performance etc, we use a number of techniques there, we have an NLP on the command. So obviously, look into specific, Colton semantically what it does, because what we do, we have a predictive algorithm that infers the usage of specific functions and objects. So we actually know what they're doing, what setting they're being used.
Tobias Macey
0:16:20
And you mentioned to the, for the pull request cases, you're relying on parsing the depths of what's being changed. And I'm curious if there are cases where the diff just doesn't provide enough context about the overall intent of the code and any approaches that you have for being able to mitigate some potential false positives or false negatives where you missed something because of the fact that the code is only changing maybe one line, but you need the broader context to understand what's being fixed.
Boris Paskalev
0:16:50
Ah, ok. So So yeah, sorry. So maybe I didn't clarify that correctly. So we do analyze the whole, the whole tree, like we always do the full analysis, right. But usually the the semantic, the semantic changes are only within the diff, and we actually show you what it is. So if a change that you make on this line of code is causing a security issue somewhere else will actually catch that. Absolutely. I mean, we cannot analyze anything smaller than that, because our internal representation requires the context of what's happening. So we have to analyze every single function and procedure to see what it is. So we will analyze everything, but usually the changes that are happening on the on the dips, because they are focusing there, but it could be in a different part of the code base as well. Where the issue comes from, in terms of you mentioned, false positives and false negatives again, so there's a number of techniques to lower that. I mean, we have kind of a record high accuracy rate compared to any of the existing tools today. And that's mainly based on contextual analysis. So we actually know in what cases the problem is there. And on the fact that we actually have kind of usually thousands of examples. So it's a pretty accurate what it is, and we're not doing a syntax based comparison within semantic person. So we're not looking at what you're doing in the specific lines of codes, because without knowing the semantic details about it, you actually could be very wrong. But looking semantically gives you the considerably higher accuracy rate.
Tobias Macey
0:18:12
And in terms of identifying those false positives and false negatives, if you do identify maybe a false positive, and is there any way for the users to be able to label it as such, so that it can get fed back into your machine learning models so that you can prevent that from happening in the future, and just any other sort of feedback mechanisms that are built in for users to be able to feed that back into your model to improve it over time?
Boris Paskalev
0:18:38
Yep, so we have two ways. First of all, is you can ignore rules for your own, you can say is, hey, this rule I don't like, and you can decide if you want to do this for a project or in general. And the second and the second is you can actually have a kind of a thumbs up and thumbs down with a commencing. Yeah, I don't like this because of blah, right. So this is the two main mechanisms that we look at it is clear for open source, we are get the feedback automatically if an issue was fixed or not, right, and that, as I said earlier, we'll look at how many of the issues exist in the code base out there. And how many of these type of issues have been fixed, which is part of our probability assessment, if an issue is should actually flag or not.
Tobias Macey
0:19:18
And in terms of the code that you're analyzing, I'm wondering, again, how that feeds back into your models, particularly in the case where somebody might be scanning a private repository, and if there are any sort of intellectual property in terms of algorithms or anything along those lines, and preventing that from getting fed back into your model so that it gets surfaced as a recommendation on somebody else's project.
Boris Paskalev
0:19:42
Yep. So we do not learn from private Colt, do not become part of the public knowledge, right, we have a special function that you can learn from your private code. And that becomes your own knowledge. That's usually is for larger companies with logical basis. If you when we analyze code, we don't learn from that code, right? We don't from open source repositories. And depending on the licensing, there's some open source repositories that you can see, but you cannot use right. So for those who are not going to ever create the suggestions that suggestion examples coming from there will still count them as how many times we've seen that issue and or that it's been fixed. But to never showed as an example, for 16 examples will only come from a fully open source projects.
Tobias Macey
0:20:27
And in terms of the overall challenges or anything that was particularly interesting or unexpected that you've come across in the process of building and growing the deep code project and the business around it. What have been something that was sort of notable in your experience?
Boris Paskalev
0:20:45
Wow, that's an interesting question. I think the one that it's more shocking is the number of different technology and innovations that we have to do like, I mean, we create new versions of the platform, a lot like we actually literally about to release in one. In a matter of weeks, we released it to some pilot customers already, the considerably increases the coverage, while maintaining the same high accuracy. So but yeah, so it's really like we have to come up with new things all the time. I mean, we have half of our team is focusing on really inventing new stuff, we do publish about half of them. Because those, those are pretty interesting findings from them, we keep internally because obviously they are proprietary. And over time they come out, obviously. So yeah, so it's really the sheer volume of new things that you have to build. Like there's so many modules, when our CTO starts drawing the whole picture like it's takes hours since a bunch of small boxes, and each one in its own, it's kind of a different innovation that came up. And that's, that's really interesting. And I was not expecting that. And I was not expecting that two years ago, when I started looking into it. And when I look at it today, we still doing a lot of that. And when I look at the roadmap, a lot of new things coming in the space as well. So that is quite interesting, and explains why they have never been a platform so far that really goes deep into understanding code in that way. And then being able to learn from such a large set of be called out there in a extremely fast way.
Tobias Macey
0:22:12
In terms of the platform itself and its capabilities, what are the some of the overall limitations and some of the cases where you might not want to either use it or avoid some of the recommendations that it might come out with just because of some of the artifacts of the code that you're trying to feed through it.
Boris Paskalev
0:22:30
Sure. Question. So no limitations, in general, fully scalable, can support any language, that's the best piece of architecture specific carrier that you don't want to use it. We haven't found one yet. I mean, ultimately, that's part of the basic building blocks. Maybe when we start delivering some more, more higher level architectural analysis, some of those spaces might come up, but that's still to come. But from the basic building blocks, finding books and issues in your code. Yeah, there's we haven't find any specific areas where they are, I mean, some projects may have a little bit higher false positive rate versus another for specific reasons. As you mentioned, the Python version, for example, using Python version two, and we've given you a lot of Python version three suggestions. But other than that, there is no industry or language or focus specific.
Tobias Macey
0:23:16
And another thing that is potential challenge are cases where the code base itself is quite large, I'm wondering you run into any issues where you've hit an upper limit in terms of just your deployed platform for being able to parse and keep the entirety of that structure, semantically, in, in the working set. And any strategies that you've developed to be able to work around that
Boris Paskalev
0:23:40
the platform is designed can literally handle anything and millions of lines of code in seconds. So I mean, think about it, we are learning from billions of lines of code. And in order to do that efficiently, we've built some pretty efficient algorithms to actually do that. So we haven't seen I can we finalize some pretty God basis, any issues I can use? Like, wow. So we are on average, when I compare it to other tools tend to oftentimes hundred times faster in the analysis space. So yeah, I think that scalability is definitely not an issue. I mean, it happened a couple of times between a man of hard disk space because of caching. But since when the cloud was pretty fast, for a lot more,
Tobias Macey
0:24:21
yeah, I was just thinking in terms of some of the sizes of mono repo is for the Googles, and Facebook's of the world where it takes, you know, potentially hours to actually clone the entire history of the project and some of the workarounds that they've had to do. But I'm sure that you know that, that that's the sort of one 10th of 1% case, code is even of that scale. But I was just curious if you had ever run into something like that.
Boris Paskalev
0:24:47
But you're right, the cloning is the slow part. So those large tissues, large repositories, usually cloning takes a while, and then an ISIS takes much, much faster. In our case. So that's really now we actually separating the shoulders we're calling people know why the slow. But yeah, so cloning mistakes, sometimes fast, the slow, especially if you the dominant network, in the cloud, and it's a lot of people, but then the analysis is much, much faster than the cloning.
Tobias Macey
0:25:13
What are some of the other user experience tweaks that you've ended up having to introduce just to improve the overall receipt of your product to make sure that users are able to take the full advantage of it?
Boris Paskalev
0:25:26
I mean, the areas where we've talked a little bit specifically explanations, trying to actually explain to the customer what the issue is, we've actually had to release. Yeah, there was another new engine just for that. Because people are saying, Yeah, that's a bit confusing. And yeah, so we actually had to build on that UI perspective as well, people understanding what what, that's all we're obviously working progress on the website, specifically explaining to customers that the code is secure that we don't use it, we're not going to display it as you rightfully asked to other customers, we're not going to use it for anything else. We're not going to store it. There's other companies that had issues with that. So we're very diligent in in that. But yeah, those are kind of the the major areas out there.
Tobias Macey
0:26:08
And looking forward, what are some of the features or improvements that you have planned for the platform and for the business.
Boris Paskalev
0:26:16
So key one is in as our internal main KPIs for this year is the number of actual issues like recall, that we can find. So that's, as I mentioned, it's going to be coming up very soon. So expect something like four to five times increasing the number of issues that we can detect. So that's, that's, that's pretty exciting. I mean, other things that we're looking at, we're doing ultimately called fixing, we're starting to look into that right now. But that's likely it's early 2020. Release. So being able to kind of give you suggestions how to fix it automatically. You don't have to even write the code or try to understand it. We don't recommend that, obviously, but Cambridge is going to be there. The other one is, as I mentioned, trying to analyze the Cortana models or more architectural level, semantic level and describe it, it does think that's another big one. I mean, we're toying with some more interesting stuff like this kitchen ration, automatic, fully automatic, as well. But that's more of a Yeah, we have to see the results, how commercially viable that will be. We have many different space, we have quite a long roadmap of cool things that will come up. And on purely operational stuff, getting more integrations, obviously, people are asking for the integration. So we're going to be releasing quite soon our first ID integration, where people, developers will be able to just directly get the results in their ID running somewhere else. And hopefully, that spins out, well kind of open it up. So anybody can do it any idea integrations, because there's a quite a list of ideas out there.
Tobias Macey
0:27:44
Yeah, being able to identify some architectural patterns in ways that the code can be internally restructured to improve it, either in terms of just the understand ability of it, or potentially, the scalability or extensive ability would definitely be interesting. And also what you were mentioning, as far as test cases, either identifying where a test case, isn't actually performing the assertion that you think it is, or cases where you're missing a test, and being able to surface at least a stub of suggesting how to encompass that missing piece of functionality and verifying it.
Boris Paskalev
0:28:21
Correct. Yeah. So in the test case, specifically, the area that we're looking at is find the test case out there that it's most suitable for exactly what you're doing. Because that's human human generated already. And it will, I will maintain it in the long run, which is pretty much the, the main Achilles heel for all the current test case, automatic generations out there, and then a just a little bit, so it's perfectly for you. So that's really the the focus area that we're going in that space, which is pretty exciting. As I said, if it turns out to work, it will be an amazing product as well. And nice add on. But yeah, the platform is God no way that we can build multiple products, and we're just scratching the surface, and lots will come up.
Tobias Macey
0:28:58
So there are some other tools that are operating in this space, at least tangentially or, you know, at surface value might appear to be doing something that's along the same lines of what you're doing most notable being the kite project. And I'm wondering if you can provide some compare and contrast between your product and kite and any others that are that you're tracking that are in a similar space?
Boris Paskalev
0:29:20
Yep. So God is a great tool. It's a great idea integration, they have some great in line suggestions. They again, the main differentiation between title any other similar to that is doing static analysis is they look at the code in a much shallower level, right? They actually tried to throw Hey, looks like based on what you're typing, a lot of other people are typing this, right, which is almost like treating the coldest regular text, like there's it's syntax, right? Why are we actually doing semantic analysis, we're saying is, is actually you're typing this and the parameter, what you're passing in is not right, actually, the object you're passing in his intent has to be a long core, whatever that is. Right. So that's kind of the the main differentiation, so they have suggestions that is mainly kind of old to completely a bit faster to type, they go a bit deeper, and kind of getting kind of the linker type of suggestions, as well, but again, gives you a higher false positive, right, obviously, because it's a doesn't go deeper to understand the issue and doesn't give you the contextual analysis as well. So that's kind of the main thing. So the accuracy, the recall, and accuracy is the two main things that are measured. So we can find considerably more things, and the accuracy rate will be considerably higher. So that's what it kind of the main differentiation out there. But we do have side by the way, they have amazing UI, amazing design and amazing community behind them. So a great tool as well.
Tobias Macey
0:30:38
Are there any other aspects of the work that you're doing at deep code, or just the overall space of automated fixes and automated reviews that we didn't discuss yet they'd like to cover before we close out the show.
Boris Paskalev
0:30:50
Yeah, I don't want to go too deep into things that are more experimental, because those who take time and I don't want to get people too excited, because they might take years to be ready. But the space is right, that's pretty much I have to say. And, yeah, and there'll be a lot of new things coming up. And so developers should be extremely excited what's coming up.
Tobias Macey
0:31:10
And for anybody who wants to follow along with you or get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move into the pics. And this week, I'm going to choose a book series that I read a while ago, and that I'm probably going to be revisiting soon. That's called the red wall series by Brian Jake's that focuses on a bunch of woodland animal characters, and just all it's a very elaborate and very detailed world and series that he built up with a lot of complex history. So definitely worth checking out. If you're looking for a new book or set of books to read. And they all stand alone nicely. You don't have to read them all in any particular order. But all together, they give you a much broader view of sort of his vision for that space. So definitely recommend that. And so with that, I'll pass it to you, Boris, do you have any pics this week?
Boris Paskalev
0:31:59
Yes. Big this week. In general, the AI space has been going great. I mean, everybody knows there's no real AI as much machine learning. But there's a couple of new areas coming in that space. And that's very exciting. It's pretty much applying machine learning to everything or a big data. So that's lovely. But in that contrast, because we all do that every day. And that's our passion here, the difficult, my favorite because the little bit less of that and do some sports and go outside.
Tobias Macey
0:32:26
That's always a good recommendation and something that bears repeating. So thank you for taking the time today for joining me and describing the work that you're doing with deep code. It's definitely an interesting platform. And I'll probably be taking a look at it myself. So thank you for all of your work on that. And I hope you enjoy the rest of your day.
Boris Paskalev
0:32:42
Thank you very much you too.
Tobias Macey
0:32:45
Thank you for listening to the show. If you want to hear more and you don't want to wait until next week and check out my other show the data engineering podcast with deep dives on databases, data pipelines and how to manage information in the modern technology landscape. Also, don't forget to leave a review on iTunes to make it easier for others to find this show.

Security, UX, and Sustainability For The Python Package Index - Episode 225

Summary

PyPI is a core component of the Python ecosystem that most developer’s have interacted with as either a producer or a consumer. But have you ever thought deeply about how it is implemented, who designs those interactions, and how it is secured? In this episode Nicole Harris and William Woodruff discuss their recent work to add new security capabilities and improve the overall accessibility and user experience. It is a worthwhile exercise to consider how much effort goes into making sure that we don’t have to think much about this piece of infrastructure that we all rely on.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Nicole Harris and William Woodruff about the work they are doing on the PyPI service to improve the security and utility of the package repository that we all rely on

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by sharing how you each got involved in working on PyPI?
    • What was the state of the system at the time that you first began working on it?
  • Once you committed to working on PyPI how did you each approach the process of identifying and prioritizing the work that needed to be done?
    • What were the most significant issues that you were faced with at the outset?
  • How often have the issues that you each focused on overlapped at the cross section of UX and security?
    • How do you balance the tradeoffs that exist at that boundary?
  • What is the surface area of the domains that you are each working in? (e.g. web UI, system API, data integrity, platform support, etc.)
    • What are some of the pain points or areas of confusion from a user perspective that you have dealt with in the process of improving the platform?
  • What have been the most notable features or improvements that you have each introduced to PyPI?
    • What were the biggest challenges with implementing or integrating those changes?
  • How do you approach introducing changes to PyPI given the volume of traffic that it needs to support and the level of importance that it serves in the community?
  • What are some examples of attack vectors that exist as a result of the nature of the PyPI platform and what are you most concerned by?
  • How does poor accessibility or user experience impact the utility of PyPI and the community members who interact with it?
  • What have you found to be the most interesting/challenging/unexpected aspects of working on Warehouse?
    • What are some of the most useful lessons that you have learned in the process?
  • What do you have planned for future improvements to the platform?
    • How can the listeners get involved and help out?
  • How was this work funded?

Keep In Touch

  • Nicole
    • @nlhkabu on Twitter
    • Website
    • If you’re using CI to upload to PyPI and would like to speak with Nicole please book a time here
    • If you’re using assistive technology and would like to speak with Nicole please book a time here
  • William
    • @8x5clPW2
    • Website
    • Email
    • Please get in touch if you’d like to work with Trail of Bits on your next security project!

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:13
Hello, and welcome to podcast, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Lenovo with 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network all controlled by a brand new API, you've got everything you need to scale up. And for your tasks that need fast computations, such as training machine learning models and running your ci CD pipelines. They just launched dedicated CPU instances. They've also got worldwide data centers, including a new one in Toronto and one opening in Mumbai at the end of the year. So go to Python podcast.com slash Linux that's LINOD today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. For even more opportunities to meet, listen and learn from your peers. You don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and to the combined events of the data architecture summit and graph forum, go to Python podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register. Your host, as usual is Tobias Macey. And today I'm interviewing Nicole Harris and William Woodruff about the work they are doing on the Pi Pi service to improve the security and utility of the package repository that we all rely on. So Nicole, can you start by introducing yourself?
Nicole Harris
0:01:38
Yeah. Hi, my name is Nicole Harris. I've been working on ipi or the warehouse project, which is the code base that powers API for about three or four years now. In my day job, I manage a UX UI team at a company called peopledoc. But in my spare time I work on po po.
Tobias Macey
0:02:00
William, can you introduce yourself?
William Woodruff
0:02:02
Sure. So my name is William Woodruff. I'm a security engineer with a small security consultancy called trilobites. I've actually been working on warehouse for only about five or six months now we started the work back in March. But during my day job, I sort of split my time between engineering and research. And on the research side, I do program analysis, research, mostly government funded. On the engineering side, I work on mostly open source projects, like warehouse, and always Korea and things like that.
Tobias Macey
0:02:30
And going back to you, Nicole, do you remember how you first got introduced to Python.
Nicole Harris
0:02:33
So my background is in HTML, CSS design user interface. So I, Python wasn't sort of the first technology that I was exposed to in terms of the web. But my husband is actually a Python developer, he started teaching himself programming by learning Django. So through him, basically, I got introduced to Python, and also learned, you know, and not enough Python to be useful alongside my friend and skills.
Tobias Macey
0:03:06
And William, do you remember how you first got introduced to Python? I think
William Woodruff
0:03:09
I think I use Python and a few university courses, but I didn't actually really start programming in earnest at it until I took this job. And before that, I also did actually CM Ruby. So this has been sort of a nice, a nice turn for me.
Tobias Macey
0:03:26
And given the fact that you haven't been using it for your day to day, I'm curious how much effort it's been to get up to speed with the code base, and be able to understand it and be effective with it. And how much of your experience with Ruby in particular was able to easily translate?
William Woodruff
0:03:43
Oh, so I think Fortunately, the warehouse code base, I'd like to say, it's probably one of the nicest Python code bases I've worked on. It has like 100% unit test coverage. And the idioms of the frameworks that it uses are actually well preserved across the code base. So it was actually relatively easy to get up to speed. And thankfully, I had both Nicole and everybody over on the PSL side, as well as to Minaj gene set to answer my questions as as they came up.
Tobias Macey
0:04:11
And so for both of you, I'm wondering if you can just start by sharing a bit about how you each got involved in working on the pipeline project and the main responsibilities that you have.
Nicole Harris
0:04:24
Yeah, so I can maybe start there. So I think it must have been in 2015, Donald stuffed, who is the lead developer on on warehouse, which is the project pairing pi pi, sent out, I think he actually opened a GitHub ticket that said, help, I need a designer, this is not something that I'm good at, you know, I'm rebuilding this thing. And, you know, this is completely outside of my skill set. So, you know, please retweet. And it was through one of my friends that had actually met at a party unconference that I that I kind of put my hand up and said, Hi, you know, I'm Nicole, and this is, this is what I do, and, and I think I can help you so. So that's how I got involved in in my, my involvement has kind of extended from there. So in terms of my responsibilities, I'm responsible for the UX, the UI, so the user experience user interface, as well as the HTML and CSS code base for the warehouse project. So a bit of a bit of coding and a bit of designing.
Tobias Macey
0:05:36
And William, How about yourself?
William Woodruff
0:05:37
Yeah. So on my side, I got involved by the current contract that I'm working on, which is the OTF funded security improvements to warehouse. And my work is primarily revolved around four key changes to the warehouse code base to sort of improve both the way that users improve the ability for users to secure their accounts, as well as improve the general security posture of the bypass base. And I can talk about the specifics of those improvements as we go forward. But uh, that was, that was how I got started.
Tobias Macey
0:06:06
And particularly for a unicorn, what was the state of the system at the time that you first began working on it, and any of the notable issues that you were first faced with?
Nicole Harris
0:06:17
So I didn't know if you're aware of the sort of full history of pi VI, the, when I joined the project in I think it was 2015 2016. Basically, api.org was still powered by an old code base that had been written like, I think it had been written before kind of web frameworks even existed. I think, I think Donal described it as before we even knew how to like use Python to, to build great web experiences. So in terms of the the state of the ecosystem, you know, there was this old code base, that I was kind of that the tunneled really discouraged me from diving into, he was like, Look, don't look at it, it's no best practice. What we're going to do is we're going to rebuild this from scratch. So you know, I had a fairly clean slate in terms of the user interface. And in fact, the HTML and the CSS code base, Donald did have some sort of, I think, the bootstrap templates that were working in the code base, but they weren't particularly finessed, let's see the whole basically just kind of out putting data onto the screen. So I basically rebuilt that from scratch, and made a whole lot of decisions about how we were going to structure not so much the templates, but certainly the the CSS because we were using SAS, yes, CSS code base, so that it would be something that would be easy to maintain moving forward. Because if any of your listeners have experienced sort of working on large code bases with CSS, it can get out of control pretty quickly. So we needed to put that in from the beginning.
William Woodruff
0:08:01
Yeah, so as I started working on warehouse, one of the first things I looked at was sort of the present security posture of the site and of the various like, sort of common weak points and package management, such as like name squatting, or project name, reuse, or username reuse. And overall, as far as package managers and packaging dependencies go, warehouse was in a pretty good state. So for example, as I began working, already supported, preventing common type of squatting attacks on packages, and it had written limiters and other sort of mechanisms in place to prevent these really common low level attacks against package indices. The things that I ended up working on as part of the OTF funded scope were things that are sort of above and beyond the current norm for for package indices. And that would be like, two factor authentication API tokens, surprisingly, are not the norm for prepackaged indices, and the logging infrastructure.
Tobias Macey
0:09:01
And will I understand that you have also worked on the homebrew package manager as well. And I'm wondering what your initial reactions were, as you started digging into warehouse and how it compared to your prior experience of working with other package managers and some of the common security pitfalls that are germane to that particular type of application.
William Woodruff
0:09:22
I will say I'm probably the homers current worst maintainer, I'm probably one of the least active ones. But the the security issues that Humber has to deal with are somewhat unfortunately, somewhat orthogonal to traditional homebrew, or traditional package management issues, primarily because homebrew revolves around this the central repository for all packages. And so we actually have finer grained control over both the integrity of packages as well as their origin, because we can actually see the get committed, as well as run like CI checks, basically, literally, as every package is updated. So it's all good. It's all good centralized in a way that, for example, by vi can't necessarily can't necessarily do. But that That being said,
Tobias Macey
0:10:06
I think, and so for each of you, once you began working on the Pi Pi code base, and working toward some of the initial issues, I'm curious if the problems that you were addressing are identified ahead of time or what your overall approach was for determining what were the most critical and most important tasks to be undertaking to improve the overall security and user experience of the platform.
Nicole Harris
0:10:35
So I can take this one, I think this this kind of relates to the way that this project has actually been funded. So as well as being a contributing designer slash developer on on pi pi. I'm also a member of the Python packaging Working Group, which is the sort of sub organization or working group that that works under the Python Software Foundation to raise money for packaging related projects. And it was through that working group that we actually got funding to make the security improvements that the users are starting to see being rolled out on pi pi. So the scope of the work that will and I have been undertaking is is directly related to the application that we made to the open technology fund who have actually funded this work. So what we did is we, we looked at their their mission and their vision and their values, we looked at the different grants streams, and we made an application for the items that we thought were relevant to their particular fund. And that was kind of what determines the scope of everything that has been funded through that particular initiative. So I think will you probably agree that in coming into this project, we had fairly well defined parameters around what was and what wasn't in scope based on what was basically being funded and what we'd what we said we were going to do.
William Woodruff
0:12:02
Yeah, I think that's correct. I think, yeah, we had a we had a high level idea of the individual goals we want to achieve based on the work that we scoped out with IVF. And then once we actually began work, we sort of prioritized the individual tasks based on what we thought would have both the highest user impact as well as what we could roll out with, with like minimal, minimal disruption to like, I think, like package upload and the user experience.
Tobias Macey
0:12:28
And given that you're both focusing on somewhat different areas of the platform, I'm wondering how often the issues that you're focusing on have had overlap, and what the cross section ends up being between user experience and security, particularly given that the interfaces that you're dealing with aren't necessarily just the web UI that you see when you load up the web page.
William Woodruff
0:12:52
So I actually I'm of the opinion that like, UI is severely underrated in terms of user security. So users, oftentimes, our don't know how to don't really know how to engage with the security features that security engineers exposed to them. And this is an issue that I've run into and other platforms that I've worked on. And I think a huge part a huge, huge boon to working with, Nicole has been actually setting up a set of features and then seeing how how to expose them correctly to users set up something that I'm not personally equipped to do. And seeing her build like this actually extremely pleasant to use, and extremely intuitive. Setup has been really great.
Tobias Macey
0:13:34
And then in terms of the trade offs that exist, I know that oftentimes there's a conflict between improving the overall security of a system but also still making it usable, because as you ratchet down too tightly on making something ultimately secure, you start to encourage people to take shortcuts that ultimately reduces the effectiveness of your practices and how you try to balance that issue and some of the common patterns that you have settled on to make sure that you're improving the security as much as possible, while still making sure that people are adhering by the security practices.
William Woodruff
0:14:12
Yeah, I think a huge challenge when designing secure systems is is security fatigue. So one of the last things you want to do is, like I said, ratchet down the system so much that users become frustrated and take shortcuts to achieve achieve their ends. And that's one of the issues you often see with a two factor implementations as a two factor implementation will, like require a user to sign on or re authenticate so frequently, that users will just like move there to TP setup on to their post itself and just control C Control V and thereby like dissolve the second factor component of the authentication scheme.
Tobias Macey
0:14:46
and wondering too, if you can just enumerate the overall list of interfaces. And the total surface area of the problems that you're each working with, as far as the special is a mix of the API project, because with some projects that might be limited to just the web UI, others it might be just an API. But with API, there's the web interface there the API's that users are using those the actual data integrity, as well as the actual interactions that people have of downloading and installing the packages, which is potentially another attack vector that isn't necessarily going to be present in other projects.
William Woodruff
0:15:25
Yeah, I think so the work that, at least what I did the work that I did, primarily centered around the API and the web interface. So the security features that we added specifically two factor authentication, API tokens, and an audit log of those two factor authentication is is intended primarily for use with the web interface, and audit log. Visibility is performed via the web interface, although some online events are actually captured as the user hits the API for like, two sensitive actions, such as package uploads, or file uploads, or removals. But also on the API side, there's API tokens themselves, which the user will interact with, via a tool like setup tools, or twine, or any of the other clients that interact with the warehouse API.
Tobias Macey
0:16:16
And, Nicole for you, as well. I'm wondering what the surface areas that you're dealing with as far as the user experience work, and some of the ways that that manifests at the different trade offs and the interactions between API's and web UI, and the overall package upload experience, etc.
Nicole Harris
0:16:36
Yeah, so I mean, in terms of this current contract, my work has been limited to basically what we'll just described. So first part was making sure that when users that they find it easy to set up two factor authentication, and then to use that out when logging into pi pi.org. So that's sort of the first first thing we worked on. Then we looked at the, the API keys. So sorry, API tokens, we're avoiding the word keys. And I can tell you why later. But looking at how it's, it's, you know, making it easy for users to set up those those tokens. And then obviously, as will set as well, exposing the audit log to the end users in terms of my work with regards to sort of the way that people interact with pi pi outside the browser, that's really limited to me, making sure that the instructional texts and the help texts that were showing on pi pi.org is actually useful enough for people to be able to do what they need to do. So for example, with the API tokens that were we've just deployed, I've been running some user tests that have revealed that perhaps the the way that we display the token, and the instructions that we give to users currently is not good enough for them to understand what they need to do next, using using whatever tool they're using. So that's kind of where my sphere of influence kind of sits is making sure that people have the information that they need to be able to then interact with API. However they need to do that.
Tobias Macey
0:18:28
That I'm sure is also complicated by the fact that there are any number of different tools that people might be using that would require the access to that API token, where I know that there's Pip. And there's twine for being able to upload things and flit and there are, I'm sure any number of different homegrown applications, and I'm wondering how that plays into your efforts to make sure that the instructions are clear and accessible. And I guess how far you're willing to take the effort. And when you decide that you covered enough ground, and the sort of majority of people are handled, and anybody who is in some of these edge cases is there because of something that they've decided to do that isn't necessarily something that would be required to be supported by the people responsible for the API infrastructure?
Nicole Harris
0:19:16
Yeah, I think I think there's sort of to two factors when, when thinking about, or at least how I think about designing for pi phi, it's, it's that yet people have different workflows, as you've just described, and also that you have people with different really vastly different levels of knowledge as well. So, you know, Python is now being used a lot, as a teaching language. So I'm really aware that pi pi could be the first, you know, package index that some people are using or experiencing. So they might not be familiar with all of the concepts that we present to them. On the flip side, you have people who've been coding, you know, decades, and a really familiar with all the concepts. So it's a real challenge in terms of making sure that you're explaining things enough for beginners, whilst also not sort of, you know, talking down to people who are who are really experienced. So that so that is, that is a challenge, but I tend to lean on the side of, Okay, let's, let's give more information for beginners, because at the end of the day, experienced users can ignore instructions that they already know if they don't need them in terms of the kind of weighing up or how much information to give, which tend to take a lot of feedback from the community. So I mean, I've run user tests and thinking more less about the API tokens here and more about the two factor authentication workflow that we worked on, I ran a whole lot of user tests, when we were rolling out those interfaces, with people with different levels of experience, and who were who had different, we're kind of using different tools, you know, for example, to for to TP to authenticate that some people will using a password manager to create a temporary one time password word. Some people using mobile phone, though, using all three other people, you know, there was all sorts of different ways that people were doing that and and what we in the end did was put a whole lot of examples into our text of, Okay, these are the kind of the kind of applications that you might choose to use. And we made sure that we had a good balance there between, you know, sort of the most popular tools. So things like Google Authenticator and author sort of floated to the top of the list as as things that people sort of were mentioning frequently, but also mentioning, you know, the kind of less common use cases making sure, for example, that we were listing non proprietary solutions as well, because we know that there's members of the community out there who prefer not to use proprietary software. So it it's really just about prioritizing the way that you present the information to cover the most common use case first, and then give the kind of the information for the edge cases later. Yeah, and I would say that's the same. Also, when we're talking about webauthn, which is two factor authentication with some kind of device. Lots of people understand that as OI authenticate with a yubikey. Because you know, yubikey is probably the most popular, the most popular USB key that you can use with that particular standard. But we do have people out there in the community who are using other things. So what we ended up doing was writing the instructional and the help text in such a way to sort of emphasize USB keys mentioning certain brand names. So people kind of were associating what we were talking about with the correct concept. And then also mentioning, hey, there's all these other ways that you can also do this as well, by the way. So I think that balance is quite good. Because generally, generally, if you are not necessarily using the most mainstream, as you sort of said, if you're not using the most mainstream solution out there on the market, then you're probably more familiar and more advanced use it anyway. In which case, perhaps the help Texas or the instructional text is less required for you than it might be for someone who's a beginner who's using something that's fairly mainstream.
William Woodruff
0:23:28
Yeah. And to add on to that for the API tokens work we did. One thing that's pretty interesting about the Python package ecosystem as a whole is that there's a whole lot of third party clients out there and a whole lot of third party implementations that talk to these API's. And so as we were designing out the initial API keys approach, we realized that we would probably have to make concessions in terms of like authentication semantics to make them fit into all of these third party clients that expect a username and password instead of just a general purpose key authentication. As we're working on that, we also realized very quickly that the range in continuous integration setups as well as other automated systems, constrained our ability to add certain token prefixes and certain sub usernames. So doing all that work was was pretty interesting because it involved community feedback, as well as trying to sort of guests common or happy paths and unhappy paths for for common for common uses of tokens, or sorry, API keys.
Nicole Harris
0:24:26
One thing I'd also like to add to that is, I don't know when this podcast exactly is going to go out. But I'm currently in terms of those API API tokens, I'm still working on improving the help text in the instructional text. But I do need to seek feedback from from members of the community as to what tools they are using their API tokens with so that I can make sure that I am covering all of those, well, as many of those use cases as possible within the help and instructional tech. So I suppose that's a bit of a call to action. And I know, we'll probably get a chance to make another one by the end of this podcast. But if you're a community member out there, and you're particularly particularly interested in if you're using a continuous integration service to upload your package to pi pi, and you'd like to test out the API tokens, and I'd really like to speak to you because understanding what your workflow is, and how we can document that in the user interface and give brief but useful instructions would be very valuable.
Tobias Macey
0:25:32
As we've been discussing here, there is a wide variety of people and patterns in terms of how the API infrastructure is interacted with. And I'm curious how that informs and affects your overall workflow and strategy for interior for introducing changes to the platform, and how you validate and I guess, control the rollout of those changes.
Nicole Harris
0:26:00
Yeah, so I can speak on that a little bit. So in terms of releasing new features, well, a lot of this is actually handed base, similar from chain sick consulting, who's our project manager for this contract, and she's worked as a project manager for previous contracts as well. And in what she does is she reaches out to the community and does a lot of communication about what the upcoming features are going to be within, when we release a new feature, it's marked as a beta or beta, depending on your accent, feature. So it sort of comes with the warning of you know, this is something that we've shipped, but you know, it's it's, it's still not kind of certified as as as perfect and, and production ready. So you know, obviously set things up with the expectation that perhaps things might change. And she does communication at that point as well to reach out to the community to say, Hey, we really this new thing, please go and test it. At that stage. I obviously also do some reach out in terms of user testing with people to see if they've got any any problems, working through the interfaces, but we also because of her work in in sort of communicating what's going on to the wider community, we do tend to get a lot of tickets opened up on on GitHub, where people said, Hey, you know, I've tried out this thing, and it's not quite working, you know, there's a bug you or I'm using a browser that you haven't tested it with, or whatever it is, and then we go and address those, those particular issues before we can obviously move out of the beta period. So so it's been quite smooth so far, in terms of, you know, yeah, there's bugs. But we expect that to happen within that period. And we've been quite good at turning around and fixing those. And, and because we're labeling things as beta, people under stand at that that's, you know, part of the process of developing software. Will did you want to comment on that at all, in terms of some of the changes maybe that we've had to make based on on feedback that's come back from the community?
William Woodruff
0:28:11
Yeah. So I think the the big things that come to mind are what you mentioned earlier with, with confusion about token versus key in the context of security token versus what I originally called API tokens, but we quickly realized confuses users because they associate token with with a physical device. We've also on the on more of the development side. I think I mentioned earlier, but warehouse has pretty comprehensive unit tests. So as as we've been developing, we've been somewhat fortunate to catch things that otherwise probably would have, would have blown up in production. As both unit tests and as as sort of smoke tests by either seminar or the reviewers on the PSF. side, that would be earnest, Donald and Dustin.
Tobias Macey
0:28:56
So we've mentioned the API key, and some of the two factor off features that have been introduced. I'm curious, what have been some of the other notable features or improvements that you've been involved with?
Nicole Harris
0:29:09
Well, I suppose I've been involved in since very early. So I'm going to scope my answer to that question to this particular contract, which is the RTF contract. So yeah, as you said, two factor authentication API, API token. And then the audit log, which is basically being able to expose so that with this kind of, from my point of view, there's two sides to that audit log, it's, we have an account audit log. So when you log into your account, you can see, okay, you know, when did I last change my password? When did I set up an API key, when did I enable two factor authentication, etc, etc. So we've got that exposed. And then we've also got project audit logs as well. So things that have happened on an individual project. So for example, a new release has been made or, or an API key has been created that has permissions on this project. So things like that. The other thing to mention is that the ETFs grant doesn't just cover security, when we made the application through the Python packaging working group, we also received funding to improve both the accessibility and the localization of pi pi.org, as well. So some of my work well, already working on this, but it's going to be my work moving forward as well, is to improve the accessibility of IPOs or for people who are using assistive technologies. So for example, people who are using screen readers or people who are limited to just using their keyboard, people who are using high contrast mode, etc. And also we're going to be implementing localization. So making it possible for us to translate at least the interface copy on popi eyes or org into a local languages, so French, Chinese, whatever, whatever community contributions we get for translations, those things are kind of within the scope of the ETFs contract as well. So that's super exciting, because it's not just about thinking about how we can make the site more secure. But also how can we make it more universally accessible for people who have different needs, and who are who are in in different communities, Python communities around the world.
Tobias Macey
0:31:36
And William, in terms of the attack vectors that you have considered for pipe E, I know that you said in general, it was in a fairly good security stance as far as already having some capacity for mitigating type of squatting attacks. But I'm wondering if there are some of the other attack vectors that you have looked at or other things that you're concerned and about for API, recognizing that you're not asking you to do any sort of improper disclosure, but just in general, some of the thoughts that you have as far as security and attack vectors for packet repository, I'm sure.
William Woodruff
0:32:14
So so the really common attack vectors that you see on package indices, and package managers are sort of those type of squatting, package takeover fishing based attacks, where someone will try to take over the account or add themselves as as a contributor to a project and then push up a malicious version of that project that contains, you know, a malware dropper, or whatever, whatever it needs to be. And I said, So fortunately, pipe I already had a few pretty pretty good medications in place, including for type of squatting and and rate limiting to prevent credential, brute forcing, there are some things that are sort of already well known, well known weaknesses in pipe is set up, those include sort of the way that that roles are currently structured. So at the moment, any account can be added to any project as an owner without that other project without that targeted users consent. So and and prior to this audit, login, without a ton of history, or login, to designate that, that change. So there are there are big issues with sort of transparency and package ownership, as well as transparency and changes in package control. So like it's, it's, if you I think, I'm actually not positive about this. But I believe currently, if you delete your project name on, if you delete your project on patreon.org, another user can claim that that name, and if that happens, you can then imagine a sort of package reuse attack where a popular package gets deleted by an attacker, and then they become a like, innocence legitimate owner, because they've actually claimed the project rather than taking it over.
Nicole Harris
0:33:49
Yeah, that's correct as To my knowledge, will, however, they can't release any files that have previously the being released, if that makes sense. So it would only be new versions moving forward. But you're right in in the sense that, yeah, it would be they would own the package. And have the legitimacy of of that that package name. With regards to your first comment, I know that we do have a pull request in progress, I'm hoping that we'll be able to address the issue with giving permission to add collaborators. Soon.
William Woodruff
0:34:29
Yeah. There's also the sort of more general problem of active scanning of projects for rather packages as they get uploaded. And that's I think, as far as I know, an unsolved problem in the world of package maintenance. And I don't think it's something that I could barely be asked to solve.
Nicole Harris
0:34:45
Well, what do you mean by that? He said, active scanning.
William Woodruff
0:34:47
Yeah. So imagine scanning for, like common indicators of compromise, or common indicators that have packages is malicious. For some for some, you know, fuzzy definition of malicious? Because you can imagine, like a recent package that contains malware samples, or what have you.
Tobias Macey
0:35:05
And particularly given the flexibility of Python and the ability to obfuscate the actual intent of the code, it's definitely a non trivial and potentially NP complete problem to be able to actually definitively to determine whether or not a package is malicious or has nefarious intent.
William Woodruff
0:35:24
Yeah, this is the problem with that some of the most lockdown platforms in the world struggle with, you know, Apple with their app store struggle with static analysis immensely. So I think it would be completely unreasonable to expect a dynamic language on a community maintained index to solve this problem.
Tobias Macey
0:35:40
So in terms of your overall experience of working on and with the Pi Pi platform, and the community of users who rely on it, what have been some of the most interesting or challenging or unexpected aspects of that work,
William Woodruff
0:35:55
I can try answering that. So on my at least, I've done community management before. Some of it as in my role, as I'm reminded, and or some of it on my own open source projects, as well as the open source work that trilobites does. But it is it is different every time. And so especially when dealing with feature changes that affect potentially 10s of thousands of people, it can be sort of challenging to get people to see your side of things, especially when it comes to like event logs. So very understandably, users are wary of any sort of feature that records their IP address for records, security, salient events, about their actions. And so it can be difficult to explain to users who don't necessarily see the value of those recordings, from a security perspective, it can be difficult to justify those events to them, and coming up with a compromise where we both get actionable, or were able to record enough information to take action, while also preserving their privacy and mitigating their concerns can be can be a challenge. Especially you know, for for countries where GDPR compliance is is key.
Nicole Harris
0:37:05
I think on my side, one of the issues with doing design in the open on open source community projects is that the work is very, very visible. And it's it is really hard to satisfy everybody, you know, everybody's using different browsers, everybody has different use cases. And, and, you know, we don't have any full time resources on, on looking at the user experience of pi pi, it's just me, and the hours that I have, either in my spare time when I'm working as a volunteer or as on this contract for my contracted hours. So, you know, it's, it has been challenging to try and satisfy everyone and, and make everybody happy. That was probably more challenging when we had the transition from the old api.org. Sorry, the old API code base to popular org, when there were a lot of changes, which was disruptive to people's existing workflows. On the other hand, there are a lot of people who are like, Yay, pipe eyes sort of moved into the modern era, and it works on mobile. And, and you know, so there's kind of two sides to every coin. What I've tried to do in terms of my work with pi pi, is make sure that when decisions are made, that they're really backed by either user research or user feedback, or by user testing. So you know, it not just being a case of me saying, well, it's my opinion that it should be like this, and therefore, my opinion, is most important. But actually being able to show people I looked into this, I looked at prior art, or I looked at, I spoke to people within the community, and this is the reason that this decision has been made. And when you actually articulate the reason and you show people that you've, you've thought about this more than just, you know, this is my opinion, then people are really responsive to that. So I think that that's been quite positive experience for me in in interacting with the Python community, who as a whole, very friendly, friendly bunch of people
Tobias Macey
0:39:15
in terms of the future work that you either have planned for your existing contract, or that you have identified as potential improvements to the platform in general, what do you think are most interesting or most notable? And what are some of the ways that listeners and the broader community can get involved and help out with your efforts and just the overall work needed to keep the pipeline platform healthy and viable for the long run,
Nicole Harris
0:39:45
so I can address that in terms of the current contract, most of the security work is is kind of done. Now. I mean, there's a few things that we need to wrap up. And as I mentioned, I would really like to talk to anybody who's using see is uploaded to it to pi pi, because that would be really helpful for me in terms of making sure that the interface is working for those use cases. In terms of the rest of this contract. As I mentioned earlier, we have accessibility and localization, which is the last two subjects that we need to address. In terms of accessibility, I've also put a call out recently, I'd really like to talk to any members of the Python community who are interacting with websites, and using assistive technologies. So if you're a user, who's online using a screen reader, I would love to speak to you. Same for if you're someone who's limited to using a keyboard, or if you're using high contrast mode. Or if you're using like a very zoomed in version, you know, you're using, you're zooming in your browser a lot because of poor eyesight. The reason that I would really like to speak to people who are using the web in those ways is because we're doing an audit against WCAG. Two point O standards, which is is kind of the accessibility standard. But just being able to tick the box isn't in my view enough, I mean, obviously, we want to check the box and say, yes, we're compliant. But actually being able to test the interface with people who are using a system assistive technology. And and seeing that it's working for them in in real life with real life use cases is super important, as well. So it's not really enough just to check the boxes, we really need to talk to people about how they're using the site as well. And on the localization side, and I think there'll be more communication that will come out about this later, as we sort of get into that milestone, we are going to be looking for people to help us to actually translate the interface copy into different languages. So once we've actually got the technical implementation done, you know, we're going to want to get people to translate it into whatever language that they'd like to translate it into barring Arabic and Hebrew and any right to left languages, because that is outside the scope of the current project.
William Woodruff
0:42:15
Yeah, I'm also on on the security side of things. There are things that are out of scope of the current contract, but that I believe, are planned for a future iteration on on the warehouse code base. And that would be things like, for API keys, the implementation that we went with, is based on the security tokens called macaroons. And one of the interesting things about macaroons is that they have embedded in them something called caveat language, which allows for a sort of rich description of the permissions associated with each token. And currently, we have a version of version field in our cabinet language that allows for those permissions to be iterated on, and modified to allow for sort of really rich interactions with the authentication system. So you can imagine, I think the future in the future, the plan is to add tokens that expire after exactly one use, or are only allowed between certain hours today, or can only be used from a certain domain in terms of or certain authenticated IP, or things like that. So I think we've put out on the warehouse issue tracker, sort of a request for for help with that
Nicole Harris
0:43:16
yet, I should mention here as well that if any of your listeners are interested in contributing to the warehouse project, the issue tracker is in in fairly good school fairly well managed. So we do tag issues with needs discussion or help required. So going on to the issue tracker, and having a look at what discussions are happening is kind of a useful way of being able to find out where you could help make pie pie more sustainable. In terms of the feature development that we're currently working on. The other thing I'd like to mention as well is that and I think what will already said today kind of reinforces this, it's a really nice code base to work on, like, pretty easy to set up with Docker and Docker compose, got really great unit test coverage, it really is a very nice code base to work on. So you know, if you're looking to make an open source contribution, I think it's a it's a good candidate. And we do welcome also, people who are making their first contribution to open source as well. So it's not just your more experienced listeners, you can make contributions to the warehouse code base, you have plenty of tickets tagged with good first issue, which, specifically for people who are looking to make sort of more mine now sort of to ease their way into open source contributions.
William Woodruff
0:44:43
Yeah, I do want to hammer that point, it really is a nice code base. I've worked on a lot of both open source and proprietary code bases written in sort of a combination of Python two and Python three, or you know, now Python three, but we're migrated from Python two, with very bespoke setups and environments that are clearly developed from an engineer's desk somewhere inside of an office. And warehouse, fortunately, is not one of those code bases.
Tobias Macey
0:45:08
And is it worth digging more into the actual funding behind this work, and how that structured and just some of the overall sustainability efforts to be able to maintain and upgrade the Pi Pi and warehouse platform?
Nicole Harris
0:45:22
Yeah, so I can I can talk about that. As I mentioned earlier, I'm a member of the Python packaging Working Group, which raises money for for not just pi pi for any packaging related project. And it was through that, that, that we got this this grant from the open technology fund or TF to actually be able to do that work. It's the second major grant that we've got for pi pi, you might be familiar with the fact that we got a moss were granted a moss grant Missoula open source. Grant last you must last year, and that was to migrate from the old version of API to this to the new warehouse Kobe's into retire that old code base. So so far, through the packaging Working Group, we've had two fairly substantial grants, which have allowed us to really improve the packaging index, that working group continues to, to work to, to make grounds for for different subjects, not just pi pi, but also many of the tools that interact with pi pi, such as Pip. So we're hoping that in the next sort of year, we will have more more money coming in from those from those from those applications that we make them be able to fund more sustainable development for Python, the packaging ecosystem in general. The other thing to mention is that we have very fortunate with pi pi, to have a number of great sponsors who actually give us the infrastructure for free, I don't have the data right now, in terms of how much that's worth, but it's certainly millions per year that it costs to actually run the Python packaging package index. And a lot of that is is borne by how CDN fastly if you donation to us is actually quite enormous. So in terms of sustainability, we we have a mixture of these, the funding coming through from grant applications, and we have you know, the these different companies giving us their their services to enable us to keep the service up. The other thing that we we we appreciate is we have a donation page on polka.org, where members of the community can donate towards the Python packaging Working Group, so that we can then have a budget to be able to pay for maintenance, and improvements to both pi pi and other projects. An ideal scenario in the future is that we would have enough kind of recurring donations from the community that we would be able to set up a more reliable either part time or full time situation where we have people working on packaging as their job. Because at the moment, we really have mostly just contracts that come and go depending on the money that comes in.
Tobias Macey
0:48:18
Are there any other aspects of your current efforts on the Pi Pi infrastructure, or any other aspects of the overall platform that we didn't discuss yet that you'd like to cover before we close out the show?
Nicole Harris
0:48:30
Yeah, I can't think of anything. Can you think of anything real?
William Woodruff
0:48:33
Oh, no, no, not in particular. I mean, there's there's sort of interesting things about webauthn and to TP that they go into, but they'll be a bit in the weeds.
Tobias Macey
0:48:42
Well, for anybody who does want to dig deeper into that, if you have any specific references that you found useful, I can add them to the show notes. And for anyone who wants to follow up with either of you or get in touch and follow along with the work that you're doing, I'll have you each have your preferred contact information to the show notes. And so with that, I'll move into the and this week, I'm going to choose the show the expanse, I started watching that recently, and I've gotten through the first season and into the second. And it's just very interesting and well done sci fi series chronicling some dramatic events that far into the future where humans have gone beyond Earth and started populating other areas of the solar system. So it's a interesting and well put together show with a lot of good sort of environmental aspects such as the Creole language that people speak further out into the asteroid belt. So if you're looking for something new to watch, I recommend that And so with that, I'll pass it to you. Do you have any pics this week?
William Woodruff
0:49:37
Sure. Yeah. I don't know if I have a media pick. I've been reading. I'm not actually normally a big nonfiction person. But I've been reading an autobiography of Abraham Lincoln by Carl Sandburg, who's some a well known American poet. So a little bit out of his like, I think, not, not his expertise, but out of his field of renown. But it's been a pretty interesting, a pretty interesting read so far. It's actually a surprisingly nuanced biography of his life in the sense that it goes through sort of the both political and military failures that he encountered. And it's just been, it's been a sort of interesting to read, because, you know, you learn this stuff in like 10th grade in American high schools, but then you just then it gets dropped.
Nicole Harris
0:50:14
I do have an answer.
0:50:17
So, last week, or the week before, I watched a documentary on Netflix called the Great hack, which was particularly interesting to me, because I live in the UK. And it talked about Brexit and Cambridge Analytica and and what's sort of been happening, I haven't followed that probably as closely as I should have. So yeah, anybody out there who's kind of interested in documentaries, it's certainly very, very interesting and very topical at the moment with regards to the current political climate.
Tobias Macey
0:50:50
Well, thank you both very much for taking the time today to join me and discuss your work on the API platform and infrastructure and some of the ways that that will improve the overall viability of it in the long term and improve the available workflows for people using it. So I appreciate all of your efforts on that front and I hope you enjoy the rest of your day. Thank you.
Nicole Harris
0:51:11
Thank you.

Learning To Program In Python With CodeGrades - Episode 224

Summary

With the increasing role of software in our world there has been an accompanying focus on teaching people to program. There are numerous approaches that have been attempted to achieve this goal with varying levels of success. Nicholas Tollervey has begun a new effort that blends the approach adopted by musicians and martial artists that uses a series of grades to provide recognition for the achievements of students. In this episode he explains how he has structured the study groups, syllabus, and evaluations to help learners build projects based on their interests and guide their own education while incorporating useful skills that are necessary for a career in software. If you are interested in learning to program, teach others, or act as a mentor then give this a listen and then get in touch with Nicholas to help make this endeavor a success.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today Nicholas Tollervey is back to talk about his work on CodeGrades, a new effort that he is building to blend his backgrounds in music, education, and software to help teach kids of all ages how to program.

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what CodeGrades is and what motivated you to start this project?
    • How does it differ from other approaches to teaching software development that you have encountered?
    • Is there a particular age or level of background knowledge that you are targeting with the curriculum that you are developing?
  • What are the criteria that you are measuring against and how does that criteria change as you progress in grade levels?
  • For someone who completes the full set of levels, what level of capability would you expect them to have as a developer?
  • Given your affiliation with the Python community it is understandable that you would target that language initially. What would be involved in adapting the curriculum, mentorship, and assessments to other languages?
    • In what other ways can this idea and platform be adapted to accomodate other engineering skills? (e.g. system administration, statistics, graphic design, etc.)
  • What interesting/exciting/unexpected outcomes and lessons have you found while iterating on this idea?
  • For engineers who would like to be involved in the CodeGrades platform, how can they contribute?
  • What challenges do you anticipate as you continue to develop the curriculum and mentor networks?
  • How do you envision the future of CodeGrades taking ship in the medium to long term?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Build Your Own Knowledge Graph With Zincbase - Episode 223

Summary

Computers are excellent at following detailed instructions, but they have no capacity for understanding the information that they work with. Knowledge graphs are a way to approximate that capability by building connections between elements of data that allow us to discover new connections among disparate information sources that were previously uknown. In our day-to-day work we encounter many instances of knowledge graphs, but building them has long been a difficult endeavor. In order to make this technology more accessible Tom Grek built Zincbase. In this episode he explains his motivations for starting the project, how he uses it in his daily work, and how you can use it to create your own knowledge engine and begin discovering new insights of your own.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Podcast.init listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Tom Grek about knowledge graphs, when they’re useful, and his project Zincbase that makes them easier to build

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what a knowledge graph is and some of the ways that they are used?
    • How did you first get involved in the space of knowledge graphs?
  • You have built the Zincbase project for building and querying knowledge graphs. What was your motivation for creating this project and what are some of the other tools that are available to perform similar tasks?
  • Can you describe how Zincbase is implemented and some of the ways that it has evolved since you first began working on it?
    • What are some of the assumptions that you had at the outset of the project which have been challenged or updated in the process of working on and with it?
  • What are some of the common challenges when building or using knowledge graphs?
  • How has the domain of knowledge graphs changed in recent years as new approaches to entity resolution and data processing have been introduced?
  • Can you talk through a use case and workflow for using Zincbase to design and populate a knowledge graph?
  • What are some of the ways that you are using Zincbase in your own projects?
  • What have you found to be the most challenging/interesting/unexpected lessons that you have learned in the process of building and maintaining Zincbase?
  • What do you have planned for the future of the project?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Docker Best Practices For Python In Production - Episode 222

Summary

Docker is a useful technology for packaging and deploying software to production environments, but it also introduces a different set of complexities that need to be understood. In this episode Itamar Turner-Trauring shares best practices for running Python workloads in production using Docker. He also explains some of the security implications to be aware of and digs into ways that you can optimize your build process to cut down on wasted developer time. If you are using Docker, thinking about using it, or just heard of it recently then it is worth your time to listen and learn about some of the cases you might not have considered.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at pythonpodcast.com/angel and help support this show.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Itamar Turner-Trauring about what you need to know about running Python workloads in Docker

Interview

  • Introductions
  • How did you get introduced to Python?
  • For anyone who is unfamiliar with it, can you describe what Docker is and the benefits that it can provide?
  • What was your motivation for dedicating so much time and energy to the specific area of using Docker for Python production usage?
  • What are some of the common issues that developers and operations engineers run into when dealing with Docker and its build system?
  • What are some of the issues that are specific to Python that you have run into when using Docker?
  • How does the ecosystem for Python in containers compare to other languages that you are familiar with?
  • What are some of the security issues that engineers are likely to run into when using some of the advice and pre-existing containers that are publicly available?
  • One of the issues that you call out is the speed of container builds. What are some of the contributing factors that lead to such slow packaging times?
    • Can you talk through some of the aspects of multi-layer packages and useful ways to take proper advantage of them?
  • There have been some recent projects that attempt to work around the shortcomings of the Dockerfile itself. What are your thoughts on that overall effort and any specific tools that you have experimented with?
  • When is Docker the wrong choice for a production environment?
    • What are some useful alternatives to Docker, for Python specifically and for software distribution in general that you have had good luck with?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Protecting The Future Of Python By Hunting Black Swans - Episode 221

Summary

The Python language has seen exponential growth in popularity and usage over the past decade. This has been driven by industry trends such as the rise of data science and the continued growth of complex web applications. It is easy to think that there is no threat to the continued health of Python, its ecosystem, and its community, but there are always outside factors that may pose a threat in the long term. In this episode Russell Keith-Magee reprises his keynote from PyCon US in 2019 and shares his thoughts on potential black swan events and what we can do as engineers and as a community to guard against them.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to grow your professional network and find opportunities with the startups that are changing the world then Angel List is the place to go. Go to pythonpodcast.com/angel to sign up today.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Russell Keith-Magee about potential black swans for the Python language, ecosystem, and community and what we can do about them

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what a Black Swan is in the context of our conversation?
  • You were the opening keynote for PyCon this year, where you talked about some of the potential challenges facing Python. What motivated you to choose this topic for your presentation?
  • What effect did your talk have on the overall tone and focus of the conversations that you experienced during the rest of the conference?
    • What were some of the most notable or memorable reactions or pieces of feedback that you heard?
  • What are the biggest potential risks for the Python ecosystem that you have identified or discussed with others?
  • What is your overall sentiment about the potential for the future of Python?
  • As developers and technologists, does it really matter if Python continues to be a viable language?
  • What is your personal wish list of new capabilities or new directions for the future of the Python language and ecosystem?
  • For listeners to this podcast and members of the Python community, what are some of the ways that we can contribute to the long-term success of the language?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

A Modern Open Source Project Management Platform - Episode 220

Summary

Project management is a discipline that has been through many incarnations, spawning an entire industry of businesses and tools. The challenge is to build a platform that is sufficiently powerful and adaptable to fit the workflow of your teams, while remaining opinionated enough to be useful. It also helps to have an open and extensible platform that can be customized as needed. In this episode Pablo Ruiz Múzquiz explains the motivation for creating the open source tool Taiga, how it compares to the other options in the market, and how you can use it for your own projects. He also discusses the challenges inherent to project management tools, his philosophies on what makes a project successful, and how to manage your team workflows to be most effective. It was helpful learning from Pablo’s long experience in the software industry and managing teams of various sizes.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Pablo Ruiz Múzquiz about Taiga, a project management platform for agile developers & designers and project managers who want a beautiful tool that makes work truly enjoyable

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what Taiga is and the reason for building it?
    • Project management platforms have been available for a long time. Can you describe how Taiga fits into that market and what makes it stand out?
  • Can you describe how you view project management and some of the unique challenges that it poses when building a tool for it?
    • How do the requirements differ between project management for software teams vs other disciplines?
  • How is Taiga implemented and how has the system design evolved since it was first started?
  • For someone who is using Taiga can you talk through the features of the platform and how it fits into a typical workflow?
  • How do you maintain a balance between usability and structure in managing project workflows against flexibility and customization?
  • Within an engineering team how do you view the responsibility for driving and maintaining the lifecycle of a project?
  • What are the most common points of friction within a project management workflow and how are you working to address them in Taiga?
    • Onboarding and discovery for a new contributor in a given project is often painful. What are some steps that a project manager or product team can take to make that process more palatable?
  • How has the landscape of project management practices and tools changed since you first began working on Taiga and how has that influenced your roadmap?
  • What have been the most challenging or difficult aspects of building and growing the Taiga project and community?
    • What lessons have you learned in the process that have been particularly valuable or unexpected?
  • What are some of the most interesting/unexpected/innovative ways that you have seen Taiga used?
  • When is Taiga the wrong choice for a given project or team?
  • What do you have planned for the future of Taiga?

Added by Pablo

  1. Why did you choose AGPLv3 for a license?
  2. How can Taiga integrate itself with other platforms that are typically used by teams?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA