Easy Data Validation For Your Python Projects With Pydantic - Episode 263

Summary

One of the most common causes of bugs is incorrect data being passed throughout your program. Pydantic is a library that provides runtime checking and validation of the information that you rely on in your code. In this episode Samuel Colvin explains why he created it, the interesting and useful ways that it can be used, and how to integrate it into your own projects. If you are tired of unhelpful errors due to bad data then listen now and try it out today.

Springboard logo Machine learning is finding its way into every aspect of software engineering, making understanding it critical to future success. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype.

Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. Podcast.__init__ is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to pythonpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.


linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? Check out Linode at linode.com/podcastinit or use the code podcastinit2020 and get a $20 credit to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.



Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show because you love Python and want to keep your skills up to date. Machine learning is finding its way into every aspect of software engineering. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. Podcast.__init__ is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to pythonpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.
  • Your host as usual is Tobias Macey and today I’m interviewing Samuel Colvin about Pydantic, a library for enforcing type hints at runtime

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Pydantic is and what motivated you to create it?
  • What are the main use cases that benefit from Pydantic?
  • There are a number of libraries in the Python ecosystem to handle various conventions or "best practices" for settings management. How does pydantic fit in that category and why might someone choose to use it over the other options?
  • There are also a number of libraries for defining data schemas or validation such as Marshmallow and Cerberus. How does Pydantic compare to the available options for those cases?
    • What are some of the challenges, whether technical or conceptual, that you face in building a library to address both of these areas?
  • The 3.7 release of Python added built in support for dataclasses as a means of building containers for data with type validation. What are the tradeoffs of pydantic vs the built in dataclass functionality?
  • How much overhead does pydantic add for doing runtime validation of the modelled data?
  • In the documentation there is a nuanced point that you make about parsing vs validation and your choices as to what to support in pydantic. Why is that a necessary distinction to make?
    • What are the limitations in terms of usage that you are accepting by choosing to allow for implicit conversion or potentially silent loss of precision in the parsed data?
    • What are the benefits of punting on the strict validation of data out of the box?
  • What has been your design philosophy for constructing the user facing API?
  • How is Pydantic implemented and how has the overall architecture evolved since you first began working on it?
    • What have you found to be the most challenging aspects of building a library for managing the consistency of data structures in a dynamic language?
      • What are some of the strengths and weaknesses of Python’s type system?
  • What is the workflow for a developer who is using Pydantic in their code?
    • What are some of the pitfalls or edge cases that they might run into?
  • What is involved in integrating with other libraries/frameworks such as Django for web development or Dagster for building data pipelines?
  • What are some of the more advanced capabilities or use cases of Pydantic that are less obvious?
  • What are some of the features or capabilities of Pydantic that are often overlooked which you think should be used more frequently?
  • What are some of the most interesting, innovative, or unexpected ways that you have seen Pydantic used?
  • What are some of the most interesting, challenging, or unexpected lessons that you have learned through your work on or with Pydantic?
  • When is Pydantic the wrong choice?
  • What do you have planned for the future of the project?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:13
Hello, and welcome to podcast ordinate, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project to hear about on the show, you need somewhere to deploy it. So take a look at our friends over at linode who 200 gigabit and private networking load balancers, a 40 gigabit public network fast object storage and a brand new managed Kubernetes platform all controlled by a convenient API, you've got everything you need to scale up. And for your tasks that need fast computation such as training machine learning models, or running your ci and CD pipelines. They've got dedicated CPU and GPU instances. Go to Python podcast comm slash linode that's Li n o d today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You listen to this show because Do you love Python and want to keep your skills up to date? machine learning is finding its way into every aspect of software engineering. springboard has partnered with us to help you take the next step in your career by offering a scholarship to the machine learning engineering career track program. And this online project based course every student is paired with a machine learning expert who provides unlimited one to one mentorship support throughout the program via video conferences. He'll build up your portfolio of machine learning projects and gain hands on experience in writing machine learning algorithms, deploying models into production and managing the lifecycle of a deep learning prototype. springboard offers a job guarantee meaning that you don't have to pay for the program until you get a job in the space podcast often it is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there's no obligation. Go to Python podcast.com slash springboard and apply today. Make sure to use the code AI springboard when you enroll yourself. As usual is Tobias Macey and today I'm interviewing Samuel Colvin about pedantic a library for enforcing type hints at runtime. So Samuel, can you start by introducing yourself? Hi,
Samuel Colvin
0:02:10
I'm Samuel. I am a software developer. I split my time usually between SAS company to to culture I have been working on for many years. And on now health, which is a very exciting health tech company, we do blood and genetic testing, to give people actual health data. And then I also spend too much of my time doing open source.
Tobias Macey
0:02:29
And do you remember how you first got introduced to Python? I do.
Samuel Colvin
0:02:31
I got offered an internship when I was at university by a company that used Python quite a lot. I actually didn't do the internship. But I played with Python and got hooked. I've done quite a lot of developing and other languages in MATLAB and C sharp and Julia and rust. I'd obviously like everyone JavaScript, but I've always come back to Python.
Tobias Macey
0:02:48
And so a few years ago, you started the pedantic project. And I'm wondering if you can give a bit of a description about what that project is and what it is that motivated you to create it in the first place.
Samuel Colvin
0:02:59
Yeah, I can remember problem I was trying to solve, I was trying to pass a dictionary of HTTP request headers into a kind of class a bit like a data class with some properties that had type annotations. And I was really frustrated that all of the validation libraries that I could find didn't respect or care about
0:03:17
type annotations. In fact, they directly conflicted with them. So I guess I started digging
0:03:23
around, you can get access to the annotations and went from there. And a few days later, I released version 0.0. point one and some people used it, it got quite popular on Hacker News, right, then
Tobias Macey
0:03:35
the rest is history. And so in terms of the main use cases that benefit from pedantic you mentioned, being able to parse a dictionary of headers into a class object, but what are some of the other ways that it's being used?
Samuel Colvin
0:03:48
So the fun and the challenging bit of pedantic is that it's used in quite a lot of different situations. So it's used for settings management, you can think of it a bit like the settings py file in Django or in any any server project where you would have a settings of DSN for connecting to databases and what port and 1000 other different settings, but it's also used for kind of API form data validation. And then for I think people use it at the library boundaries. So to confirm when people are using an external library that
0:04:21
they are parsing that library, the correct arguments,
0:04:24
and then it's also used by data scientists in a data processing pipeline kind of scenario. So the the range of different ways in which is used has been really interesting. But it's also been in certain situations is a bit confusing, I think for for developers who assume that everyone else is doing with it, what they're doing about a year ago, asked him a mirror. So I think you interviewed a couple of weeks ago, started fast API, which uses Python tick. And that's where the library really took off in terms of its popularity. But what's been really interesting is since then, its usage has really exploded, not just with fast API, so I don't think now has Just a bit over a million downloads a month. So quite a lot more than than fast API and the other libraries that use it. So it's obviously being used, I suspect somewhere by big corporations who are running ci thousands of times a day, which is why it's being downloaded so much.
Tobias Macey
0:05:14
And in terms of the use cases that you're seeing for it, you know, the Web API is one sort of obvious avenue for it. And then you also mentioned it being used in the data science and data engineering contexts. Are there any ways that it's being employed that you found to be particularly surprising or any types of feature requests that you're getting for given contexts that you have either been sort of surprised and delighted by or had to actively turn down in terms of trying to avoid feature creep?
Samuel Colvin
0:05:45
Yeah, the feature creep.
0:05:46
issue is definitely definitely challenging because of the of the demand for people. Well, people want to get to use it in lots of different situations. There's a issue around strictness in how you validate or pass data which I think we'll come on to in a Bit, which is definitely problematic because different people have strong and different ideas about how it should work in those regards. I guess when I first started using it, I was always thinking about kind of dumb data. So JSON type inputs, strings, mostly strings, but also obviously floats and ends. But obviously, a lot of the time, it's used in contexts where the inputs can be quite complex Python objects. And so a lot of its usage is, has expanded to do validation of those complex objects. That definitely wasn't how it was, it wasn't how it first started.
Tobias Macey
0:06:29
And there are a number of other libraries that exist in the Python ecosystem for being able to do various things like settings management, which pedantic is focusing on and that tried to enforce different sorts of best practices. And then there are also another suite of libraries for being able to do things like define data schemas, or define validation logic for input data structures, such as marshmallow and cerebrus. I'm wondering if you can give a bit of an idea of the comparison of how pedantic Duck fits in both of those different ecosystems and some of the reasons that people might want to choose pedantic over the other available options.
Samuel Colvin
0:07:07
So looking at the settings case, first of all, I'm a big fan of the 12 factor app approach to to building applications. And so a paid outtake base settings class will automatically read and infer environment variables or environment variables defined in a dot m file. So you can think of if you've ever used Django, you will have a settings file, but a great number of your default settings will have or stop getting around them to allow you to override that default, that's automatic in pedantic but of course, because it also coerces types it would automatically coerce a string to say an int or a path or whatever else you might want it to be. The main difference, of course, is that it uses type hints to define what your type should be. So you don't have to set that out twice or conflict with type hints. I also think that tests need to be first class citizen when you think about settings. So, settings class can also take arguments when you initialize it, which will overwrite the defaults, but also environment variables, making it really easy to use in a testing situation. And then lastly, pedantic isn't tied to any particular framework or ecosystem. So it can be used by tools like fast API, but it's not, it's not tied to them. So it can be used in any any library you like, in terms of comparing it to tools like marshmallow and Cypress. As I say, the first thing you'll notice is the type hints are are used, the second thing will be speed. pedantic, is about two and a half times faster than marshmallow. And, from our benchmarks, 26 times faster than Cypress, Cypress must have some, some problem there because I've I've asked it asked question on Stack Overflow, and I think on on their repo about why it's so slow, but I've never got an answer. So it seems that services in particular was very slow. But pedantic is fast as an all of the other libraries that we've that we've benchmarked or on a par with with the fastest ones. It's compiled with seitan. And there are binaries available on Linux for Linux, Mac, and Windows from pi pi, which is one of the reasons it's so fast. But even even without that it's among the fastest of the validation libraries. As you will learn when you start using pi downtick, we lean towards coercion over strictness. Unlike other libraries, which is confusing in some scenarios, but mostly it's useful if you think about passing like, get arguments, the values that will always be strings, if you had some values that say age that you want it to be an int or or time delta or a date time, you need to have that coercion, there's no way that you could do strict validations that the input value must be an insane when the inputs have to be strings. And lastly, I'd say I hope that we managed to be friendly and helpful on on GitHub, unlike lots of projects, pedantic allows people to ask questions on GitHub, and I try to always be as helpful as I can Which to put it mildly, in isn't true of all projects.
Tobias Macey
0:10:03
And because of the fact that you are addressing these two different use cases of being able to be used for settings validation, as well as being able to handle validation of input and output data or data as it traverses an application, what are some of the tensions are challenges that you face, whether it's technical or conceptual or just for in terms of requests from the community that you're facing in terms of being able to build a library that addresses these different use cases?
Samuel Colvin
0:10:30
So I said earlier that I don't take a leans towards coercion over strictness. So if you pass it a string of a number, so a string of characters 123 into a field, which is an integer, then it will convert that string to an integer that occasionally confuses and frustrates people, and they say, oh, it should be strict and say that it refused that string to be the input to an int. If you then say, Well, what would happen if I pass you a string to a path field, for example, they would say oh, well, it should definitely do. coercion then the conversation goes on. And what you realize is that the person you're speaking to, is thinking in the, in the JSON world where you have the seven types of data in JSON, and they assume that there should be no coercion between them. But there should be coercion from those JSON types to higher types, like, I don't know, you you IDs, or, or paths. So one of the problems I've seen, it's quite understandable is people assuming that other people's usage is similar to theirs. And therefore, they might assume everyone's using it for an API where everything's JSON, where actually it's being used in lots of other contexts. Or they might assume that the data being passed in is lots of different Python objects. And that JSON is like, it's not relevant. That's been a problem both both technically in how strict pedantic should be, but also, conceptually to explain to people that it's used to lots of different things, which is great, but it can also lead to slight confusion.
Tobias Macey
0:11:52
In terms of the actual usage of pedantic from looking at the documentation. It's largely based around building up these class objects that are As containers for the different data fields, and in a lot of ways, it's very similar to the data classes that were added in Python three, seven. So I'm wondering what you see as being the trade offs of using the built in data classes versus pedantic and what the sort of challenges were that you faced in when data classes were added as a first class concern. Given that it seems that you started the project before they were part of the mainline Python.
Samuel Colvin
0:12:26
Yeah, Python tick was released the first for Python, I think it was 3.6, which didn't have our data classes, and they had to we have to support them later. So pedantic has a its own version of data classes, which are really just validation on top of standard data classes. So the data class you you get when you initialize a data class that uses the pedantic decorator effectively will be a completely vanilla data class, just that validation will have gone on. More generally, the number one trade off, of course, is that data glasses don't do any data validation. So it might say food needs to be an end But if you pass it a string, or a nested Dictionary of you IDs, data classes isn't going to care, it isn't going to do anything. And so you're you're relying on my PI or static type analysis to check that that's actually true. Well, obviously, pedantic, does the runtime type checking. The second big difference is that data classes use the pattern or arguably anti pattern of generating a bunch of Python code and then calling a Val on that to create the data class, which is quite slow and avoid prevents you from doing things like compiling with seitan, which pedantic data classes don't do so by logic data classes, don't do any of that stuff. And so they can work with seitan. The only case implicitly when that happens is when we're using our own variant of standard library data classes where of course, we thought we have to use the standard initialization of a data class once we've done the validation. And then the third big difference is that, of course, pedantic has lots of other tools on top of it, whether that be parsing JSON, validators serialization of things like JSON nested data structures, all that stuff is available in pedantic that's not going to be available in the standard library data classes, which are kind of the building block. So simpler. And they're great. And I use them quite a lot. But they're, they're not right in every scenario.
Tobias Macey
0:14:14
So because pedantic is a library, and also because it's doing this runtime type checking, as opposed to just the ahead of time validation that you might get from a linter or something like my PI, what are some of the points of overhead or the potential complications that get added by using pedantic in place of the built in capabilities or doing this ahead of time checking? Well,
Samuel Colvin
0:14:35
it's gonna be a lot slower, of course, to call a function where you go through the whole of data validation before you call a function that's unavoidable. There are cases where it's better to use the kind of duck typing and catch the error approach rather than rather than doing the validation first, but of course, there are cases where you where you need validation compared to hand written validation. Because of the the compiling pedantic is generally on a par, maybe slightly faster or slightly slower than handwritten validators. So it is, of course, slower than not doing validation. But if you're going to do validation, pedantic, is pushing towards the fastest, you can do it in Python, I think.
Tobias Macey
0:15:13
And as far as the conversation that we started have strict validation versus type coercion, I know that you added some explicit points in the documentation to be able to call that out to avoid confusion because of some conversations that came up earlier in the lifecycle of the project. And I'm wondering if you can talk a bit more about some of the nuances of that strictness versus coercion and some of the ways that it manifests and some of the limitations that you see in terms of the use cases by explicitly not supporting that strict validation in favor of being able to do the coercion.
Samuel Colvin
0:15:51
Yeah, I mean, pedantic, started off trying to be fast and trying to be simple. And so I took the approach that if I want something to be an integer, then the simplest thing to do is called built in on that value. And if it succeeds, then you know you've got an integer. And if it fails, then you take that error and you use that as a basis for the exception you're going to throw. And in most cases, we still do that. And that does mean that, for example, if you pass it a string, that will that will be passed to an int. Or if you parse something to a list, it'll just call the list built in and give you back a list. There are cases where that makes no sense. So for example, virtually everything can be cast to a string because it has the string method on it. And so it wouldn't make sense just to call string on everything and say, if it can be passed to a string, then it is a string, it wouldn't make any sense to have if you pass a list of integers to field you expect to be a string for you just to get back the string representation of that list. But there were some other other weird ones that people came up with early on points that were mistakes that we fixed. For example before because you were just calling list on something to see if it's a list and then call calling for example, Intel something to see if it's an int if you pass a string of 123 to something Try to be a list of intz. You've got back the list 123, which was very confusing. So there are cases where we've gradually moved to be slightly stricter in the right places. I think that's the right approach to take the still to some degree and open open what there is an open issue. Literally, there's also open issue in my head about whether we should have a completely strict mode. But from what I've seen, when you explain to someone what a completely strict mode means things like pass has to be an instance of path, not a string, that generally isn't what people want. So I think we will avoid going completely strict and instead move towards slowly making things stricter in the right in the right ways, but only have one mode. And if you really want to be stricter, you can use validators to do that yourself in terms of cases where that means it doesn't work. If you had a kind of testing situation where you wanted to test the output of a web hook or of API pedantic wouldn't be the right tool because it will lean towards doing coercion instead of instead of just checking their assumptions. This is where you could lose data. For example, if you had a float of 3.1415, you pass that to an end, you'll silently lose data, because it would convert that to the three. But most of the problems, I think, really come from developer confusion that they assume it would be strict. And that's where they kind of end up getting confused and getting getting into trouble. And then I get a slightly irate issue saying, Why the hell does that work this way.
Tobias Macey
0:18:24
And for those cases, where there is the potential for data loss, I know that in some ways, it's potentially impossible to be able to determine ahead of time if there is any sort of lossy conversion that's going to happen. But have you explored the possibility of adding in some types of warning or being able to capture hooks of this is going to lose data and then being able to raise errors during the development cycle or anything along those lines for giving people that option of lossy versus lossless conversions? Well, we
Samuel Colvin
0:18:54
thought about this, the strict mode would be that I think would be the route there. So you would enable the strict version Data classes during testing, and then you would know that in testing it was it was working, that's just a lot more work and not something that I think people actually want. In the end. If you look at libraries, like I don't want to pick on service, but I know that you have to explicitly set up any collection that you want, which is in some ways more explicit and explicit is better than implicit. But it does also mean that every single place you want to use some higher order objects that will will require some coercion, you need to explicitly set that, whereas pay downtick tries really hard to just work out of the box in the most likely scenario. I mean, I also I built it for me for projects I needed. And so I built it the way I wanted. And to a degree, unless people pay me they can kind of have it the way I want it to be a bit blunt about it.
Tobias Macey
0:19:41
And that brings up an interesting point too, about the API design of the library and how you approach the overall philosophy of the structure of the project and the interfaces that you expose to users. And I'm wondering what your prior experience has been in terms of building out projects that are more widely used, and what your thoughts are in terms of how to design that API in a way that is easy to adopt, as well as being appropriately expressive for the problem space that you're working in? I think it's
Samuel Colvin
0:20:09
I think it's a difficult and nuanced problem. I think that humility is quite a useful attribute of a developer and probably an under considered one, trying to remember what it was like when you first started developing. And remembering how little you understood is really valuable when you're building open source code that will be used a lot by junior developers. And I mean, Sebastian Ramirez, who built fast API has helped me a lot on pedantic is the is the master of that he seems great at creating open source projects and getting them to grow and doing it in a way which arguably isn't even always the technically perfect way, but which allows the most people to use this code and to get something done. Then there are there are like more practical things you can do. I know Sebastian was talking on your podcast previously about things like setting every key word document explicitly in the public facing interface public facing API. So the With ID type IDs, you can kind of get there without having to use the documentation. Lots of projects don't do that. And it's really frustrating. So I definitely try and get things like that right and make it easy to use, I suppose my overall approaches as long as it's fast, and it's easy to get started with, I'm happy to skip some edge cases and allow people to fix them themselves or, or use use another tool. I guess it's better to be right to the majority of people and usable and do everything.
Tobias Macey
0:21:27
And in terms of how pedantic itself is implemented, can you talk through the overall design of the project and some of the ways that it has evolved since you first began working on it?
Samuel Colvin
0:21:36
I will say that the biggest changes that we got to version one, late last year, there were a few understandable complaints from from big projects using pedantic that it was a moving target pre version 0.1 I was quite rushed, and one I was quite keen on on breaking things to make it better. And that understandably, frustrated people who were had it as one of many dependencies, so we got to version one, and I've tried really hard to talk Avoid any backwards incompatible changes since then, the main other change in pedantic over over the years has been the fact that there are now four main interfaces to by downtick. So there's the base model approach, which is the primary one, then there's pedantic data classes, which I talked about earlier. Then you have paths as object which allows you to pass or validate any object you like. So you give it the type and you give it the raw data, and it will either succeed or fail. And then most recently, in version 1.5, we released validate arguments, which is a function decorator that allows you to validate the arguments of any function. And then Sebastian added back when he was first working on fast API schema generation using JSON schema, which was another another big step forward for for by downtick in terms of other changes. David Montague worked a lot on getting pedantic to compile to syphon, which has made a big difference to performance. That was one of the big changes we made last year
Tobias Macey
0:22:56
and for the actual internal architecture of With the project, I'm wondering how the actual class definitions are structured to be able to do things like gain access to the type attributes of the arguments to the class or of the fields within the class. And also in terms of the typing capabilities in Python itself. I'm wondering what you have found to be some of the well considered aspects of it and any challenges that you face in terms of shortcomings of the typing system, because I know that whenever this conversation comes up, people will invariably look to things like Haskell for some of the more complex and elegant ways to handle complex types. But I'm wondering what your just overall thoughts are on typing in Python and the ways that you approach leveraging it within the project.
Samuel Colvin
0:23:42
So the first bit of that question is super easy in the in the metaclass. It's really easy to get access to the attributes, object annotations object on a class and then use that as your as the basis for your validation. The tough bit is introspecting. The types to work out what they are and then developing code to to suitably parse or validate data to check whether or not it's compliant with that. That's that that's been the tough bit. I think that somewhere that I was looking for it before this and couldn't find it. Guido has said that he explicitly doesn't think that type annotations are designed for runtime type checking. I think originally, they said they're here, do whatever you like with them. And then later on, it was clarified that they weren't for runtime type checking. And that kind of shows when you try and start introspecting types, they're hellishly complicated to get access to what they are. It's one of the things that was a lot of the work early on in, in pedantic and still to this day, one of the frustrations is that you have even if you think about a sequence of different integers, let's say you have collection, sequence, list, set, frozen set iterable. There are many different versions of what you might generically call an array of integers. So that's been problematic again, and goes on being a problem. Not all That is completely self employed ontic. But most of the time, most of the time, it just works still.
Tobias Macey
0:25:04
So for a developer who's interested in getting started with using pedantic or integrating it into an existing project, what is the overall workflow of being able to add in those type definitions and some of the different capabilities that are exposed by bringing pedantic into that project.
Samuel Colvin
0:25:20
So getting started with pedantic should be as simple as pip install Python take that will give you a compiled binary on whatever operating system you're using, then have a look through the docs and get started. I think it's almost in some ways too easy to get started, because everyone already has an idea about how typings work. That's probably one of the reasons that people find the pitfalls because they haven't gone through the docs, quite understandably like like the rest of us, they just start going and end up running into problems. Those type hints are awesome, because they then work with your ID with my PI with your own intuition with any other ID that you're with with PI charm or whatever idea you're using. In particular, pi charm has a plug in specifically for pi antic, built by the community, which is awesome, which makes usage of Python with pedantic even easier and even better. But yeah, the use of the of type hints avoids you having to learn another like, schema micro language for defining your models, you can just use standard Python and off you go.
Tobias Macey
0:26:17
And what are some of the edge cases that developers often run into? Aside from the confusion about the strictness versus coercion of the typing information? One of
Samuel Colvin
0:26:28
the useful divisions, I think, is that when you're defining your model, it can't look at the rest of the world. So let's say that you're creating a user with an email address, and you need that email address to be unique. You can't do the check that the email address is unique within a validator in general, because that validator can't be a synchronous and even if it's even if you're using a synchronous database lookup, it's it's generally bad practice to put that inside the validator. So using pedantic gives you a very good division between Checking the data is consistent and then checking that the data works with the rest of the world. But that's something that quite often leads to confusion. And there's a bit of tooling around how to raise pedantic, valid error, pedantic validation errors, with after you've done the initial validation, which I'm going to work on in future. There are some differences with with data classes, even where if you use the pedantic data classes, mostly around implicit constraints and data classes, you can't have extra arguments to them, which, which occasionally people have problems with. And then there's a kind of complex question about whether or not to pass around pedantic models, or dictionaries created from pedantic models or data classes you create or how you then pass your data along. If it's simply as simple as accessing an attribute of a model and saving that to the database. That's fine. But obviously, people have complex processing workflows, and people have lots of different solutions for that, and it works well with pickle so you can just pick all your models but
0:27:58
even that occasionally leads to leads to problems
Tobias Macey
0:28:00
and for the capability of being able to access the different attributes and pass along the data. What are some of the sort of best practices that you have found to be useful of whether you pass the exact model or just a representation of the data set and then do the coercion back and forth throughout?
Samuel Colvin
0:28:19
I don't think there's a there's a single good answer, I do occasionally end up with data classes that shadow models which can allow or at least shadow parts of models that I need in a particular context. Sometimes it is as simple as as calling dict on on the model and getting back that dictionary and using that, but in general, I think that the best practice would be to continue passing the model around and using that model directly rather than converting it to a dictionary too early.
Tobias Macey
0:28:46
And another element of using pedantic within these applications particularly using the model terminology, people might get confused for those who are used to working with RMS in different web frameworks such as Django, and I'm wondering what the options are for being able to integrate pedantic with some of the other elements of the ecosystem, you know, starting with things like RMS of using it within Django, or with things like SQL alchemy, for doing the validation of the data as it's flowing into the application and being able to easily convert it into a database object. Yeah, I gotta
Samuel Colvin
0:29:24
put Django to one side, because I think that the best and the worst of Django is that it's batteries included. And if you're going to use Django, you're probably best using vanilla Django and Django rest framework and leaving pedantic out of it. I'm sure that there are some cases where people do build stuff with with Django, but use pedantic but I definitely haven't. SQL alchemy and other or M's are a more nuanced question. pedantic has a from RM mode, which basically allows it to inspect the attributes of an ORM class and build a pedantic model from there. I personally am not particularly pro RMS, I would much rather write my queries in real estate Call or whatever, then then have this RM step in between. So, I haven't done that much recently with our EMS and pedantic and I, I would say RM to quite often a mistake, full stop. So I don't put that much effort into using them myself. But I know many people do. And you do find yourself having to define your data twice in the form of a pedantic model and a sequel alchemy model. But I think that's just unavoidable. At least in most cases, there are there are no doubt exceptions where you could auto generate one or the other. But I would say unless you had thousands of different tables, and therefore models, it
Tobias Macey
0:30:35
wouldn't be worth it. And talking about it in the context of databases also brings up the interesting question of the ability to define pedantic classes that relate specifically to other classes so that you can do things like joining across different data objects or being able to specify relations of those different objects as it flows throughout your application. And I'm wondering what you have found to in terms of some Have the advanced usage capabilities of pedantic that are not necessarily obvious at first blush,
Samuel Colvin
0:31:06
one of the things I've been amazed by is how advanced many people's usage of pedantic has been in regards, which I definitely haven't, haven't done that pedantic doesn't have. And it's an open open issue, a way of avoiding recursive links between between models. So if you have, I can't think of an example right now. But you have a user linked to a pet. And then you have the pet linked back to the user, that can lead to recursive problems and pedantic it's an open issue. I'd love someone to come along and help with that. But in short, I haven't solved that because I suspect that models like that are an anti pattern in the first place. And so they just shouldn't exist. I would much prefer myself to keep my models individually and not connected and have an integer field. If there's a if there's a foreign key, for example, rather than having some implicit link to another model that automatically does
0:31:52
a query that I can't see because I think that that complexity
0:31:54
gets you into hot water when it's not really actually saving you enough time in terms of other advanced usages or advanced features of pedantic. The generic models built by David Montague are scarily powerful. If you think about generics in the context of Python typing, generic models are like that. But in but with validation on top, so you can define a model that has some generic type or types associated with it. And pedantic will then go and do the validation based on dynamic types within the definition of the model. Other another of its of the powerful features of pedantic is custom types with custom validation and custom schemas, which I don't think people are aware of enough or and don't get used enough probably the documentation could could do some work there. One of the things that I often find suggesting as solutions to people who ask questions is custom base models with custom config, and even modified methods like the deck or JSON on the model, which gets around lots of people's requests for for more features, and then the validation error system in pedantic is quite quite complex, it allows things like custom translations and customized messages on errors, which are often again probably aren't documented well enough, because people don't seem to be aware of that and the power of what that can do.
Tobias Macey
0:33:11
And another interesting capability that you touched on briefly earlier is the way to be able to generate things like JSON schema from a pedantic model. And I'm wondering if there is the capability of being able to do something like that in the reverse, where you have a schema definition and then being able to generate the corresponding model object for being able to validate other instances of that schema.
Samuel Colvin
0:33:37
Um, there are there is some third party projects, I think there were at least two out there that generate either Python code or Python models. The problem with that is that things like my PI and static typing and your own intuition aren't going to work because the model isn't defined in code somewhere. So in general, I would say that the Python code should be your single source of truth about the definition of What your data should look like. And then you should generate the schema from there? Yeah, there are there are tools that can generate Python code to represent a schema. And I'm sure there are contexts in which they useful. I haven't used them myself.
Tobias Macey
0:34:10
And I'm wondering too, if you have explored the space of being able to generate other types of schema objects beyond just JSON schema. So for instance, in the data engineering context, being able to use a pedantic model to be able to create instances of Avro objects or parquet rows, things like that,
Samuel Colvin
0:34:29
I haven't myself, I haven't worked on that. I would have said that the best approach would be to work from the current JSON schema
0:34:35
decks and go about
0:34:37
generating from that, but but I haven't had any experience myself.
Tobias Macey
0:34:40
And then as far as being able to integrate pedantic with other frameworks because of the fact that it is largely just vanilla Python, it seems like it's fairly straightforward for things like using it in the settings module of a Django project, but what have you found to be some of the useful tips in terms of the overall process of integrating with Things like maybe the Daxter project for ETL workflows, or maybe pyramid for being able to or flask for integrating it with other web frameworks or frameworks of other types that people might be trying to use the data validation capabilities within.
Samuel Colvin
0:35:14
I've used it quite a lot with with libraries like starlet obviously it's a it's a cornerstone of fast API. I know people use it quite a lot for settings management in flask. I think there are some libraries around that I haven't used to ask for a few years. So I haven't been working with that. It's become more and more a cornerstone of the data pipelines used in used in machine learning projects. It's kind of been amazing to see it was an application I had never thought of before. But now you see all of these quite popular machine learning packages like deep Pavlov and Transformers from hugging face, using pedantic for both settings management and for parsing the data before doing the machine learning, which has been really interesting. I haven't had much experience with them myself, but what's the kind of amazing thing about about My dad has been seeing other people pick it up and run with it and do stuff I had never, never thought of.
Tobias Macey
0:36:05
And what are some of the other interesting or innovative or unexpected ways that you have seen, pedantic used that you have been particularly surprised or impressed by?
Samuel Colvin
0:36:16
Probably the most surprising thing for me has been big companies you would have heard of like Microsoft, IBM, AWS, NSA, Uber Salesforce, using using pedantic which
0:36:25
never something I would have expected when I
0:36:27
first like hack something together and release something on pi pi in terms of particular projects. Facebook have their first MRI project for making MRI scanning faster and then reagent Machine Learning Library that use pedantic Microsoft use pedantic through fast API for core Windows and Office services, which is amazing. There's Mexican near bank called Sue Anka. I hope I pronounced that right who use pedantic for their interbank transfer validation the molecular science software Institute use pedantic a lot as far as I know, they're using For their COVID response, and lots of other projects in academia and in industry, each of those projects is cool. But the most gratifying thing for me has been seeing the like the sheer number and diversity of different projects that have used it in ways I wouldn't have wouldn't have thought of.
Tobias Macey
0:37:15
And particularly for some of the scientific context to have you found people using pedantic alongside things like paint for being able to handle unit conversions and being able to incorporate that validation or transform logic within the pedantic models.
Samuel Colvin
0:37:32
One of the frustrations I find writing open source code is that you can see the open source tip of the usage iceberg but you can't see the closed source usage. And so it frustrates me it's tantalizing to be able to see some of it and guess at what other people are doing with it, but not to be able to go in and actually understand what people are doing with it. So the short answer is I just don't know because lots of those some of those projects are obviously open. And I'm sure if I spent some weeks digging away I would find out what people are doing with it. I'm sure that'd be very useful for me in terms of how I develop it further, but I haven't done that. But I was a lot of the usages is closed source. And so you just don't know. But I mean, it's interesting because you get occasional intuition about about what people are doing when you get I had a credit checking agency who pointed out above the other day in a regex. And I, you know, I never never occurred to me, the company that that would be using it. So it's, you occasionally get a hint at what's underneath the sea level of that iceberg,
Tobias Macey
0:38:26
but mostly, I don't know. And in terms of your own experience of building and growing the pedantic project, what are some of the most interesting or challenging or unexpected lessons that you've learned in that process?
Samuel Colvin
0:38:38
It's been really fun working with with some big companies, I would do that again, either commercially, or just for free, out of curiosity to
0:38:45
see how larger organizations use
0:38:47
it. It's been there's a
0:38:50
strange paradox when you start writing open source that on day one, you desperately hope some other people will install it and use it and careful what you wish for because three years later, I now spend an hour or two Two hours a day working on pedantic mostly just answering issues which definitely wasn't what I plan to do but but it's been been really interesting. As I said earlier, I talked earlier about humility and and remembering what it was like not to know how to do things, answering a lot of like, quote, damn unquote questions for people who are who are relatively new to Python has has been a good experience in reminding me how much I know, in a sense and how lucky I am to be able to write code like I can. And so I do my best to give back and not to get frustrated by people, people asking questions that I think
0:39:32
they could have worked out with a few minutes on Google. Some of the other challenging aspects
Tobias Macey
0:39:35
of open source maintainer ship are also things like knowing when to say no to particular feature requests, but also the quote unquote bus factor of the project and figuring out what is the succession path for maintainer ship if you decide to step away from the project or you're no longer actively using it and you start to want to spend your time on other things and I'm wondering what your thoughts Are your own personal approach to that or any other thoughts that you have in this sort of discussion of maintaining open source projects?
Samuel Colvin
0:40:07
Yeah, I think it's really hard. And I wouldn't deny that the bus factor wasn't quite big with pedantic right now that I'm the majority of the work is really gratifying to see other people, particularly Sebastian Ramirez and David Montague, but lots of other people as well contributing it to it taking time to answer questions. But I think it's I think it's a outstanding big problem. I think that we do a lot of patting ourselves on the back and saying how great open sources, but I was reading an essay by john mark, which I'll leave a link to calls perhaps slightly hyperbolically open source of sales, I wouldn't say would go quite that far. But
0:40:42
I definitely don't think it's quite as rosy as as it should be.
0:40:45
And I definitely see the problem that all these big and very profitable companies use a library like pedantic and of course, many more, but don't contribute financially. And that then leads to a bus factor with with projects like this, that and even more so with fast API, if you'd like. I think Sebastian is amazing. And he's done lots but and it's got incredibly popular very quickly. But I don't think that any of the organizations that use that have a succession planning and have thought about what would happen if those libraries stopped being maintained. I mean, I think the good thing is, at least we've pedantic, it's relatively stable. I have lots of interesting things I want to do on v2, that that I will one day get get round to working on but without that, it's not like it's gonna suddenly crash or fall to the ground, if it doesn't have any work on it for a few weeks or a few months.
Tobias Macey
0:41:28
And in terms of selecting the sort of libraries to use within an application if people are looking at pedantic what are the cases where it's the wrong choice, and they might be better suited using maybe built in data classes or something like marshmallow or some other settings management library,
Samuel Colvin
0:41:46
I'd say the first case is that it's not a substitute for strictly typed compiled languages like c++ or rust or go and you see occasionally questions where I wonder whether the real answer here is you shouldn't be using Python. It's not It helps get around some of those problems, but it's not going to be a substitute for them. I also think that it's often the old fashion Python approach of kind of duck typing and catch the error easier to ask forgiveness than permission. Just try it works well. And if you end up validating every single input in Python, it gets really, really slow. So often, if you are reasonably sure about the inputs, and if there's no particular security concern of just calling it and seeing what happens, and then that's often a better approach. It's also obviously wrong, as I said earlier in the in the validation context for, say, a testing case where you want to confirm that a web hook or an API is giving you the right data. That's not the right tool for identic because it leans in always well towards coercion over over strictness. But if you are determined to write Python, and if you want to do validation, I think
Tobias Macey
0:42:46
an antic is probably the best tool in most scenarios. That's obviously my biased opinion. And as you continue to work on and maintain the project, what is this? What are some of the things that you have planned for the future of it and maybe some of the ways that you're planning on using it in your own work. Yeah,
Samuel Colvin
0:43:01
I'm heading towards v2 now, and I've got quite a lot of big features that I want to work on in v2, some stuff I want to break to get right. So I think validators currently, you can think of them like a list of functions, and each function is called in turn, and the output from one is given as the argument to the next unless an error occurs. Of course, that's slightly slow and somewhat confusing, I would much prefer validators to work a bit like middleware. So it's a function stack. And each function calls the next function along, but that from the outside that just looks like one function, which should be faster in the case that you are doing some common piece of validation where it's just one function, but also much more powerful because you could do your checks before or after calling the next function without having to mess with the order of the validators. I also think one of the things marshmallow does well is it talks, it makes serialization that kind of first class concern. So it's not just about the passing stage, but it's also about the output stage, and pedantic has quite a few things I want to work on in that in that regard. So input and output aliases, computed fields. So the fields on a model that aren't, don't come explicitly from an input, but instead are computed, either eagerly or lazily based on other fields. And there's a whole bunch of other stuff. Have a look on the v2 milestone on GitHub. And I'd love some feedback.
Tobias Macey
0:44:18
And are there any other aspects of your work on pedantic or the overall field of data validation and type checking things like that, or any ways that you're looking for help from the community that we didn't discuss that you'd like to cover before we close out the show?
Samuel Colvin
0:44:34
Oh, I would, I would, I would say that I've got quite a lot of
0:44:36
issues with a label called feedback. I'd love people's feedback. Obviously, there's a risk of just bike shedding. But getting people's input is really useful before you release a feature rather than them being furious once you've done it. So I don't need people to write code or submit issues but just a bit of feedback. A plus one here or there makes it much easier to work out what people what people are looking for. Other than that, nothing in particular
Tobias Macey
0:44:56
well for anybody who wants to get in touch with you and follow along. With the work that you're doing, I'll help you add your preferred contact information to the show notes. And so with that, I'll move us into the pics and this week I'm going to choose something that I had forgotten about for a while and remember during a conversation with my family the other day of the devil sticks, also known as juggling sticks or flower sticks is some way to pass the time and have something to keep your idle hands busy. So if you're looking for something to do or play with, I'll definitely recommend checking that out. And so with that, I'll pass it to you Samuel. Dan, when he picks this week,
Samuel Colvin
0:45:28
I do I am in terms of books, I would thoroughly recommend flash boys by Michael Lewis. I know it's not that recent, but it's an awesome book. In fact, everything by Michael Lewis, I'm a massive fanboy of him. Then in a more computing specific context, there's algorithms to live by by Brian Christian and Tom Griffis, which is a awesome book. Not it's much better than its title suggests terms of TV. I've really enjoyed sex education on Netflix. If you're bored at home at the moment, I would certainly recommend it is extremely funny. And then in terms of tech, I found recently a website called n grok.com. which creates a tunnel from a port on your local machine to the public Internet, which is awesomely helpful when developing and you want to show something to someone or if you want to have an HTTPS connection to a local port. That was that was a really nice to find really useful tool.
Tobias Macey
0:46:14
All right. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with pedantic and it's definitely a very interesting project and one that I'm planning to take use of and some of my work so thank you for all of your time and effort on that and I hope you enjoy the rest of your day.
Samuel Colvin
0:46:28
Thank you very much, you too.
Tobias Macey
0:46:33
Thank you for listening. Don't forget to check out our other show the data engineering podcast at data engineering podcast comm for the latest on modern data management. And visit the site at Python podcast comm to subscribe to the show, sign up for the mailing list and read the show notes. And if you've learned something or try it out a project from the show then tell us about it. Email [email protected] with your story. To help other people find the show please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support Podcast.__init__ on Patreon!