Build Your Own Domain Specific Language in Python With textX - Episode 269

Summary

Programming languages are a powerful tool and can be used to create all manner of applications, however sometimes their syntax is more cumbersome than necessary. For some industries or subject areas there is already an agreed upon set of concepts that can be used to express your logic. For those cases you can create a Domain Specific Language, or DSL to make it easier to write programs that can express the necessary logic with a custom syntax. In this episode Igor Dejanović shares his work on textX and how you can use it to build your own DSLs with Python. He explains his motivations for creating it, how it compares to other tools in the Python ecosystem for building parsers, and how you can use it to build your own custom languages.

Do you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? With Linode’s managed Kubernetes platform it’s now even easier to get started with the latest in cloud technologies. With the combined power of the leading container orchestrator and the speed and reliability of Linode’s object storage, node balancers, block storage, and dedicated CPU or GPU instances, you’ve got everything you need to scale up. Go to pythonpodcast.com/linode today and get a $60 credit to launch a new cluster, run a server, upload some data, or… And don’t forget to thank them for being a long time supporter of Podcast.__init__!



Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
  • Your host as usual is Tobias Macey and today I’m interviewing Igor Dejanović about textX, a meta-language for building domain specific languges in Python

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what a domain specific language is and some examples of when you might need one?
  • What is textX and what was your motivation for creating it?
  • There are a number of other libraries in the Python ecosystem for building parsers, and for creating DSLs. What are the features of textX that might lead someone to choose it over the other options?
  • What are some of the challenges that face language designers when constructing the syntax of their DSL?
  • Beyond being able to parse and process an arbitrary syntax, there are other concerns for consumers of the definition in terms of tooling. How does textX provide support to those end users?
  • How is textX implemented?
    • How has the design or goals of textX changed since you first began working on it?
  • What is the workflow for someone using textX to build their own DSL?
    • Once they have defined the grammar, how do they distribute the generated interpreter for others to use?
  • What are some of the common challenges that users of textX face when trying to define their DSL?
  • What are some of the cases where a PEG parser is unable to unambiguously process a defined grammar?
  • What are some of the most interesting/innovative/unexpected ways that you have seen textX used?
  • What have you found to be the most interesting, unexpected, or challenging lessons that you have learned while building and maintaining textX and its associated projects?
  • While preparing for this interview I noticed that you have another parser library in the form of Parglare. How has your experience working with textX informed your designs of that project?
    • What lessons have you taken back from Parglare into textX?
  • When is textX the wrong choice, and someone might be better served by another DSL library, different style of parser, or just hand-crafting a simple parser with a regex?
  • What do you have planned for the future of textX?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:13
Hello, and welcome to podcast dotnet, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project to hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at linode. With the launch of their managed Kubernetes platform, it's easy to get started with a next generation of deployment and scaling powered by the battle tested linode platform including simple pricing node balancers, 40 gigabit networking dedicated CPU and GPU instances and worldwide data centers. Go to Python podcast.com slash linode. Today, that's Li n od E and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with the ways that Python is being used including the latest in machine learning. Data Analysis. For more opportunities to stay up to date, gain new skills and learn from your peers. There are a growing number of virtual events that you can attend from the comfort and safety of your own home. Go to Python podcast.com slash conferences to check out the upcoming events being offered by our partners and get registered today. Your host, as usual is Tobias Macey, and today I'm interviewing Igor Dejanovi about textX a meta language for building domain specific languages in Python. So Igor, can you start by introducing yourself?
Igor Dejanović
0:01:29
Hi, Tobias. Thanks for hanging. Sure. I'm, I'm Igor Dejanovi I work as a professor at University of Novi Sad teaching. There are several courses in software engineering, and the most relevant for this podcast is probably the course on DSL. And that's one of the reason why the textX exist, actually.
Tobias Macey
0:01:50
All right, and do you remember how you first got introduced to Python? Well,
Igor Dejanović
0:01:54
I remember it was relatively late, since Python is around since night is different languages back there. And I even miss the opportunity to use Python in the days where I do a lot of sysadmin stuff. Where Python is used a lot. I think I pick up Python try to pick up Python first in 2008. But I remember that I was put off by semantic whitespace is at a time because I didn't have actually an experience with any language that did something like that. So I said, Remember, I cool off a little bit, and again, tried Python, I think it was 2009. And I decided that's time to give it a few days. Let's try it for a couple of days and see how things go. And I remember that, day after day, I actually start to enjoyed even the whitespace stuff. So the nesting with whitespace started to get very, very logical to me and later on. I learned there is actually a name for that in a DSL, literature. It's called the secondary syntax part of the language syntax that you can, doesn't look does not have a specific semantics, but you can freely use. That's the whitespace. For example, in textual language or in graphical language that's colors and shapes and positions. And actually, if you have a lot of secondary syntax, you get all sorts of readability issues, because people tend to develop their own styles in Python. Quite good here because reducing secondary syntax actually improves readability. So yeah, I actually love it a lot now.
Tobias Macey
0:03:33
And then, in terms of the context of domain specific languages, can you give a bit of a description about what they are and some of the cases where you might need to build one versus just using a general purpose programming language?
Igor Dejanović
0:03:47
DSL is language tailored and constrained to a particular domain. It is at the right level of abstraction, which enables its user the domain experts a higher level of expressiveness. By removing unnecessary information that is a part of common understanding. What do I mean by that? For example, imagine two lawyers, if they're talking some legal issues, because they are operating in the same domain, they can remove all unnecessary information that are common understanding. So their expressiveness is higher, they can use shorter forms to convey information between each other, but if they are about to explain something to a person that is outside of the legal domain, they will have to be much more verbose because they do not share the common understanding. Besides using the concept of domain dsls also use a concrete notation that is used in a given domain and thus ideally making the domain expert capable of specifying the solution on their own. Of course, in practice, that is not always achieved. But anyway, even even the domain expert is not using the DSL directly. It is much easier to communicate with the developer when they're looking at the at the familiar notation for them. dsls came in different forms and shapes. We have for example, internal and external dsls internal DSL are those built inside the host language. For example, if you use some clever feature of some language, you can make something that looks like different language but is actually interpreted or compiled by the same compiler. Some languages are more capable in that direction. For example, Lisp is well known for being very capable when it comes to the sale and lease purchase are usually created as health all the time. Then we have for example, some of the more contemporary language like Ruby is also very popular for building internal DSL, even Python, which is not very capable of building internal diesels. We see internal DSL all the time. For example, Because Django is a web application framework in Django, you have the definition of a data model, you create a class that extends the model plus, and then you specify some plus attributes, as an instance of a fields, and then Django is capable from that description to dynamically generate all kinds of stuff like for example, object relational mapper or SQL schema, or for example, admin interface for crud separations. On the other hand, external dia cells are full blown languages on their own. They have their own syntax, we have to build a compiler for them or interpreter. So they are much harder to build and maintain. And then, of course, we have other categorization like for example, by concrete syntax, we have textual or graphical or some other notation like tabular or those languages are a DSL, but just with different interface to the user. Why we should use dsls? Well, first of all, when you are constrained in a way, what you can say, and you use only the domain concept, you can express the solution much in a much more condensed form. So you're more expressive, because the commonly understood stuff is hidden from the from you. It's built into the tooling platform into the compiler that enables us to be more productive. some case studies. For example, there is a case study by meta case and done in Nakia think it was like maybe 10 years ago, done with development of mobile applications, they measured productivity boost by a factor of 10. But in general, what they can observe in practice is that you should achieve at least a factor of five if you're implementing DSL right in the right way. Another appealing reason For using the cells beside productivity boost is that your solution your knowledge from some domain is stored and specified in a DSL, which is independent from the underlying technology. So it will evolve not with the technology itself, but with the domain. What that means is that your knowledge is preserved and can be easily transferred to another target platform. And what is also important because it is on the right level of abstraction that is familiar to the main expert, the specification of a solution serve also as the up to date documentation of the system.
Tobias Macey
0:08:40
And so, text x is a tool chain for being able to define your own dsls for being able to incorporate into Python and other language projects. I'm wondering if you can give a bit more description about the text x project and some of the motivation behind creating it and some of its origin story
Igor Dejanović
0:08:59
will Tex Tex is actually a DSL and a tool for building dsls. So it's a meta language. And the motivation was early in my career I got introduced to Model Driven engineering and the DSL just, let's say a different flavor of it, or they are all up a lot. So I quickly got into DSL stuff to the project called x text. It is a Java based project. And I think it was somewhere in 2005 or six when I played and use the XX. And I always wanted something that is similar to xx but in Python, and I wanted something for the DSL course. So it should be lightweight, it should be something easy to use. And the XX was a little bit heavier because it is Java. It is Eclipse based and it is a lot harder to learn. So I wanted something easier for A student to get started. So that's that's the motivation. And I think it started developing it somewhere. Maybe 2015 I think 2015 sorry. So that's that's the time I decided to sit down and and actually implement it because I realized there is nothing similar to it in the Python world at the time.
Tobias Macey
0:10:21
And does you mention there are a number of other libraries that do exist in Python for being able to build dsls or write parsers? But what are some of the capabilities of text acts that make it stand out that might cause somebody to choose it over some of the available other options and maybe some of the characteristics of the overall space in dsls in Python that make something like text x useful and necessary?
Igor Dejanović
0:10:46
Yeah, there are great options in the Python world. We're actually lucky to have a many parsing option there is for example, ply and Sly and byte parsing and parsimonious and Lark, even antler for example, which is a Java tool but can produce parser for Python, but all those tools are, I would say, a more like a normal normal parser or classical parsing tools where you get much more involvement into building. When you're building DSL, you have to, beside grammar, you have to describe the actions actually, which is used to transform the parse tree to something else, or, or they can do that on the fly without building part three. So you have to do a lot more to maintain your language. But the text x is built on the idea that stamp right from x text where you actually describe through the grammar, you actually describe both the parser or the syntax of the language and what we call the metamodel of the language or the structure of the language. So in a way it is constrained. That's why I like to say it's DSL for building DSL, you're constrained but through that constraint meant you actually are much more productive and you can say more With less, because it's all you need is more or less the grammar itself. And you can start using your language of text will out of the box create all the necessary elements of your language dynamically. So it will create classes that correlates with your grammar rules, it will create a parser that will parse your textual representation of your model and instantiate the object of your dynamically created classes. And that will happen at runtime, just by reading the grammar your language is much easier to maintain.
Tobias Macey
0:12:34
And for people who are building these dsls How does the actual definition of the grammar and hooking into the behavior as you mentioned with the direct parsing tools, you have to be much more manual and explicit about it?
Igor Dejanović
0:12:48
Well, when you're using classical parsing, you make grammar and then you will either get the par three or you will get some nested lists for example. One way to transform the parts of content to some to some data structure, but then you will need to transform that to some other form to be usable in further processing. And you have all sorts of sting like you if that is some real language you have, for example, a reference resolving, or you have to define, for example, parent child relationship between elements. That's all built in into textbooks. So, by just giving the Texas grammar, you will get the nice graph of you You are not getting the parse tree, you are actually getting the object graph, Python, plain Python objects, and they are connected by the reference resolving. So it's not a tree. It's a graph. And you can use it straight away. It's a graph of a Python object. And you can plug into the creation sort of way. You can define something called in textbooks object processors, so object stores Stories is a callable in Python that is called to either check or transform the object that is being created. So in that way, you can implement additional semantics checks, or you can on the fly change the object being created. That is similar to for example, semantic action you're doing in classical parsing, but this is optional thing that you can add to introduce additional semantics or additional transformation of the object.
Tobias Macey
0:14:27
And so for those semantics Is that something like being able to say that this particular token is a key word whereas this other type of token is the beginning of a function definition to the end of a function definition and things like that,
Igor Dejanović
0:14:40
will that will be actually given to you straight away because the if you're asked asking about tokenization of the input, the textbox is based on peg or single peg peg grammars and recursive descent parsing called packet parsing. So we will distinguish between Name of, for example, the function into some keyword, because it is unlimited unlimited look occurred and can resolve that ambiguity
Tobias Macey
0:15:08
for people who are defining the dsls. What are some of the challenges that they face in just constructing the syntax of their target language? And what are some of the types of inspiration that they might look to for determining how it's going to look and the user experience of the people who are going to be using that language and the DSL parser and logic that are going to be generated and built with text x?
Igor Dejanović
0:15:35
Well, there are all sorts of challenges during language design. First of all, we should be sure that when building domain specific languages first we must be ensured that the domain is covered correctly, what that means is that that you built into your language, the correct concepts and the relationship and there are any there are No concept that you're left out, or that you don't have any additional concepts that are not relevant or irrelevant for the domain. What usually happens in practice, and is a danger for DSL developer developer is that you start with the DSL for some domain, but then it's it's hard. It's very tempting to add stuff to that language and many times DSL ends up being a GPL or general purpose language. That's one consideration the other is defining the proper syntax for the language and when we are talking about DSL, in general, we are not talking about only the textures, languages, concrete syntax can be anything can be graphical representation can be tabular. So the concrete syntax is actually the interface to the user what user will see and feel and interact with during the usage of the language. So it must be very nice. It must be easy to use. It must be very intuitive to be intuitive, it must already correlate To the existing language in a domain, we're just formalizing what user are already probably using. And those are considerations regarding the syntax. And of course, we're building textual syntaxes, we have to consider all technical stuff like parsing issues, left recursion ambiguities. And at the end comes to semantics. So we usually describe semantics by interpreting or compiling our language to something else. So choosing the right execution style, whether we should interpret our language or compile makes a difference. And of course, it's probably less important, but we have to take care about runtime performance of our language. So all these decisions will influence how efficient our language will be in practice.
Tobias Macey
0:17:50
And on the point of runtime performance. What are some of the capabilities of Python that will lead someone to use it as the host language Which for a DSL, and what are some of the cases where somebody might want to use a DSL that's using a different host language that's more optimized for particular latencies or a particular target environments?
Igor Dejanović
0:18:12
Well, it depends how critical is the runtime performance for your use case, I usually look at two performances it one is runtime, and one is development time, or maintenance time. So if it's more important to you to quickly develop your language and to easily maintain it, then Python is a really good choice. It's it's dynamic nature gives you a quick turnaround, and you can experiment easily. So no wonder it's used for prototyping and for things that are, you know, right one and throw it away. So if you're, if your system really needs, it's critical and needs something that must give a better runtime performance pattern probably is not a good option. There are other languages that could Service a better host. But still you can use Tex Tex or similar tools built on Python to produce or generate code for some other runtime platforms. So you can see different aspects. What are you using for developing language? And what are you using for runtime runtime can be different. So you can easily generate, for example, rust code from using text x from your models. That's not any. There is no any constraint on in that regard.
Tobias Macey
0:19:28
And as far as the overall end user experience of working with the dsls that are being built using text x, what are some of the associated needs as far as tooling or overall ecosystem of building and working in that environment? And what are some of the additional either associated projects of tech stacks or capabilities built into it that help in that overall process?
Igor Dejanović
0:19:53
Yes, well, tooling is very important when talking about the sales and given all the benefits Get from DSL The main reason why people didn't use so much details in the in the past probably is the tooling support because it's not easy to build yourself from scratch and to maintain. So there are actually a class of software tools that are specifically made for building dsls and they are called language workbenches. Tex Tex is not a language workbench. So it's a simpler tool. Language workbench is a is an integrated environment for building languages and involving language. So it's much more complex, but for the DSL part, there is a text text comment that you get when you install the library. So when you install your library in a Python environment virtual environment, you get a text x comment that can be used to compile your or to check your model or to visualize your model for example, or metamodel. So we can for example, generate a nice diagram of your grammar, it is a class class like diagram that describes the structure of the language or you can use textbooks for example, textbooks common to start the projects to make initial outline of the project and there is another useful tools for example, there is a support for language server protocol and Visual Studio Code integration. It is a project Tex Tex. ls that is worked on that is Daniela lateral is working on it. He was working on a master's thesis regarding language language server protocol in text text based languages. And he after his master thesis finished he continued to work on that and now we have a second version of that project that is very nicely developing. So anyone who wants to try tech stocks should check out that project. It is under the same organizationally GitHub, like textbooks itself,
Tobias Macey
0:22:03
yeah, being able to have that syntax highlighting and the language server support in the development environments will certainly reduce the burden of people who want to be able to take advantage of the DSL without just looking at a blank wall of text and not really having any indicators of what the different tokens are and what their meeting might be in relation to each other.
Igor Dejanović
0:22:25
Definitely, it's first thing that should be done when you're producing tooling for your language, syntax highlighting and code completion and code navigation. So you should help you your user to easily navigate around and to get help from the from the environment. And the text. xls is a project exactly for that. any language you develop using text X can automatically generate the Visual Studio, Visual Studio Code integration for your for that language. So out of the books, you get syntax highlighting for your language, which you can further configure if you're not satisfied with within the results and it is planned to support all all styles of ID support like navigation and completion and so it's it's not still a finished fully but it's very usable at the moment
Tobias Macey
0:23:14
and can be tried out digging deeper into text x itself. Can you talk through how its implemented and some of the ways that the structure of the project and its overall goals have evolved since you first began working on it?
Igor Dejanović
0:23:27
Well, it's built on top of a parser called arpeggio. It's a parser. It's a peg parser. I started developing in 2009. It's what it's probably first project real project they didn't bite on. So when I decided to write Tex Tex, I decided, Okay, I will use back parsing. And because I knew arpeggio very well and I realized the title probably needs to need to tweak it along the way to support all the features. I want to have in text x, I use it. And that was a good decision, I think because on the way I did have to tune a few things in our page itself to help develop develop some text x features easier. So basically, our page is doing the parsing. Text x is just a layer above our page. How it works is that there is if you open the text source, you will see that there is a text x grammar language defining arpeggio syntax, and when the grammar is parsed, there is a visitor that will build the metamodel and another parser out of the grammar that another parser Is there a page a parser for your new language, and the metamodel is the object holding all the information about your language. So all the concepts all the relationship that everything is contained in that object, and that metamodel object is actually used as a entry point API entry point for further parsing. You create the metamodel and you say meta model dot parse model from file, or model from string if you want to parse the file or the parse to parse the string, and the page of parser built dynamically for your language will is accompanied with a visitor that will transform the parse tree to the object or to the object graph of corresponding to your grammar. And that design didn't change much from the beginning. So the core design remained the same, but the language itself for the grammar evolved over time. It started as x text language the my idea at the time was to just make an X text implementation in Python, but actually, it grew over time and add some additional shortcuts in the grammar language itself and some way to easier specify some stuff. For example, there is something called a repetition modifier in text x when you want to match zero or more stuff. Or one or more stuff, you can attach to a syntactic addition to the plus sign or to a three style and and see okay match zero or more elements or objects but they should be separated but but something and you just added a separator in the classical parsing, what will you do you will have to do that manually you will have to say match these and then comma and these zero or more time, but in Texas it's much shorter to write. There are also for example, rule modifiers that doesn't exist in x text but do exist in text x, there is a relatively recent addition of an order choice. An order choice is when you have a sequence of things that you want to match. And you say okay, match those sequence in any order. The tech sticks are beneath our page will match all those elements, but in any order, they are specified, which is very handy for some languages that define the key. For example, some key words that can be written in any, in any order. So more or less, that's, that's about the design the design itself. So the core core remained pretty much the same tool this time.
Tobias Macey
0:27:12
And for people who are using text decks for building their own languages, you mentioned a little bit about the need for having the grammar definition and then being able to parse the written language of the end user and being able to generate the concrete model from that. But what is the overall end to end workflow of somebody who is defining a new language with tech stacks and then being able to distribute it to end users for them to be able to actually make use of that and develop within it?
Igor Dejanović
0:27:46
Well, the workflow can be different depending on the on how complex your DSL is. So you can start very simple like you can define your language embedded in a Python module, you can just write stream with the little grammar. And then you can you can call one function that will transform that string to the metamodel. And then you can use a metal so it's just a few lines of code if your language is very simple, but if you were developing something more complex then you can build the whole project language project. And there is now a support for that in additional project called Tex Tex dev that can be installed together with text Tex either by using pip install Tex Tex with the with the dev as dependency or as optional dependency or you can directly install textbooks that when you install that that project it will add additional comment to tech sticks called start project. So it's similar to for example, how Django would create a new project. So you type text start project and you get initial project answer answer to several question and then the initial project is generated. In that project. You have a grammar file, where you should go to define your grammar And then the project has a registration already built in. So, the project will be registered with the tax tax it is done through the setup tools extension point mechanism. So languages can be extendable can be actually there are like plugins for a text x in a way. So you can use text X to lease languages and the generators for languages are also registered in the text x setup.pi. So you can lease generators also. So in that case workflow is start the project text x start project and then play with creed the grammar usually I tend to first open just a blank file and try to write some model in it. So how I would like to express some solution. And then I write that solution that model and then I in parallel, I'll develop a grammar for that. And then I usually have a small unit test They run constantly to see if everything works, or I just have. I just use for example, textbooks common line to check if grammar is okay. And then I iterate, I extend the model, I add some new things in the model. And then I extend the grammar and see if everything parses and when I end up with the syntax part, so I'm satisfied with how the model applies and how the grammar look like. Then I the design the semantics, I built our compiler for that either reusing some template engine, or I make a little I made a little interpreter for the language. At the end of the process, you can pack it up, make a package of it and release it on pipe ipi, for example. So the user can just install the language and if the user would like to have it support, you can use txt XLS to build Visual Studio Code plugin with all the syntax highlighting stuff. And then you can distribute your language to that plugin. So that those are the options and it's very flexible, you can use it either very simple, or you can use it like a full, full blown project language project
Tobias Macey
0:31:14
for the languages that you're defining. That brings up an interesting thought as far as how you would provide things like unit tests capabilities for the people who are writing the language to ensure that what they're building is going to parse properly, or function as intended. And I'm curious what your experience has been as far as how frequently people will actually go that extra mile to build additional ecosystem tooling for their languages and just the overall need for it. And some of the points at which it hits the tipping point of complexity where that's even necessary.
Igor Dejanović
0:31:49
Well, it all depends who the end users are, if, for example, the end user are people that are not that technically savvy, probably It's probably good investment to make a good tooling support. And for the testing, I think it's generally always good to write tests. When you're developing your language. I usually cover all projects, open source projects that they work on with PI Test, test with a good coverage, I think it's very important. Besides the documentation, I generally feel more confident when doing some larger refactorings changing the language. I want to be sure that the assumptions that I had before are not broken. So I think it's worthwhile to put some additional additional job in in making proper testing.
Tobias Macey
0:32:42
And then as far as the specifics of the parsing implementation, I know you mentioned that you're using a peg parser with some customization of the grammar syntax, but what are some of the situations where a peg parser is unable to unembarrassed asleep processing a defined grammar and concrete implementations of it and you would be better served with a different parsing approach.
Igor Dejanović
0:33:07
Well, parsers are really, really nice for its simplicity, it's, it's kind of something you will probably end up with. If you're trying to build the parser. manually, you will probably go to recursive descent, it's it's easy to understand back parsers are really easy to debug. But they're they have these difference comparing to context free grammars that their order choice is ordered. So the alternative is ordered. And by that, I mean when you have several alternatives to match at some point, you're telling the parser try these. If this does not succeed, try the other one. And do this until you find something that succeeds so in a way Pigs are more imperative in comparison to, to a context free grammar, which are more declarative. You see, these don't thermal is this or this or this, I don't care in what order or whatever, it's just this, I declare this. And what's the problem with peg is that it is always it will always be unambiguous. But this might sound good and but in practice is actually not always because it hides the ambiguity in the language, it will just go from left to right and pick first that match from the order choice and that is the the way it resolves ambiguity, but it's not always what you want. So and you will not get any warning. It's the grammar is very hard to analyze for those things. For example, the typical problem you have with taxis for example, Imagine you will try to match a and then if that doesn't succeed, you will match a and then B, you can see that this second match will never succeed or never be reached if you find a in the input it will be matched by the first choice. So, these A with B afterwards will never be reached, and then can introduce parties problems in practice. And the most difficult problems are when you reorder the order choice you you are actually changing language but it's hard to see how so in a big grammars can that can be problematic, you don't have any analysis from the from the tool and with the other parsing approaches, which are based on on CFG s and do some pre processing and grammar analysis. You do have some help, either through sunlight for example, shift reduce conflicts that tell you that at some point you have some either ambiguity or you need more location. To resolve something. So bugs are easy to debug, easy to understand, but
Tobias Macey
0:36:05
do have their own problems for people who are using tech stacks and building their own languages. What are some of the common challenges that they run into either in terms of being able to overcome those challenges of the limitations of the peg grammar, or just the overall process of building the DSL and making it available to their end users for being able to do the work that the DSL is intended for?
Igor Dejanović
0:36:34
So if I understood the question, usually, people might from my experience, because the problem with open source projects is that you don't always get the full feedback from the users. But I do have a lot of feedback from my students. So it usually goes relatively smooth whenever you have a good documentation and good examples. They will Just read through that and generally they understand that very, very well. So they don't have much problems with it. Usually initially, they have a sword sometime they have a sort of fear from the person in general probably because the previous that they were exposed to some old classic tools like flex and Yak and similar. So they consider parsing very hard and hard to understand and but I think that fear is very quickly overcome when they start to work with the easier to understand and to use tools.
Tobias Macey
0:37:38
And as far as projects that you have seen built with tech stacks or that you've built yourself what are some of the most interesting or innovative or unexpected ways that you've seen it used?
Igor Dejanović
0:37:48
Well, again, the most projects I see developed are from from my students. There are several projects listed on the on the tax front page who is used But usually, users don't reach out that much. So I encourage users of Tex Tex who are listening to this podcast, to drop me a line what they're using tactics for, I always like to hear about it. But from the other projects, probably most interesting was a project done by several students. It's a language for describing guitar temperatures. They call the project pipe tops, it's on the GitHub, and the guitar tablature is a way to, to write the the piece of music done forgetter. But for folks that probably didn't have, for example, formal education don't know how to read notes, and it's very easy to understand format it it's actually depicts the neck of the guitar in ASCII art, where you see six strings running horizontally, and on each string, there is a number that says It's a fret you have to press when you play that note. And they managed to parse that we text Tex and grammar is actually very elegant. And if you think about it, it's like two dimensional language, you know, you have not only Corazon, you're not only parsing horizontally but vertically also because you have to correlate the different strings at the same place. And the interpreter for that language will play the music. So they designed a language with the semantics of playing the music that is described by the user. So it's, for example, for me was very innovative and interesting way of using text sticks.
Tobias Macey
0:39:36
Yeah, that's really cool. And as far as your experience of building tech stacks and maintaining it, and continuing to use it as a teaching tool, what have you found to be some of the most interesting or unexpected or challenging lessons that you've learned in the process?
Igor Dejanović
0:39:53
Well, maintenance of open source projects in general, I learned it's time consuming And not very easy to do, especially when the project start to get some traction. So when you're maintaining project, you have a lot of work to do around just organizing stuff, making sure every issue is commented on and every pull request is reviewed and etc, etc, and the release processes done properly with right versioning and stuff like that. So since many options, open source projects are done on a daily basis in the free time and it's something hard to do. So in, I think it was a year and a half ago, I got I got a really great contribution for Pierre Burrell. It was implementation of a custom scoping support because Tex Tex had always this reference resolving thing but references. Let me quickly just to describe what it is for example, when you're parsing something at some place you have here, I want to match the name of some object I defined somewhere else, and the text sticks in that place, we'll resolve that to a proper Python reference. So you don't have to do that yourself. That's why you end up with a graph of Python object, not a tree. And the scoping was done by using a global scope. So the textbooks in an older version or by default, will search for that kind of object, that type of object, but globally, and that's not what you always want. So pure done support for custom scopes, you can define a custom school provider that will where you can in Python define actual algorithm how that object is to be found. And another piece of that pull request was support for multimodal and multimodal, so you can for example, have several grammars different and you can build a model that can refer things from other modal in other languages, you can even reference things that are outside of text text, for example, you can reference a specific node in a JSON file or a specific node in a XML file. And its support is really cool. And he did that a year and a half ago, send the pull request. And we have really great collaboration on that pull request. And when we merge that to master I asked Pierre to join the project to help in maintaining and I'm really happy he accepted, so he is now co maintaining project with me. And it's much easier when you have someone that is a co maintainer with you, because we can discuss this design decision. And sometimes I don't have time to look at some pull requests or some issue sometimes peer don't get time. So it's much easier when there are more people.
Tobias Macey
0:42:46
And one of the other interesting things that I found out while I was doing the research for this conversation is that in addition to tech stacks and the arpeggio parser that it's using, you've also built another parser using a different type of grammar support called power glare. I'm wondering what your motivation was for creating that. And just some of the ways that your experience in arpeggio and text acts fed into the work you did there. And some of the ways that the work you're doing on par glare has informed decisions about how you approach things with a text x and arpeggio.
Igor Dejanović
0:43:19
Well, it's, it's actually started this
0:43:23
as problems impact parsing, I realized at the time so for example, one of the problem I already said about it, it's that an ambiguous parser will parsing which is not what you always want, there are hidden ambiguities. The other thing for example, generally related to a top down parser is the general don't accept the left left recursion left recursive rules. And sometimes grammar is most natural described by using left recursive rules. For example, if you're building something that is heavily expression oriented, it's much Easier to encode naturally. For example, expression, if expression is if you're building expression for arithmetic operations, you can easily say expression is the expression plus expression or expression, expression minus expression and so on with top down parsing, you must avoid the left recursion. So you encode those rules differently, which is not very natural. So I want to experiment with the other parsing approach. My idea was to offer additional back end for text text instead of arpeggio to use, for example, as an option, you could plug in some other parser. So it built Parque layer two to experiment with LR parsers, bottom up parsers. And especially I was interested in general in general parsing, so the burglar also implements Glr parsing, and later on, I realized that trying to put two different parsing styles in a textbooks project would be very complicated. So I decide not to do that, but But the park layer project itself was developing quite nicely. And I really liked the sound of the results I got with especially Glr parsing. So, and I like the way you can use actually context free grammars to declaratively Express your language. The some learn lesson I get from the working on paraglider is that sometimes it's really nice to have something easy to start with like beg parser especially for students that are rote learning, parsing, but something you need more power for a power like bottom up parser with declarative specification of language and with full general parsing like Glr which can which can accept any context free grammar, even ambiguous one and can in case of ambiguity can produce ours forests, so Although a possible solution for your input, and that's especially, for example, important for natural language processing, where you, by default have ambiguity in your language.
Tobias Macey
0:46:11
And then for people who are making the decision of what to use, what are the cases where a text x is the wrong choice, and they might be better served by either using a different DSL library or using a simple parsing library and then doing the manual resolution of how that logic is supposed to play or just using a simple regex for maybe the smaller cases?
Igor Dejanović
0:46:34
Well, if you're trying to parse something more complex, see generally not very wise choice to just use the reg x's. So generally, I recommend all this use some parsing library because even if you think you can easily can craft your parser There are all sorts of edge cases that are already built in into the parsing library and there is good error support and stuff like that, but crafting parsers can give you additional control. So if you weren't if you want to real total control over the parsing process, or for example, if you want to learn parsing in depth, then you can go with handcraft parser. And for different libraries, well text text is not a great choice if you really want to influence the outcome of your parsing. So, if you want to have complete control over what are you transforming your input to or for example, if you want to get the best possible runtime performance or for example, if you want to persist a stream of tokens as they arrive, then text x is not suitable suitable parser for that, or for example, if your input is, is ambiguous, naturally, like for example, it's a natural language or some language that is ambiguous you cannot use bag parser in general for that in that case. So generally, if you need full control and you want to produce something that that the does not correspond fully to your grammar. Let me give you an example. If you are building an authentic expression language expression based language, for example, maybe you want to evaluate the expression on the fly, if you want that that, then text x is not a good choice, because you will always end up with a particular graph. And you will have to transform that graph to the result of the expression.
Tobias Macey
0:48:21
And as you continue working with text x and using it for your own purposes and for your teaching, what are some of the new capabilities or features or just overall improvements that you have planned for it or associated projects that you have in mind to build?
Igor Dejanović
0:48:36
Well, first of all, one thing we discussed recently was to drop Python two support from textbooks and in their bedroom. They're still comfortable with both Python two and because of that, we cannot move on with some Python three only stuff. For example, one of the things I would really like to see indexed axes type hinting, so we can provide more More stricter check for types in the library itself. And there is also one bigger feature we will be we have been planning maybe a year ago, it is a small DSL for custom scoping providers. That is the part that Pierre is working was working on. But you now describe this co providers by Python functions and the idea was to create a small DSL where you can describe the scoping rule by very small and simple DSL then you can embed in grammar itself. So so at the place of of using the reference you can write that expression that tell Tex Tex how to resolve the reference and there is even because we were discussing that on several issues, we made in the wikis, there is a depth a one is text next enhancement proposal. So we collected all the idea about that DSL in that document. And that's probably something we should work on in some, in some future when we find some time.
Tobias Macey
0:50:12
Well, for anybody who wants to get in touch with you or follow along with the work that you're doing or contribute to your work on tech stacks and your other libraries, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the pics and this week I'm going to choose the project we make Python style guide. It's a set of plugins for providing fairly strict linting of your project.
Liked it? Take a second to support Podcast.__init__ on Patreon!