Building The Seq Language For Bioinformatics - Episode 257

Summary

Bioinformatics is a complex and computationally demanding domain. The intuitive syntax of Python and extensive set of libraries make it a great language for bioinformatics projects, but it is hampered by the need for computational efficiency. Ariya Shajii created the Seq language to bridge the divide between the performance of languages like C and C++ and the ecosystem of Python with built-in support for commonly used genomics algorithms. In this episode he describes his motivation for creating a new language, how it is implemented, and how it is being used in the life sciences. If you are interested in experimenting with sequencing data then give this a listen and then give Seq a try!

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? Check out Linode at linode.com/podcastinit or use the code podcastinit2020 and get a $20 credit to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.



Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. And now, the events are coming to you, with no travel necessary! We have partnered with organizations such as ODSC, and Data Council. Upcoming events include the Observe 20/20 virtual conference on April 6th and ODSC East which has also gone virtual starting April 16th. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Ariya Shajii about Seq, a programming language built for bioinformatics and inspired by Python

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Seq is and your motivation for creating it?
    • What was lacking in other languages or libraries for your use case that is made easier by creating a custom language?
    • If someone is already working in Python, possibly using BioPython, what might motivate them to consider migrating their work to Seq?
  • Can you give an impression of the scope and nature of the tasks or projects that a biologist or geneticist might build with Seq?
  • What was your process for identifying and prioritizing features and algorithms that would be beneficial to the target audience?
  • For someone using Seq can you describe their workflow and how it might differ from performing the same task in Python?
  • How is Seq implemented?
    • What are some of the features that are included to simplify the work of bioinformatics?
    • What was your process of designing the language and runtime?
    • How has the scope or direction of the project evolved since it was first conceived?
  • What impact do you anticipate Seq having on the domain of bioinformatics and genomics?
  • What have you found to be the most interesting, unexpected, and/or challenging aspects of building a language for this problem domain?
  • What is in store for the future of Seq?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:13
Hello, and welcome to podcast ordinate, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project to hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at linode. Go to 100 gigabit and private networking node balancers, a 40 gigabit public network fast object storage and a brand new managed Kubernetes platform all controlled by a convenient API, you've got everything you need to scale up. And for your tasks that need fast computation such as training machine learning models or running your ci CD pipelines. They've got dedicated CPU and GPU instances. Go to Python podcast.com slash linode. That's Li n o d today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias. Macey, and today I'm interviewing Aria shaji about seek a programming language built for bioinformatics and inspired by Python. So Ari, can you start by introducing yourself?
Ariya Shajii
0:01:08
Sure. I'm a fourth year PhD student at MIT csail. Working with professors Bonnie Berger and some on our sinka. Before seek I worked on various other genomics applications like genotyping and sequence alignment for a particular kind of third generation sequencing data. And now, for the past couple of years, we've been working on this project. So that's sort of a little bit about me.
Tobias Macey
0:01:34
And do you remember how you first got introduced to Python?
Ariya Shajii
0:01:36
So the first language I learned was actually Java in high school. And shortly after that, I stumbled across Python, and I think I kind of gravitated towards it because it seemed like a language that was really targeted at people who don't have a computer science background and sort of from its design, and everything is very simple, clean syntax, and the A large community kind of led me towards Python. So that's how I first first encountered it.
Tobias Macey
0:02:05
And so now you're working on this Sikh project. I'm wondering if you can give a bit of a description about what it is and your motivation for creating it and how you first got interested in the area of bioinformatics?
Ariya Shajii
0:02:18
Sure. So I guess I'll just say a few words about what seek actually is. So sort of at a high level of seek is essentially a domain specific language for computational genomics and bioinformatics. Language wise, it's based largely on Python. So you can think of it as sort of a Python implementation, on top of which we've added a few domain specific features types and compiler optimizations. The key difference between seek and C Python, for example, is that C compiles to LLVM IR with no runtime overhead, so the language itself is completely statically typed. And unlike C, Python, we have no no reason Time type information or anything like that. And so for as far as the motivation is concerned, really what motivated seek was the fact that we were looking at these different genomics applications. And what we realized was that they're doing different things at a high level, but they're essentially reusing the same primitive. So they're operating on sequences, DNA sequences, for example. Those are strings of ACGT. They're dealing with really big data. They're doing operations like sequence alignment sequence, indexing, hashing, stuff like that. So they're all sort of using the same set of building blocks. And that's what really motivated us to build a language that expose these operations as primitives and the compiler that understood them and could do optimizations on them. We tried for a really long time to actually to build a high level language that expose these primitives as sort of coarse grain building blocks that you could glue together in different ways. And ultimately, we actually spent like a year in have tried to do that. And we essentially found that it wasn't possible. Like in bioinformatics, there's just too much, you know, yes, they're using these primitives. But there's so much stuff interspersed in between that to build a high level language that had just these building blocks was essentially impossible. And I think a good analogy for that is MATLAB. Right? So Matlab is like a language for linear algebra. But it's actually a low level language, right? Because you can't express all the things that are relevant to a linear algebra application with just matrix operations alone, you need some low level infrastructure, and I think bioinformatics is, is really similar. So that's why we kind of settled on Python, mainly, again, because of its appeal to non programmers and you know, the large community that it has, but we still need a performance. You know, biological data is growing really fast. It's like much faster than Moore's law. And because of that, performance is really critical in our domain. So that's why we sort of took Python and reimplemented it from the ground up in a statically typed way with no again, no runtime overhead. So that's sort of the motivation for seek in terms of bioinformatics itself. I think the reason that I sort of gravitated towards that was just because it was a, I don't know, I wish I had like a more concrete answer. But it was just a really cool field, I think. And it's a really concrete application of computer science. So I think that's sort of why I really like this field.
Tobias Macey
0:05:22
Yeah, it's definitely an interesting area. And it's exciting to see some of the ways that computational power can be applied to the realms of biology to get some more concrete and in depth understanding then what we've been able to do previously. Definitely. And do you have much experience in the past of implementing different languages and compilers? Or is this one of your first forays into that space?
Ariya Shajii
0:05:48
So
0:05:50
So my research has always been basically bioinformatics but on the side I've kind of had a few toy projects here and there that were compilers. So I have A little bit of knowledge, nothing really like super formal, but just from personal projects that I've worked on in the past. But I'd say this is definitely the biggest compilers related project that I've been a part of. And you mentioned that in some of the other available languages and runtimes, that there weren't the necessary primitives specific enough to bioinformatics to be able to get the performance that you were looking for. And so you went down the road of creating this custom implementation. And you also targeted Python as the syntax for it. And I'm curious for somebody who's already working in Python, and maybe using something like the bio Python packages for doing some sort of bioinformatics processing, what might motivate them to consider migrating their work to seek even if they already have an existing code base? And what are some of the potential challenges that they might face in that conversion?
0:06:51
Sure. So one of our goals is basically to make the Python to seek transition as seamless as possible. So We're not quite at the stage where you can literally take an entire pre existing Python code base and run it and seek you'll probably need to change a few minor things here and there. But you know that we're continuing to close the language gap. And we're hoping to, at some point in the near future, get get to that stage. But you know, if you have some small snippet of Python code, most of the time, it should just work as is and seek about bio Python in particular. And maybe some other libraries, just the same. I think the one of the main reasons for switching would definitely be performance. Right? So bio Python is a pure Python library. And we've actually done a few benchmarks against it and seek is substantially faster, especially if you're dealing with these huge data sets. Now, one thing I will mention actually as interesting is we're actually implementing bio Python in C. So we're, you know, bio Python has a pre existing API, we have a lot of the same functionality and we're in the process right now of actually implementing bio pythons API using The primitives that are available and seek. So again, I think the point is to try to make that transition as seamless as possible. But for someone who's really interested in performance, I think seek would definitely be a pretty good alternative to to bio Python. I think another important point when talking about the difference between a language like Python and a language like seek is sort of the metadata overhead for lack of a better term. So if you think about a linear algebra application that sort of does operations on a few really big matrices, most of the time there is actually just spent in linear algebra kernels that are, you know, hand optimized, C, or Fortran kernels. And you really only have a few objects that exist in your Python application. And Bioinformatics, or a lot of this genomics applications that we're interested in the situation is really different where you have potentially billions of sequences that you're processing at any given time. So any metadata overhead or runtime type information or per object overhead adds up really quickly. So I think that was also one of the motivations to build seek that sort of eliminated all of that overhead. So I think that's another reason why seek might be a might be a good choice for those kinds of applications. And in terms
Tobias Macey
0:09:13
of the actual code bases that people who work in bioinformatics are going to be writing and collaborating on. What is the general scope and scale of those types of projects in terms of the sort of complexity of the applications and the maybe number of different code files? Is it something where somebody writes a fairly straightforward script to brute force through these different sequences? Or is it something whereas people are going to be leaning heavily on libraries and wanting to integrate with something like web frameworks or other components of the ecosystem?
Ariya Shajii
0:09:48
Yeah, that's a good question. I think it varies a lot. There are definitely applications where, you know, I mean, there are cases where you're doing some kind of ad hoc analysis and you really only have a single script that you're worried about. There are definitely a lot of cases where you have some bigger project, that you're not necessarily relying on external libraries. But just the project itself is bigger and you have, you know, multiple source files or something like that. And that, like you said, there are definitely cases where, where you're relying on external libraries, probably not web related, but definitely, you know, things like machine learning frameworks, for instance. So I think it varies a lot. And I think because of that, it's really important and seek to have interoperability with other frameworks to be a priority. And that's sort of as been a priority for us. So that's why i think i think it definitely varies a lot based on your concrete application for the
Tobias Macey
0:10:41
primitives that you're incorporating into seek. You mentioned that there are some elements that are lacking in different implementations. And I'm curious what your process was for identifying what the useful primitives were, and some of the algorithms to bake into the implement Have seek to simplify the work of people working in bioinformatics.
Ariya Shajii
0:11:04
Sure. So I think that's actually been one of the hardest things to do. Because, again, we spent a really long time trying to design a language that sort of isolated these primitives and just expose them in a higher level language. And the reason for that is because yeah, you know, abstractly, you have these primitives, like sequence alignment, for example, which would be like a Smith Waterman dynamic programming algorithm, or, you know, some kind of sequence indexing, whether you use a hash table, or FM index is also a very common data structure used in genomics applications. But when you actually look at implementations of these things, they vary so much from case to case that it's really hard to provide these things as sort of fixed costs grade building blocks. So again, this was why we settled on a much lower level language. But having said that, there are definitely things that are essentially ubiquitous. So sequence types again, that's it If you're dealing with DNA that's strings of ACGT, or you can also have an N bass there, which is like an ambiguous, ambiguous nucleotide sequence types, Kaymer types, which are also frequently used that those are fixed. So sequences are arbitrary length. Kaymer types are fixed length k sequences, and you can do things like to bid encode those and stuff like that. They're very common operations like reverse complement, for example, where you take a sequence and you physically reverse it, and then you swap A's and T's and C's and G's, that's a very common operation that's done on sequences and basically any genomics application. Again, these string matching algorithms like Smith Waterman or hamming distance calculations. So these are sort of the things that we thought were useful to implement and seek, I think the other benefit of seek is that it's a compiler so we can actually do higher level optimizations that a library couldn't necessarily do so for example, If you take reverse compliment, reverse compliment has various algebraic rules. So if you take a sequence and you reverse complement it twice, then you get the same sequence back. So a compiler that actually knows about these things can exploit those algebraic rules and do optimizations that a library couldn't do.
Tobias Macey
0:13:15
And then for the target audience have seek is it largely people who are working in the sciences who just need to be able to process the data that they're coming from? Or is it also common that there might be a set of programmers on staff who work with the domain experts to be able to understand the scope of what they're trying to work with? And then the implement the actual applications for them and curious what types of challenges that poses in terms of how to approach some of the interface and workflow for the people who are actually using seek in their day to day work?
Ariya Shajii
0:13:50
Sure. So I think seek as a language I think is in a sort of a unique place because it in a way bridges the gap. Between a really usable language like Python and a performant language like C or c++, and because of that, I think if you're someone who is not necessarily an expert in programming, but you know is, let's say, a biologist or something like that, and you're trying to analyze some really large data set, I think sick, that's someone who we actually designed a language for, it's supposed to be something that someone who doesn't have a background in programming or software engineering can pick up relatively easily and use but at the same time, someone who is let's say, like an algorithmic, and they're designing these really performance critical applications for you know, sequence alignment or sequence assembly or what have you. That's also someone who I think could benefit from C because, you know, we do a bunch of these domain specific optimizations that are pretty difficult to replicate by hand. So I think both sides of the spectrum are sort of people who were were interested in targeting a guy that's sort of a tall order, but that's Something that we've thought a lot about
Tobias Macey
0:15:02
for the overall workflow itself. What are some of the tooling elements that you have implemented to simplify the overall workflow? And how might the actual development process differ for somebody who's used to working in Python, I know that at least for the time being, you don't have a repple. So that is one impact. But what are some of the other elements that might differ in terms of how people will approach building things with seek versus what they were doing in Python?
Ariya Shajii
0:15:28
Sure. I think that's probably the area that we need to work on the most. So that we have, for example, C has a debug mode that gives you nice stack traces, and all sorts, all sorts of things like that, just like Python does. But again, C compiles to LLVM. So it's definitely the debugging process for seek is more challenging than today is more challenging than it is in Python. So I think that's definitely something that we're hoping to work on in the near future to sort of give people the tooling and the debugging support. If they come from a Python background they're they're familiar with. I think the other interesting thing in that regard for seek is that because it's a domain specific language, we could actually do things like domain specific debugging, for instance. So you know, if you're processing sequences and you're aligning sequences, we could potentially implement something like a domain specific debugger that lets you, you know, visualize alignments, visualize sequences, and all that kind of stuff. So I think we're not quite where we want to be in terms of debugging and tooling, and all that kind of stuff. And the primary reason for that is really just manpower. We haven't had anybody. We're just sort of limited on manpower. So but again, we that's something that we would definitely want to work on. And because of the fact that the syntax is very similar to Python, are you able to lean on any of the existing tools such as linters, or a static analysis that's available to the Python ecosystem and modify that to work with seek as a sort of shortcut to be able to get some of those tooling elements in and Then another aspect of the workflow and tooling is the question of being able to test the applications that you're developing with seek. And I'm wondering what facilities you either currently have or are planning on? Sure. So in terms of existing Python tools, that's not something we've actually explicitly explored. But I think it's something that definitely makes a lot of sense. Again, like the syntax is almost identical to Python. We've added a few extra language features like pipelines and pattern matching. So I think we might have to make some modifications to some of those existing tools that you mentioned. But that's something that definitely is in the realm of possibility. So for testing, we actually have a testing framework and seek that is some kind of similar to what it is in Python. But essentially, what you can do is you can annotate a function with the test decorator and then you can have a certs and that function that they won't terminate the program. They'll actually just like fail the test if they if the assert fails, but I think it will be Really nice to sort of expand that and maybe even implement some of the testing facilities that exist and regular Python into seek. And again, you know, the fact that we haven't done that is really just because of manpower. We just haven't had any anyone to work on that yet. So for
Tobias Macey
0:18:18
the actual implementation of ck itself, I know that the syntax is, as you said, largely similar to Python. And I'm wondering how you approach the actual construction of the language and the compiler. And if you are able to leverage any of the elements of the C Python implementation, or at least use it as a reference for things like the tokenizer or anything like that for being able to build the parser and the compilation and some of the underlying architectural decisions that you've made as you have gone through implementing seek?
Ariya Shajii
0:18:50
Sure. So a lot of the language design, thankfully, was essentially just dictated by Python. So you know, Python behavior and semantics. And of course syntax is basically all present and seek. On top of that, like I said, we've added a few features, like pipelines, for instance, which are a pretty natural abstraction for thinking about processing genomic data. pattern matching, which you might find in other functional languages, novel aspect of that and seek is that we actually allow genomic pattern matching, so you can match sequences, and you can have, you know, wildcard bases or, like stars, like a regular expression, stuff like that. And some of the new types that I talked about, we don't reuse any components of C Python. I think the parser is something that potentially though that that's actually a something that we could possibly reuse. Again, we have to add support for some of these other language features, but pythons parser is something that we could potentially build on in terms of the runtime. Our philosophy when designing the runtime was essentially just to make it as minimal as possible. So all of the core types like lists, dictionaries and sets and even a lot of the functionality of types like any integers and float and string and stuff like that, and more primitive types are all implemented in C. So we have a small runtime library mainly for garbage collection and a few hand optimized sim D optimized functions for sequence alignment. But our goal has basically been to reduce that as much as possible and implement everything that we can and seek itself. So that's sort of been our, our design methodology
Tobias Macey
0:20:25
so far, because of the fact that you are using the Python syntax as a reference. I'm curious how evolutions to the language are going to be reflected in seek itself and what plans you have for being able to maintain feature compatibility as the language evolves and any potential challenges that you anticipate as a result of that.
Ariya Shajii
0:20:46
Yeah, I think that that's it that is definitely a challenging problem, like, you know, as you make changes to Python, then we'll sort of have to play catch up a little bit, at least for the short term. Yeah, I think again, your suggestion of, you know, using pythons See pythons part parser, for example, might actually help in in that regard, because, you know, it will sort of make it easier possibly to integrate some of these language changes. But I think, seek being a separate language that's essentially independent from C Python. I think that's definitely a challenge to think about.
Tobias Macey
0:21:18
Yeah. And you mentioned that when you first started approaching this problem, you spent a year going in a different direction before you ended up with the current direction of building seek. And from the time that you actually started building this custom runtime, I'm curious how the overall scope or direction of the project has evolved, and how much that differs from your original conceptions of it.
Ariya Shajii
0:21:40
Sure. So I think seek today really it can be used as a general purpose language is probably best suited for scientific applications that don't rely on pythons dynamic features, some of which we've had we've had to get rid of, but you know, if you have some scientific application that isn't dynamic in nature, I think see could see Could be a good fit, even if it's outside of bioinformatics. And again, being a low level language. We've talked a lot about extending to other fields of subfields of bioinformatics. So right now, we're focused mostly on computational genomics. But, you know, there's more to bioinformatics at that there's phylogenetics, for example, or population genomics. And I think seek is in a place now where we can potentially target those fields as well. And again, when we started, like I said, we started by thinking about a much higher level language. And I think if we had gone that route, it would have been much more difficult for us to expand to some of these other areas. And like I said, even in other domains outside of Bioinformatics, I think C can potentially be the applicable. So at this point, again, our domain of interest is still Bioinformatics, but I think C can definitely be a useful tool, even even outside of that field. And so
Tobias Macey
0:22:51
as far as the actual bioinformatics field, I know that when looking through the documentation for this project, it alludes to the fact that there is a lot of sort of messy code or inconsistent approaches to problem solving in terms of the way that the software is developed and challenges in terms of the speed of execution. And I'm wondering what impact you anticipate seek having on the overall domain of bioinformatics and genomics and some of the standards that could be implemented in terms of the training of people who are working in those fields to improve the overall capacity for being able to run these analysis and the impact that this increased speed has on their ability to perform meaningful research?
Ariya Shajii
0:23:36
Sure. So we hope to see a lot more tools and methods being written and seek in the coming months and years. And I think the benefit that would have is, not only would it sort of give everyone a unified framework for Bioinformatics, software development, but optimizations and features that we add in future versions of the C compiler could even be applied retroactively to existing stuff. All right. So if someone writes a piece of software today and seek and however many months down the line, we add support for a GPU back end or FPGA back end, then our hope is that that software that was written today could just run as is on on those back ends or same for any other compiler optimizations that we add. And I think really, like in an ideal world seek would allow bioinformatics software to sort of keep pace with the growing data that, again, is sort of really outpacing, you know, Moore's law. And I think by 2025, it's predicted that genomic data will even have surpassed Twitter and YouTube data. So it's a really, really fast growing data set. And I think seek sort of gives us at least one tool to keep pace with that. And
Tobias Macey
0:24:42
in terms of the data sets, do you find that there are a large volume of information that's available in the public domain for people to be able to do their own experimentation and test out seek with those data sets? Or is it something where a lot of the information is held in practice? Data Sets by different companies working in the biotech industries or the pharmaceutical industries and any challenge that you've seen in terms of being able to make seek available and get it in the hands of people who are doing this types of research or any collaborations that you are either currently engaged with or seeking to be able to get that feedback to help evolve the language.
Ariya Shajii
0:25:24
Sure. So I think in terms of data, it's sort of a combination of both. There's definitely a lot of publicly available data out there. So for example, there is there are many databases, one of them is s ra Sequence read archive, and that has a ton of publicly available sequencing data. So I think the there's a lot of data out there. In terms of collaboration. Google Cloud Life Sciences actually recently reached out to us to talk about running seek on the cloud. So that's something that we're we're actually starting to work with them to develop a cloud back end for for See, I think that's something that we're really excited about, especially again, you know, as the scale of the data increases to have something like a distributed computing back end for seek, and in general, a compiler that can perform not only single machine optimizations, but optimizations that are relevant to a distributed computing environment. I think that that's a really, really powerful tool. So
Tobias Macey
0:26:21
as you have been building the seek language and working on improvements and experimentation with it, and working with some of the end users, what have you found to be some of the most interesting or unexpected or challenging aspects of building a language for this problem domain and just some of the overall elements of language and compiler design?
Ariya Shajii
0:26:39
I think, definitely. It's something I alluded to earlier, I think in terms of Bioinformatics, identifying the core primitives and operations and, you know, bioinformatics and computational genomics is actually really, really hard, again, because, you know, you could draw these very coarse grain boxes around things like alignment or indexing or hashing and stuff like that. But what do you actually look At the concrete implementation of these things, some subtle details, but those details actually have algorithmic implications. So it's really hard to sort of identify those primitives. But I think we're sort of on the right track and giving people a low level infrastructure to implement these things themselves, actually. So that again, that was one of the motivations behind going with a lower level language, rather than a higher level DSL, in terms of actual compiler design, I think, I don't know, for me dealing with these generic types and duck typing of Python in a statically typed context has actually been really, really hard. So I mean, what we're doing in seek is essentially taking a Python program that's dynamically typed by nature and imposing a static type system on top of that, and that can lead to some actually really, really difficult to resolve corner cases. So that's actually been a really hard on the programming language compiler design side, that's actually been a really hard problem. And we're still continuing to sort of close the language gap with Python. So there's still some cases that that We're working on there. So that's sort of, I would say, on both sides on the bioinformatics side and the programming language side, those are sort of the two biggest challenges for me at least, do you think that there are other problem
Tobias Macey
0:28:11
domains, they would benefit from having a similar runtime available to them? And do you think that there is just an overall benefit to having custom languages for some of these different research areas or different use cases versus having general purpose languages that are broadly applicable but not necessarily optimized, and sort of what you see is the trade offs and the overall spectrum of programming language availability, for solving some of these interesting and challenging problems?
Ariya Shajii
0:28:43
So I think it varies a lot by domain bioinformatics is sort of a It's a unique domain and that a lot of practitioners are not software engineers or programmers are computer scientists by trade. So I think for Bioinformatics, something like C Going the Python route and implementing something that behaves and you know, as the semantics of Python was, was really useful. In terms of general purpose languages versus domain specific languages, I think both definitely have their, their use cases. I think with DSL, it's important to, to sort of have interoperability with existing libraries and systems to be a priority. And again, this is something I mentioned earlier, but I think we don't want to sort of fall into the trap of not being interoperable with other systems. And then someone who uses seek is sort of unable to use let's say, NumPy, or TensorFlow or some of these other libraries that exist for Python. So I think dsls are really good, especially if performance is critical. But at the same time, I think interoperability needs to definitely be a priority. And that's something that we've definitely had in the back of our minds as we've worked on seek so
Tobias Macey
0:29:55
what do you have planned for the future of seek in the near to medium term and What are some of the overall impacts that you hope to have as it progresses?
Ariya Shajii
0:30:05
So I think in the short term, definitely, we're still working to close the language gap with Python. So right now we have a uni directional type checker. So a lot of most cases we can, you know, do type deduction on if you have a equals two plus two, for example, we can tell that A is an answer. That's a super easy case. But Python actually has some more complicated cases, like, for instance, if you use lambdas. In Python, they're lambdas aren't types. So if you sort something, you'd have like list dot sort, and some lambda, you actually need a bi directional type checking to resolve the type of that lambda. So we're in the process of actually implementing that right now, along with some other things like optional types of show allow us to deal with nons for example, you know, in Python, you can assign anything to non in a statically typed context is a little bit more tricky. So that's another thing that we're working on. So just sort of closing the language gap a little bit more beyond that. We're working on a new intermediate representation that's a little bit higher level than LLVM IR. And our hope is that that will actually allow us to do a lot more Python specific and domain specific optimizations. So you know, in Python, let's say if you have a case where you're adding three or concatenating, three strings, that's something that we could potentially recognizing this new intermediate representation and optimize. And on the bioinformatics side, there are many again, that case that I mentioned, where you reverse compliment something twice. That's something that we can potentially catch if we have if we have an IR as well. So that's another project we're really excited about. Another thing I alluded to earlier was different back ends like GPU and FPGA and seeing how those how those things interact with our, with this domain, various domain specific optimizations. And for some other thing we're really excited about exploring and like I mentioned, working with Google Cloud life sciences to run seek on the cloud and seeing what we can do there. I think that opens up the door to a whole bunch of other domain specific options. Is that that a compiler like C could do so those are sort of the projects that are ongoing right now that we're really excited about, and are just getting more people involved in seek I think, so far we've had a really limited number of people working on it. It's myself and the CO first author on our paper Ebrahim Nova Nagesh, who is he was a former postdoc at MIT. And now he is a professor at University of Victoria. So it's mainly been the two of us with undergrads, Europe's working with us. So there's getting more people involved. I think that's, that's really something we're very excited about as well for the future success and sustainability of the project. What are some of the risks that you think could pose a threat in terms of its future viability and wonders, what is your thoughts in terms of the level of involvement that you're going to have once you have finished your PhD program? So for the for the first part of your question, I think I'll just have to come back to interoperability because I think that's such an important point. We really want to make sure that If someone uses C, they're still able to use these other Python tools and libraries and frameworks. And that exists today. I think that's something that we're not quite at the place we want to be right now. We have Python interoperability in C. So if you have some Python function that's pure Python, you can call it right now and see, can we do all of the marshaling to and from Python types between seek types, but I think you know, what we talked about to have tooling and debugging support. That's something that we're actively working on. So in terms of viability, I think that's a really important, really important aspect. In terms of my own involvement. I think this is a project that has, you know, we have a huge laundry list of ideas and things we want to explore. So I'm not yet 100% sure about what my future plans are going to be. But I definitely envision working on this even after I complete my PhD, so I think this is definitely a long term project and I'm really excited about it. So
Tobias Macey
0:33:58
yeah, are there any other elements of the Sikh project itself or bioinformatics or language design that we didn't discuss that you'd like to cover. Before we close out the show,
Ariya Shajii
0:34:07
I think we did a pretty good job to be honest, covering all our bases as
Tobias Macey
0:34:11
well. For anybody who wants to follow along with the work that you're doing or get in touch or contribute to the project, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the pics and this week I'm going to choose board games as a way to have something to do particularly in these interesting times. So definitely recommend taking a look at your board game closet or maybe contributing to it. One that I've been enjoying play with my kids is a game called labyrinth. And I'll also mention board game geek as a great site for being able to discover and read reviews on different board games as you're determining what new ones to add to your collection. And so with that, I'll pass it to you. Do you have any picks this week?
Ariya Shajii
0:34:50
So I would have to go with this documentary that I recently watched called breakthrough which is it's not a it's not a new movie. I think it was released in 2019 but it's just one that I happened to recently watch is about Jim Allison, who's a scientist whose work led to new cancer treatments. And he ultimately won the Nobel Prize because of it. It is a really unorthodox scientist, I would say in the movie sort of shows his perseverance throughout his research career. And that was a really interesting movie, I'd recommend it.
Tobias Macey
0:35:21
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with seek it's definitely a very interesting project and one that I'm excited to see some of the impacts that it will have as it continues to grow and gain some adoption. So I appreciate all the work that you're doing there. And I hope you enjoy the rest of your day.
Ariya Shajii
0:35:38
Thank you so much for having me.
Tobias Macey
0:35:43
Thank you for listening. Don't forget to check out our other show the data engineering podcast at data engineering podcast comm for the latest on modern data management. And visit the site at Python podcasts calm to subscribe to the show, sign up for the mailing list and read the show notes and if you've learned something or try it Add a project from the show then tell us about it. Email [email protected] with your story. To help other people find the show, please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support Podcast.__init__ on Patreon!