One of the most persistent challenges faced by organizations of all sizes is the recording and distribution of institutional knowledge. In technical teams this is exacerbated by the need to incorporate technical review feedback and manage access to data before publishing. When faced with this problem as an early data scientist at AirBnB, Chetan Sharma helped create the Knowledge Repo project as a solution. In this episode he shares the story behind its creation and growth, how and why it was released as open source, and the features that make it a compelling option for your own team’s knowledge management journey.
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Chetan Sharma about Knowledge Repo, an open source framework for managing documentation for technical users
How did you get introduced to Python?
- EE + CS/AI + Stats degrees
- Airbnb working on ML models
- Knowledge Repo itself
Can you describe what Knowledge Repo is and the story behind it?
- We started seeing interviewees use ipython notebooks, thought they were great
- Wanted to push more people to use notebooks, but they weren’t very shareable, vettable
- Existing notebook hosting services weren’t very good, and weren’t built for people who aren’t data stakeholders. It was especially poor with images, annoying cell blocks
- Made a simple post processor to remove cell blocks, push the images to s3, and host on flask
- Once we were pushing notebooks into a Github repo for hosting on a flask app, so many things became possible
- Review cycles
- Shareability / collaboration features
- Indexing / searching
- Concurrently, great work was happening on developing internal R packages / python libraries to provide consistent, branded aesthetics
What are some of the approaches that teams typically take for recording and sharing institutional knowledge?
- Copy and paste to google docs, slides
- Facebook was using facebook photo albums
- untrustworthy, not discoverable, divorced from the code
What are the unique requirements that are introduced when attempting to record and distribute learnings related to data such as A/B experiments, analytical methods, data sets, etc.?
- Reproducibility is a big one
- Making sure the learnings are trustworthy (good data? no bugs?)
- Distributing widely, across the org and across time
- Experimentation is at the end of a research-design-build-measure cycle, strategic analysis is often before
- Capturing all of the context
Can you describe how the Knowledge Repo project is architected?
- Repositories: a store of posts, most commonly a github repo
- Markdown as original lingua franca, eventually a KR specific “KR post” concept (which is still basically markdown)
- Post processors
- Convert whatever upstream file to markdown / KR post (Jupyter notebook, R Markdown, markdown were the original ones)
- Handle images and other large assets, usually pushing them to cloud storage
- Evolved to handle PDFs, googledocs, keynotes
What were the motivating factors for making it available as an open source project?
- It was such a common problem. Even incredibly sophisticated data teams at Uber, Facebook, etc. were begging us to share the system.
What is the workflow for creating, sharing, and discovering information in an installation of Knowledge Repo?
- Create a github repo for hosting strategic analysis
- Use the KR script to create a stub/template for whatever format you’re working in
- Do your work in Jupyter, etc.
- Instead of using github scripts (git add) use knowledge scripts (knowledge add), which is basically the github scripts with postprocessors
- Do typical Github workflows
- See the result in the hosted knowledge repo app
What are some of the options available for extending or customizing an installation of Knowledge Repo?
- More postprocessors! google docs, presentations, UX research, anything can be done in KR with a simple postprocessor to turn it to markdown/images/PDF
- Tying the system to your internal data tools. For example, an experimentation system like Eppo or whatever you use for marketing campaigns
If you were to start over today, what are some of the ways that you might approach the solution to knowledge management differently?
- Think of it more holistically:
What are the most interesting, innovative, or unexpected ways that you have seen Knowledge Repo used?
- UX research
- Writing up guide for acquihiring
- Demonstrating of capabilities, data framework
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Knowledge Repo?
- Strategic analysis needs to be elevated, this leads to paradigm changes
- Organization problems are helped by tools like KR: eg. promotions
- Meeting people’s tools/workflows where they are is powerful
When is Knowledge Repo the wrong choice?
Keep In Touch
- Underrated cooking ingredients: chickpea flour, butter fried kimchi (in grilled cheese, nachos)
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email email@example.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers