Social Good

Entity Extraction, Document Processing, And Knowledge Graphs For Investigative Journalists with Friedrich Lindenberg - Episode 186

Summary

Investigative reporters have a challenging task of identifying complex networks of people, places, and events gleaned from a mixed collection of sources. Turning those various documents, electronic records, and research into a searchable and actionable collection of facts is an interesting and difficult technical challenge. Friedrich Lindenberg created the Aleph project to address this issue and in this episode he explains how it works, why he built it, and how it is being used. He also discusses his hopes for the future of the project and other ways that the system could be used.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at podcastinit.com/chat
  • Registration for PyCon US, the largest annual gathering across the community, is open now. Don’t forget to get your ticket and I’ll see you there!
  • Your host as usual is Tobias Macey and today I’m interviewing Friedrich Lindenberg about Aleph, a tool to perform entity extraction across documents and structured data

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what Aleph is and how the project got started?
  • What is investigative journalism?
    • How does Aleph fit into their workflow?
    • What are some other tools that would be used alongside Aleph?
    • What are some ways that Aleph could be useful outside of investigative journalism?
  • How is Aleph architected and how has it evolved since you first started working on it?
  • What are the major components of Aleph?
    • What are the types of documents and data formats that Aleph supports?
  • Can you describe the steps involved in entity extraction?
    • What are the most challenging aspects of identifying and resolving entities in the documents stored in Aleph?
  • Can you describe the flow of data through the system from a document being uploaded through to it being displayed as part of a search query?
  • What is involved in deploying and managing an installation of Aleph?
  • What have been some of the most interesting or unexpected aspects of building Aleph?
  • Are there any particularly noteworthy uses of Aleph that you are aware of?
  • What are your plans for the future of Aleph?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Lorena Mesa - Episode 78

Summary

One of the great strengths of the Python community is the diversity of backgrounds that our practitioners come from. This week Lorena Mesa talks about how her focus on political science and civic engagement led her to a career in software engineering and data analysis. In addition to her professional career she founded the Chicago chapter of PyLadies, helps teach women and kids how to program, and was voted onto the board of the PSF.

Brief Introduction

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable.
  • Check out our sponsor Linode for running your awesome new Python apps. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project
  • You want to make sure your apps are error-free so give our other sponsor, Rollbar, a look. Rollbar is a service for tracking and aggregating your application errors so that you can find and fix the bugs in your application before your users notice they exist. Use the link rollbar.com/podcastinit to get 90 days and 300,000 errors for free on their bootstrap plan.
  • Visit our site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch.
  • By leaving a review on iTunes, or Google Play Music it becomes easier for other people to find us.
  • Join our community! Visit discourse.pythonpodcast.com to help us grow and connect our wonderful audience.
  • Your host as usual is Tobias Macey
  • Today we’re interviewing Lorena Mesa about what inspires her in her work as a software engineer and data analyst.

Interview with Lorena Mesa

  • Introductions
  • How did you get introduced to Python?
  • How did your original interests in political science and community outreach lead to your current role as a software engineer?
  • You dedicate a lot of your time to organizations that help teach programming to women and kids. What are some of the most meaningful experiences that you have been able to facilitate?
  • Can you talk a bit about your work getting the PyLadies chapter in Chicago off the ground and what the reaction has been like?
  • Now that you are a member of the board for the PSF, what are your goals in that position?
  • What is it about software development that made you want to change your career path?
  • What are some of the most interesting projects that you have worked on, whether for your employer or for fun?
  • Do you think that the bootcamp you attended did a good job of preparing you for a position in industry?
  • What is your view on the concept that software development is the modern form of literacy? Do you think that everyone should learn how to program?

Keep In Touch

Twitter

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Glyph on Ethics in Software - Episode 17

Visit our site for past episodes and extra content.

Summary

In this episode we had a nice long conversation with Glyph Lefkowitz of Twisted fame about his views on the need for an established code of ethics in the software industry. Some of the main points that were covered include the need for maintaining a proper scope in the ongoing discussion, the responsibilities of individuals and corporations, and how any such code might compare with those employed by other professions. This is something that every engineer should be thinking about and the material that we cover will give you a good starting point when talking to your compatriots.

Brief Introduction

  • Welcome to Podcast.__init__ the podcast about Python and the people who make it great
  • Date of recording – July 21, 2015
  • Hosts Tobias Macey and Chris Patti
  • Follow us on iTunes, Stitcher, TuneIn, Google+ and Twitter
  • Give us feedback! (iTunes, Twitter, email, Disqus comments)
  • We donate our time to you because we love Python and its community. If you would like to return the favor you can send us a donation. Everything that we don’t spend on producing the show will be donated to the PSF to keep the community alive.
  • Overview – Interview with Firstname Lastname about Topic

Interview with Glyph

  • Introductions
  • How did you get introduced to Python? – Chris
    • 2000 – large scale collaborative gaming system in Java
      • Asynchronous IO
      • Twisted
  • Let’s start with the bad news :) What are some of the potential wide spread implications of less than ethical software that you were referring to in your Pycon talk? – Chris
    • Robot Apocalypse :) (Not really)
      • Much of the discussion around this derails into unrealistic nightmare scenarios
      • THERAC 25 radiation machine
      • Toyota unintended acceleration scandal
    • Real worry – gradual erosion of trust in programmers and computers
    • First requirement for a code of ethics – a clear understanding of the reality you’re trying to litigate
    • The search for ethics will likely begin in academia where this aspect of software dev is more like psychology.
  • In your talk you commented on the training courses that Lawyers are required to take as part of their certification. Do you think the fact that there is no standardized certification body for software development contributes to a lack of widely held ethical principles in software engineering? – Tobias
    • Do you think that it is necessary to form such a certification mechanism for developers as part of the effort to establish a recognized ethical code? – Tobias
    • If we were to create a certification to indicate proper training in the software engineers code of ethics, how do you think that would affect the rate at which people enter the industry? – Tobias
  • Assuming we can all agree on a set of relatively strict professional ethics that would prevent the above from happening, how would we enforce those ethics? Or do you advocate an honor system? – Chris
    • Ethics are by definition an honors system
    • Enforcement would be straight forward – professional organizations to maintain a record and deviations from that record
    • Need better laws & better jurisprudence
    • We need an Underwriters Laboratory seal for software development ethics
    • Code of software ethics will not and should not tell you how to be a decent human being.
    • Devs / companies can create software that could be used for evil – “We are merchants of death and these are lethal weapons” – could conceivably earn the ethical software developer’s seal of approval.
  • Where does accessibility of the software we make fit into a code of ethics? Do you think there should be a minimum level of support for technologies such as screen readers or captioning for audio content in the software that we build? – Tobias
    • Minimum levels of knowledge required
    • Minimum levels of content in curriculum
  • In your talk you mentioned how Rackspace’s stance on user support matches the ideals you’d previously laid out, can you flesh that out a bit for us? What does that mean to individual Rackers in their day to day work lives? – Chris
  • In your talk you mentioned that availability of the software source should be mandatory for compliance with a properly defined ethical framework. What mechanisms for providing that access do you think would be acceptable? Should there be a central repository for housing and providing access to that source? – Tobias
    • Would the list of acceptable mechanisms change according to the intended audience of the software? – Tobias
    • What responsibility do you think producers of software should have to maintain an archive of the source for past versions? – Tobias
    • How should we define what level of access is provided? In the case of commercial software should the source only be available to paying customers, perhaps delivered along with the product? This also poses an interesting quandary for SaaS providers. Should they provide the source to their systems only to paying customers, or to potential customers as well? – Tobias
    • This question of transparency and availability of source is especially interesting in the light of a number of stories that have come out recently about patients who have been provided with prostheses and other medical devices. In a number of cases, shortly after receiving the device, the company who made it, which are increasingly startups, goes out of business, leaving the patient with no way of obtaining support for something that they are dependent on for their health and well-being. Having the source for those devices available would help mitigate the impact of such a situation. – Tobias
  • You brought up an interesting aspect of the trust equation and its relevance to the need for an ethical code. Because what we do as software engineers is effectively viewed as sorcery by a vast majority of the public, they must therefore wholly place their trust in us as part of using the products that we create. As you mentioned with the demise of the scribe with the rise of literacy, increasing the overall awareness of how software works at a basic level partially reduces that depency of trust. At what level of aptitude do you think our relationship with our users becomes more equitable? How does the concept of source availability play into this topic of general education? – Tobias
  • What can the Python community in particular do to start the ball rolling towards defining a set of professional ethics, and what has it already done in this area? – Chris
    • PSF Code of Conduct is a starting point
      • PSF is an organization of individuals
      • Corporations are cagey about getting involved for fear of it becoming a legally binding contract
    • Django Code of Conduct more specific

Picks

Keep In Touch

Eric Schles on Fighting Human Trafficking with Python - Episode 12

Listen to past episodes, read about the hosts or donate to the show at podcastinit.com

Brief Introduction

  • Date of recording – June 10th, 2015
  • Hosts Tobias Macey and Chris Patti
  • Follow us on iTunes, Stitcher or TuneIn
  • Give us feedback! (iTunes, Twitter, email, Disqus comments)
  • You can donate (if you want)!
  • Overview – Interview with Eric Schles

Interview with Eric Schles

  • Introductions
  • How did you get introduced to Python?
  • What inspired you to take up the fight against slavery? Is there personal story behind this choice?
  • Some of your work touches on the “Deep Web”. Can you provide listeners with some context around what that term means and role it plays in what you do?
    • Tor .onion sites (Hidden Services) are examples
    • Anonymous Web Experience
    • Anonymity allows for illegal, immoral things like buying selling people
    • Conceptually very important idea
    • Bruce Schneier – Web technologies need to be more privacy aware
    • Like a really scary version of “The Internet of the Old Days”
    • Photos of young, exploited men and women
    • Pedophiles are building communities, having parties through these hidden services
    • Eric feels that Tor is an extreme
    • Feels there had to be a way to protect the rights of legitimate while protecting against pedophiles
    • Maybe a voting system?
    • The Tor project feels that any compromise lessens the that’s so important for people in embattled or countries (Worded that poorly -Chris)
    • No metrics on the amount of pedophilia that actually happens Tor – probably a lot
    • Sexually abused victims of trafficking grow up damanged unable to do anything else
    • Consumers of this type of porn were often themselves victims sexual abuse
    • Structural dissonance which exists to create this problem society needs to be addressed
    • Google puts the number to the anti-trafficking hotline at top of any trafficking search results
    • Darren (Derek?) Hayes – redirect to trafficking resources when viewing advertisements for victims trafficking
  • Why did you choose Python as opposed to any other tool for your search engine?
    • Needed solutions quickly with the ability to evolve as needed
    • Able to rapidly develop and incorporate new features rapidly
    • Easy to scale as needed
    • Flask is easier to prototype and iterate with
    • Python data science tools make the analysis easy
    • Able to finish a 2 year C++ project in 3 weeks using Python
    • Doing data science in Ruby is challenging
    • Pandas Dataframe galvanized the creation of a lot of other useful tools
    • Vincent – write Python which compiles to D3
  • Can you provide a high level description of the technical details the search engine that you created, and what it’s like to with Tor through Python?
    • Directed search engine
    • “It would be like if you went to Google but everything watched was Porn which you were uncomfortabl seeing and you sad”
    • Get most case information through regular old detective work
    • Person arrested / in holding yields phone number, other attributes that can feed the search engine
    • Google can’t scrape the deep web
    • Memex tool indexes the deep web – Eric’s search engine uses that
    • Eric does design work for the Memex project
    • Developed by the amazing Chris White
    • Eric’s search engine uses the Tor driver in Selenium to .onion sites
  • What are some of the technical and legal challenges that you experienced in the course of your work?
    • Most of the technical challenges are around automated processing
    • Legal structure provides some limits on what can be worked on
  • Does your search engine try to infer who might be engaged in work voluntarily as opposed to those being forced into it their will?
    • No, because they get all their case referrals from detective work
    • You have to have been hospitalized or in some other way come the attention of the authorities for being deprived of rights
    • Trafficking looks very different in different cultures
    • Global similarities
    • Afraid to say why if hurt
    • Forced into having sex against your will
    • Clear patterns of indication
    • Urban versus Suburban versus Rural
    • Fracking towns
    • Demographics are very different – mostly men very women, LOTS of ads for sex workers
    • Only helping people that want to be helped
  • What was the most surprising fact you uncovered as part of research?
    • Imagery of exploited children is so depressing and sad
  • Without revealing anything you shouldn’t, are you aware of being set free as a result of your work?
    • “Not my work, our work”
    • Not an individual effort
    • lawyers, analysts, larger DAs office
  • Given the complicated socio-economic aspects of human and prosecution of those who are responsible, can you discuss of the moral and ethical considerations that you have confronted with while building these tools?
    • Privacy is the biggest concern
    • Open source book to teach colleagues at the DA’s office how program to in Python
    • Sometimes Eric works at Civic Hall
  • Are there any projects out there that you consider similar to you are working on?
  • What would it take for other municipalities and law agencies to get started with using your tools?
  • How can our listeners get involved and help you with this Chris
    • Tweet at @EricSchles or E-mail Eric
    • Volunteer for any of the non profit anti-trafficking groups
  • Message to the community: There is a world of good waiting to happen

Picks

Keep in Touch

More From Eric

  • He presented at PyGotham 2014
  • He also talked at the Open Data Science Conference 2015 Boston

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA