End-to-end project narrative

I was one of the first people to work with IRS filings from nonprofits. Just extracting information from the files involved a fairly extensive series of steps: gaining access to the thousands of schema files, figuring out how to use them to decode the data, and even dealing with an S3 bucket that caused most operations to fail. This work drew the attention of AWS itself, which led to a couple of articles on their official blog.

With the data worked out, I realized there was a golden opportunity to disrupt the then-paywalled industry of nonprofit data. The challenge was to make it discoverable. To this end, I designed and built a portal to publish the data. I hired a team of offshore designers and front-end engineers while developing a back-end built around a document database and an Elasticsearch deployment.

I was particularly proud of the search engine, though in hindsight I could see much better ways to do what I had done. There are 1.9 million nonprofits in the United States, and most of them have very limited text information on their profiles, so (this being 2017) I augmented the text so that the fulltext search engine would have more to work with. Had I known then what I know now, I would have used paragraph2vec or averaged GloVe embeddings—again, 2017!—to create a vector index for approximate nearest neighbor search.

We quickly reached tens of thousands of monthly users. Eventually the major player in this space, Candid, removed their paywall for similar information, and we considered our job complete.

David's raw ML reference notes

Explorer

End-to-end project narrative

Graph View

Backlinks