Almost everyone complains about the quality of FBO's search feature. We want to provide developers and users with access to something better. Starting today, GovTribe gives you the ability to search the full text of FBO procurement documents in real-time, through our API. That means RFPs, SOWs, award documents, and more. In the next few weeks, we'll be adding the same capability to our iPhone andiPad products.
We thought it would be fun to give you a quick look at how we did it. Our requirements were pretty simple:
- Give users a fantastic search experience in real-time.
- Give us the flexibility to quickly reindex our data (we use Elasticsearch) without needing to download hundreds of thousands of attachments, re-extract text, and generally do a lot of busy work All of the text of all of the procurement documents you could ever read, or ever want to, is, well, a lot of gigabytes. We don't want to have to move that much data around the Internet every time we need to reindex or change mappings. We also don't want to convert every PDF, DOCX, XLSX, and who knows what to pure, clean text more than once. What we ended up doing is straightforward, but met our requirements nicely:
- First, get a few big EC2 boxes. For us, c3.2.xlarge hit the sweet spot of compute, memory and price. We used these to download, clean, convert and upload the attachments.
- Install Supervisor on each, and use it to keep an instance of TikaJAXRS up and running. Tika is an exceptional attachment extractor, and quite reliable once you get it working. It's also what Elasticsearch's Attachments Plugin uses to extract text from files.
- Move the cleaned data into s3. We ended up keeping the output in HTML, as opposed to plain text, as it preserved some metadata we'd rather not lose:
- Before we send the file to s3, we store its hash in the s3 metadata as 'x-amz-meta-md5'. This makes sure we won't download the same file twice, even if the file name or URL changes.
- Index. Our friendly search provider (Searchly) keeps our Elasticsearch boxes running. All we had to do was spend some time tuning our index mappings and queries to make sure our results made sense. Want to find the government's answers to questions about a solicitation? Now you can:
Note that the name of the author is contained in the metadata
<meta data-preserve-html-node="true" name="meta:author" content="Sean.MacheskiBrasher"/>. At last count we've extracted 58070 government contracting officers from FBO data. In the future, we think it would be nice to link them to the documents they've written. Kepping document metdata makes that possible. We also wanted to store the data that FBO provides along with the files (the package name, file name, and file description) We choose to capture this in s3 user metadata fields:
Eager to check it out? Register for an API key.