We're constantly trying to sort, tweak and refine the data that goes into our iPhone app, hōrd. Making government procurement data pretty has become an obsession around here, and remains one of our biggest challenges. But why? We want our users to have a delightful experience using our app, even if the data that powers it is anything but.
For a long time our biggest data quality headache has been vendors. When we say vendors, we mean any company, person, or organization that does business with the federal government. Clean vendor data for us means our users are able easily track the procurement activity of their competition. For example, you should be able to track "Widget Federal" without having to worry about about data entry errors like "Federal, Widget", "WidigtFed" or "Widgets Fedeal LLC", or holding companies like "Widget Federal Arlington". Splitting joint ventures like "Widget Federal and WidgetAccessories.com, a JV" is also crucial. Here are some of the more egregious examples we see on a daily basis. Before is what we get from the government's data feeds, and after is what we display to our users:
- Before: DATEX-OHMEDA INC DBA GE HEALTHCARE BIOSCIENCE BIOPROCESS [DUNS: 129501685],3030 OHMEDA DR,MADISON WI 53707-7550
- After: General Electric, Inc.
- Before: Integrated Security Solutions, Inc. (044574767) 108 Cooperative Way Kalispell, MT 59901-2386
- After: Integrated Security Solutions, Inc.
- Before: HHSN268201200015I - The New York Stem Cell Foundation, DUNS 796026149, 163 Amsterdam Avenue, New York, NY 10023-5858, ID/IQ Min: $1,000.00 / Max: $6,984,000.00.
- After: The New York Stem Cell Foundation
- Before: CADILLAC GAGE TEXTRON INC. DBA-10237 CADILLAC GAGE TEXTRON INC. DBA 19401 CHEF MENTEUR HWY NEW ORLEANS LA 70129-2565 US
- After: Cadillac Gage Textron Inc.
You can see some similarities between the 'before' items. They all contain address data. Some have additional items like DUNS numbers (one labelled as such one not), contract numbers and award values. Some read more like sentences and some not. Classifying and removing these less important bits seems difficult, but there are really only a few dozen non-name features that show up in the vendor award data we get from the government (DUNS numbers, NCAGEs, contract numbers, addresses, etc.) The trick is to reliably recognize and remove the features we don't care about, and preserve those we do.
So with a little NLP magic (thanks PHP NLP Tools!) and some very convoluted regex patterns, we were able to merge thousands of dirty vendors into a much nicer, cleaner set.
When you navigate our app, you'll be greeted by a much easier to understand set of vendors. Add one to your hōrd today and keep up with the competition. Enjoy!