We do a fair deal of data reporting and extraction on local county tax records for a variety of clients (Tax collection, lawyers, accountants, etc). One of the more interesting aspects of this data mining, is when we are required to retrieve, normalize and create reports based on the data from state or local sources.
The local sources are often challenging as they sometimes have no data integrity rules in place to dictate the format for either first or last names. Often times, the data does not have any “legend” or data specification, so we need to deduce our own rules.
When the data can also include sub-owners, the problem becomes exponentially greater.
For instance, with one county we may have a format like the following:
Owner 1: Smith Bob
Owner 2: Smith Jane
Based on those two records, we deduce that the very first word in the sentence is the last name.
After reviewing the rest of the data records we find a whole slew of variations.
Owner 1: Smith Bob & Lisa
Owner 2: Smith Jane
Owner 1: Smith Bob
Owner 2: Jane (w)
Owner 1: Smith Jame
Owner 2: Bob (H)
Owner 1: Smith Bob Jr.
Owner 2: Jane & Jill
Owner 1: Smith Bob Jr.
Owner 2: Smith-Wesson Jane
Data mining in this situation is very much a game of finding patterns and creating a simplified model.
The best method is to find a single “truth” of the data that is consistent throughout the data set and build off of that. For instance, in the above model, the last name for “Owner 1” is ALWAYS the first word. This allows us to build patterns to begin to model every iteration of the data set.
Another truth we find is that a husband or wife will always have a designation of either an (H) or a (W).
As you run your data set through your data mining routines, other patterns will beging to emerge that you can build upon. It all starts with finding that first truth and building upon it.
We love solving puzzles like these and more! Please do not hesitate to contact us for any data related projects (reporting/taxes/state + local tax mining etc).