# Deduplication 2: Deduplicate harder

The deduplication exercise from yesterday talked about how to deduplicate clean, easy-to-use data.  If you ever find data like that, please let me know.

In reality, there are always quirks in the data to work around.  Here’s how to solve a few of them.

The middle name conundrum.  There is no good way to deal with middle names in a database, because the way people use them is non-standard.  For some people, a middle name is never used and sometimes a throwback to a bygone ancestor (e.g. ,Ronald Bilius Weasley).  For some, it is an integral part of the first name (e.g. Mary Kate and Ashley Olson; that is just one long name, right?).  Others skip right over the first name and go right to the middle (e.g., F. Murray Abraham).  This is why a good database will have a salutations field you can fill in, so your letters go to Ron, Mary Kate, and Murray, and not Ronald, Mary, and F.

This, however, makes deduplication a bear.  You want to preserve the original first name, but need to manipulate the data:

1. Insert two columns next to your first name row (more if you think people will have three or more names in the first name field)
2. Copy (Ctrl + C) the first name column into the first empty column.
3. Highlight the column and go to Data > Text to Columns
4. Choose Delimited and hit Next
5. Choose Space and hit Finish

This separates the names into separate columns.  Dedup as normal based on these fake name columns, then delete these columns and you have clean(er) data.

Forms of names. An automated system won’t be able to tell you that Robert and Bob are the same person, but you can figure out close matches of names like Robert and Rob.  The trick here is when you deduplicate for first name to put

=LEFT(A2,3)=LEFT(A3,3)

=A2=A3

This compares the three first letters of cell A2 to the first three letters of cell A3.  This solves your Rob/Robert’s and Nick/Nicholas’s of the world.  If you want to be a little less restrictive, you can do two instead of three and solve for your Nicole/Nikki’s and your Jo/Joanne’s.

This handy trick also works well for addresses.  Generally if the first 4-5 digits of the address are the same, they are close enough for government work (but still catches them if someone put Rd instead of Dr).