Deduplication 2: Deduplicate harder

The deduplication exercise from yesterday talked about how to deduplicate clean, easy-to-use data.  If you ever find data like that, please let me know.

In reality, there are always quirks in the data to work around.  Here’s how to solve a few of them.

The middle name conundrum.  There is no good way to deal with middle names in a database, because the way people use them is non-standard.  For some people, a middle name is never used and sometimes a throwback to a bygone ancestor (e.g. ,Ronald Bilius Weasley).  For some, it is an integral part of the first name (e.g. Mary Kate and Ashley Olson; that is just one long name, right?).  Others skip right over the first name and go right to the middle (e.g., F. Murray Abraham).  This is why a good database will have a salutations field you can fill in, so your letters go to Ron, Mary Kate, and Murray, and not Ronald, Mary, and F.

This, however, makes deduplication a bear.  You want to preserve the original first name, but need to manipulate the data:

  1. Insert two columns next to your first name row (more if you think people will have three or more names in the first name field)
  2. Copy (Ctrl + C) the first name column into the first empty column.
  3. Highlight the column and go to Data > Text to Columns
  4. Choose Delimited and hit Next
  5. Choose Space and hit Finish

This separates the names into separate columns.  Dedup as normal based on these fake name columns, then delete these columns and you have clean(er) data.

Forms of names. An automated system won’t be able to tell you that Robert and Bob are the same person, but you can figure out close matches of names like Robert and Rob.  The trick here is when you deduplicate for first name to put

=LEFT(A2,3)=LEFT(A3,3)

Instead of

=A2=A3

 This compares the three first letters of cell A2 to the first three letters of cell A3.  This solves your Rob/Robert’s and Nick/Nicholas’s of the world.  If you want to be a little less restrictive, you can do two instead of three and solve for your Nicole/Nikki’s and your Jo/Joanne’s.

This handy trick also works well for addresses.  Generally if the first 4-5 digits of the address are the same, they are close enough for government work (but still catches them if someone put Rd instead of Dr).

Adding as you go

Because Excel is a flat file and not an actual database, you will frequently have to sum gifts to get a total amount of giving.  Let’s go back to our initial email list, now with gift amounts and the middle names broken out.  Columns F, G, and H, are set up to compare A, C, and D.

thirdexcellist

Here’s how to proceed:

  1. Set up column I with the statement =AND(F2,G2,H2) (and copy it for all of the rows). This returns TRUE if all of the columns are TRUE and FALSE if any are.  This puts in one cell whether something is a duplicate or not.
  2. Set up the first row of column J with the statement =IF(I1,J1+E2,E2). This is the trickiest bit.  What this is saying is “Look at the row before this.  If it’s the same as this row, add this donation to all of the previous donations that match this record.  If not, just put in the one gift.”
  3. Copy this column and paste values like you did with the TRUE/FALSE earlier.

fourthexcellist

Now, when you deduplicate, you’ll be getting the total amount of gifts for a donor, summed up.

There are certainly not perfect.  But to do perfect, you need a database.

Deduplication 2: Deduplicate harder

Please share your thoughts.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s