A brief instructional video about the benefits of data hygiene

(Billy is walking to school.  He is in black and white, but mostly white, between is grey blond hair, grey pale skin, grey khaki pants, and white white button up shirt.  Only his shoes show black – certainly not his small town.)

Narrator: Woah.  Hold on there, Billy.

(Billy stops and looks for the voice).

Narrator: Where you going in such a hurry?

(Billy holds up the books he is carrying and taps on his watch with a finger)

Narrator: Ah, going to school.  And I suppose you haven’t even given a thought to DATA HYGIENE?

(Billy pauses, has a quizzical look, and proceeds to sniff his armpits.)

Narrator: No, not that type of hygiene.  DATA hygiene.  One of the things that makes America great (cut to stock footage of a flag waving and other good and wholesome things) is Americans doing positive, American things for other Americans, helping our neighbors, our pets, and our environment.  For example, elderly people! Puppies! Parks!  All of this is possible because of our positive, American civic associations of good people.

(Cut back to Billy nodding enthusiastically.  If Billy were tragically stuck inside a paper bag, he could not act his way out)

Narrator: But what if ALL OF THIS CAME TO AN END?  Were there no data hygiene, our civic associations won’t be able to help preserve our positive American way of life?  They wouldn’t be able to reach new volunteers, get voluntary donations, or help out in your neighborhood.

(The previous images of an elderly person, a puppy, and a park, all disappear in rapid succession)

Narrator: THIS WOULD LEAD TO COMMUNISM!

Billy: Holy shit!

Narrator: Wait, you can talk? I thought we had you muted.  Why didn’t you talk before?

(Billy shrugs.)

Narrator: There are two types of data: data that you control and user-entered data.  Let’s take a tour of data you control.

(Billy is whisked away to an average city street)

Narrator: Here is the hustle and bustle of a nonprofit organization’s traffic.  All sorts of data come through here on a daily basis, from caging vendors to development officers to syncs with other databases.  And we can control these with some basic community standards.  But you have to make sure these community standards are followed, from address 1 line standards…

(close-up on the street sign, which magically changes from Mockingbird Lane to Mockingbird Ln)

…to address line 2 standards…

(close-up on an apartment, labeled as 123 Main Street, Apartment 3-B, which changes to
123 Main St
Apt 3B)

…to pick lists that dictate what titles and suffixes can be used.

(A mailbox that says Reverend Timothy Lovejoy, III, changes to Rev Timothy Lovejoy III).

Narrator: It’s better that these things are cleaned up in the database…

(Billy opens a door and see workers in a dank, grey office working on various spreadsheets.  A supervisor is whipping them if they don’t clean lists fast enough.  He recoils in horror.)

So that hard-working Americans don’t have to fix them later.  Fortunately, you have professionals doing data entry here, so that should limit duplication and errors.  The same can’t be said for…

(Click. Billy is instantly transported to a seedy red light district.  Crude graffiti adorn the spare brick walls and trash lays, some in bags, some not along the street).

Narrator: USER GENERATED DATA!

(A scantily clad woman of the night beckons to Billy.  He starts to walk over)

Narrator: Billy! What do you think you are doing?

(Billy stops and gives the camera a sheepish shrug).

Narrator: Here in user-generated data, anything goes.  In order to get people to fill out forms, you have to make it very easy.  Most of the time, that means leaving out items on the form…

(a metal mailbox is embedded in the wall.  It says simply “Jim”.)

… not validating standard fields or following capitalization rules…

(the neighbor’s mailbox is labeled “mt. cHris hartnet, mbs”

… having people put data wherever…

(the camera pans up to reveal a steel mill in the background labeled “Dr. Alice Humphries, DDS”)

… and just plain having jokers plant fake information.

(the building next to the mill is labeled “I. P. Freely.”  Billy looks frightened.)

Narrator: Yes, this lack of cleanliness is the price you pay for freedom.  American freedom!  But we all have to do our part, or else the civic organizations won’t be able to reach anyone in this neighborhood.  And you know what that means, don’t you Billy?

(Billy nods and mouths the word “communism”)

Narrator: That’s right.  Here we have to use more broad hygiene techniques.  Enforce your capitalization rules, not theirs.  Every one of these houses should be run for address standardization, so the addresses are correct.  They should also go through strict National Change of Address forms, so you know when someone has moved.  If someone’s email address doesn’t work, make sure to suppress it or you will be labelled a spammer.  And you know what happens to spammers.

(The camera shows a man in a stockade.  Residents are throwing cans of Spam at him.)

Narrator: Appending data like telephone numbers, addresses, and email address when you don’t have them is also usually worth the investment.  That way, you know if Jim back there is the same Jim who just wrote your nonprofit a big check.

(Click. Billy is back on his school grounds.)

Narrator: So, Billy, you know the importance of data hygiene, right?

(Billy nods enthusiastically and runs off camera)

Narrator: Data hygiene: it’s the right thing to do.  America.

A brief instructional video about the benefits of data hygiene

Deduplication 2: Deduplicate harder

The deduplication exercise from yesterday talked about how to deduplicate clean, easy-to-use data.  If you ever find data like that, please let me know.

In reality, there are always quirks in the data to work around.  Here’s how to solve a few of them.

The middle name conundrum.  There is no good way to deal with middle names in a database, because the way people use them is non-standard.  For some people, a middle name is never used and sometimes a throwback to a bygone ancestor (e.g. ,Ronald Bilius Weasley).  For some, it is an integral part of the first name (e.g. Mary Kate and Ashley Olson; that is just one long name, right?).  Others skip right over the first name and go right to the middle (e.g., F. Murray Abraham).  This is why a good database will have a salutations field you can fill in, so your letters go to Ron, Mary Kate, and Murray, and not Ronald, Mary, and F.

This, however, makes deduplication a bear.  You want to preserve the original first name, but need to manipulate the data:

  1. Insert two columns next to your first name row (more if you think people will have three or more names in the first name field)
  2. Copy (Ctrl + C) the first name column into the first empty column.
  3. Highlight the column and go to Data > Text to Columns
  4. Choose Delimited and hit Next
  5. Choose Space and hit Finish

This separates the names into separate columns.  Dedup as normal based on these fake name columns, then delete these columns and you have clean(er) data.

Forms of names. An automated system won’t be able to tell you that Robert and Bob are the same person, but you can figure out close matches of names like Robert and Rob.  The trick here is when you deduplicate for first name to put

=LEFT(A2,3)=LEFT(A3,3)

Instead of

=A2=A3

 This compares the three first letters of cell A2 to the first three letters of cell A3.  This solves your Rob/Robert’s and Nick/Nicholas’s of the world.  If you want to be a little less restrictive, you can do two instead of three and solve for your Nicole/Nikki’s and your Jo/Joanne’s.

This handy trick also works well for addresses.  Generally if the first 4-5 digits of the address are the same, they are close enough for government work (but still catches them if someone put Rd instead of Dr).

Adding as you go

Because Excel is a flat file and not an actual database, you will frequently have to sum gifts to get a total amount of giving.  Let’s go back to our initial email list, now with gift amounts and the middle names broken out.  Columns F, G, and H, are set up to compare A, C, and D.

thirdexcellist

Here’s how to proceed:

  1. Set up column I with the statement =AND(F2,G2,H2) (and copy it for all of the rows). This returns TRUE if all of the columns are TRUE and FALSE if any are.  This puts in one cell whether something is a duplicate or not.
  2. Set up the first row of column J with the statement =IF(I1,J1+E2,E2). This is the trickiest bit.  What this is saying is “Look at the row before this.  If it’s the same as this row, add this donation to all of the previous donations that match this record.  If not, just put in the one gift.”
  3. Copy this column and paste values like you did with the TRUE/FALSE earlier.

fourthexcellist

Now, when you deduplicate, you’ll be getting the total amount of gifts for a donor, summed up.

There are certainly not perfect.  But to do perfect, you need a database.

Deduplication 2: Deduplicate harder

Your trapezoid zero database – a spreadsheet

Why trapezoid zero? Because square one is too dignified.

If you are like most nonprofit direct marketers, you have one true donor database that filters in all of the data you need from various applications, gives you a view into what every constituent needs and wants, and allows you to segment and communicate with them instantly.

Then you wake up from your dream. You perform requisite amounts of personal grooming, cover your dirty shameful bits with clothing sufficiently uncomfortable to convey gravitas, go to work, and stare at the latest list you have pulled in Excel.

Excel is not a bad program per se and has gotten even better since Clippy was taken out for some cannoli in downtown Redmond and never heard from again.  The trick is that it is not intended to do what we do with it for address lists and the like.  For that, Microsoft created Access, which is actually a bad program per se.

I promise to talk about using Excel for its intended purpose, but first, here’s how to manage your lists in Excel, starting with basic deduplication.

edharris

Deduplication could effectively be called Sisyphean except some is pushing the blinkered rock down as you are trying to push up. You are never, never done deduplicating, especially given that the most common records for deduplication are your best supporters (given that they will be engaging the organization from all angles).

So let’s assume you have a very simple list.  You want to remove duplicates from an email list.  Specifically, you are looking for exact duplicates – you don’t mind emailing the same email twice as long as it is for different names and you don’t mind emailing the same person twice as long as it is at two separate addresses.  Here’s your list:

firstexcellist

As you can see, there are a few Blofelds in there to get rid of.  Your list will be longer and less menacing.

  1. Select all by hitting Ctrl + A
  2. Go to Data > Sort
  3. Sort by all three fields in any order
  4. Now, you are going to compare each row to the one below it. Go to the top row and the first empty column and put in =A2=A3.  This creates a TRUE/FALSE statement where anything that comes back TRUE is identical.
  5. Drag the lower right corner of the cell you just created down so it covers all rows and over to the right so it covers as many columns as you have. It will look something like this:
  6. Highlight the TRUE/FALSE columns, hit Ctrl + C to copy the columns, then right-click on the same columns to bring up the special menu. Select Paste Values.  What you are doing here is replacing your TRUE/FALSE statements with the results of those statements.  Because you are going to be sorting by these, you don’t want Excel recalculating.
  7. Select all by hitting Ctrl + A
  8. Go to Data > Sort
  9. Sort by the three TRUE/FALSE columns. The output will look something like this.

    secondexcellist
    As you can see, there are two records in rows 14 and 15 that were exact matches for other records in the database, since they have three columns of TRUE.  Those are your duplicates and you can delete them.  One might say you have a license to delete them.

That works fine for simple data.  But how often does that happen?  Tomorrow will be some advanced tips for deduplication in Excel.

Your trapezoid zero database – a spreadsheet