The dirty dirty data tricks that dirty dirty people will use to try to get their way

Matthew Berry, New York Times bestselling author and mediocre fantasy football advice-giver (this is a compliment; you have to listen to the podcast), does a column each year called “100 Facts.”  In his intro, each time, he warns about the exercise he is going to undertake.  Statistics can be shaded in whatever way you wish (I’m paraphrasing him), so he acknowledges that he is presenting the best facts to support his perceptions of players.  But he goes further to say that’s all other fantasy football analysts are doing as well – he’s just the one being honest about it.  It’s the analyst’s equivalent of Penn and Teller’s cups and ball trick with clear cups – just because you know how the trick is done doesn’t make it less entertaining.

With the knowledge of statistics comes the responsibility of presenting them effectively.  My first and much beloved nonprofit boss used to say that if you interrogate the data, it will confess.  I would humbly submit a corollary: if you torture the data, it will start confessing to stuff just to make you stop.

A well-wrapped statistic is better than Hitler’s “big lie”; it misleads, yet it cannot be pinned on you.
— How to Lie with Statistics by Darrell Huff

So here are some common tricks people will use to make their points.  Arm yourself against this, less you be the victim of data presented with either malice or ignorance.

The wonky y-axis

The person presenting to you was supposed to increase revenue by a lot.  In fact, s/he increased it by only a little.  The weasel solution?  Make a mountain out of that molehill:

deceptiverevenuegraph

Note that the difference between the top and bottom of the y-axis is only $10,000.  Here’s what that same graph looks like with the y-axis starting at 0, as we are trained to expect unless there’s a very good reason:

zero based revenue graph

Both are true, but the latter is a more accurate representation of what went on over the year.

Ignoring reference points

Let’s take a look at that last graph with the budgeted goal added in.

revenue graph with budget goal

This tells a very different story, no? Always be on the lookout for context like this.

The double wonky y-axis

I’ve been saving a Congressional slide for this blog post.  I make no claims about which side of this issue is true or right or moral or whatever.  That said, this is also a good example of having quality debates with good data versus intentionally putting your spin on the ball.

This graph was presented by Congressman Jason Chaffetz in the debate over Planned Parenthood.


Hat tip to PolitiFact.

The graph seems to say that Planned Parenthood health screenings have decreased, abortions have increased, and now Planned Parenthood performs more abortions than health screenings.

But this is a case where the graph has two different y-axes.  Looking at the data, you can see that there were still well more than double as many prevention services performed as abortions.  When we look at the graph, it looks like the opposite is true.

Again, you may choose to do with this information what you will; there are many who would say one abortion is too many.  However, to paraphrase Daniel Patrick Moynihan, you can have your own opinions, but not your own facts.

The outliers

This is one of those things that is less frequently used by people to fool you and more often overlooked by people who subsequently fool themselves.

Here’s a sample testing report.

test2

This one seems like a pretty clean win for Team Test Letter.  Generally, you are going to take the .2% point decrease in response rate in order to increase average gift by $7 and an additional 14.6 cents per piece mailed out.  Game, set, match.

But one must always ask the uber-question, why. So you look at the donations. It turns out a board member mailed her annual $10,000 gift to the test package.  No such oddball gifts went to the control package.  Since this is not likely a replicable event, let’s take out this one chance donation out and look at the data again.
test3An even cleaner win for Team Control.  The test appears to have suppressed both response rate and average gift.

Percentages versus absolutes

Check out the attached graph of email open rates, where a new online team came it and the director bragged about the increase in open rates.  I actually saw a variant of this one happen live.

openrate1

Wow.  Clearly, much better subject lines under the new regime, no?  More people are getting our messages.

Well, for clarity, let’s look at this on a month by month basis.

openrate2

So, something happened in July that spiked open rates. Maybe it’s the new team, but we must ask why. One of the common culprits, when you are looking at percentages, is a change in N, the denominator.  Let’s look at the same graph, but instead of percentages, we are going to look at the number of people who opened the email.

openrate3

Huh.  Our big spike disappeared.

In looking into this, July is when we started suppressing people who had not opened an email in the past six months.  This is actually a very strong practice, preventing people who don’t want to get email from you, have moved on to another address, or were junk data to begin with off of your files.  As a result, your likelihood of being called spam goes down significantly.

So it wasn’t that twice as many people were opening emails; it was that half as many people (the good half) were getting the emails.

Correlation does not equal causation

The wonderful site FiveThirtyEight recently did a piece on how Matt Damon is more attractive in movies where he is perceived as being smarter.  For example, see how dreamy Damon is perceived to be as super-genius Will Hunting.  As Irene Adler says to Sherlock in the eponymous BBC series, brainy is the new sexy.

And you can look at this and think a logical conclusion: the smarter a Matt Damon character is in a movie, the more attractive that character is perceived to be.  This is plausible even though dreaminess was judged from a still frame – if Matt Damon is wearing an attractive sweater, it’s one of the Bourne movies; if it’s WWII garb, probably Saving Private Ryan.

This conclusion would reason that when Damon plays Neil DeGrasse Tyson in the upcoming biopic, his resultant sexiness will distract from the physical mismatching casting.

There’s also the hypothesis posited by the author: “The more attractive Damon is perceived to be in a movie, the smarter he is perceived to be.”  This says the reverse of the above: if Damon is attractive in a movie, he will be perceived to be smart.  This too is plausible – we tend to overestimate the competence of people we find to be attractive (hence why there is no picture of me on the site – you would immediately start discounting my advice).

Or it could be an exogenous third factor that causes both.  What if make-up artists want to symbolize dumbness by making actors unattractive (actually, since it’s Matt Damon, let’s say less attractive not unattractive)?  Film is after all a visual medium and since they know people underestimate less attractive people, they aim to make less competent characters less attractive.

Those are the ways correlation can go: A can cause B, B can cause A, or C can cause A and B.

This is what we must guard against in drawing final conclusions, but rather continually refined theories.  Let’s say you are seeing a general trend that your advocacy mail packages are doing better than your average mail package.  It’s generally safe to say more advocacy mail packages would be better.  But what if it isn’t the advocacy messaging, but that advocacy messages have a compelling reply device?  Or that when you mailed your advocacy pieces, you were also in the news?

One of the key parts of determining the results of a test is learning what the test actually means.  It’s important to strip away other possibilities until you have determined what the real mechanism is for success or failure.  This is why, for the blog analysis last week, I did a regression analysis rather than a series of correlations – to control for autocorrelations.

You don’t have to be versed in all manner of stats; the most important thing it to keep asking why.  From that, you can find the closest version to the truth.

The dirty dirty data tricks that dirty dirty people will use to try to get their way

7 direct marketing charts your boss must see today

Yay!  It’s my first clickbait-y headline!

I preach, or at least will be preaching, the gospel of testing everything.  There have been times that it has been a rough year for the mail schedule, but then we get to a part of the year we tested into last year, so I know that the projections are going to be pretty good and our tweaks are going to work.  It is those times that there are but one set of footprints on the beach, for it is the testing that is carrying me. So I eventually had to test out one of these headlines — my apologies in advance if it works.

The truth is that there are no such charts that run across all organizations.  There are general topics that you need to cover with your boss – file health in gross numbers, file health by lifecycle segment, in-year performance, long-term projections, how your investments are performing.

But what you need to do is tell your story.  You need to analyze all of the data, make your call, and present all of the evidence that makes your case and all of the evidence that opposes it.

This sounds simple, but how often do you see presentations that feature slides that educate to no end – slides that repeat and repeat but come to no point.  Also, they are repetitive and recapitulate what has already been said.

On Monday, I brought up the war between art and science marketers.  The secret to how the artists win is:

Stories with pictures

Yes, really. The human brain craves narrative and will put a story to about anything that comes in front of it.  It also retains images better than anything else.  There’s a semi-famous experiment where they gave noted oenologists (French for “wine snobs”)* white wine with red food coloring. The experts used all of the words that one uses to describe red wine, without ever noting that it was actually a white wine. When confronted with this, the so-called wine experts all resigned their posts and took up the study of nonprofit direct marketing to do something useful with their lives.

winesmeller

OK, I’m lying about that last part.

My point is that we privilege our sight over all other sense – in essence, we are all visual learners.  When we see words on a slide, our brain, which is still trying to figure out why it isn’t hunting mastodons, sees the letters and has to pause to think “what’s with all of those defective pictures.”

So, as I’ve been writing a lot of defective pictures and I promised the seven direct marketing charts your boss must see today, let’s discuss a story that you would want to tell and how you would present it.

1.

Graph1

The idiot I replaced the idiot that I replaced cut acquisition mailings in 2012.

2.

Graph2

It spiked net revenue for a time, enough for him to find another job.

3.

Graph3

But that has really screwed us out of multiyear donors coming into 2015.  You can see the big drop in multiyear donors in 2014 because they weren’t acquired two years earlier.

4.

graph4

And multiyear donors are our best donors.  You’ll also note that our lapsed reacquired donors have greater yearly value than newly acquired with about the same retention rate.  Thus, my first strategic priority to focus more in reacquiring lapsed donors.  Not as good as the multiyear donor that idiot made sure we didn’t have coming into the file this year, but pretty darn good.

5.

graph5

Lapsed donors have actually decreased as a portion of our average acquisition mailing…

6.

graph6

…yet they have been cheaper to acquire.  In summary, they are better donors than newly acquired donors and they are cheaper to acquire, yet we’ve been reaching out to them less.  Thus, we have an opportunity here.

7.

graph7

Because of this insight and because my salary significantly lags the national average for a direct marketing manager of $67,675, I believe I deserve a raise.  I’m now open for questions.

I swear that in many presentations, this would be over 30 slides and over an hour long.  I’ve actually given some of those presentations and if someone was in one of those and is still reading this, I apologize.

Some key notes from this:

  • Note the use of color to draw attention to the areas that are important to you. Other data are there to provide background, but if you are giving the presentation, it is incumbent upon you to guide the mind of your audience.  In fact, if you are giving the presentation, you may wish to present the chart/graph/data normally, then have the important colors jump out (or the less important ones fade away), arrows fly in, and text appear.
  • As mentioned, this is a different structure of presentation that would normally occur. Normally, there would be a section on file health, then one on revenues, one on strategic priorities, and so on.  However, when you structure like that, the slide that makes the point of why you are doing the strategic priorities you are doing may be 50 slides early.  You can say, “remember the slide that said X?” but regardless of what the answer is, the answer is really is no.  You are smarter than that.  You are going to use data to support narrative, not mangle your story to fit an artificial order of data.
  • There is one point per image (with the exception of #4, which had a nice segue opportunity) and no bullet points. Bullet points help in Web reading (hence my using them here), but they actually hurt memory and retention in presentations.

With this persuasive power, though, comes persuasive responsibility.  Not in the sense that your PowerPoint will soon have you enough dedicated followers to form your own doomsday cult, although if that opportunity arises, please take the high road.

What I mean is as you get better and better at distilling your point, there will be a temptation to take shortcuts and to tilt the presentation so it favors your viewpoint beyond what is warranted.  Part of this is ethical, to be sure – don’t be that type of person – but a larger part is that no one person is smarter than everyone else summed together.  Even readers of this blog.  If you omit or gloss over important data points, you aren’t allowing honest disagreement and insights among your audience that can come to greater understanding.  By creating an army of ill-informed meat puppets, you are going it alone trusting on your knowledge and skill alone to get you through.  There will be a day and that day may be soon when the insight you will need will be in someone else’s head.

You do have to prioritize for your audience.  You may have noticed some other points you would have covered in these graphs – retention in this program is falling and cost to acquire donors is increasing.  This person chose to focus on lapsed but didn’t hide the other metrics, which is sound policy.

So we will cap off the week tomorrow with tricks that other people use to shade their data.  I debated doing this section because it could be equally used as a guide to shade your data.  But you are trusting me and I’m trusting you.  Knowledge is not good or bad in and of itself, but let’s all try to use it for good.

* Oenology is actually from the Greek words for “wine” and “study of,” but that isn’t funny…

7 direct marketing charts your boss must see today

The basics of direct marketing reporting – part two

Yesterday, we talked about the key metrics you want to look at in Excel – 13-14 indicators that speak to you about progress and testing results.

However, a direct marketing Muggle will look and these data and say “Huh.  Interesting.” This is direct marketing Muggle code for “this is not interesting and it makes me think of my Algebra II class, which was taught by a nun.”

While you will want all of the data, you will want a skinnier, clearer chart for others, preferably with colors that call out what is actually important.  Let’s look at a fairly standard test – your thesis was that extra personalization in the letter would increase average gift versus your control.  Here’s what this could look like:

uglytest

The first thing to notice is that your hypothesis was wrong – average gift didn’t go up.  But now you have another decision – should you pay for the additional personalization in the future?

You, as a direct marketing professional, can read this chart.  The increased personalization caused response rate to increase.  As a result, gross income per piece went up and net income per piece went up.  However, return on investment went down; the additional investment didn’t bring in as much as the investments before it.  What would you recommend to your boss?

This is a judgment call based on your goals for your program.  One good approach would be to call for a retest – possibly with even more personalization or to see if you can get the personalization costs down or different ask strings to try to boost average gift.  This is clearly not in the 95% percentile one way or the other (which are other good fields to add to your spreadsheet when you get more advanced), so more testing would be good.

But I know which one I would mail more quantity of when the next test is done – I would use the personalization version as the control.  For me, net per piece matters more than ROI.  Our donors’ time is a scarce and valuable commodity.  There are only so many times you have the opportunity to get in front of them, so if you have the opportunity to maximize their investment of their time, versus trying to go for cost control in borderline cases like this one.

Charity Navigator would disagree with me, as they focus on cost of fundraising, so that’s another point in my argument’s favor.  Remember the Charity Navigator Constanza test – hear what they have to say and do the opposite and it will be to your benefit.

So now you have your course of action.  Now you have to have other people see it your way.  Time to explain it:

pretty test

The first thing to note is that it’s legible.  The second is quantity and absolute gross, net, and cost numbers are gone.  These don’t have any relevance to the decision over what to roll out with.  If you leave them in, there’s a natural human temptation to think biggest = better, especially when it’s called revenue and has a dollar sign in front of it.  For a layperson, it’s good to eliminate these distractions.

Then we’ve color-coded the winning parts.  Control wins on cost; test wins on response rate and ROI, gross income and net.  This helps draw attention to the salient bits.  It is amazing how much these little steps can help focus minds.

You will note that I left ROI in there, even though it is evidence that does not support the case you are trying to make.  I’ve talked about testing as a central commandment on the direct marketer’s tablets.  But testing is nothing if there isn’t intellectual honesty.  You have to make the case, but also give your team all of the information to challenge you and make your arguments better.

This is usually where the aesthetic marketers get us data-driven marketers.  They tell quality stories based not on what is true, but on what we wish were true.

We must become equally good storytellers, because a good story plus data beats just a good story.  On Thursday, I’ll talk about how to present data in a compelling way, but first, we have to figure out how to measure our metrics.

The basics of direct marketing reporting – part two

The basics of direct marketing reporting

So there have been some unjustified slaps at Excel over the past week, as well as against hamsters, Ron Weasley, and the masculinity/femininity of people named Kris.  (The one against Clippy was totally justified.)

clippy

It seems only right, then, to talk about things that Excel is actually good at – doing calculations and presenting data.

There are two general schools of marketing people: art versus science.  The art folks appreciate the aesthetics of marketing and aim toward beautiful design and copy.  They will talk about white space and the golden ratio and L-shaped copy and such.  They elevate fad into trend into fashion. They were responsible for the Apple “1984” commercial and don’t understand why the guy with bad toupee on late night commercials is really successful. They can read the nine-point font they are proposing for your Website and don’t care if it is actually usable.

The job of the science people is to make sure that these people don’t damage your organization too much.*  Our motto is “Beauty is beautiful, but what else does it do?”, or it would be if we started having mottos.  Our tools are the well-designed study, the impertinent question (e.g., “I understand that our brand guidelines say to use Garamond, but our testing shows Baskerville converts better. Would we rather stick to the brand guidelines or raise more money?”), and the clear data presentation.

This last one can be hard for us. Too often, when we present our data, the data goes up against a beautiful story that people wish was true and loses.

So we need to cover not only what data you want to collect (today), but how to present it compellingly (tomorrow).

A standard Excel chart for mail pieces

The things I like to see, in approximate order, are:

  • Enough things to identify the piece/panel/list
  • Quantity mailed
  • Response rate
  • Number of donors
  • Average donation
  • Gross revenue
  • Cost
  • Net revenue
  • Gross per thousand
  • Cost per thousand
  • Net per thousand
  • Return on investment
  • Cost to raise a dollar

That’s for a donor piece; for acquisition, I’d recommend adding cost to acquire.

So that’s what data to collect; tomorrow, we will look at how to present it.

* I am framing this as a battle largely for dramatic purposes. Ideally, you have a data person who respects the talents of a high-quality designer and a designer who likes to focus on what works. These together are stronger than any one alone.**

** But if you have to pick one, pick the scientist.

The basics of direct marketing reporting

Semi-advanced direct marketing Excel statistics

In addition to not being a database, Excel is also not a statistics package.  If you are going to do anything advanced, I highly recommend R.  The programming language, not the John-Cleese-played Q replacement in The World is not Enough and Die Another Day.

Cleese

Cheer up! Yes, it’s a lesser part in lesser Bond movies.
You, John Cleese, are still an international treasure.

Anyway, stats in Excel.  We’ll start with correlations, as they can give us some insight into blog and Facebook traffic and interactions.

Wait, you argue – Facebook is not direct marketing.  First, yes, you are correct.  Second, no, you are wrong; there is a way that you can use Facebook as a direct marketer.  I’ll talk about this more when I do a whole social media week (don’t worry, folks, I promise to spend time deflating the hype and hopefully producing things you can print out and get to board members and say “See? Let’s put money in places where it will make money!”).  Third, because Facebook has limited value (but not zero) as a direct marketing vehicle, you can test things on there to see how they resonate with your audience.  Granted, your Facebook audience and direct mail audience will probably be fairly dissimilar, but your online audience is probably similar to your Facebook audience.  And what you are looking for is what makes compelling online content for you.  So this is a way to make Facebook your testing ground before you put it on to a real platform (i.e., your Web site, your emails).

I’m going to demonstrate this on this blog’s stats because I have the data available. However, I’ve also done this with Facebook posts very successful.  The prep work you will have to do for either blog posts or Facebook posts is to record your outcomes (view, likes, shares, and/or comments), to code the subject matter of each post, and to put in any other variables like day of the week that may be relevant.  Here’s my version of this:

beginningofregression

I have my blog posts on the left and the various factors on the right (and there are more tags, but I need to cut it off at some point to display it.  Yes, I have a blog post that has zero views. If you would like to break the seal on a blog post about why we do segmentation, it’s here; I’m sure it would appreciate a visit.

I then went through this and deleted tags that only applied to a single post. Then I ran a correlation for each individual variable to page views. The correlation function looks like

=CORREL($B2:$B20,I2:I20)

And is expressed from 1 to -1.

correlations

Here’s what that looks like.  Let’s look at the days of the week first, because there appears to be an effect here — Monday content has been king, with a strong correlation to page views. It will be interesting going forward to see in the long term whether that is the nature of the content (I’ve been trying to put introduction content on Mondays and get progressively more involved throughout the week) or that people are more interested in reading blog posts on Mondays.  Nevertheless, I can probably do a better job of setting up the rest of the week as must-see content, since Tuesday, Wednesday, and Thursday are all a bit negative.

Images tend to correlate well to views as well, probably because they show better in social media.  I’d been noticing this from just a glance as well, so you will probably be seeing more images here in the weeks to come. They will also be less boring images, since some of the lower performers were images of equations and Excel sheets. It is not coincidence that the tallest Python led this blog post.

And it looks like cultivation and multichannel efforts are winning while conversion, lifetime value, and personalization are not as strong, with negative correlations to page views. I won’t be acting on this immediately, but keeping an eye on it. And I do have a multichannel week planned in the near future, so we’ll be able to test whether that’s an artifact in the data.

However, you might notice that the reason cultivation is ranked so highly is that it is in the two top performing posts, which are Monday posts. Is it the topic or the day that made those strong?  For this, you need regression.

Normally, you wouldn’t do this after only 20 blog posts.  We are not going to be able to draw any statistically significant conclusions, but I do want to show you how it’s done.

  1. Go to Data > Data Analysis > Regression
  2. Select the range from your outcome variable as your Y range and the range of the independent variables you want to test in the X range like so:
    regression panel
  3. Hit OK. You’ll get something that looks like this.  In my case, it’s a really, really bad regression:
    a bad regression

Yuck. The things you would normally be looking for are:

  • in R-squared, you are looking for as close to 1 as possible.  One would mean your model is totally predictive. Zero means it predicts nothing at all.
  • In P-value per variable, you are looking for less than .05.  That would show if there is a statistically significant relationship between any of the variables and your output. In this case, there isn’t and we can pretty much throw out the whole thing.
  • If there is a relationship, you want to look at the coefficient for two things:
    • Is it positive or negative? Positive is good things; negative is bad things.
    • How big is the relationship? In this one, if these were significant, it would be a bad idea to post about personalization again, as posting about it reducing views by 7.  But it isn’t significant, so I’m not yet worried.

Hope this helps you with the stats side of Excel.  Tune in next week, when we look at some of the things that Excel is actually good at.

Semi-advanced direct marketing Excel statistics

Appending data in Excel

One of the primary reasons to have a database is to have data in various tables.  A simple example is a database with a people table – a list of your donors – and a donations table.  This allows for there to be a one to many relationship between the donors and donations.  After all, a good donor will generally donate multiple times.  By using a key in the people table, each donation is marked with that same key – e.g., all donations from donor 12398 have 12398 in their donor field and are associated with that record in the people table.

Excel can’t do this.  As a flat file, all you can represent are one-to-one relationships.  If you had to have multiple gifts in Excel, you would have duplicate donor records for each instance of the gift.

But you can append data to a record in Excel.

A while back, I wanted to do some homemade modeling on a file to add sex in as a variable (male/female in this case, to simplify the model).  Unfortunately, I didn’t have a list of who was male or female in this particular file, so I used US census records to determine how many males and females there were with a given first name.  This led to the odd conclusion that someone who had a name that 60 females and 40 males have in the US is a 60% woman.  Thank goodness I was using this for modeling and not for assigning Mr. or Ms.

By the way, if you were wondering if there were any perfect half-female, half-male names, there were three: Ariel, Hong, and Kris.  Sydney, Robbie, Frankie, and Kerry were almost androgynous, just missing the cut.  The Excel sheet is here if you are curious.

To complete the model, I had to put the percentage of femaleness with each record.  I used a somewhat fiddly function called VLOOKUP.

VLOOKUP tells Excel to look through a bunch of rows to find a particular value in the left column of an array, find it, and return another column in that array.  To wit:

=VLOOKUP(A1, ‘[Male v Female first names.xlsx]Sheet1’!$A:$B,2,FALSE)

This assumes that A1 is the first name you want to look up.  This then goes over to the document called Male v Female first names and looks at the first two columns, A & B.  It searches for the name in column A.  If it finds it, it returns back the value in B next to that value.  So, if A1 is Hong, it finds Hong in the array, sees the value next to that is 50%, and puts it in the cell in question.  The FALSE at the end just means you want an exact match, rather than the closest one.

It is by no means as pretty or efficient as a SQL JOIN statement.  In fact, if you have a lot of these to do, as I did, copy and paste it for the full column, then get yourself some tea and a crossword puzzle while Excel labors like a hamster on a wheel.  But, on the plus side, you don’t have to learn SQL.  You still should, and we’ll cover SQL for direct marketers at some point, but you can’t learn everything immediately.

Appending data in Excel

Deduplication 2: Deduplicate harder

The deduplication exercise from yesterday talked about how to deduplicate clean, easy-to-use data.  If you ever find data like that, please let me know.

In reality, there are always quirks in the data to work around.  Here’s how to solve a few of them.

The middle name conundrum.  There is no good way to deal with middle names in a database, because the way people use them is non-standard.  For some people, a middle name is never used and sometimes a throwback to a bygone ancestor (e.g. ,Ronald Bilius Weasley).  For some, it is an integral part of the first name (e.g. Mary Kate and Ashley Olson; that is just one long name, right?).  Others skip right over the first name and go right to the middle (e.g., F. Murray Abraham).  This is why a good database will have a salutations field you can fill in, so your letters go to Ron, Mary Kate, and Murray, and not Ronald, Mary, and F.

This, however, makes deduplication a bear.  You want to preserve the original first name, but need to manipulate the data:

  1. Insert two columns next to your first name row (more if you think people will have three or more names in the first name field)
  2. Copy (Ctrl + C) the first name column into the first empty column.
  3. Highlight the column and go to Data > Text to Columns
  4. Choose Delimited and hit Next
  5. Choose Space and hit Finish

This separates the names into separate columns.  Dedup as normal based on these fake name columns, then delete these columns and you have clean(er) data.

Forms of names. An automated system won’t be able to tell you that Robert and Bob are the same person, but you can figure out close matches of names like Robert and Rob.  The trick here is when you deduplicate for first name to put

=LEFT(A2,3)=LEFT(A3,3)

Instead of

=A2=A3

 This compares the three first letters of cell A2 to the first three letters of cell A3.  This solves your Rob/Robert’s and Nick/Nicholas’s of the world.  If you want to be a little less restrictive, you can do two instead of three and solve for your Nicole/Nikki’s and your Jo/Joanne’s.

This handy trick also works well for addresses.  Generally if the first 4-5 digits of the address are the same, they are close enough for government work (but still catches them if someone put Rd instead of Dr).

Adding as you go

Because Excel is a flat file and not an actual database, you will frequently have to sum gifts to get a total amount of giving.  Let’s go back to our initial email list, now with gift amounts and the middle names broken out.  Columns F, G, and H, are set up to compare A, C, and D.

thirdexcellist

Here’s how to proceed:

  1. Set up column I with the statement =AND(F2,G2,H2) (and copy it for all of the rows). This returns TRUE if all of the columns are TRUE and FALSE if any are.  This puts in one cell whether something is a duplicate or not.
  2. Set up the first row of column J with the statement =IF(I1,J1+E2,E2). This is the trickiest bit.  What this is saying is “Look at the row before this.  If it’s the same as this row, add this donation to all of the previous donations that match this record.  If not, just put in the one gift.”
  3. Copy this column and paste values like you did with the TRUE/FALSE earlier.

fourthexcellist

Now, when you deduplicate, you’ll be getting the total amount of gifts for a donor, summed up.

There are certainly not perfect.  But to do perfect, you need a database.

Deduplication 2: Deduplicate harder

Your trapezoid zero database – a spreadsheet

Why trapezoid zero? Because square one is too dignified.

If you are like most nonprofit direct marketers, you have one true donor database that filters in all of the data you need from various applications, gives you a view into what every constituent needs and wants, and allows you to segment and communicate with them instantly.

Then you wake up from your dream. You perform requisite amounts of personal grooming, cover your dirty shameful bits with clothing sufficiently uncomfortable to convey gravitas, go to work, and stare at the latest list you have pulled in Excel.

Excel is not a bad program per se and has gotten even better since Clippy was taken out for some cannoli in downtown Redmond and never heard from again.  The trick is that it is not intended to do what we do with it for address lists and the like.  For that, Microsoft created Access, which is actually a bad program per se.

I promise to talk about using Excel for its intended purpose, but first, here’s how to manage your lists in Excel, starting with basic deduplication.

edharris

Deduplication could effectively be called Sisyphean except some is pushing the blinkered rock down as you are trying to push up. You are never, never done deduplicating, especially given that the most common records for deduplication are your best supporters (given that they will be engaging the organization from all angles).

So let’s assume you have a very simple list.  You want to remove duplicates from an email list.  Specifically, you are looking for exact duplicates – you don’t mind emailing the same email twice as long as it is for different names and you don’t mind emailing the same person twice as long as it is at two separate addresses.  Here’s your list:

firstexcellist

As you can see, there are a few Blofelds in there to get rid of.  Your list will be longer and less menacing.

  1. Select all by hitting Ctrl + A
  2. Go to Data > Sort
  3. Sort by all three fields in any order
  4. Now, you are going to compare each row to the one below it. Go to the top row and the first empty column and put in =A2=A3.  This creates a TRUE/FALSE statement where anything that comes back TRUE is identical.
  5. Drag the lower right corner of the cell you just created down so it covers all rows and over to the right so it covers as many columns as you have. It will look something like this:
  6. Highlight the TRUE/FALSE columns, hit Ctrl + C to copy the columns, then right-click on the same columns to bring up the special menu. Select Paste Values.  What you are doing here is replacing your TRUE/FALSE statements with the results of those statements.  Because you are going to be sorting by these, you don’t want Excel recalculating.
  7. Select all by hitting Ctrl + A
  8. Go to Data > Sort
  9. Sort by the three TRUE/FALSE columns. The output will look something like this.

    secondexcellist
    As you can see, there are two records in rows 14 and 15 that were exact matches for other records in the database, since they have three columns of TRUE.  Those are your duplicates and you can delete them.  One might say you have a license to delete them.

That works fine for simple data.  But how often does that happen?  Tomorrow will be some advanced tips for deduplication in Excel.

Your trapezoid zero database – a spreadsheet