If you don’t know what a linear regression analysis is or how it is measured, I recommend you start with my post on running regressions in Excel here.

OK, now that you’re back, you’ll notice I did an OK job of saying what a linear regression analysis is and what it means, but I didn’t mention why these would be valuable. Today, we rectify this error.

In yesterday’s post on correlations, I mentioned that they only work for two variables at a time. This is extremely limiting, in that most of your systems are more complex with this. Additionally, because of interactions between multiple variables, it’s difficult to determine what is causing what. I’ve discussed before how the failure of the US housing market was related to people assuming variables that are independent were actually correlated with each other.

Linear regression analysis allows you to look at the intercorrelations between and among various variables. As a result, **regression analysis is the primary basic modeling algorithm.** In fact, it’s often used as a baseline for other approaches — if you can’t beat the regression analysis, it’s back to the drawing board.

Two side notes here:

First, if you are interested in learning to do this yourself, I strongly recommend Kaggle competitions. Kaggle is where people compete for money to produce the best models for various things — right now, for example, they are running a $200,000 competition on diagnosing heart disease, a $50,000 competition for stock market modeling, and a $10,000 competition to identify endangered whales from photography.

It’s some pretty cool data stuff and the best part is that they have tutorial competitions for people like me (and perhaps you; I would hate to assume). One sample is to model what passengers would survive the sinking of the Titanic from variables like age, sex, class ticket, fare, etc. They walk you through correlation, regression, and some more advanced modeling techniques we’ll discuss later in the week. Here, as ever, they look for improvement on regression as the goal of more advanced models.

Second, it’s tempting to view regression as a Mendoza line* of modeling: a lowered hurdle that shouldn’t be bothered with. But regression can give you fairly powerful results and, unlike many of the other more advanced modeling we’re going to discuss, you can do it and interpret it yourself.

That said, like correlation, it doesn’t know what to do with non-linear variables. For example, you have probably noticed that your response rate falls off significantly after a donor hasn’t donated in 12 months (plus or minus). A regression model that looks at number of months since last gift will ignore this and assume that the difference between 10 and 11 months is the same as the difference between 12 and 13 months. And it isn’t. It also will choke on our ask string test in the same way as correlations will.

So here are some things worth testing with regression analyses:

**Demographic variables**: you may know the composition of your donor file (and if you are like most non-profits, it’s probably female skewed). But have you looked at which sex ends up becoming the better donor over time? It may be with a regression analysis that the men on your file donate more or more often (or not), which could change your list selects (I know I have been known to put a gender select on an outside file rental to improve its performance).

**Lapsed modeling your file**: Using RFM analysis, you know what segments perform best for you and which go into your lapsed program (if not, use RFM analysis to figure out what segments perform best for you). However, there may be hidden gems in your file that missed a gift (according to you) and would react well if approached again hidden in your lapsed files. Taking your appended data like wealth, demographics, and other variables alongside your standard RFM analysis can help find some of these folks to reach out to.

**Content analysis:** In the early regression article, I show a (bad) example of using regression analysis to find out what blog posts work best. This can be applied to Facebook or other content as well.

What I didn’t mention is that once you have this data, it probably applies across media. What works on Facebook and in your blog are probably good topics for your enewsletters, email appeals, and possibly paper newsletters as well. Through this type of topic analysis, you will figure out what your constituents react to, then give them more of it.

This, however, looks at your audience monolithically. In future posts, I’ll talk about both some ways to cluster/segment your file like k-means clustering and some ways on improving on regression analysis with techniques like Bayesian analysis. For now, though, it’s time to look at some formulae that rule our worlds even beyond direct marketing: what do Google and Facebook use?

* A baseball term coming from Mario Mendoza, a weak-hitting shortstop who usually averaged around .200 batting average. Anyone below Mendoza in the batting average category was considered to be hitting below the Mendoza line or very poorly. (He made up for this for several years with strong fielding). And now you know the rest of the story.