Tuesday, April 16, 2013

Another perspective on the uses of control groups

I have been reading Eric Siegel's book on Predictive Analytics. Though it is a "pop science" account, with the usual "this will change the world" subtitle, it is definitely a worthwhile read.

In chapter 7 he talks the reader through what are called "uplift models", which are Decision Tree models that can not only differentiate groups who respond differently to an intervention, but how much differently when compared to a control group where there is no intervention. All this is in the context of companies marketing their products to the population at large, not the world of development aid organisations.

(Temporarily putting aside the idea of uplift models...) In this chapter he happens to use the matrix below, to illustrate the different possible sets of consumers that exist, given two types of scenarios that can be found where both a control and intervention group are being used.
But what happens if we re-label the matrix, using more development project type language? Here is my revised version below:

Looking at this new matrix it struck me that evaluators of development projects may have a relatively improverished view of the potential uses of control groups. Normally the focus is on the net difference in the improvement, between households in the control and intervention groups: How big is it and is it statistically significant? In other words, how many of those in the intervention group were really "self-helpers" who would have improved anyway, versus being "Need help'ers" who would not have improved without the intervention.

But this leaves aside two other sets of households who also surely deserve at least equal attention.One are the "hard cases", that did not improve in either setting. Possibly the poorest of the poor. How often are their numbers identified with the same attention to statistical detail? The other are the "Confused", who have improved in the control group, but not in the intervention group. Perhaps these are the ones we should really worry about, or at least be able to enumerate. Evaluators are often asked, in their ToRs, to also give attention to negative project impacts, but how often do we systematically look for such evidence?

Okay, but how will we recognise these groups? One way is to look at the distribution of cases that are possible. Each group can be characterised by how cases are distributed in the control and intervention group, as shown below. The first group (in green) are probably "self-help'ers" because the same proportion also improved in the control group. The second group are more likely to be "need-help'ers" because fewer people improved in the control group. The third group are likely to be the "confused" because more of them did not improve in the intervention group than in the control group. The fourth group are likely to be the "hard cases" if the same high proportion did not improve in the control group either.
At an aggregate level only one of the four outcome combinations shown above can be observed at any one time. This is the kind of distribution I found in the data set collected during a 2012 impact assessment of a rural livelihoods project in India. Here the overall distribution suggests that the “need-helpers” have benefited. 

How do we find if and where the other groups are? One way of doing this is to split the total population into sub-groups, using one household attribute at a time, to see what difference it makes to the distribution of results. For example, I thought that household’s wealth ranking might be associated with differences in outcomes. So I examined the distribution of outcomes for the poorest and least poor of the four wealth ranked groups. In the poorest group, those who benefited were the “need-help’ers” , but in the “Well-Off” group those who benefited were the “self-help’ers”, perhaps as expected

There are still the two other kinds of outcomes that might exist in some sub-groups – the “hard cases” and the “confused” How can we find where they are? At this point my theory-directed search fails me. I have no idea where to look for them. There are too many household attributes in the data set to consider manually examining how different their particular distribution of outcomes is from the aggregate distribution.

This is the territory where an automated algorithm would be useful. Using one attribute at a time, it would split the main population into two sub-groups, and search for the attribute that made the biggest difference. The difference to look for would be extremity of range, as measurable by the Standard Deviation.  The reason for this approach is that the most extreme range would be where one cell in the control group was 100 and the other was 0, and similarly in the intervention group. These would be pure examples of the four types of outcome distributions shown above. [Note that in the two wealth ranked sub-groups above, the Standard Deviation of the distributions was 14% and 15% versus 7% in the whole group]
This is the same sort of work that a Decision Tree algorithm does, except Decision Trees usually search for binary outcomes and use different “splitting” criteria. I am not sure if they can use the Standard Deviation, or if they can use a another measure which would deliver the same results (i.e. identify four possible types of outcomes).

Wednesday, April 10, 2013

Predicting evaluability: An example application of Decision Tree models

The project: In 2000 ITAD did an Evaluablity Assessment of Sida funded democracy and human rights projects in Latin America and South Africa. The results are available here:Vol.1 and Vol.2. Its a thorough and detailed report.

The data: Of interest to me were two tables of data, showing how each of the 28 projects were rated on 13 different evaluablity assessment criteria. The use of each of these criteria are explained in detail in the project specific assessments in the second volume of the report.

Here are the two tables. The rows list the evaluability criteria and the columns list the projects that were assessed. The cell values show the scores on each criteria: 1 = best possible, 4 = worst possible. The bottom row summarises the scores for each project, and assumes an equal weighting for each criteria, except for the top three, which were not included in the summary score.


The question of interest: Is it possible to find a small sub-set of these 13 criteria which could act as good predictors of likely evaluability? If so, this could provide a quicker means of assessing where evaluablity issues need attention.

The problem: With 13 different criteria there are conceivably 2 to the power of 13 possible combinations of criteria that might be good predictors i.e 8,192 possiblities

The response:  I amalgamated both tables into one, in an Excel file, and re-calculated the total scores, by including scores for the first three criteria (recoded as Y=1, N=2). I then recoded the aggregate score into a binary outcome measure, where 1 = above average evaluablity scores and 2 below average scores.

I then imported this data into Rapid Miner, an open source data mining package. I then used the Decision Tree module within that package to generate the following Decision Tree model, which I will explain below.


The results: Decision Tree models are read from the root (at the top) to the leaf, following each branch in turn.

This model tells us, in respect to the 28 projects examined, that IF a project scores less than 2.5 (which is good) on "Identifiable outputs"  AND if it scores less than 3.5 on "project benefits can be attributed to the project intervention alone"  THEN there is a 93% probability that the project is reasonably evaluable (i.e has above average aggregate score for evaluability in the original data set). It also tells us that 50% of all the cases (projects) meet these two criteria.

Looking down the right side of the tree we see that IF the project scores more than 2.5 (which is not good) on"Identifiable outputs" AND even though it scores less than 2.5 on "broad ownership of project purpose amongst stakeholders THEN there is a 100% probability that the project will have low evaluability. It also tells us that 32% of all cases meet these two criteria.

Improvements: This model could be improved in two ways. Firstly, the outcome measure, which is an above/below average aggregate score for each project could be made more demanding, so that only the top 25th percentile were rated as having good evaluability. We may want to set a higher standard.

Secondly, the assumption that all criteria are of equal importance, and thus their scores can simply be added up, could be questioned. Different weights could be given to each criterion, according to their perceived causal importance (i.e. the effects they will have). This will not necessarily bias the Decision Tree model towards using those criteria in a predictive model. If all projects were rated highly on a highly weighted criteria that criteria would have no particular value as a means of discriminating between them, so it would be unlikely to feature in the Decision Tree at all.

Weighting and perhaps subsequent re-weighting criteria may also help reconcile any conflict between what are accurate prediction rules and what seems to make sense as a combination of criteria that will cause high or low evaluability. For example in the above model, it seems odd that a criteria of merit (broad ownership of project purpose) should help us identify projects that have poor evaluablity.

Your comments are welcome

PS: For a pop science account of predictive modelling see Eric Siegel's book on Predictive Analytics