Monday, October 24, 2011

Evaluation quality standards: Theories in need of testing?


Since the beginning of this year I have been part of a DFID funded exercise which has the aim of “Developing a broader range of rigorous designs and methods for impact evaluations” Part of the brief has been to develop draft quality standards, to help identify “the difference between appropriate, high quality use of the approach and inappropriate/ poor quality use”

A quick search of what already exists suggests that there is no shortage of quality standards. Those relevant to development projects have been listed online here. They include:
  • Standards agreed by multiple organisations, e.g. OECD-DAC and various national evaluation societies. The former are of interest to aid organisations where as the latter are of more interest to evaluators.
  • Standards developed for use within individual organisations, e.g. DFID and EuropeAID
  • Methodology specific standards, e.g. those relating to randomised and other kinds of experimental methods, and qualitative research
In addition there is a much larger body of academic literature on the use and mis-use of various more specific methods.

A scan of the criteria I have listed shows that a variety of types of evaluation criteria are used, including:
  • Process criteria, where the focus is on how evaluations are done. e.g. relevance, timeliness, accessibility, inclusiveness
  • Normative criteria, where the focus is on principles of behaviour e.g. independence, impartiality, ethicality
  • Technical criteria, where the focus is on attributes of the methods used e.g. reliability and validity
Somewhat surprisingly, technical criteria like reliability and validity are in the minority, being two of at least 20 OECD-DAC criteria. The more encompassing topic of Evaluation Design is only one of the 17 main topics in the DFID Quality Assurance template for revising draft evaluations. There are three possible reasons why this is so: (a) Process attributes may be more important, in terms of their effects on what happens to an evaluation, during and after its production, (b) It is hard to identify generic quality criteria for a diversity of evaluation methodologies, (c) Lists have no size limits. For example, the DFID QA template has 85 subsidiary questions under 17 main topics.

Given these circumstances what is the best way forward, of addressing the need for quality standards for “a broader range of rigorous designs and methods for impact evaluations”? The first step might be to develop specific guidance which can be packed in separate notes on particular evaluation designs and methods. The primary problem may be simple lack of knowledge about the methods available; knowing how to choose between them may be in fact “a problem we would like to have”, which needs to be addressed after people at least know something about the alternative methods. The Asian Development Bank has addressed this issue through its “Knowledge Solutions” series of publications. 

The second step that could be taken would be to develop more generic guidance that can be incorporated into the existing quality standards. Our initial proposal focused on developing some additional design focused quality standards that could be used with some reliability across different users. But perhaps this is a side issue. Finding out what quality criteria really matter, may be more important. However, there seems to be very little evidence on what quality attributes matter. In 2008 Forss et al carried out a study: “Are Sida Evaluations Good Enough? An Assessment of 34 Evaluation Reports” The authors gathered and analysed empirical data on 40 different quality attributes of evaluation reports published between 2003 and 2005. Despite suggestions made, the report was not required to examine the relationship between these attributes and the subsequent use of the evaluations. Yet, the insufficient use of evaluations has been a long standing concern to evaluators and to those funding evaluations. 

There are at least 4 different hypotheses that would be worth testing in future versions of the SIDA study that did look at evaluation quality and usage:
  1. Quality is largely irrelevant, what matters is how the evaluation results are communicated.
  2. Quality matters, especially the use of a rigorous methodology, which is able to address attribution issues
  3. Quality matters, especially the use of participatory processes that engage stakeholders
  4. Quality matters, but it is a multi-dimensional issue. The more dimensions are addressed, the more likely that the evaluation results will be used.
The first is in effect the null hypothesis, and one which needs to be taken seriously. The second hypothesis seems to be the position taken by 3ie and other advocates of RCTs and their next-best substitutes. It could be described as the killer assumption being made by RCT advocates that is yet to be tested. The third could be the position of some of the supporters of the “Big Push Back” against inappropriate demands for performance reporting. The fourth is the view present in the OECD-DAC evaluation standards, which can be read as a narrative theory of change about how a complex of evaluation quality features will lead to evaluation use, strengthened accountability, contribute to learning and improved development outcomes. I have taken the liberty of identifying the various possible causal connections in that theory of change in this network diagram below. As noted above, one interesting feature is that the attributes of reliability and validity are only one part of a much bigger picture. 


[Click on image to view a larger version of the diagram]

While we wait for the evidence…

We should consider transparency as a pre-eminent quality criterion, which would be applicable across all types of evaluation designs. It is a meta-quality, enabling judgments about other qualities. It also addresses the issue of robustness, which was of concern to DFID. The more explicit and articulated an evaluation design is, the more vulnerable it will be to criticism and identification of error. Robust designs will be those that  can survive this process. This view connects to wider ideas in the philosophy of science about the importance of falsifiablity as a quality of scientific theories (promoted by Popper and others).

Transparency might be expected at both a macro and micro level. At the macro level, we might ask these types of quality assurance questions:
  • Before the evaluation: Has an evaluation plan been lodged, which includes the hypotheses to be tested? Doing so will help reduce selective reporting and opportunistic data mining
  • After the evaluation: Is the evaluation report available? Is the raw data available for re-analysis using the same or different methods?
Substantial progress is now being made with the availability of evaluation reports. Some bilateral agencies are considering the use of evaluation/trial registries, which are increasingly commonplace in some field of research. However, availability of raw data seems likely to remain the most challenging requirement for many evaluators.

At the micro-level, more transparency could be expected in the particular contents of evaluation plans and reports. The DFID Quality Assurance templates seem to be most operationalised set of evaluation quality standards available at present. The following types of questions could be considered for inclusion in those templates:
  • Is it clear how specific features of the project/program influenced the evaluation design?
  • Have rejected evaluation design choices been explained?
  • Have terms like impact been clearly defined?
  • What kinds of impact were examined?
  • Where attribution is claimed is there also a plausible explanations of the causal processes at work?
  • Have distinctions been made between causes which are necessary, sufficient or neither (but still contributory)?
  •  Are there assessments of what would have happened without the intervention?
This approach seems to have some support in other spheres of evaluation work, not associated with development aid: “The transparency, or clarity, in the reporting of individual studies is key” TREND statement, 2004

In summary, three main recommendations have been made above:
  • Develop technical guidance notes, separate from additional quality criteria
  • Identify specific areas where transparency of evaluation designs and methods is essential, for possible inclusion in DFID QA templates, and the like
  • Seek and use opportunities to test out the relevance of different evaluation criteria, in terms of  their effects on evaluation use
PS: This text was the basis of one of the presentations to DFID staff (and others) in a workshop on 7th October 2011 on the subject of “Developing a broader range of rigorous designs and methods for impact evaluations” The views expressed above are my own and should not be taken to reflect the views of either DFID or others involved in the exercise.