Warning: long post, deep weeds.
Last week saw some really interesting thinking in the development economics blogosphere, focused on design questions for external validity (the applicability of case-specific findings to other cases). This is a central question for research on civic tech and accountability programming, which talks a lot about wanting an evidence base, but remains dominated by case studies, enthusiasm and a handful of ametuer researchers. We see this clearly a couple times a year (at TicTech, the Open Data Research Forum), where the community gathers to talk about evidence, share novel case studies, acknowledge that we can’t generalize from case studies, and then talk some more as if we can.
If we want to develop the kinds of general heuristics and and rules of thumb that would be useful for the people who actually design and prioritize programming modalities, the kind that would make it possible to learn across country contexts, then we have to be smarter about how we design our research and conceptualize our evidence base. There’s a lot to learn from development economics in that regard. Development studies is like pubescent civic tech and accountability’s older uncle, who used to be cool, but still knows how to get shit done. In particular, there was a lot to learn from last week’s discussions about generalization and validity.
Most prominently, researchers from @JPAL_Global wrote in SSIR about what they call “The Generalizability Puzzle” of anticipating whether evidence on programming in one context is applicable in another. The authors rightly point out that the common perception of having to choose between local evidence and strong evidence from other contexts is a false dichotomy. To move past that dichotomy, they emphasize focusing on general mechanisms of human behaviour rather than policy and program details. This makes a lot of sense if done thoughtfully, and they propose 4 steps by which to do so.
Step 1: What is the disaggregated theory behind the program?
Step 2: Do the local conditions hold for that theory to apply?
Step 3: How strong is the evidence for the required general behavioral change?
Step 4: What is the evidence that the implementation process can be carried out well?
The authors illustrate these steps with a couple of cases, and Dave Evans test drives it with one of his own on the Development Impact blog. Both Evans and the J-Pal researchers emphasize thoughtful assessment and composition of evidence over formulaic approaches to generalization, which is also the main takeaway from Lant Pritchett’s geeky screed against RCT enthusiasm on the @CGDev blog (more below). These calls for careful consideration when extrapolating case-based evidence reinforce the hard fact that external validity across country contexts is a pipe dream. But they also remind us that strict external validity isn’t really what we’re after.
What’s it mean for civic tech and accountability evidence?
External validity is complicated. That might be the most relevant takeaway for civic tech’s sprawling portfolio of case studies. In fact it’s worse than that, because the host of case studies produced by the civic tech and accountability research space over the last year are almost exclusively qualitative and vary dramatically in rigor. Generalizing from rigorous quantitative analysis in specific contexts is hard enough (arguably impossible), but it’s profoundly difficult with a pile of narrative apples and oranges. We recognize this some degree, and talk a lot about how context is king, but would benefit from taking a more thoughtful look at generalization.
It’s true that you can’t generalize from a qualitative data in the same way as quant data simply because you don’t have a sample that’s representative in quant sense, full stop. But you can generalize theoretical constructs from qualitative case studies–in fact, that’s kind of the whole point. In many ways, this makes external validity tests more challenging for qualitative than quantitative work, because the tests aren’t as universally explicit or widely recognized. It’s easy to fall into the habit of talking about cases as if they represent or “are like” some larger group of cases–that’s where our language naturally wants to take us. But that is a mode of statistical analysis that simply doesn’t apply to the kind of qualitative case study civic tech lives in. Analytical generalization, where the lessons and findings are based not on surface level similarities between cases, but between underlying structural dynamics, and can thus be generalized across dissimilar cases.
There’s a whole host of scholarship discussing this (see Yin, 2005; George & Bennett, 2005; Mitchell, 1983; Buroway1998), and much of it complains about qualitative case-based scholarship that tends to be little more than sloppy storytelling. Again, this is partly because rigorous methods and case study design are harder in qual work than in quant, because they same simple and universally applicable procedures simply don’t exist. But that’s all the more reason for the civic tech and accountability space to take these issues seriously (after a solid decade of funding sloppy case studies and talking fast and loose about “evidence”). We need to think carefully about addressing questions of external validity in both research and programming, and what it means for our evidence base.
Though development economics debates are all primarily focused on quant analysis, there’s some overarching methodological takeaways we’d do well to consider. Looking at the posts referenced above, we can start with three interlinked strategies with which to do so: the importance of combining different types of evidence, reliance on theoretical models, and thoughtful research design.
Firstly, the approaches linked above all emphasize the careful composition of different types of evidence to compose a strong evidence base. The objective isn’t just the triangulation for triangulation’s sake, as with with mixed methods approaches (though it requires just as much thoughtful design). It’s a little more desperate, and has to do with identifying the evidence – any evidence – that speaks to the underlying social mechanisms. In the civic tech and accountability space this means both critical appraisals of evidence quality, and casting a very wide net.
Casual storytelling and loose narratives from the field can provide useful insights in composing and prioritizing components of an evidence base, both for policy and research design. But we’re likely to benefit just as much from identifying evidence outside of familiar research circles. If we want to understand how the visibility of citizen feedback mechanisms will motivate local government accountability in Honduras, we’ll want to look at the comparable research in other studies (the “teeth” of citizen voice according to Peixoto & Fox). We’ll also want to look at other research on Honduran municipal government incentive structures that will tell us something about how local government actors react to public pressure. We might find that in public administration studies or research in a dedicated policy area. There might also be useful information in NGO cases and reports, or we might need to look at other countries. The bottom line here is that the net should be wide, comparison should be critical and judicious, and that’s a lot more work than we’re accustomed to.
Secondly, civic tech and accountability has a bad habit of ignoring theory. I’m not talking about theory for theory’s sake (sorry STS studies). This is about building a functional set of assumptions regarding the underlying social mechanisms that can directly inform study design and the articulation of learning. As the J-Pal researchers put it,
[This] underscores the importance of drawing connections between seemingly dissimilar studies in a way that a good literature review does. These academic reviews that discuss the common mechanisms behind effective programs are useful for policy makers precisely because they home in on the underlying behaviors that generalize across superficially different contexts. This is very different from the growing fashion in some policy circles of promoting metaanalyses, which are traditionally used in medicine and simply average the effects found across different studies.
This also ties back directly to the question of combining data types. For Pritchett, ignoring theory is one of the main external validity problems encountered by RCT enthusiasm.
If the essence of science was doing experiments, there would be a Nobel Prize for alchemy. The essence of science is theory because only theory provides the framework within which individual empirical results can be evaluated and aggregated. That doing X had impact ΔY in a given context is, in the absence of theory, zero rigorous evidence about the likely impact of doing X in any other context. For that matter, it is only theory that can tell us what “context” even means.
This is important for the civic tech and accountability field, both in terms of existing quantitative evidence, and for the qualitative work in the pipeline. We need sound theoretical frameworks about underlying social mechanisms, which are drawn from broad literature bases. These are hard to establish because they require both a familiarity with literature that people embedded in the field are hard pressed to develop, and a familiarity with the field that is impossible for most academics. Developing it will require a concerted effort (ping, donors), but without which, we’ll continue to confuse evidentiary forests for trees.
Smart evidence composition and theory building are both deeply tangled up with larger research design issues. Pritchett’s gripe makes these entanglements very clear. And though his argument is nuanced and a little confusing, it’s worth sketching out:
Pritchett is focused on a specific type of RCTs, which aim to explain the effects of causes (what he calls “x-centric” research, because it is designed around the independent, x variable), and the hidden limitations and analytical dangers that accompany such designs. X-centric designs aim to determine what happens to an outcome of interest if input variables are tweaked, rather than research attempting to explain the causes behind outcomes of interest, and Pritchett describes their susceptibility to analytical and theoretical failures at length.
In the context of civic tech and accountability, Pritchett is warning about all the methodological and analytical problems we can get into if we spend our energies opportunistically testing the inputs we can control (how does providing citizens with different types of information on service provision impact the ways in which they relate to service providers) rather than exploring what has influenced outcomes (why do citizens demand accountability of service providers). The difference is subtle, but critical when building an evidence bases, and has everything to do with the interaction of research design and a theoretical base.
The beauty of an x-centric approach is that a researcher can do empirical work in the complete absence of a model or theory of Y. All I need to do is randomize units into “treatment” and “control,” and do X to the treatments and not the controls, and I can trace out the (mean) impulse response function on Y of doing X by comparing the paths of the treatment and control. Any researcher can then make (seemingly) rigorous statements of the type “In the following context and background conditions, my research did X and the (average) impact on Y of doing X was ΔY.”
The danger of the x-centric approach is that a researcher can do empirical work in the complete absences of a model or theory of Y. This can be sold as an advantage and the search for “causes of effects” dismissed as irrelevant, or worse (e.g., Gelman and Imbens 2013). But without a model of Y, x-centric research can easily become eccentric, in many ways.
One way to think about this dynamic in the civic tech and accountability space is to suggest that research shouldn’t be asking what works, but rather how information mechanisms influence accountability and governance. The difference is subtle, but important. We are in early days of understanding tech and accountability, still engaged in the magical thinking of 18th century physicians diagnosing humors. We still need to focus on exploratory and explanatory research over research that tests or confirms hypotheses. Designing research that asks how things work rather than whether things work is the first step towards understanding the underlying social mechanisms and developing a theory that’s actually useful to programmers and funders.
The good news is that case studies tend naturally towards the latter, especially if they are casually designed. This means that we have a lot of material to work with, and most case study designs not yet implemented can be strengthened without a tremendous amount of effort. But we should address our rhetoric, because we talk an awful lot about producing evidence of what works, as if that were really possible in some kind of general sense. At the very least, that kind of talk raises false assumptions among practitioners and funders, and we should avoid it.
Instead, we should think about an evidence base in more general and more nuanced terms, and consider more carefully how it will be used and by whom. We should design research that clearly contributes to explaining and complicating social mechanisms, and we should begin carefully composing the qualitative and quantitative evidence we have towards theoretical models that are actually useful.
These three strategies aren’t very specific, and they’re demanding. But applied carefully and openly, they can help us make the first step in a fairly long processes of moving towards evidence-based decision-making and better outcomes. It’s about that gangly teenager of civic tech maturing into something less cool, but decidedly more effective.