Replication studies in software engineering – the good, the bad and the ugly.

Software engineering, in its essence, is for one particular reason very fascinating: It intersects with a plethora of application domains and we continuously have the chance to make contributions to our field, but also to society. In software engineering, being itself an interdisciplinary area where mathematics meets, inter alia, psychology, we develop concepts and new methods that, when applied in practice, often remain unpredictable regarding their suitability to solve an envisioned problem. Software engineering is inherently complex as the socio-economic contexts in which we apply software engineering concepts is inherently complex itself. We are challenged on a daily basis by trying to understand isolated problems while our research on such problems reveals new problems, and so on — let alone this be a perfectly reasonable playground for curiosity-driven minds.

Replication Studies (the good)

Too often, we see researchers proposing (yet another) new method/approach/framework/… whereas the value might depend on whether and to which extent those contributions solve a practical problem — where reasoning by argument, a formal proof, or a simulation is not sufficient anymore to convince other researchers of the value of a contribution. For those contributions that need empirical evidence, too often the claims made in the contributions are not sufficiently underpinned with such empirical evidence.

This is not per-se a problem as we always have to start research at some point; and this also includes the proposition of new ideas to discuss them, give feedback, and steer their development and testing for their sensitivity in certain contexts. However, those contributions often remain at the level of solution proposals and they find their way through various conferences where we rely on the contributions’ expected value, where we extend proposed concepts with further concepts, continue our philosophical discussions whether this or that twist in the concepts is better — in the end, too often without still having any evidence whatsoever on whether those concepts even work and to which extent they work. Whereas we then might take the value of certain methods as granted, we step into the trap of believing rather than knowing – simplified: we, as researchers, then don’t practice anymore what we preach.

We have to face that many concepts in software engineering reveal their true nature (and value) when applied in a socio-economic context – with people involved, each having an own background, experiences, and expectations. This brings us to the value of evidence-based research. In our context, we then would apply our software engineering method, for example an approach for effort estimation in a certain software development project, make some observations about its effects and side-effects, evaluate the approach taking into account feedback from project participants, and finally draw some conclusions.

From a researcher’s perspective, experimental research thereby becomes a crucial and challenging task. It is crucial, as experimentation of any kind, ranging from classical action research through observational studies to broad exploratory surveys, are fundamentally necessary to understand the practical needs and improvement goals in software engineering, to steer problem-driven research, and to investigate the value of new methods via validation research. It is challenging, because we still need a solid empirical basis that allows for generalisations taking into account the human factors that influence our anyway hardly standardisable research area.

Now, when conducting an industrial case study, we can draw some conclusions and, at some point, maybe infer a (grounded) theory. This gives us some confidence about the value of our method, let’s us understand practical needs and distill some improvement goals for our method.

The problem that remains: we might still lack a solid empirical basis. The reason seems to lie in an ironically paradoxical circumstance: The appropriate design of a study and the descriptive interpretation of the results going beyond purely observational, qualitative analyses and reasoning is very challenging, because we might begin our empirical endeavour with a lack of empirically grounded theories that would support us in formulating appropriate hypotheses and research questions. In turn, we often lack such theories as we still have no empirically sound data basis, thus, leaving us with the problem that our results are hard to generalisable (or discuss in relation to existing evidence) — they cannot be viewed as representative.

The fact is, however, that software engineering is currently driven by solution work, rather than by empirical work. This shortcoming makes it hard to generalise findings in an empirically sound manner and gain objectivity in our conclusions, which leads us to replication studies. The principle of replication studies is self-explanatory, their rationale is perfectly explained by Karl Popper in his book “The Logic of Scientific Discovery”:

We do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them. Only by such repetitions can we convince ourselves that we are not dealing with a mere isolated coincidence, but with events which, on account of their regularity and reproducibility, are in principle intersubjectively testable.

The usage of isolated experiments and case studies without replication in different contexts is not sufficient for significant validation research in software engineering, as pointed out by various researchers; and the more a method depends on a practical context, the higher the need for replicating experiments, because of the involved stakeholders and the sensitive context both have a strong influence on any experimental endeavour in validating an anyway hardly standardisable discipline. Otherwise, our conclusions remain to a large extent subjective.

The bad

The problem with this view is quite simple. We, as researchers, know (or should know) that we need to occasionally rely on evidence-based research; and to increase our external validity, i.e. the ability to state that the repeated application of a method in another context leads to same or similar effects, we need to reproduce our studies. This view being commonly accepted in medical research is, however, often neglected in software engineering. That is to say, we also know (or should know), however, that designing a study, collecting the data, analysing the data, and finally reporting on our results, is a very time-consuming task. This leads to the question which value a replication has for our own research – Shall we really invest a year in another case study to strengthen our confidence? A question that a PhD student might understandably answer with “Hell, no!”. The effects are often that we rely on an experiment performed with students as part of a tutorial or some toy examples that hardly reflect reality.

Replication studies thus can only be a long-term goal, which we often can only reach in an ideal research environment with enough fundings and often only as a community effort. This, in turn, leads to the question what the responsibility of the community really is.

The ugly

Despite the undeniable relevance of replication studies, we have to take into account that replication studies are very time-consuming and often brings us only one small step further in confirming a theory, which makes them so unpopular among researchers. The ugly truth, however, remains that replication studies are also often not appreciated by the community. Very surprising was a review we once got for a replication study performed in the application of a requirements engineering approach in two industrial contexts where the reviewer was “disappointed” that “nothing new was contributed”. The missing appreciation of the effort invested in the study was not the problem – the ignorance towards the value of such studies for the community was.

The importance and the problem of ignorance

I myself might have become (sometimes) ignorant regarding the value of new approaches that are presented without any empirical evidence about their strengths and limitations. I know, however, that we are confronted with a problem that much valuable research is hindered to find its way into a community, simply because authors have not validated it yet; even if we should give it a chance to be presented at an event, to be discussed in front of the community, and give the authors the chance to seek for collaborations to initiate an experimental phase as part of community work. At the same time, I am also convinced that too often empirical work gets a luke-warm reception when it offers no new methods/approaches/frameworks/… to the field. While the first can be, to some extent, reasonable, the latter has really become a problem.

What is even more worrying is that — maybe it is only an impression — software engineering researchers seem sometimes to be exclusively part of one of two mutually exclusive evangelistic groups; each being ignorant towards the other. The conceptual witnesses and the empirical ones. We need to recognise the importance of empirical research and the value of replication studies; also, we need to recognise that purely conceptual work is the result of research that often can only be done in a step-wise manner while empirical evidence can not be provided right from the beginning. If being ignorant, we should be so with unsupported claims and methodological flaws, not with contributions we might not like.