Grounded Theory: A small introduction
Most of the data we analyse in our research projects is of textual nature, ranging from interview transcripts and survey results over Wikis and otherwise published material to even software code. Textual data, however, is notoriously difficult to analyse. We have a multitude of techniques and tools which are also well known to practitioners to process and analyse quantitative data. We handle numbers very often. Yet, what should we do with all the textual data? So far, the potential of analysing this data is not used.
Fortunately, we have seen an interesting development in the research area of analysing texts. This is often promoted under the heading text mining or text analytics. Both mean roughly the same thing: systematically retrieve and analyse textual data to gain additional insights, such as additional insights into a software development project.
In this post, I will summarise the basic concepts of manual text analytics. The content results, in large parts, from current work of Stefan Wagner and me for a book chapter (for Analysing Software Data). I will intentionally refrain from introducing techniques to automated text analytics, but focus on the most frequently cited qualitative text analysis technique, known as Grounded Theory.
Grounded Theory (GT) is the most cited approach for qualitative data analysis which comes at the same time with a plethora of different interpretations. In a nutshell, GT describes a qualitative research approach to inductively build a “theory”, i.e. it aims at generating testable knowledge from data rather than testing existing knowledge. To this end, we thus make use of various empirical methods to generate
data, and we structure and classify the information to infer a theory (see also the figure). A theory, in its essence, provides explanations for certain phenomena via a set of interrelated concepts. Even in natural sciences, we mostly rely on the notion of a social theory and refer to a set of falsifiable and testable statements/hypotheses. As most qualitative research methods, GT has its origins in social sciences and it was first introduced in 1967 by Glaser and
A detailed introduction into the background of Grounded Theory and the delineation with similar concepts arising along the evolution of GT is given by Birks and Miller in their book Grounded Theory: A practical guide. For the remainder of this post, I will introduce the core technique, manual coding, and I will rely on the terms and concepts as introduced in context of GT.
Manual Coding
Once we collect textual data for its analysis and interpretation, e.g. via interview research, it needs to be structured and classified. This classification is often referred to as coding where we identify patterns in texts, having an explanatory or a exploratory purpose and serving as a basis for further analysis, interpretation and validation. Coding can be done in two ways: manually or automated. In this post, I introduce coding as a manual process.
Manual coding is a creative process that depends on the experiences, views and interpretations of those who analyse the data to build a hierarchy of codes. During this coding process, we conceptualise textual data via pattern building. We abstract from textual data, e.g. interview transcripts or commit comments stated in natural language, and build a model that abstracts from the assertions in form of concepts and relations. By this coding process, we interpret the data manually. Hence, it is a creative process which assigns a meaning to statements and events. One could also say that we try to create a big picture out of single dots.
There are various articles and textbooks proposing coding processes and the particularities of related data retrieval methods such as why and how to build trust between interviewers and interviewees. The least common denominator of the approaches, however, lies in the three basic steps of the coding process itself followed by a validation step:
- Open coding aims at analysing the data by adding codes (representing key characteristics) to small coherent units in the textual data, and categorising the developed concepts in a hierarchy of categories as an abstraction of a set of codes – all repeatedly performed until reaching a “state of saturation”.
- Axial coding aims at defining relationships between the concepts, e.g. “causal conditions” or “consequences”.
- Selective coding aims at inferring a central core category.
- Validation, finally, aims at confirming the developed model with the authors of the original textual data.
Open coding brings the initial structure into unstructured text by abstracting from potentially large amounts of textual data and assigning codes to single text units. The result of open coding can range from sets of codes to hierarchies of codes. For example, we might want to code the answers given by quality engineers in interviews at one company to build a taxonomy of defects they encounter in requirements specifications. During open coding, we then classify single text units as codes. This might results, for example, in a taxonomy of defect types, such as natural language defects which could be further refined, e.g. to sentences in passive voice.
During axial coding, we then might want to assign dependencies between the codes in the taxonomies. For example, the quality engineers might have experienced that sentences in passive voice have frequently lead to misunderstandings and later on to change requests. This axial coding process then might lead to a cause-effect chain that shows potential implications of initially defined defects.
The final selective coding then would bring the results of open and axial coding together to build one holistic model of requirements defects and their potential impacts.
The idea of (manual) coding – as it is postulated in Grounded Theory – is to build a model based on textual data, i.e. “grounded” on textual data. As the primary goal is to gather information out of text, we need to assure a flexible process during the actual text retrieval and the coding process as well. For example, in case of conducting interviews, we perform an initial coding of the first transcripts. In case we find interesting phenomena for which we would like to have a better understanding of the causes, we change the questions for subsequent interviews; an example is that an interviewee states that a low quality of requirements specifications has also to do with a low motivation in a team leading to new questions on what the root causes for a low motivation might be. We thereby follow a concurrent data generation and collection along with an emerging model which is also steered according to research or business objectives.
During the open coding step, we continuously decompose data until we find small units to which we can assign codes (“concept assignment”). This open coding step alone shows that the overall process cannot be performed sequentially. During the open coding step, we found it useful:
- to initially browse the textual data (or samples) before coding it to get an initial idea of its content, meaning and, finally, of potential codes we could apply,
- to continuously compare the codes during coding with each other and especially with potentially incoming new textual data and
- to note the rationale for each code down to keep the coding process reproducible (of special importance if relying on independent re-coding by another analyst).
Having a set of codes, we allocate them to a category as a means of abstraction. For instance, when analysing phenomena in a software development process, we might have various codes for phenomena in a change management process that brings us to simply think in the form of processes so that we might choose “change management” to be the category. During axial coding, we then assign directed dependencies between the codes. Finally, the last step in the coding process is supposed to be the identification of the core category, which often can be also predefined by the overall objective followed by the text analysis (e.g. “requirements specification defects”).
The overall coding process is performed until we reach a theoretical saturation, i.e. the point where no new codes (or categories) are identified and the results are convincing to all participating analysts.
Challenges
The introduced coding process is subject to various challenges, of which we identify the following three to be the most frequent ones.
Coding as a creative process. Coding is always a creative process. When analysing textual data, we decompose it into small coherent units for which we assign codes. In this step, we find appropriate codes that reflect the intended meaning of the data while finding the appropriate level of detail we follow for the codes. This alone shows the subjectivity inherent to coding that demands for a validation of the results. Yet, we apply coding with an exploratory or explanatory purpose rather than with a confirmatory one. This means that the validation of the resulting model is usually left to subsequent investigations. This, however, does not justify a creationist view on the model we define. A means to increase the robustness of the model is to apply analyst triangulation where coding is performed by a group of individuals or where the coding results (or a sample) of one coder are independently reproduced by other coders as a means of internal validation. This increases the probability that the codes reflect the actual meaning of textual units. We still need, if possible, to validate the resulting theory with the authors of the textual data or the interviewees represented by the transcripts.
Coding alone or coding in teams? This challenge considers the validity of the codes themselves. As stated, coding (and the interpretation of codes) is a subjective process that depends on the experiences, expectations and beliefs of the coder who interprets the textual data. To a certain extent, the results of the coding process can be validated (see also the next paragraph). Given that this is not always the case, however, we recommend to apply, again, analyst triangulation as a means to minimise the degree of subjectivism.
Validating the results. We can distinguish between an internal validation, where we form, for example, teams of coders to minimise the threat to the internal validity, and external validation (the above mentioned analyst triangulation). The latter aims at validating the resulting theory with further interview participants or people otherwise responsible for the textual data we interpret. This, however, is often not possible; for example, in case of coding survey results from an anonymous survey. In those cases, the only mitigation we can aim for is to give much attention to the internal validation.
Example
We conducted an industrial survey study in 2013 as a collaboration between TU München and the University of Stuttgart. We have been working with industrial partners on requirements engineering (RE) for several years and had a subjective understanding of typical problems in this area. Yet, we often stumbled on the fact that there is no more general and systematic investigation of the state of the practice and contemporary problems of performing requirements engineering in practice. Therefore, we developed a study design and questionnaire to tackle this challenge called Naming the Pain in Requirements Engineering (NaPiRE). While you are not likely to perform the same study, the way we analysed the free-text answers to our open questions is applicable to any kind of survey. You can find more information on the complete survey on the website: http://www.re-survey.org/.
For analysing the free-text answers in our survey, we followed the manual coding procedure as introduced above. In contrast to the idealised way of coding, however, we already have a predefined set of codes (given RE problems) for which we want to know how the participants see their implications. For this reason, we have to deviate our procedure from the standard procedure and rely on a mix of bottom-up and top-down. We start with selective coding and build the core category with two sub-categories, namely RE problems with a set of codes each representing one RE problem and Implications which then groups the codes defined for the answers given by the participants. For the second category, we conduct open coding and axial coding for the answers until reaching a saturation for a hierarchy of (sub-)categories, codes and relationships.
During the coding process, we had to tackle several challenges. One was the lack of appropriate tool support for manual coding, especially when working in distributed environments, and another one was the missing possibility to validate the results by getting feedback from the respondents. The following figure sketches our procedure we followed during manual coding in our NaPiRE studies. What you can also see is the wonderful tool support we currently have at our disposal (spreadsheets and post-its, yes).
For this reason, we relied on researcher triangulation during the open coding step as this was essentially the step which most depended on subjectivity (during interpretation of the answers to the open questions). During this open coding step, we decomposed the data relying on spread sheets and worked with paper cards where we also denoted the rationale for selected codes. In a third step, we arranged the cards according to categories by using a whiteboard. A third researcher then repeated, as a validation step, independently the open coding process on a sample.