Keppel Health Review

View Original

Missing data: who cares?

Across medical fields and study designs, missing data points are one of the most consistent features of quantitative healthcare research. There is potential to view missing data as a minor inconvenience, a statistical oddity or a technical concern—when in reality, it can invalidate research findings and exacerbate existing inequalities.


Missing data refers to observations (e.g. blood pressure reading, number of hospital admissions, or yearly income) that should be recorded for individuals in a scientific study but, for a plethora of reasons, are not.

Image credit: Unsplash

The causes of omissions are multiple and varied. Among other factors, the prevalence of missing data depends on the nature of the study and of the data being gathered. For example, it has been found that clinical trials in the field of palliative care have especially high levels of missing data: 23.1% of outcome observations were missing, according to a 2016 review. This can be explained by the fact that palliative patients are likely to be too sick to engage with a research study to completion. Indeed, the occurrence of missing data becomes more common when studies are required to record more information (and therefore involve a higher level of commitment) for each individual. Elsewhere, missing data might be a result of a study addressing a particularly sensitive topic—study participants are often reluctant to disclose information on sexual behaviour, for instance. In this case, the reasons for not providing information are embedded in social, political, cultural, and religious factors.

Missing data can invalidate study results

Statistical analyses use information from a group of study participants to calculate information about a broader population through a set of predetermined assumptions. Often when a dataset is incomplete, the default assumption is that the missing data points are randomly spread throughout study participants, or are of too small a quantity to significantly affect the results. These assumptions are problematic, not only because they do not always apply, but because they are made implicitly without considering what the true nature and underlying cause of the omissions might be.

Biases—“systematic differences between the value of a quantity calculated based on study data and the true value among the broader population”—can arise when missing data points are more common among some groups than others. A good example is this hypothetical study investigating the link between cannabis use at age 15 and mental health problems at age 21. In such a study, concerns about confidentiality might lead to hesitancy in providing information about cannabis use. It is plausible that this hesitancy would be greater among certain subgroups of participants, introducing bias to the results. Those experiencing mental illness, for instance, might be more likely to provide complete information than those that aren’t, as they don’t feel the need to hide information from the observers. These differences in response have the potential to distort or exaggerate the link between early cannabis use and mental health later in life.

This is a simple illustration of how missing data are able to bias results and reduce the validity of research findings. In reality, the process can be further complicated by incomplete data across multiple categories. Moreover, there could be other factors involved in outcome validity. What if willingness to disclose information is also linked to gender or socioeconomic status? Could the number of omissions be significantly affected by a factor not included in the dataset, or one that would be impossible to measure?

A variety of statistical methods exist that attempt to minimise the bias introduced by missing data in the dataset itself. Choosing the appropriate method requires an understanding of both its statistical underpinning and the clinical context the data were gathered within. As it stands, most studies in medical literature discard individuals with missing data points from the analysis, after which a complete case analysis (CCA) is carried out. This is a convenient method, and is the default option in most standardly-used statistical software. It is also important to recognise that, in certain contexts, this is a statistically sound way to handle the problem of missing data points. However, inappropriate use is widespread. Under certain circumstances, a multitude of more complex approaches—inverse probability weighting, direct likelihood, or multiple imputation—would produce more accurately representative results than CCA.

Ideally, missing data should be tackled at the source and efforts should be made to prevent it occurring altogether. Recommendations include designing data collection fields with this in mind, for example, by reducing the amount of information each participant needs to give, or recognising that sensitive questions need to be phrased with attention to societal context. More generally, the burden of study participation, such as time commitment, can be reduced.

Missing data in electronic health records

Electronic health records hold massive potential in healthcare research. These records hold the data generated in the daily running of the health service, such as measurements taken when an individual attends a hospital treatment. Their power lies in providing an inexpensive way to evaluate data from a large number of people using already-recorded data, bypassing the need to design and implement data collection methods for a study. However, because the data in these records are primarily collected for clinical purposes and not for research, missing data is often a significant problem. Data quality can also be poor due to difficulties linking and maintaining datasets within and between organisations and localities.

Image Credit: Unsplash

The use of electronic health records has demonstrated another critical consideration: missing points are often more present in the data of individuals who are already systematically marginalised within healthcare and research. These are also the individuals for whom health outcomes are compromised by structural inequities, and for whom evidence-based solutions are often most urgently required.

The COVID-19 pandemic produced many examples of the power of electronic health records in providing evidence-based responses to public health crises, while also highlighting data quality as an important concern. As more attention was paid to the inequalities of health outcomes by ethnicity, it became clear that ethnicity has not been recorded in a consistent and equitable way in healthcare data. A 2021 report by the Nuffield Trust found that one-third of admitted patient care activity of prisoners had missing ethnicity data compared to 13% of the general population. These findings are vitally important—if we are to truly understand the intersecting nature of healthcare inequalities, we require more complete data to base our research on.

The statistical methods employed to reduce bias in observational studies and clinical trials have been adapted for use with electronic health records. However, preventing the occurrence of missing data becomes more challenging for these large and complex data sets. Ensuring data completeness involves substantial improvements in data infrastructure and widespread changes in the processes through which the data are gathered. Positive results have been achieved in the United Kingdom (UK) by introducing incentives for the collection of general practitioner data, which led to reductions in the levels of missing ethnicity data in these records. There is also hope that the COVID-19 pandemic, although detrimental in the short term, may have exposed the systemic issues which arise with and from data incompleteness and that require our urgent attention.

Missing data reveal structural inequities

The issue of missing data is a clear example of how quantitative science is fundamentally intertwined with the societal landscape it operates within. Not only are unequal patterns of missing data symptomatic of structural inequities in our healthcare systems, but they also contribute to minimising the experiences of marginalised members of society.

The problem is deep-rooted and daunting in its complexity, but it is also an area where researchers can take tangible steps for change. In discussions around this topic, collaboration and communication emerge as proposed solutions. In practice, this means greater awareness of the importance of missing data among both those who collect and those who analyse it. It also means improving engagement with individuals providing data to better support them in research participation. Furthermore, at all stages of a scientific study, there should be communication between clinicians and analysts so that assumptions about a dataset—relating to both statistical underpinning and clinical context—can be clearly articulated. It may never be possible to eradicate missing data altogether, but with better collaboration, its prevalence can be reduced and its handling improved.