Can the data speak for itself? Tackling inequalities and exclusions in statistical research
For years, researchers have unequivocally emphasised the health and social inequalities underpinning many health systems and medical research. On November 17 2021, the Centre for Statistical Methodology at the London School of Hygiene and Tropical Medicine (LSHTM) invited four researchers to discuss efforts to identify and eliminate these. The symposium, Tackling Inequalities and Exclusions in Statistical Research, explored a myriad of ongoing research efforts, particularly those intended to mitigate hidden biases in data practices. In this article, Teresa reflects on the researchers’ work and the ideas presented during the event aimed at ensuring fair decision-making in healthcare.
Rohini Mathur emphasised that ethnicity data is collected within a wider social context, and is thus shaped by historical, political, and social factors—including racism and discrimination. The influence of the socio-cultural environment on data generation and collection causes problems for the accuracy of ethnicity data itself. For instance, the majority of ethnicity data used in health research is collected from primary care facilities. As such, the data is unlikely representative of the whole population, as it omits people who do not utilise or attend these health services. Moreover, common ethnic categories listed in demographic and diversity surveys are often unsuitable for people that belong to multiple ethnic groups, who may find that the rigid categories do not capture their identities fully. Mathur discusses a case where a patient selected 'White', 'South Asian', 'Black', and 'Other' when asked to report their ethnicity on different days. Due to a lack of appropriate and encompassing categories, current ethnicity statistics will not always accurately reflect the population under study.
Despite these problems, Mathur still believes that using and improving ethnicity data is essential to gain a better understanding of health disparities. Ethnicity data allows researchers to understand how health outcomes in ethnic minority groups may differ from trends observed in more well represented populations. For instance, she and her colleagues studied the differences in COVID-19 infection and related hospitalisation rates across ethnic groups in the United Kingdom (UK). During the first wave, ethnic minority groups had an excess risk of suffering from COVID-19. However, their elevated risk decreased in the subsequent wave. Similar trends have been observed in the United States (US). They further noted that ethnic minority groups may be at higher risk due to a variety of other determinants, such as occupation, household circumstances, and the influence of policies and practices on their health-related behaviours. Mathur concluded that, “… health outcomes are determined by factors associated with ethnicity, not ethnicity itself”. We need to address the wider socioeconomic risk factors that contribute to these disparities, and improved ethnicity data allows us to do so with evidence-based research.
Mhairi Aitken stressed the need to build public trust in research studies. She explained that bias comes in different forms: bias can occur within the crude data, but also during the process of data collection and analysis. For example, unconscious biases can influence how survey questions are formulated and analyses are carried out by researchers. For this reason, it is important that individuals with different perspectives and experiences shape data practices and influence research designs. In order to ensure that researchers do not interpret their findings from a narrow perspective, public engagement and trust in research is needed to gain these diverse insights.
Aitken advised that public engagement can be achieved by raising awareness about the importance of a research problem, consulting with the study population through socially sensitive and appropriate approaches, and empowering the population in conducting socially meaningful and impactful research studies. One way to achieve this is for researchers to begin wide-scale conversations with the general public on the uses of data in health research. Including the public in these conversations can resolve potential concerns in data practice and build public trust. An alternative solution is for researchers and patients to work collaboratively to co-design policies and practices, thereby broadening the process of research analysis by including patient views and experiences.
Aitken concluded that addressing data ethics through public engagement can both limit negative impacts of research on patients and participants, and also maximise the potential benefits of research for all members of society. Data collection and statistical analysis are useful tools to address inequalities in society, and ethical considerations are needed to support and monitor this process.
Darshali Vyas expressed the view that clinical decision-making algorithms that include race as a risk factor are unlikely to be reflecting genetic differences in observed health outcomes; a more likely explanation is that observed differences in health are due to the effects of racism and social inequalities.
In one of her studies, she discovered that the vaginal birth after caesarean delivery (VBAC) calculator used in the US unfairly discriminates against African American and Hispanic women. The VBAC calculator is used to predict the chance of a successful vaginal delivery. In the US, race is regarded as a risk factor to an unsuccessful vaginal delivery in the predicted model, along with factors such as height, weight, and previous vaginal delivery. This means that, for women of the same age and body stature, African American women may have a lower chance of successful vaginal birth after caesarean when compared to White American women according to the VBAC calculator. As a consequence of this prediction, African American mothers are less likely to be offered vaginal birth than White American mothers. Vyas compared the VBAC calculator used in the US to those used in Canada and Sweden and discovered that the same tool can be useful without including race, implying that race has been erroneously identified as a predictive risk factor. Instead, it is likely that socio-cultural factors, and systemic racism, are more legitimate risk factors in unsuccessful vaginal deliveries for African American women in the US.
Further, Vyas pointed out that many clinical decision-making tools use White American as the default value and include other races as risk factors. She emphasised that a correlation between race and health outcomes does not suggest that race is really the cause of the observed health disparities. In fact, Vyas explained that much scientific evidence suggests that there is more genetic variation within racial groups than between racial groups. She continued on from this, saying: "In the conversation of race, the differences between associations and causation should be more explicit.”
Vyas concluded that the idea of race should be clearly distinguished from genetic differences that can be traced back by ancestry. She also encouraged professional societies to review tools, create guidelines, and re-evaluate existing race-correction on a case-by-case basis to identify and rethink principles and assumptions that are based on racial discrimination. Many of these small changes will accumulate to create structural change towards greater health equity.
Sherri Rose believes justice and fairness are essential in health research. When tackling a public health or medical problem, researchers usually identify the specific research question, the population they want to study, the social contexts of their data, and the methods and algorithms they plan to use in their data analysis. According to Rose, ethical machine learning and data analysis in health research should involve taking steps to ensure an equitable distribution of health benefits, risks, costs, and resources.
Rose analysed statistical models that health insurance providers use in the US to distribute their funds across health programmes. She discovered that many of these models disadvantage older adults and people with mental illness, as their specific health conditions are not considered in the methods—what Rose termed an ‘algorithmic fairness problem’. To address this problem, she uses a measure called ‘group fairness’ to compare the fairness of an intervention between a group of people with similar characteristics, and between the group of people with similar characteristics and those with different ones. One of her main research aims is to improve equity in healthcare decision-making by including built-in fairness criteria in statistical methods.
When considering the implications of research findings, it is important to remember that “[patient] data is not toy data”. Rose emphasised that patients “are individual people. We need to have respect for data.” In other words, data is not simply a collection of abstract numbers; the numbers reflect individual circumstances and stories. When analysing data, Rose encouraged researchers to examine the human stories, society, and real-world situations behind the figures. In doing so, she hopes researchers will keep in mind that the purpose of statistical analysis is to understand and solve real-world problems—including those that relate to health and social inequalities.
In closing the symposium, the chair credited the researchers for their contributions to the progress that has been made in resolving previously overlooked biases and unfair assumptions. However, social inequalities continue to translate into health disparities, and we must keep building on the advancements that have been made thus far. More research and policy plans that target these inequalities are needed to create a healthier and more equitable society.
The article represents the author’s understanding and reflection of the symposium event. It does not speak on behalf of the four researchers. Importantly, the KHR acknowledges that the researchers conscientiously used the terms 'race' and 'ethnicity' to convey their points. We recognise the overlap between both terms and appreciate that they can have different meanings and connotations for various readers. Further information on the symposium and the speakers’ research can be found on the LSHTM Centre for Statistical Methodology’s webpage.
Researcher Bios
Rohini Mathur is an epidemiologist specialising in health equity research at LSHTM. Her work involves investigating the quality and completeness of ethnicity data in the United Kingdom and how electronic health records and cohort data are used in health research.
Mhairi Aitken is an ethics fellow of the public policy programme at the Alan Turing Institute. She is a sociologist who researches the social and ethical dimension of digital innovation and has a particular interest in the role of public engagement in data practice.
Darshali Vyas is currently a resident physician in medicine at Massachusetts General Hospital. She has researched the flaws of race-based clinical decision calculators and explored how these race-based prediction algorithms may perpetuate race-based inequalities.
Sherri Rose is an associate professor of the Centre of Health Policy at Stanford University and Co-Director of the Health Policy Data Science Lab. She is interested in developing and integrating statistical machine learning approaches to improve human health.