
Race and/or ethnicity in predictive models
Source:vignettes/articles/race-ethnicity-in-predictive-models.Rmd
race-ethnicity-in-predictive-models.Rmd
The use of race and/or ethnicity in predictive (also called prognostic) models is, in a word, problematic. It is problematic for a few reasons, and fortunately, this has received much-needed attention in recent years.
Including parameters for race and/or ethnicity may well be done with the purest of intentions, namely trying to ensure well-calibrated estimates of risk. For example, there may be evidence suggesting or convincingly demonstrating the risk of an outcome is higher or lower in certain demographic groups defined by race and/or ethnicity. But these population-level observations, while certainly important from the perspective of outcome ascertainment, population health, and health equity, do not then mean all people within that group have the same degree of increased or decreased risk. Indeed, to project these population-level findings to each person who belongs to that group is a recipe for ecological fallacy. Worse still, incorporating race and/or ethnicity into prognostic models based on these population-level observations could exacerbate rather than alleviate health inequities.
I have, unfortunately, heard various ill-founded arguments defending race and/or ethnicity, in prognostic models. These range from arguments suggesting removal implies race and/or ethnicity don’t matter at all or that removal is dismissive of something that may be core to a person’s sense of self or identity; arguments that it doesn’t matter how we predict, just that the predictions are good; and even arguments that invoke (either implicitly or explicitly) biologic essentialism. In all cases, the arguers seem to earnestly desire to serve people well, but their arguments nevertheless betray an incomplete understanding of clinical epidemiology and best practices in predictive modeling and/or (for the arguments invoking biologic essentialism) fundamentally flawed perspectives.
To be clear, race and/or ethnicity matter. They matter a lot. It is just their direct incorporation as predictor variables within predictive models that is problematic. And, for example, we know (and have known for some time, e.g., from the human genome project) that race is a social construct, and there is more genetic variation within a given race than between two given races. But this truth does not negate in any way the coexisting truth that systemic racism and structural inequity are real problems. And these problems can and do lead to health inequity. For example, it is not controversial to say that, on average, certain races have higher risk factor prevalence (e.g., higher average BP, higher average cholesterol) than other races. And these inequities in turn raise the risk for various adverse health outcomes. But, crucially, the existence of systemic racism and structural inequity and their impact on upstream risk factors for adverse health outcomes does not justify saying a given individual is at higher lower risk because of their race and/or ethnicity per se. To use race and/or ethnicity as a pseudo-surrogate in that manner is crude at best. But worse, using race and/or ethnicity as a predictor also commits the ecological fallacy by assuming all people of a given race and/or ethnicity have an entirely homogeneous lived experience. Which is, of course, untenable. Better, then, to focus on the upstream risk factors themselves, which are the true culprits. Although some might worry that omission of race and/or ethnicity could lead to poorer estimation, this is still a fallacious line of reasoning. To whatever extent race and/or ethnicity have predictive value, the follow-up question should always be “But why?”, because, for example, to argue it is the person’s race per se is to argue from biologic essentialism. In other words, if one were to draw a directed acyclic graph (DAG) demonstrating the relationship between race and/or ethnicity and the outcome in question, the DAG would be very complex, but it would demonstrate the relationship is not so much a direct one, but rather, mediated by a host of other factors that are associated both with race and/or ethnicity (e.g., due to the aforementioned impacts of systemic racism and structural inequity) and with the outcome of interest.
As a further example, let us consider the factors identified by the Pooled Cohort Equations (PCEs) as being prognostically important for 10-year ASCVD risk: Imagine a female 62 years of age who has a systolic BP of 155, takes BP medication, has a total cholesterol of 200, HDL cholesterol of 45, smokes, and has type II diabetes. According to the PCEs, a person with these parameters who is Black has a 10-year ASCVD risk that is about 20 percentage points higher than a person with these parameters who is White. Although females who are Black may have worse risk factors on average (and thus, more females who are Black suffer adverse health outcomes) than females who are White, one must remember this is a population-level observation, and ultimately, implementation of predictive models occurs at the individual level. And if a clinician sees two individual females whose modeled risk factors are the same, then for the model to predict a 20 percentage point difference in these two individual females because one is Black and the other is White should raise serious skepticism and concern. Although the difference with the revised PCEs is smaller, the difference in risk is still about 10 percentage points. I have occasionally heard people argue incorporation of race and/or ethnicity may be serving as surrogates for other, unmeasured risk factors. After all, the preceding paragraph discusses how systemic racism and structural inequity can lead to higher risk factor prevalence. But just saying race and/or ethnicity are surrogates for unmeasured risk factors is an unconvincing argument, and it also fosters epistemic sloth (see the “But why?” comment above). Although it is a hyperbolic example, it would be like observing that wearing a bathing suit is associated with drowning and then arguing that having a bathing suit on is a surrogate for an unmeasured risk factor for drowning. This is superficially true but epistemically lazy. The real culprit is being in or around bodies of water, people’s varying proficiency with swimming and water safety habits, and so on, not the bathing suit itself. And surely, not all people wearing bathing suits are at equal risk for drowning. Again, nothing herein in any way diminishes or contradicts literature demonstrating health inequities and the paramount importance of continuing work to mitigate and eventually eliminate such disparity; it simply underscores the importance of recognizing and working toward remedying those population-level observations without committing a textbook definition of ecological fallacy by incorporating race and/or ethnicity into predictive models intended for individual risk prediction.
As further reassurance that removal of race and/or ethnicity from prognostic models does not portend poorer estimation, we need to look no further than the PREVENT (Predicting Risk of cardiovascular disease EVENTs) equations: Despite not using race and/or ethnicity in their modeling, they still have much better calibration than the PCEs. And the PREVENT equations are not unique in this regard. Good results have also been seen for estimation of eGFR via the reparameterized CKD-EPI equations and prediction of successful vaginal birth after cesarean (VBAC) via the reparameterized TOLAC (Trial of Labor After Cesarean) model.
I could write much more about this, but detailing this issue comprehensively is beyond the scope of this documentation. I would have felt remiss, however, if this package supported estimation via the PCEs without commenting on the “elephant in the room”. The following resources are good starting points for those who are interested in understanding the issue more
With specific regard to the PCEs, the PCEs formally consider the races Black and White within the models. More specifically, however, they also incorporate ethnicity to a certain degree, because they specify non-Hispanic Black and non-Hispanic White. Even if one set aside the per se problem of using race and/or ethnicity in prognostic models, the fact the PCEs reduce something as complex as race and ethnicity to non-Hispanic Black or non-Hispanic White is obviously additionally problematic. What about the various other races and ethnicities? What about the false dichotomy of Black or White or Hispanic and non-Hispanic (e.g., what about someone who has one parent who is Black and one who is White)? And so on. Guidance from ACC/AHA at the time of the publication of the PCEs suggested one might consider using the equations for people who are non-Hispanic White for people who are neither Black nor White (see 4.1, Recommendation 2), but this also carries with it challenging and uncomfortable implications. Among these is how best to convey that within implementations of the model. Within efforts to foster inclusivity, there is considerable discourse surrounding avoidance of “othering” language. And yet, there seems to be no clear way to achieve this within the PCEs. Even saying things like “neither Black nor White”, “something else”, etc. are only superficially successful at best. While they might avoid the literal use of a word like “other”, these variants are still, de facto, othering, because they categorize the person based on their non-belongingness to another group.
With all the above, one might wonder why preventr
supports estimating risk using the PCEs. In addition to the issues
discussed here, the PCEs are well-known to overestimate risk. However,
the PCEs are still widely used, and until the release of the PREVENT
equations, they were the primary method for estimating ASCVD risk in
many areas given their development and endorsement by the ACC/AHA in
their 2013 guideline. It is thus easy to envision use cases where it
would be instructive to see the difference in risk estimation between
the PCEs and the PREVENT equations. This is the primary reason for
incorporation.