Towards fairness in machine learning for dermatology: a skin tone representation disparities study

Celia Cintas
2025/02/10 | Reflections

Most machine learning (ML) models assume ideal conditions and rely on the premise that clinical data comes from the same distribution as the training samples. However, this premise is often not met in most real-world applications; in clinical settings, we encounter various hardware devices and diverse patient sub-populations with differing or unknown conditions. On the other hand, we need to evaluate potential disparities in dermatological outcomes that may be translated and exacerbated in our ML solutions to ensure they are clinically feasible for a positive societal impact. Addressing the robustness and fairness of these solutions is essential as we see a growing interest in these models within the dermatology field.

Even though there are many more people with dark skin than those with light skin globally, the choices reflected in textbook materials impact how physicians identify skin disease across varying skin tones, meaning that patients are diagnosed at a more advanced and harder-to-treat stage of the disease. In fact, with skin cancer, this often results in diagnostic delays. Medical students and doctors around the world learn about skin conditions through textbooks, presentation slides, and other educational materials that illustrate the manifestations of these diseases, primarily on light-colored skin. Not every skin condition appears the same on light skin as it does on dark skin; for instance, basal cell carcinoma looks pink and pearly on light skin, while it may present as pigmented and shiny on dark skin. When medical education materials fail to represent the vast diversity of skin tones, doctors trained with these resources are less likely to accurately diagnose and treat patients with dark skin who come to them with a skin condition.
 
 

Previous analysis of dermatology-related academic materials (journals and textbooks) has shown an under-representation of FST V and VI; however, images were annotated and analyzed manually, i.e., a domain expert located each image in a textbook and labeled the skin tone. Unfortunately, this manual approach is not tractable for a large corpus due to its labor-intensive nature, operator visual fatigue, and intra-inter-observer error of skin tone labeling.  As part of our research, we provided an open-source pipeline to automatically quantify the imbalanced representation of binary skin tones in textbooks (FST I-IV and FST V-VI); while this means we do not capture further granularity in skin tones, it does assess the most historically excluded skin tones.  We evaluated our method in four medical textbooks, where brown and black skin tone images made up only 10.5% of all skin images.

 
 

Putting this work together required an interdisciplinary team with diverse skill sets across several continents and institutions, from Nairobi, Kenya (where I’m based, as well as other co-authors) to Zurich, Switzerland, and Stanford, USA. The dermatologists on the team identified the problem. They labeled innumerable figures from textbooks to serve as baseline and training data for the ML methods, and they actively participated in validating each step of the proposed solution. The ML scientists improved the process of extracting figures from the entire document. They refined models previously created for identifying non-diseased patches of skin in images and estimating their tone. 
 

With our proposed new pipeline, we were able to recapitulate the findings in the literature, which shows a significant underrepresentation of FST V-VI skin in dermatology educational materials. This project allows for the running of bias assessments at scale without the need for hours spent labeling manually. We envision this technology as a tool for medical educators, publishers, and practitioners to help assess skin tone diversity in their educational materials, setting a minimum standard for authors and publishers to follow, thus helping to improve health equity. This work is an excellent example of ML serving not to replace but to aid doctors in improving their training and making them more effective. 

_______________

Celia Cintas is a Staff Research Scientist at IBM Research Africa, based in Nairobi. She is a member of the AI Science team at the Kenya Lab. Her current research focuses on the improvement of ML techniques to address challenges in global health and exploring subset scanning for anomaly detection under generative models.



Published: 02/10/2025