The hidden health dangers of data labeling in AI development

Carlos Andrés Arroyave Bernal
08/20/2024 | Reflections


My doctoral research reflected on the role of workers who enable the use of artificial intelligence in countries like Colombia. I raised discussions regarding the conditions of a segment of the global AI workforce that is often overlooked: workers who are part of humanitarian initiatives for labor inclusion.

Some humanitarian initiatives have promoted the employment of individuals in AI companies in countries like Colombia. These individuals are typically subcontracted by large tech companies in poorer countries of the southern hemisphere to train AI systems under precarious working conditions. Known as data labelers, these professionals perform essential work for the development of AI, yet they occupy the most marginalized position in the production chain of these major tech companies (Partnership on AI).

Beyond the discussion on the adverse working conditions faced by data labelers—such as inadequate compensation, partial inclusion, or the attribution of biases in data processing and model development—one issue I consider paramount is their mental health.
 
Mural by Beastman
Mural by Beastman [Image credit: Beastman/Creative Commons CC]

The conditions under which AI is produced and developed are linked to the health of data labelers. These conditions include a lack of autonomy, privacy violations, and increased surveillance. For example, their workload distribution depends on the amount of time they have available for labeling, whether four hours (equivalent to one shift) or eight. A workload that also depends on the number of images the client demands to be processed. As a result, the speed at which data is labeled is timed to determine the hourly pay rate.

The speed of data labeling is connected to the type of artificial intelligence being fed, as classifications vary in complexity. The need to label as much data as possible per unit of time limits the number of active breaks that digital labelers can take. Among labelers, there is talk of daily quotas or targets to meet, and it is noted that when a certain degree of repetitiveness is reached, processes can be standardized. Given that the goal is to process the maximum number of data points per unit of time, standardization also has an impact on visual health due to the high volume of images and videos that must be processed.

The literature reports that migrant workers are often employed in precarious jobs that lead to poor health outcomes and worse general and mental health (Arici C, 2019). Studies on migrant workers' health have identified a high burden of common mental disorders such as depression (Moyce SC, 2017), as well as a possible deterioration in their health status and a lower quality of life compared to natives. Factors associated with these outcomes include legal status, language barriers, and literacy levels, which act as barriers to migrant workers' control over their environment and working conditions (Benazizi,2023).

In this context, it is concerning that labelers, many of whom are migrants, are employed in projects for identifying xenophobic content. Particularly because some of them may have been victims of abuse or mistreatment due to their migrant status, or may lack social welfare in the host countries, and face difficulties in acculturation and adaptation. In other words, what may happen is that annotators who belong to populations lacking social welfare might suffer more when required to label sensitive content.

Among the mental health risks experienced by labelers is the need to observe material with high levels of violence or pornographic content. This situation is compounded by limited access to psychological support services or medical care. The precariousness of their working conditions also hinders their ability to address their well-being. However, in the narratives of labelers, their concern is not so much for their mental health but for the need to remain neutral in the labeling work.
 
Pervasive idea that AI labeling must remain 'neutral' has been detrimental to work of data labelers
Pervasive idea that AI labeling must remain 'neutral' has been detrimental to work of data labelers [Image credit: Creative Commons]

It is striking to think that people are employed to classify information based on an ostensibly "neutral" human judgment. That is, labelers are tasked with labeling information according to pre-determined classifications set by the company they work for. These manually assigned labels ultimately feed into large AI databases. The way these systems operate means there comes a point when they "learn" to make classifications on their own. Content involving violence, xenophobia, or pornography is classified by AI thanks to the manual work initially performed by the labelers, which reproduces the same biases and cultural stereotypes of those who fed it.
 
One of the labelers interviewed for my doctoral dissertation stated: “In tweet labeling, one had to interpret the message. Even though some messages contained pornography and disturbing images, one had to remain neutral because we were producing that image, and if we let ourselves be influenced by the message, it would be poorly labeled.”  (Data labeler, 2023, July 2). [Interview]
 
Labelers constantly demand neutrality of themselves, pushing themselves to act as if they were machines. This occurs because the labeler marks content by selecting labels that are pre-established by the client to mark content, avoiding any reflections or reflections of what they see. In training AI, an effort is made by the industry to ensure that the human, understood as sensitive and partial, disappears. Beneath this tedious task of labeling millions of data points and images, AI learns as it is trained, but at a high cost that is not sufficiently considered for data labelers. The continuous dependence of AI development on this large workforce is hidden, and the data labelers ultimately bear the costs.

This workforce, besides being hidden and marginalized by being exposed to toxic and violent material can face various psychological problems. There is a need to consider what are the mental health impacts on data labelers, and, consequently, what are the mental health costs of these jobs.


Carlos Andrés Arroyave Bernal is the Director of the Master of Science (MsC) Program in Transdisciplinary Healthy Studies at Universidad Externado de Colombia. He is a PhD Candidate in Social Sciences and Humanities at Universidad Nacional de Colombia.
 



Published: 08/20/2024