What do you get when you run a quick experiment using the new software tool a whole industry is talking about?
Last week, researchers Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli put a draft study on arXiv claiming that Chat GPT annotates data better than humans for certain tasks. The press jumped on this, suggesting that ChatGPT can replace workers to train AI.
Data annotation workers like us started discussing the study on forums and in chats. We quickly realized its claims – and the press around them – were missing a lot. Fortunately, journalist Chloe Xiang at Vice reached out to us and heard and published our side of the story. We’ve been organizing with Turkopticon for several years, developing an analysis of our work and collectively campaigning for better working conditions on Amazon Mechanical Turk. Now, we’re faced with new challenges related to ChatGPT and perceptions from academia and industry.
Today, we want to outline our perspective on this study and the related hype, and why it matters to us and everyone who works in software.
First, does the study apply to all or even most uses of automated data labeling? No. The study only looked at tweet classification – a far cry from all of AI. The authors had ChatGPT classify tweets for “relevance, topic, stance, problem or solution framing, and policy framing” and found that different runs of ChatGPT agreed with each other more. They claimed that coding would be much cheaper this way. Even then, ChatGPT is still trained by humans – data annotators go through its outputs before it hits the public to make sure the results make sense and are not toxic. Computers don’t have access to our changing norms of what is appropriate speech or what is respectful. There is no ChatGPT without human workers.
Second, were the findings and subsequent study published on arXiv put through a peer-review process? No, they were not. Publishing hype harms our collective understanding of technology, beyond just workers. In a world that rewards clicks and attention, journalism and academia are both at risk of prioritizing hype. If we slowed down for peer review, we would find some of the obvious shortcomings. Even better, the most impacted workers and communities should have a say before these findings land or the press has its day.
Third, do the claims like the ones made in the study have real-life consequences? Yes, absolutely. Amazon Mechanical Turk already hides us, the data workers, from the requesters who have us sort, classify, label, and judge their data. Requesters often blame us when something goes wrong, thinking we must be bots or “low skill” rather than looking at the design of their own tasks. We worry studies like these encourage AI requesters who typically post work to MTurk to think about automating without understanding how to generate high quality results, impacting the quality of data and AI we all get. We worry that studies like these focus on us as costs to be cut rather than people with skills and knowledge we bring to the process.
Language is a living, breathing thing. A computer program doesn’t go out into the world to fact-check what it scraped off the internet. Similarly, ChatGPT might generate text, but a human still has to read it to decide if it is “good” text. With all the hype about ChatGPT replacing writers, data annotators, and artists, we have to remember that writing and knowing is not just making words or assigning categories, it’s about judgment.