Making Data, Making Worlds: The Consequences of the Generative Turn in Big Data and AI

Ludovio Rella (Durham University)
26/02/2024 | Reports Back

Event organizers Kristian Bondo-Hansen (right) and Ludovico Rella (left).

Last year, on September 28 and 29, a two-day workshop took place at the Royal Geographical Society in London among scholars across disciplines from Geography, Media Studies, Science and Technology Studies, Cultural Studies, and Political Science. Making Data, Making Worlds investigated the processes of world-making associated with generative Artificial Intelligence algorithms (think ChatGPT and Stable Diffusion), and large-scale simulation environments and digital twins. The event was a demonstration of radical interdisciplinarity, and of collaborative and care-based scholarly practice.The program devoted ample time for discussants to present their papers and respond to thoughtful and provocative questions raised by audience members. These event proceedings, which also informally continued over food and drinks, shared a vision that a “generative turn” is taking place in multiple technologies: whereas algorithms were previously used to either classify or make predictions based on ‘real data’, contemporary algorithms are now used to generate new and synthetic data.

This generative turn raises new questions and complicates our answers to old ones. First, synthetic data promise a ‘technological fix’ to the bias problem in facial recognition - connected with the overabundance of white faces in facial recognition training datasets - by synthetically generating under-represented faces. Second, Reinforcement Learning with Human Feedback promises to automate content moderation, remove hateful speech misinformation and harmful content from synthetically generated text and pictures, and produce alignment of AI algorithms to human values.While promising to automate content moderation, Reinforcement Learning with Human Feedback still requires, as the name admits, a large amount of human intervention on extremely distressing material. The event considered the promise of these techno-fixes as those that raise more questions than answers, as these technologies are relatively recent and the broader landscape of generative world-making is fraught and complicated.

The event also considered the generative processes of world-making by interrogating the distinctions between the ‘real’ and the ‘simulated’. Large-scale simulations are often touted as a way to forecast extreme and rare events including climate change and other extreme events such as the car accidents that self-driving cars are supposed to avoid. Although self-driving cars might be trained on simulated car crashes, this has not provided a fully convincing solution to the risks they pose on ‘real’ roads, while complicating the landscape of responsibilities when accidents do occur. As two event participants note in previous papers, the bottom line is that these technologies promise to automate data production to fight data scarcity, and to reduce risks associated with algorithms when deployed. However, these fixes are not unproblematic:risks and ethical questions remain.

World making and worlding have long histories in STS, and were considered during the event by engaging with feminist and postcolonial studies, among others. Postcolonial scholar Gayatri Spivak understood worlding as cartography, an effort to impose a map over a territory to partition and impose imperial hierarchies: “the worlding of a world generates the force to make the ‘native’ see himself as ‘other’”. Donna Haraway situates worlding within the field of STS by introducing the possibility forms of “worlding from below” based on kin-making with fellow human and more-than-human agents, including among them also “inanimate” objects like devices, sensors, algorithms etc. More recently, echoing the idea that design practices are also world-making practices, postcolonial anthropologist Arturo Escobar has argued for a ‘pluriversal design approach’ that can produce worlds capable of incorporating diverse epistemologies and ontologies. Katherine Hayles also informed many papers, as her study of posthuman cognition has been highly influential in going beyond western liberal ideas of individuality. All these thinkers informed, directly or indirectly, our thoughts for the two days of the workshop, unveilling the conceptual and ethico-political stakes of our work: making models of the world, as well as accumulating data about it, is always a hallmark of power, understood as power on what to include and what to exclude (data, representations, agents, subjects etc.) from the world in the making. The ways models and data come to be is an integral part of how power is enacted through society.

Against this backdrop, the event was packed with two days of thoughtful and in-depth discussions of papers from leading scholars on algorithms, drawing upon their respective disciplines. Every day saw five papers being presented and unpacked by discussants, and then engaged by the audience through Q&A. The discussions were kicked off on September 28 by plenary speaker Louise Amoore, who presented the paper co-authored by Alexander Campolo, Ben Jacobsen and myself, on “The World Model: Governing Logics of Generative AI” (currently under review). Drawing on earlier research by Katherine Hayles that extended cognition to more than just humans (hence including machines), our paper discussed how models of the world and the rationalities of machine learning systems, especially generative ones anticipate new processes of generative world making. There is intense discussion nowadays about whether generative AI algorithms have a model of the world or not. Louise argued that the latent space of a language model or a diffusion model - such as GPT, Stable Diffusion, DALL E, or Midjourney - represent world model of those algorithms, and that enables the algorithm to act in the world. A latent space is a compressed space representing all data that an AI algorithm like GPT or Stable Diffusion has been trained on: it represents salient features of individual pictures or texts, as well as similarities between them, and their “grouping” and clustering around concepts. As computer scientists put it, a model generates pictures or text by “walking the latent space” and exploring the relation between data.

Nanna Bonde Thylstrup presented her new project on “Data Loss, Machine Unlearning and Machine Forgetting”. This paper hones in on a very complex question in Machine Learning: how do we make an algorithm forget data it has been trained on, without it glitching? From lawsuits over copyrighted material being scraped for training algorithms to the right to be forgotten, the reasons to delete data abound. However, these algorithms learn from data in complex ways, and it is much harder to delete what they learned after they have learned it. Nanna’s paper was an investigation into algorithmic memory and more-than-human cognition, to show how deletion and removal can be productive acts.

The following afternoon was dedicated to discussions on the political economy of generative AI. Devika Narayan, from Bristol’s Digital Future Institute and Bristol Business School, presented her research on the changing markets and players in cloud infrastructures, demonstrating how ‘software assets’ including algorithms, data, and virtualized computing infrastructures are gaining centrality. Big Tech is investing in fixed capital in the form of very expensive cloud infrastructures, but it is also trying to ‘emancipate’ itself from hardware through virtualization of these infrastructures, making them available as a service and on demand, to more easily extract value from what would otherwise be a growing sunk cost. In so doing, this shift towards just-in-time infrastructure provision tends to play into, and sometimes reinforce, the tendency towards concentration in cloud infrastructure provision, with impacts on access and power structures associated with the AI industry. Malcolm Campbell-Verduyn, from Groningen University, presented on growth and its embeddedness within the very fabric of infrastructures, including cloud and other digital systems. Among the conclusions drawn by the piece, there was an invitation towards infrastructure reuse, retrofit and repair, rather than plain and simply pushing for new infrastructures.

(Nanna Bonde Thylstrup and James Steinhoff)

James Steinhoff, from University College Dublin, and Tanja Wiehn, from Copenhagen University, presented on Nvidia Omniverse and the political economy surrounding Nvidia and synthetic data production. If contemporary capitalism, as James argued, is often defined by its relationship with data - think of “platform capitalism” or “surveillance capitalism” – then we can see synthetic data for the different types of valuation and capitalization that are entailed by data production rather than just Big Data mining and analytics. Synthetic data are often marketed as “data from nowhere” that do not suffer from the biases and privacy problems attached to “real” data. However, this effort to “de-spatialize” data is never successful. Rather than fully decoupling data from individuals or from space itself, synthetic data represent in James and Tanja’s work a different step towards automation of data production, which in turn enables new ways to capitalize on data beyond surveillance. This does not mean that surveillance is disposed of, but rather that a new form of extraction and capitalization is built on top and off the back of surveillance, providing new ways of filling the “data gap” haunting AI-driven applications.

The second day continued with further discussions on simulation, generativity and futurity. Generating a world is also generating a future, in that it is producing “unseen” data from the training set. Furthermore, these algorithms are often used as an aid to imagine and summon up futures: how will climate change impact specific areas? What would happen to a city if disruptions happen in specific neighbourhoods? And which data are the models trained on? Paulan Korenhof’s (Wageningen University) previous studies on the proliferation of digital twins and their ethical and political implications presented new research on the extremely ambitious project by the European Union to build a digital twin of the planet by the end of the decade. Digital Twins are not merely digital representations, but also socio-technical systems that “make” these representations and produce a “world”. Digital Twins are likely to affect modes of seeing, participating, and intervening in the context of environmental governance and thereby configure a political relation between the virtual and the real. The fact that Destination Earth is a digital replica of the planet, yet designed and directed entirely from and within Europe might risk reproducing “views from nowhere” that have been extensively criticized by feminist and decolonial research in technoscience.

Afterwards, Orit Halpern, from TU Dresden, presented research on the futures and imaginaries produced and foreclosed by artificial intelligence and automation patents. By filing a patent, companies are both envisioning a future and preventing others from envisioning similar technologies, or from envisioning alternative uses of the same technologies. In both digital twins and patents, envisioning a technology also means to envision the world that technology will inhabit, how the world will be changed by that technology, and vice versa.

Following lunch, the final three papers honed in on the role of synthetic data, their production, and the reasons and rationalities that animate their production and use in practical applications. Ben Jacobsen from York University and Johannes Bruder from University of Applied Sciences and Arts Northwestern Switzerland, focused on the ‘gap’ between the true and the fake produced by the emergence of synthetic data. Johannes in particular looked at how deepfakes on Twitter may or may not trigger speculative events in financial markets, and how these intersect with changes in the synthetic data industry. Johannes’s paper argued that the “real” is no longer the benchmark for synthetic data: synthetic data, rather, aim to be “authentic”, as argued in an Accenture white paper. Reality is no longer benchmark, rather, it is the very output of these systems, constantly produced, reproduced and reworked. Authenticity, then, has more to do with plausibility than with truth. Ben then looked precisely at “the gap” as the productive tension between the synthetic and the real, wherein the distinctions between them are never fixed. Here, the ongoing and contingent play of proximities and distances produce a creative tension between the “too close” and the “too far away”.

(Taylor Spears and Kristian Bondo Hansen)

The last paper dealt with the impact of generative AI on financial expertise. Taylor Spears from University of Edinburgh Business School and Kristian Bondo Hansen, Copenhagen Business School and co-organizer of the event looked at the emerging adoption of synthetic data technologies within finance and financial markets. They argued that synthetic data must ultimately be understood as the product of financial modelling infrastructures. Synthetic data generation technologies are “technologies of translation” that both facilitate the embedding of machine learning methods into new markets and the sharing of data among financial actors (for a seminal paper on translation, see here). Echoing Johannes’s paper, Taylor and Kristian foregrounded how synthetic data are changing the nature of financial expertise, potentially blurring the boundaries between finance and computer science.

Reality is being reworked by generative AI and large-scale simulation, with consequences spanning from how we plan to mitigate and hopefully to counter climate change, to how we fight against disinformation and its harmful consequences in elections and financial markets, to how we regulate an increasingly concentrated oligopolistic cloud infrastructure market. In a way, going back to Spivak’s point that worlding entails the “othering” of the native, we might start to wonder whether “the real” as such is othered by synthetic data: how is the real problematized by the proliferation of data that do not have any distinguishable origin points in individuals such as artists, experts, and authors?

Published: 02/26/2024