In order to cast light on and address contemporary social and environmental problems such as climate change and rampant environmental injustice, organizations such as the Environmental Protection Agency have recognized a need to link and visualize diverse data – i.e. data characterizing different phenomena, collected by different sources at different times. Designing data analyses and visualizations to effectively characterize the complexity of these problems poses significant challenges. Often, the structure and meaning of data varies significantly across cultural contexts and scientific disciplines. Further, characterizing complex issues requires information infrastructure that can link data at fine granularity with ‘big picture’ data, producing narratives that effectively attend to social phenomena influenced at many scales. Recent efforts to build both the technical and social infrastructure to support data integration have attempted to address these challenges. Through ethnographic studies of data integration efforts, this line of research aims to create new knowledge on both the challenges of and opportunities for integrating data to address contemporary problems.

Data integration has been defined as the “problem of combining data residing at different sources, and providing the user with a unified view of these data” (Lenzerini 2002); data integration is enabled by the creation and adoption of standards, protocols, and technologies for identifying, formatting, describing, and linking disparate data sets. I characterize the Semantic Web as one such effort; its designers aim to identify, describe, and link global data sets into a shared, machine-readable Web of data. Recent literature in Information Studies and Science and Technology Studies (STS) has shown that building standards for data integration is complicated; as practitioners aim to make data shareable, different understandings of how data should be named, stored, and linked produce “friction.” Further, this literature has shown that all data integration – no matter how “big” or expansive – produces data margins. All data schemas impose a certain “order of things,” and all data visualizations frame social phenomena in ways that exclude some populations and problems from the picture. Those leveraging data, both big and small, to address contemporary problems thus must do so with an understanding of what underlying infrastructures afford, the ways they can eclipse data, and how they can be responsibly leveraged. This study focuses primarily on individuals in data communities that have become attuned to such socio-political data concerns or unsettled by other limits of data integration.

This study will draw on my experiences with STS, Web and data science, and the digital humanities to address the following questions:

  1. What cultural, historical, and infrastructural conditions have shaped the design and affordances of data integration infrastructure – particularly linked data standards and Semantic Web technologies?
  2. What limits – such as those imposed by language or the design of base infrastructure – do designers face as they architect data integration infrastructure, and how do designers engage such limits?
  3. What motivations, modes of inquiry, and expertise do data practitioners engage as they leverage such standards and technologies to foreground marginalized knowledge?
  4. What sensibilities, skills, and routines should constitute a critical data practice?

First, I will conduct an ethnographic study of communities designing information infrastructure for data integration. This will include interviews with and participant observation of those designing and vetting standards and technologies for linking data on the World Wide Web (through organizations like the W3C) and those building the technical and social bridges for data sharing (through organizations like the Research Data Alliance). Second, I will study organizations and activists leveraging data integration infrastructure to address contemporary problems. I will conduct a series of interviews with data practitioners aiming to bring together diverse data sets to draw new connections between social and environmental phenomena. In such interviews, I will aim to characterize the motivations and expertise practitioners bring to their data practice, how they understand the biases of data and the limits of data infrastructure and data analysis, and how they situate their work in relation to such challenges. In doing so, I will attempt to characterize what constitutes a “critical data practice” in the era of big data. (See Appendix I for an elaboration of “critical data practice”).

This research explicitly addresses recent calls in STS to study information infrastructures – analyzing the values that get built into them and the types of knowledge work they afford. While this literature tends to approach a study of information infrastructure by analyzing what designers do (how they collaborate, negotiate, and eventually come to consensus), this line of research aims to better characterize what structures what designers do. In other words, it considers the conditions that make designing standards and technologies for data integration so challenging (e.g. the design of legacy infrastructures and epistemic conflicts over how data gets defined). In doing so, it aims to better understand how information infrastructure designers and data practitioners orient and adapt their work in the face of complex conditions – at times leveraging technologies deviously to support political ends alternative to those for which they were designed. In drawing attention to projects that do this particularly well, this research will offer accounts of arenas where critical engagements with big data remain promising (while always limited). In this sense, this work will also contribute to literature on the practices and politics of big data. Finally, the work will contribute to pedagogical literature on data literacy, demonstrating the sensibilities and skills required of contemporary critical data practitioners.