For two weeks in late June, as part of the Beautiful Data workshop held at Harvard University, our own Jeff Steward, director of digital infrastructure and emerging technology, had the pleasure of discussing data with some great thinkers from the museum field, the humanities, and other related disciplines. Here Jeff shares insights from the workshop.
It was merely two days into the workshop, while discussing institutional portraiture and the shape of data, when the conversation veered into questions of data sanitization. After my presentations on data visualization I was often asked, “How do you clean your data?” or “How good is your data that you are able to create visualizations like this?” The short answer is that the museums’ data is just okay; it’s often dirty, denormalized, and full of gaps.
For example, we use 299 different historical periods (such as “Hellenistic period, Early” and “Edo period, 1615–1868”) to describe works of art. But only 14% of our records actually have an assigned period, leaving 86% of our collections unfindable by this particular attribute.
Harvard metaLAB’s Yanni Loukissas quoted British anthropologist Mary Douglas as saying, “Dirt is matter out of place.” In the context of my work, the dirt is what reveals the culture of our data and the character of our institution. In other words, we should find the proper place for our dirt instead of fixating on the ideal of pristine data.
I spend a lot of time sifting and sorting data, looking for that perfect thread through it all to illustrate some point or to tease out a new idea. Doing this allows me to create visualizations like page view terrains and abstractions of object colors, but I’m being quite selective about the data. It wasn’t until I heard some of the discussions during the workshop that I really took to heart the idea that in our desire to clean and shape data, we introduce bias. Sometimes we unintentionally play with history through cataloguing decisions and the data we elect to change, share, and present.
A recent series of visualizations from Northeastern University provides a novel view of the rise and fall of cultural centers using data sets from Freebase, the General Artist Lexicon, and Union List of Artist Names Online (ULAN). The series illustrates the birth and death places of 150,000 individuals, but large gaps in the data set are quickly apparent: there is hardly any activity in Africa, South America, and Asia, for instance; much of the movement shows individuals leaving Europe. The series fails to acknowledge the existence of cultural centers in parts of the world outside the influence of Europe. In this case, the bias is fairly evident.
In our own visualization of object page views, I’ve already made decisions that will shape the viewer’s impression of the museums’ data. I’ve ordered objects by their identification number and have elected to arrange them in a square area. The wide, seemingly empty stretches represent untouched areas of our collections. If I were to reorder the objects by their classification (paintings, coins, sculpture, and so on) or change the shape of the area of the terrain, those empty stretches would take on an entirely different meaning.
So what do we do about this issue? We can start by identifying our agenda and our bias up front. We can do this by contextualizing our work and data as much as possible. We can provide details on the source of the data along with insight into the methodology and reasoning behind the design of our information, as well as the way we choose to visualize it.
The workshop helped me understand just how important it is to question what might be missing by aligning data in a particular way—and more important, to ask whether I’m hiding anything, jumping to conclusions, limiting discoveries, or narrowing potential outcomes in my design and presentation of data.