Week 2 - Data Basics
Welcome to week 2 of Data Librarianship and Management! This week, we are going to be going over the basics of data, especially to make sure we are on the same page with some vocabularies.
Overview
This week, you all should:
- Read the articles for this week before our live discussion on Thursday on Zoom from 6:30 - 7:30pm Eastern
- Read/watch the lecture
- Complete homework 1!
Lecture
Hi everyone, welcome to week 2 of data librarianship!! This week we’ll be discussing what data is, how the definitions change for different disciplines, and how data looks as it moves across different phases of its lifecycle.
But I want to start by saying ‘thank you’ to you all for finishing the self-assessment and letting me know which weeks you’d like to moderate. The results of the self-assessment are always so helpful for me in planning the pace of the course more effectively. A lot of you reported confidence in learning new technology and some of you specifically mentioned programming/coding in the comments section. I want to be clear that down the line there will be some homework assignments that have a reading and writing, or hands-on options that can include some programming. You choose either the reading/writing or hands-on, not both (unless you really want to, but I’m not giving extra credit for it). So if you’ll never be forced to try coding, but the option is available to you and I’m here to help you should the need arise.
Ok, so let’s get into it. The first thing when unpacking what it means to be a data librarian, is understanding the concept of data – how it evolves over time, how different domains of research or internal services conceptualize it, and why it’s in the purview of librarians to be involved in data- or computationally-intensive work. This lecture is , broken up by specific questions, is meant to give an overview of data and data librarianship at large. Next week, we’ll do a deep dive into the different types of data that you might encounter, such as GIS or qualitative data.
To start, how does data factor into librarianship?
Let’s start at the very beginning (a very good place to start). Most of y’all mentioned that you took some data related courses, such as:
- Database design & development
- Programming for cultural heritage and metadata
- Digital Preservation and Curation
- Metadata design
This is great!! I think it speaks to how important and embedded data is within librarianship, and how we can all expect to encounter data across different areas of librarianship. A lot of you in the discussion for week 1 talked about how the Emmelhainz focused on academic librarianship and we touched on the fact that there are many opportunities for data librarians to work in other types of Galleries, Libraries, Archives, and Museums (GLAMs), and in even companies/industry (though the job titles won’t have ‘librarian’ in them). We’ll examine building data services in week 8. I think you all should go through this course as if you were working in the type of organization that you want to after graduation and starting up a new data services, and trying to skill up in this area.
So what is data?
Data, in many respects, is in the eye of the beholder. We even have multiple ways to expressing dat-ah, or day-ta. [clip of Dr. Eggman from Sonic Boom being confused over the many ways to say ‘data’]. One person’s “just a lab notebook” is another person’s “rich unstructured data”. One person’s “primary source documents” are another person’s “corpus”. What one calls data depends largely on discipline/field of study and methodologies in which one is situated. Jargon switching is an essential part of the job of a data librarian (and I’d argue, librarianship as a whole, but this isn’t intro to librarianship, it’s intro to data librarianship). What I call ‘jargon switching’ is actually known as code switching to folks who study linguistics – a great illustration of the concept. Code switching is the process of shifting from one linguistic code (a language or dialect) to another, depending on the social context or conversational setting. It happens when a speaker alternates between two or more languages, or language varieties, in the context of a single conversation. It’s typically used to align speakers with others in a specific situation (e.g. defining oneself as a member of a group or community) and to announce specific identities, create certain meanings, and facilitate particular interpersonal relationships.
This basically speaks to the idea that how I would talk about metadata to a librarian is different from how I would describe it to a history undergraduate, or a physics professor, or museum curator. Being able to switch your level of jargon to different levels of expertise, or different methodologies and disciplines is key to be able to provide a cohesive data services. I had a great conversation with a conservator once, who was charged with conserving large-scale artworks. She came to me because she had a lot of what she called “light data” and she was having trouble managing it. In the data reference interview I conducted with her, I found out was a specific type of spectrometer data. Spectrometer data! That was the same type of data that an astrophysics prof had asked me about a few months before, which they described as “spectra from the machine”. So it goes to show that even when people talk about the same type of data, how they say it varies across fields.
So being able to jargon switch is really a skill to cultivate as a data librarian, and it starts right at the definition of data. Let’s look at this definition of research data from the reading for this week:
Research data, unlike other types of information, is collected, observed, or created, for purposes of analysis to produce original research results.
This gives us some very specific parameters to work within, but you might notice that this could potentially excludes analysis of secondary data (e.g. someone else’s data!), which isn’t great because we want others to make use of our work for the betterment of our fields. What about this one:
Data that are descriptive of the research object, or are the object itself.
This is nice to me because it includes metadata in its definition of data, which seems right to me given how much research we can do on metadata, if its well-formed and accurate (feel free to disagree with me and tell me though!). Let’s go to our last and very general definition:
Any information you use in your research
This might fall under “don’t be so open-minded that your brains fall out”, or might be the answer to all the squabbling over “what counts as data”. To me, I want to encourage people to take care of all their research materials until the point where its better to send it over to the professionals [gestures at self and all], so if calling everything data gets us there, I’m personally fine with it. Let’s pivot and ask:
What types of research methods or data types might each definition exclude? What nuances are we missing in these broad strokes?
It’s worth mentioning that when discussing data with researchers across domains, sometimes the word ‘data’ doesn’t even come up – it’s offensive to some, for instance qualitative researchers for whom their work can be deeply interpersonal. During qualitative projects, researchers are deeply embedded in the communities they study. I’ve heard a few times from researchers, after I called their interviews ‘data’, that “these are real people, not data points.” And so when I’m speaking to these scholars, I tend to use words like ‘materials’ and ‘your work’ instead of ‘data’ or ‘corpora’.
So now let’s examine a table of data types, taken and adapted from our readings –
Types of Data | |
---|---|
Documents & spreadsheets Laboratory notebooks, field notebooks, diaries Questionnaires, transcripts, codebooks Audiotapes, videotapes Photographs, films Protein or genetic sequences Spectra Administrative data Standard operating procedures and protocols |
Slides, artifacts, specimens, samples Source code Metadata Database & database content Models, algorithms, scripts Contents of an application (input, output, logfiles for analysis software, simulation software, schemas) Methodologies and workflows |
I like this because it’s inclusive of not only typical things we might think of, like spreadsheets, but also logfiles for software (which can be incredibly useful in finding errors or understanding the provenance of data) and administrative data. Borgman (2011) also gives us a few different categories of data to think about:
- Observational data, such as weather measurements, surveys
- Computational data, such as models, simulations
- Experimental data, such as chemical reactions in a lab
- Records of government, business, public/private life, such as archival records, open government data, law cases
Leek (2015) has a very quantitative point of view, but I think it’s worth going over his definition of a dataset, which he separates by level of processing:
- Raw data - the data as you got it, which should only be read-only (e.g. you never change it, but you can make derivatives as needed). Leek says “if you did any manipulation of the data at all, it is not the raw form of the data”
- Tidy data - this is the processed data ready for analysis. For data to be tidy is something specific, which we’ll get into more next week when we talk about quantitative data.
- Codebook - a document that describes each variable and the values in the tidy data and also contains information about the study design and choices you made.
- Recipe - explicit steps of how you get from raw to tidy – to Leek, ideally this would be a script to limit ‘human error’ in taking the raw data as an input and the tidy data as the output. It also has information about the software used to go from A –> B and the system you used it on (macOS, Windows, Linux).
In my opinion, one thing we miss when we have the large general categories of data is that data is created within a variety of situations – for example, completely unrelated to research or as a by product of research. A researchers’ letters (emails now) are rich data for historians, future researchers in the same area, and even for genealogical research. Imagine someone taking your emails now and preserving them as valuable data. I personally would be appalled by that notion because I am real salty in my emails, but it’s basically the foundation of whole fields studying the history of scholarship. I’ll talk about it more in week 6 when we discuss reproducibility, but Sir Isaac Newton has some really great salty letters that have shown us an interesting glimpse into how reproducibility manifested during his time.
Another example of the breadth of forms data can take, even when we don’t immediately clock them as data, is from a guest blog post I wrote for the National Museum of Natural History field book project, during my time as an NDSR at the American Museum of Natural History:
While the scope of my project is in the digital realm, I am constantly shown the value of field books and older scientific texts through conversations with science staff. All the scientists at the AMNH are as passionate about our historic collections of field notebooks as they are about their own field notes […] During my interviews, many scientists have expressed to me that the most important data from their work are actually their field notes–the majority of which are still done with good old fashioned pencil and paper.
I go onto describe an encounter with a curator in the mammology department at the AMNH, in their historical fieldbook collection:
As I flipped through some of the newer, less vulnerable books he told me he often comes into this section of the archives to examine old accounts of expeditions, which tend to include species descriptions, and descriptions of environments that have changed drastically in the intervening years. He told me sometimes he visited these books as frequently as once a day because the information within these hundred year old volumes is so helpful to his research.
These fieldbooks were meant as a record of a research expedition – not as research material itself. Yet, it has become a trove of data for researchers both within and outside the AMNH. All this underscores the main point…data can be a LOT of stuff and a good portion of a patron-facing job will be convincing them that their materials, including what they consider ephemera, is likely important and should be well-documented and preserved. However, we have to think critically. The assumptions behind individual work can deeply influence the way data is created, gathered, procured, or otherwise generated. Borgman (2011) discusses two opposite ends of the spectrum (her spectrum, I’d add) at length:
Exploratory investigations: pursue specific questions, usually about a specific phenomena
One example is biological research that involves collecting water samples from the same beach to look at bacteria. And then there’s also:
Observational investigations: systematically capturing the same set of observations over long periods of time to propose a new theory – interpretation of natural phenomena
One prominent example of this is climate modeling.
When working with data, understanding the origins, assumptions, and methods involved in its creation will help frame how you (and your patrons!) can use or collect (in the library sense) it. Some good questions to ask whenever you’re encountering a fresh dataset are:
- What are the potential sources of bias in this data?
- How was the data collected? Was it ethically collected?
- What is the strongest argument for using this data?
- What is the strongest argument against using this data?
Once you understand how the data was created, for what purpose, with what biases, under which methodologies and frameworks, maybe you can work with it or assist others in working with it. And each step in the data lifecycle has it’s own set of questions:
Planning for data | Processing data | Analyzing data | Preserving data | Publishing data |
---|---|---|---|---|
how will we manage this data? what data sources will we use to get the data? what format will the data be in? how will we collect this data? |
how will we check, validate, or clean the data? how will we describe that process? how will we describe the data? |
how will we interpret data? what research outputs will be produced? what format will data be in? how will we ready this data for publication? |
what is the best archival format for our type of data? What needs to be preserved alongside our data to make it useful to others? What type of metadata and documentation do we submit with it? |
which repository or archive is the right one for our data? how will we make sure our data is indexed widely? how can we get credit for sharing our data? |
STORAGE & BACKUP | —> | —> | —> | —> |
METADATA & DESCRIPTION | —> | —> | —> | —> |
But no matter which stage the data is at in the lifecycle, I think it needs the following at a minimum:
- Metadata, or structured information about the data that describes its contents and structure.
- Codebooks or documentation about the specifics of the analysis, such as variable names, participant tracking, or research workflows.
- Reliable storage and backup.
- The tools used to create, modify, and analyze the data.
But we’ll discuss those in depth more in the course to come. I hope your curiosity has been even more piqued, because we’ll be doing deep dives into a lot of the topics I’ve given an overview for tonight. Next week we’ll examine the different types of data, such as quantitative, GIS, “big”, and qualitative. Looking forward to seeing you for our live discussion!
Homework 01
Pick on of these ‘Collections as Data’ personas (do not pick one that reflects your current or past realities!):
Then evaluate these four digital objects to answer the question, “is this data?”, from the perspective of your persona (e.g. imagine you are a data journalist and say if each one is data or not):
Your evaluation should follow this template for each digital object:
Name of object:
Creator:
Date Accessed:
Briefly describe the digital object.
Why is this or is this not data?
What are the potential sources of bias in this object?
Is the data ready-to-use or does it require more work to make it usable?
Is there any accompanying material to help secondary users understand what it is? If so, please describe and link to it. If not, describe what documentation or metadata might help make it useful for others?