Because these collections are extensive and spread across multiple repositories in the University, making changes is not a simple task of updating a few database records. It also requires institutional coordination and changes that must be implemented in multiple systems by multiple people. To address this challenge, I worked with the team to develop tools that would help to understand the metadata across different repositories and to begin to identify pathways of addressing various problems that the project has identified. We quickly began to realize that bulk editing and analysis approaches would usefully complement the work. By analyzing the collections as a whole, we hoped to develop understandings that could inform or confirm our analyses, particularly the insights and recommendations arising from the analysis of harmful terminology. We were specifically hoping to develop tools that would help in the analysis of hundreds of descriptive records, which were identified by the project during our collections survey, in aggregate. While much work in the rest of the project has focused on collection inventory and analysis, we also hoped to create methods that could assist collections managers in analyzing and understanding collection metadata, with the ultimate goal of creating reusable, code-based tools that could assist our project and others in changing or updating collection descriptions.
The project team identified more than two hundred finding aids that described materials at the University of Michigan related to the Philippines. These documents provided the basis of a dataset for analysis. These finding aids came from three collections on campus: the Bentley Historical Library, Special Collections Research Center, and the Clements Library. All provided data in a text-based markup format known as Encoded Archival Description (EAD), which was shared in eXtensible Markup Language, a standard data format that is frequently used to share metadata. The analysis of the data was undertaken by graduate student Ella Li (School of Information) and myself. We worked together to develop a series of analysis tools. The goal was to analyze the data, with the end goal of producing useful data visualizations, and to begin creating tools that could be used or repurposed to begin making changes to the descriptions. We developed analysis modules using the Python programming language and widely used analysis and visualization tools. The resulting code is available, and can be reused or repurposed, through a series of interactive code examples now available on GitHub.
]]>While many recent attempts to control library collection priorities have focused on the control of social and cultural narratives, there is another threat to library collections: the refusal to admit that a library should ever get rid of a book! Let’s call it “book fever.” This is the sort of thing that possesses a woman in a small Ontario city to get more and more (and more!) books even though she has no space to store them. As the CBC reported in January 2022, it sounds like a bookseller gone obsessive: she owns a book store but now “has a barn and two other farmhouses on the same property full of donated” books. That might not be a problem were there some use for these books, yet “The books are essentially in deep storage, inaccessible to customers and unsearchable by the bookstore’s three-person staff since most of them aren’t catalogued.” She calls it a “tsunami of books” and says she has “no place to store them.” Yet it appears she can’t stop accepting books from the local Salvation Army, which receives hundreds of books a week.
While clearly these books are of no interest to their previous owners, there doesn’t seem to be an indication that there is demand elsewhere. Is this like food waste, where there inefficient distribution and overproduction lead to thousands of pounds of waste every year yet many people still don’t have enough food? You see, books are things but ideas aren’t quite the same as things: there is a zero sum aspect of things - you either have a thing or you don’t, and if you give it to someone else, they have the thing - ideas, on the other hand, can be shared but simultaneously kept. I suspect that what these book hoarders want is to share knowledge, but they’re confusing that with the keeping of books.
A few librarians slid into this conversation on twitter. Mary Cavanagh pointed out that while one of Ranganathan’s laws may be “to every book its reader,” that doesn’t mean mindless hoarding; instead, recycling and deaccessioning unused books is “ethical, necessary, and practical.”
While 1st Ranganathan rule may be, ‘to every book a reader,’ 2nd rule of librarianship is that recycling old unwanted books is ethical, necessary, and practical. To the bin. https://t.co/lbrKbyCRs7
— Mary Cavanagh (@mfcavanagh) January 16, 2022
Another, David Kemper, tweeted that such hoarding is a “disservice” and one respondent pointed out that some of these books came from public libraries where presumably people “won’t even borrow [them] literally for free.”
Agreed. I see some public library stickers on one book there, so there are books that people won't even borrow literally for free. I challenge anyone to detect even one book that is worth reading in those pictures.
— Ryan Deschamps (@RyanDeschamps) January 19, 2022
Another library perspective noted “Books are not inherently sacred. . . . Weeding collections is a good thing.”
Apparently it needs to be said again. Books are not inherently sacred. Books have a lifespan. Recycling books is not a bad thing. Weeding collections is a good thing. Books are not inherently sacred. Stop acting like they are. https://t.co/QQ5Mo4cjLo
— Natalie (@InkyLibrarian) January 16, 2022
Presumably, should someone show up at these aforementioned barns and want to take one of the books, the hoarder should be happy, right? But it doesn’t seem like anyone is stampeding to the place. Captain Obvious on twitter suggested a free warehouse, which was duly laughed off the stage.
LOL! OK, OK, I give up, librarians. It's Monday morning and I have no time to fight for the idea of a free warehouse of cast-off books. I get it: you know all about this problem, etc., and you don't have money to solve it, etc.
— Captain Obvious (@lsanger) January 17, 2022
I don't have time to engage further. I'm done.
(By the way, did you mean a public library, where we share the costs of maintaining the collection so more of us can use the things and, more importantly, find what we’re looking for, not some random castoffs?)
The problem is real. In December, Karen Heller wrote about Fran Lebowitz’s collection of books in The Washington Post. Heller noted Wonder Books, a warehouse of 6 million books in Frederick, Maryland, that continues to receive 300,000 volumes every month. The books are in literal piles and bins. Like the Canadian book tsunami, these American’s were “drowning in old books.” It may be that nobody likes to throw away a book, but there remain a lot of books out there that aren’t very good. Huge runs of hardbacks are published and immediately on sale in big box stores, where most will go unread. But there’s a lot of things that just might benefit from a reconsideration. The Awful Library Books site, with the tagline “Hoarding is not collection development!”, has many to choose from. For example, unless you’re a collection devoted to the history of book organization and cataloging (there are collections that are!), you should not ask your high school library to retain a copy of the ALA Rules for Filing Catalog Cards from 1942, unless you’re preparing your kids for a career in WWII-era libraries. This might be a good reference point for the students in my “Introduction to the Organization of Information” class who need to know about the history of the profession, but probably not a lot of others. There are plenty of institutions that do have a copy of this, that have the resources to keep it, and it’s not doing anyone any good hanging out lost in their barn!
Book hoarders may have overlooked Ranganathan’s first law, which is: books are for use.
Whether you call it tsundoku, bibliomania, or just plain book fever, the activity of collecting and storing books is clearly a human passion. But when it gets too focused on the possession of books qua books, we have lost focus on the reason that we make libraries: to promote knowledge, learning, and community engagement.
An unfortunately missed previous libraries playlist could’ve focused on the “cheese slice bookmark” tweets, which were in high circulation around January 2020 when the University of Liverpool Library tweeted a photo of a packaged cheese slice found stuffed between the pages of a returned book. It was covered by Know Your Meme, though it’s now under review (presumably for being too “niche”?).
]]>Exciting to see this out! It was a privilege to work with @archivalflip and @CLIRgrants on the program assessment for Amplifying Unheard Voices! There's a lot for funders here, & there's lots of advice for cultural heritage grantseekers too! https://t.co/awm9s8GEI9 pic.twitter.com/R7TBlVAmnY
— Jesse Johnston 💻🖋😻🐕🎓🌈🐝☮️ (@jesseajohnston) February 16, 2023
In their description of the report, CLIR writes:
This report summarizes a yearlong program assessment of “Amplifying Unheard Voices,” a major revision of CLIR’s Digitizing Hidden Collections grant program. The revision sought to expand the reach and appeal of the program to a broader range of institutions, including independent and community organizations, and to emphasize the digitization of historical materials that tell the stories of groups underrepresented in the digital historical record. Significant changes were made to the application structure, new applicant support resources were created, eligibility was expanded to Canada, and new thematic emphases and program values were added. The assessment was based on a series of qualitative data-gathering activities that included stakeholder groups and staff. Through surveys and interviews of applicants, inquirers, proposal reviewers, and staff, the authors provide a holistic view of the program, offer a series of recommendations, and identify areas for further attention.
The full report and accompanying data are available from CLIR as a pdf.
]]>“We see further potential to increase equity in funding programs and representation of community stories in the digital historical record.”
Digitizing Hidden Collections (DHC) is a major funding program that has supported the digitization of unique, historical collections since 2015. Grants are administered and awarded by CLIR, but the funding comes from The Andrew W. Mellon Foundation. In 2020, CLIR worked with Mellon to adapt the program so it could better serve less-frequent-grantseeking organizations and emphasize the digitization of historical collections that told the story of groups under-represented in the digital historical record. The report summarizes my year-long work with Ricky Punzalan to assess the resulting program, “Amplifying Unheard Voices.” Through the 2021 program revision, CLIR aimed to expand the reach and appeal of the program to a broader range of institutions, including independent and community-based organizations and to emphasize the digitization of historical materials that tell the stories of groups underrepresented in the digital historical record. The program revision implemented significant changes to the application structure, created new applicant support resources, expanded eligibility to Canadian applicants, and added new thematic emphases and stated program values.
Enthusiasm for the program is high. The changes in the 2021 DHC:AUV iteration were warmly received by many potential applicants, including organizations that are not frequent grant seekers for collections-related activities as well as many organizations that have previously applied to DHC. The revised program was recognized as a critical funding resource unique in its newly articulated support for collections digitization in conjunction with social justice priorities. These interests are clearly expressed in the program values and positively benefit the preservation of and access to more representative digital collections and records. CLIR’s resource materials for applicants were praised highly for their clarity, comprehensiveness, and approachability and for being readily usable and accessible. The expanded membership of the review panel represented expertise well-suited to evaluate the new group of applicants and proposals received. Overall, program accessibility, the appeal of the call for proposals that emphasized underrepresented perspectives in collections, and the continuing support for digitization was welcomed and well received. Even among those interested in the program but who elected not to submit applications, more than half hoped to submit applications in future competitions if given the option.
Alongside these positive elements, we identified areas in which the program would benefit from further attention as it moves ahead:
We conclude the assessment with optimism about the program’s possibilities but also with an awareness of the significant work required to maintain and improve such funding programs. We note the high enthusiasm for increased support of community-based memory initiatives that will diversify the historical record and make that record more digitally available. At the same time, the assessment reveals challenges of funding digitization projects in cultural heritage: the significant time required for design, implementation, and management of multiyear programs; the limitations of project grants; and the challenges of making incremental yet responsive changes within a longstanding program.
The project revealed enthusiasm for and potential of the future of DHC:AUV, but more broadly, we see further potential to increase equity in funding programs and representation of community stories in the digital historical record.
]]>toc_mapping_humanities_data.md
). Jekyll provides a capability that allowed me to include and this boilerplate text so that it appears in any page on the site where I put the code to reference it. I had referenced this in a handful of posts, but I soon realized that in fact “mapping humanities” wasn’t really correct; the main point, in fact, was humanities data curation, so I renamed that template intro_humanities_data_curation.md
. Now, I needed to update each post where I had included the template with the new template title.
It would be possible to make these replacements by hand, but it seemed like it would be quicker if I could ask the computer to do this for me. I also thought that it might be useful to show how to use the command shell to make this update in a quick and consistent way. The basic steps would be:
Each of these steps can be done with a command shell tool: the first using a regular expression, the second using grep and find, and the third using a file editor like sed, which can use the regular expression to make the changes.
Each of the files where I used the introductory template included the following line:
{% include toc_mapping_humanities_data.md %}
This line told the Jeykll site generator to insert the template at this point. To include the new template, I needed to replace the above with the following line:
{% include intro_humanities_data_curation.md %}
To do this, I created the following regular expression, which would match the portion of the line that I wanted to match. Using grouping, I could tell the command which parts I wanted to change:
(include )[a-z_]*(.md)
Now, grep
and sed
come into the picture. I used grep
to identify all of the lines where the include command occurs:
grep 'include [a-z_]*.md' _posts/202[012]-[01]*.md
To make the replacement, the sed
command. sed
is a “stream editor” designed to replace matches in input based on pattern matching like regular expression.
To test sed
, for example, we can pipe in a specific input:
This is useful for testing the regular expression patterns, too. The -E
option specifies using extended regular expressions, which are necessary to use things like group matching. In this case, the parentheses in the first pattern set groups, they are then references with numbers starting with 1
in the replacement string (i.e., \1
and \2
).
So the sed command may be:
This will publish all of the modified stream (the file contents) into the terminal window or display. If you review the contents, the change has worked. Now, it needs to be rerouted into the files, if you want to make the actual changes.
To do this, we can use the find
command to make a specific search for the files, then run the sed command on each of these:
find _posts -type f -name '202[012]-[01]*.md' -exec sed -i '' -E 's/(include ).*(\.md)/\1intro_humanities_data_curation\2/' {} \;
The above illustrates some of the power of the find
command. Here it searches for each of the markdown files with a name meeting the specific search parameters, then the -exec
option runs the sed command with the regular expression created above.
Aside: This
find
construction uses what looks like a regular expression to search for certain files, but in fact this a different sort of pattern matching. While similar to regex, this is an example of a “file expansion” search also known as “globbing,” which approach pattern matching similar to regex but specific to searching file paths (see Globbing). To use a regular expression in find, the Zshell uses the-regex
option. While a more limited pattern matching option, globbing can do a lot. For example, to find posts that don’t inlude ones posted in August, try thefind
command:find _posts -type f -name '202[012]-[01]-[01234567]*'
So to sum up: Using this one line of commands, developed piece by piece, I quickly updated all of the include statements for all of the posts on the site.
]]>For reference, here are the other essays in the series:
Encoding Reparative Description: Preliminary Thoughts (17 September 2023)
Find and replace data in the shell (11 September 2022)
Wrangling Humanities Data: An Interactive Map of NEH Awards (17 July 2022)
Wrangling Humanities Data: Using Regex to Clean a CSV (27 February 2021)
Wrangling Humanities Data: Exploratory Maps of NEH Awards by State (22 January 2021)
Wrangling Humanities Data: Cleaning and Transforming Data (19 January 2021)
Wrangling Humanities Data: Finding and Describing Data (20 December 2020)
This installment uses the geospatial dataset previously created and describes how to display the data in an interactive map on the web. As in the previous post, you can also download a version of this post from the GitHub repository along with all of the data discussed here. File references discussed below are included in the same neh-grant-data-project repository.
Although my previous essays have explored various data-related topics, this post continues a theme of mapping humanities data. Previous installments walked through the process of preserving, transforming, and visualizing this data, which is a list of grants awarded by the NEH during the 1960s (the agency’s first five years). This post demonstrates how to create interactive, web-friendly maps more familiar to everyday users. The goal is to plot the grant information from the 1960s on a map background that allows zooming in and out, repositioning, and the option to click on each point to get information about the referenced grant.
Rather than building the map from scratch (as previously), this demonstration uses widely used, open, and pre-existing code libraries to generate the map with the desired features. The primary tool is Leaflet.js, a javascript library that will knit together the map tiles and the grant data.
Leaflet provides a library of tools, written in javascript, that help to display geospatial information via a web browser. By combining this library, which builds on common frameworks and code approaches to web publishing, with the geospatial grant data, we can create, display, and share the map. The elements that power the map include:
A similar process is outlined in a clear and approachable way by Kim Pham in “Web Mapping with Python and Leaflet,” Programming Historian 6 (2017), doi:10.46430/phen0070. Pham’s tutorial starts by setting up all of the map elements in one file (with a different dataset), then splits them out into three files as below. If you want to see how this would look in one file, refer to the above tutorial.
Because the html and css elements of this step are the shortest, we will go through those first. Below is a walkthrough description of each file.
The first file is an html file (basic-map-neh-1960s-leaflet.html
).
In the <head>
section, this file calls the necessary javascript and css files that are required for operating the Leaflet library. The header also pulls in a custom css file for our map (see the next file walkthrough).
In the <body>
section, the html includes only one empty div with an id="map"
attribute. This tag is all that we need to provide the page information that leaflet can hook the full map on.
Finally, just before the closing </html>
tag, the file references the javascript file that we will use to create the map (see the subsequent file walkthrough).
<!DOCTYPE html>
<head>
<link rel="stylesheet" href="https://unpkg.com/leaflet@1.8.0/dist/leaflet.css" integrity="sha512-hoalWLoI8r4UszCkZ5kL8vayOGVae1oxXe/2A4AO6J9+580uKHDO3JdHb7NzwwzK5xr/Fs0W40kiNHxM9vyTtQ==" crossorigin="" />
<script src="https://unpkg.com/leaflet@1.8.0/dist/leaflet.js" integrity="sha512-BB3hKbKWOc9Ez/TAwyWxNXeoV9c1v6FIeYiBieIWkpLjauysF18NzgR1MBNBXf8/KABdlkX68nAhlwcDFLGPCQ==" crossorigin=""></script>
<script src="https://code.jquery.com/jquery-2.1.4.min.js"></script>
<link rel="stylesheet" href="basic-map-neh-1960s-leaflet.css" />
</head>
<body>
<div id="map"></div>
</body>
<script src="basic-map-neh-1960s-leaflet.js"></script>
</html>
Next is a short css file (basic-map-neh-1960s-leaflet.css
).
This file provides styling information to the browser about how to display the map. Most important is the information for the map
id, which provides
instructions to the browser about where to display the leaflet map.
body {
margin: 0;
padding: 0;
}
#map {
position: absolute;
top: 0;
bottom: 0;
width: 100%;
height: 750px;
}
The file that pulls this all together is the javascript that calls
the leaflet functions to create the map, load the grant data, and creates individual markers for each grant point (basic-map-neh-1960s-leaflet.js
)
The file opens by calling a function via window.onload
. This means that each time the window is loaded (or reloaded), the browser will execute
the instructions to draw (or redraw) the map.
Next, we create a basemap variable (var basemap
), which provides information about the underlying map layer. In this case, we draw the tiles from the Open Street Map project and provide attribution.
Then, the $.getJSON
command loads the geojson data (created previously).
Using leaflet’s functions (they are recognizably prepended with L.
), the file
gives instructison for parsing each geojson element into points. The majority of this section is a series of filters that provide information
for displaying the text (e.g., how to display the integers as US dollars) or correcting missing information (such as unlisted Institution fields). At the end of this block, in the layer.bindPopup
statement, a formatted
string creates the text of the popup for each grant on the map.
Finally, leaflet is instructed to draw the map. The view is set (zoom level and the latitude and longitude to center the view). And, the basemap and geojson are added as layers to the interactive map.
The next section explains how to use python locally to view (serve) these files and explore the interactive map as it could appear if published to the web.
window.onload = function () {
var basemap = L.tileLayer('http://{s}.tile.osm.org/{z}/{x}/{y}.png', {
attribution: '© <a href="http://osm.org/copyright">OpenStreetMap</a> contributors'
});
// retrieve the geojson data
$.getJSON("neh_1960s_grants.geojson", function(data) {
// set popups for each point
var geojson = L.geoJson(data, {
onEachFeature: function (feature, layer) {
// set content formatting to format and correct missing information for popups
if ( feature.properties.Institution == null ) {
feature.properties.Institution = 'an unaffiliated, independent scholar'
}
if ( feature.properties.Participants == null ) {
feature.properties.Participants = 'unlisted'
}
if ( feature.properties.ProjectTitle == null ) {
feature.properties.ProjectTitle = 'unlisted'
}
let dollarUS = Intl.NumberFormat('en-US', {
style: 'currency',
currency: 'USD',
})
// create the popups
layer.bindPopup(
`<p>In ${ feature.properties.YearAwarded }, ${ feature.properties.Institution } (in ${ feature.properties.InstCity }, ${ feature.properties.InstState }) was awarded ${ dollarUS.format(feature.properties.AwardOutright) } for <a href="https://securegrants.neh.gov/publicquery/main.aspx?f=1&gn=${ feature.properties.AppNumber }">NEH project number ${ feature.properties.AppNumber }</a>.<br /><br /><strong>Project Title:</strong> ${ feature.properties.ProjectTitle }<br /><strong>Project participants:</strong> ${ feature.properties.Participants }<br /><strong>NEH Program:</strong> ${ feature.properties.Program }<br /><strong>NEH Division:</strong> ${ feature.properties.Division }</p>`
);
}
});
// set up the map, set viewport
var map = L.map('map')
.setView([37.90, -94.66], 4); //continental US view
basemap.addTo(map);
geojson.addTo(map);
});
};
After files are created, we can use the python HTTP server to display the files locally and see how they may display on the live web. For this, use python3 http.server module. (Note: this should only be used locally for testing, not in a production environment.)
To display (and “run” the files) the files with python, open a command shell, navigate to the location where the files are stored, then run:
python -m SimpleHTTPServer
or in python3 python3 -m http.server
.
You can specify a server port for these if you like, or you can use the default. When the server starts, you will see something like this displayed in the shell:
Serving HTTP on 0.0.0.0 port 8000 ...
.
Note the port number. Now, open a web browser. In the browser’s location bar, use the port number (rather than a URL) to display the local server. For example, if as above the port is 800, you would request the following address in the browser:
localhost:8000/
or 127.0.0.1:8000
If the files are working correctly, you should see something like a map dotted with blue markers representing the grant data.
It’s an exciting opportunity, and I’m looking forward to joining the work of advancing the innovative UMSI curriculum, which has already pioneered digital skills development for early-career librarians, archivists, and other information professionals. While I have some concerns about returning to academia in general—the inherent elitism and classism, as well as the overheated (and under-acknowledged) prestige economy don’t align well with my personal values—the opportunity to join UMSI specifically presents a chance to facilitate training for archivists and digital curators.
I have a longtime connection to the University of Michigan. It is my alma mater, and notwithstanding the shameful and disappointing sexual misconduct (to put it diplomatically) in the news recently, I remain optimistic that the university as a social institution still has potential to make the world better through education and knowledge production. I hope to be part of the effort to move the U and its culture in ethical, egalitarian, and inclusive directions.
I have a great regard for many of the positive social and cultural values that I see in the state universities of the Midwest, including educational breadth, increasing access to knowledge for more people, and sharing critical thought, culture, and knowledge through teaching. Aside from the problematic settler colonialism behind the Land-Grant College Acts, the fundamental focus of the state university on service to all people in the state appeals to me. That said, University of Michigan in fact predates the federal land grant legislation, having been envisioned through the 1817 Treaty of the Rapids with the Council of the Three Fires, which stipulated the university should educate the descendants of the land givers, the Anishinaabeg and Wyandot. Although the institution has not yet realized that potential (though land acknowledgments have become more frequent and respectful), I believe that we can build a more ethical institution that stewards and answers the call to serve, educate, and inform the community of all people in Michigan.
I remain an idealist at heart and will be working to serve the values that align with mine, which are very clearly at the heart of UMSI’s work. The school’s current mission is to “create and share knowledge so that people will use information—with technology—to build a better world.” Among UMSI’s list of core values, I hope I can particularly support:
Teaching in a program like the one at UMSI is a chance to create a digital archives and curation curriculum that allows students to build on the digital and technological foundations that UMSI has invested in through its curriculum over the last decade, and to apply those skills to work in archives, digital curation, libraries, and other cultural heritage spaces. This type of curriculum development is something to which I have already contributed in my teaching at the University of Maryland, where I began building a curriculum of curation skills and tools for digital collections, and in my work at the Library of Congress, where I began a Library Carpentry series. I look forward to expanding those learning opportunities in this new role at UMSI.
]]>Most grant and fellowship proposals are evaluated in a peer review process. Generally, program staff will conduct an initial review. This review evaluates technical questions and may confirm that you are eligible according to program guidelines, follow budget rules, and other aspects that are “matters of fact.” A second audience that may read your application are oversight advisors, such as board members of a foundation, federal oversight boards, or others who may be interested in assuring that funded projects advance the funder’s mission and programs. (If you’re applying to a competition that does not use a peer review process, try to find out from the funder how proposals may be evaluated.)
The most crucial audience to consider as an applicant is that of the peer review step. These readers are generally knowledgeable about particular disciplines, project methods or research approaches, or other areas that require specialist review. It is in this step that your proposal will be evaluated most critically to ensure that it is making significant contributions to the program area.
Federal funders frequently seek peer reviewers. The opportunity to volunteer as a reviewer offers a great way to get involved as a subject specialist. Each agency has a different process and may be looking for slightly different information. Here is a selected list of resources to help you or your colleagues indicate interest:
To get an idea of the general types of readers a particular funder may recruit, look at annual reports or other publications from the funder. Most applications to federal agencies are evaluated by federal advisory panels, which may be listed in agency reports, the Federal Register, or in the federal advisory committee database.
Knowing more about who will read your application is a great way to be actively involved in the funding process and to raise your understanding of the research landscape.
]]>A giant container ship the length of four football pitches has become wedged across Egypt’s Suez Canal, blocking one of the world’s busiest trade routes.
At first, it seemed like just another inconvenience that comes up from day to day on the news, but as it became clear that the ship would be stuck for a while and we learned that it was potentially disrupting 10% of global trade, this situation received more and more attention. As photos emerged, the story caught fire and became a social media sensation and a great Internet meme. In an early example of remixing, twitter user @jdgtranen recaptioned the photo with the BBC logo, combining it with the lyrics from WAP:
when Cardi B said “ I want you to park that Big Mack Truck right in this little garage”....the Suez Canal FELT that 😤😤 pic.twitter.com/lz8xvYT7qW
— joshua gutterman tranen (@jdgtranen) March 26, 2021
Memes don’t always attract the attention of library twitter! But this one did as soon as another photo started to get attention. It turns out that the engineers didn’t really have a great hi-tec solution for this problem, in fact they were sending earth-moving machines to remove sand from the bow of the ship. Well, even a giant backhoe, digger, or dump truck is dwarfed by a ship the length of four football fields that can carry 20,000 truck-sized containers! The photo that caught attention is this one, of the lone digger:
Managed to dig out good part of the bulbous thingy. It's still stuck. #Evergiven #SuezCanal #Suez pic.twitter.com/zbeD59LA6V
— Guy With The Digger At Suez Canal (@SuezDiggerGuy) March 25, 2021
(Yes, there’s at least one account tweeting from the perspective of the digger @SuezDiggerGuy.)
The tiny digger, working away in its task to remove enough sand to free the freighter’s fifty-foot draft from the mud, become a great metaphor for various aspects of library work… here’s a few notables from the past day.
An image meme from @vwyeth shows the paradigm: huge problems (here it’s the ship denoted with the text “Structural Problem”) confronted by tiny, possibly ineffective or mismatched, solutions (in this case, suggesting that one person making one small change won’t change “the system”):
Ok I like this one even better. pic.twitter.com/fRCRlTT3WO
— Vanessa Wyeth (@vwyeth) March 25, 2021
Or this, shared by @SgWingo, suggesting the hollowness of naive advice in the face of the overwhelming pandemic:
— Sarah Wingo (@SgWingo) March 26, 2021
There are a few that suggest library work, perhaps symbolized by the never-ending, often-thankless, and frequently-invisible work of maintenance, might be like the digger. Here @sonicstacey offers this picture as a metaphor for the inadequate budgets that memory institutions often make available for digital preservation:
not sure if I did this right pic.twitter.com/znJNoFYkSt
— 🖥 terminal boredom 💾 (@sonicstacey) March 26, 2021
Or, as @remembrancermx points out, maybe the freighter is the backlog problem. Archives in particular face many challenges in funding all the work to process collections, including weeding irrelevant materials, physical stabilization and preservation, the creation of usable information to describe the materials and for users to find it, and the then the costs to take care of the materials and serve them to users as possible over a long period of time. That takes a lot of time, people, infrastructure, and resources, yet many archives are short on staff and resources. Perhaps the problem is the inadequate staffing?! Have you ever felt like the lone digger if you’re a “lone arranger”?
Did I do it right? pic.twitter.com/sHKQJBYJA1
— Yonah (@remembrancermx) March 25, 2021
There’s a second layer of memes embedded here, as many users tweet the image with something like “hope i did this right…” encouraging others to respond with other images, encouragement, or different takes.
For more takes on this meme, check out the “Suez Canal Jam” on Know Your Meme.
]]>For reference, here are the other essays in the series:
Encoding Reparative Description: Preliminary Thoughts (17 September 2023)
Find and replace data in the shell (11 September 2022)
Wrangling Humanities Data: An Interactive Map of NEH Awards (17 July 2022)
Wrangling Humanities Data: Using Regex to Clean a CSV (27 February 2021)
Wrangling Humanities Data: Exploratory Maps of NEH Awards by State (22 January 2021)
Wrangling Humanities Data: Cleaning and Transforming Data (19 January 2021)
Wrangling Humanities Data: Finding and Describing Data (20 December 2020)
Have you heard of regular expressions before and wondered how to make use of them? This post is for someone who has asked this question. It assumes a basic understanding of “regex” and shows how to use a full-featured text editor to cleanup plain text data. The data in question comes from a larger project, which is pulling bibliographic data from a major citation database in CSV form, transforming the data and extracting certain elements (DOIs of publications), then feeding the information into Zotero to create a shared bibliography. At some point in summer 2019, the CSV files began to include new fields that contained line breaks and non-text characters, which broke my data workflow. In a project that I could previuosly do with the output from the database, I now needed to clean up the CSV with the formatting errors. At first, this was not too onerous - a few lines to delete - but after a month, this grew to hundreds of lines in a CSV with thousands of lines. I needed a way to quickly search for the error patterns and fix as many problems at once in a batch. I decided to explore regex as a solution. Around the same time, I taught a workshop that included an overview of regular expressions, and someone asked for a “real world” use case for regex that could illustrate how to implement regex in a functional way. This is the use case.
The use case arose from an ongoing project that arose from the Covid-19 pandemic. Even after the pandemic had only lasted a couple of months, it had already brought about major reorientations in health and bioscience research. This first affected researchers in epidemiology and the health sciences who immediately began to on new vaccines, but it quickly included research into the effectiveness of public health measures, and the direct impact as well as cascading effects of the disease on particular communities, among many other areas. At large research universities, active research into the novel coronavirus immediately ramped up, and many existing research projects and labs have been reoriented to investigate the virus and the disease. This resulted in a flood of medical, scientific, and social research publications. At the University of Michigan, I began a project to track these publications starting in April 2020. As with so many things involving the Covid pandemic, our work has been responsive, and quickly adapting to the situation; in the last ten months, we honed the process of identifying citations, identified multiple ways to present the list of publications, and reworked the workflow to gather and process the data.
In this post, I will explain how I’ve been using advanced text editors and pattern recognition routines to parse and clean the data we’re gathering. Specifically, I will demonstrate how I use Virtual Studio Code (aka VSCode) to clear up some data quality issues, including multi-point editing and regular expression strings to identify patterns for correction. If you are looking for a text editor, the wikipedia comparison of text editors is a good place to start; in the past, I have used TextWrangler, Sublime, Brackets, and Atom, but at present VSCode is an excellent option. In future, I plan to add another post or two to this that will explain the data workflow in more detail, since this is only one of many steps. The outcome of the workflow is to create the list of publications that are included in a publicly-available bibliography at https://myumi.ch/3qnOG (via Zotero).
Okay, let’s get into text editors and regex!
Displayed above is the CSV file, as it appears in VSCode, which I discuss in more detail below. Note that the rows are very easy to distinguish and individual fields as noted in the header row are color-coded, which makes the data somewhat easier to read in lower rows. Various extensions can be added to VSCode to aid in the processing of CSV files, which are available in the VSCode extension “marketplace” (note that most of the CSV extensions are free). In the above illustration, I am using the “Rainbow CSV” extension to make the “cells” more visible to a human reader - this makes each field in the line a different color, which is much easier to see than a tiny comma.
One of the first things I noted when viewing the file is that the first line is not one that is necessary for my project (it provides information about the file and the specific search string that was used to generate the result), and the second line contains the field headers.
So, I deleted the first line. In VSCode, you can double click on the line to highlight it, then delete. Or, if you want to use the keyboard, position the cursor on the line you want to delete, then type Ctrl + X
(Cmd + X
on MacOS).
VS Code is full of handy keyboard shortcuts, and you can even create new ones! If you use shortcuts frequently, there are lists of useful shortcuts, like this one from VS Code developers for Windows users, or this one for Mac users. Print one out and keep it beside your desk until you learn them all! Or, see the many lists generated by other shortucut users, like this list from Deepak Gupta, for additional useful keyboard commands in VSCode. You can even create new shortcuts, which are called “key bindings,” within VSCode using built-in features.
Regular expressions are a useful set of pattern-searching techniques, which allow you to find very specific patterns within text. For example, have you ever searched in multiple sites and noticed that you want to find things with both British and American spellings? For example, all of the times where the word digitize appears, but you know that it might be spelled digitize or digitise? Regular expressions can help! In this case, if you have a search tool that can search with regular expressions, you could input the string digiti[sz]e
, and it would be able to match either spelling. The regular expression syntax is complicated and can be quite powerful, but I am only going to go into a few specific search expressions in this post. If you’re interested to learn more about regular expressions, or “regex” as they are often called, check out the introduction from Library Carpentry, or search for one of the many cheatsheets available online, such as this Regex cheat sheet from MIT.
Many advanced text editors support the use of regular expressions, which can be used to conduct advanced searches. In VSCode, you can bring up the window by typing Ctrl + F
(Cmd + F
on MacOS) or opening the Edit
pull-down menu and selecting Find
. The find and replace console (above) appears at the upper right corner of the VSCode window, and you can activate (or deactivate) the option to use regular expressions in searches by selecting the button at the right end of the search input prompt (above, note that .*
is highlighted).
If you start to see particular patterns that are beyond matching a word or phrase, you may want to consider using regular expressions. For example, are there many lines that begin with blank spaces or letters, when they should begin with numbers? Or are there some lines that can begin with letters, but only if they are followed by a line that begins with a number? If you have identified pattersn like these, regex may be a tool that you want to use. Regex also offers more advanced usage that supports selecting and changing certain patterns. (I will not discuss regex replacement features here, but you can learn more about that functionality in Bohyun Kim’s post at the ACRL TechConnect Blog.)
In this post, I am using basic regex in this example to identify certain patterns that cause problems in a CSV document that I received. When I opened the file, I noticed that many lines included unescaped commas, non-alphanumeric characters, tabs, or other strange formatting that caused the CSV to be inaccurate. While fixing the CSV and preserving the data would require more refined regex work, I needed to get rid of the errant lines while preserving the lines that were correct, plus the first three columns (I need a list of the DOI entries, which are in the third column). Here’s how I used regular expressions and VSCode to do that:
The lines that were not “broken” all began with a number. Lines that did not begin with a number were causing problems in the file’s format, which caused the file to be an invalid CSV. To identify these lines, I searched for any line that begins with an upper- or lower-case letter, which was not followed by a line beginning with a number:
^[A-Za-z].*\n(?!^[0-9])
Using VSCode, I selected each of these patterns by using the “multi-cursor” option in VSCode. To select all of the matching lines, type Shift + Ctrl + L
(or Shift + Cmd + L
on MacOS). I made sure the cursor was at the beginning of each of these lines, then deleted the selected text to remove the unwanted lines using Ctrl + X
(Cmd + X
on MacOS).
Some lines were completely blank, or appeared to be. This expression matches all lines that are blank or include space characters. Once selected, again delete the selection in batch using the multiple select method above.
^\s*$
Reviewing the remaining lines, I noticed that many of the lines were not blank, but they did begin with spaces or tab characters. The previous pattern did not match them since they were not blank lines. Most of the tabs (though not all) were converted to sequences of 6 or 8 spaces. To catch this case, I created a pattern to look for 6 spaces at the beginning of the line (this also catches the cases that have 8 spaces), plus any characters following these up to a line break (\n
). Then, using the negative look ahead pattern (parentheses at the end of the line), the pattern checks the following line to see if it begins with a numeral; if the line does begin with a numeral, then the line is skipped and not matched. The reason for skipping the trailing line is to reduce the possibility of deleting needed information and fields.
^[\s]{6}.*\n(?!^[0-9])
This is how regexper visualizes the match.
Instead of using the delete line method of removing the material, as previously, I used a different approach this time. This pattern selects the entire line, so using the multiple select selects the entirety of the undesired content. To remove this, use the multiple select (Shift + Ctrl + L
or Shift + Cmd + L
), then hit the delete key. Bye bye!
There were still some lines that looked blank, which turned out to be special tab characters. Regex allows you to look for these with \t
, so I searched for any cases where the character occured at the beginning of the line and paired this with the look ahead to avoid any lines with additional content.
^[\t].*\n(?![0-9])
Then select, delete, and buh bye!
At this point, a few lines remained with “odd characters” at the beginning, which turned out to be bullet point characters. After a few tries, I realized serendipitously that the following regex will select these lines:
^.\n(?![0-9])
This is selecting any line that is not followed by a line that begins with a number. Then, select those remaining and delete! This time I used the remove line (Ctrl + X
) again.
Now, most of the blank or extra lines are gone, but there are still the remaining lines that begin in the middle of a cell Now that the other lines are gone, I can identify these lines by matching anything that doesn’t have a number at the beginning. Then, to prevent the deletion of content beyond the cell (delineated by a double quotation mark), the pattern stops when it finds a double quotation mark.
^[^0-9][^"]+
This is a good pattern for error-checking CSVs (if you have lines beginning with a numerical index): it matches lines not beginning with a digit, then matches up to a double quotation mark. will select text to delete and catches most of the abstracts that do not have multiple sentences.
There are still a few lines that need help. I identified these by searching for lines that don’t begin with a digit:
^[^0-9]
At this point, there are only a few (for February’s CSV there were only 3!), and these can be corrected one by one.
Finally, I use the CSVLint extension of VSCode, which looks for errors. There were a few lines that were missing a cell, which means that some of these patterns were selecting and/or deleting too much. These lines still had the most important information - the DOI - that I wanted, so I fixed each line by adding an extra cell at the end by appending a comma to the end of the line.
Regular expressions are fussy and esoteric. It is not always clear why certain patterns match (obviously the computer knows, but it can take a while to decrypt what is happening even if you are the person who wrote the expression), and the patterns often don’t do quite what you think they will (note that I still had about 6 lines to fix by hand because of uncaught errors). That said, there is something fun in thinking through how the pattern will work, and whether it is going to match exactly the content that you want.
Perhaps that joy of finding and matching patterns is likely what some people find appealing about the game of “regex golf”. That’s basically a riddle game where you have two groups of strings (say, the titles of Star Trek and Star Wars movies), then you try to create a pattern that matches all of the items in one list, but none of the ones in the other list. I’m not sure I would play the game, but after working through some “real world” examples, I can see the appeal (but also frustration) in that work. To end, here is an XKCD about regex golf (I had to read the explanation):
I am not likely to choose this approach to cleaning up a CSV file in future. Instead, I would use a tool specialized for CSV as a format. Nonetheless, this was a good way to practice with regular expressions.
If you are interested in working with regular expressions, you may find these resources, including a few helpful validators and visualizers, of use: