Archiving the Epistolæ project

Miniature of Matilda of Tuscany
Miniature of Matilda of Tuscany from the frontispiece of Donizo’s Vita Mathildis (Vatican Library, Codex Vat. Lat. 4922, fol. 7v.).
— Wikipedia (https://en.wikipedia.org/wiki/Matilda_of_Tuscany)

Press release

by Sarai Vega and Esther Jackson

Columbia University Libraries is excited to announce the archiving of the Epistolæ project, a remarkable collection of medieval Latin letters to and from women, now accessible through the Columbia University Academic Commons

Professor Joan M. Ferrante, Professor Emeritus of English and Comparative Literature at Columbia University, along with her colleagues, created the Epistolæ project and translated these letters. Working with the Columbia Center for New Media Teaching and Learning (now the Columbia Center for Teaching and Learning), Dr. Ferrante was able to develop this open database.

Covering the period from the 4th to the 13th centuries, this collection features letters in their original Latin alongside English translations, organized by the names of the women who wrote or received the letters. Each letter is complemented by biographical sketches. This collection has become an invaluable resource for students and scholars. The letters give a glimpse into the lives of these women and how they were involved in medieval societies.

The Libraries portion of this project was conceived as a way to increase discoverability of and access to the materials from this project by creating archival copies, all of which were assigned DOIs (digital object identifiers). As a result of this project, 2,329 works are now available through Columbia’s Academic Commons for view, download, and research.

Technical details 

by Sarai Vega

The main objectives of this repository project were to:

  1. Archive stable, citable HTML versions (galleys) of each letter and biography in Academic Commons, Columbia’s digital repository.
  2. Gather metadata about each of the letters and women from the existing Hugo website build, and convert it to CSV files, which could be imported into the cataloging system.
  3. Insert DOIs and other information into the HTML galleys, including the fact that the version in Academic Commons is archival.

Python code was utilized to accomplish these objectives. Throughout the process of developing this code, the number of files, their formats, and file types had to be considered. I focused on writing code that could automatically handle many files and inform the user when there was a problem. I explored the folder system of the Epistolae website and had to do a lot of research on how to work with HTML, HTML.md, and CSV files. In order for the code to collect metadata from and change HTML files, it was necessary to understand and use HTML syntax.

Across the different groups and topics, the thought process behind the code stayed the same. Writing out the steps the code should take before typing it out helped me to clarify the process and catch any pitfalls early. 

The code opens the needed file and uses specific markers in the HTML syntax to extract the desired information. These markers include syntax (<p>, <main>, etc), keywords, and dashes implemented in the text to separate sections. The information is then saved to various lists/dictionaries. 

Lists and dictionaries are used to save data, the main difference being how they index information. Lists are indexed by numbers starting from zero, while dictionaries have key-value pairs. While testing the code, these lists and dictionaries were printed out to see if the right information was being extracted. If they looked correct, the lists/dictionaries were then appended to a previously set-up CSV file. By appending to the end of a CSV file, the metadata was kept together with one row per file.

Once the code was written, it was tested with a few files. This was done to check what the output CSV files would look like and how the code would handle different formats; the testing process often identified new information and errors. For example, while working with the “People” files, the code was originally written to handle one letter sent and one received. Even the column names in the CSV files were set up specifically for this format. However, when testing with a few randomly selected files, I found that most of the files had varying numbers of letters sent and received. Some files had long lists of letters sent or received, while others were empty. I then had to re-think and change the process behind the data collection method to accommodate these differences. Instead of focusing on the information format for one letter sent and one received, the code now takes the sections listed under “sent” and “received” without worrying about dividing up the information.

I took precautions when collecting metadata to avoid errors, especially since it would take too long to check all the files by hand. These precautions include a “try-except” syntax, which allows the code to exit a function smoothly if an error occurs. Python first runs the code under “try,” and if an error occurs, it will run the code under “except.” An example can be seen below:

try:
 with open(file_path, 'r') as h:
   [...Continued code...]


except UnicodeDecodeError or IndexError:
 print("Error:", file_path)

The code under “except” changes depending on what was being worked on. For some functions, the code will only print out the name of the file causing the error. For others, it will print out the file name and add a placeholder in the CSV file; this extra step allows the user to return to the CSV file and manually fill in the information. In the example above, the current file name will be printed if either a “UnicodeDecodeError” or “IndexError” occurs. I decided to specify these two errors since they appeared multiple times during testing. Using the “try-except” syntax, errors did not stop functions from running, and it was not a mystery when an error occurred. The amount of testing done before the Python code was run on all the files was consistent across the entire project. The finished CSV files were then put into a Google Sheet shared with other Digital Scholarship staff.

It took some trial and error to edit the HTML files properly. I copied the files and tested the HTML syntax, deleting and adding certain portions to see how they would affect the overall look when opened in a browser window. Similar to the metadata collection part, markers in the HTML syntax were necessary to ensure the inserted text would be put in the same position for all the files. Below is a sample of the code used to add the archival note:

[1] break_content = new_content.split("</body>")
[2] insert_txt = '<p>This is an archived work created in 2024 and downloaded from Columbia University Academic Commons.</p>\n</body>'
[3] final_list = [break_content[0], insert_txt, break_content[1]]
[4] final_content = ''.join(final_list)

When adding the archival note, I use the “</body>” tag towards the end of the HTML files as a consistent marker the code could find. Instead of editing the original files, I decided to copy the text, edit it, and input it into a new HTML file. In this case, the text was copied and then split into two halves at the “</body>” marker (as shown in line [1] above). The archival note with HTML syntax is in line [2]. In order to match the style of the archival note with the surrounding text, the “<p> … </p>” tags were added. In line [3], the text is put into a list with the archival note between the two halves. Python’s “join( )” function in line [4] works by using the string that proceeds to combine the elements in the inputted list. In this case, the strings from “final_list” are combined together with an empty string (‘’) in between.  After the text is joined together, it is ready to be put into a new HTML file.

I included this note at the bottom of all the HTML files without any variation. However, to add the DOIs, the text had to change according to which file was being worked on. The DOIs are sorted according to file numbers and types in the shared Google Sheet. Once sorted, the DOIs are downloaded as a CSV file and uploaded to the coding workspace. The code then goes through each row and collects the DOI and corresponding file name. Using this information, the correct file is opened, and the DOI is added using a process similar to the one described above for the archival note:

[1] split_at = "</main>"
[2] doi_format = """<h2 class="mt-4">DOI: </h2> \n""" + str(file_doi) + "\n</main>"
[3] split_content = content.split(split_at)
[4] final = split_content[0] + doi_format + split_content[1]

The “<h2 class=“mt-4”>… </h2>” tags in line [2] ensures the “DOI:” header matches the style of the other headers towards the bottom of the file. The variable “file_doi” saves the current DOI being used, and “str( )” is a Python function that changes the input to a text string. Also, as shown in lines [2] and [4], multiple Python strings can be joined into one string using the addition symbol (+). This allows the HTML syntax in the variable “doi_format” to remain the same, the only difference being the DOI number saved in “file_doi.”

Keeping the code versatile and flexible was important in this process. The goal was to maximize efficiency while completing the main objectives properly. I am glad that I got the opportunity to work on this project and learn new skills and information!

Leave a Reply

Your email address will not be published. Required fields are marked *