Website Design / Digitization, OCR, and Text as Data
I worked with two very distinct kinds of digital sources for the first exercise. I never looked at these kinds of sources deeply until now, or moreso, I knew within me the differences, but comparing them was something I would not have done on my own. As someone born in the growing digital age, I'm used to having these sources accessible to me.
Despite both being accessible online, they perform different functions, are created, preserved, and experienced in their own ways.
Digital history is mediated, and I am made more aware of this fact, as I explore the two sources. I find that they are both authoritative in functions. They are both searchable, organized, and widely accessible. But because both sources are shaped by the technological assumptions, they answer to decisions made by institutions: the newspaper editors, by the library of congress. It proposes a limit in preservation and choice of what data is accessible for us to view
The first link is a digitalized newspaper from the Library of Congress.
Chicago, Ill.;New York, N.Y.
The second, a born-digital website archived by the Wayback Machine.
Both can be freely accessed online.
THE DAILY WORKER (September 1, 1927).
This digital source can be accessed through the Library of Congress's Chronicling America collection. The name itself is interesting. It is to chronicle what was. This specific daily newspaper is political in nature. In a closer read, it focuses on labor movements during that time, depicting the "left" side of politics critiquing capitalism. Even the ads and editorials reflect this ideology.
I find the detailed metadata of Chronicling America to be its strength. I was able to search by date, state, publication, and subject or keyword, making it easy to navigate and locate what I might need.
I was thinking of two things as I explore the site: what is the scope of the archived newspapers in the database? I encoutered only two results with my selected search. Similar to a physical archive, I wonder who made the decision to digitalize these primary sources.
I did a bit of external research about The Daily Worker, and come to find its roots aligning in Communist Party ideology. This version has shaped institutions like the University of Illinois at Urbana-Champaign Library and the Library of Congress. The decision to release these specific newspapers must be catered to diffirent individuals or is it to inform us as well?
In my experience with digitalized newspapers, some can be hard to read due to the scanning quality, or OCR proccessing. But the metadata organization of the site is extremly well done in my opinion. I could read each word, zoom in or out and easily access the provided images.
If this site were to disappear, the contextual information provided will be lost along with it. I must have been ignorant until this point because I never knew this to be the case in the Americas back then. But now I am a little bit informed.
Tumblr, 2007-2013, Internet Archives Wayback Machine
I decided to pick this site because a few days ago I accidentally deleted my Tumblr account containing blog posts from when I was 18 or maybe even younger. I was going through reddit and got informed that this cannot be undone. That account of mine is lost forever from their database. A reddit user recommended the Wayback Machine to recover older posts, but I am unlucky and could not find anything regarding my account. It made me a bit sad because they contained my memories and my younger self blogging and recording.
The born-digital source of Tumblr was created when I was just 6 years old. Since then, it has transformed into a web of sophisticated digital platform capable of using bots. I used the Wayback Machine to explore the archived versions of Tumblr from 2007, 2008, and 2013, 2014, and adding my experience, till now, 2026.
The left side images are from its founding years, and the right ones are from mid-years. This showcases the rapid growth of digital platforms. They evolve based on feedback and user needs. The early version is text-based and minimal, containing what essential tools you need for a blog site.
In just 2008, it becomes sligthly more refined, but simpler compared to 2013-2014. Now, tumblr is one of my favourite blog sites. It is capable to adding embeded links or even inter-linking other sites.
As you know, this blog is done on Tumblr.. :)
The Wayback Machine can only capture a fragment of what the experience might have been with Tumblr, for many individuals. Tumblr has become so much more interactive, with newer design elements, such as replies, reposts, notes. The Wayback seems to have archived up to the year 2016. This explains why I could not find my lost account in the archive. This limited scope of archival data makes me wonder if they would require informed consent to add more about the remaining years. I personally do not remember signing anything about this. Nevertheless, I am indifferent about this decision to share my personal data.
I discovered the structures that shape how I use Tumblr today but not the lived experience it had, though one can simply imagine. The access to the archive, captured only what the Wayback Machine could provide me. It has succesfully captured the preservation of Tumblr's fragments but not its full ecosystem.
Because it is a born-digital archive, Tumblr's content is often personal, or user-generated. This can cause an ethical problem about who's data is preserved and if there are policies or consent involved before it is shared with the internet. Maybe I will check again in tha few years, if they decide to showcase or capture my old blog site on the Wayback Machine.
Reading Like a Machine: OCR in Action
While doing the OCR activity, it emphasized how technology shapes historical interpretation.
When I scanned the left photo, an excerpt of Georgias, it assumed a very clean printed text, since all the lettering and text are the same font and layout. The main text was decoded very easily and when copy-and-pasting, most text is searchable. The smaller fonts in the side margin were not easily captured.
On the other hand, when I scaned my notes from a class, it failed to recognize any of my handwritten notes. The text can be selected but it is not searchable. It produced an unreable text. If I simply look at it, it is legible but the OCR failed entirely.
For many historians, this error matters. If I were to research a source that is handwritten, the OCR may fail to recognize it and it becomes invisible. Valuable information is lost: ideas, names and content. This doesn't mean they did not exist in the past but they are misinterpreted.
This exercise has taught me that digital sources and born-digital sources are not exact depictions of the past but constructed and incomplete. It is shaped by the growing technological era, and institutions that make decisions for us. Despite being appreciative of the free access it gives us, there certainly are limitations.