Monday, October 5, 2009

A Visual Idea of Crowd-Sourcing Portland Records

In reference to my post the other day about my desire to crowd-source transcribing Portland directories, I decided to make a quick diagram of what I'm trying to do. Maybe this will be more compelling, so please take a look (click on the image for a larger view):

Part of this is a re-cap on thoughts and concerns, another part is thinking out loud:
  1. The first step is capturing individual pages of the City Directories. A big problem is that these things are too big to easily copy; the interior parts of the pages are far enough away from the scan bar that they copy black (I tried to imply that with the page illustration). A specific idea is finding existent copies of directories outside of the library collection, purchasing them, then cutting the spines in order to facilitate scanning. This seems extremely expensive (and I hate the idea of cutting spines in general). Another option would be high-resolution photos of the pages, although I don't know enough about photography to gauge the expense or difficulty of this.
  2. Pages have a varying amount of entries on them, and a human eye (at least currently) is required to correctly parse out the different entries. Cutting and pasting each record is an easy option, but I haven't yet looked for any software that will automatically save and number pasted JPGs. I have little concern about facilitating this process, as I'm sure that someone has created software that will allow quick pasting to sequential files.. If I'm wrong on this, this may be a more difficult step.
  3. The actual interface for recording individual records is something that I am currently trying to learn, and it's been very difficult for me. The components that I'm focusing on are: individual users and records, and how to record both. My plan is to figure out how to do this using MySQL and PHP, although I have extremely little experience with both. The goal of this process follows this pattern:

    The user logs in and starts transcribing data. I want a user log in so we can track how many records the specific user is logging. I want to provide reward to top-ranking users. The secondary reason for this is that each record will need to go through at least two users - the data will be compared, and if each matches, it goes into the database. If a record is recorded differently by two people, then it gets sent back to this process. It's a quality control process.
  4. After data is coded twice in the same way, it gets entered into the final database. This database is meant to record accurate historical information that can be later data-mined.


This is what I want to do. I'm aiming to digitize all of Portland's directories. Do you have advice? Better ways of doing the same thing? Do you have the time and expertise to help? Is anyone interested in this?

Saturday, October 3, 2009

The Hope of Crowd-Sourcing Historical Data Entry

I spent almost 4 hours at Portland's Central Library today doing data-entry work. My ultimate goal was uncertain, but I was armed with an Excel spreadsheet, the 1893 City of Portland Directory, and the 1900 City of Portland Directory. I started from the beginning of each of the directories, entering all information for persons and businesses listed. I got as far as "Acre" in my endeavors before I called it a day. I've done this sort of thing about two dozen times in the past year.

Like I said, my ultimate goal is unknown (although I've got some vague future plans), but the immediate goal is to digitize these directories. After leaving the Library, I walked back towards my neighborhood, silently reflecting on the amount I accomplished and doing calculations. On average, I complete one entry in one minute. Based on this statistic, it would take 42 weeks of full-time data-entry to go through just the 1893 directory. I don't have that kind of time. I don't have the kind of money to pay other people for that kind of time. In the grand scheme of things, my 4 hours of work has amounted to nothing.

In the past, I've toyed around with photocopying the directories, scanning them, OCRing them, then coding the data. This hasn't worked out very well. First (and this is a big problem for any solution), the books are extremely thick and fragile, making it very hard to clearly photocopy a page. Even though I've spent a decade of my life making photocopies for a living, I still have trouble getting a clear photocopy of most of these pages. Secondly, OCR technology as it currently exists is extremely fallible and OCRing century-old print is less than ideal. Third, coding the OCR'd data is not automatic and is time-consuming in itself. I haven't walked around figuring out the statistics for the time this might take me, but I'm pretty sure that it's still more time than I have.

My solution? Obtain the best copies I can of these directories, scan them, separate them by entry, and then set up a process where other humans help me do the data entry. If I could divide the 42 weeks of labor between 1000 willing people, we might all get somewhere. Have you seen the United States Geologic Survey of the North American Bird Phenology Program? That's what I have in mind.

This is something that I think is worth spending time working towards, even if I can see some big problems with this. The first is the noted problem of getting a clean copy of the documents. Another is breaking the scanned pages into individual entries - that will take some labor. The next big problem is setting up the system; I currently have no idea how to program something like this. Assuming that I can overcome those hurdles, the next difficulty is finding people who are interested in participating in this project.

Even with these problems, I see a potential in this and I'm going to work towards it. Even if my own goals are vague, I'm a huge believer in the utility of converting historical analog information to digital media. My exact goals might be a little hazy right now, but I'm convinced that this information will be hugely useful to future researchers. What do you think?