Monday, October 5, 2009

A Visual Idea of Crowd-Sourcing Portland Records

In reference to my post the other day about my desire to crowd-source transcribing Portland directories, I decided to make a quick diagram of what I'm trying to do. Maybe this will be more compelling, so please take a look (click on the image for a larger view):

Part of this is a re-cap on thoughts and concerns, another part is thinking out loud:
  1. The first step is capturing individual pages of the City Directories. A big problem is that these things are too big to easily copy; the interior parts of the pages are far enough away from the scan bar that they copy black (I tried to imply that with the page illustration). A specific idea is finding existent copies of directories outside of the library collection, purchasing them, then cutting the spines in order to facilitate scanning. This seems extremely expensive (and I hate the idea of cutting spines in general). Another option would be high-resolution photos of the pages, although I don't know enough about photography to gauge the expense or difficulty of this.
  2. Pages have a varying amount of entries on them, and a human eye (at least currently) is required to correctly parse out the different entries. Cutting and pasting each record is an easy option, but I haven't yet looked for any software that will automatically save and number pasted JPGs. I have little concern about facilitating this process, as I'm sure that someone has created software that will allow quick pasting to sequential files.. If I'm wrong on this, this may be a more difficult step.
  3. The actual interface for recording individual records is something that I am currently trying to learn, and it's been very difficult for me. The components that I'm focusing on are: individual users and records, and how to record both. My plan is to figure out how to do this using MySQL and PHP, although I have extremely little experience with both. The goal of this process follows this pattern:

    The user logs in and starts transcribing data. I want a user log in so we can track how many records the specific user is logging. I want to provide reward to top-ranking users. The secondary reason for this is that each record will need to go through at least two users - the data will be compared, and if each matches, it goes into the database. If a record is recorded differently by two people, then it gets sent back to this process. It's a quality control process.
  4. After data is coded twice in the same way, it gets entered into the final database. This database is meant to record accurate historical information that can be later data-mined.


This is what I want to do. I'm aiming to digitize all of Portland's directories. Do you have advice? Better ways of doing the same thing? Do you have the time and expertise to help? Is anyone interested in this?

2 comments:

LLL said...
This comment has been removed by a blog administrator.
Patricia said...

Unlike the above spam, I'll leave a real comment.

This seems like a very doable project. If I recall from my boring job days, there was some similar thing where old out-of-print books were scanned and normal people transcribed the pages so they could be read/searched etc. People kept stats on how many pages they did, they felt good because they were preserving something that might be lost, etc.

I would however, challenge you to define why this would be an important project as your commenter on the other post pointed out.

As for manpower, if you can capture the under-worked desk job people--believe me there are a lot of them who are bored of having nothing to do at work--you will have this done in no time.

I have a feeling that this would be a great project for a second "New Deal" type work program, but I think we won't see those days again for awhile, if ever.