Saturday, October 3, 2009

The Hope of Crowd-Sourcing Historical Data Entry

I spent almost 4 hours at Portland's Central Library today doing data-entry work. My ultimate goal was uncertain, but I was armed with an Excel spreadsheet, the 1893 City of Portland Directory, and the 1900 City of Portland Directory. I started from the beginning of each of the directories, entering all information for persons and businesses listed. I got as far as "Acre" in my endeavors before I called it a day. I've done this sort of thing about two dozen times in the past year.

Like I said, my ultimate goal is unknown (although I've got some vague future plans), but the immediate goal is to digitize these directories. After leaving the Library, I walked back towards my neighborhood, silently reflecting on the amount I accomplished and doing calculations. On average, I complete one entry in one minute. Based on this statistic, it would take 42 weeks of full-time data-entry to go through just the 1893 directory. I don't have that kind of time. I don't have the kind of money to pay other people for that kind of time. In the grand scheme of things, my 4 hours of work has amounted to nothing.

In the past, I've toyed around with photocopying the directories, scanning them, OCRing them, then coding the data. This hasn't worked out very well. First (and this is a big problem for any solution), the books are extremely thick and fragile, making it very hard to clearly photocopy a page. Even though I've spent a decade of my life making photocopies for a living, I still have trouble getting a clear photocopy of most of these pages. Secondly, OCR technology as it currently exists is extremely fallible and OCRing century-old print is less than ideal. Third, coding the OCR'd data is not automatic and is time-consuming in itself. I haven't walked around figuring out the statistics for the time this might take me, but I'm pretty sure that it's still more time than I have.

My solution? Obtain the best copies I can of these directories, scan them, separate them by entry, and then set up a process where other humans help me do the data entry. If I could divide the 42 weeks of labor between 1000 willing people, we might all get somewhere. Have you seen the United States Geologic Survey of the North American Bird Phenology Program? That's what I have in mind.

This is something that I think is worth spending time working towards, even if I can see some big problems with this. The first is the noted problem of getting a clean copy of the documents. Another is breaking the scanned pages into individual entries - that will take some labor. The next big problem is setting up the system; I currently have no idea how to program something like this. Assuming that I can overcome those hurdles, the next difficulty is finding people who are interested in participating in this project.

Even with these problems, I see a potential in this and I'm going to work towards it. Even if my own goals are vague, I'm a huge believer in the utility of converting historical analog information to digital media. My exact goals might be a little hazy right now, but I'm convinced that this information will be hugely useful to future researchers. What do you think?

1 comment:

Unknown said...

Can you give me an example of a time when you wished you had this info digitally, or when you did, and it was awesome? Alternatively, what do you think are the top three reasons why this info would be helpful from someone's perspective who isn't you? I am curious about the hypothetical reasons that volunteering my time would turn out to be a really awesome thing, even if you're not sure why it would be awesome for your own personal stuff.