Part of this is a re-cap on thoughts and concerns, another part is thinking out loud:
- The first step is capturing individual pages of the City Directories. A big problem is that these things are too big to easily copy; the interior parts of the pages are far enough away from the scan bar that they copy black (I tried to imply that with the page illustration). A specific idea is finding existent copies of directories outside of the library collection, purchasing them, then cutting the spines in order to facilitate scanning. This seems extremely expensive (and I hate the idea of cutting spines in general). Another option would be high-resolution photos of the pages, although I don't know enough about photography to gauge the expense or difficulty of this.
- Pages have a varying amount of entries on them, and a human eye (at least currently) is required to correctly parse out the different entries. Cutting and pasting each record is an easy option, but I haven't yet looked for any software that will automatically save and number pasted JPGs. I have little concern about facilitating this process, as I'm sure that someone has created software that will allow quick pasting to sequential files.. If I'm wrong on this, this may be a more difficult step.
- The actual interface for recording individual records is something that I am currently trying to learn, and it's been very difficult for me. The components that I'm focusing on are: individual users and records, and how to record both. My plan is to figure out how to do this using MySQL and PHP, although I have extremely little experience with both. The goal of this process follows this pattern:
The user logs in and starts transcribing data. I want a user log in so we can track how many records the specific user is logging. I want to provide reward to top-ranking users. The secondary reason for this is that each record will need to go through at least two users - the data will be compared, and if each matches, it goes into the database. If a record is recorded differently by two people, then it gets sent back to this process. It's a quality control process.
- After data is coded twice in the same way, it gets entered into the final database. This database is meant to record accurate historical information that can be later data-mined.
This is what I want to do. I'm aiming to digitize all of Portland's directories. Do you have advice? Better ways of doing the same thing? Do you have the time and expertise to help? Is anyone interested in this?