Sunday, April 10, 2011

Security Mechanism Can Enhance Historical Record Keeping

Recently a friend showed me an incredibly creative application of computing technology with all sorts of possible uses. I am a language buff (that means Big Fan) so I immediately grabbed onto historical language issues. Now, this technology project has been around a little while so it isn't completely hot off the press. But I didn't know about it so you may not either.

The reCAPTCHA project. You probably (?) know what a CAPTCHA is: those squiggly tortured hard to read words you have to type in sometimes to gain access to a site, sign up for an online account or post a comment on a blog. If you don't type the word correctly you don't get in. Fortunately you get additional chances with new words. This technology exists because people can usually read the words but computers cannot. This keeps automated spamming out.

Here's the cool upgrade. Have you seen one of those where you have to type in TWO words next to one another? Why two? Isn't one enough? Well, this is what is going on. There are projects underway (which you probably *are* familiar with) to digitize books, newspapers, archival texts. The really old ones, the hand written ones, the faded or slightly crumpled ones - the OCR (Optical Character Recognition) software being used to scan and translate those texts cannot read many of the words.

Fortunately the computer identifies words it doesn't understand. Many of those words are being placed into those CAPTCHA screens. So you get two words: one is a known word, made to look hard to read, and the second is a word that was hard to read to start with (perhaps made more so, just to be safe :). That same unknown word is paired off and given to many many many people in CAPTCHA boxes. If you get the known word right, you get in (at least that is my understanding). If the vast majority of people get the known word right AND provide the unknown word with the same answer, the odds are the unknown word has been correctly identified and it is plugged into its spot in the original scanned text.

I realize the book scanning projects are controversial in some spheres, especially in regard to recent works. What captures (oh ow ow pun) my imagination is the thought of ancient Medieval texts written by monks with all those flourishes being scanned and translated. Ok, they are usually in Latin. Providing a Latin word in a CAPTCHA screen might be a giveaway, but the idea of being able to digitize previously inaccessible texts by harnessing the power of millions of people passing through online security checks is impressive. For now we may have to stick with Victorian English or Elizabethan English or whatever the native language is of the country where a site exists is, but eventually I suspect, computer scientists will find a way to put this to work with languages that are no longer in common use.

Meanwhile, and this might already be in the works, languages like Latin or other special-use languages might be applied to sub-populations of sites where they make sense. To get into a repository of ancient Sanskrit resources you have to type a two word Sanskrit CAPTCHA for example? Ancient texts can be made publicly accessible much faster than they would have been otherwise.

More information about the project:

No comments:

Post a Comment