OC – Are They Good or Not?
Our next step in the larger project is to automatically pull the text out of the labels using a method called OCR (optical character recognition). OCR has been around for a long time and this is certainly not the first attempt to do this for biodiversity specimens. There are many challenges to OCR of museum specimens (e.g. different handwritings and fonts) and no one solution has come forward to resolve this challenge. What we are striving to do is build off of what has been done in the past and develop a human in the loop workflow. This means that we anticipate that some specimens can be transcribed automatically, but many will still require human eyes. This is where you and Notes from Nature can be a huge help!
Next up will be an expedition where we ask volunteers to look over the OCR results and tell us how it did. Hence the name, OC – Are They Good or Not? Get it? We are all about the puns. We’ll present images of the original label and OCR output and ask volunteers to tell us what errors, if any, they see. We don’t need to know everything single error letter by letter. We just need to know if there are errors and what kind they are. For example, if a word or letter is present in the original label, but not the OCR output that is called a deletion. You may notice that some images are side by side while some are presented as one on top of the other. We did this in order to make the images fit as best we could within the image viewer.
If this sounds fun to you, please head over to the Labs Project and give it a try.
— The Notes from Nature Team