OC – Are They Good or Not?
We are excited to announce our next installment in the Digi-Leap series. Our last expedition in this series asked you to identify the labels on a specimen and the type of text it contained.
Our next step in the larger project is to automatically pull the text out of the labels using a method called OCR (optical character recognition). OCR has been around for a long time and this is certainly not the first attempt to do this for biodiversity specimens. There are many challenges to OCR of museum specimens (e.g. different handwritings and fonts) and no one solution has come forward to resolve this challenge. What we are striving to do is build off of what has been done in the past and develop a human in the loop workflow. This means that we anticipate that some specimens can be transcribed automatically, but many will still require human eyes. This is where you and Notes from Nature can be a huge help!
Next up will be an expedition where we ask volunteers to look over the OCR results and tell us how it did. Hence the name, OC – Are They Good or Not? Get it? We are all about the puns. We’ll present images of the original label and OCR output and ask volunteers to tell us what errors, if any, they see. We don’t need to know everything single error letter by letter. We just need to know if there are errors and what kind they are. For example, if a word or letter is present in the original label, but not the OCR output that is called a deletion. You may notice that some images are side by side while some are presented as one on top of the other. We did this in order to make the images fit as best we could within the image viewer.
If this sounds fun to you, please head over to the Labs Project and give it a try.
— The Notes from Nature Team
Have you considered the Mahalanobis Taguchi System (MTS) for pattern recognition, as developed by Dr. PC Mahalanobis and Dr. Genichi Taguchi? This differs from and is superior to other machine learning approaches for several reasons. One of those is an optimization step to reduce the number of parameters that must be evaluated to get a good answer. This reduces run time and data storage requirements. The best book on the subject is “Quality Recognition & Prediction: Smarter Pattern Technology with the Mahalanobis-Taguchi System” by Shoichi Teshima, Yoshiko Hasegawa. A brief intro article I wrote about MTS is available at: https://modsimworld.org/papers/2016/Mahalanobis_Taguchi_System_for_Pattern_Recognition_Prediction_and_Optimization.pdf
Happy to discuss. – Steve Holcomb
Suffolk, Virginia, USA,
Wow, thanks so much for the information! We have considered it and will continue to, but we haven’t finalized our methods at this point. That is to say that our final pipeline is far from complete at this stage of the project. Best, Michael for the Notes from Nature Team