So What Do We Do with All of Your Transcriptions?
We wanted to explain more about what happens behind the scenes after our awesome Notes from Nature volunteers do transcriptions or classifications. What do we do with it and how do we get it back to curators or other scientists at Museums? One thing you may not know is that every label is transcribed by three different people. The idea is that more folks examining labels will lead to better results. For example, if two people enter Wisconsin for the state, and one person accidentally enters Wyoming then we can assume Wisconsin is correct and that Wyoming was a mistake. We also know that some labels are tough to interpret, and sometimes a couple different guesses can get closer to the right answer than just one.
This seems pretty easy right? Well… it gets more complicated when we start working with free text labels. Those text boxes where you enter sentences and phrases from the label. Things like locality information “Route 46 next to a tree by the stop sign on 4th street”, or habitat data “in a field”. How do we compare answers for these kinds of labels. What do we do with extra punctuations? Extra spaces? Extra words? Different words?
We have spent the last few months writing code that helps handle these kinds of situations. Essentially we want to first find labels that match and if not then we want to select the best label we can from the set of answers. We have set up a series of decisions rules to go through your answers. First, we ask if two of the three answers are identical including spaces and punctuation. If they match we are done. If not, then we remove extra spaces and punctuation and ignore capitals and ask if two of the three answers are identical. If so then we select the one with the most characters- with the idea of getting more information.
These two labels would be found to match after removing punctuation, spaces and ignoring capitals. Here we generally take the one with more characters to include as much information as possible.
Rd. 10 KM 24 *RD. 10. KM 24 *this one gets selected more characters
At this next stage things get a little more complicated and we want to use our decision rules to select the best answer we can among the three. First we look for labels where all of the words from one are found in another – partial ratio match. If we find this then we take the label with the most words.
North Fork of Salmon River at Deep Creek, by US-93 *North Fork of the Salmon River at Deep Creek, by US-93 *partial match selection– more words
Finally, we compare the answers using both a ‘fuzzy matching’ scheme. The fuzzy matching looks partial matches on words for example someone may have written ‘rd’ whereas someone else wrote ‘road’, our fuzzy matching will allow those to be considered the same. This strategy also allows for slight misspellings between words. If we get a fuzzy match between the two labels then we take the label with the most words. That ensures that we get the most data we can from these answers.
*County Line Road 2 mi E of airport County Line Rd. 2 mi. E. of airport *fuzzy match select this one
The end result of all this is a reconciliation “toolkit”. We pass all transcripts from finished expeditions through this toolkit, and it delivers three products. The first is just the raw data. The second is a best guess transcription based on the field by field reconciliations described above. The third is perhaps the most important – a summary of what we did and how we did it as a .html file. The summary output is something we are extending, as we think of new things that providers might want to see. Here is an example from the New World Swallowtail Expedition, one of the more difficult expeditions we’ve launched.
More recently, we have added some new features, including information about how many transcriptions were done by transcribers (based on their login names at Zooniverse) and a plot of transcription “effort” and how that looks over all transcribers. The effort plot is very new, but we wanted to provide information on whether most of the effort is done by a very few people, or there is more even spread across transcribers. Here is an example for a different expedition, “WeDigFLPlants’ Laurels of Florida”:
Finally, we give them the information about how labels were reconciled (if there was an exact match, partial or fuzzy match). We do this so the providers can go through them and decide if there are some they want to check. We also highlight any problem record, those for which we could not get a match, or those for which there was only one answer – so we could not compare the answers. Here is an example from one label. The areas in green are the three different answers, the top row is the ‘best guess’ reconciled record and the gray row is information about how the reconciliation was done. For example on the first column Country all three answers were Myanmar – and in gray it says we had an exact match with three answers. The ones in red are potential issues (in this case only one answer given).
The goal of all of this is to make it easy for providers to use these data right away. And we’ll note that this tool allows us to also get an overall look at transcription “success” rates, something we may come back to future posts, because these numbers are striking and illustrate the high value of this effort.
– Julie Allen, Notes from Nature data scientist
Good stuff, but I’m slightly concerned about the fuzzy matching. Doesn’t it effectively encourage us to expand abbreviations? I seem to remember that in NFN 1 we were asked to transcribe exactly, to get around the risk of expanding things wrong.
It’s not often an issue, but sometimes there are situations where an abbreviation is misleading, or ambiguous. For example, in this record https://www.zooniverse.org/projects/zooniverse/notes-from-nature/talk/subjects/3847547 I originally thought that the collector had mixed minutes (as in geographical co-ordinates) and miles; but it could equally be feet and metres.