How Weird is That?
Specimen collectors often have deep experience with the natural world, and occasionally they notice things that aren’t as they expected. In a recent survey of over 220 collectors from across taxonomic disciplines (botanists, ornithologists, entomologists, etc.), over half (59%) reported documenting the anomalies that they observe on their specimen labels, which is great. However, there is a huge diversity of ways in which they do this, which makes it hard to find their observations. When asked to provide words that they use in those descriptions, survey respondents gave 170 unique words and phrases. Most of these words and phrases can be used in ways that might not communicate an anomaly. For example, “early” is a frequently cited word to describe a phenological anomaly (i.e., an anomaly related to the timing of life history events). “Flowering early” is an observation of an anomaly; “specimen collected in early morning” is not. Even words that might be thought straightforward, like “Strange”, appear in ways that are not documenting an anomaly (e.g., “Strange Road” as a place name).
With this new project, “How Weird is That?”, we are seeking help to classify specimen records as including an observation of an anomaly or not. These classifications will then be used to train machines to differentiate between the two cases. To ensure that some of the records being considered include observations of anomalies, we’ve searched the 120 million specimen records at iDigBio for each of 25 terms cited by collectors as useful in describing them. In the project’s first Notes from Nature Expedition, we included all of the records that have images associated with them and that contain the terms “early”, “earlier”, or “earliest”. The second expedition includes records that use the terms “late”, “later”, or “latest”. After that, we will do a second late-later-latest set of specimens, then move on to other terms like “weird”, “abnormal”, and “odd”. The further classification of statements of anomalies as being about phenology, distribution, or other things will be used in to refine the machine learning step. Once the machines have been taught to flag assertions of an anomaly, it can be a much faster hand-off of that information to those who could use the information, such as those studying invasive species or mismatches in the arrival of migratory birds and emergence of the insects that they eat.
Finally, a few things to note. We have the expectation that most images that are associated with specimen records will contain the specimen labels, but that is not always the case. So as not to bias the sampling and diminish the utility of the machine learning rules that we arrive at, we have not removed any records from the datasets by acting on potentially faulty assumptions, such as “images of fossils don’t ever contain labels” or “bird images are only ever made in the field and not after specimen preparation is complete.” This leads us to an important point: specimens are preserved plants, insects, birds, fish, etc. If you think that viewing dead organisms, whether in the field (e.g., a photo of a beached whale) or after preservation (e.g., an insect on a pin), will trigger unpleasant reactions for you, we encourage you to contribute to science in a different Notes from Nature project. Also, please note that some handwriting on labels is hard to read. If that’s the case for something you see, use “Uncertain” as a response, and we will check it later. Finally, please be assured that classifications of specimen records as not containing an observation of an anomaly are as valuable to our process as finding those that do. The machines need both to learn how to differentiate.
We are tremendously grateful to participants in this activity and hope to keep things interesting throughout this data creation campaign by remaining engaged in Talk and providing occasional blog updates. Thank you and enjoy!
— Austin Mast, Florida State University