CITSCribe & Notes from Nature Hackathon

Post by Austin Mast

The CITSCribe Hackathon, co-organized by Zooniverse’s Notes from Nature Project (www.notesfromnature.org) and iDigBio (www.idigbio.org), brought together over 30 programmers and researchers from the areas of biodiversity research and digital humanities for a week to further enable public participation in the transcription of biodiversity specimen labels. There are approximately 1 billion biodiversity research specimens in US collections alone, but it is estimated that information from just 10% of them is currently digitized and online. Digitization of these specimens gives researchers access to vast quantities of information in their investigations of timely subjects such as climate change, invasive species, and the extinction crisis. The magnitude of the task of bringing those specimens into digital format far exceeds current capacity and requires new, Internet-scale approaches to engage the public to help with the task and learn more about biodiversity collections. Participants in the hackathon were energized by the opportunity to work on groundbreaking citizen-science projects with immediate and strong impacts in the areas of biodiversity and applied conservation.

The event opened on December 16, 2013, at iDigBio’s University of Florida (Gainesville, FL) center with the co-organizers Rob Guralnick (University of Colorado, Boulder) and Austin Mast (Florida State University) introducing the group to the process of digitization of biodiversity specimens, the heterogeneity of specimen labels, and the role that public participation tools and public participants play in the digitization workflow. This was followed by a brief introduction to the development tracks that sub-groups might like to tackle during the week: (1) interoperability between public participation tools and biodiversity data systems, (2) transcription quality assessment/quality control (QA/QC) and the reconciliation of replicate transcriptions, (3) integration of optical character recognition (OCR) into the transcription workflow, and (4) user engagement. The brief introductions and expressions of interest that followed made it clear that there would be a critical mass of complementary interests and competencies in each track for the week (Yay!).

After Cody Meche (an Agile Trainer and Coach at Davisbase Consulting) energized the group with a talk on agile development best practices (thanks for volunteering your time, Cody!), Alex Thompson (iDigBio) presented some of the digital resources that iDigBio had assembled prior to the hackathon (including a Vagrant script to build a virtual machine for the Notes From Nature web interface) and helped the programmers set up their development environments in a “Tech-up!” session. Yonggang Liu presented the new iDigBio Image Ingestion Appliance for the iDigBio Cloud—a storage resource for public participation tools. The hackathon participants then self-organized into development tracks to plan deliverables and the development roadmaps in the Team-up!, activities that culminated in presentations to the whole group in a Stand-up! session after lunch on Day 2.

Huge progress was made in a series of Code-sprints and Stand-up! sessions that composed much of the second-half of Day 2 and the full Days 3 and 4. These were punctuated by occasional Mix-up! sessions in which either pairs of development teams met together to discuss areas of overlap or the participants were completely randomized into new groups to discuss new directions not yet taken. A call-in from Laura Whyte, the Director of Citizen Science at Adler Planetarium, provided an exciting overview of the latest activities at Zooniverse, including GalaxyZoo Quench (a project that is engaging the public from the process of data collection to data analysis to manuscript writing) and ZooTeach (a site where teachers can find lesson plans that complement Zooniverse projects). And an excursion to the Florida Museum of Natural History (including its colorful Butterfly Rainforest) on Wednesday afternoon provided a bit of a breather from all of the coding.

On the final day, hackathon tracks presented their final Stand-up!—a parade of creative and useful solutions for public participation in transcriptions. The interoperability track (Alex T., Ted H., Matthew M, Ed G., Robert B., Greg R., Yonggang L.) introduced their code to produce a Darwin Core Archive that describes discrete projects (sometimes called “Expeditions” or “Missions”) for ingestion by public participation tools and export from those tools back to the data providers. This includes code to generate descriptions of the project (e.g., taxonomic and geographic scope) in Ecological Markup Language along with record-level description of images and digitization projects using Audobon Core and Darwin Core. Parts of this code were added to a beta version of the iDigBio image ingestion appliance and Symbiota, a biodiversity data management tool. Much of the further development in this area will involve creation of a public participation management tool to create and manage projects of this type and download and process publicly generated data.

The QA/QC track (Jun L., Tony K., Al M., Chuck M.) tackled a big challenge in citizen science transcription—how to take the outputs from the citizen science transcription products and assure the highest quality end result. Team QA/QC introduced an innovative pipeline for building consensus from multiple transcription replicates using characters or, alternatively, tokens using the MAFFT alignment tool—a tool typically used for DNA sequence alignment. They demonstrated ca. 35% agreement between the consensus that the two methods generate and gold standard data (transcribed by highly trained digitizers) for exact matches. They also generated script to normalize the name strings (e.g., from “A. R. and F. T. Smith” to “A. R. Smith, F. T. Smith”). Much of the further development in this area will involve optimizing the alignment algorithm for this task and making the consensus builder into a web service that can take input replicate transcriptions and output a consensus transcription.

The integration of OCR track (Go Team Ll Ll!; William U., Deb P., Andrea M., Sylvia O., Miao C., Jason B.) created word clouds (using n-gram scoring, faceting, and Solr for indexing + Carrot² for visualization) and explored their use in two steps of the pipeline: a step in which the public participant selects a subset of specimens with a word of interest from the word cloud and a data cleaning step, where infrequent words are highlighted by the system. They also created an interface for exploring the words using histograms, rather than word clouds. Much of the further development in this area will involve integration of the word selection step into public participation tools and integration of the visualization for data cleaning into a processing tool, such as the public participation management system.

The user engagement track (Go Team Honey Badger!; Julie A., Matthew B., David B., Paul F., Lisa L., Paul K.) made progress on a diversity of useful fronts. Their completed “ditto” function code to autocomplete Notes from Nature fields using previous entries with key-binding is sure to make data entry in that system far more efficient. Other code created by that group created functionality in Notes from Nature to see all target fields at once in a single window for easy tabbing between them and to flag specimens with explanations for skipping them (e.g., specimen label obscured, specimen label illegible). The group brainstormed dashboard functionality for public participation tools, created a mock-up for a dashboard in Notes from Nature, and coded a dashboard (tentatively called My Dashboard) in Atlas of Living Australia’s Biodiversity Volunteer Portal. These dashboards provide such things as a map of specimens transcribed by the public user, the user’s badges, and completed missions in which the user participated. The group also produced white-papers on ideas to encourage user sign-in, gamification ideas for Notes from Nature and the Biodiversity Volunteer Portal, and classification of user experience. Much of the further progress in this area will involve testing and implementation of this new functionality in the production versions of Notes from Nature and Atlas of Living Australia’s Biodiversity Volunteer Portal.

Hackathon participants represented a broad range of career stages—undergraduate students, graduate students, postdoctoral scholars, computer programmers, and university faculty—and institutions, including the Adler Planetarium, University of California–Berkeley, Cornell University, Harvard University, King’s College London, Australian Museum, Smithsonian, New York Botanical Garden, Botanical Research Institute of Texas, Illinois Natural History Survey, Atlanta University Center, National Ecological Observation Network, and many others. Digital humanities projects represented at the hackathon included the University of Iowa Libraries’ DIYHistory Transcription Project, Indiana University’s Data to Insight Center, the Outreach Ethnomusicology project, and the FromThePage.com transcription project. Biodiversity projects represented included Notes from Nature, iDigBio, VertNet, Atlas of Living Australia, Symbiota, Filtered-push, Morphbank, Smithsonian Digital Volunteers, and the Biodiversity Heritage Library.

Documentation of the hackathon can be found at the CITSCribe wiki (https://www.idigbio.org/wiki/index.php?title=Transcription_Hackathon). This includes a complete participant list and many recorded presentations. Hackathon participants used the hashtag #CITSCribe, and a few additional photos are available at https://www.facebook.com/iDigBio/photos_stream.