Between 20th and 24th September 2021, representatives of the 14 institutions partnering in the BiCIKL project teamed up in a hybrid hackathon, held simultaneously at the Meise Botanic Garden (Belgium) and on Zoom.
The hackathon aimed to explore in practice how data types stored at the participating infrastructures and other infrastructures can best interact with each other, in order to provide a much more complete picture of biodiversity knowledge at a minimum effort on the side of researchers.
As a first step, the partners were tasked to identify barriers to interoperability between the infrastructures. To do so, they formed nine teams and embarked on nine topical projects meant to improve on the existing databases, tools and platforms by creating new workflows to link their infrastructures and feed them with additional contextual data.
During the first day of the hackathon, the project leads presented their respective infrastructures and the hackathon topics. What followed were three days of intensive hacking within and between the assigned teams. Each group then presented their results and lessons learned at the hackathon’s final meeting on Friday.
As a bonus, members of Lifewatch ERIC working on their new LifeBlock data management and storage ecosystem - which relies on blockchain and IPFS technology - were invited to demonstrate the infrastructure and explore possible use cases.
Project 1: Finding the lost parents
Together, the teams of the Meise Botanic Garden, Swiss Institute of Bioinformatics, Global Biodiversity Information Facility (GBIF) and Plazi looked into plant hybrids: an excellent example of taxa where scientific names abound.
The issue here is that while names appear in various places, including the participating infrastructures of the Catalogue of Life, International Plant Names Index, Global Biodiversity Information Facility Taxonomic Backbone, Wikidata and TreatmentBank, there is currently no resource to collate information about the parent species for individual hybrids.
During the hackathon, Team 1 jointly parsed plant hybrid names from literature and taxonomic checklists and found 20,785 accepted hybrid names and 24,744 hybrid names which could not be matched to any taxon. The team found the parents for a total of 23,351 hybrids. The hybrids and their parents were then matched in the Catalogue of Life. They also looked at occurrence data to find pairs of species where one is native and one is alien to a country.
Additionally, the team identified a major obstacle when it came to matching hybrids between databases: the multiplication sign “ × ” has not been consistently used across literature and has instead been repeatedly replaced with various similar symbols and letters, thus impeding automated discovery.
To discover and/or validate the links between a hybrid and potential parents, a prototype was designed based on literature and terminologies constructed from GBIF data. For example, for one hybrid name, the user has access to publication(s) including this specific hybrid and at least two “non-hybrid” species thanks to an automatic process. Then, the publication(s) can be displayed in a dedicated viewer highlighting annotations (in particular the species names) to allow users to confirm the existence of a parent/child relationship.
Project 2: How good are Triple IDs in ENA?
At the hackathon, the Botanic Garden and Botanical Museum Berlin, European Nucleotide Archive (ENA) and Global Biodiversity Information Facility (GBIF) tasked themselves with finding a way to automate methods for validation, correction and annotation of links between specimens and sequence data from ENA created through GBIF triple IDs.
They queried all triple IDs present in the different object categories in ENA using the ENA API, before running those against the GBIF API and determining the number of exact matches, as well as the number of approximate matches. Then, they tried to compare metadata from GBIF and primary sources to resolve issues with the similar hits and triple IDs that had multiple potential hits.
The team joined forces with Team 3 and explored GBIF’s clustering algorithm. What they’ve learned will be proposed as a procedure for auto-correction and annotation of triple IDs in ENA.
Project 3: Enhancing the GBIF clustering algorithms
In their project, the Global Biodiversity Information Facility (GBIF), Biodiversity Information Standards (TDWG), Naturalis, Botanic Garden and Botanical Museum Berlin, European Nucleotide Archive (ENA) joined efforts to establish extra linkages between objects through the GBIF APIs, including such between material citations, specimens and sequences. They also devised a more robust mechanism to monitor ongoing improvements to the GBIF clustering. The team worked collaboratively with Microsoft Azure Cloud and DataBricks to do data exploration.
By the end of the hackathon, they reported a 15% increase of records on GBIF, as well as 50% for the European Nucleotide Archive.
Project 4: Assigning Latin scientific names to operational taxonomic units based on sequence clusters
In the meantime, the teams of the Global Biodiversity Information Facility (GBIF), UNITE/PlutoF and the Catalogue of Life looked into how best to select a taxon name for an operational taxonomic unit (OTU): a crucial step in linking molecular data to museum specimens, treatments etc. and a major bottleneck in metabarcoding studies.
Together, they identified best practices and room for improvements in the algorithms currently in use at UNITE/PlutoF and the International Barcode of Life project’s BINs dataset in GBIF with a focus on those relying on accession numbers.
Project 5: Registering biodiversity-related vocabulary as Wikidata lexemes and link their meaning to Wikidata items
On their end, the teams of Senckenberg - Leibniz Institution for Biodiversity and Earth System Research, Text Technology Lab and Leibniz Institute of Freshwater Ecology and Inland Fisheries explored how biodiversity natural language processing (NLP) pipelines can be linked to and from Wikidata in a way that enhances both. The team used NLP to extract biodiversity-related words and phrases, aiming to upload the vocabulary as lexemes to Wikidata. Those lexemes could then be provided with further information about their forms, senses and usages.
Having tested the workflow during the hackathon, the team now plans to use the workflow systematically for one or more of the BiCIKL’s research infrastructure, in order to help better disambiguate ambiguous terms and recognise biodiversity-related entities with higher precision across a growing range of linguistic and thematic contexts and document types.
Project 6: FAIR Digital Object design from multiple sources
In Project 6, Naturalis joined efforts with the Botanic Garden and Botanical Museum Berlin, Senckenberg – Leibniz Institution for Biodiversity and Earth System Research and DiSSCo to explore methods for creating semantically enhanced FAIR Digital Objects (FDO) able to interconnect disparate biodiversity data within a coherent structure in the near future.
During the hackathon, they brought into play DiSSCo’s new Wikimedia-based modelling framework created for openDS modelling and worked together on identifying and fixing several issues within the tool. Then, they created a simple model for digital specimens in the tool and an automated workflow for semantic validation of new digital specimen objects. The team also came up with recommendations for data modelling to improve interoperability of the research infrastructures.
Project 7: Enriching Wikidata with information from OpenBiodiv about type specimens in context from different literature sources
Meanwhile, members of Pensoft, the Wikidata community and Elixir teamed up to explore ways to enrich existing Wikidata records - already containing taxa entries accompanied by identifiers linking to data from Zoobank, GBIF, iNaturalist, NCBI Taxonomy, Plazi, BOLD Systems and others - with literature data coming from the OpenBiodiv knowledge graph. The latter is continuously being populated with Linked Open Data statements extracted from literature, following the OpenBiodiv-O ontology.
During the hackathon, the team performed SPARQL queries to OpenBiodiv to explore collections which have been used in the description of taxa using type specimens. For prominent collections currently not featured on Wikidata, they suggested new pages.
Having looked into the existing Wikidata properties, they identified types of information about type specimens that could be fed from OpenBiodiv to Wikidata, thus establishing links between type specimens and taxon names, institutions, locations and literature references.
Project 8: Linking specimen with material citation and vice versa
Team 8: Plazi, the SIB Swiss Institute of Bioinformatics (SIB), Berner Fachhochschule (BFH), Meise Botanic Garden, the Global Biodiversity Information Facility (GBIF) and the Botanic Garden and Botanical Museum Berlin tasked themselves with the creation of bi-directional links between specimens, their material citation, treatments and publication via API. By establishing such links, a researcher would be able to easily understand what is known about a specimen at any time.
As a result of their work at the hackathon, the team came up with algorithms that allow finding the respective representations in either database, as well as annotating the specimen and material citations with the respective identifiers. Additionally, they designed a user interface that facilitates human interactions to curate links created by automation or create links that have not been discovered.
Project 9: Hidden women in science
In Project 9, the Meise Botanic Garden, Wikidata, Plazi, Global Biodiversity Information Facility (GBIF), Consortium of European Taxonomic Facilities (CETAF) and Science stories were concerned with ensuring achievements of women in science do not go under the radar and the appropriate credit is given where it’s due.
At the hackathon, the collaborators wrote SPARQL queries to Wikidata to search for “missing” women scientists and began filling in the gaps through creating new entries on Wikipedia, Wikidata, Bionomia and Science Stories. The team even demonstrated how they created one such entry for a female botanist on a Wikipedia Weekly episode, which aired on YouTube.
Invited project: A LifeBlock traceability and provenance service for data from external sources
Members of Lifewatch ERIC also participated in the hackathon in order to refine and test their new LifeBlock data provenance and traceability system. They communicated with various other teams to explore possible use cases for LifeBlock. The LifeBlock technology allows data management from multiple data sources with traceability , provenance and application of FAIR principles.
The LifeBlock infrastructure aims to help the community, including BiCIKL’s partners and participating databases by ensuring:
- Visibility and recognition of researchers
- Anti-tampering, as any modification is traced-back to origin, and approved by authors.
- Backup System through IPFS
- Tokenization of Ecosystem Services
- Traceability
- History of changes performed, including author and modification
- Replicability: data, algorithms, research environments and parameters can be included into LifeBlock for each execution and experiment
- Perpetual link between data, citations and publications.
We will hear more about the activities and outputs of the first BiCIKL hackathon at the TDWG 2021 conference coming up on 18-22 October 2021. Stay tuned!
***
Follow BiCIKL Project on Twitter and Facebook.
Join the conversation on Twitter at #BiCIKL_H2020.