"This paper comes as an output of task 1.2 in the BiCIKL Project. The idea was to put a hackathon in place in order to bring the community together" - says Sofie Meeus from Meise Botanic Garden, task Leader - "producing recommendations coming out directly from the needs of the scientists. That has been quite a challenge, considering we were doing this during a pandemic crisis".
She's referring to "Recommendations for interoperability among infrastructures" (https://riojournal.com/article/96180/), recently published in the RIO journal.
"It has been a full week of hackathon in hybrid mode. Very demanding, for sure, but I think it worked out pretty well, as a founding moment for the BiCIKL community and from a scientific point of view" - Meeus added - "The BiCIKL project is born from a vision that biodiversity data are most useful if they are presented as a network of data that can be integrated and viewed from different starting points. BiCIKL’s goal is to realise that vision by linking biodiversity data infrastructures, particularly for literature, molecular sequences, specimens, nomenclature and analytics. So we acted."
To make those links we need to better understand the existing infrastructures, their limitations, the nature of the data they hold, the services they provide and particularly how they can interoperate. In light of those aims, in the autumn of 2021, 74 people from the biodiversity data community engaged in a total of twelve hackathon topics with the aim to assess the current state of interoperability between infrastructures holding biodiversity data.
"These topics examined interoperability from several angles. Some were research subjects that required interoperability to get results, some examined modalities of access and the use and implementation of standards, while others tested technologies and workflows to improve the linkage of different data types. And we came out with 5 high-level recommendations for infrastructures (that can be detailed into more specific needs)" - says Meeus.
(1) The use of data brokers.
An alternative to linking infrastructures is for a third-party infrastructure to act as a broker between infrastructures. Data brokerage is particularly important where multiple identifier systems exist, such as with person identifiers. No one single resource holds all the links between people, specimens and literature, also no one person identifier system works for every situation. Data brokers have a crucial role providing links between identifiers systems, creating links where there is no other source, and providing links that can be curated by the community.
- If direct linking cannot be supported between infrastructures, explore using data brokers to store links.
- Cooperate with open linkage brokers to provide a simple way to allow two-way links between infrastructures, without having to co-organize between many different organisations.
(2) Building communities and trust
The user community will not only make use of the linked infrastructures but will also contribute to it, for example, by enriching data brokers and providing user feedback to infrastructures. The infrastructures should facilitate the reporting of issues, including those issues related to incompatibilities between infrastructures.
- Facilitate and encourage the external reporting of issues related to their infrastructure and its interoperability.
- Facilitate and encourage requests for new features related to their infrastructure and its interoperability.
- Provide development roadmaps openly.
(3) Cloud computing as a collaborative tool
During the hackathon we demonstrated the ability to collaborate openly across institutions using shared infrastructure.
Beyond collaboration, cloud infrastructures also commonly offer various services built on massive-scale Machine Learning implementations. This includes powerful enrichment services such as georeferencing, computer vision, translating and data clustering. Infrastructures may make use of such state-of-the-art services to enrich the data they serve and make links to other infrastructures, benefitting from scaling effectiveness they could not meet on their own.
- Provide cloud-based environments to allow external participants to contribute and test changes to features.
- Consider the opportunities that cloud computing brings as a means to enable shared management of the infrastructure.
- Promote the sharing of knowledge around big data technologies amongst partners, using cloud computing as a training environment.
It is a fairly obvious statement that adoption and continued compliance with community standards is a positive step towards interoperability. Standards need to be developed by a broad community to be useful to that whole community. But standards development and compliance need investment by infrastructures. So the recommendations are:
- Invest in standards compliance and work with standards organisations to develop new, and extend existing standards.
- Report on and review standards compliance within an infrastructure with metrics that give credit for work on standard compliance and development.
(5) Multiple modalities of access
The ways that researchers access data can have a large influence on how research is conducted and how easy it is for researchers to do what they want. BiCIKL infrastructures aim to provide Open Data to be used however the users want. They want to support innovative uses and novel applications, but also more prosaic uses for the data. The aim is to do more and better science in a timely manner. The modes by which data are accessed are an important consideration in reducing the barriers and friction to the use of these data. They are also critical to what uses can be made of the data. We recommend that infrastructures provide as many different modalities of access as possible. Only by doing this will they give access to the data without limiting the uses that researchers can make of the data. We have distinguished four basic levels of access, all of which have used in the community. These are:
- browsing the data via a web portal,
- programmatic access via an API,
- downloading data to be used locally and
- personal requests for unique sets of data.
As for this point our recommendations are:
- Provide as many different modalities of access as possible.
- Avoid requiring personal contacts to download data.
- Provide a full description of an API and the data it serves.