Integrating Phylogenetic data within BiCIKL: a RoundTable full of suggestions

28 June 2023

“Integrating Phylogenetic data within BiCIKL: Defining and implementing a data model to improve linkage of phylogenetic trees to DNA sequences, literature, taxonomy, and specimens”. 

This was the title of the 4th BiCIKL Roundtable, held on 26-27 April at the GBIF Secretariat in Copenhagen and online.


The main goal of the workshop was to expand the participation in BICIKL from DNA sequences, literature, taxonomy, and specimens by inviting like-minded people interested in FAIR data that are involved with phylogenetic infrastructures.


The secondary goal then is to devise case studies to integrate or link phylogenetic data, such as phylogenetic trees, DNA matrices, and material citations into the BiCIKL infrastructures and vice versa.  The roundtable was used to update the BiCIKL community on the developing GBIF data model and to use it as a standard for integrating the phylogenetic data with BiCIKL.


To that end the following researchers outside BiCIKL attended the roundtable:

·      TreeBase - Bill Piel, University of Singapore

·      OpenTree - Mark Holder, University of Kansas

·     Data Model - John Wieczorek, TDWG, Argentina

·     Citation and clustering - Nicki Nicolson, Kew


Over dozen BiCIKL investigators attended as well as many GBIF Secretariat staff. and also staff from the Danish Natural History Museum.

The event started a full afternoon of presentations on Wednesday followed by hands-on all-day workshop on Thursday to develop the use cases. 

The roundtable clearly found a focus on two data sets that are critical to integration. The first is the Nexus formatted phylogenic data that is currently used by TreeBase and OpenTree and therefore in the phylogenetic community. A nexus file can contain a DNA matrix and a phylogeny that can be visualized in an interactive manner.  The second data set is a DwC file that contains identifiers that can link the data in the Nexus file to other data commonly held by BiCIKL RIs.


The general needs were identified:

·      Improved phylogenetic tree repository due to TreeBase code issues

·      Ability to visualize tree in RIs interactively with additional data such as GBIF occurrence or specimen labels

·      Improved taxonomic matching system for phylogenies by using ChecklistBank

·      Improved Material Citations to integrate the evidence behind a phylogeny (specimen, DNA, literature, taxonomy) with BiCIKL RIs.


In addition to these specific needs. The group discussed longer-term issues such as providing input trees and citations for OpenTree.

Four specific Use Cases were scoped:


1.      TreeBase publishing to GBIF for long-term preservation. The current TreeBase data store will be translated into the proposed DwC-Nexus files necessary to be integrated.  The data will be turned into treatments and material citations and the matrix will be stored (TBD). The nexus phylogeny will be made available via GBIF and the TreeBase and OpenTree identifiers will be made available.

2.     Harvesting Phylogenies for PDF literature.  The pdf phylogenetic figures will be extracted and formatted in Nexus. These will be matched with their already harvested DwC data from the publication. These will be submitted to GBIF as material citations and available for linking to other RIs.  These literature material citations will contain nexus files with phylogeny IDs, taxonomy, specimen and sequence information.

3.     Data paper publishing model for new projects. A set of Nexus and DwC-type files of recently published phylogenetic papers will be harvested and reformatted into the Pensoft publication model that creates the linages via the identifiers.  This will be published as a data paper which will then be accessible via GBIF as a material citation.

4.     PAFTOL publishing with the New Data model.  The developing PAFTOL dataset which has well-structured Nexus and DwC data will be integrated with Kew specimen data to be ingested in the new GBIF data model as a test case. This will add a new dimension to the data model and provide knowledge on how the other use cases could eventually be integrated via the developing model.


The use cases will be running in parallel over the next several months. GBIF will act as a coordinator among the projects. The hope is that later this year the developments can be the basis of a public webinar to demonstrate the advances, seek input, and introduce new RIs to BiCIKL.