Data Science Project: BIOSCAN-5M

Overview

BIOSCAN-5M Project
BIOSCAN-5M Dataset.

In this project, a large-scale new dataset of over five million samples was introduced. Each sample includes a high-quality microscopic RGB image, a DNA nucleotide barcode sequence, and a Barcode Index Number.

Data Curation-Governance

Data curation involved the collection, organization, and maintenance of data to ensure its quality and accessibility. The key tasks included:

ETL/ELT: Extract-Transform-Load

Data Migration:

Data Transformation:

Multiple-Data-Modalities

Images

BIOSCAN-5M dataset contains 5,150,808 high-quality images of living organisms

BIOSCAN-5M Insect Project
BIOSCAN-5M high-quality microscopic RGB images.

DNA Nucleotide Barcode Sequences

The presented DNA barcode sequence illustrates the nucleotide arrangement—Adenine (A), Thymine (T), Cytosine (C), and Guanine (G)—within a designated gene region, such as the mitochondrial cytochrome c oxidase subunit I (COI) gene. This sequence is visually represented in blocks of distinct colors:

            TTTATATTTTATTTTTGGAGCATGATCAGGAATAGTTGGAACTTCAATAAGTTTATTAATTCGAACAGAATTAAGCCAACCAGGAATTTTTATTGGTAATGACCAAATTTATAATGTAATTGTTACAGCTCATGCCTTTATTATAATTTTTTTTATAGTTATACCTATTATAATTGGAGGATTCGGAAATTGACTAGTCCCATTAATATTAGGAGCTCCTGATATAGCTTTCCCTCGAATAAATAATATAAGTTTTTGAATGTTACCTCCTTCATTAACTCTATTATTATCAAGAAGAATAGTTGAAAATGGAGCTGGAACAGGATGAACTGTTTATCCCCCTTTATCCTCAGGAACTGCTCATGCAGGAGCTTCTGTTGATCTTGCTATTTTCTCTTTACATTTAGCAGGAATTTCTTCAATTCTTGGAGCTGTAAATTTTATTACAACAATTATTAATATACGATCTTCAGGAATTACACTTGATCGAATACCTTTATTTGTTTGATCTGTAATTATTACAGCTATTCTACTTTTACTGTCTCTTCCAGTATTAGCTGGAGCTATTACAATATTATTAACTGATCGTAATTTAAATACATCTTTTTTTGACCCAATTGGAGGAGGAGATCCAATTCTATATCAACATTTAT
            
Visual representation of DNA sequence
This visual representation offers a glimpse into the intricate structure of DNA.

Color Scheme

The color scheme is designed as follows:

These nucleotides, represented by their respective colors, play a pivotal role in defining the genetic information encoded within the DNA sequence.

Textual Taxonomy Labels

Taxonomic group ranking annotations categorize organisms hierarchically based on evolutionary relationships. It organizes species into groups based on shared characteristics and genetic relatedness.

Taxonomy
Taxonomic Classification Tree.

Geographic Information

Each dataset sample includes geographic information about the collection sites, captured through four key data attributes:

1- Latitude and 2- Longitude

BIOSCAN-5M-lat-lon
Locations associated with the sites of specimens collection.

3- Country and 4- Province/State

BS-5M-country
Countries associated with the sites of collection.

Size Information

Each dataset sample includes size information about each specimen, captured through three key data attributes:

1- Image Measurement Value

Count of pixels occupied by the organism in its image
BS-5M-measu
Pixel count.

2- Area Fraction

The fraction of the original image, the cropped image comprises based on bounding box information of the cropping tool.

3- Scale Factor

The ratio of the cropped image to the cropped and resized image based on bounding box information of the cropping tool.
BS-5M-bbx
Bounding Box detected by our cropping tool.

Statistical Analytics

Biological Statistics

Geographical Statistics

Genetic Statistics

BS-5M-bbx
Distribution of pairwise distances of subgroups of class. The x-axis shows the subgroup categories sorted alphabetically.
BS-5M-bbx
Distribution of pairwise distances of subgroups of order. The x-axis shows the subgroup categories sorted alphabetically.
BS-5M-bbx
Distribution of pairwise distances of subgroups of species. Among the species, there are 8,372 distinct subgroups with sufficient identical barcodes for calculating pairwise distances, which makes visualization challenging. To address this, the groups are sorted in descending order based on their mean distances and partitioned into 100 bins. These bins are used to plot the distribution of pairwise distances within the species rank. The mean distance of each bin is displayed along the x-axis.

Data Distributions

ML Benchmark

Data Partition

split
Data split distribution to facilitate closed world and open world settings.

Barcode-BERT: DNA Sequence Classification

Two stages of the proposed semi-supervised learning set-up based on BarcodeBERT Arias et al. (2023). (1) Pretraining: DNA sequences are tokenized using non-overlapping k-mers and 50% of the tokens are masked for the MLM task. Tokens are encoded and fed into a transformer model. The output embeddings are used for token-level classification. (2) Fine-tuning: All DNA sequences in a dataset are tokenized using non-overlapping k-mer tokenization and all tokenized sequences, without masking, are passed through the pretrained transformer model. Global mean-pooling is applied over the token-level embeddings and the output is used for taxonomic classification.

The results are presented in Gharaee et al. (2024)

BS-5M-bert
Barcode Bert model.

Zero-shot Clustering

Images and DNA are each passed through one of several pretrained encoders. These representations are clustered with Agglomerative Clustering.

The results are presented in Gharaee et al. (2024)

BS-5M-bert
Zero-shot clustering.

BIOSCAN-CLIBD: Multi-modal Contrastive Learning

Our experiments using the BIOSCAN-CLIBD Gong et al. (2024) are conducted in two steps. (1) Training: Multiple modalities, including RGB images, textual taxonomy, and DNA sequences, are encoded separately, and trained using a contrastive loss function. (2) Inference: Image vs DNA embedding is used as a query, and compared to the embeddings obtained from a database of image, DNA and text (keys). The cosine similarity is used to find the closest key embedding, and the corresponding taxonomic label is used to classify the query.

BS-5M-bert
BIOSCAN-CLIBD model.

We report top-1 macro accuracy (%) on the test set using different amounts of pre-training data (1 million vs. 5 million records from BIOSCAN-5M) and various combinations of aligned embeddings (image, DNA, and text) during contrastive training. Our results include accuracy for image-to-image, DNA-to-DNA, and image-to-DNA query-key combinations. As a baseline, we provide the results without contrastive learning (no alignment). We report accuracy separately for seen and unseen species, along with the harmonic mean (H.M.) between these values.

BIOSCAN-CLIBD predicting orders

split

BIOSCAN-CLIBD predicting families

BS-5M-bert

BIOSCAN-CLIBD predicting genus

BS-5M-bert

BIOSCAN-CLIBD predicting species

BS-5M-bert

The results details are presented in Gharaee et al. (2024)

Tools and Technologies

This section discusses the design and implementation of the project.

Data Processing Pipeline

The pipeline includes stages for:

Data Migration

Data Structure

Data Visualization

Utilized various libraries for data visualization:

Data Analytics and Data Processing

Tools and libraries used for data analytics and processing include:

Computing Infrastructure

Infrastructure and environments used include: