BIOSCAN-5M Portfolio

Project Description

This project introduced a large-scale dataset of over five million samples, each consisting of a high-quality microscopic RGB image, a DNA barcode sequence, and a Barcode Index Number. The dataset comprises both structured and unstructured data formats. Structured data, provided in CSV and JSON-LD formats, includes taxonomy labels, DNA sequences, Barcode Index Numbers, geographic metadata, and specimen size information. Unstructured data consists primarily of images.

Key Contributions

Led the design and coordination of a 5M-image multimodal dataset, guiding multi-partner teams through data curation, preprocessing, and machine learning experimentation.
Developed preprocessing pipelines for image cropping, resizing, and bounding box extraction to support scalable training workflows.
Structured image data and metadata for ML readiness, including split-aware subdirectories (pretrain, train, validation, test, val_unseen, test_unseen, etc.) and metadata in CSV and JSON-LD formats to ensure consistency across training and evaluation pipelines.
Created a Hugging Face-compatible loader supporting resolution variants and standard ML splits (HuggingFace-Dataset).
Uploaded and organized datasets on cloud and open-access platforms to ensure accessibility and reproducibility (Links & Resources).
Led benchmarking of multimodal classification and clustering models, providing baseline evaluations and performance tracking.
Computed image-level statistics including bounding box-derived size features (area fraction, scale factor) to support downstream analyses.
Provided Assistance in cleaning and standardized biological taxonomic labels to resolve annotation inconsistencies at scale..
Curated and visualized biological, geographical, and genetic statistics across millions of records using dynamic tables and distribution plots.
Analyzed taxonomic-level variation using pairwise genetic distance distributions and Shannon diversity index to support downstream ML task findings (Barcode-Metrics).

Tools & Technologies

Python, PyTorch, TensorFlow, Pandas, NumPy, PySpark (pyspark.sql)
h5py, CSV, JSON, Pickle, Hugging Face Datasets
Google API Client, PyDrive, Requests, google-auth
Transformers (e.g., ViT, DetrFeatureExtractor), timm
OpenCV, scikit-learn, PIL (Pillow)
Matplotlib (mpl_toolkits), Seaborn, Plotly
Cluster Computing, Parallel Computation
Virtual Environment, Bash Environment
Digital Research Alliance of Canada

Code / Git

GitHub Page

Research / Paper

Paper

Presentations

Proceedings of NeurIPS 2024

Links & Resources

Additional Context

Multiple Data Modalities

Images

The BIOSCAN-5M dataset contains 5,150,808 high-quality images of living organisms.

BIOSCAN-5M Insect Project — BIOSCAN-5M high-quality microscopic RGB images.

DNA Nucleotide Barcode Sequences

The presented DNA barcode sequence illustrates the nucleotide arrangement—Adenine (A), Thymine (T), Cytosine (C), and Guanine (G)—within a designated gene region, such as the mitochondrial cytochrome c oxidase subunit I (COI) gene.

TTTATATTTTATTTTTGGAGCATGATCAGGAATAGTTGGAACTTCAATAAGTTTATTAATTCGAACAGAATTAAG...

Visual representation of DNA sequence — This visual representation offers a glimpse into the intricate structure of DNA.

Color Scheme

Adenine (A): Red
Thymine (T): Blue
Cytosine (C): Green
Guanine (G): Yellow

Textual Taxonomy Labels

Taxonomic group ranking annotations categorize organisms hierarchically based on evolutionary relationships.

Geographic Information

Each dataset sample includes geographic information about the collection sites, captured through four key data attributes:

1- Latitude and 2- Longitude

BIOSCAN-5M-lat-lon — Locations associated with the sites of specimens collection.

3- Country and 4- Province/State

BS-5M-country — Countries associated with the sites of collection.

Size Information

Each dataset sample includes size information about each specimen, captured through three key data attributes:

1- Image Measurement Value

Count of pixels occupied by the organism in its image.

2- Area Fraction

The fraction of the original image that the cropped image comprises based on bounding box information.

3- Scale Factor

The ratio of the cropped image to the cropped and resized image based on bounding box information.

BS-5M-bbx — Bounding Box detected by our cropping tool.

Statistical Analytics

Biological Statistics

Geographical Statistics

Genetic Statistics

Data Distributions

ML Benchmarks

Data Partition

Seen: Samples whose species label is an established scientific name of a species. Used in closed world settings.

train
val
test

Unseen: labelled with an established scientific name for the genus, and a uniquely identifying placeholder name for the species. Used in open world settings.

key_unseen
val_unseen
test_unseen

Heldout: labelled with a placeholder genus and species name. Used in novelty detection.

other_heldout

Unknown: samples without a species label. Used in self- and semi-supervised learning.

pretrain

Data split distribution to facilitate closed world and open world settings.

Barcode-BERT: DNA Sequence Classification

Two stages of the proposed semi-supervised learning set-up based on BarcodeBERT Arias et al. (2023). (1) Pretraining: DNA sequences are tokenized using non-overlapping k-mers and 50% of the tokens are masked for the MLM task. Tokens are encoded and fed into a transformer model. The output embeddings are used for token-level classification. (2) Fine-tuning: All DNA sequences in a dataset are tokenized using non-overlapping k-mer tokenization and all tokenized sequences, without masking, are passed through the pretrained transformer model. Global mean-pooling is applied over the token-level embeddings and the output is used for taxonomic classification.

The results are presented in Gharaee et al. (2024)

Zero-shot Clustering

Images and DNA are each passed through one of several pretrained encoders. These representations are clustered with Agglomerative Clustering.

The results are presented in Gharaee et al. (2024)

BIOSCAN-CLIBD: Multi-modal Contrastive Learning

Our experiments using the BIOSCAN-CLIBD Gong et al. (2024) are conducted in two steps. (1) Training: Multiple modalities, including RGB images, textual taxonomy, and DNA sequences, are encoded separately, and trained using a contrastive loss function. (2) Inference: Image vs DNA embedding is used as a query, and compared to the embeddings obtained from a database of image, DNA and text (keys). The cosine similarity is used to find the closest key embedding, and the corresponding taxonomic label is used to classify the query.

We report top-1 macro accuracy (%) on the test set using different amounts of pre-training data (1 million vs. 5 million records from BIOSCAN-5M) and various combinations of aligned embeddings (image, DNA, and text) during contrastive training. Our results include accuracy for image-to-image, DNA-to-DNA, and image-to-DNA query-key combinations. As a baseline, we provide the results without contrastive learning (no alignment). We report accuracy separately for seen and unseen species, along with the harmonic mean (H.M.) between these values.

BIOSCAN-CLIBD predicting orders

BIOSCAN-CLIBD predicting families

BIOSCAN-CLIBD predicting genus

BIOSCAN-CLIBD predicting species

The results details are presented in Gharaee et al. (2024)