Overview
In this project, a large-scale new dataset of over five million samples was introduced. Each sample includes a high-quality microscopic RGB image, a DNA nucleotide barcode sequence, and a Barcode Index Number.
- Composed of both structured (CSV metadata) and unstructured (images, text) data types.
- Structured metadata is in JSON-LD and CSV formats, including taxonomy classification labels, DNA nucleotide sequences, and Barcode Index Numbers.
- Unstructured data is organized into chunks of 10,000 images per directory for the original-full size images, cropped images, resized-original images, and resized-cropped images.
- Images organized into directories were then stored in HDF5 and ZIP formats.
- Data was uploaded to Google Drive and other platforms such as Zenodo, Kaggle, and Hugging Face.
Data Curation-Governance
Data curation involved the collection, organization, and maintenance of data to ensure its quality and accessibility. The key tasks included:
- Data Validation: Ensuring data accuracy and consistency.
- Data Cleaning: Removing errors and inconsistencies from the dataset.
- Data Enrichment: Enhancing the dataset with additional relevant information.
- Data Documentation: Providing detailed descriptions and metadata.
ETL/ELT: Extract-Transform-Load
Data Migration:
- Performed using Google Cloud Platform (GCP) libraries, including:
- Google API Client
- PyDrive
- Requests
- Google-auth
Data Transformation:
- Cleaned and standardized data samples into a fixed format (e.g., JPEG images).
- Removed corrupted samples using Python libraries such as PIL, Pillow, OpenCV, and Pandas.
- Cropping images using Detection Transformer (DETR). Carion et al. (2020)
- Resizing images to 256 on their shorter side.
Multiple-Data-Modalities
Images
BIOSCAN-5M dataset contains 5,150,808 high-quality images of living organisms
DNA Nucleotide Barcode Sequences
The presented DNA barcode sequence illustrates the nucleotide arrangement—Adenine (A), Thymine (T), Cytosine (C), and Guanine (G)—within a designated gene region, such as the mitochondrial cytochrome c oxidase subunit I (COI) gene. This sequence is visually represented in blocks of distinct colors:
TTTATATTTTATTTTTGGAGCATGATCAGGAATAGTTGGAACTTCAATAAGTTTATTAATTCGAACAGAATTAAGCCAACCAGGAATTTTTATTGGTAATGACCAAATTTATAATGTAATTGTTACAGCTCATGCCTTTATTATAATTTTTTTTATAGTTATACCTATTATAATTGGAGGATTCGGAAATTGACTAGTCCCATTAATATTAGGAGCTCCTGATATAGCTTTCCCTCGAATAAATAATATAAGTTTTTGAATGTTACCTCCTTCATTAACTCTATTATTATCAAGAAGAATAGTTGAAAATGGAGCTGGAACAGGATGAACTGTTTATCCCCCTTTATCCTCAGGAACTGCTCATGCAGGAGCTTCTGTTGATCTTGCTATTTTCTCTTTACATTTAGCAGGAATTTCTTCAATTCTTGGAGCTGTAAATTTTATTACAACAATTATTAATATACGATCTTCAGGAATTACACTTGATCGAATACCTTTATTTGTTTGATCTGTAATTATTACAGCTATTCTACTTTTACTGTCTCTTCCAGTATTAGCTGGAGCTATTACAATATTATTAACTGATCGTAATTTAAATACATCTTTTTTTGACCCAATTGGAGGAGGAGATCCAATTCTATATCAACATTTAT
Color Scheme
The color scheme is designed as follows:
- Adenine (A): Red
- Thymine (T): Blue
- Cytosine (C): Green
- Guanine (G): Yellow
These nucleotides, represented by their respective colors, play a pivotal role in defining the genetic information encoded within the DNA sequence.
Textual Taxonomy Labels
Taxonomic group ranking annotations categorize organisms hierarchically based on evolutionary relationships. It organizes species into groups based on shared characteristics and genetic relatedness.
Geographic Information
Each dataset sample includes geographic information about the collection sites, captured through four key data attributes:
1- Latitude and 2- Longitude
3- Country and 4- Province/State
Size Information
Each dataset sample includes size information about each specimen, captured through three key data attributes:
1- Image Measurement Value
Count of pixels occupied by the organism in its image2- Area Fraction
The fraction of the original image, the cropped image comprises based on bounding box information of the cropping tool.3- Scale Factor
The ratio of the cropped image to the cropped and resized image based on bounding box information of the cropping tool.Statistical Analytics
Biological Statistics
Geographical Statistics
Genetic Statistics
Data Distributions
ML Benchmark
Data Partition
- Seen: Samples whose species label is an established scientific name of a species. Used in closed world settings.
- train
- val
- test
- Unseen: labelled with an established scientific name for the genus, and a uniquely identifying placeholder name for the species. Used in open world settings.
- key_unseen
- val_unseen
- test_unseen
- Heldout: labelled with a placeholder genus and species name. Used in novelty detection.
- other_heldout
- Unknown: samples without a species label. Used in self- and semi-supervised learning.
- pretrain
Barcode-BERT: DNA Sequence Classification
Two stages of the proposed semi-supervised learning set-up based on BarcodeBERT Arias et al. (2023). (1) Pretraining: DNA sequences are tokenized using non-overlapping k-mers and 50% of the tokens are masked for the MLM task. Tokens are encoded and fed into a transformer model. The output embeddings are used for token-level classification. (2) Fine-tuning: All DNA sequences in a dataset are tokenized using non-overlapping k-mer tokenization and all tokenized sequences, without masking, are passed through the pretrained transformer model. Global mean-pooling is applied over the token-level embeddings and the output is used for taxonomic classification.
The results are presented in Gharaee et al. (2024)
Zero-shot Clustering
Images and DNA are each passed through one of several pretrained encoders. These representations are clustered with Agglomerative Clustering.
The results are presented in Gharaee et al. (2024)
BIOSCAN-CLIBD: Multi-modal Contrastive Learning
Our experiments using the BIOSCAN-CLIBD Gong et al. (2024) are conducted in two steps. (1) Training: Multiple modalities, including RGB images, textual taxonomy, and DNA sequences, are encoded separately, and trained using a contrastive loss function. (2) Inference: Image vs DNA embedding is used as a query, and compared to the embeddings obtained from a database of image, DNA and text (keys). The cosine similarity is used to find the closest key embedding, and the corresponding taxonomic label is used to classify the query.
We report top-1 macro accuracy (%) on the test set using different amounts of pre-training data (1 million vs. 5 million records from BIOSCAN-5M) and various combinations of aligned embeddings (image, DNA, and text) during contrastive training. Our results include accuracy for image-to-image, DNA-to-DNA, and image-to-DNA query-key combinations. As a baseline, we provide the results without contrastive learning (no alignment). We report accuracy separately for seen and unseen species, along with the harmonic mean (H.M.) between these values.
BIOSCAN-CLIBD predicting orders
BIOSCAN-CLIBD predicting families
BIOSCAN-CLIBD predicting genus
BIOSCAN-CLIBD predicting species
The results details are presented in Gharaee et al. (2024)
Tools and Technologies
This section discusses the design and implementation of the project.
Data Processing Pipeline
The pipeline includes stages for:
- Data Ingestion
- Data Validation
- Data Transformation
- Data Storage
- Data Access
- Feature Engineering
Data Migration
- Google API Client
- PyDrive
- Requests
- Google-auth
Data Structure
- Arrays
- Lists
- Stacks
- Dictionaries
Data Visualization
Utilized various libraries for data visualization:
- Matplotlib (mpl_toolkits)
- Seaborn
- Plotly
Data Analytics and Data Processing
Tools and libraries used for data analytics and processing include:
- TensorFlow
- Scikit-learn
- Pytorch
- transformers (DetrFeatureExtractor)
- h5py
- Pandas
- PySpark (pyspark.sql)
- csv
- jason
- PIL
- pilow
- pickle
- timm
- shutil
Computing Infrastructure
Infrastructure and environments used include:
- Google Cloud Platform
- Digital Research Alliance of Canada
- Virtual Environment
Links
Overview of the links related to the projects:
Version Control Systems:
Dataset and Code Sharing Platforms:
Research Article Platforms: