BIOSCAN-1M Portfolio

Project Description

BIOSCAN-1M is a large-scale dataset combining over 1 million RGB microscopic images with metadata such as DNA barcode sequences and taxonomy. It supports biodiversity and AI research through multimodal learning.

Key Contributions

Led dataset design and governance, defining taxonomy-aware sampling strategies and class-balanced, stratified splits for small, medium, and large dataset variants.
Developed robust ETL pipelines for large-scale image preprocessing (cropping, resizing, HDF5 conversion) and metadata structuring (CSV, JSON-LD) to support downstream ML tasks.
Implemented and automated the full ML pipeline, from data preparation through model training and benchmarking, using cloud-hosted storage and reproducible configurations.
Fine-tuned ResNet50 and Vision Transformer (ViT) backbones using transfer learning for multi-class classification, evaluating model performance across 72 controlled experiments.
Benchmarked classification models with systematic variation of task type, architecture, loss function, and random seed to assess robustness and generalization.
Ensured reproducibility and experimental rigor through repeated runs and standardized evaluation metrics, including accuracy, F1-score, confusion matrices, and tabular summaries.

Tools & Technologies

Python, PyTorch, Pandas, NumPy
h5py, CSV, JSON, Pickle, Hugging Face Datasets
Google API Client, PyDrive, Requests, google-auth
Transformers (e.g., ViT, DetrFeatureExtractor), timm
OpenCV, scikit-learn, PIL (Pillow)
Matplotlib, Seaborn, Plotly
Cluster Computing, Parallel Computation
Virtual Environment, Bash Environment
Digital Research Alliance of Canada

Code / Git

GitHub Repository

Research / Paper

Paper

Presentations

Proceedings of NeurIPS 2023

Links & Resources

Additional Context

Statistical Analysis

Fine-Grained Classification

Fine-grained classification involves distinguishing between classes that are very similar to each other, often requiring detailed and nuanced feature analysis. In taxonomic classification, this concept is illustrated as we move from broader, less specific levels (e.g., phylum, class) to more specific levels (e.g., genus, species) within the taxonomic tree. This necessitates a classification model that can accurately differentiate between these closely related classes.

Long-Tailed Distribution

The BIOSCAN-1M dataset exhibits a long-tailed distribution, where a small number of classes have a large number of samples, while most classes have very few samples. This distribution can result in models that perform well on the frequently occurring classes but struggle with the underrepresented classes due to insufficient training data. For instance, in the BIOSCAN-1M dataset, the order Diptera contains 896,324 samples out of a total of 1,128,313, which represents approximately 80% of all samples. This illustrates how a few classes dominate the dataset, highlighting the long-tailed nature of the distribution.

Class distribution of taxonomy group order — Class Distribution of taxonomy group order.

High Class Imbalance Ratio

There is a significant class imbalance in the BIOSCAN-1M dataset, with some classes being heavily underrepresented compared to others. This high class imbalance ratio can lead to biases in model training, where the model becomes skewed towards the majority classes and shows poor performance on the minority classes. Addressing this imbalance is crucial for ensuring the model's effectiveness across all classes.

ML Benchmarks

This section outlines the machine learning benchmark tasks and results for the BIOSCAN-1M dataset. The benchmarks were designed to evaluate classification performance at different taxonomy levels and dataset sizes.

Data Sampling

The BIOSCAN-1M dataset was sampled in the following ways:

Taxonomy Levels: Two datasets were created:

BIOSCAN-1M-Insect: Samples at the taxonomy order level.
BIOSCAN-1M-Diptera: Samples at the taxonomy family (Diptera) level.

Dataset Sizes: Each dataset was further divided into three sizes to address usability and feasibility for end users in various domains:

Small: 50,000 samples
Medium: 200,000 samples
Large:
- Order: 1,100,000 samples
- Diptera: 891,000 samples

Stratified Class-Based Split

Data samples for each dataset size were split into train (70%), validation (10%), and test (20%) sets using a class-based mechanism to ensure consistent data distributions across all sets.

Multi-class Classification

Two image-based classification benchmark experiments were designed and conducted on all three sized datasets:

Insect-Order: 16 classes.
Family-Diptera: 40 classes.

Class distribution of the two sampled datasets.

Transfer Learning

Dataset was fine-tuned utilizing two pretrained backbone models to facilitate transfer learning:

ResNet50: A deep residual learning framework for image recognition. He et al. (2016)
Vision Transformer (ViT-B/16-224): A transformer-based model for image classification. Dosovitskiy et al. (2020)

Robustness and Generalizability

To ensure the robustness and generalizability of the models, each experiment was repeated with 3 different random seeds. This approach allows us to account for the impact of randomness in our results. The total number of experiments conducted is given by:

3 (dataset sizes) × 2 (classification tasks) × 2 (backbone models) × 2 (loss functions) × 3 (seeds) = 72

To see the results these experiments with mean and standard deviation please visit Tables A3-A4 Gharaee et al. (2023).

Evaluation

The model with the best performance on the validation set is selected and used for test experiments. The metrics used for the evaluation are as follows:

Top-1 Accuracy: The proportion of test samples where the top predicted class matches the true label.
Top-5 Accuracy: The proportion of test samples where the true label is among the top five predicted classes.
Macro-F1 Score: The macro-averaged F1 score, which computes the F1 score for each class and then averages these scores, giving equal weight to each class.
Loss: Monitored during training and evaluated post-training to measure the effectiveness of the model’s learning.

Findings and Results

The results indicate that:

Cross-Entropy: Generally performed better achieving higher accuracy.
Vision Transformer: Demonstrated competitive performance, especially on larger datasets, showing its robustness and capability in handling high-dimensional image data.

Deployment

The model trained on BIOSCAN-1M datasets are stored in project's Google Drive folder. The pretrained models, and its AI-based tool are utilized by biologists at the Centre for Biodiversity Genomics (CBG) to streamline biological taxonomy classification. Traditionally performed by human experts, this process is costly, and time-consuming.