Data Science Project: BIOSCAN-1M

Project Description

BIOSCAN-1M
Sample image from BIOSCAN-1M Dataset

BIOSCAN-1M is a large-scale dataset combining over 1 million RGB microscopic images with metadata such as DNA barcode sequences and taxonomy. It supports biodiversity and AI research through multimodal learning.

Key Contributions

Tools & Technologies

Code / Git

Research / Paper

Presentations

Additional Context

Statistical Analysis

Fine-Grained Classification

Fine-grained classification involves distinguishing between classes that are very similar to each other, often requiring detailed and nuanced feature analysis. In taxonomic classification, this concept is illustrated as we move from broader, less specific levels (e.g., phylum, class) to more specific levels (e.g., genus, species) within the taxonomic tree. This necessitates a classification model that can accurately differentiate between these closely related classes.

Taxonomy Classification
Taxonomy Classification.

Long-Tailed Distribution

The BIOSCAN-1M dataset exhibits a long-tailed distribution, where a small number of classes have a large number of samples, while most classes have very few samples. This distribution can result in models that perform well on the frequently occurring classes but struggle with the underrepresented classes due to insufficient training data. For instance, in the BIOSCAN-1M dataset, the order Diptera contains 896,324 samples out of a total of 1,128,313, which represents approximately 80% of all samples. This illustrates how a few classes dominate the dataset, highlighting the long-tailed nature of the distribution.

Class distribution of taxonomy group order
Class Distribution of taxonomy group order.

High Class Imbalance Ratio

There is a significant class imbalance in the BIOSCAN-1M dataset, with some classes being heavily underrepresented compared to others. This high class imbalance ratio can lead to biases in model training, where the model becomes skewed towards the majority classes and shows poor performance on the minority classes. Addressing this imbalance is crucial for ensuring the model's effectiveness across all classes.

ML Benchmarks

This section outlines the machine learning benchmark tasks and results for the BIOSCAN-1M dataset. The benchmarks were designed to evaluate classification performance at different taxonomy levels and dataset sizes.

Data Sampling

The BIOSCAN-1M dataset was sampled in the following ways:

    Taxonomy Levels: Two datasets were created:

    • BIOSCAN-1M-Insect: Samples at the taxonomy order level.
    • BIOSCAN-1M-Diptera: Samples at the taxonomy family (Diptera) level.

    Dataset Sizes: Each dataset was further divided into three sizes to address usability and feasibility for end users in various domains:

    • Small: 50,000 samples
    • Medium: 200,000 samples
    • Large:
      • Order: 1,100,000 samples
      • Diptera: 891,000 samples

Stratified Class-Based Split

Data samples for each dataset size were split into train (70%), validation (10%), and test (20%) sets using a class-based mechanism to ensure consistent data distributions across all sets.

Multi-class Classification

Two image-based classification benchmark experiments were designed and conducted on all three sized datasets:

  • Insect-Order: 16 classes.
  • Family-Diptera: 40 classes.
Class distribution
Class distribution of the two sampled datasets.

Transfer Learning

Dataset was fine-tuned utilizing two pretrained backbone models to facilitate transfer learning:

Robustness and Generalizability

To ensure the robustness and generalizability of the models, each experiment was repeated with 3 different random seeds. This approach allows us to account for the impact of randomness in our results. The total number of experiments conducted is given by:

3 (dataset sizes) × 2 (classification tasks) × 2 (backbone models) × 2 (loss functions) × 3 (seeds) = 72

To see the results these experiments with mean and standard deviation please visit Tables A3-A4 Gharaee et al. (2023).

Evaluation

The model with the best performance on the validation set is selected and used for test experiments. The metrics used for the evaluation are as follows:

  • Top-1 Accuracy: The proportion of test samples where the top predicted class matches the true label.
  • Top-5 Accuracy: The proportion of test samples where the true label is among the top five predicted classes.
  • Macro-F1 Score: The macro-averaged F1 score, which computes the F1 score for each class and then averages these scores, giving equal weight to each class.
  • Loss: Monitored during training and evaluated post-training to measure the effectiveness of the model’s learning.

Findings and Results

The results indicate that:

  • Cross-Entropy: Generally performed better achieving higher accuracy.
  • Vision Transformer: Demonstrated competitive performance, especially on larger datasets, showing its robustness and capability in handling high-dimensional image data.
results
Per-class top-1 test accuracy of the Insect-Order and Diptera-Family classification experiments of the Large dataset.
results
Confusion Matrix of insect-order experiments.
results
Confusion Matrix of diptera-family experiments.

Deployment

The model trained on BIOSCAN-1M datasets are stored in project's Google Drive folder. The pretrained models, and its AI-based tool are utilized by biologists at the Centre for Biodiversity Genomics (CBG) to streamline biological taxonomy classification. Traditionally performed by human experts, this process is costly, and time-consuming.