Data Science Project: BIOSCAN-1M

Overview

BIOSCAN_1M
BIOSCAN-1M Dataset Sample.

In this project, a large-scale new dataset of over one million samples was introduced. Each sample includes a high-quality microscopic RGB image, a DNA nucleotide barcode sequence, and a Barcode Index Number.

Data Curation-Governance

Data curation involved the collection, organization, and maintenance of data to ensure its quality and accessibility. The key tasks included:

ETL/ELT: Extract-Transform-Load

Data Migration:

Data Transformation:

Statistical Analytics

Fine-Grained Classification

Fine-grained classification involves distinguishing between classes that are very similar to each other, often requiring detailed and nuanced feature analysis. In taxonomic classification, this concept is illustrated as we move from broader, less specific levels (e.g., phylum, class) to more specific levels (e.g., genus, species) within the taxonomic tree. This necessitates a classification model that can accurately differentiate between these closely related classes.

Taxonomy Classification
Taxonomy Classification.

Long-Tailed Distribution

The BIOSCAN-1M dataset exhibits a long-tailed distribution, where a small number of classes have a large number of samples, while most classes have very few samples. This distribution can result in models that perform well on the frequently occurring classes but struggle with the underrepresented classes due to insufficient training data. For instance, in the BIOSCAN-1M dataset, the order Diptera contains 896,324 samples out of a total of 1,128,313, which represents approximately 80% of all samples. This illustrates how a few classes dominate the dataset, highlighting the long-tailed nature of the distribution.

Class distribution of taxonomy group order
Class Distribution of taxonomy group order.

High Class Imbalance Ratio

There is a significant class imbalance in the BIOSCAN-1M dataset, with some classes being heavily underrepresented compared to others. This high class imbalance ratio can lead to biases in model training, where the model becomes skewed towards the majority classes and shows poor performance on the minority classes. Addressing this imbalance is crucial for ensuring the model's effectiveness across all classes.

ML Benchmark

This section outlines the machine learning benchmark tasks and results for the BIOSCAN-1M dataset. The benchmarks were designed to evaluate classification performance at different taxonomy levels and dataset sizes.

Data Sampling

The BIOSCAN-1M dataset was sampled in the following ways:

Stratified Class-Based Split

Data samples for each dataset size were split into train (70%), validation (10%), and test (20%) sets using a class-based mechanism to ensure consistent data distributions across all sets.

Multi-class Classification

Two image-based classification benchmark experiments were designed and conducted on all three sized datasets:

Class distribution
Class distribution of the two sampled datasets.

Transfer Learning

Dataset was fine-tuned utilizing two pretrained backbone models to facilitate transfer learning:

Robustness and Generalizability

To ensure the robustness and generalizability of the models, each experiment was repeated with 3 different random seeds. This approach allows us to account for the impact of randomness in our results. The total number of experiments conducted is given by:

3 (dataset sizes) × 2 (classification tasks) × 2 (backbone models) × 2 (loss functions) × 3 (seeds) = 72

To see the results these experiments with mean and standard deviation please visit Tables A3-A4 Gharaee et al. (2023).

Evaluation

The model with the best performance on the validation set is selected and used for test experiments. The metrics used for the evaluation are as follows:

Findings and Results

The results indicate that:

results
Per-class top-1 test accuracy of the Insect-Order and Diptera-Family classification experiments of the Large dataset.
results
Confusion Matrix of insect-order experiments.
results
Confusion Matrix of diptera-family experiments.

Deployment

The model trained on BIOSCAN-1M datasets are stored in project's Google Drive folder. The pretrained models, and its AI-based tool are utilized by biologists at the Centre for Biodiversity Genomics (CBG) to streamline biological taxonomy classification. Traditionally performed by human experts, this process is costly, and time-consuming.

Tools and Technologies

This section discusses the design and implementation of the project.

Data Processing Pipeline

The pipeline includes stages for:

Data Migration

Data Structure

Data Visualization

Utilized various libraries for data visualization:

Model Development and Deployment

Tools and libraries used for model development and deployment include:

Computing Infrastructure

Infrastructure and environments used include: