Overview
In this project, a large-scale new dataset of over one million samples was introduced. Each sample includes a high-quality microscopic RGB image, a DNA nucleotide barcode sequence, and a Barcode Index Number.
- Composed of both structured (CSV metadata) and unstructured (images, text) data types.
- Structured metadata is in JSON-LD and CSV formats, including taxonomy classification labels, DNA nucleotide sequences, and Barcode Index Numbers.
- Unstructured data is organized into chunks of 10,000 images per directory for the original-full size images, cropped images, resized-original images, and resized-cropped images.
- Images organized into directories were then stored in HDF5 and ZIP formats.
- Data was uploaded to Google Drive and other platforms such as Zenodo, Kaggle, and Hugging Face.
Data Curation-Governance
Data curation involved the collection, organization, and maintenance of data to ensure its quality and accessibility. The key tasks included:
- Data Validation: Ensuring data accuracy and consistency.
- Data Cleaning: Removing errors and inconsistencies from the dataset.
- Data Enrichment: Enhancing the dataset with additional relevant information.
- Data Documentation: Providing detailed descriptions and metadata.
ETL/ELT: Extract-Transform-Load
Data Migration:
- Performed using Google Cloud Platform (GCP) libraries, including:
- Google API Client
- PyDrive
- Requests
- Google-auth
Data Transformation:
- Cleaned and standardized data samples into a fixed format (e.g., JPEG images).
- Removed corrupted samples using Python libraries such as PIL, Pillow, OpenCV, and Pandas.
- Cropping images using Detection Transformer (DETR). Carion et al. (2020)
- Resizing images to 256 on their shorter side.
Statistical Analytics
Fine-Grained Classification
Fine-grained classification involves distinguishing between classes that are very similar to each other, often requiring detailed and nuanced feature analysis. In taxonomic classification, this concept is illustrated as we move from broader, less specific levels (e.g., phylum, class) to more specific levels (e.g., genus, species) within the taxonomic tree. This necessitates a classification model that can accurately differentiate between these closely related classes.
Long-Tailed Distribution
The BIOSCAN-1M dataset exhibits a long-tailed distribution, where a small number of classes have a large number of samples, while most classes have very few samples. This distribution can result in models that perform well on the frequently occurring classes but struggle with the underrepresented classes due to insufficient training data. For instance, in the BIOSCAN-1M dataset, the order Diptera contains 896,324 samples out of a total of 1,128,313, which represents approximately 80% of all samples. This illustrates how a few classes dominate the dataset, highlighting the long-tailed nature of the distribution.
High Class Imbalance Ratio
There is a significant class imbalance in the BIOSCAN-1M dataset, with some classes being heavily underrepresented compared to others. This high class imbalance ratio can lead to biases in model training, where the model becomes skewed towards the majority classes and shows poor performance on the minority classes. Addressing this imbalance is crucial for ensuring the model's effectiveness across all classes.
ML Benchmark
This section outlines the machine learning benchmark tasks and results for the BIOSCAN-1M dataset. The benchmarks were designed to evaluate classification performance at different taxonomy levels and dataset sizes.
Data Sampling
The BIOSCAN-1M dataset was sampled in the following ways:
- BIOSCAN-1M-Insect: Samples at the taxonomy order level.
- BIOSCAN-1M-Diptera: Samples at the taxonomy family (Diptera) level.
- Small: 50,000 samples
- Medium: 200,000 samples
- Large:
- Order: 1,100,000 samples
- Diptera: 891,000 samples
Taxonomy Levels: Two datasets were created:
Dataset Sizes: Each dataset was further divided into three sizes to address usability and feasibility for end users in various domains:
Stratified Class-Based Split
Data samples for each dataset size were split into train (70%), validation (10%), and test (20%) sets using a class-based mechanism to ensure consistent data distributions across all sets.
Multi-class Classification
Two image-based classification benchmark experiments were designed and conducted on all three sized datasets:
- Insect-Order: 16 classes.
- Family-Diptera: 40 classes.
Transfer Learning
Dataset was fine-tuned utilizing two pretrained backbone models to facilitate transfer learning:
- ResNet50: A deep residual learning framework for image recognition. He et al. (2016)
- Vision Transformer (ViT-B/16-224): A transformer-based model for image classification. Dosovitskiy et al. (2020)
Robustness and Generalizability
To ensure the robustness and generalizability of the models, each experiment was repeated with 3 different random seeds. This approach allows us to account for the impact of randomness in our results. The total number of experiments conducted is given by:
3 (dataset sizes) × 2 (classification tasks) × 2 (backbone models) × 2 (loss functions) × 3 (seeds) = 72
To see the results these experiments with mean and standard deviation please visit Tables A3-A4 Gharaee et al. (2023).
Evaluation
The model with the best performance on the validation set is selected and used for test experiments. The metrics used for the evaluation are as follows:
- Top-1 Accuracy: The proportion of test samples where the top predicted class matches the true label.
- Top-5 Accuracy: The proportion of test samples where the true label is among the top five predicted classes.
- Macro-F1 Score: The macro-averaged F1 score, which computes the F1 score for each class and then averages these scores, giving equal weight to each class.
- Loss: Monitored during training and evaluated post-training to measure the effectiveness of the model’s learning.
Findings and Results
The results indicate that:
- Cross-Entropy: Generally performed better achieving higher accuracy.
- Vision Transformer: Demonstrated competitive performance, especially on larger datasets, showing its robustness and capability in handling high-dimensional image data.
Deployment
The model trained on BIOSCAN-1M datasets are stored in project's Google Drive folder. The pretrained models, and its AI-based tool are utilized by biologists at the Centre for Biodiversity Genomics (CBG) to streamline biological taxonomy classification. Traditionally performed by human experts, this process is costly, and time-consuming.
Tools and Technologies
This section discusses the design and implementation of the project.
Data Processing Pipeline
The pipeline includes stages for:
- Data Ingestion
- Data Validation
- Data Transformation
- Data Storage
- Data Access
- Feature Engineering
Data Migration
- Google API Client
- PyDrive
- Requests
- Google-auth
Data Structure
- Arrays
- Lists
- Stacks
- Dictionaries
Data Visualization
Utilized various libraries for data visualization:
- Matplotlib
- Seaborn
- Plotly
Model Development and Deployment
Tools and libraries used for model development and deployment include:
- TensorFlow
- Scikit-learn
- Pytorch
- transformers (DetrFeatureExtractor)
- h5py
- Pandas
- csv
- jason
- PIL
- pickle
- timm
- shutil
Computing Infrastructure
Infrastructure and environments used include:
- Google Cloud Platform
- Digital Research Alliance of Canada
- Virtual Environment
Links
Overview of the links related to the projects:
Version Control Systems:
Dataset and Code Sharing Platforms:
Research Article Platforms: