BIOSCAN-5M Portfolio

Project Description

barcodemetric — DNA Nucleotide Barcode Sequence

This project provides tools for analyzing DNA nucleotide barcode sequences derived from the BIOSCAN-1M and BIOSCAN-5M datasets. The data is hierarchically structured by taxonomic ranks—from phylum down to species—where each rank contains subgroups of identical barcode sequences. The project enables statistical analysis of DNA barcode variation across taxonomy to support biodiversity and bioinformatics research.

Key Contributions

Processed and analyzed DNA barcode data from BIOSCAN-1M and BIOSCAN-5M across hierarchical taxonomic ranks.
Computed Shannon Diversity Index (SDI) to quantify genetic richness and evenness within taxonomic subgroups.
Measured genetic similarity using pairwise Damerau–Levenshtein distances on aligned DNA barcode sequences.
Implemented subgroup filtering, random sampling, and alignment strategies to support large-scale analysis.
Developed dual pipelines using Pandas for small datasets and Apache Spark (PySpark) for multi-million–record datasets.
Aggregated statistical summaries (mean, median, standard deviation) of diversity and distance metrics across taxonomic levels.
Applied interquartile range (IQR)–based analysis to characterize distribution spread and identify variability and outliers.
Generated distribution plots and boxplots to visualize SDI and pairwise distance statistics across taxonomic ranks.
Enabled reusable outputs (Parquet files and DataFrames) for downstream bioinformatics and ML workflows.

Tools & Technologies

Python, PyTorch, TensorFlow, Pandas, NumPy, PySpark (pyspark.sql)
h5py, CSV, JSON, Pickle, Parquet, DataFrames
Matplotlib (mpl_toolkits), Seaborn, Plotly
Cluster Computing, Parallel Computation
Virtual Environment, Bash Environment
Digital Research Alliance of Canada

Code / Git

BarcodeMetrics

Research / Paper

Paper

Additional Context

DNA Barcode Sequence

A DNA barcode sequence represents the ordered arrangement of the four nucleotides— Adenine (A), Thymine (T), Cytosine (C), and Guanine (G)—within a specific gene region, commonly the mitochondrial cytochrome c oxidase subunit I (COI) gene.

    TTTATATTTTATTTTTGGAGCATGATCAGGAATAGTTGGAACTTCAATAAGTTTATTAATTCGAACAGAATTAAG...

Shannon Diversity Index (SDI)

The Shannon Diversity Index (SDI) quantifies genetic diversity by accounting for both:

Richness: Number of unique DNA barcode sequences
Evenness: Distribution of samples across those barcodes

Higher SDI values indicate greater genetic diversity with more evenly distributed barcodes, while lower values suggest dominance by a small number of barcodes.

Definition:

\( \mathrm{SDI}_T = -\sum_{i=1}^{N} p_i \log_2(p_i) \)

N is the number of unique barcodes in subgroup T
p_i is the proportion of samples with the i-th barcode

Damerau–Levenshtein Distance for DNA Barcodes

The Damerau–Levenshtein distance measures sequence similarity by counting the minimum number of edit operations—insertions, deletions, substitutions, and adjacent transpositions— required to transform one DNA barcode into another.

Average pairwise distances are computed across taxonomic ranks using the following procedure:

Subgroups with fewer than 4 unique sequences are excluded
Subgroups with more than 1,000 sequences are randomly sampled to 1,000
Sequences are aligned using MAFFT
Pairwise distances are computed using the Damerau–Levenshtein metric
Mean and standard deviation are aggregated across subgroups at each taxonomic rank

Pairwise Distance Calculation

The repository provides two implementations for computing pairwise DNA barcode distances:

Pandas — suitable for small to medium-sized datasets
Apache Spark (PySpark) — optimized for large-scale datasets such as BIOSCAN-5M

Note: For optimal performance, experiments should be conducted in two phases, selecting the appropriate implementation based on dataset size.

Illustrations

The visualization summarizes the distribution of SDI and pairwise distances across taxonomic ranks. Each boxplot displays the Interquartile Range (IQR)—the range between the 25th percentile (Q1) and the 75th percentile (Q3)—capturing the central 50% of the data. The median (Q2) is shown as a line inside the box. Whiskers extend up to 1.5 times the IQR from Q1 and Q3, indicating data spread and helping to identify potential outliers.

Bioinformatics Project: Barcode-Metric