Bioinformatics Project: Barcode-Metric

Project Description

barcodemetric
DNA Nucleotide Barcode Sequence

This project provides tools for analyzing DNA nucleotide barcode sequences derived from the BIOSCAN-1M and BIOSCAN-5M datasets. The data is hierarchically structured by taxonomic ranks—from phylum down to species—where each rank contains subgroups of identical barcode sequences. The project enables statistical analysis of DNA barcode variation across taxonomy to support biodiversity and bioinformatics research.

Key Contributions

Tools & Technologies

Code / Git

Research / Paper

Additional Context

DNA Barcode Sequence

A DNA barcode sequence represents the ordered arrangement of the four nucleotides— Adenine (A), Thymine (T), Cytosine (C), and Guanine (G)—within a specific gene region, commonly the mitochondrial cytochrome c oxidase subunit I (COI) gene.

    TTTATATTTTATTTTTGGAGCATGATCAGGAATAGTTGGAACTTCAATAAGTTTATTAATTCGAACAGAATTAAG...
  

Shannon Diversity Index (SDI)

The Shannon Diversity Index (SDI) quantifies genetic diversity by accounting for both:

  • Richness: Number of unique DNA barcode sequences
  • Evenness: Distribution of samples across those barcodes

Higher SDI values indicate greater genetic diversity with more evenly distributed barcodes, while lower values suggest dominance by a small number of barcodes.

Definition:

\( \mathrm{SDI}_T = -\sum_{i=1}^{N} p_i \log_2(p_i) \)

  • N is the number of unique barcodes in subgroup T
  • pi is the proportion of samples with the i-th barcode

Damerau–Levenshtein Distance for DNA Barcodes

The Damerau–Levenshtein distance measures sequence similarity by counting the minimum number of edit operations—insertions, deletions, substitutions, and adjacent transpositions— required to transform one DNA barcode into another.

Average pairwise distances are computed across taxonomic ranks using the following procedure:

  • Subgroups with fewer than 4 unique sequences are excluded
  • Subgroups with more than 1,000 sequences are randomly sampled to 1,000
  • Sequences are aligned using MAFFT
  • Pairwise distances are computed using the Damerau–Levenshtein metric
  • Mean and standard deviation are aggregated across subgroups at each taxonomic rank

Pairwise Distance Calculation

The repository provides two implementations for computing pairwise DNA barcode distances:

  • Pandas — suitable for small to medium-sized datasets
  • Apache Spark (PySpark) — optimized for large-scale datasets such as BIOSCAN-5M

Note: For optimal performance, experiments should be conducted in two phases, selecting the appropriate implementation based on dataset size.

Illustrations

The visualization summarizes the distribution of SDI and pairwise distances across taxonomic ranks. Each boxplot displays the Interquartile Range (IQR)—the range between the 25th percentile (Q1) and the 75th percentile (Q3)—capturing the central 50% of the data. The median (Q2) is shown as a line inside the box. Whiskers extend up to 1.5 times the IQR from Q1 and Q3, indicating data spread and helping to identify potential outliers.

Distributions of SDI across taxonomic ranks

barcodemetric
DNA Nucleotide Barcode Sequence

Distributions of pairwise distance across subgroups of class

barcodemetric
DNA Nucleotide Barcode Sequence

Distributions of pairwise distance across subgroups of order

barcodemetric
DNA Nucleotide Barcode Sequence