BacPrune-Rust performs linkage disequilibrium (LD) pruning for haploid genotype matrices using D' and r scores with optional additional filtering by minor allele frequency (MAF). Written in Rust and designed to scale to matrices with millions of variants and thousands of isolates, such as the complete CRyPTIC Consortium M. tuberculosis dataset.
BacPrune-Rust is a self-contained compiled binary with no runtime dependencies. Building from source requires Rust ≥ 1.78 (stable toolchain, edition 2021).
BacPrune-Rust is available on the bioconda channel. Install into an existing environment:
conda install -c bioconda bacprune
or create a new dedicated environment (recommended):
conda create -n bacprune -c bioconda bacprune
To build from source (requires Rust ≥ 1.78, stable toolchain, edition 2021), run the following:
git clone https://github.com/bacpop/BacPrune-Rust
cd BacPrune-Rust
cargo build --release
The compiled binary will be located at target/release/bacprune.
You can also install the Python package from source using pip (requires Rust ≥ 1.78 and maturin):
git clone https://github.com/bacpop/BacPrune-Rust
cd BacPrune-Rust
pip install maturin
pip install .
BacPrune offers three modes: pruning by r² (Pearson r-squared), pruning by |D'| (Lewontin's D'), and pruning only variants with identical presence and absence (perfect positive correlation) as determined by variant hashes.
r² is the default and is recommended for use in GWAS applications where LD pruning is used to reduce nonidentifiability arising from correlations between predictors, as it will prune strong positive and negative correlations without pruning weakly-correlated variants (which |D'| may).
Usage: bacprune [OPTIONS] <INPUT_FILE> <N_ROWS> <N_COLS> <MAF_CUTOFF> <OUTPUT_DIRECTORY>
Arguments:
<INPUT_FILE> Path to the input CSV (numeric header row + sample rows of 0/1 genotypes)
<N_ROWS> Total number of rows in the CSV including the header row
<N_COLS> Number of columns in the CSV
<MAF_CUTOFF> Minor allele frequency cutoff; variants below this threshold are removed
<OUTPUT_DIRECTORY> Directory where output files are written
Options:
--ld <THRESHOLD> LD pruning threshold; pairs at or above this value are pruned. Required for --r and --dprime; not used with --dedup
--r Prune by r² (Pearson r-squared) threshold [default method]
--dprime Prune by |D'| (Lewontin's D') threshold
--dedup Remove exact duplicate variants only via hashing; no pairwise LD calculation
-h, --help Print help
-V, --version Print version
Tip
n_rows includes the header row
The genotype matrix should be in CSV format, with the first row being a header of (integer) variant IDs, and subsequent rows being samples with values 0 (reference) or 1 (alternate allele).
Note
MAF filtering operates on the frequency of the allele encoded as 1; thus, a variant with an allele encoded as 0 that is below the MAF threshold will never be filtered. If you wish to filter low-frequency reference alleles, normalize your genotype matrix so that the alternate allele is always 1 and the reference allele is always 0.
# r² (default) pruning for r²≥0.8, filter variants with MAF<5%
bacprune genotypes.csv 613 87092 0.05 ./results --ld 0.8
# D' pruning for |D'|=1, filter variants with MAF<5%
bacprune genotypes.csv 613 87092 0.05 ./results --ld 1 --dprime
# prune exact duplicates only, filter variants with MAF<1%
bacprune genotypes.csv 613 87092 0.01 ./results --dedup
All files are written to <output_directory>.
bacprune_rust_results.csv: pruned genotype matrix in the same format as the inputld_pruning_summary.csv: one row per representative SNP, listing the post-MAF-filter column indices (base 0) of all variants pruned from its groupdirection_of_correlation.csv: one row per post-MAF variant;Statusisrepresentative,positive_correlation(identical genotype pattern, r > 0), ornegative_correlation(complement pattern, r < 0)
Phase 1 (all modes): each variant column is hashed as a Vec<u8> and inserted into a hash map. Exact duplicates are removed in a single O(n·v) pass. Complementary columns hash differently and are both retained.
Phase 2 (--r and --dprime only): greedy pairwise scan over remaining variants. For each pair exceeding the --ld threshold, the lower-MAF variant is pruned. The sign of r determines the correlation direction recorded in direction_of_correlation.csv for both modes.
Pearson r²: r = (n·Σxy − Σx·Σy) / √(Σx·(n−Σx) · Σy·(n−Σy)) — threshold applied to r². The sign of r determines the correlation direction recorded in direction_of_correlation.csv.
D': D = f(00)·f(11) − f(10)·f(01), normalised by D_max. Always returns |D'| ∈ [0, 1]. Correlation direction is determined separately via r since D' discards the sign.