BacPrune-Rust

BacPrune-Rust performs linkage disequilibrium (LD) pruning for haploid genotype matrices using D' and r scores with optional additional filtering by minor allele frequency (MAF). Written in Rust and designed to scale to matrices with millions of variants and thousands of isolates, such as the complete CRyPTIC Consortium M. tuberculosis dataset.

Installation

Dependencies

BacPrune-Rust is a self-contained compiled binary with no runtime dependencies. Building from source requires Rust ≥ 1.78 (stable toolchain, edition 2021).

Installing with conda (recommended)

BacPrune-Rust is available on the bioconda channel. Install into an existing environment:

conda install -c bioconda bacprune

or create a new dedicated environment (recommended):

conda create -n bacprune -c bioconda bacprune

Building from source

To build from source (requires Rust ≥ 1.78, stable toolchain, edition 2021), run the following:

git clone https://github.com/bacpop/BacPrune-Rust
cd BacPrune-Rust
cargo build --release

The compiled binary will be located at target/release/bacprune.

You can also install the Python package from source using pip (requires Rust ≥ 1.78 and maturin):

git clone https://github.com/bacpop/BacPrune-Rust
cd BacPrune-Rust
pip install maturin
pip install .

Usage

BacPrune offers three modes: pruning by r² (Pearson r-squared), pruning by |D'| (Lewontin's D'), and pruning only variants with identical presence and absence (perfect positive correlation) as determined by variant hashes.

r² is the default and is recommended for use in GWAS applications where LD pruning is used to reduce nonidentifiability arising from correlations between predictors, as it will prune strong positive and negative correlations without pruning weakly-correlated variants (which |D'| may).

Usage: bacprune [OPTIONS] <INPUT_FILE> <N_ROWS> <N_COLS> <MAF_CUTOFF> <OUTPUT_DIRECTORY>

Arguments:
  <INPUT_FILE>        Path to the input CSV (numeric header row + sample rows of 0/1 genotypes)
  <N_ROWS>            Total number of rows in the CSV including the header row
  <N_COLS>            Number of columns in the CSV
  <MAF_CUTOFF>        Minor allele frequency cutoff; variants below this threshold are removed
  <OUTPUT_DIRECTORY>  Directory where output files are written

Options:
      --ld <THRESHOLD>  LD pruning threshold; pairs at or above this value are pruned. Required for --r and --dprime; not used with --dedup
      --r               Prune by r² (Pearson r-squared) threshold [default method]
      --dprime          Prune by |D'| (Lewontin's D') threshold
      --dedup           Remove exact duplicate variants only via hashing; no pairwise LD calculation
  -h, --help            Print help
  -V, --version         Print version

Tip

n_rows includes the header row

Formatting the genotype matrix

The genotype matrix should be in CSV format, with the first row being a header of (integer) variant IDs, and subsequent rows being samples with values 0 (reference) or 1 (alternate allele).

Note

MAF filtering operates on the frequency of the allele encoded as 1; thus, a variant with an allele encoded as 0 that is below the MAF threshold will never be filtered. If you wish to filter low-frequency reference alleles, normalize your genotype matrix so that the alternate allele is always 1 and the reference allele is always 0.

Example Commands

# r² (default) pruning for r²≥0.8, filter variants with MAF<5%
bacprune genotypes.csv 613 87092 0.05 ./results --ld 0.8

# D' pruning for |D'|=1, filter variants with MAF<5%
bacprune genotypes.csv 613 87092 0.05 ./results --ld 1 --dprime

# prune exact duplicates only, filter variants with MAF<1%
bacprune genotypes.csv 613 87092 0.01 ./results --dedup

Output files

All files are written to <output_directory>.

bacprune_rust_results.csv: pruned genotype matrix in the same format as the input
ld_pruning_summary.csv: one row per representative SNP, listing the post-MAF-filter column indices (base 0) of all variants pruned from its group
direction_of_correlation.csv: one row per post-MAF variant; Status is representative, positive_correlation (identical genotype pattern, r > 0), or negative_correlation (complement pattern, r < 0)

Algorithm

Phase 1 (all modes): each variant column is hashed as a Vec<u8> and inserted into a hash map. Exact duplicates are removed in a single O(n·v) pass. Complementary columns hash differently and are both retained.

Phase 2 (--r and --dprime only): greedy pairwise scan over remaining variants. For each pair exceeding the --ld threshold, the lower-MAF variant is pruned. The sign of r determines the correlation direction recorded in direction_of_correlation.csv for both modes.

Pearson r²: r = (n·Σxy − Σx·Σy) / √(Σx·(n−Σx) · Σy·(n−Σy)) — threshold applied to r². The sign of r determines the correlation direction recorded in direction_of_correlation.csv.

D': D = f(00)·f(11) − f(10)·f(01), normalised by D_max. Always returns |D'| ∈ [0, 1]. Correlation direction is determined separately via r since D' discards the sign.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
bioconda-recipe		bioconda-recipe
experimental		experimental
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BacPrune-Rust

Installation

Dependencies

Installing with conda (recommended)

Building from source

Usage

Formatting the genotype matrix

Example Commands

Output files

Algorithm

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BacPrune-Rust

Installation

Dependencies

Installing with conda (recommended)

Building from source

Usage

Formatting the genotype matrix

Example Commands

Output files

Algorithm

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages