VariFAST: Automated Variant Filtering by Tagged-Signatures

Introduction

Next-Generation Sequencing (NGS) has become a foundational technology in genomics research. Despite the advances in sequencing accuracy and variant callers, false positive variants remain a significant issue in both whole genome and whole exome sequencing analyses. Traditionally, researchers rely on manual review using tools such as the Integrative Genomics Viewer (IGV) to filter out erroneous calls. However, this approach is labor-intensive, inconsistent across users and labs, and not scalable for large datasets.

To address these limitations, VariFAST (Variant Filter by Automated Scoring based on Tagged-signature) offers a novel automated alternative designed to replace manual IGV review with a more consistent, explainable, and efficient filtering strategy for both germline and somatic variants.

Challenges in Current Variant Filtering Methods

Manual review is subjective, time-consuming, and error-prone. Common automated methods also fall short:

  • Germline Variants:

    • VQSR (Variant Quality Score Recalibration) requires large sample sizes to be effective and cannot be applied to somatic variants.

  • Somatic Variants:

    • FilterMutectCalls provides initial filtering, but manual review remains essential.

Overview of VariFAST

VariFAST is an automated system designed to emulate and improve upon manual IGV review. It incorporates four core modules:

  1. Metric Calculation
    Quantifies 16–18 features (metrics) per variant based on read characteristics.

  2. Tag Marking
    Labels variants with interpretive tags consistent with manual review standards.

  3. V-score Evaluation
    Computes a weighted score based on the presence and strength of metrics.

  4. Machine Learning Model (XGBoost)
    Applies a trained model to classify germline variants with improved accuracy.

Methodology

Input and Output
  • Input: .bam and .vcf files

  • Output: Annotated variants with scores and tags

Metric Calculation

VariFAST uses a set of metrics inspired by IGV SOPs:

  • High-impact metrics (Level 3): lcr, vafr, lm, ni, nd

  • Moderate (Level 2): lncr, mv, hdr

  • Low (Level 1): Remaining metrics like sse, dir, mm, r, etc.

Two additional metrics for somatic variants:

  • nvaf: Variant allele frequency in normal sample

  • lncr: Low coverage in normal sample

Tag Marking

19 tags are used, such as:

  • LM (Low Mapping)

  • NI / ND (Near Insertion / Deletion)

  • MV (Multiple Variants)

  • HDR (High Discrepancy Region)

  • RR (Repeat Region)
    Variants with no tags are likely true positives.

V-score Evaluation

A weighted sum:

v-score = Σ wᵢ * xᵢ
  • Robust threshold for high Fβ-score: 3–4

  • Fully interpretable scoring framework

XGBoost Model
  • Uses all calculated metrics as input

  • Hyperparameter optimization via grid search

  • Offers superior filtering performance but with reduced interpretability compared to v-score


Performance Validation

Germline Variant Filtering
  • Tested on GIAB benchmark data (HG001–HG004)

  • Metrics: Fβ, Precision, Recall, Accuracy, MCC, AUC

  • Validated against Sanger sequencing

  • XGBoost achieved higher AUC than v-score (e.g., 0.790 vs 0.712)

Somatic Variant Filtering
  • Applied to penile squamous cell carcinoma and pituitary adenomas

  • Strong consistency with manual review

  • Effective in identifying true variants with v-score < 4

Comparison with VQSR

FeatureVariFASTVQSR
Sample size requirementNoYes (large)
Somatic variant supportYesNo
Germline filteringYesYes
InterpretabilityHigh (v-score, tags)Low
INDEL performanceHighModerate
MCC (HG001_a)0.5220.488
AUC (HG001_a)0.7900.756

Computational Efficiency

  • Germline (30,000 variants @ 80x): ~3 hours (64 cores)

  • Somatic (108,400 variants, 136 samples @ 90x): ~3 hours

  • Complexity: O(nML logN)

  • Platform: Python package with ray for parallelization

  • Availability:
    GitHub: https://github.com/bioxsjtu/VariFAST

Limitations and Future Directions

  • Complex Variants (MNPs): Require enhanced detection logic

  • Metric Weighting: Needs adaptive learning for different datasets

  • Generalizability: Performance can vary across platforms and read depths

Conclusion

VariFAST is a reliable and explainable alternative to manual IGV-based variant filtering. By combining rule-based scoring (v-score) with machine learning (XGBoost), it achieves high accuracy for both germline and somatic variants with interpretable results. Its scalability, flexibility, and transparency make it a promising tool for modern genomics pipelines.

Scroll to Top