VariFAST: Automated Variant Filtering by Tagged-Signatures

Introduction

Next-Generation Sequencing (NGS) has become a foundational technology in genomics research. Despite the advances in sequencing accuracy and variant callers, false positive variants remain a significant issue in both whole genome and whole exome sequencing analyses. Traditionally, researchers rely on manual review using tools such as the Integrative Genomics Viewer (IGV) to filter out erroneous calls. However, this approach is labor-intensive, inconsistent across users and labs, and not scalable for large datasets.

To address these limitations, VariFAST (Variant Filter by Automated Scoring based on Tagged-signature) offers a novel automated alternative designed to replace manual IGV review with a more consistent, explainable, and efficient filtering strategy for both germline and somatic variants.

Challenges in Current Variant Filtering Methods

Manual review is subjective, time-consuming, and error-prone. Common automated methods also fall short:

Germline Variants:
- VQSR (Variant Quality Score Recalibration) requires large sample sizes to be effective and cannot be applied to somatic variants.
Somatic Variants:
- FilterMutectCalls provides initial filtering, but manual review remains essential.

Overview of VariFAST

VariFAST is an automated system designed to emulate and improve upon manual IGV review. It incorporates four core modules:

Metric Calculation
Quantifies 16–18 features (metrics) per variant based on read characteristics.
Tag Marking
Labels variants with interpretive tags consistent with manual review standards.
V-score Evaluation
Computes a weighted score based on the presence and strength of metrics.
Machine Learning Model (XGBoost)
Applies a trained model to classify germline variants with improved accuracy.

Methodology

Input and Output

Input: .bam and .vcf files
Output: Annotated variants with scores and tags

Metric Calculation

VariFAST uses a set of metrics inspired by IGV SOPs:

High-impact metrics (Level 3): lcr, vafr, lm, ni, nd
Moderate (Level 2): lncr, mv, hdr
Low (Level 1): Remaining metrics like sse, dir, mm, r, etc.

Two additional metrics for somatic variants:

nvaf: Variant allele frequency in normal sample
lncr: Low coverage in normal sample

Tag Marking

19 tags are used, such as:

LM (Low Mapping)
NI / ND (Near Insertion / Deletion)
MV (Multiple Variants)
HDR (High Discrepancy Region)
RR (Repeat Region)
Variants with no tags are likely true positives.

V-score Evaluation

A weighted sum:

Robust threshold for high Fβ-score: 3–4
Fully interpretable scoring framework

XGBoost Model

Uses all calculated metrics as input
Hyperparameter optimization via grid search
Offers superior filtering performance but with reduced interpretability compared to v-score

Performance Validation

Germline Variant Filtering

Tested on GIAB benchmark data (HG001–HG004)
Metrics: Fβ, Precision, Recall, Accuracy, MCC, AUC
Validated against Sanger sequencing
XGBoost achieved higher AUC than v-score (e.g., 0.790 vs 0.712)

Somatic Variant Filtering

Applied to penile squamous cell carcinoma and pituitary adenomas
Strong consistency with manual review
Effective in identifying true variants with v-score < 4

Comparison with VQSR

Feature	VariFAST	VQSR
Sample size requirement	No	Yes (large)
Somatic variant support	Yes	No
Germline filtering	Yes	Yes
Interpretability	High (v-score, tags)	Low
INDEL performance	High	Moderate
MCC (HG001_a)	0.522	0.488
AUC (HG001_a)	0.790	0.756

Computational Efficiency

Germline (30,000 variants @ 80x): ~3 hours (64 cores)
Somatic (108,400 variants, 136 samples @ 90x): ~3 hours
Complexity: O(nML logN)
Platform: Python package with ray for parallelization
Availability:
GitHub: https://github.com/bioxsjtu/VariFAST

Limitations and Future Directions

Complex Variants (MNPs): Require enhanced detection logic
Metric Weighting: Needs adaptive learning for different datasets
Generalizability: Performance can vary across platforms and read depths

Conclusion

VariFAST is a reliable and explainable alternative to manual IGV-based variant filtering. By combining rule-based scoring (v-score) with machine learning (XGBoost), it achieves high accuracy for both germline and somatic variants with interpretable results. Its scalability, flexibility, and transparency make it a promising tool for modern genomics pipelines.