Introduction
Next-Generation Sequencing (NGS) has become a foundational technology in genomics research. Despite the advances in sequencing accuracy and variant callers, false positive variants remain a significant issue in both whole genome and whole exome sequencing analyses. Traditionally, researchers rely on manual review using tools such as the Integrative Genomics Viewer (IGV) to filter out erroneous calls. However, this approach is labor-intensive, inconsistent across users and labs, and not scalable for large datasets.
To address these limitations, VariFAST (Variant Filter by Automated Scoring based on Tagged-signature) offers a novel automated alternative designed to replace manual IGV review with a more consistent, explainable, and efficient filtering strategy for both germline and somatic variants.
Challenges in Current Variant Filtering Methods
Manual review is subjective, time-consuming, and error-prone. Common automated methods also fall short:
Germline Variants:
VQSR (Variant Quality Score Recalibration) requires large sample sizes to be effective and cannot be applied to somatic variants.
Somatic Variants:
FilterMutectCalls provides initial filtering, but manual review remains essential.
Overview of VariFAST
VariFAST is an automated system designed to emulate and improve upon manual IGV review. It incorporates four core modules:
Metric Calculation
Quantifies 16–18 features (metrics) per variant based on read characteristics.Tag Marking
Labels variants with interpretive tags consistent with manual review standards.V-score Evaluation
Computes a weighted score based on the presence and strength of metrics.Machine Learning Model (XGBoost)
Applies a trained model to classify germline variants with improved accuracy.
Methodology
Input and Output
Input:
.bamand.vcffilesOutput: Annotated variants with scores and tags
Metric Calculation
VariFAST uses a set of metrics inspired by IGV SOPs:
High-impact metrics (Level 3):
lcr,vafr,lm,ni,ndModerate (Level 2):
lncr,mv,hdrLow (Level 1): Remaining metrics like
sse,dir,mm,r, etc.
Two additional metrics for somatic variants:
nvaf: Variant allele frequency in normal samplelncr: Low coverage in normal sample
Tag Marking
19 tags are used, such as:
LM (Low Mapping)
NI / ND (Near Insertion / Deletion)
MV (Multiple Variants)
HDR (High Discrepancy Region)
RR (Repeat Region)
Variants with no tags are likely true positives.
V-score Evaluation
A weighted sum:
v-score = Σ wᵢ * xᵢ
Robust threshold for high Fβ-score: 3–4
Fully interpretable scoring framework
XGBoost Model
Uses all calculated metrics as input
Hyperparameter optimization via grid search
Offers superior filtering performance but with reduced interpretability compared to v-score
Performance Validation
Germline Variant Filtering
Tested on GIAB benchmark data (HG001–HG004)
Metrics: Fβ, Precision, Recall, Accuracy, MCC, AUC
Validated against Sanger sequencing
XGBoost achieved higher AUC than v-score (e.g., 0.790 vs 0.712)
Somatic Variant Filtering
Applied to penile squamous cell carcinoma and pituitary adenomas
Strong consistency with manual review
Effective in identifying true variants with v-score < 4
Comparison with VQSR
| Feature | VariFAST | VQSR |
|---|---|---|
| Sample size requirement | No | Yes (large) |
| Somatic variant support | Yes | No |
| Germline filtering | Yes | Yes |
| Interpretability | High (v-score, tags) | Low |
| INDEL performance | High | Moderate |
| MCC (HG001_a) | 0.522 | 0.488 |
| AUC (HG001_a) | 0.790 | 0.756 |
Computational Efficiency
Germline (30,000 variants @ 80x): ~3 hours (64 cores)
Somatic (108,400 variants, 136 samples @ 90x): ~3 hours
Complexity: O(nML logN)
Platform: Python package with
rayfor parallelizationAvailability:
GitHub: https://github.com/bioxsjtu/VariFAST
Limitations and Future Directions
Complex Variants (MNPs): Require enhanced detection logic
Metric Weighting: Needs adaptive learning for different datasets
Generalizability: Performance can vary across platforms and read depths
Conclusion
VariFAST is a reliable and explainable alternative to manual IGV-based variant filtering. By combining rule-based scoring (v-score) with machine learning (XGBoost), it achieves high accuracy for both germline and somatic variants with interpretable results. Its scalability, flexibility, and transparency make it a promising tool for modern genomics pipelines.