ABSTRACT
Background Sequencing of both healthy and disease singletons yields many novel and low frequency variants of uncertain significance (VUS). Complete gene and genome sequencing by next generation sequencing (NGS) significantly increases the number of VUS detected. While prior studies have emphasized protein coding variants, non-coding sequence variants have also been proven to significantly contribute to high penetrance disorders, such as hereditary breast and ovarian cancer (HBOC). We present a strategy for analyzing different functional classes of non-coding variants based on information theory (IT).
Methods We captured and enriched for coding and non-coding variants in genes known to harbor mutations that increase HBOC risk. Custom oligonucleotide baits spanning the complete coding, non-coding, and intergenic regions 10 kb up- and downstream of ATM, BRCA1, BRCA2, CDH1, CHEK2, PALB2, and TP53 were synthesized for solution hybridization enrichment. Unique and divergent repetitive sequences were sequenced in 102 high-risk patients without identified mutations in BRCA1/2. Aside from protein coding changes, IT-based sequence analysis was used to identify and prioritize pathogenic non-coding variants that occurred within sequence elements predicted to be recognized by proteins or protein complexes involved in mRNA splicing, transcription, and untranslated region (UTR) binding and structure. This approach was supplemented by in silico and laboratory analysis of UTR structure.
Results 15,311 unique variants were identified, of which 245 occurred in coding regions. With the unified IT-framework, 132 variants were identified and 87 functionally significant VUS were further prioritized. We also identified 4 stop-gain variants and 3 reading-frame altering exonic insertions/deletions (indels).
Conclusions We have presented a strategy for complete gene sequence analysis followed by a unified framework for interpreting non-coding variants that may affect gene expression. This approach distills large numbers of variants detected by NGS to a limited set of variants prioritized as potential deleterious changes.
Footnotes
↵* EJM and NGC should be considered to be joint first authors.
LIST OF ABBREVIATIONS
- ASSEDA
- Automated Splice Site and Exon Definition Analysis
- BIC
- Breast Cancer Information Core Database
- CASAVA
- Consensus Assessment of Sequencing and Variation
- CIS-BP-RNA
- Catalog of Inferred Sequence Binding Preferences of RNA binding proteins
- CRAC
- Complex Reads Analysis and Classification
- DM2
- Domain Mapping of Disease Mutations
- ENIGMA
- Evidence-based Network for the Interpretation of Germline Mutant Alleles
- ExPASy
- Expert Protein Analysis System
- GATK
- Genome Analysis Toolkit
- HBOC
- Hereditary Breast and Ovarian Cancer
- HGMD
- Human Gene Mutation Database
- IARC
- International Agency for Research on Cancer
- IGV
- Integrative Genomics Viewer
- Indel
- Insertion/deletion
- IT
- Information theory
- LOVD
- Leiden Open Variant Database
- MGL
- Molecular Genetics Laboratory
- MLPA
- Multiplex Ligation Probe Amplification
- NGS
- Next-Generation Sequencing
- PTB
- Polypyrimidine tract binding protein
- PTT
- Protein Truncation Test
- PWM
- Position Weight Matrix
- RBBS
- RNA-Binding protein Binding Site
- RBP
- RNA-Binding Protein
- RBPDB
- RNA-Binding Protein DataBase
- Ri
- Individual information
- Rsequence
- Mean information content
- SHAPE
- Selective 2’-Hydroxyl Acylation analyzed by Primer Extension
- SNV
- Single Nucleotide Variant
- SRF
- Splicing Regulatory Factor
- SRFBS
- Splicing Regulatory Factor Binding Site
- SS
- Splice Site
- TF
- Transcription Factor
- TFBS
- Transcription Factor Binding Site
- UTR
- Untranslated Region
- VCF
- Variant Call File
- VUS
- Variants of Uncertain Significance
- ΔRi
- Change in individual information.
- Patient Sample IDs are assigned in following manner
- number-number+letter (i.e. 1–1A). If a sample was repeated, the IDs are separated by a “.” (i.e. 1–1A.2–1A)