Scientists at the University of Chicago have developed a new statistical tool that enhances the ability to find genetic variants causing diseases. This tool, described in a new paper published on January 26, 2024, in Nature Genetics, combines data from Genome-wide Association Studies (GWAS) and genetic expression predictions to limit the number of false positive results and more accurately identify causal genes and variants for a disease.
GWAS is a widely used approach to identify genes associated with various human traits, including common diseases. For instance, researchers compare the genome sequences of a large group of people with a specific disease with another set of sequences from healthy individuals. Differences found in the disease group may indicate genetic variants that increase the risk of this disease and require further investigation.
However, most human diseases are not caused by a single genetic variation. Instead, they result from a complex interaction of multiple genes, environmental factors, and various other variables. Consequently, GWAS often identifies many variants in numerous regions of the genome associated with a disease. The limitation of GWAS, however, is that it only identifies associations and not causality. In a typical genomic region, many variants are highly correlated with one another due to a phenomenon called linkage disequilibrium. This is because DNA is passed down from one generation to the next in whole blocks, and not individual genes, so nearby variants tend to correlate.
„In a block, there may be many genetic variants, all correlated with disease risk, but you don’t know which variant is actually the causal variant. That’s the basic challenge of GWAS, namely how do we get from association to causation.“
Xin He, PhD, Associate Professor of Human Genetics and lead author of the new study
Complicating matters, most genetic variants are located in non-coding genomes, making it difficult to interpret their effects. A common strategy to address these challenges is the use of gene expression levels. Expression quantitative Trait Loci or eQTLs are genetic variants associated with gene expression.
The reason for using eQTL data is that if a variant associated with a disease is an eQTL of a gene X, then X might be the link between the variant and the disease. However, the problem with this reasoning is that neighboring variants and eQTLs of other genes can be correlated with the eQTL of gene X, while directly affecting the disease, resulting in a false positive result. Many methods have been developed to nominate risk genes from GWAS using eQTL data, but they all suffer from the fundamental problem of confusion by neighboring associations. In fact, existing methods can generate false positive genes in over 50% of cases.
In the new study, Prof. He and Matthew Stephens, PhD, Ralph W. Gerard Professor and Chair of the Departments of Statistics and Professor of Human Genetics, developed a new method called Causal-Transcriptome-Wide Association Studies (cTWAS). It utilizes advanced statistical techniques to reduce the false positive rate. Instead of focusing on one gene at a time, the new cTWAS model considers multiple genes and variants. Using a Bayesian multiple regression model, confusing genes and variants can be filtered out.
„Looking at them one by one, you will receive false positive results, but considering all neighboring genes and variants together will significantly increase the probability of finding the causal gene,“ he said.
The article demonstrates the utility of this new technique by studying the genetics of LDL cholesterol levels. As an example, existing eQTL methods pointed to a gene involved in DNA repair, but the new cTWAS approach identified a different variant in the target gene of Statin, a common medication for treating high cholesterol levels. Overall, cTWAS identified 35 putative causal genes for LDL, more than half of which have not been previously described, indicating new biological pathways and potential treatment targets for LDL.
The cTWAS software is now available for download from the He’s Lab website. He hopes to continue working on expanding its capabilities to include other types of omics data such as splicing and epigenetics, as well as integrating eQTLs from multiple tissue types.
„The software will allow people to conduct analyses linking genetic variations to phenotypes. That’s truly the biggest challenge facing the entire field,“ he said. „We now have a much better tool for making these connections.“
Zhao, S., et al. (2024). Considering genetic confounding in transcriptome-wide association studies improves discovery of risk genes for complex traits. Nature Genetics. doi.org/10.1038/s41588-023-01648-9.