skip to content

Department of Computer Science and Technology

A novel computational method for integrating gene expression across many different types of human tissue has been developed by a team including researchers here.

Their new model uses graph representation learning to process genetic information from around 50 different types of tissue taken from an individual, ranging widely from salivary gland to skin, stomach, spleen and skeletal muscle. The model then uses the learned factorised representations to predict – with greater flexibility than existing models – gene expression data for other types of tissue that were not collected. 

And this ability to predict unmeasured gene expression across a broad collection of tissues and cell types "may expand our understanding of the molecular origins of complex traits" the scientists say in a paper – Hypergraph factorization for multi-tissue gene expression imputation – that has just been published in Nature Machine Intelligence.

Using such an approach could help us gain a better understanding of the connections between different parts of the body (such as the brain and the gut) and the biological mechanisms that drive a variety of diseases from Alzheimer’s to cancer.

The work is described by its first author Ramon Viñas as "a step forward toward the multi-tissue and multi-scale integration of omics data".

To develop the new model, researchers from this Department – PhD students Ramon Viñas, Chaitanya Joshi and Dobrik Georgiev, along with supervisor Professor Pietro Liò –worked with colleagues from the Vanderbilt University Medical Centre and the Department of Statistics and Irving Institute for Cancer Dynamics at Columbia University.

The novelty is that this approach leverages multiple tissue types from the same individual. In the past, previous technologies only focused on a single tissue at a time.

Ramon Viñas

It is the integration of multi-tissue gene expression data – and the ability to make accurate predictions based on it – that offers such promise.

Gene expression is the process by which our genetic information – our own personal set of instructions – is transformed into the proteins that carry out vital functions within our bodies, from building and repairing tissue to keeping our immune systems strong.

Gene expression is a rich source of molecular information that reflects how our body is functioning, or why it is sickening. But it is hard to acquire and measure. Until now, as the paper explains, the invasiveness of the tissue sampling process has meant that "gene expression is usually measured independently in easy-to-acquire tissues, leading to an incomplete picture of an individual’s physiological state."

In other words, the molecular information available from easy-to-access material like blood has been used as a proxy for the molecular information from harder-to-access organs like the heart. "But this approach is not very accurate as gene expression is very tissue-specific," Ramon explains.

Two years ago, there was a step forward when the journal Science published a paper on the TEEBoT method, which presented a computational model to predict gene expression in a target tissue from blood gene expression. This was a more accurate method, but it only handled a single reference tissue, such as blood.

The new method represents an advance on this. "What we do is predict gene expression in a target tissue of interest as a function of multiple collected tissues – for example, accessible tissues such as blood, skin, mucus, adipose tissue, etc," Ramon says. "Accounting for multiple tissues leads to increased statistical strength." In fact, the research team used the Genotype-Tissue Expression project (GTEx) dataset of over 50 human tissue types taken from more than 800 individuals.

He adds: "The novelty is that this approach leverages multiple tissue types from the same individual. In the past, previous technologies only focused on a single tissue at a time. But here, we can model a variable number of reference tissues per individual."

The researchers used graph representation learning to address this task. They represented the data in a hypergraph with three different types of nodes: donor nodes, tissue nodes, and metagene nodes. They then used graph representation learning to iteratively update the features of the donor nodes.

"We were trying to integrate the gene expression data from the multiple collected tissues by using a paradigm known as message passing," Ramon says. "And once we did that, we could use the learned features to make predictions about what the gene expression would look like in the tissues that had not been collected."

Using this new method leads to improved results over standard imputation techniques, the researchers say, and also better performance than could be achieved by using a single reference tissue, for example via TEEBoT.

It also allowed the researchers to develop additional insights into what are known as eQTLs – or expression Quantitative Trait Loci. "Essentially, this is information about which variants of an individual's genome are associated with their gene expression," Ramon explains.

"We were able to use our method to impute the gene expression of all uncollected tissues in GTEx to create a complete dataset where all individuals had all tissues collected. And using this complete dataset, we could identify novel eQTLs that had previously not been detected.

"From the computational side," he adds, "this approach is quite novel because as far as I know, it's the first method that can integrate gene expression information from different tissues of the human body in a single computational model. Most existing approaches to modelling gene expression only focus on a single tissue at a time. This method may improve downstream analyses in existing bioinformatics pipelines and may find application beyond computational genomics."



Published by Rachel Gardner on Friday 11th August 2023