Deep Learning Models to Accelerate Translational Cancer Genomics

Our ability to observe a large number of molecular measurements directly from patient tumors and germline is unprecedented. However, the ability to rationally infer meaning from these large datasets remains limited. Typical machine learning models can be trained to capture the complex relationship between the patient's profile (e.g. genomics and transcriptomics ) and different clinical outcomes (e.g. survival, progression, and drug response). While this is helpful in many cases, our ability to understand this relationship is still limited due to the “black box” nature of such models.  Here, we will create a new model (P-net) that is a customized neural network with biologically-inspired architecture to both accurately predict biologically and clinically relevant outcomes and provide a better understanding of the interactions between different molecular components. The developed model will be used to predict the probability of cancer progression in prostate patients. Gaining a better understanding of the process of progression in cancer can guide clinical intervention, help identify potential molecular progression biomarkers, and enable new translational discoveries at the intersection of molecular profiling and clinical outcomes. The developed model will be available for researchers and practitioners to apply in different scenarios.

Final Report

Background:

In this project, we are aiming to build a biologically inspired computational model based on artificial neural networks to predict clinical and molecular outcomes in cancer patients. The proposed model incorporates prior knowledge of curated biological pathways that facilitate different processes inside the human cells. The nodes of the proposed architecture are meant to represent some biological concept; feature, gene, subpathway, or pathway. The connections between these nodes follow our understanding of the relationships between these biological entities. For our research, we identified a pathway database (Reactome Pathway database) (Fabregat et al. 2018) that can be used to build the internal architecture of the proposed neural network model. Based on a set of selected pathways included in the Reactome pathways database; we developed P- Net, a biologically informed sparse neural network architecture. We devised a scoring algorithm to rank nodes based on their contribution to the network outcome.

Development:

The developed network (P-NET) was trained and tested on more than 1000 patients (Armenia et al., 2018) to classify primary prostate samples versus metastatic samples based on the genomic profiles of the patients. The input to the network is the number of mutations per gene and the gene-level copy number for all of the measured genes. The trained model is tested on another subset of the P1000 and different performance metrics are reported including, Precision, Recall, Accuracy, F1, Area Under Curve (AUC), and Area Under Precision-Recall Curve (AUPRC). We compared the performance of our model to other known models including fully connected neural networks, support vector machines, and decision trees showing the P-Net performance is either equal to or better than other models. The training and testing experiment was repeated multiple times using different training and testing samples to study the stability of the model.

Results:

P-NET accurately identified patients with advanced cancer based on their genomic profiles. This shows that localized and advanced patients exhibit distinct genomic characteristics that are captured using our computational model. Patients with high P-NET scores had greater rates of relapse, showing that P-NET may be useful in stratifying patients clinically and predicting biochemical recurrence (BCR) before it happens. In a systematic, unbiased way incorporating multiple types of genomic alterations (addressing a major problem in cancer computational biology), P-NET recovered known biology of mCRPC via AR, TP53, RB1, and PTEN disruption. By leveraging the fully interpretable layers of P-NET, we discovered focal MDM4 amplifications significantly enriched in mCRPC. We showed that this event promotes enzalutamide resistance, is therapeutically targetable in prostate cancer cells, and is a new actionable alteration in a genomically stratified subset of mCRPC patients.

We reported our development and results in a scientific manuscript that has been published in Nature under the title “Biologically informed deep neural network for prostate cancer discovery”.

The published paper has attracted a lot of attention with more than 51k access, 27 citations, and multiple news coverage. We are excited to see that the community is adopting our method and expanding it in different directions. The manuscript authors had the opportunity to share their work with different scientific communities through multiple invited talks nationally and internationally.

We deposited our developed model on GitHub and made the repository accessible to the public at https://github.com/marakeby/pnet_prostate_paper
A permanent version of the developed code is available at https://zenodo.org/record/5163855#.YXAqhEbMJcA

The data used in the project has been deposited in a Zenodo repository https://zenodo.org/record/5163213#.YXAq_EbMJcA

Challenges:

To fulfill the second Aim of the grant, we modified the P-NET model to work with Melanoma patients. We identified a large cohort with molecularly characterized patients (Conway, Jake R., et al 2021, Nature Genetics). We split the data into training, validation, and testing splits. We trained the developed P-NET using 838 whole-exome profiles to predict the molecular subtypes of the patients. The trained model is tested on another subset of the Conway dataset and different performance metrics are reported including, Precision, Recall, Accuracy, F1, AUC, and AUPRC. We compared the performance of our model to other known models, including fully connected neural networks, support vector machines, and decision trees. The training and testing experiments were repeated multiple times using different training and testing samples to study the stability of the model.

Due to the more heterogeneity and the larger number of predicted classes in the Melanoma subtypes, the current P-NET implementation has a relatively low computational performance compared to the Prostate space. We believe that we can overcome this problem by collecting more samples for each subtype, an effort that requires more resources and funds beyond the current ICI grant.

Conclusion:

We proposed and developed P-NET, a sparse biologically informed neural network that accurately identified patients with advanced cancer based on their genomic profiles. Patients with high P-NET scores had greater relapse rates, showing that our model can be used in early detection and patient stratification. Our model recovered known biology of mCRPC via AR, TP53, RB1, and PTEN disruption and computationally nominated focal MDM4 amplification as a potential target that we later experimentally validated. We believe that the development in this project is the basis of a new era of interpretable machine learning modeling in the cancer space that will change the way we think about cancer progression and treatment response and will empower researchers to make new data-driven discoveries for better understating cancer biology and better treat cancer patients. We will continue expanding our developed framework to overcome major challenges in the cancer space and will continue seeking funding through different mechanisms to facilitate our development.

Learn More About Their Work:

Nature Magazine Cover

Article: Nature.com Protein Misfolds – The protein tau is believed to stabilize the skeleton that shapes nerve cells, but in neurodegenerative diseases known as tauopathies, tau misfolds and stacks together to form filaments. In this week’s issue, Sjors Scheres, Michel Goedert, and their colleagues build on their previous work identifying different folds of tau filaments present in conditions such as Alzheimer’s disease. They reveal four additional folds relating to specific diseases...

  • The developed model has been deposited at GibHub and the repository has been made accessible to the public here 
  • A permanent version of the developed code is available here 
  • The data used in the project has been deposited to a zenodo repository here