Integrate Pre-existing Knowledge in Biomedical Data Analysis with Graph Representation Learning

Albi, Giuseppe

Early Artificial Intelligence (AI) systems encoded domain-specific knowledge using knowledge representation techniques, such as expert systems. In contrast, modern AI paradigms like Deep Learning have thrived due to the abundance of data and computational resources, shifting toward a data-driven approach to decision-making. However, relying solely on this kind of approach can pose risks in certain fields, particularly in medicine and biomedical research where expert knowledge plays a crucial role. Life sciences are inherently guided by curated, domain-specific expertise, which is empirically integrated with experimental data. A promising direction is offered by the adoption of hybrid AI approaches that combine structured knowledge, such as medical understanding of specific phenotypes or known relationships among patient variables, with data-driven models. This integration can bridge the gap between structured knowledge representation and the flexibility of learning from data. The feasibility of hybrid methods is supported by the growing availability of structured biomedical knowledge, including knowledge bases (KBs), terminologies, and domain-specific bioinformatics repositories focused on areas such as precision medicine and pharmacology. Graph Representation Learning (GRL) has gained increasing attention in biology and medicine due to its ability to model complex relationships using graph structures. Graphs offer an efficient way to represent entities, their interconnections, and associated attributes as informative signals. GRL is especially valuable because it allows for the integration of heterogeneous data sources, such as real-world patient data or public available data repository, with pre-existing biomedical knowledge. Knowledge Graphs (KGs), in particular, are effective in this context, as they harmonize diverse biomedical knowledge resources within a unified graph-based framework, facilitating comprehensive analysis in specialized domains. The objective of this thesis is to proposes AI frameworks based on GRL paradigms and integrating pre-existing knowledge, to address the analysis of biomedical data from the INTESTRAT- CAD projects and computational toxicology. A Methodological Background reported in Chapter 2, starts by describing the graph notation and the Machine Learning (ML) tasks formulations on graph structures. Then, the chapter proceeds by introducing GRL methods including traditional graph statistics, manifold learning, Topological Data Analysis (TDA) and graph embedding models learned with Neural Networks (NN) architectures. In the following chapters three case studies are reported, each addressing a specific biomedical task by adopting a suitable GRL paradigm, and identifying a thesis’s aim: stratification and computational phenotyping (Aim 1), prediction of coronary artery stenosis for risk stratification (Aim 2), and small molecules toxicology prediction (Aim 3). Specifically, Aim 1 and Aim 2 use real-world data from the INTESTRAT-CAD project, with a patients’ population belonging to the Epifania trial, while Aim 3 uses Tox21, a public available toxicology repository. In addition, Aim 1 considers as pre-existing knowledge the initial phenotypic medical definition assigned to the population in study, while Aim 2 and Aim 3 share similar sources of biomedical knowledge, by leveraging semantic relations from KGs. Chapter 3 describes Aim 1, that deals with the Computational Phenotyping of Coronary Artery Disease (CAD) patients, on the basis of clinical data, and the domain medical knowledge defined as the initial CAD severity, assigned to the patients when enrolled in the study. A cohort of 725 patients is used to create a dataset made by clinical variables such as demographics, medical history, laboratory exams and drug prescriptions, and comprising the initial CAD definition assigned by the clinicians. The contribution is a TDA-based framework for semi-supervised computational phenotyping, where the Mapper hyper-parameters tuning is guided by the initial CAD label, and the characterization of the new subgroups is performed with the most discriminative features extracted from ML predictive models. Chapter 4 reports Aim 2, dealing with the developing of a predictive risk model for CAD, by combining omics and clinical dataset variables with PrimeKG, a precision medicine-oriented KG. A different cohort from the previous Aim is considered, and consisting in 723 patients, with a wider set of clinical variables and the RNA-sequencing (RNA-seq) data. By first mapping the dataset features with PrimeKG nodes, and then using Knowledge Graph Embedding (KGE) models to learn the KG entities representations, this study shows how to contextualize real-word data with pre-existing medical knowledge, in order to create a new patient representation to be used for Coronary Artery Stenosis prediction. Chapter 5 reports Aim 3, which objectives is to augment a toxicity prediction model with semantic knowledge deriving from ComptoxAI, a computational toxicology KG. Specifically, Graph Neural Networks (GNNs) are adopted to learn small molecules representation, by leveraging their 2-dimensional (2D) structure, atoms and edges attributes. In addition, this representation is updated with the knowledge between chemicals, genes and Tox21 assays, extracted from ComptoxAI, all combined to create a computational toxicology predictive pipeline. Finally, the main motivation behind the thesis are highlighted, and the major findings with and possible future developments are discussed in Chapter 6.

Integrate Pre-existing Knowledge in Biomedical Data Analysis with Graph Representation Learning / Giuseppe Albi , 2025 Nov 21. 37. ciclo