Development of Automatic Risk Evaluation Systems for Acute Myeloid Leukemia from Gene Expression Data

Léonard Sauvé

In America, thousands of adults will be diagnosed with acute myeloid leukemia (AML), among which only 28% are estimated to survive after five years. This white blood cell cancer type is genetically heterogeneous and our current therapies are often inefficient to treat certain adverse subtypes of the disease. Our computational approaches aim to refine the AML subtypes and lead to targeted therapies and better treatment. For these analyses, we investigate the Leucegene dataset, comprising hundreds of AML cancer cell gene expression samples. This data presents high potential for developing machine learning (ML) systems for automatic risk group identification directly from gene expression data. These developments could be integrated to existing clinical processes and assist clinicians with identifying accurate treatment protocols for AML patients. Our bioinformatics and ML approaches try to alleviate the not-yet-answered challenges in the field allowing such advances, notably the capacity of these systems to manage large numbers of input variables, large sample size and censored survival information. Our research also aims to robustly correct platform and handler artifacts known as batch-effects, introduced by the integration of multi-platform experimental data. Currently, a pipeline involving a Cox-PH regression and a Principal Component Analysis can be used to compare survival modelling using gene expression to commonly used clinical features and shows that gene expression can indeed improve accuracy of these systems. Investigations involving more complex ML approaches using deep-neural networks are underway to further improve our AML risk score evaluation pipeline