Seminar | January 31 | 4-5 p.m. | 1011 Evans Hall
Professor Katie Pollard, Department of Epidemiology and Biostatistics, UC San Francisco, Gladstone Institute, and Chan-Zuckerberg Biohub
Machine learning is a popular statistical approach in many fields, including genomics. We and others have used a variety of supervised machine-learning techniques to predict genes, regulatory elements, 3D interactions between regulatory elements and their target genes, and the effects of mutations on regulatory element function. I will highlight a few of these studies, emphasizing the strengths and weaknesses of different predictive models and the biological insights gained via variable importance analysis. Then I will talk about some of our recent work exploring the limitations of popular machine-learning methods in genomics, where the biology underlying the data used to train the models frequently violates one or both parts of the independent and identically distributed (IID) assumption. The talk will conclude with some thoughts on modeling non-IID data and interpreting over-fit models, with the aim of improving the application of supervised learning to biological data and emphasizing the mechanistic insights gained from modeling over performance statistics per se.