Ensembles of Trees and CLT's: Inference and Machine Learning: Neyman Seminar

Seminar | November 20 | 4-5 p.m. | 1011 Evans Hall

 Giles Hooker, Cornell University

 Department of Statistics

This talk develops methods of statistical inference based around ensembles of decision trees: bagging, random forests, and boosting. Recent results have shown that when the bootstrap procedure in bagging methods is replaced by sub-sampling, predictions from these methods can be analyzed using the theory of U-statistics which have a limiting normal distribution. Moreover, the limiting variance that can be estimated within the sub-sampling structure.

Using this result, we can compare the predictions made by a model learned with a feature of interest, to those made by a model learned without it and ask whether the differences between these could have arisen by chance. By evaluating the model at a structured set of points we can also ask whether it differs significantly from an additive model. We demonstrate these results in an application to citizen-science data collected by Cornell's Laboratory of Ornithology.

Given time, we will examine recent developments that extend distributional results to boosting-type estimators. Boosting allows trees to be incorporated into more structured regression such as additive or varying coefficient models and often outperforms bagging by reducing bias.

Bio: Giles Hooker’s research concentrates on machine learning methods, particularly in intelligibility, the interface of statistics and differential equation models, robust statistics and functional data analysis, with a particular emphasis on applications in ecology. He received a doctorate in Statistics from Stanford University and undertook postdoctoral studies with Jim Ramsay. He joined Cornell in 2006 where he is Associate Professor of Statistics and Data Science and of Computational Biology, and Director of Undergraduate Studies in Biometry and Statistics.

 Berkeley, CA 94720, 5106422781