A challenging problem in cancer genetics is that tumors often consist of a few different groups of cells, so called clones, where each clone has different mutations, like copy-number (CN) variations. In whole genome sequencing the mutations of the different clones get mixed up, according to their relative unknown proportion in the tumor. However, CN's of single clones can only take values in a known finite set, denoted as the alphabet.

In this talk, we show how this structural information can solve the problem, which corresponds to blind source separation with finite alphabets.

First, we give a complete combinatorial characterization of identifiability, which lays the fundamentals of exact recovery theory in a completely new sparsity framework (Behr and Munk, 2017, IEEE Trans. Inf. Theory). In a statistical change-point regression model, as it appears, e.g., in CN applications, we introduce a multiscale approach and derive estimators with optimal convergence rates (up to log-factors) and uniform confidence statements for all quantities, including statistical error guarantees for the minimal “model dimension”, a task which is in general difficult to obtain (Behr et al., 2017, Ann. Stat., to appear). The estimator is computed efficiently using dynamic programming.

Finally, we give minimax results for the multivariate case, as it appears, for instance, in wireless digital communications. We outline how the combinatorial structure of finite alphabets arise challenging questions in computational statistics, such as potential optimality gaps.

This is joint work with Prof. Axel Munk (University of Göttingen) and Prof. Chris Holmes (Wellcome Trust Centre for Human Genetics, University of Oxford).