Statistics
http://events.berkeley.edu/index.php/calendar/sn/stat.html
Upcoming EventsSeminar 217, Risk Management: Predicting Portfolio Return Volatility at Median Horizons, Oct 2
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=118742&date=2018-10-02
Commercially available factor models provide good predictions of short-horizon (e.g. one day or one week) portfolio volatility, based on estimated portfolio factor loadings and responsive estimates of factor volatility. These predictions are of significant value to certain short-term investors, such as hedge funds. However, they provide limited guidance to long-term investors, such as Defined Benefit pension plans, individual owners of Defined Contribution pension plans, and insurance companies. Because return volatility is variable and mean-reverting, the square root rule for extrapolating short-term volatility predictions to medium-horizon (one year to five years) risk predictions systematically overstates (understates) medium-horizon risk when short-term volatility is high (low). In this paper, we propose a computationally feasible method for extrapolating to medium-horizon risk predictions in one-factor models that substantially outperforms the square root rule.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=118742&date=2018-10-02The challenge of big data and data science for the social sciences, Oct 2
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120009&date=2018-10-02
The 2005 National Science Foundation workshop report on "Cyberinfrastructure for the Social and Behavioral Sciences" (Fran Berman and Henry Brady) argued that the methods of doing research in the social sciences would be transformed by big data and data science and that the social sciences should be centrally involved in studying the impacts of big data and data science on society. In "The Challenge of Big Data and Data Science," just completed for the Annual Review of Political Science, I have brought these arguments up-to-date. I will talk about defining "big data" and "data science," about the new kinds of research being done in the social sciences over the past decade that use big data and data science methods, and about the impacts of the information revolution on warfare, cities, the media, health care, and jobs and the ways that the social sciences must come to grips with them.<br />
<br />
The Berkeley Distinguished Lectures in Data Science, co-hosted by the Berkeley Institute for Data Science (BIDS) and the Berkeley Division of Data Sciences, features Berkeley faculty doing visionary research that illustrates the character of the ongoing data revolution. This lecture series is offered to engage our diverse campus community and enrich active connections among colleagues. All campus community members are welcome and encouraged to attend. Arrive at 3:30 PM for light refreshments and discussion prior to the formal presentation.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120009&date=2018-10-02Concentration from Geometry in High Dimension, Oct 3
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120103&date=2018-10-03
The concentration of Lipschitz functions around their expectation is a classical topic that continues to be very active. We will discuss some recent progress, including: <br />
1- A tight log-Sobolev inequality for isotropic logconcave densities<br />
2- A unified and improved large deviation inequality for convex bodies<br />
3- An extension of the above to Lipschitz functions (generalizing the Euclidean squared distance)<br />
The main technique of proof is a simple iteration (equivalently, a Martingale process) that gradually transforms any density into one with a Gaussian factor, for which isoperimetric inequalities are considerably easier to establish. The talk is joint work with Yin Tat Lee (UW) and will involve some elementary calculus.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120103&date=2018-10-03Statistical challenges in casualty estimation, Oct 3
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120434&date=2018-10-03
An accurate understanding of the magnitude and dynamics of casualties during a conflict is important for a variety of reasons, including historical memory, retrospective policy analysis, and assigning culpability for human rights violations. However, during times of conflict and their aftermath, collecting a complete or representative sample of casualties can be difficult if not impossible. One solution is to apply population estimation methods-- sometimes called capture-recapture or multiple systems estimation-- to multiple incomplete lists of casualties to estimate the number of deaths not recorded on any of the lists. In this talk, I give an introduction to the procedures by which population estimation is performed in the context of conflict mortality, which mainly consists of a record linkage step followed by capture-recapture estimation. I then describe some of my recent work in this area, which is directed at elucidating the limitations of these statistical methods and proposing variants with better properties. I will conclude with a discussion of open questions in this challenging area of applied statistics.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120434&date=2018-10-03Center for Computational Biology Seminar, Oct 3
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120220&date=2018-10-03
Title: Making sense of the “noise” in cancer data<br />
<br />
During carcinogenesis, cells accumulate 1000s of somatic DNA mutations. “Driver” mutations bestow fitness advantages that lead to selective sweeps that increase that frequency of mutated cells compared to those lacking the driver. These sweeps also increase the frequency of “passenger” mutations accumulated since the last such sweep. These mutations have little impact on cell function but provide information about the mutational processes that generated them. Both their type (i.e., A to C) and genomic locations depend not only what caused the mutation -- e.g., UV light – but also the chromatin state of the cell that acquired it. My lab developed Bayesian inference methods to classify somatic mutations into different ‘subclones’ that correspond to different sweeps. Our methods also use phylogenetic approaches to determine the relative order in which the sweeps occurred. We are now developing supervised and unsupervised learning methods to interpret this historical record of the cancer, in order to use the timing and patterns of somatic mutations to reconstruct the changes that a normal cell underwent during its transformation into a cancerous cell.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120220&date=2018-10-03Seminar 217, Risk Management: Robust Learning: Information Theory and Algorithms, Oct 9
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=118749&date=2018-10-09
This talk will provide an overview of recent results in high-dimensional robust estimation. The key question is the following: given a dataset, some fraction of which consists of arbitrary outliers, what can be learned about the non-outlying points? This is a classical question going back at least to Tukey (1960). However, this question has recently received renewed interest for a combination of reasons. First, many of the older results do not give meaningful error bounds in high dimensions (for instance, the error often includes an implicit sqrt(d)-factor in d dimensions). Second, recent connections have been established between robust estimation and other problems such as clustering and learning of stochastic block models. Currently, the best known results for clustering mixtures of Gaussians are via these robust estimation techniques. Finally, high-dimensional biological datasets with structured outliers such as batch effects, together with security concerns for machine learning systems, motivate the study of robustness to worst-case outliers from an applied direction.<br />
<br />
The talk will cover both information-theoretic and algorithmic techniques in robust estimation, aiming to give an accessible introduction. We will start by reviewing the 1-dimensional case, and show that many natural estimators break down in higher dimensions. Then we will give a simple argument that robust estimation is information-theoretically possible. Finally, we will show that under stronger assumptions we can perform robust estimation efficiently, via a "dual coupling" inequality that is reminiscent of matrix concentration inequalities.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=118749&date=2018-10-09Letters of recommendation in Berkeley undergraduate admissions: Program evaluation and natural language processing, Oct 9
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120051&date=2018-10-09
In Fall 2015 and 2016, UC Berkeley asked many freshman applicants to submit letters of recommendation as part of their applications. This was highly controversial. Proponents argued that letters would aid in the identification of disadvantaged students who had overcome obstacles that were not otherwise apparent from their applications, while opponents argued that disadvantaged students were unlikely to have access to adults who could write strong letters. I oversaw an experiment in the 2016-17 admissions cycle in which applications were scored with and without their letters. Initial analysis of the experiment indicated that when available the letters modestly improved the reader scores of students from underrepresented groups, and that few otherwise admissible students failed to submit letters when asked. I will also present results of a textual analysis of the letters themselves, using natural language processing to measure differences in the letters that underrepresented students receive compared to otherwise similarly qualified students not from underrepresented groups.<br />
<br />
The Berkeley Distinguished Lectures in Data Science, co-hosted by the Berkeley Institute for Data Science (BIDS) and the Berkeley Division of Data Sciences, features Berkeley faculty doing visionary research that illustrates the character of the ongoing data revolution. This lecture series is offered to engage our diverse campus community and enrich active connections among colleagues. All campus community members are welcome and encouraged to attend. Arrive at 3:30 PM for light refreshments and discussion prior to the formal presentation.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120051&date=2018-10-09Large deviations of subgraph counts for sparse Erd\H{o}s--R\'enyi graphs, Oct 10
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120664&date=2018-10-10
For each fixed integer $\ell\ge 3$ we establish the leading order of the exponential rate function for the probability that the number of cycles of length $\ell$ in the Erd\H{o}s--R\'enyi graph $G(N,p)$ exceeds its expectation by a constant factor, assuming $N^{-1/2}\ll p\ll 1$ (up to log corrections) when $\ell\ge 4$, and $N^{-1/3}\ll p\ll 1$ in the case of triangles. We additionally obtain the upper tail for general subgraph counts, as well as the lower tail for counts of seminorming graphs, in narrower ranges of sparsity. As in other recent works on the emerging theory of nonlinear large deviations, our general approach applies to functions on product spaces which are of ``low complexity", though the notion of complexity used here is somewhat different. Based on joint work with Amir Dembo.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120664&date=2018-10-10To persist or not to persist?, Oct 10
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120151&date=2018-10-10
Two long standing, fundamental questions in biology are "Under what conditions do populations persist or go extinct? When do interacting species coexist?" The answers to these questions are essential for guiding conservation efforts and identifying mechanisms that maintain biodiversity. Mathematical models play an important role in identifying these mechanisms and, when coupled with empirical work, can determine whether or not a given mechanism is operating in a specific population or community. For over a century, nonlinear difference and differential equations have been used to identify these mechanisms. These models, however, fail to account for stochastic fluctuations in environmental conditions such as temperature and precipitation. In this talk, I present theorems about persistence, coexistence, and extinction for stochastic difference equations that account for species interactions, population structure, and environmental fluctuations. The theorems will be illustrated with models of Bay checkerspot butterflies, spatially structured acorn woodpecker populations, competition among Kansas prairie grass species, and the evolutionary game of rock, paper, and scissors. This work is in collaboration with Michel Benaim.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120151&date=2018-10-10Seminar 217, Risk Management: Asymptotic Spectral Analysis of Markov Chains with Rare Transitions: A Graph-Algorithmic Approach, Oct 16
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=118748&date=2018-10-16
Parameter-dependent Markov chains with exponentially small transition rates arise in modeling complex systems in physics, chemistry, and biology. Such processes often manifest metastability, and the spectral properties of the generators largely govern their long-term dynamics. In this work, we propose a constructive graph-algorithmic approach to computing the asymptotic estimates of eigenvalues and eigenvectors of the generator. In particular, we introduce the concepts of the hierarchy of Typical Transition Graphs and the associated sequence of Characteristic Timescales. Typical Transition Graphs can be viewed as a unification of Wentzell’s hierarchy of optimal W-graphs and Friedlin’s hierarchy of Markov chains, and they are capable of describing typical escapes from metastable classes as well as cyclic behaviors within metastable classes, for both reversible and irreversible processes. We apply the proposed approach to conduct zero-temperature asymptotic analysis of the stochastic network representing the energy landscape of the Lennard-Jones cluster of 75 atoms.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=118748&date=2018-10-16The Lovász theta function for random regular graphs, Oct 17
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120791&date=2018-10-17
The Lovász theta function is a classic semidefinite relaxation of graph coloring. In this talk I'll discuss the power of this relaxation for refuting colorability of uniformly random degree-regular graphs, as well as for distinguishing this distribution from one with a `planted' disassoratative community structure. We will see that the behavior of this refutation scheme is consistent with the conjecture that coloring and community detection exhibit a `computationally hard but information-theoretically feasible' regime typical of random constraint satisfaction and statistical inference problems. The proofs will make use of orthogonal polynomials, nonbacktracking walks, and results on the spectra of random regular graphs.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120791&date=2018-10-17Learning in Google Ads, Machines and People, Oct 17
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120789&date=2018-10-17
This talk is in two parts, both of which discuss interesting uses of experiments in Google search ads. In part 1 I discuss how we can inject randomness into our system to get causal inference in a machine learning setting. In part 2. I talk about experiment designs to measure how users learn in response to ads on Google.com.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120789&date=2018-10-174th Annual CDAR Symposium 2018, Oct 19
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=119719&date=2018-10-19
The fourth annual CDAR Symposium, presented in partnership with State Street, will convene on Friday, October 19, 2018, from 8:30 am to 6:30 pm at UC Berkeley’s Memorial Stadium. Our conference will feature new developments in data science, highlighting applications to finance and risk management. Confirmed speakers include Jeff Bohn, Olivier Ledoit, Ulrike Malmendier, Steven Kou, Ezra Nahum, Roy Henriksson, and Ken Kroner.<br />
<br />
The Consortium for Data Analytics in Risk (CDAR) supports research into innovation in data science and its applications to portfolio management and investment risk. Based in the Economics and Statistics Departments at UC Berkeley, CDAR was co-founded with State Street, Stanford, Berkeley Institute for Data Science (BIDS), and Southwestern University of Finance and Economics (SWUFE). This year, CDAR welcomes a new founding member, Swiss Re based in Switzerland, and a new industry partner, AXA Rosenberg. CDAR organizes conferences, workshops, and research programs, bringing together academic researchers from the physical and social sciences, and industry researchers from financial management firms and technology development companies large and small.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=119719&date=2018-10-19Seminar 217, Risk Management: Proliferation of Anomalies and Zoo of Factors – What does the Hansen–Jagannathan Distance Tell Us?, Oct 23
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=118743&date=2018-10-23
Recent research finds that prominent asset pricing models have mixed success in evaluating the cross-section of anomalies, which highlights proliferation of anomalies and zoo of factors. In this paper, I investigate that how is the relative pricing performance of these models to explain anomalies, when comparing their misspecification errors– the Hansen–Jagannathan (HJ) distance measure. I find that a traded-factor model dominates others in a specific anomaly by incorporating the multiple HJ distance comparing inference. However, different from the current research of Barillas and Shanken (2017) and Barillas, Kan, Robotti and Shanken (2018), I result that the HJ distance is a general statistic measure to compare models and some model-derived non-traded factors even outperform traded factors. Second, there is a large variation in the shape and curvature of these confidence sets of anomalies, which makes any single SDF difficult to satisfy confidence sets of anomalies all. My results imply that further work is required not only in pruning the number of priced factors but also in building models that explain the anomalies better.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=118743&date=2018-10-23Optimal robot action for and around people, Oct 23
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120052&date=2018-10-23
Estimation, planning, control, and learning are giving us robots that can generate good behavior given a specified objective and set of constraints. What I care about is how humans enter this behavior generation picture, and study two complementary challenges: 1) how to optimize behavior when the robot is not acting in isolation, but needs to coordinate or collaborate with people; and 2) what to optimize in order to get the behavior we want. My work has traditionally focused on the former, but more recently I have been casting the latter as a human-robot collaboration problem as well (where the human is the end-user, or even the robotics engineer building the system). Treating it as such has enabled us to use robot actions to gain information; to account for human pedagogic behavior; and to exchange information between the human and the robot via a plethora of communication channels, from external forces that the person physically applies to the robot, to comparison queries, to defining a proxy objective function.<br />
<br />
The Berkeley Distinguished Lectures in Data Science, co-hosted by the Berkeley Institute for Data Science (BIDS) and the Berkeley Division of Data Sciences, features Berkeley faculty doing visionary research that illustrates the character of the ongoing data revolution. This lecture series is offered to engage our diverse campus community and enrich active connections among colleagues. All campus community members are welcome and encouraged to attend. Arrive at 3:30 PM for light refreshments and discussion prior to the formal presentation.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120052&date=2018-10-23Constructing (2+1)-dimensional KPZ evolutions, Oct 24
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120826&date=2018-10-24
The (d+1)-dimensional KPZ equation<br />
\[<br />
\partial_t h = \nu \Delta h + \frac{\lambda}{2}|\nabla h|^2 + \sqrt{D}\dot{W},<br />
\]<br />
in which \dot{W} is a space--time white noise, is a natural model for the growth of d-dimensional random surfaces. These surfaces are extremely rough due to the white noise forcing, which leads to difficulties in interpreting the nonlinear term in the equation. In particular, it is necessary to renormalize the mollified equations to achieve a limit as the mollification is turned off. The d = 1 case has been understood very deeply in recent years, and progress has been made in d ≥ 3, but little is known in d = 2. I will describe recent joint work with Sourav Chatterjee showing the tightness of a family of Cole--Hopf solutions to (2+1)-dimensional mollified and renormalized KPZ equations. This implies the existence of subsequential limits, which we furthermore can show do not coincide with solutions to the linearized equation, despite the fact that our renormalization scheme involves a logarithmic attenuation of the nonlinearity as the mollification scale is taken to zero.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120826&date=2018-10-24Safe Learning in Robotics, Oct 24
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120922&date=2018-10-24
A great deal of research in recent years has focused on robot learning. In many applications, guarantees that specifications are satisfied throughout the learning process are paramount. For the safety specification, we present a controller synthesis technique based on the computation of reachable sets, using optimal control and game theory. In the first part of the talk, we will review these methods and their application to collision avoidance and avionics design in air traffic management systems, and networks of unmanned aerial vehicles. In the second part, we will present a toolbox of methods combining reachability with data-driven techniques inspired by machine learning, to enable performance improvement while maintaining safety. We will illustrate these “safe learning” methods on a quadrotor UAV experimental platform which we have at Berkeley, including demonstrations of motion planning around people.http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120922&date=2018-10-24Rigidity and tolerance for perturbed lattices, Oct 31
http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120793&date=2018-10-31
Consider a perturbed lattice {v+Y_v} obtained by adding IID d-dimensional Gaussian variables {Y_v} to the lattice points in Z^d. <br />
Suppose that one point, say Y_0, is removed from this perturbed lattice; is it possible for an observer, who sees just the remaining points, to detect that a point is missing?<br />
In one and two dimensions, the answer is positive: the two point processes (before and after Y_0 is removed) can be distinguished using smooth statistics, analogously to work of Sodin and Tsirelson (2004) on zeros of Gaussian analytic functions. (cf. Holroyd and Soo (2011) ). The situation in higher dimensions is more delicate; our solution depends on a game-theoretic idea, in one direction, and on the unpredictable paths constructed by Benjamini, Pemantle and the speaker (1998), in the other. (Joint work with Allan Sly).http://events.berkeley.edu/index.php/calendar/sn/stat.html?event_ID=120793&date=2018-10-31