The goal of statistical learning theory is to provide solid theoretical analysis of the behavior of machine learning algorithms. Under the assumption that the data has been sampled from some underlying, but unknown ground truth, we want to assess whether the results achieved by machine learning algorithms are trustworthy, whether the algorithms are well-behaved or erratic, or what is their complexity in terms of data required or computation time needed.
Some branches of statistical learning theory are well-studied and "more or less solved," while others are just beginning to be investigated. We would like to highlight the following two areas:
Interactive and interpretable machine learning. Here we ask how a fruitful interaction between machine learning algorithms and human users can be achieved. This is clearly a question of rising importance: machine learning systems get more and more complex and involved, which makes it hard to judge the meaning, implications, and trustworthiness of a machine's inference result. On the other hand, machine learning systems start to have serious impact on every-day life, hence being able to control their results gets more important. The question about interactive and interpretable machine learning clearly has aspects of human-computer interface, but also raises lots of algorithmic and also theoretical issues.
Up to now, our group has focused on one particular aspect, namely the question of how an algorithm's input can be provided by humans in a more natural way. We consider a setting where the input to a machine learning algorithm is not given in terms of similarity values (``On a scale from 0 to 1, the similarity between image A and image B is 0.8''), but rather in terms of distance comparisons (``Image A is more similar to image B than to image C''). Many studies in psychology give evidence that for human users such qualitative comparisons are much easier to provide than quantitative similarity scores. From a theoretical point of view, however, such an approach raises many questions: How can machine learning be performed on such qualitative input data, in which way do the approaches and results differ from the ones in the standard setting, and what kind of statistical guarantees can we give? We address these questions in our project on comparison-based learning below.
Machine learning for scientific environments. While machine learning methods have been used since more than a decade in some areas of science, for example in bioinformatics or the neurosciences, we currently observe a rising trend to use machine learning methods in many diverse scientific areas, ranging from social sciences over physics to geoscience. When machine learning methods are used in scientific contexts, it is of highest importance to have reliable statistical guarantees.
A prime example is the area of network science. For more than a decade, the community has focused on exploratory work, investigating properties of particular networks (social networks, brain networks, protein-networks, transport networks, etc). However, confirmatory statistical analysis is rare, even though it is of utter importance: due to the inherent randomness in networks it is not obvious how to distinguish ``random artifacts'' from ``true structure''. Hence, in one of our projects below we focus on constructing two-sample tests for populations of networks. For example, given brain networks of some Alzheimer patients and brain networks of a control group, how can we infer in a statistically solid way whether the two sets of brain networks behave similarly or not?