In machine learning one considers computer algorithms that learn in a certain sense from data or
observations. A simple illustration of this is automatic recognition of handwritten digits or
letters. In this case, the data consists of e.g. images of andwritten digits and the
additional information, which digit the image displays. This additional information is called
label. Machine learning methods then process this data in such a way that they recognize previously
unseen handwritten digits as accurate as possible. The main difference to conventional computer
algorithms is that the programmers do not incorporate information about the nature of the problem
at hand to the algorithm. In this case, for example, about structural differences between a
handwritten "1" and "8". In this sense, a machine learning algorithm indeed learns from the data.
For such learning methods mathematics plays a central role at various points: it starts with a
precise definition of "learning", includes in many cases the description of the learning process
and its efficient implementation as a computer algorithm, and ends with mathematically provable
guarantees if and how well certain learning algorithms actually learn.
How do such guarantees look like?
This depends on many factors. Let us therefore consider the example of the so-called binary
classification. Here we have only two possible labels, in the example above "1" and "8", and the
learning procedure should correctly determine the label for as many future digits as possible. In
the mathematical description of this learning objective, it is then assumed that the handwritten
digits are in a certain sense randomly drawn from all possible images of digits, but without
describing these possible images in detail. Ideally, we therefore only assume that the size of the
images, e.g. 100 x 100 pixels, is known. In addition, it is assumed, that the labels can also be
random to a certain extent, for example to take the natural uncertainty in scrawly handwriting into
consideration.
For many learning methods it can then be proved that the optimal error rate can be achieved
approximately with high probability, if sufficient data is provided to the algorithm. Moreover, the
so-called "No-Free-Lunch-Theorem" shows, that for any algorithm one cannot specify the required
size of the dataset without describing the random mechanism in a little more detail. Interesting
guarantees, on the one hand, provide a dataset size that is as small as possible and, on the other
hand, describe the random mechanism as vague as possible. Hereby, a vague description is important,
since we do not want to or cannot refer to a wide knowledge on the concrete application problem
when creating the learning procedure.
Finally, so-called adaptive learning methods manage to require the optimal dataset size for a
large range of vague descriptions of the random mechanism without even knowing if one of the vague
descriptions is valid, and if so, which one.
Which kind of special mathematical knowledge is important here?
A precise understanding of random mechanisms plays a major role in the analysis of learning
methods, therefore probability theory is central. In addition, many methods cannot be understood
without a detailed knowledge of certain high- or infinite-dimensional spaces. This leads to
functional analysis and approximation theory. When implementing a learning process one usually has
to solve an optimization problem, hence corresponding skills are important, too. Finally, a good
understanding of the internal processes of a computer is essential for a concrete, own
implementation of a learning process, to keep the usually high computing costs as low as possible.
In this sense, one should therefore be quite broadly based.
What are current challenges?
Despite the current boom, our understanding of many learning processes is extremely limited.
For example, there is no complete theory on whether neural networks, which have become very popular
again in recent years, can actually learn in the above sense. In addition, most learning methods
are so-called black box methods, for which it is, for example, impossible to explain a prediction
that has been made. Other aspects such as privacy and fairness, which are derived from data
protection and anti-discrimination grounds, play an increasingly important role, but at the moment,
they are difficult to integrate into many learning processes. The same applies to the integration
of specific knowledge about the problem, which after all is available in many concrete
applications. Furthermore, the generation of labels is usually very costly, which leads to the
demand for learning algorithms that only need a small fraction of the data to be labeled. This is
of course only a small selection of open questions, but these already illustrate that there is
still a considerable need for research.
What research is being conducted at the ISA in this regard?
For 20 years now, we have been dealing with the mathematics of machine learning. Initially,
learning guarantees for so-called kernel-based learning methods played a central role, but since
then the efficient implementation of these learning methods for large scale data has played an
important role, too. Finally, in recent years, we have also worked on learning without labels
and on neural nets. In addition, we have repeatedly dealt with questions on the robustness of
learning methods and the correct description of learning problems.
At the moment, we are involved in the Cluster of Excellence SimTech the International Max Planck Research School for Intelligent Systems , the Cyber Valley Initiative of the state of Baden-Württemberg, and the European Laboratory for Learning and Intelligent Systems .
Thank you for the interview.
Prof. Ingo
Steinwart
Institute of Stochastics and Application
Chair for Stochastics