52
$\begingroup$

I have recently become interested in Machine Learning and AI as a student of theoretical physics and mathematics, and have gone through some of the recommended resources dealing with statistical learning theory and deep learning in neural networks.

One of the main issues I find in my personal study of this discipline is that an overwhelming proportion of such resources focus on the practical side of ML, forfeiting rigour in favour of useful heuristics. This approach has its obvious merits, considering the great current interest in their applications both in science and technology, but I would like to go beyond what the average engineer might need and explore the more theoretical sides.

The elephant in the room is, of course, the fact that to date the inner workings of the main tools of AI, neural networks above all others, are not well understood. From what I can tell, there are a variety of approaches drawing from very diverse field, including a physical perspective (see Huang's Statistical Mechanics of Neural Networks, or Statistical Field Theory for Neural Networks by Helias and Dahmen).

As an outsider, I have a hard time navigating the literature, so I have thought to ask quite an open question on this site (whether this is the right place I do not know; I'm sure moderation will let me know if it isn't). Could anyone lay out a map of the current landscape of AI research, from mainstream science to the cutting-edge approaches, and elucidate the types of mathematics needed to tackle them?

$\endgroup$
18
  • 22
    $\begingroup$ For a rapidly developing field, there may not be a clear map. $\endgroup$ Dec 10 at 21:55
  • 3
    $\begingroup$ @SamHopkins True, and I guess that's part of the appeal. Maybe I could have put "as clear as the circumstances allow" in the title, but I think it doesn't sound as catchy. Besides, I was partially inspired to make this post by this post about learning the basics of ML, as a sort of "next step" for the aspiring researcher. $\endgroup$
    – AI Bert
    Dec 10 at 22:04
  • 6
    $\begingroup$ Note that "AI research" is a much broader field than machine learning, let alone deep statistical machine learning -- it goes back 70+ years, encompassing many areas a layman might not think of as 'artificial intelligence' (see e.g. the table of contents of "AI: A Modern Approach"). Do you want to narrow the question to statistical "deep" neural-network machine learning? $\endgroup$
    – usul
    Dec 10 at 23:16
  • 2
    $\begingroup$ @usul Simulating neural networks on computers has been done since the 1950s, and deep learning goes back to the 1960s. They also go back 70+ years. $\endgroup$ Dec 11 at 0:12
  • 2
    $\begingroup$ @AIBert sounds good to me, just saying that answering your question would be a large undertaking. E.g. if you look at the NeurIPS conference call for papers, it mentions 12 general example areas and about 44 example sub-areas. Between just that conference, ICML, and AAAI alone there are several thousand papers published each year. On the theory side, there are the conferences COLT and ALT. Someone's answer discusses NLP-specific conferences. $\endgroup$
    – usul
    Dec 11 at 17:04

10 Answers 10

25
$\begingroup$

I highly recommend the syllabus for Boaz Barak's Harvard course on Foundations of Deep Learning. It balances a mathematical point of view with a respect for the fast-moving empirical aspects of the field.

$\endgroup$
17
$\begingroup$

At a more introductory level than Martin M. W.'s answer, I enjoyed Notes on Contemporary Machine Learning for Physicists by Jared Kaplan. In particular, it is a standalone text (although without exercises) and might be easier to follow than course slides without the accompanying lectures. It is written in a "theoretical physics" style, focusing on the intuition of mathematical concepts underlying machine learning.

$\endgroup$
1
  • 1
    $\begingroup$ Yes, this is a great reference! $\endgroup$ Dec 11 at 11:46
13
$\begingroup$

One interesting mathematical approach to neural networks is obtaining approximation theorems i.e. as mentioned in https://en.wikipedia.org/wiki/Universal_approximation_theorem

Universal approximation theorems imply that neural networks can represent a wide variety of interesting functions with appropriate weights.

Let $C(X, \mathbb{R}^m)$ denote the set of continuous functions from a subset $X $ of a Euclidean $\mathbb{R}^n$ space to a Euclidean space $\mathbb{R}^m$. Let $\sigma \in C(\mathbb{R}, \mathbb{R})$. Note that $(\sigma \circ x)_i = \sigma(x_i)$, so $\sigma \circ x$ denotes $\sigma$ applied to each component of $x$. Then $\sigma$ is not polynomial if and only if for every $n \in \mathbb{N}$, $m \in \mathbb{N}$, compact subspace $K \subseteq \mathbb{R}^n$, $f \in C(K, \mathbb{R}^m), \varepsilon > 0$ there exist $k \in \mathbb{N}$, $A \in \mathbb{R}^{k \times n}$, $b \in \mathbb{R}^k$, $C \in \mathbb{R}^{m \times k}$ such that $$\sup_{x \in K} \| f(x) - g(x) \| < \varepsilon$$ where $ g(x) = C \cdot ( \sigma \circ (A \cdot x + b) )$.

One major work is "Error bounds for approximations with deep ReLU networks"

We will consider $L^\infty$-error of approximation of functions belonging to the Sobolev spaces $W_n^{\infty}([0, 1]^d )$ (without any assumptions of hierarchical structure).

One article with many references is WHY DEEP NEURAL NETWORKS FOR FUNCTION APPROXIMATION?.

$\endgroup$
4
  • 5
    $\begingroup$ These are interesting, but the problem with this kind of theory paper is that it always lags behind the practice. I used to read a bunch of these in grad school, but eventually I lost interest, because it didn't feel that the theory was building towards an actual understanding of what we were doing as practitioners. That's a long way of saying, I don't know if this can be described as part of a "mathematical approach to AI", because it just follows the empiricists and retells their story, rather than suggesting new ideas. $\endgroup$ Dec 11 at 10:02
  • 1
    $\begingroup$ I asked a related question a few years ago at Cross Validated:Iconic (toy) models of neural networks (stats.stackexchange.com/questions/279713/…). $\endgroup$ Dec 11 at 17:04
  • 1
    $\begingroup$ @DavisYoshida although I share your sentiment (just make AI non-alchemy again!), my stand currently is that our mathematical understanding indeed lacks behind the practice, due to it being easier. So I still consider any good theoretical work as important, albeit still far from practice. Hopefully from these building blocks someone eventually finds a breakthrough that makes AI more like science again. (Anecdote, in my answer I shared a very recent theoretical work on neural network mathematical understanding, it got outstanding paper award in EMNLP) $\endgroup$
    – justhalf
    Dec 12 at 2:43
  • $\begingroup$ @justhalf The approximation results grant you e.g. to conceptualize the one layer, very large width model parametrization of "all nice functions, up to numerics." But in the end I agree with Davis. Such results trace back at least to Wiener's Tauberian theorems, a time before Turing was even a uni student. Subjects lying dormant to an extent, only improved when in demand, proofs being turned constructive just now, etc. Functional analysis is a box mathematicians feels comfortable in. Thinking deep about messy cat images scares them. $\endgroup$
    – Nikolaj-K
    Dec 16 at 2:47
9
$\begingroup$

Here are some relevant resources:

$\endgroup$
7
$\begingroup$

You might be interested in Greg Yang's work (e.g., https://arxiv.org/abs/2203.03466), which is very physics inspired. This theory actually successfully led to a zero-shot hyperparameter transfer method which seems to have been used in the most recent generation of LLMs. (I say "seems to", since the labs aren't very forthcoming with details).

$\endgroup$
7
$\begingroup$

Much of what people call AI today are LLMs, and so it'd be good to look at NLP conferences such as ACL, EMNLP (this year's conference just concluded yesterday), and others.

There are some theoretical works there, for example this recent paper in EMNLP 2023: Unraveling Feature Extraction Mechanisms in Neural Networks on using Neural Tangent Kernels (NTK) to investigate the behavior of neural networks got awarded an outstanding paper award. Following the references from that paper may get you to more theoretical works, such as seeing it as a Gaussian Process, Implicit Matrix Factorization, etc.

$\endgroup$
5
$\begingroup$

The recent book (2022) "Mathematical Aspects of Deep Learning," edited by Philipp Grohs and Gitta Kutyniok, provides a comprehensive overview of contemporary mathematical approaches to deep learning analysis.

Here is an excerpt of the first chapter:

We describe the new field of the mathematical analysis of deep learning. This field emerged around a list of research questions that were not answered within the classical framework of learning theory. These questions concern: the outstanding generalization power of overparametrized neural networks, the role of depth in deep architectures, the apparent absence of the curse of dimensionality, a surprisingly successful optimization performance despite the non-convexity of the problem, understanding what features are learned, why deep architectures perform exceptionally well in physical problems, and which fine aspects of an architecture affect the behavior of a learning task in which way. We present an overview of modern approaches that yield partial answers to these questions. For selected approaches, we describe the main ideas in more detail.

$\endgroup$
4
$\begingroup$

Notice: This post is based on my own views and experience with machine learning algorithms which is orthogonal to the views and experience of everyone else.

Modern deep learning algorithms were not designed in order to have a nice mathematical theory nor were they designed to be particularly interpretable or understandable. They were instead designed to perform well while the inner workings of the neural network are mostly regarded as a black box. The end consumer of AI technologies will typically be unfamiliar with and uninterested in the inner workings of an AI system, so the developers of these systems develop AI models that look good on the outside, but the parameters of these AI models may not even be publicly available. In practice, neural networks tend to be a hodgepodge of techniques (such as convolutional layers, regularization, normalization, various activation functions, etc.) that experimentally work well for data in a specific format rather than a unified structure that mathematicians can prove theorems about. I think that this is not a good thing as we need to be more focused on AI safety/interpretability than simply performance.

Criteria by which machine learning is mathematical

A fitness function which has many a unique maximum or few local maxima should be considered as more mathematical and interpretable than a fitness function with many local maxima (or at least the local optimum should be completely describable using few bits of information). I prefer machine learning algorithms where the process of training is more pseudodeterministic in the sense that the learned AI model does not depend very much on a random initialization or other sources of randomness produced during training. The pseudodeterminism should also be robust in several different ways. For example, the Hessian at the local maximum should not have eigenvalues that are too close to zero, and the process of training the fitness function should remain pseudodeterministic even if generalized. If these pseudodeterminism requirements are satisfied, then the trained model should be interpretable and one should be able to investigate the trained AI model mathematically.

The nature of neural networks

I consider neural networks with ReLU activation to be mathematical objects in the sense that the functions computed by these neural networks are precisely the roots of tropical rational functions, so perhaps the connection between neural networks with ReLU and tropical geometry may be explored further.

I have performed a simple experiment where I have trained a reasonably small neural network twice with the same initialization and have taken some measures to ensure that the trained neural networks would end up at the same local optimum. But even when taking these measures, the two neural networks ended up looking quite different from each other. To make things worse, after training a neural network, one can remove over 90 percent of the weights without harming the performance of the neural network. This convinces me that trained neural networks still have a lot of random information in them and are difficult to investigate from a pure mathematics perspective. Neural networks are quite noisy and perhaps this noise is a reason for their lack of interpretability.

Alternative mathematical machine learning techniques

There are some machine learning algorithms that do have a nice mathematical theory behind them. Unfortunately, these more mathematical machine learning algorithms have not been developed to the point where they can compete with neural networks, but they still have an important role in machine learning, and I believe that people can develop the theory and practice of these more mathematical machine learning algorithms so that they will help with tasks that today can only be accomplished using deep neural networks.

The algorithm PageRank that is used for Google and other search engines simply consists of computing the Perron-Frobenius (the dominant) eigenvector of the adjacency matrix of a directed graph. The eigenvectors of the Laplacian matrix of a graph can be used to partition the nodes of the graph into clusters. I would therefore consider spectral graph theory (along with related concepts such as the eigenvalues of the Laplacian on some Riemannian manifold) as an area of mathematics applicable to machine learning.

Making machine learning more mathematical using functions on a complex domain

Suppose that $D=\{z\in\mathbb{C}:|z|\leq 1\}$, and let $f:D^n\rightarrow[-\infty,\infty)$ be a non-constant continuous function that is plurisubharmonic on the interior of $D^n$. Let $g:S_1^n\rightarrow[-\infty,\infty)$ be the restriction of $f$ to the torus $S_1^n$.

Let $L_f$ and $L_g$ denote the set of local maxima of $f$ and $g$ respectively. Then $L_f\subseteq L_g$ and $\max f=\max g$ by the maximum principle. If $\mathbf{z}$ is a typical element in $L_g$, then there is approximately a $0.5^n$ probability that $\mathbf{z}$ also belongs to $L_f$, so $|L_f|\approx(0.5)^n|L_g|$. Since $f$ has fewer local maxima than $g$, the fitness function $f$ should be easier to investigate mathematically than the function $f$. Now, the main problem is to use fitness functions like $f$ to solve machine learning tasks.

Unfortunately, I have not seen much literature trying about any attempts to use plurisubharmonic fitness functions for machine learning, and there are certainly difficulties with this sort of approach, but it seems like researchers should spend more resources developing more fitness functions of a complex variable to develop safer and more interpretable AI systems.

$\endgroup$
4
$\begingroup$

As was mentioned, you are probably asking for too much, with respect to the bleeding edge.

Nevertheless, here are a few comments and resources:

I would like to go beyond what the average engineer might need and explore the more theoretical sides.

Here are three textbooks concerned with theoretical machine learning, from less to more mathematically sophisticated:

None of these directly address the bleeding edge, but they may nevertheless be useful.

This webpage has some resources on the theory of reinforcement learning (under Textbook and Related Courses):

With respect to:

the current landscape of AI research, from mainstream science to the cutting-edge approaches, and elucidate the types of mathematics needed to tackle them?

The content of the above resources will be of some help. However, the "types of mathematics needed to tackle" the bleeding edge of a field is obviously not something that can be answered with authority. In general, and I apologize if this list is offensively elementary, I have seen the following flavors of math used:

  • real analysis
  • measure theory
  • probability theory, including measure-theoretic
  • stochastic processes
  • maybe a bit of functional analysis
  • linear algebra
  • some statistics
  • some computer science techniques (analysis of algorithms, etc.)
  • optimization

So you can think of the above list of topics as areas of math that have been used, but probably not an exhaustive list of everything that could be useful.

Edit: And here is a paper I should have mentioned: The Mathematics of Artificial Intelligence, by Gitta Kutyniok.

$\endgroup$
3
$\begingroup$

Below are three books on the mathematics of deep learning I have found. Especially, the author of the first one is a pure mathematician. So you can expect "high-quality" mathematics from his book.

  1. Deep Learning Architectures: A Mathematical Approach by Ovidiu Calin.

This book describes how neural networks operate from the mathematical point of view. As a result, neural networks can be interpreted both as function universal approximators and information processors. The book bridges the gap between ideas and concepts of neural networks, which are used nowadays at an intuitive level, and the precise modern mathematical language, presenting the best practices of the former and enjoying the robustness and elegance of the latter.

This book can be used in a graduate course in deep learning, with the first few parts being accessible to senior undergraduates. In addition, the book will be of wide interest to machine learning researchers who are interested in a theoretical understanding of the subject.

  1. Geometry of Deep Learning: A Signal Processing Perspective by Jong Chul Ye .

The focus of this book is on providing students with insights into geometry that can help them understand deep learning from a unified perspective. Rather than describing deep learning as an implementation technique, as is usually the case in many existing deep learning books, here, deep learning is explained as an ultimate form of signal processing techniques that can be imagined.

To support this claim, an overview of classical kernel machine learning approaches is presented, and their advantages and limitations are explained. Following a detailed explanation of the basic building blocks of deep neural networks from a biological and algorithmic point of view, the latest tools such as attention, normalization, Transformer, BERT, GPT-3, and others are described. Here, too, the focus is on the fact that in these heuristic approaches, there is an important, beautiful geometric structure behind the intuition that enables a systematic understanding. A unified geometric analysis to understand the working mechanism of deep learning from high-dimensional geometry is offered. Then, different forms of generative models like GAN, VAE, normalizing flows, optimal transport, and so on are described from a unified geometric perspective, showing that they actually come from statistical distance-minimization problems.

Because this book contains up-to-date information from both a practical and theoretical point of view, it can be used as an advanced deep learning textbook in universities or as a reference source for researchers interested in acquiring the latest deep learning algorithms and their underlying principles. In addition, the book has been prepared for a codeshare course for both engineering and mathematics students, thus much of the content is interdisciplinary and will appeal to students from both disciplines.

  1. The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks by Daniel A. Roberts, Sho Yaida and Boris Hanin.

This textbook establishes a theoretical framework for understanding deep learning models of practical relevance. With an approach that borrows from theoretical physics, Roberts and Yaida provide clear and pedagogical explanations of how realistic deep neural networks actually work. To make results from the theoretical forefront accessible, the authors eschew the subject's traditional emphasis on intimidating formality without sacrificing accuracy. Straightforward and approachable, this volume balances detailed first principle derivations of novel results with insight and intuition for theorists and practitioners alike. This self contained textbook is ideal for students and researchers interested in artificial intelligence with minimal prerequisites of linear algebra, calculus. informal probability theory. it can easily fill a semester long course on deep learning theory. For the first time, the exciting practical advances in modern artificial intelligence capabilities can be matched with a set of effective principles, providing a timeless blueprint for theoretical research in deep learning.

$\endgroup$
1
  • $\begingroup$ I find the last of these books fascinating but also puzzling. The authors use the term effective theory to mean a theory that accounts for the phenomenology without claiming to identify the true causes of the phenomena. Effective field theory has been successful in physics, but it remains to be seen whether it succeeds in the realm of deep learning. The Amazon editorial reviews include some big names, and they are enthusiastic, but also careful not to overstate the authors' contribution. $\endgroup$ Dec 17 at 9:22

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

Not the answer you're looking for? Browse other questions tagged or ask your own question.