Questions tagged [learning-theory]

This tag is used for questions that are related with following branches: Statistical learning theory, Machine learning, Vapnik–Chervonenkis theory (VC theory) and all other branches that are studied and applied in the area of learning theory that involves various kinds of mathematics.

Filter by
Sorted by
Tagged with
2 votes
1 answer
40 views

Non-linear transforms of RKHS question

I was reading the paper Norm Inequalities in Nonlinear Transforms (referenced in this question) but ran into difficulties, so I was wondering if anyone could help? I think I follow the paper until I ...
Mat's user avatar
  • 41
52 votes
10 answers
6k views

A clear map of mathematical approaches to Artificial Intelligence

I have recently become interested in Machine Learning and AI as a student of theoretical physics and mathematics, and have gone through some of the recommended resources dealing with statistical ...
1 vote
0 answers
59 views

Approximation of continuous function by multilayer Relu neural network

For continuous/holder function $f$ defined on a compact set K, a fix $L$ and $m_1,m_2,\dots,m_L$, can we find a multilayer Relu fully connected network g with depth $L$ and each $i$-th layer has width ...
Hao Yu's user avatar
  • 773
0 votes
0 answers
25 views

The hardness of active learning with fixed budget

I have been looking for theoretical papers studying this question of the fundamental hardness of PAC active learning algorithms. I found a few papers studying the problem from a fixed perspective (...
rivana's user avatar
  • 1
1 vote
2 answers
191 views

Beating the $1/\sqrt n$ rate of uniform-convergence over a linear function class

Let $P$ be a probability distribution on $\mathbb R^d \times \mathbb R$, and let $(x_1,y_1), \ldots, (x_n,y_n)$ be an iid sample of size $n$ from $P$. Fix $\epsilon,t\gt 0$. For any unit-vector $w \in ...
dohmatob's user avatar
  • 6,586
0 votes
1 answer
80 views

Is it reasonable to consider the subgaussian property of the logarithm of the Gaussian pdf?

Let $Y$ denote a Gaussian random variable characterized by a mean $\mu$ and a variance $\sigma^2$. Consider $N$ independent and identically distributed (i.i.d.) copies of $Y$, denoted as $Y_1, Y_2, \...
Math_Y's user avatar
  • 261
2 votes
1 answer
81 views

VC-based risk bounds for classifiers on finite set

Let $X$ be a finite set and let $\emptyset\neq \mathcal{H}\subseteq \{ 0,1 \}^{\mathcal{X}}$. Let $\{(X_n,L_n)\}_{n=1}^N$ be i.i.d. random variables on $X\times \{0,1\}$ with law $\mathbb{P}$. ...
Math_Newbie's user avatar
1 vote
1 answer
156 views

Rademacher complexity for a family of bounded, nondecreasing functions?

Let $\{\phi_k\}_{k=1}^K$ be a family of functions mapping from an interval $[a, b]$ to $[-1, 1]$. That is, $\phi_k \colon[ a,b] \to [-1, 1]$ are nondecreasing maps on some finite interval $[a, b] \...
Drew Brady's user avatar
1 vote
0 answers
75 views

Distribution-free learning vs distribution-dependent learning

I came across some papers studying the problem of distribution-free learning, and I am interested in knowing the exact definition of distribution-free learning. I have searched some literature: In ...
yinan's user avatar
  • 11
4 votes
0 answers
116 views

Progress on "Un-Alching" ML?

So, a couple of years ago I watched both Ali Rahimi's NIPS speech "Machine Learning is Alchemy", (where he talks about how the field lacks a solid, overarching, theoretical foundation) and ...
dicaes's user avatar
  • 41
1 vote
1 answer
111 views

Tight upper-bounds for the Gaussian width of intersection of intersection of hyper-ellipsoid and unit-ball

Let $\Lambda$ be a positive-definite matrix of size $n$ and let $R \ge 0$, which may depend on $n$. Consider the set $S := \{x \in \mathbb R^n \mid \|x\|_2 \le R,\,\|x\|_{\Lambda^{-1}} \le 1\}$ where $...
dohmatob's user avatar
  • 6,586
1 vote
0 answers
24 views

Minimax statistical estimation of proximal transform $\mbox{prox}_g(\theta_0)$, from linear model data $y_i := x_i^\top \theta_0 + \epsilon_i$

tl;dr: My question pertains the subject of minimax estimation theory (mathematical statistics), in the context of linear regression. Given a vector $\theta_0 \in \mathbb R^d$, consider the linear ...
dohmatob's user avatar
  • 6,586
2 votes
0 answers
196 views

Covering/Bracketing number of monotone functions on $\mathbb{R}$ with uniformly bounded derivatives

I am interested in the $\| \cdot \|_{\infty}$-norm bracketing number or covering number of some collection of distribution functions on $\mathbb{R}$. Let $\mathcal{F}$ consist of all distribution ...
masala's user avatar
  • 93
1 vote
0 answers
245 views

Conditions for equivalence of RKHS norm and $L^2(P)$ norm

Let $K$ be a psd kernel on an abstract space $X$ and let $H_K$ be the induced Reproducing Kernel Hilbert Space (RKHS). Let $P$ be a probability measure on $X$ such that $H_K \subseteq L^2(P_X)$ and ...
dohmatob's user avatar
  • 6,586
1 vote
1 answer
71 views

How far from a sparse parity function can a function be and still look like such a function on small sets?

Let $\mathbb F_2^n$ denote the set of binary vectors of length $n$. A $k$-sparse parity function is a linear function $h:\mathbb F_2^n\to\mathbb F_2$ of the form $h(x)=u\cdot x$ for some $u$ of ...
Jack M's user avatar
  • 633
0 votes
0 answers
32 views

Normalizing a parameter in a regression

I am thinking about the possibility of making a parameter in my regression, let's say the $\lambda$ in a ridge regression, somehow, inside a range : $\lambda \in [0,1]$. Do you have any ideas how I ...
SUMQXDT's user avatar
0 votes
0 answers
83 views

Verification of a certain computation of VC dimension

Disclaimer: I'm not very familiar with the concept of VC dimensions and how to manipulate such objects. I'd be grateful if expects on the subject (learning theory, probability), could kindly proof ...
dohmatob's user avatar
  • 6,586
0 votes
1 answer
150 views

VC dimension of a certain derived class of binary functions

Let $X$ be a measurable space and let $P$ be a probability distribution on $X \times \{\pm 1\}$. Let $F$ be a function class on $X$, i.e., a collection of (measurable) functions from $X$ to $\mathbb R$...
dohmatob's user avatar
  • 6,586
0 votes
1 answer
139 views

Rademacher complexity of function class $\{(x,y) \mapsto 1[|yf(x)-\alpha| \ge \beta]$ in terms of $\alpha$, $\beta$, and Rademacher complexity of $F$

Let $X$ be a measurable space and let $P$ be a probability distribution on $X \times \{\pm 1\}$. Let $F$ be a function class on $X$, i.e., a collection of (measurable) functions from $X$ to $\mathbb R$...
dohmatob's user avatar
  • 6,586
0 votes
0 answers
150 views

Upper-bound for bracketing number in terms of VC-dimension

Let $P$ be a probability distribution on a measurable space $\mathcal X$ (e.g;, some euclidean $\mathbb R^m$) and let $F$ be a class of funciton $f:\mathcal X \to \mathbb R$. Given, $f_1,f_2 \in F$, ...
dohmatob's user avatar
  • 6,586
1 vote
0 answers
87 views

$L_1$ convergence rates for multivariate kernel density estimation

Let $X$ be a random variable on $\mathbb R^d$ with probability density function $f$, and let $X_1,\ldots,X_n$ of $X$ be $n$ iid copies of $X$. Given a bandwidth parameter $h=h_n > 0$ and a kernel $...
dohmatob's user avatar
  • 6,586
4 votes
0 answers
153 views

Convergence rates for kernel empirical risk minimization, i.e empirical risk minimization (ERM) with kernel density estimation (KDE)

Let $\Theta$ be an open subset of some $\mathbb R^m$ and let $P$ be a probability distribution on $\mathbb R^d$ with density $f$ in a Sobolev space $W_p^s(\mathbb R^d)$, i.e all derivatives of $f$ ...
dohmatob's user avatar
  • 6,586
4 votes
2 answers
233 views

Bounds on the number of samples needed to learn a real valued function class

Let us see Theorem 6.8 in this book, https://www.cs.huji.ac.il/w~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf It gives us a lowerbound (and also an ...
Student's user avatar
  • 545
1 vote
0 answers
51 views

Properties of a kernel convolution $K'(x,y) = \int_X\int_X K_0(x,a)K(a,b)K_0(b,y)d\mu(a)d\mu(b)$ where $K$ and $K_0$ are kernels on $(X,\mu)$

Let $(X,\mu)$ be a probability measure space and $K:X \times X \to \mathbb R$ be a (psd) kernel on $X$. Let $K_0$ be another kernel on $X$ and defined a new kernel $\widetilde K$ on $X$ by $$ \...
dohmatob's user avatar
  • 6,586
2 votes
1 answer
148 views

Representer theorem for a loss / functional of the form $L(h) := \sum_{i=1}^n (|h(x_i)-y_i|+t\|h\|)^2$

Let $K:X \times X \to \mathbb R$ be a (positive-definite) kernel and let $H$ be the induced reproducing kernel Hilbert space (RKHS). Fix $(x_1,y_1),\ldots,(x_n,y_n) \in X \times \mathbb R$. For $t \ge ...
dohmatob's user avatar
  • 6,586
5 votes
1 answer
356 views

Why is this nonlinear transformation of an RKHS also an RKHS?

I came across this paper (beginning of page 6) where they stated that if $f,h\in \mathcal{H}$, where $\mathcal{H}$ is an RKHS, then $l_{h,f}=\left|f(x)-h(x)\right|^q$ where $q\geq 1$ also belongs to ...
Kashif's user avatar
  • 343
3 votes
0 answers
342 views

Analytic formula for the eigenvalues of kernel integral operator induced by Laplace kernel $K(x,x') = e^{-c\|x-x'\|}$ on unit-sphere in $\mathbb R^d$

Let $d \ge 2$ be an integer and let $X=\mathcal S_{d-1}$ the unit-sphere in $\mathbb R^d$. Let $\tau_d$ be the uniform distribution on $X$. Define a function $K:X \times X \to \mathbb R$ by $K(x,y) := ...
dohmatob's user avatar
  • 6,586
1 vote
0 answers
213 views

Variance-based localized Rademacher complexity for RKHS unit-ball

Let $\mathscr X$ be a compact subset of $\mathbb R^d$ (e.g the unit-sphere). Let $K: \mathscr X \times \mathscr X \to \mathbb R$ be a positive kernel function and let $\mathscr H_K$ be the induced ...
dohmatob's user avatar
  • 6,586
0 votes
0 answers
312 views

Lower-bound on expected value of norm of transformation of random vector with iid Rademacher coordinates

Let $n$ be a large positive integer. Let $A$ be a positive-definite matrix such with eigenvalues $\lambda_1 \ge \lambda_2 \ge \ldots \ge \lambda_n$ such that $\lambda_n = o(1) \to 0$ and $\lambda_i=\...
dohmatob's user avatar
  • 6,586
1 vote
0 answers
93 views

Concentration for $\sum_{i=1}^n y_i \psi(x_i^\top u)$, for $y_1,\ldots,y_n \sim \{\pm 1\}$ and $x_1,\ldots,x_n$ uniform iid on hypersphere

Let $y_1,\ldots,y_n$ be drawn iid uniformly from $\{\pm 1\}$ and let $x_1,\ldots,x_n$ be drawn iid uniformly from the unit-sphere $(d-1)$-dimensional sphere $\mathbb S_{d-1}$, and independently from ...
dohmatob's user avatar
  • 6,586
3 votes
1 answer
283 views

Games and the right mathematical framework for GANs

Generative Adversarial Networks were introduced in http://papers.nips.cc/paper/5423-generative-adversarial-nets and has more than 20000 citations. It is an important topic within deep learning. Are ...
Turbo's user avatar
  • 13.5k
1 vote
0 answers
77 views

Covering number after projection

In these lecture notes on Statistical Learning Theory we find the following definitions for covering numbers: Definition. Let $(\mathcal{W}, d)$ be a metric space and $\mathcal{F} \subset \mathcal{W}$...
Jonas Metzger's user avatar
1 vote
1 answer
261 views

Finite VC dimension > the number of free parameters

I'm looking for an example of the following: A hypothesis class $\mathcal{H}$ such that $\forall h \in \mathcal{H}$, the number of free parameters of $h$ is equal to $n \in \mathbb{N}$ (where $n < ...
keyboardAnt's user avatar
0 votes
1 answer
255 views

How large sample $m$ is enough [closed]

I have a $D$ probability distribution over $X =R^d$, i have two samples $s_1$ and $s_2$ from $D$, each having size $m_1$, $m_2$, a unit ball centered at origin $B(0)$, defined by $B(0)=\{x \in R^2: \|...
user avatar
1 vote
1 answer
51 views

Fast rates in ERM: Extreme case of low-noise assumption implies non-differentiability

Some context: I am going through some literature on empirical risk minimization for bipartite ranking [1] that shows how certain "low-noise" conditions lead to fast rates of convergence of ...
dmh's user avatar
  • 111
8 votes
4 answers
2k views

How to learn a continuous function?

Let $\Omega \subset \mathbb{R}^m$ be an open subset bounded with a smooth boundary. Problem : Given any bounded continuous function $f:\Omega\to\mathbb{R}$, can we learn it to a given accuracy $\...
Rajesh D's user avatar
  • 704
1 vote
1 answer
489 views

Upper bounding VC dimension of an indicator function class

I would like to upper bound the VC dimension of the function class $ F$ defined as follows: $$ F := \left\{ (x,t) \mapsto \mathbb{1} \left( c_Q\min_{q \in Q} {\|x-q \|}_1 - t > 0 \right) \; | \; Q ...
ato_42's user avatar
  • 11
10 votes
1 answer
582 views

Abstract mathematical concepts/tools appeared in machine learning research

I am interested in knowing about abstract mathematical concepts, tools or methods that have come up in theoretical machine learning. By "abstract" I mean something that is not immediately related to ...
0 votes
2 answers
270 views

Statistical divergence

Does anyone know about a statistical divergence of this type? \begin{equation} \text{D}(P||Q) = \frac{1}{2} \left[\text{KL}(M||P) + \text{KL}(M||Q)\right] \end{equation} where $M = \frac{1}{2} [P+Q]$....
Apprentice's user avatar
1 vote
1 answer
517 views

Why we use Rademacher complexity for generalization error when we can have a trained function?

Let $G$ be a family of functions mapping from $Z$ to $[a, b]$ and $S=\left(z_{1}, \ldots, z_{m}\right)$ a fixed sample of size $m$ with elements in $Z$ . Then, the empirical Rademacher complexity of $...
lee's user avatar
  • 43
23 votes
1 answer
3k views

Relation between information geometry and geometric deep learning

Disclaimer: This is a cross-post from a very similar question on math.SE. I allowed myself to post it here after reading this meta post about cross-posting between mathoverflow and math.SE, I did ...
Blupon's user avatar
  • 333
6 votes
0 answers
108 views

Functional Equation of Zeta Function on Statistical Model

I've been studying [1] because I was interested in his ideas on the zeta function. I'll define it here (c.f. p. 31): The Kullback-Leibler distance is defined as $$ K(w)=\int q(x)f(x, w)dx\quad f(x,w)...
Matt Cuffaro's user avatar
0 votes
0 answers
124 views

Function classes with high Rademacher complexity

My question is two fold, Is there any general understanding of what makes a function class have high Rademacher complexity? (Sudakov minoration would say that one sufficient condition for a class of ...
gradstudent's user avatar
  • 2,136
2 votes
0 answers
193 views

Shattering with sinusoids

Let $d \geq 2$ and $K$ some positive integer. Consider distinct points $\theta_1, \ldots, \theta_K\in \mathbb{T}^d$ and (not necessarily distinct) $z_1, \ldots, z_K \in \{-1,1\}$ such that $\sum\...
Rajesh D's user avatar
  • 704
0 votes
2 answers
275 views

Use covering number to get uniform concentration from pointwise concentration

Let $\Theta$ be a subset of a metric space. Suppose $(X_\theta)_{\theta \in \Theta}$ is a random process on $\Theta$ which is $L$-Lipschitz and with the property that there exists constants $A, B>0$...
dohmatob's user avatar
  • 6,586
1 vote
1 answer
363 views

Growth rate of bounded Lipschitz functions on compact finite-dimensional space

Let $\mathcal X$ be a metric space of diameter $D$ and "dimension" (e.g doubling dimension) $d$. Let $L \in [0, \infty]$ and $M \in [0, \infty)$ and consider the class $\mathcal H_{M,L}$ of $L$-...
dohmatob's user avatar
  • 6,586
1 vote
2 answers
318 views

Is it possible to “solve” iterative (convex/non-convex) optimization problems via learning (one-shot)?

I posted a following question in MSE, but I think it should be posted here in MO. Since I don't know how to transfer the post from MSE to MO, I have pasted the question below. Thank you in advance and ...
user550103's user avatar
2 votes
2 answers
484 views

Lower bound on misclassification rate of Lipschitz functions in terms of Lipschitz constant

Important note @MateuszKwaśnicki in the comment section has raised a fundamental issue with the current statement of the problem. I'm trying to bugfix it. Setup I wish to show that a Lipschitz ...
dohmatob's user avatar
  • 6,586
8 votes
2 answers
1k views

VC dimension, fat-shattering dimension, and other complexity measures, of a class BV functions

I wish to show that a function which is "essentially constant" (defined shortly) can't be a good classifier (machine learning). For this i need to estimate the "complexity" of such a class of ...
dohmatob's user avatar
  • 6,586
3 votes
0 answers
279 views

From Sudakov minoration principle to lowerbounds on Rademacher complexity

For a compact subset $S \subset \mathbb{R}^n$ (and an implicit metric $d$ on it) and $\epsilon >0$ lets define the following $2$ standard quantities, Let ${\cal P}(\epsilon,S,d)$ be the $\epsilon-...
gradstudent's user avatar
  • 2,136