Questions tagged [machine-learning]

The tag has no usage guidance.

Filter by
Sorted by
Tagged with
2 votes
1 answer
40 views

Non-linear transforms of RKHS question

I was reading the paper Norm Inequalities in Nonlinear Transforms (referenced in this question) but ran into difficulties, so I was wondering if anyone could help? I think I follow the paper until I ...
Mat's user avatar
  • 41
52 votes
10 answers
6k views

A clear map of mathematical approaches to Artificial Intelligence

I have recently become interested in Machine Learning and AI as a student of theoretical physics and mathematics, and have gone through some of the recommended resources dealing with statistical ...
1 vote
0 answers
59 views

Approximation of continuous function by multilayer Relu neural network

For continuous/holder function $f$ defined on a compact set K, a fix $L$ and $m_1,m_2,\dots,m_L$, can we find a multilayer Relu fully connected network g with depth $L$ and each $i$-th layer has width ...
Hao Yu's user avatar
  • 773
1 vote
2 answers
191 views

Beating the $1/\sqrt n$ rate of uniform-convergence over a linear function class

Let $P$ be a probability distribution on $\mathbb R^d \times \mathbb R$, and let $(x_1,y_1), \ldots, (x_n,y_n)$ be an iid sample of size $n$ from $P$. Fix $\epsilon,t\gt 0$. For any unit-vector $w \in ...
dohmatob's user avatar
  • 6,586
1 vote
0 answers
100 views

Matrix valued word embeddings for natural language processing

In natural language processing, an area of machine learning, one would like to represent words as objects that can easily be understood and manipulated using machine learning. A word embedding is a ...
Joseph Van Name's user avatar
3 votes
1 answer
145 views

Why is the logistic regression model good? (and its relation with maximizing entropy)

Suppose we're trying to train a classifier $\pi$ for $k$ classes that takes as input a feature vector $x\in\mathbb{R}^n$ and outputs a probability vector $\pi(x)\in\mathbb{R}^k$ such that $\sum_{v=1}^...
stupid_question_bot's user avatar
9 votes
1 answer
283 views

Who introduced the term hyperparameter?

I am trying to find the earliest use of the term hyperparameter. Currently, it is used in machine learning but it must have had earlier uses in statistics or optimization theory. Even the multivolume ...
AChem's user avatar
  • 813
2 votes
0 answers
77 views

Equivalence of score function expressions in SDE-based generative modeling

I am studying the paper "Score-Based Generative Modeling through Stochastic Differential Equations" (arXiv:2011.13456) by Yang et al. The authors use the following loss function (Equation 7 ...
Po-Hung Yeh's user avatar
8 votes
1 answer
480 views

Geometric formulation of the subject of machine learning

Question: what is the geometric interpretation of the subject of machine learning and/or deep learning? Being "forced" to have a closer look at the subject, I have the impression that it ...
Manfred Weis's user avatar
  • 12.5k
1 vote
0 answers
95 views

Problems Correction of "Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning "' [closed]

Where I can find the problems correction of this book " Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning "
zdo0x0's user avatar
  • 11
3 votes
0 answers
38 views

Prove the convergence of the LASSO model in the presence of limited eigenvalues

I am researching the properties of the Lasso model $\hat \beta:= \operatorname{argmin} \{\|Y-X\beta\|_2^2/n+\lambda\|\beta\|_1\}$, specifically its convergence when the data satisfies restricted ...
GGbond's user avatar
  • 39
8 votes
0 answers
119 views

Worst margin when halving a hypercube with a hyperplane

Consider the $n$-cube $C_n=\lbrace-1,1\rbrace^n$ and the problem of partitioning it into halves with hyperplanes through the origin that avoid all its points. We can parameterize the hyperplanes by ...
Veit Elser's user avatar
  • 1,041
1 vote
0 answers
49 views

Curve fitting with "rough" loss functions

Many real-valued classification and regression problems can be framed as minimization in the following way. Setup: Let $\Theta \in \mathbb{R}^p$ be the parameter space that we are searching over. For ...
Simon Kuang's user avatar
2 votes
0 answers
423 views

Mathematics research relating to machine learning

What branch(s) of math is most relevant in enhancing machine learning (mostly in terms of practical use as opposed to theoretical/possible use)? Specifically, I want to know about math research used ...
Artus's user avatar
  • 141
1 vote
1 answer
97 views

Adjoint sensitivity analysis for a cost functional under an ODE constraint

I am trying to recover the result given by equation 10 in the article here. I am unable to get rid of the integral, any help would be much appreciated. To keep the description as self contained as ...
Abhi. A's user avatar
  • 55
2 votes
0 answers
50 views

Convergence of minimiser of empirical risk to minimiser of population risk

Let $X_1, \dots, X_n \sim \mu$ be some random elements of a space $\mathcal{X}$. Let $H$ be a Hilbert space of functions $f: S \to \Re$ with norm $\|\cdot\|_H$. Let $\|f^*\|_{L_2(\mu)} < \infty$ ...
user27182's user avatar
  • 315
2 votes
0 answers
42 views

can we get a family of classifiers $\left\{f_n\right\}_{n \in N}$such that $\lim_{n->∞} (E_{(X_1, Y_1), ...,(X_n, Y_n) \sim \rho}[R(f_n)]-R(f_B))=0 $

For a given classifier $f: \mathbb{R}^d \mapsto\{0,1,2\}$, let $$ R(f):=\mathbb{E}_{(X, Y) \sim \rho}\left[\mathbb{1}_{f(X) \neq Y}\right] $$ $f_B$ the Bayes classifier. can we get a family of ...
fantacy_crs's user avatar
3 votes
0 answers
48 views

How to prove emprical risk converges to expectation risk as $n\to \infty$?

For example, for a classical binary classification: $x \in \mathbb{R}^d$ and $y \in\{0,1\}$ let empirical risk be $R_{\ell}^n(f):=\frac{1}{n} \sum_{i=1}^n \ell\left(f\left(X_i\right), Y_i\right)$ and ...
fantacy_crs's user avatar
2 votes
1 answer
81 views

VC-based risk bounds for classifiers on finite set

Let $X$ be a finite set and let $\emptyset\neq \mathcal{H}\subseteq \{ 0,1 \}^{\mathcal{X}}$. Let $\{(X_n,L_n)\}_{n=1}^N$ be i.i.d. random variables on $X\times \{0,1\}$ with law $\mathbb{P}$. ...
Math_Newbie's user avatar
4 votes
1 answer
264 views

Perceptron / logistic regression accuracy on the n-bit parity problem

$\DeclareMathOperator{\sgn}{sign}$The perceptron (similarly, logistic regression) of the form $y=\sgn(w^T \cdot x+b)$ is famously known for its inability to solve the XOR problem, meaning it can get ...
ido4848's user avatar
  • 141
1 vote
0 answers
32 views

Convergent gradient-type scheme for solving smooth nonconvex constrained optimization problem

Let $x_1,\ldots,x_n \in \mathbb R^d$ and $y_1,\ldots,y_n \in \{\pm 1\}$, and $\epsilon, h \gt 0$. Define $\theta(t) := Q((t-\epsilon)/h)$, where $Q(z) := \int_{z}^\infty \phi (z)\mathrm{d}z$ is the ...
dohmatob's user avatar
  • 6,586
3 votes
0 answers
126 views

What is the meaning of big-O of a random variable?

I encountered this problem in a book "Pattern Recognition and Machine Learning" by Christopher M. Bishop. I excerpt it below: screenshot of the book In the excerpt, the big-O notation $O(\xi^...
zzzhhh's user avatar
  • 31
2 votes
0 answers
70 views

Training an energy-based model (EBM) using MCMC

I'm reading this paper about training energy-based models (EBMs) and don't understand the parameters that we are training for? The part that is relevant to the question is in pages 1-4. Here is the ...
Garfield's user avatar
  • 201
1 vote
0 answers
145 views

How to maximize certain function of hundreds variables related to correlations between sets vectors ? (and win Kaggle :))

It might be helpful for data science/bioinformatics challenge. Consider for simplicity three rectangular matrices $Y_{true}$ , $Y_{predict0},Y_{predict1}$ of the same sizes say 70000*140. Let us ...
Alexander Chervov's user avatar
2 votes
0 answers
80 views

Nuclear norm minimization of convolution matrix (circular matrix) with fast Fourier transform

I am reading a paper Recovery of Future Data via Convolution Nuclear Norm Minimization. Here, I know there is a definition for convolution matrix. Given any vector $\boldsymbol{x}=(x_1,x_2,\ldots,x_n)^...
Xinyu Chen's user avatar
1 vote
0 answers
75 views

Distribution-free learning vs distribution-dependent learning

I came across some papers studying the problem of distribution-free learning, and I am interested in knowing the exact definition of distribution-free learning. I have searched some literature: In ...
yinan's user avatar
  • 11
4 votes
0 answers
116 views

Progress on "Un-Alching" ML?

So, a couple of years ago I watched both Ali Rahimi's NIPS speech "Machine Learning is Alchemy", (where he talks about how the field lacks a solid, overarching, theoretical foundation) and ...
dicaes's user avatar
  • 41
2 votes
0 answers
41 views

Combining SVD subspaces for low dimensional representations

Suppose we have matrix $A$ of size $N_t \times N_m$, containing $N_m$ measurements corrupted by some (e.g. Gaussian) noise. An SVD of this data $A = U_AS_A{V_A}^T$ can reveal the singular vectors $U_A$...
user2600239's user avatar
1 vote
0 answers
102 views

Can I minimize a mysterious function by running a gradient decent on her neural net approximations? [closed]

A cross post from on AI StackExchange. So I have this function let call her $F:[0,1]^n \rightarrow \mathbb{R}$ and say $10 \le n \le 100$. I want to find some $x_0 \in [0,1]^n$ such that $F(x_0)$ is ...
Vladimir Zolotov's user avatar
1 vote
0 answers
52 views

How to calculate the unifrom entropy or VC dimension of the following class of functions?

When dealing with U process I meet with such a uniform entropy to calculate. For any $\eta>0$, function class $\mathcal{F}$ containing functions $f=\left(f_{i, j}\right)_{1 \leq i \neq j \leq n}: \...
leslie zhang's user avatar
3 votes
1 answer
229 views

Independent input feature z can be removed: if y=f(x+z,z), then y=g(x)?

Let $y\in \mathbb{R}$ and $\mathbf{x},\mathbf{z}\in\mathbb{R}^p$ be random variable and random vectors. Assume $y=f(\mathbf{x}+\mathbf{z},\mathbf{z})$ for some function $f$. Is the following statement ...
John's user avatar
  • 195
1 vote
0 answers
46 views

Sample Complexity/PAC-Learning Notation

In PAC Learning, Sample Complexity is defined as: The function $m_\mathcal{H} : (0,1)^2 \rightarrow \mathbb{N}$ determines the sample complexity of learning $\mathcal{H}$: that is, how many examples ...
user490208's user avatar
1 vote
0 answers
139 views

Stochastic Gradient Descent

In this question, I am not really sure how to approach this question as I am a beginner in optimisation Consider the function $f : B_1 → R$ with $f(x) = \left\lVert x \right\rVert_2^2$ and $B_1$ := {$...
Jacob Zhang's user avatar
5 votes
2 answers
274 views

Entropy & difference between max and min values of probability mass

Let $X$ be a random variable with probability mass function $p(x) = \mathbb{P}[X = x]$. I know entropy $H(X)$ of $X$ measures the uncertainty of $X$ and a large value of $H(X)$ means $p(x)$ is nearly ...
aest's user avatar
  • 143
1 vote
1 answer
177 views

Using Hoeffding inequality for risk / loss function

I've got a question to the Hoeffding Inequality which states, that for data points $X_1, \dots, X_n \in X$, which are i.i.d. according to a probability measure $P$ on $X$, we find an upper bound for: $...
Mathematiger's user avatar
20 votes
3 answers
3k views

How can Machine Learning help “see” in higher dimensions?

The news that DeepMind had helped mathematicians in research (one in representation theory, and one in knot theory) certainly got many thinking, what other projects could AI help us with? See MO ...
liuyao's user avatar
  • 485
2 votes
0 answers
196 views

Covering/Bracketing number of monotone functions on $\mathbb{R}$ with uniformly bounded derivatives

I am interested in the $\| \cdot \|_{\infty}$-norm bracketing number or covering number of some collection of distribution functions on $\mathbb{R}$. Let $\mathcal{F}$ consist of all distribution ...
masala's user avatar
  • 93
1 vote
0 answers
95 views

Limit cycles or stable solutions for k-dimensional piece-wise linear ODEs

As a branch of reinforcement learning, restless multi-armed bandits have been shown PSPACE-HARD but Whittle has offered an implementable solution called the Whittle Index Policy. Weber and Weiss ...
KLiu's user avatar
  • 41
1 vote
0 answers
86 views

If two functions are close apart can I proof the difference of their empirical loss is also small?

I am trying to understand the proof of Theorem 3 in the paper "A Universal Law of Robustness via isoperimetry" by Bubeck and Sellke. Basically there exist atleast one $w_{L,e}$ in $\...
user avatar
2 votes
0 answers
43 views

Convergent algorithm for minimizing nonconvex smooth function

Let $\Phi$ be the Gaussian CDF and for $\gamma\ge 0$ and $h>0$, define a loss function $\ell_h:\{\pm 1\} \times \mathbb R$ by $$ \ell_{\gamma,h}(y,y') := \phi_{\gamma,h}(yy') := \Phi((yy'-\gamma)/h)...
dohmatob's user avatar
  • 6,586
0 votes
0 answers
32 views

Normalizing a parameter in a regression

I am thinking about the possibility of making a parameter in my regression, let's say the $\lambda$ in a ridge regression, somehow, inside a range : $\lambda \in [0,1]$. Do you have any ideas how I ...
SUMQXDT's user avatar
0 votes
0 answers
70 views

Shattering of a set of binary classifiers

Let $S$ be a set, and let $\mathcal{F}_{S}=\{f:S\to\{-1,+1\}\}$ be a set of different label assignments. Show that $\mathcal{F}_{S}$ shatters at least $|\mathcal{F}_{S}|$ subsets of $S$. Here is what ...
cbyh's user avatar
  • 143
1 vote
0 answers
72 views

Converting an indexed equation to a matrix one

I am helping a friend with a project involving neural networks and he wants to convert this equation into matrix notation: $$w_{ij} = \sum_{n=1}^N\left[\sum_{i=1}^I(r_{in}-y_{in})v_{ih}\right](1-z_{hn}...
user3308874's user avatar
3 votes
0 answers
194 views

What is the VC-dimension of regular convex k-gons in the plane?

Recall the relevant definitions: Let $H$ be a family of sets in $\mathbb{R}^d$. The intersection of $H$ with a point set $C$ is defined as $H\cap C = \{h\cap C\mid h\in H\}$. The VC-dimension of $H$ (...
Tassle's user avatar
  • 131
2 votes
1 answer
149 views

Derive equation for regularized logistic regression with batch updates

I am trying to understand this paper by Chapelle and Li "An Empirical Evaluation of Thompson Sampling" (2011). In particular, I am failing to derive the equations in algorithm 3 (page 6). ...
denvercoder9's user avatar
4 votes
1 answer
609 views

The ODE modeling for gradient descent with decreasing step sizes

The gradient descent (GD) with constant stepsize $\alpha^{k}=\alpha$ takes the form $$x^{k+1} = x^{k} -\alpha\nabla f(x^{k}).$$ Then, by constructing a continuous-time version of GD iterates ...
lazyleo's user avatar
  • 63
2 votes
1 answer
148 views

Representer theorem for a loss / functional of the form $L(h) := \sum_{i=1}^n (|h(x_i)-y_i|+t\|h\|)^2$

Let $K:X \times X \to \mathbb R$ be a (positive-definite) kernel and let $H$ be the induced reproducing kernel Hilbert space (RKHS). Fix $(x_1,y_1),\ldots,(x_n,y_n) \in X \times \mathbb R$. For $t \ge ...
dohmatob's user avatar
  • 6,586
1 vote
0 answers
33 views

Correlating two matrices $A,B$ with stochastic dependency structure imposed by cross-validation

Consider a labelled data set $$D = \{(x_1, y_1),...,(x_n, y_n)\} $$ on which we want to evaluate a machine learning algorithm using $k$-fold cross validation with $m$ different random seeds. This ...
Joker123's user avatar
  • 153
2 votes
1 answer
83 views

How to fit a set of parametrized data to a parametrized distribution?

I have a time series $d_i(a)$ which depends on the parameter $a$. On the other hand, I have a sequence of normal distributions $\mathcal{N}(0,Q_i(a))$, where the variance $Q_i$ depends on time and ...
ycz's user avatar
  • 51
2 votes
0 answers
36 views

Stochastic gradient descent in 'stronger' settings

I am minimzing a function $F(x) = \mathbb E(f(x,\Xi))$ where $\Xi$ is some random value, by a stochastic gradient descent that generates a random number $\xi$ from the distribution of $\Xi$ at each ...
lrnv's user avatar
  • 653