Sampling, particle system and neural network

Anna Korba (Sampling through Optimization of Discrepancies)

The task of sampling is to find the target measure $\mu^*$ from information of unnormalised density or discrete approximation.

The quality is given by the difference : $|\frac{1}{n} \sum_{i=1}^n f(x_i) - \int f(x)\mathrm{d} \mu(x)|$.

Reference Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm, Liu, Wang 2016

Link to slides

The optimisation of neural network can be mapped to approximating $\mu^{*}$ in single layer neural network with infinite width.

As an optimisation problem, we can use gradient flow with respect to Wasserstein metric over probability measure on $\mathbb{R}^d$. To implement the gradient flow, particles are used to approximate the flow of measure.

As gradient descent is important in machine learning, it is not surprising that gradient flow enters the picture. Particle systems enter as an approximation of the gradient flow of measure. The metric of probability measures are known as divergence in this setting. There are different choices such as KL divergence (relative entropy), chi-divergence, MMD (maximum mean divergence). MMD is defined as the supremum over the functions in a reproducing kernel Hilbert space (RKHS). The difference of the integration over the two measures defined the metric known as MMD. She is proposing a new metric as a convex combination of MMD and chi-divergence known as DMMD. In her talk, she tries to convince us this new metric is useful for the discrete case.

Jiaxin Shi (Stein’s Method for Modern Machine Learning: From Gradient Estimation to Generative Modeling) This talk uses the Stein method. More precisely to construct a discrete state Markov process so that the stationary distribution is a given distribution. He is interested in the generator of this process and this will be the Stein operator in his application.

Stein operator

Gradient estimation is the problem to calculate $\nabla_{\eta} \mathbb{E}_{q_{\eta}}(f(x))$

where $q_{\eta}$ is interpreted as a language model to predict the next token, $f$ is the reward model to evaluate the outcome of the prediction from the language model. It is then used to improve the model during a training process.

Reference Measuring sample quality with Stein’s method, Jackson Gorham, Lester Mackey Discrete Stein operator from Markov process

Gradient estimation with discrete Stein operators

Another part of his talk was about blind source separation, e.g. separating indvidual voices from a noisy room, score matching, score network to diffusion model. Diffusion model was behind the explosive growth in image generative model in recent days.

Link to slides

Oliver Tse: Variational Acceleration Methods in the Space of Probability Measures Acceleration of convergence of gradient flow like equation?

Check his paper for more.

Tim Laux: Emergence of Mean Curvature Flow in Learning with Jona Lelmi, Leon Bungert, Kerrek Stinson. (Blackboard talk) MBO scheme converges to Mean curvature flow.

Reference:Large data limit of the MBO scheme for data clustering: -convergence of the thresholding energies

Application on adversial training improvement on Trillos,Murray.

Goal $\min_{u \in H} \mathbb{E}_{X,Y \sim\mu}[l (u(X),Y)]$ approximated by $\min_{u\in H}\mathbb{E}_{(X,Y)\sim \mu}[\max_{\tilde x \in B_{\epsilon}(x)}l(u(\tilde X,Y))]$. Consider the boundary as the decision boundary in a binary classification problem. The previous work uses ‘wrong’ scheme. His work modified the function to be optimised at each step which is the right one to converge to MCF.

Gamma-convergence of a nonlocal perimeter arising in adversarial machine learning