Taiji Suzuki: Convergence of mean field Langevin dynamics and its application to neural network feature learning

Consider $\min_{\mu\in P} L(\mu), L(\mu):= F(\mu)+ \lambda_2 \operatorname{Ent}(\mu)$, where $F$ is convex, $\operatorname{Ent}=\int \log (\mu) \mathbb{d} \mu$.

The idea is to use mean field Langevin dynamics to minimise $L$.

Gradient Langevin dynamics $d X_t = -\nabla L(X_t) \mathrm{d} t + \sqrt{2 \lambda} \mathrm{d} B_t$ where $L(X)= \frac{1}{n} \sum_{i=1}^n l_i(X) + R(X)$ is non-convex.

The discrete approximation via Eule-Meruyama scheme is $X_{k+1}=X_k - \eta \nabla L(X_k) + \sqrt{2 \eta}\lambda \xi_k $ where $xi_k \sim N(0,I)$.

GLD as Wasserstein gradient flow From paper

Approximate Mean-field Langevin dynamics by discrete finite particle system in discrete time.

Assuming $log$-Sobolev inequality, he can show convergence of the approximation of the loss function.

image

Reference: Convergence of mean-field Langevin dynamics: Time and space discretization, stochastic gradient, and variance reduction

Wuchen Li: Information Gamma Calculus: Convexity Analysis for Stochastic Differential Equations Problem: Sample target measure $\pi(x) = \frac{1}{Z} e^{-V(x)}$ given $V: \Omega \to \mathbb{R}$. But $V$ is only partially known. $\dot X(t)=b(X_t)+\sqrt{2}a(X_t) B_t$ assume invariant measure $\pi$ exists. Question: convergence to invariant measure.

Lyapunov method: calculate second derivative of KL divergence, entropy dissipation analysis for non-gradient and degenerate stochastic dynamics system.

Reference: Entropy dissipation via Information Gamma calculus: Non-reversible stochastic differential equations

Earlier idea for the second derivative of Fisher information Lafferty 1988: The density manifold and configuration space quantization

Mi Jung Park:Privacy-preserving Data Generation in the Era of Foundation Models: Generative Transfer Learning with Differential Privacy

Synthetic data is vulnerable to linkage attacks which connects to a original datapoint. Differential privacy: learn as much as possible about a group while learning as little as possible about individuals. Fine tuning models using DP method to avoid model remembering data. e.g. Type in a name recover the origin portrait of a woman. Method: Kernel mean embedding Application: (Latent) Diffusion model, attention modules, Image generation: Future: Tabular data Connection between robustness and DP?