I am back to Berlin for the Workshop on Optimal Transport: From Theory to Applications – Interfacing Dynamical Systems, Optimization, and Machine Learning. I find the talk by Philippe Rigollet(MIT really interesting. He considers “Transformers” in the architecture of neural networks as mean field interacting particle systems. The references to the work are 2312.10794, 2305.05465.

Let me try to summarise this idea. We take the input data $x_0$. The trained neural network (ResNet) is essentially computing recursively

$x_{k+1}= x_k + \sigma( W_k x_k + b_k)$ where

$\sigma$ is a non-linear function, $W_k$ is an affine function and $b_k$ is a vector.

This is a discrete time evolution equation. Then he does a continuous time approximation of the discrete time dynamical system to get a so-called neural ODEs

$\dot x(t) = \sigma (W_t x(t) + b_t).$

Then the neural network is considered as the flow map $x_0 \mapsto x(T).$

In the context of natural language model, the so-called Transformer are models to predict the next word in a given sentence. Unlike the ResNet above, the Transformer’s input is a sequence of vector $x_i(0)$.

The map from natural langauge to the input $x(0)=(x_i(0)){i=1,2,…,n }\in \mathbb{R}^{d\times n}$ is as follows: each word in a sentence ‘Mein Lieblingstier ist __’ are broken down into $n$ segments and each segment is represented by a real value vector known as a token in dimension of the order of $d$ which contains the information of the word as well as its position in the sentence. In typical cases, $n \approx 512, d \approx 200$. Then the set of token ${x_1,…,x_n}$ is represented as an empirical measure:

$\frac{1}{n} \sum_{i=1}^{n} \delta_{x_i}$

In this way, the Transformer is a flow map on the probability measure on $\mathbb{R}^d$.

The output is then a distribution of token which would allow us to give a prediction of the origin problem. For example, the most likely outcome in the above sentence could be “Schmetterling.”

Since the dynamics of $x_i$ depends on other particles see the papers for details, it forms an interacting particle system. The specific self-attention dynamics has clustering phenomena which means eventually the process will become a delta measure. But in practice due to metastability, there are multiple metastable states which are observed in machine learning.

Questions:

In your result, you showed the long time convergence of the continuous model. How about the convergence of the discrete time model to continuous time model? So this should answer the question: Do your results connect to the discrete time setting? What can we say about finite number of layers model?

A: Yes. The convergence of discrete model is available.

Q: How does the result about the flow map answer the question: what is the output of the model given an input? (Did not ask)

Q: What is the family of interacting particle systems that is relevant to neural network? What are their characteristics/ similarities?

A: Only applicable to the case of Transformer related neural network.

Q: How general is this strategy of interacting particle systems in understanding machine learning? Can we do this with models for other purposes?

A: According to Anna from TU Eindhoven, probably there can be other applications.

Q: Can you tell me some review papers about the state of the act in using interacting particle systems in understanding machine learning?

A: He mentioned the keyword SVGD. See also the paper Sinkformers: Transformers with Doubly Stochastic Attention.

Q: How does your result answer the question: What representation does the trained model learn?

A: Did not ask.

Alexander Mielke (Hellinger–Kantorovich (a.k.a. Wasserstein-Fisher-Rao) Spaces and Gradient Flow) Unbalanced transport problem: allow creation and annhilation that keep the mass constant which is not just transporting mass.

Olga Mula (Reduced Models in Wasserstein Spaces for Forward and Inverse Problems) Keywords: Barycentre, inverse problem

  1. Forward problem, 2. Forward model reduction, 3. Inverse state estimation, 4. Inverse parameter estimation.