Mean field sequence: an introduction

This is the first post in a planned series about mean field theory by Dmitry and Lauren (this post was generated by Dmitry with lots of input from Lauren, and was split into two parts, the second of which is written jointly). These posts are a combination of an explainer and some original research/ experiments. The goal of these posts is to explain an approach to understanding and interpreting model internals which we informally denote "mean field theory" or MFT. In the literature, the closest matching term is "adaptive mean field theory". We will use the term loosely to denote a rich emerging literature that applies many-body thermodynamic methods to neural net interpretability. It includes work on both Bayesian learning and dynamics (SGD), and work in wider "NNFT" (neural net field theory) contexts. Dmitry's recent post on learning sparse denoising also heuristically fits into this picture (or more precisely, a small extension of it). Our team at Principles of Intelligence (formerly PIBBSS) believes that this point of view on interpretability remains highly neglected, and should be better understood and these ideas should be used much more in interpretability thinking and tools. We hope to formulate this theory in a more user-friendly that can be absorbed and used by interpretability researchers. This particular post is closely related to the paper "Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity". The experiments are new. What do we mean by mean field theoryMean field theory is a vague term with many meanings, but for the first few posts at least we will focus on adaptive mean field theory (see for example this paper, written with a physicist audience in mind). It is a theory of infinite-width systems that is different from the more classical (and, as I'll explain below, less expressive) neural tangent kernel formalism and related Gaussian Process contexts. Ultimately it is a theory of neurons (which are treated somewhat like particles in a gas). While every single neuron in the theory is a relatively simple object, the neurons in a mean field picture allow for an emergent large-scale behavior (sometimes identified "features") that permits us to see complex interactions and circuits in what is a priori a "single-neuron theory". These cryptic phrases will hopefully be better understood as this post (and more generally as this series) progresses.Why MFTWe ultimately want to understand the internals of neural nets to a degree that can robustly (and ideally, in some sense "safely") interpret why a neural net makes a particular decision. So one might say that this implies that we should only care about theories that apply directly to real models. Finite width, large depth, etc. While this is fair, any interpretation must ultimately rely on some idealization. When we say "we have interpreted this mechanism", we mean that there is some platonic gadget or idealized model that has a mechanism "that we understand", and the real model's behavior is explained well by this platonic idealization. Thus making progress on interpretability requires accumulating an encyclopedia (or recipe book) of idealizations and simplified models. The famous SAE methodology is based on trying to fit real neural nets into an idealization inherited from compressed sensing (a field of applied math). As we will explain below, if we never had Neel Nanda's interpretation of the modular addition algorithm, we would get it "for free" by applying a mean field analysis to the related infinite-width model. As it were, the two use the same Platonic idealization[1]. Thus at least one view on the use of theory is to see it as a source of useful models that can be then applied to more realistic settings (with suitable modification, and, at least until a "standard model" theory of interpretability exists, necessarily incompletely). Useful theories should be simple enough to analyse mathematically (maybe with some simplifications, assumptions, etc.) and rich enough to illuminate new structure. We think that mean field theory (and its relatives) is well-positioned to take such a role.Brief FAQ section"Frequently asked questions about MFT" is a big topic that can be its own post. But before diving into a more technical introduction, we should address a few standard questions which keep cropping up, especially about comparisons between MFT and other better-known infinite-width limits. Doesn't infinite width mean that we're in the NTK (or more generally a Gaussian process) regime? The first analyses of neural nets at infinite width have been in the so-called NTK regime, where in particular the model "freezes" to its prior/ initialization at all but the last layer (which is performing linear regression). This is a remarkably deep picture that is for example sufficient to learn mnist. But approaches in this family exhibit extremely different behaviors from realistic nets (in particular the freezing of early neurons) and they are generalize much worse on problems that cannot be solved by some combination of clustering and linear regression (of which MNIST is an example). For example these methods learn only memorizing circuits in modular addition (at least in known regimes) and, worse, they are known to require exponential training data and complexity for learning algorithms that are well-known to be learnable by SGD (see for example the leap complexity paper) – this means that these techniques are fundamentally incompatible with these settings (more generally so-called "compositional" models - ones that have multiple serial steps which models tend to need depth for - have similar failures in this regime). This can be partially improved by including so-called "correction terms", but these only work when the Gaussian process has good performance by itself, and fail to ameliorate for the exponential complexity issues. Note that the Gaussian process picture is useful as a heuristic baseline. In particular it makes some predictions on scaling exponents that have some experimental agreement (and is related to the muP formalism). It turns out that the lack of expressivity of the Gaussian limit is due not to its having infinite width to a certain choice of how to take the infinite limit (and in particular how to scale weight regularization terms in the loss). Different limits and scalings give significantly more expressive behaviors as we shall see, and we use MFT as a catch-all term for these. (These different limits are also harder in general, at least in terms of exact mathematical analysis: the Gaussian process limit somewhat compensates for its lack of expressivity by having much easier math.)Isn't mean field theory only a Bayesian learning theory and doesn't that make it unrealistic? In physics contexts (like MFT, Gaussian Process learning, etc.) Bayesian learning is often theoretically easier to deal with, and we'll explain Bayesian learning predictions here (validated by tempering experiments). However a version of mean field for SGD learning exists and is called "Dynamical Mean Field Theory" (DMFT) (it extends the NTK in Gaussian process contexts). Probably more relevantly, Bayesian learning experiments frequently find similar structures to gradient-based methods (and are often easier to analyse). This is particularly well demonstrated in empirical results by the Timaeus group.Is mean field theory a theory of shallow models? Most existing papers on mean field theory work in the context of 2-layer neural nets (i.e. 2 linear layers, one nonlinear layer). However there is a fully general, and experimentally robust extension of the theory to a larger number of layers (see for example this lecture series), and we will look at such models here. In fact mean field theory can model mechanisms of arbitrary depth - but it works best for shallower models (or for shallow mechanisms in deep models), and would likely be less useful for modeling strongly depth-dependent phenomena.What is a success of mean field theory I should know about? Glad you asked! Most people know about the Modular Addition task, which was first explained mechanistically by Neel Nanda et al.'s grokking paper. The interpretation is heuristic: it shows that the model exhibits signatures of using a nice and unexpected trigonometric trick. It also interpolates between generalization and memorization in a sudden shift reminiscent of a phase transition. A more ambitious task (that was considered too hard to tackle in the interpretability community) would be to understand exactly what the model learns on a neuron-by-neuron basis in any setting that exhibits generalization/ grokking. Since models have inherent randomness (from initialization, and sometimes from SGD), the task is inherently a statistical one: explain the probability distribution on weights of learned models (at least to a suitable level of precision), and was generally believed to be quite hard. Thus it comes as a surprise to practitioners of interpretability that in fact there is a context where this is done. In the paper "Grokking as a First-order Phase Transition in Two Layer Networks", Rubin, Seroussi and Ringel constructed a complete explanation (experimentally verified to extremely high precision) for the modular addition network in the Bayesian learning setting (there are some other differences from Neel Nanda's approach, most notably the choice of loss function, but variants of the approach extend to these as well). The distribution is first understood at infinite width, then shown to apply at realistic (but large) width in the appropriate regime. When applying the adaptive mean field theory approach to this task, Fourier modes and the trigonometric mechanism fall out as a natural output of the theory – moreover they are fully explained on a statistical distribution level (i.e. we have a complete model "exactly what each neuron does" to an appropriate degree of precision, understood in a statistical physics sense). Of particular interest, the model explains a grokking-like phase transition between memorization (equivalently, a Gaussian process-like behaviour) and generalization (inherently mean field) and predicts the data fraction at which it happens (this is a Bayesian learning analog of predicting the distribution of when grokking happens in SGD-trained neural nets). The phenomenon is a genuine phase transition in the thermodynamic sense.Are real models in the mean field regime or the Gaussian process regime, or something else? This is an interesting question, whose answer is "this question doesn't make sense". The distinction between regimes applies to infinite width nets, i.e. to a totally non-standard setting. One can prove rigorous results with the gist that if the width is (sufficiently enormous with some giant bound) compared to the training data, the model is guaranteed to learn in one of these two regimes. However, no real models are that enormous. Instead, some phenomena and some mechanisms can be seen (experimentally or theoretically) to extend from infinite nets to nets of finite width. Sometimes these look more like mean field phenomena, sometimes they look like Gaussian process phenomena. For example in some sense MNIST is "GP-like" (GP stands for Gaussian process). Circuits in modular addition are, as it turns out, entirely explained by the MFT limit as we've explained above.Introduction to the theoryThe background (and the foreground)In physics, one often looks at systems with a large, stable background. A planet vs. a sun, an electron vs. a proton, a weakly interacting observer vs. a large system being observed. In these settings the "background" is the large system and the "foreground" or "test system" is the small system being studied. In these cases the background system may be fixed, or it may be undergoing some motion (like the sun moving around the galaxy's center), but the important idealization is that it does not react to the observer/ test system. In fact, the earth is applying a gravitational pull to the sun (and famously in quantum mechanics, observations always impact a system at a quantum level). But these "reverse" effects are small, so to a good approximation we can treat the sun as doing its own "stable" thing while earth is undergoing physics that depend strongly on the sun. Self-consistencyWhile typically the large "background" is a cleanly separate system from the small test system of the observer, it is sometimes extremely useful to treat the test system as a tiny piece of the background. So: the large background system may be a cup of water and the small test system may be a tiny bit of water at some location. Here while technically the full cup includes the tiny "test" bit, the large-scale behaviors (waves etc.) in the water don't really care to relevant precision if the test bit is changed or removed (at least if it's tiny enough). But the tiny bit of water definitely cares about the large-scale behaviors (waves, vortices or flows, etc.), to the extent that bits of water care about things. Similarly (and in a closely related way), "the economy" is a giant system that includes your neighborhood bakery. The bakery can be viewed as a small "test system": it is affected by the economy. If property prices go up or the economy tanks, it might close. But the economy is not (at least to leading order) affected by this bakery. It is perhaps affected by the union of all bakeries in the world, but if this particular bakery closes due to some random phenomenon (e.g. the lead baker retires), this won't massively impact the economy.This point of view is remarkably useful, because it introduces a notion of "self-consistency". Self-consistency when applied in this context comes from the following pair of intuitions:the behavior of each small component is (statistically) determined by the backgroundthe behavior of the background is the sum of its small components.If both of these assumptions are true, then these two observations (when turned into equations) are usually enough to fully pin down the system. Indeed, you have two functional relationships[2] :

mjx-c.mjx-c62::before {
padding: 0.694em 0.556em 0.011em 0;
content: "b";
}

mjx-c.mjx-c6B::before {
padding: 0.694em 0.528em 0 0;
content: "k";
}

mjx-c.mjx-c72::before {
padding: 0.442em 0.392em 0 0;
content: "r";
}

mjx-c.mjx-c66::before {
padding: 0.705em 0.372em 0 0;
content: "f";
}

mjx-c.mjx-c2218::before {
padding: 0.444em 0.5em 0 0;
content: "\2218";
}

mjx-c.mjx-c1D456.TEX-I::before {
padding: 0.661em 0.345em 0.011em 0;
content: "i";
}

mjx-c.mjx-c2192::before {
padding: 0.511em 1em 0.011em 0;
content: "\2192";
}

mjx-c.mjx-c1D719.TEX-I::before {
padding: 0.694em 0.596em 0.205em 0;
content: "\3D5";
}

mjx-c.mjx-c37::before {
padding: 0.676em 0.5em 0.022em 0;
content: "7";
}

mjx-math {
display: inline-block;
text-align: left;
line-height: 0;
text-indent: 0;
font-style: normal;
font-weight: normal;
font-size: 100%;
font-size-adjust: none;
letter-spacing: normal;
border-collapse: collapse;
word-wrap: normal;
word-spacing: normal;
white-space: nowrap;
direction: ltr;
padding: 1px 0;
}

mjx-container[jax="CHTML"][display="true"] {
display: block;
text-align: center;
margin: 1em 0;
}

mjx-container[jax="CHTML"][display="true"][width="full"] {
display: flex;
}

mjx-container[jax="CHTML"][display="true"] mjx-math {
padding: 0;
}

mjx-container[jax="CHTML"][justify="left"] {
text-align: left;
}

mjx-container[jax="CHTML"][justify="right"] {
text-align: right;
}

mjx-mo {
display: inline-block;
text-align: left;
}

mjx-stretchy-h {
display: inline-table;
width: 100%;
}

mjx-stretchy-h > * {
display: table-cell;
width: 0;
}

mjx-stretchy-h > * > mjx-c {
display: inline-block;
transform: scalex(1.0000001);
}

mjx-stretchy-h > * > mjx-c::before {
display: inline-block;
width: initial;
}

mjx-stretchy-h > mjx-ext {
/* IE */ overflow: hidden;
/* others */ overflow: clip visible;
width: 100%;
}

mjx-stretchy-h > mjx-ext > mjx-c::before {
transform: scalex(500);
}

mjx-stretchy-h > mjx-ext > mjx-c {
width: 0;
}

mjx-stretchy-h > mjx-beg > mjx-c {
margin-right: -.1em;
}

mjx-stretchy-h > mjx-end > mjx-c {
margin-left: -.1em;
}

mjx-stretchy-v {
display: inline-block;
}

mjx-stretchy-v > * {
display: block;
}

mjx-stretchy-v > mjx-beg {
height: 0;
}

mjx-stretchy-v > mjx-end > mjx-c {
display: block;
}

mjx-stretchy-v > * > mjx-c {
transform: scaley(1.0000001);
transform-origin: left center;
overflow: hidden;
}

mjx-stretchy-v > mjx-ext {
display: block;
height: 100%;
box-sizing: border-box;
border: 0px solid transparent;
/* IE */ overflow: hidden;
/* others */ overflow: visible clip;
}

mjx-stretchy-v > mjx-ext > mjx-c::before {
width: initial;
box-sizing: border-box;
}

mjx-stretchy-v > mjx-ext > mjx-c {
transform: scaleY(500) translateY(.075em);
overflow: visible;
}

mjx-mark {
display: inline-block;
height: 0px;
}

mjx-c {
display: inline-block;
}

mjx-utext {
display: inline-block;
padding: .75em 0 .2em 0;
}

mjx-mi {
display: inline-block;
text-align: left;
}

mjx-msub {
display: inline-block;
text-align: left;
}

mjx-mspace {
display: inline-block;
text-align: left;
}

mjx-msup {
display: inline-block;
text-align: left;
}

mjx-mn {
display: inline-block;
text-align: left;
}

mjx-TeXAtom {
display: inline-block;
text-align: left;
}

mjx-mtext {
display: inline-block;
text-align: left;
}

mjx-munderover {
display: inline-block;
text-align: left;
}

mjx-munderover:not([limits="false"]) {
padding-top: .1em;
}

mjx-munderover:not([limits="false"]) > * {
display: block;
}

mjx-msubsup {
display: inline-block;
text-align: left;
}

mjx-script {
display: inline-block;
padding-right: .05em;
padding-left: .033em;
}

mjx-script > mjx-spacer {
display: block;
}

mjx-mfrac {
display: inline-block;
text-align: left;
}

mjx-frac {
display: inline-block;
vertical-align: 0.17em;
padding: 0 .22em;
}

mjx-frac[type="d"] {
vertical-align: .04em;
}

mjx-frac[delims] {
padding: 0 .1em;
}

mjx-frac[atop] {
padding: 0 .12em;
}

mjx-frac[atop][delims] {
padding: 0;
}

mjx-dtable {
display: inline-table;
width: 100%;
}

mjx-dtable > * {
font-size: 2000%;
}

mjx-dbox {
display: block;
font-size: 5%;
}

mjx-num {
display: block;
text-align: center;
}

mjx-den {
display: block;
text-align: center;
}

mjx-mfrac[bevelled] > mjx-num {
display: inline-block;
}

mjx-mfrac[bevelled] > mjx-den {
display: inline-block;
}

mjx-den[align="right"], mjx-num[align="right"] {
text-align: right;
}

mjx-den[align="left"], mjx-num[align="left"] {
text-align: left;
}

mjx-nstrut {
display: inline-block;
height: .054em;
width: 0;
vertical-align: -.054em;
}

mjx-nstrut[type="d"] {
height: .217em;
vertical-align: -.217em;
}

mjx-dstrut {
display: inline-block;
height: .505em;
width: 0;
}

mjx-dstrut[type="d"] {
height: .726em;
}

mjx-line {
display: block;
box-sizing: border-box;
min-height: 1px;
height: .06em;
border-top: .06em solid;
margin: .06em -.1em;
overflow: hidden;
}

mjx-line[type="d"] {
margin: .18em -.1em;
}

mjx-mrow {
display: inline-block;
text-align: left;
}

mjx-munder {
display: inline-block;
text-align: left;
}

mjx-over {
text-align: left;
}

mjx-munder:not([limits="false"]) {
display: inline-table;
}

mjx-munder > mjx-row {
text-align: left;
}

mjx-under {
padding-bottom: .1em;
}

mjx-mtable {
display: inline-block;
text-align: center;
vertical-align: .25em;
position: relative;
box-sizing: border-box;
border-spacing: 0;
border-collapse: collapse;
}

mjx-mstyle[size="s"] mjx-mtable {
vertical-align: .354em;
}

mjx-labels {
position: absolute;
left: 0;
top: 0;
}

mjx-table {
display: inline-block;
vertical-align: -.5ex;
box-sizing: border-box;
}

mjx-table > mjx-itable {
vertical-align: middle;
text-align: left;
box-sizing: border-box;
}

mjx-labels > mjx-itable {
position: absolute;
top: 0;
}

mjx-mtable[justify="left"] {
text-align: left;
}

mjx-mtable[justify="right"] {
text-align: right;
}

mjx-mtable[justify="left"][side="left"] {
padding-right: 0 ! important;
}

mjx-mtable[justify="left"][side="right"] {
padding-left: 0 ! important;
}

mjx-mtable[justify="right"][side="left"] {
padding-right: 0 ! important;
}

mjx-mtable[justify="right"][side="right"] {
padding-left: 0 ! important;
}

mjx-mtable[align] {
vertical-align: baseline;
}

mjx-mtable[align="top"] > mjx-table {
vertical-align: top;
}

mjx-mtable[align="bottom"] > mjx-table {
vertical-align: bottom;
}

mjx-mtable[side="right"] mjx-labels {
min-width: 100%;
}

mjx-mtr {
display: table-row;
text-align: left;
}

mjx-mtr[rowalign="top"] > mjx-mtd {
vertical-align: top;
}

mjx-mtr[rowalign="center"] > mjx-mtd {
vertical-align: middle;
}

mjx-mtr[rowalign="bottom"] > mjx-mtd {
vertical-align: bottom;
}

mjx-mtr[rowalign="baseline"] > mjx-mtd {
vertical-align: baseline;
}

mjx-mtr[rowalign="axis"] > mjx-mtd {
vertical-align: .25em;
}

mjx-mtd {
display: table-cell;
text-align: center;
padding: .215em .4em;
}

mjx-mtd:first-child {
padding-left: 0;
}

mjx-mtd:last-child {
padding-right: 0;
}

mjx-mtable > * > mjx-itable > *:first-child > mjx-mtd {
padding-top: 0;
}

mjx-mtable > * > mjx-itable > *:last-child > mjx-mtd {
padding-bottom: 0;
}

mjx-tstrut {
display: inline-block;
height: 1em;
vertical-align: -.25em;
}

mjx-labels[align="left"] > mjx-mtr > mjx-mtd {
text-align: left;
}

mjx-labels[align="right"] > mjx-mtr > mjx-mtd {
text-align: right;
}

mjx-mtd[extra] {
padding: 0;
}

mjx-mtd[rowalign="top"] {
vertical-align: top;
}

mjx-mtd[rowalign="center"] {
vertical-align: middle;
}

mjx-mtd[rowalign="bottom"] {
vertical-align: bottom;
}

mjx-mtd[rowalign="baseline"] {
vertical-align: baseline;
}

mjx-mtd[rowalign="axis"] {
vertical-align: .25em;
}

mjx-menclose {
display: inline-block;
text-align: left;
position: relative;
}

mjx-menclose > mjx-dstrike {
display: inline-block;
left: 0;
top: 0;
position: absolute;
border-top: 0.067em solid;
transform-origin: top left;
}

mjx-menclose > mjx-ustrike {
display: inline-block;
left: 0;
bottom: 0;
position: absolute;
border-top: 0.067em solid;
transform-origin: bottom left;
}

mjx-menclose > mjx-hstrike {
border-top: 0.067em solid;
position: absolute;
left: 0;
right: 0;
bottom: 50%;
transform: translateY(0.034em);
}

mjx-menclose > mjx-vstrike {
border-left: 0.067em solid;
position: absolute;
top: 0;
bottom: 0;
right: 50%;
transform: translateX(0.034em);
}

mjx-menclose > mjx-rbox {
position: absolute;
top: 0;
bottom: 0;
right: 0;
left: 0;
border: 0.067em solid;
border-radius: 0.267em;
}

mjx-menclose > mjx-cbox {
position: absolute;
top: 0;
bottom: 0;
right: 0;
left: 0;
border: 0.067em solid;
border-radius: 50%;
}

mjx-menclose > mjx-arrow {
position: absolute;
left: 0;
bottom: 50%;
height: 0;
width: 0;
}

mjx-menclose > mjx-arrow > * {
display: block;
position: absolute;
transform-origin: bottom;
border-left: 0.268em solid;
border-right: 0;
box-sizing: border-box;
}

mjx-menclose > mjx-arrow > mjx-aline {
left: 0;
top: -0.034em;
right: 0.201em;
height: 0;
border-top: 0.067em solid;
border-left: 0;
}

mjx-menclose > mjx-arrow[double] > mjx-aline {
left: 0.201em;
height: 0;
}

mjx-menclose > mjx-arrow > mjx-rthead {
transform: skewX(0.464rad);
right: 0;
bottom: -1px;
border-bottom: 1px solid transparent;
border-top: 0.134em solid transparent;
}

mjx-menclose > mjx-arrow > mjx-rbhead {
transform: skewX(-0.464rad);
transform-origin: top;
right: 0;
top: -1px;
border-top: 1px solid transparent;
border-bottom: 0.134em solid transparent;
}

mjx-menclose > mjx-arrow > mjx-lthead {
transform: skewX(-0.464rad);
left: 0;
bottom: -1px;
border-left: 0;
border-right: 0.268em solid;
border-bottom: 1px solid transparent;
border-top: 0.134em solid transparent;
}

mjx-menclose > mjx-arrow > mjx-lbhead {
transform: skewX(0.464rad);
transform-origin: top;
left: 0;
top: -1px;
border-left: 0;
border-right: 0.268em solid;
border-top: 1px solid transparent;
border-bottom: 0.134em solid transparent;
}

mjx-menclose > dbox {
position: absolute;
top: 0;
bottom: 0;
left: -0.3em;
width: 0.6em;
border: 0.067em solid;
border-radius: 50%;
clip-path: inset(0 0 0 0.3em);
box-sizing: border-box;
}

mjx-stretchy-h.mjx-c23DF mjx-beg mjx-c::before {
content: "\E152";
padding: 0.32em 0 0.2em 0;
}

mjx-stretchy-h.mjx-c23DF mjx-ext mjx-c::before {
content: "\E154";
padding: 0.32em 0 0.2em 0;
}

mjx-stretchy-h.mjx-c23DF mjx-end mjx-c::before {
content: "\E153";
padding: 0.32em 0 0.2em 0;
}

mjx-stretchy-h.mjx-c23DF mjx-mid mjx-c::before {
content: "\E151\E150";
padding: 0.32em 0 0.2em 0;
}

mjx-stretchy-h.mjx-c23DF > mjx-ext {
width: 50%;
}

mjx-c.mjx-c28::before {
padding: 0.75em 0.389em 0.25em 0;
content: "(";
}

mjx-c.mjx-c1D44E.TEX-I::before {
padding: 0.441em 0.529em 0.01em 0;
content: "a";
}

mjx-c.mjx-c2C::before {
padding: 0.121em 0.278em 0.194em 0;
content: ",";
}

mjx-c.mjx-c1D44F.TEX-I::before {
padding: 0.694em 0.429em 0.011em 0;
content: "b";
}

mjx-c.mjx-c29::before {
padding: 0.75em 0.389em 0.25em 0;
content: ")";
}

mjx-c.mjx-c210E.TEX-I::before {
padding: 0.694em 0.576em 0.011em 0;
content: "h";
}

mjx-c.mjx-c3D::before {
padding: 0.583em 0.778em 0.082em 0;
content: "=";
}

mjx-c.mjx-c1D464.TEX-I::before {
padding: 0.443em 0.716em 0.011em 0;
content: "w";
}

mjx-c.mjx-c1D70F.TEX-I::before {
padding: 0.431em 0.517em 0.013em 0;
content: "\3C4";
}

mjx-c.mjx-c2295::before {
padding: 0.583em 0.778em 0.083em 0;
content: "\2295";
}

mjx-c.mjx-c1D465.TEX-I::before {
padding: 0.442em 0.572em 0.011em 0;
content: "x";
}

mjx-c.mjx-c1D467.TEX-I::before {
padding: 0.442em 0.465em 0.011em 0;
content: "z";
}

mjx-c.mjx-c1D436.TEX-I::before {
padding: 0.705em 0.76em 0.022em 0;
content: "C";
}

mjx-c.mjx-c3A::before {
padding: 0.43em 0.278em 0 0;
content: ":";
}

mjx-c.mjx-c22A4::before {
padding: 0.668em 0.778em 0 0;
content: "\22A4";
}

mjx-c.mjx-c2212::before {
padding: 0.583em 0.778em 0.082em 0;
content: "\2212";
}

mjx-c.mjx-c2E::before {
padding: 0.12em 0.278em 0 0;
content: ".";
}

mjx-c.mjx-c3E::before {
padding: 0.54em 0.778em 0.04em 0;
content: ">";
}

mjx-c.mjx-c30::before {
padding: 0.666em 0.5em 0.022em 0;
content: "0";
}

mjx-c.mjx-c27FA::before {
padding: 0.525em 1.858em 0.024em 0;
content: "\27FA";
}

mjx-c.mjx-c31::before {
padding: 0.666em 0.5em 0 0;
content: "1";
}

mjx-c.mjx-c3C::before {
padding: 0.54em 0.778em 0.04em 0;
content: " * {
display: table-cell;
}

mjx-mtext {
display: inline-block;
}

mjx-mstyle {
display: inline-block;
}

mjx-merror {
display: inline-block;
color: red;
background-color: yellow;
}

mjx-mphantom {
visibility: hidden;
}

_::-webkit-full-page-media, _:future, :root mjx-container {
will-change: opacity;
}

mjx-c::before {
display: block;
width: 0;
}

.MJX-TEX {
font-family: MJXZERO, MJXTEX;
}

.TEX-B {
font-family: MJXZERO, MJXTEX-B;
}

.TEX-I {
font-family: MJXZERO, MJXTEX-I;
}

.TEX-MI {
font-family: MJXZERO, MJXTEX-MI;
}

.TEX-BI {
font-family: MJXZERO, MJXTEX-BI;
}

.TEX-S1 {
font-family: MJXZERO, MJXTEX-S1;
}

.TEX-S2 {
font-family: MJXZERO, MJXTEX-S2;
}

.TEX-S3 {
font-family: MJXZERO, MJXTEX-S3;
}

.TEX-S4 {
font-family: MJXZERO, MJXTEX-S4;
}

.TEX-A {
font-family: MJXZERO, MJXTEX-A;
}

.TEX-C {
font-family: MJXZERO, MJXTEX-C;
}

.TEX-CB {
font-family: MJXZERO, MJXTEX-CB;
}

.TEX-FR {
font-family: MJXZERO, MJXTEX-FR;
}

.TEX-FRB {
font-family: MJXZERO, MJXTEX-FRB;
}

.TEX-SS {
font-family: MJXZERO, MJXTEX-SS;
}

.TEX-SSB {
font-family: MJXZERO, MJXTEX-SSB;
}

.TEX-SSI {
font-family: MJXZERO, MJXTEX-SSI;
}

.TEX-SC {
font-family: MJXZERO, MJXTEX-SC;
}

.TEX-T {
font-family: MJXZERO, MJXTEX-T;
}

.TEX-V {
font-family: MJXZERO, MJXTEX-V;
}

.TEX-VB {
font-family: MJXZERO, MJXTEX-VB;
}

mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c {
font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important;
}

@font-face /* 0 */ {
font-family: MJXZERO;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff");
}

@font-face /* 1 */ {
font-family: MJXTEX;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff");
}

@font-face /* 2 */ {
font-family: MJXTEX-B;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff");
}

@font-face /* 3 */ {
font-family: MJXTEX-I;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff");
}

@font-face /* 4 */ {
font-family: MJXTEX-MI;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff");
}

@font-face /* 5 */ {
font-family: MJXTEX-BI;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff");
}

@font-face /* 6 */ {
font-family: MJXTEX-S1;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff");
}

@font-face /* 7 */ {
font-family: MJXTEX-S2;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff");
}

@font-face /* 8 */ {
font-family: MJXTEX-S3;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff");
}

@font-face /* 9 */ {
font-family: MJXTEX-S4;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff");
}

@font-face /* 10 */ {
font-family: MJXTEX-A;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff");
}

@font-face /* 11 */ {
font-family: MJXTEX-C;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff");
}

@font-face /* 12 */ {
font-family: MJXTEX-CB;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff");
}

@font-face /* 13 */ {
font-family: MJXTEX-FR;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff");
}

@font-face /* 14 */ {
font-family: MJXTEX-FRB;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff");
}

@font-face /* 15 */ {
font-family: MJXTEX-SS;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff");
}

@font-face /* 16 */ {
font-family: MJXTEX-SSB;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff");
}

@font-face /* 17 */ {
font-family: MJXTEX-SSI;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff");
}

@font-face /* 18 */ {
font-family: MJXTEX-SC;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff");
}

@font-face /* 19 */ {
font-family: MJXTEX-T;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff");
}

@font-face /* 20 */ {
font-family: MJXTEX-V;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff");
}

@font-face /* 21 */ {
font-family: MJXTEX-VB;
src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff");
}
Putting these together, we have the combined "self-consistency" equation:which means that the background field satisfies a fixed point equation for the composed function . It so happens that in many cases of interest, it has a unique solution. A classic example of a self-consistency equation is the supply-demand curve equilibrium. Here the background is a single number (price of a good) and the test system is the willingness of a single consumer to buy or of a single producer to sell, as a function of price (the actual "tiny components" consisting of individual consumers/producers are abstracted out, and the curve represents the average incentive). Of the above assumption 1 is most problematic. Thinking of each component as being determined by some "large-scale" stable system needs to be interpreted appropriately (in particular the relationship is often statistical: so for example the number of bakeries in a given neighborhood fluctuates due to people retiring/ moving/ etc., even if "the economy" is held constant; similarly, every bit of the sun reacts to magnetic/ gravitational fields from other bits, but in a statistical or thermodynamic sense). Sometimes local or so-called "emergent" effects break this directional relationship (and many interesting thermodynamic systems, such as the 2-dimensional Ising model, are precisely interesting in such contexts). But surprisingly often (at least with an appropriate formalism) the approximation of the foreground as fully determined by the background (in a statistical sense) is robust. For example if we are modeling the sun, viewing the "background system" too coarsely (as just the mass + electromagnetic field + temperature, say, of the entire sun) is insufficient. But instead we can view the "background system" as a giant union of many local systems, maybe comprising a few meter chunks. These are still "large" in the sense of being much larger than an atom (or a microscopic chunk), but studying their behavior (in an appropriate abstraction) offers sufficient resolution to model the sun extremely well. Similarly we can't apply a single supply-demand curve to the entire economy (bread costs different amounts in different places). But in appropriate contexts (for fungible products like oil, and on a "local economy" level where the economy is roughly uniform but not dominated by a single station, for example) self-consistency is a pretty good model. In many settings, the question of how well "assumption 1" above holds is related to a notion of connectedness. In the sun's magnetic plasma, the magnetic field experienced by a particle is accumulated over billions and billions of nearby particles - so the graph of interactions is extremely connected. In an oil economy, each consumer can typically choose between dozens of nearby stations which are reachable by car. However other settings (like the Ising model, or markets for rare and hard-to-transport goods) cannot be purely modeled by self-consistency as well. In physics, systems that are well-modeled by a self-consistency equation (coupled background and foreground systems) are generally called mean-field settings. A big triumph of statistical physics is to make situations with local/ emergent phenomena "behave as well as" mean field theories – renormalization is a fundamental tool here, and most textbooks on renormalization from a statistical-physics view tend to start with a discussion of mean-field methods. But settings that are directly mean-field (for example due to being highly connected or high-dimensional) are particularly nice, easy-to-study Neural nets and mean fieldNeural nets are physical systems. This is a vacuous statement – anything that has statistics can be studied using a physics toolkit (and in many ways statistical physics is just statistics with different terms). Indeed, real neural nets are immensely complex, and if there is some sense in which they can be locally decomposed into background-foreground consistencies, these must themselves be immensely complex and likely dependent on sophisticated tooling to identify (this is one of the reasons why we are running an agenda on renormalization).But it turns out that in some settings and architectures neural nets are extremely well-modeled by systems with high connectivity – and the reason is, naively enough, precisely the fact that they are highly connected (often fully-connected) on a neuron level (note that architectures that aren't "fully-connected" – e.g. CNNs – sometimes still have properties that make them "highly connected" from a physical point of view). The mean-field background and foreground for a neural netIn neural net MFT the foreground (or "system"/ "observer") abstraction is a neuron. This is typically a coordinate index of some layer. The important "background" thing that each neuron "carries" is what is called an activation function, often denoted by the letter . This is a function on data: given any input x, partially running the model on x returns a vector of activations. is its i'th component. This function is now the thing that a neuron contributes to the "background field" of the neural net.[3]Now if there are lots of neurons, each neuron's activation function reacts to a background generated by the other neurons: removing the neuron in this limit doesn't change the loss by much, so the background determines each neuron's behavior as a statistical distribution. Conversely, the background itself is composed of individual "foreground" neurons. The loop: background neuron distribution background must close, i.e. be self-consistent. Making sense of this loop is the key content of mean field theory of neural nets.In later installments we'll explain a bit more about the loop and show some examples of it working (or not). You can also see the original linked paper about the Curse of Detail for a more physics-forward view of this. Experimental setting and pretty pictureWe'll close with a toy example of "self-consistency", which is visually satisfying.In this setting we look at a 2-layer model that takes in a two-dimensional input variables and is trained on the target at a large width (here ) and on infinite data. The activation function is a bounded sigmoid-like function (the relu version of tanh). Each neuron at layer 1 is a function that only depends on a 2-dimensional row of the weight matrix, so the associated "test" field or particle can be plotted on a 2-dimensional graph. When we plot all of these together we get a good picture of the distribution of single-neuron functions that combine together to form the background system:The neurons above were trained jointly in a way that would allow them to interact.It has a nice clover-leaf like structure (it will reappear later when we look at continuous xors - a multi-layer setting where mean field performs compositional computation; already in this simple setting, the fact that the cloud of neurons is a "shaped" distribution rather than a flat Gaussian puts us solidly outside the Gaussian process regime). Now we can empirically measure how a single randomly initialized "foreground" neuron would react to the background generated by this model. To do this, we train 2048 iid single-neuron models on the resulting background from the fully trained model.[4] When we do this and combine the resulting 2048 neurons into a new model, we see that indeed it looks exactly the same as the background. When we compute its associated function, we get very similar loss.Each neuron in this picture was trained in a fully iid way, without interacting with any neuron, simply by "reacting to the background", i.e. learning the task in combination with the "blue" background above.Note that this isn't a property that comes "for free". If we were to use the wrong background (for example a the more Gaussian process-like model here) then samples of the foreground would fail to align to the background. Blue is background, orange is foreground (each orange neuron trained independently in reaction to background).The case of 2-layer networks is special: neuron functions are particularly simple to characterize, and the mean field has better properties (it's not "coupled"). But we'll see that deeper nets can still be analyzed using this language, and even using empirical methods we can get cleaner pictures of how they learn and process representations. In the next post, we will explain the physics behind these experiments and the experimental details of the models (github repo coming soon).^Technically they differ on whether they use the "pizza" vs. "clock" mechanisms, but the two idealizations are related, and both the mean field and the realistic setting can be modified to make use of either. ^Below, f and b should generally be understood as "statistical" functions: job choice is, perhaps, a probabilistic function depending on the economy, which includes both demand/ markets but also supply/ people's interests; conversely "the economy" is the average of production over the distribution of jobs. ^Technicalities. Depending on the situation can either be viewed as a function a finite training set or on an infinite "set of all possible inputs", usually a large Euclidean space (example: an MNIST input is a vector of pixel values). Unless we're working with finite training data, this is a priori an infinite-dimensional gadget; and worse, the thing that is actually summed over neurons – the analog of the "market" or "background field" is nonlinear in this objects[4]. There is also a subtlety here about SGD vs. Bayesian learning which I won't get into. But in mean-field settings that admit generalization (or for finite number of inputs), this background is effectively dominated by a small set of "relevant" directions.^Technical note: each single-neuron model is trained on the difference where is the trained model. ^In fact it is quadratic: the thing that sums over neurons is the "external square" of the neuron function, which is a function of a pair of inputs: knowing this sum fully determines the dynamics up to rotational symmetry, even for a finite-width model (it's often called the "data kernel" but is used very differently from the Gaussian process kernels, which do depend on an infinite-width assumption and lose a lot of information in finite-width and mean-field contexts).Discuss