Foundations
Attention assigns weights from scores. β acts like temperature: higher β concentrates weights; lower β spreads them. CEAS monitors spread and nudges β so attention stays inside a target band that is empirically stable for training and aligned with the model’s pseudo-critical regime.
Notation: let \(L_{\text{corr}}\) denote the correlation length (instead of the conventional \( \xi \)). “Critical” refers to critical phenomena: the regime where the system’s effective correlation length grows without bound—informally, a small local change influences the whole system. The controller steers the model toward its critical temperature, i.e., the point where \( L_{\text{corr}} \to \infty \). On finite machines this manifests as a pseudo-critical regime with a large but finite \( L_{\text{corr}} \) (near “blow-up,” yet bounded by model/context size). As model scale grows, finite-size effects shrink and the pseudo-critical behavior approaches the textbook limit.
- Fixed scaling is brittle. The textbook \(1/\sqrt{d_k}\) assumes one setting fits every head, layer, and dataset.
- Instability at the extremes. Too broad → noisy gradients; too sharp → stalled learning. Both waste compute.
- Targeted balance. CEAS keeps attention in the region where small score changes carry useful information.
Canonical Ensemble → Linear Regression and Entropy → Loss (KL → NLL).
Critical-region operation
The controller centers operation near the model’s pseudo-critical regime where information per update is maximized. A low-order (Landau-style) expansion is accurate enough here to steer β; as models scale up, the critical signatures and gains become more apparent.
Objective alignment
Training with negative log-likelihood equals minimizing KL divergence to data; in Gaussian settings this reduces to ordinary least squares. Managing β therefore directly manages the gap to data: sharper when evidence is clear, broader when it is not.
- Single, physics-grounded control knob: β is set by data dispersion and competition, not just embedding dimension.
- Compute discipline: Keeping entropy in a critical band reduces noisy updates and improves convergence stability.
- Production ready: Minimal code changes; complements standard optimizers and schedulers.
Note: CEAS is under active development. Patent pending.
The Controller
Closed-form initializer (“final address”)
Near the high-entropy regime, a principled starting value is
\[ \beta^\star \;=\; \frac{1}{\sigma_{qk}}\,\sqrt{2\,\ln N_{\mathrm{eff}}}\,, \]
where \(\sigma_{qk}\) is the empirical standard deviation of query–key dot products and \(N_{\mathrm{eff}}=\exp(H)\) is the effective competitor count.
One-step controller (online β tuning)
A Newton-style update drives β toward the target band while the representation shifts:
\[ \boxed{\beta_{\text{new}}=\beta+\frac{H(\beta)-H_{\text{target}}}{\beta\,\mathrm{Var}_{p_\beta}[s]+\varepsilon}} \]
Use a small \(\varepsilon>0\) for numerical safety. The same rule can be written with \(\log N_{\mathrm{eff}}\).
Where \(\beta^\star\) comes from (6 + 1)
- KL/entropy constraint: match a target divergence or entropy drop from uniform.
- Extreme-value gap: scale to the expected top-score gap \(\sim \sigma\sqrt{2\ln N_{\mathrm{eff}}}\).
- Free-energy balance: pick \( \beta \) at the saddle/minimum of a variational free-energy.
- Target-entropy rule: solve \(H(\beta)=H^{\star}\) for a chosen corridor.
- Variance-anneal: constrain output-weight variance of the softmax.
- Information-susceptibility / RG view: align with macro response as heads/scale increase.
- +1 control: the Newton update above maintains the corridor in real time.
Decision boundary for gating
Advanced Control
This controller accelerates entry into the useful regime (the entropy corridor) and continuously skips low-information work, while keeping a safe margin from pseudo-critical slowdowns. It is designed to drop cleanly into a standard Transformer training loop.
Controller Design
A) Faster relaxation into the corridor
Replace the unit-gain Newton step with a gain-scheduled update:
Defaults:
- 9k parameters: \(\kappa_{\max}=2.2,\; \kappa_\infty=1.0,\; \tau_\kappa=500\text{–}1000\) steps
- 14.4M parameters: \(\kappa_{\max}=1.8,\; \kappa_\infty=1.0,\; \tau_\kappa=1\text{–}2\text{k}\)
- GPT-3/4/5 scale: \(\kappa_{\max}=1.5,\; \kappa_\infty=1.0,\; \tau_\kappa=2\text{–}5\text{k}\)
Clip per update: \(|\Delta\beta| \le \Delta\beta_{\max}\). Defaults: 9k → 0.75; 14.4M → 0.5; GPT-scale → 0.3.
B) “Don’t get stuck near critical” margin
Use a correlation-length proxy (custom symbol) and hold a minimum gap from the pseudo-critical point:
Defaults: \(u_{\min}=0.06\) (9k), \(0.05\) (14.4M), \(0.04\) (GPT-scale). This caps \( \tau \sim \zeta_{\mathrm{CE}}^{\,z} \) and prevents critical slowing down from erasing gains.
C) Selective early gating, relaxed later
Gate by a dimensionless temperature-gap score \( T = \beta\,\sigma_{qk}\,\sqrt{2\ln N_{\mathrm{eff}}} \).
Threshold schedule:
- 9k: \(T_{\max}=1.8,\; T_\infty=1.05,\; \tau_T=600\) steps
- 14.4M: \(1.6,\; 1.02,\; 1.2\text{k}\)
- GPT-scale: \(1.5,\; 1.00,\; 2\text{–}4\text{k}\)
Token gating: keep tokens with \(T \ge T_{\text{gate}}\) or among top-\(q\) by \(T\) per head. Default (9k): \(q=0.55\) initially (~45% pruning), decaying to \(q=0.75\) by 2k steps.
Head gating: freeze head \(h\) when \(H_h \le H_{\text{freeze}}\) for \(w\) consecutive steps; unfreeze on exit. Defaults: \(H_{\text{freeze}} = \log N_{\mathrm{eff}} - 0.9;\; w=50\) (9k), 100 (14.4M), 200 (GPT-scale).
D) Guardrails (quality first)
- Pruning floors: keep at least \(m_{\min}\) tokens/sequence (e.g., 16–32) and at least \(h_{\min}\) heads/layer (e.g., 2–4).
- Back-off: if validation loss rises > 0.2σ (short EMA), decrease \(T_{\text{gate}}\) by 0.05 and halve \(\kappa(t)\) for 200 steps.
Integrated Cost Model (with pseudo-critical effects)
Baseline cost:
With controller:
Here \(T'_w \ll T_w\) (gain-scheduled \(\kappa(t)\) and the \(u_{\min}\) margin), \(\chi(t)\) is the pruned fraction (tokens + heads), and \(c(\cdot)\) includes finite-size effects via \(\tau \propto \zeta_{\mathrm{CE}}^{\,z}\) with the margin keeping \(\tau\) bounded.
End-to-end savings (closed-form approximation):
Define average prune rates \(\bar{\chi}_{\rm warm}, \bar{\chi}_{\rm steady}\) and warm-up speedup \(s=T_w/T'_w\).
Projected Savings (typical runs)
| Scale | \(s\) (warm-up speedup) | \(\bar{\chi}_{\rm warm}\) | \(\bar{\chi}_{\rm steady}\) | Projected savings |
|---|---|---|---|---|
| 9k | 2.4–3.2 | 0.45–0.55 | 0.22–0.30 | 35–52% (≥30% floor; ~45% common) |
| 14.4M | 1.8–2.4 | 0.35–0.45 | 0.18–0.26 | 26–40% |
| GPT-3 | 1.5–2.0 | 0.28–0.40 | 0.15–0.22 | 28–38% |
| GPT-4 | 1.4–1.8 | 0.25–0.35 | 0.12–0.20 | 24–34% |
| GPT-5 | 1.3–1.6 | 0.22–0.32 | 0.10–0.18 | 20–30% |
Larger models start closer to the corridor under the textbook \(1/\sqrt{d_k}\), so warm-up speedup \(s\) is smaller. However, steady-state gating (\(\bar{\chi}_{\rm steady}>0\)) provides persistent, scale-agnostic savings. The gap margin \(u_{\min}\) keeps \(\tau\) finite as pseudo-critical behavior strengthens with scale.
Drop-In Defaults
- Targets: \(H_{\text{target}}=\log N_{\mathrm{eff}}-1.1\) (tighten to −1.3 if stable). EMA windows: 64 steps for \(H\), 128 for \(\sigma_{qk}\).
- \(\beta\) init: \(\beta \leftarrow 1/\sqrt{d_k}\).
- Final address: \(\beta^\star \approx \dfrac{1}{\sigma_{qk}}\,\sqrt{2\ln N_{\mathrm{eff}}}\).
- Newton step: gain schedule \(\kappa(t)\) as above; clip \(|\Delta\beta|\).
- Gating: threshold \(T_{\text{gate}}(t)\) as above; maintain floors \(m_{\min}\) tokens/seq and \(h_{\min}\) heads/layer.
- Freeze: if \(H_h \le H_{\text{freeze}}\) for \(w\) steps, stop backprop through head \(h\); unfreeze when it exits the band.
- Back-off: if short-EMA validation loss rises > 0.2σ, set \(T_{\text{gate}}\leftarrow T_{\text{gate}}-0.05\) and \(\kappa\leftarrow \kappa/2\) for 200 steps.
Extending the same entropy/critical‑control lens beyond the attention temperature β—to learning rate, batch size, regularization, smoothing/dropout, and gating—compounds the gains. The result is a defensible path to ≥50% end‑to‑end training savings at LLM scale while meeting the same validation target.
2) Multi‑knob controller
Each knob is assigned (i) a local observable, (ii) a target band, and (iii) a one‑step update (Newton/PI style), with a pseudo‑critical margin to avoid \(\tau\!\sim\!\zeta_{\rm CE}^{\,z}\) blowups.
-
Attention temperature β (CEAS core)
Observable: attention entropy \(H\) (or \(N_{\rm eff}=e^H\)).
Update: gain‑scheduled Newton step on \(H\) toward \(H_{\text{target}}\).
Margin: keep \(u=\tfrac{|\beta-\beta_c|}{\beta_c}\ge u_{\min}\) so \(\zeta_{\rm CE}\) and \(\tau\) remain finite.
-
Learning rate \(\eta\) (critical‑damping target)
Observable: trust ratio \(\rho=\eta\,\lambda_{\max}(H_\theta)\) (or a curvature proxy via EMA).
Target: \(\rho\in[\rho_{\min},\rho_{\max}]\) (e.g., 0.02–0.08).
Update: \(\eta\leftarrow \eta\,\exp\!\big(\kappa_\eta(\rho^{*}-\rho)\big)\).
-
Batch size \(B\) (constant gradient‑noise scale)
Observable: GNS proxy \(g\) via online gradient variance.
Target: \(g\approx g^{*}\).
Update: \(B\leftarrow B\cdot \exp\!\big(\kappa_B(g/g^{*}-1)\big)\) with hardware caps.
-
Weight decay \(\lambda_{\rm wd}\) (spectral/entropy regularizer)
Observable: parameter spectral norm or parameter‑entropy \(H(\theta)\).
Target: keep \(H(\theta)\) in band (avoid collapse/explosion).
Update: \(\lambda_{\rm wd}\leftarrow \lambda_{\rm wd}+\kappa_\lambda\big(H^{*}-H(\theta)\big)\).
-
Label smoothing / dropout \(p\) (mutual‑information cap)
Observable: logits entropy \(H_{\rm logit}\) or calibration error.
Target: maintain a high‑entropy band early; anneal later.
Update: \(p\leftarrow \text{sched}(t)\) to keep \(H_{\rm logit}\!\to\!H_{\rm logit}^{*}\).
-
Token/head gating (work pruning)
Observable: temperature‑gap score \(T=\beta\,\sigma_{qk}\sqrt{2\ln N_{\rm eff}}\).
Target: schedule \(T_{\text{gate}}(t)\) high early, relaxing later.
Rule: keep tokens with \(T\ge T_{\text{gate}}\) or top‑\(q\) per head; freeze heads on persistently low entropy.
-
Pseudo‑critical margin (applies to all)
Define a custom correlation‑length proxy \(\zeta_{\rm CE}(\beta)=1/\big(\max(u,u_{\min})\big)^{\nu}\) (with \(\nu\in[0.5,1]\)).
Enforce \(u\ge u_{\min}\) by capping updates. This bounds \(\tau\propto \zeta_{\rm CE}^{\,z}\) and prevents critical slowing‑down from erasing the gains.
3) Why the gains compound
- Multiplicative warm‑up reduction. Typical factors when each knob is steered to an information‑optimal band: \(s_{\rm warm}^{(\beta)}\sim 1.5\! -\! 1.8,\; s_{\rm warm}^{(\eta)}\sim 1.2\! -\! 1.4,\; s_{\rm warm}^{(B)}\sim 1.1\! -\! 1.2,\; s_{\rm warm}^{(\text{reg})}\sim 1.05\! -\! 1.15\). Product \(s_{\rm warm}\approx 2.2\! -\! 3.0\) is common.
- Steady‑state keeps paying. Even when textbook \(1/\sqrt{d_k}\) lands closer to the corridor at huge scale, non‑zero \(\bar\chi_{\rm steady}\) (gating) and tempered \(\eta,B\) reduce steps by another 15–35%.
- Critical behavior helps—if the margin is enforced. Larger models sit nearer to pseudo‑criticality (better coupling), so smaller β changes propagate farther; the explicit \(u_{\min}\) gap prevents \(\tau\) blowups.
4) What to expect (projected ranges)
| Scale | Warm‑up speedup \(s_{\rm warm}\) | \(\bar\chi_{\rm warm}\) | \(\bar\chi_{\rm steady}\) | Steady speedup \(s_{\rm steady}\) | Projected savings |
|---|---|---|---|---|---|
| 9k | 2.6–3.4 | 0.45–0.55 | 0.22–0.30 | 1.20–1.35 | 45–60% |
| 14.4M | 2.1–2.8 | 0.38–0.48 | 0.18–0.26 | 1.20–1.30 | 38–52% |
| GPT‑3 | 1.9–2.5 | 0.30–0.42 | 0.18–0.24 | 1.20–1.30 | 35–50% |
| GPT‑4 | 1.8–2.4 | 0.28–0.38 | 0.16–0.22 | 1.18–1.28 | 32–48% |
| GPT‑5 | 1.7–2.2 | 0.25–0.35 | 0.15–0.20 | 1.15–1.25 | 30–45% |
Projections are end‑to‑end token‑update savings to the same validation target, under a bounded‑\(\tau\) regime.
5) Minimal drop‑in updates (beyond β)
- Curvature‑aware learning rate: maintain \(\rho=\eta\,\widehat{\lambda}_{\max}\in[0.02,0.08]\) via an EMA of top‑eigenvalue proxies (e.g., light power‑iteration every \(N\) steps).
- GNS‑scheduled batch: track gradient variance per layer; increase \(B\) when \(g>g^{*}\) (too noisy), decrease when \(g<g^{*}\) (wasting compute).
- Entropy‑tuned smoothing: adapt label smoothing/dropout to keep prediction‑entropy in a band early, then anneal.
- Regularization balance: nudge \(\lambda_{\rm wd}\) so parameter‑entropy or spectral radius stays inside a band; relax as the corridor stabilizes.
- Always enforce \(u_{\min}\): never allow any knob to push β closer than the pseudo‑critical gap; this guardrail preserves speedups by preventing \(\tau\) spikes.
6) MaxEnt add‑on: architecture & initialization
Extend the entropy/critical‑control lens to structural hyper‑parameters as well: matrix sizes (d_model, d_k, d_ff), number of heads H, attention pattern/positional scheme, activation parameters, and initialization scales. The Maximum Entropy (MaxEnt) principle selects the least‑assumptive configuration consistent with constraints (compute, memory, stability, and the corridor targets), reducing over‑/under‑provisioned work before training even starts.
-
(A) Initialization scales (per layer)
Choose weight std. σw so the temperature T = β·σqk·√(2·ln Neff) starts near a target band T* at step 0, while keeping variance propagation and kurtosis within bounds. This places layers closer to the entropy corridor from the first updates.
-
(B) Matrix sizes & heads
Evaluate a small, tile‑friendly catalog of tuples (H, d_k, d_ff, d_model) with measured cost (FLOPs/memory) and a corridor‑utility score (how well per‑head Neff stays in band for moderate β). Select via a softmax/Lagrange trade‑off between cost and utility, then fix the best tuple before training.
-
(C) Activation/normalization parameters
Maintain an output‑entropy band H(f(x)) using a tiny PI controller on activation parameters (and a sensible layer‑norm ε), plus a spectral‑radius cap to avoid heavy‑tail gradients.
-
(D) Attention pattern / positional scheme
Pick among rotary / learned / ALiBi / local patterns by the same cost–utility criterion, favoring options that keep early‑layer Neff high at fixed compute.
7) Updated projections with MaxEnt (structural)
| Scale | From MaxEnt structure/init | New total projection (vs. the previous table) |
|---|---|---|
| 9k | +8–12 pp | 52–70% |
| 14.4M | +5–9 pp | 43–61% |
| GPT‑3 | +4–8 pp | 39–58% |
| GPT‑4 | +3–7 pp | 35–54% |
| GPT‑5 | +3–6 pp | 33–51% |
pp = percentage points. Assumes: (i) small discrete architecture catalog aligned to hardware tiles, (ii) one‑shot MaxEnt pre‑selection before training (or very infrequent), and (iii) CEAS multi‑knob control active during training. Realized gains depend on dataloader throughput and compile/graph amortization.