Yihong Chen
ILCC Seminar, 27 Feb 2026
Postdoc
OATML, Oxford
Research
knowledge representation, abstraction, and acquisition
(KGs & LLMs)
PhD (CS)
FAIR & UCL
Goal
build computational systems that reproduce and improve human knowledge acquisition.
Large language models encode enormous amounts of structure through pretraining. However, controlling this structure remains difficult.
Most current interventions operate through the data distribution.
Indirect, expensive, and prone to model collapse under repeated synthetic training.
| Knowledge Graph | LLM | |
|---|---|---|
| Deletion | Remove edge | Indirect parameter updates |
| Update | Add fact | Indirect parameter updates |
| Transparency | High | Low |
| Knowledge unit | Node / triple | Entangled representation |
In symbolic systems, knowledge maps to explicit units. In LLMs, knowledge is distributed, so control is usually indirect -- no turnable knobs.
Turning the knob in computational space should produce targeted change in knowledge space (e.g. deletion of a fact) while preserving unrelated capabilities (e.g. the other facts intact).
This leads to two complementary directions: first, carve the knob by decomposing model computation; second, make the knob turnable by improving model plasticity.
Can we decompose LLMs?
Yes, functionally.
A model is just a function no matter which architecture. We can write the model as nested residual maps:
\( f = \mathrm{Dec} \circ \Big(\circ_{\ell=1}^L (\mathrm{id} + \gamma_\ell)\Big) \circ \mathrm{Enc} \)
Hidden recursion:
\( h_\ell = h_0 + \sum_{j=1}^\ell \gamma_j(h_{j-1}) \)
Residual links accumulate nested nonlinear contributions (source of entanglement).
For \(f\in C^{k+1}\) around basepoint \(x_0\):
\( f(x) = f(x_0) + \sum_{j=1}^k \frac{1}{j!} D^j f(x_0)\,(x-x_0)^{\otimes j} + O(\|x-x_0\|^{k+1}) \)
Jets: define the k-jet operator \( J_k f(x_0) \) as the truncated polynomial component.
Interpretation: each derivative tensor isolates interactions of specific orders.
Setup. Let \( \bar{x} = \sum_{i=1}^N x_i \).
Claim. For sufficiently small interaction radius \(r\),
\( J_k f(\bar{x}) = \sum_{i=1}^N w_i\, J_k f(x_i) + O(r^{k+1}) \)
Why this matters. Jets are approximately a convex combination of component jets, with controlled error. This gives a principled way to separate nested residual streams passing through one nonlinear block.
1) Redundancy. Transformers behave like ensembles over exponentially many skip/non-skip paths across scales, giving multiple pathways to carve.
2) Unification. Many interpretability methods are low-order decompositions: keep the terms we care about, discard the remainder.
Jet-order intuition (local expansion):
Extract in-model bigrams from low-complexity pathways by sweeping the vocabulary.
Then test whether these bigrams track learning during pretraining and change appropriately after finetuning.
Early checkpoints show noisy pairs (for example, yaml Adam), while later checkpoints surface more coherent pairs (for example, its own, make sure).
Model-level bigram shifts provide a direct check of whether finetuning changed the targeted knowledge.
RLHF improves ToxiGen scores, but LLMs like Llama-2-7B-Chat still retain toxic knowledge. With increasingly explicit prompts, their toxicity resurfaces: 84% for hard prompts. Jet bi-gram analysis (our method) confirms that RLHF mostly hides, rather than removes, toxic patterns.
Blue: toxic mass change. Orange: refusal mass change.
Setup
Compare Llama2-7B-Chat vs augmented model
(Qi et al., ICLR 2025).
Relative bigram mass:
\( \frac{M_{aug} - M_{chat}}{M_{chat}} \).
Toxic Bigram Mass
Joint metric ≈ −0.05% (no global removal).
Most layers: −0.5% to −2%.
Refusal Mass
Increased in mid–late MLP layers. Final layer (MLP32): +8.43%.
Interpretation
Alignment reshapes depth-wise distributions and reinforces refusal
signatures rather than erasing toxic representations.
Xiangyu Qi et al. Safety Alignment Should Be Made More Than Just a Few Tokens Deep. ICLR 2025.
Yes, with model plasticity.
Definition. A change in computation should cause a targeted change in knowledge (e.g., removing one fact) while preserving unrelated capabilities.
Design principle. Modularize the system into:
A straightforward split: Part I token embeddings, Part II transformer body.
Question: can we make the body more general?
Apply controlled perturbations to embeddings during pretraining.
The body repeatedly recovers, learning invariance to lexical shifts.
Result: stronger generalization with targeted intervention.
General vs. specific representations.
Meta-learning with minimal intervention during pretraining.
Repeated reset-and-recover cycles induce plasticity.
| Benchmark | Standard PLM | Forgetting PLM | Relative gain |
|---|---|---|---|
| XNLI (Acc) | 53.3 | 62.7 | +21.2% |
| MLQA (F1) | 34.3 | 43.4 | +33.8% |
| XQuAD (F1) | 36.1 | 49.0 | +60.9% |
| XQuAD @5K updates | 53% of final | 92% of final | faster adaptation |
1) Why control is hard
LLM knowledge is distributed and entangled, so data-level fixes can miss model-level structure.
2) Decomposition gives visibility
Jet decomposition reframes transformers as structured sums of computation paths, enabling interpretable probes such as in-model n-grams.
3) Plasticity makes intervention possible
Active Forgetting trains models to tolerate controlled perturbations, enabling targeted updates while preserving unrelated capabilities.
Bottom line
Controllability requires both: decomposition to diagnose and plasticity to intervene.
Questions? yihong.chen@cs.ox.ac.uk