Dual Forces in the Making of Knowledge Engines
Yihong Chen
University College London
Knowledge engines are shaped by two underlying, complementary forces:
These forces manifest across both structured and unstructured modeling paradigms.
| Force | Paradigm | |
|---|---|---|
| Structured | Unstructured | |
| Structure | Language modeling objectives induce structure in factorization models. (Chapter 2) |
Language modeling objectives induce structure in Transformers. (Chapter 3) |
| Destructure | Active forgetting supports generalization to unseen graphs. (Chapter 4) |
Active forgetting supports generalization to unseen languages. (Chapter 5) |
Table 6.1: Structured and unstructured paradigms through the lens of structure and destructure.
Despite differences in implementation, a unifying lens reveals shared mechanisms:
| Force | Mechanistic Insight |
|---|---|
| Structure | Language modeling induces structured computation within models. (Part I) |
| Destructure | Active forgetting enables generalization to novel symbolic environments. (Part II) |
Table 6.2: A unified view of structure and destructure in language modeling.
A new self-supervised training objective
\[ \underset{\theta \in \Theta}{\arg\max} \sum_{\langle s, p, o \rangle \in \mathcal{G}} \left[ \log P_\theta(s \mid p, o) + \log P_\theta(o \mid s, p) + {\color{red}{\lambda \log P_\theta(p \mid s, o)}} \right] \]
\[ \log P_\theta(p \mid s, o) = \phi_\theta(s, p, o) - \log \sum_{p' \in \mathcal{R}} \exp\left[\phi_\theta(s, p', o)\right] \]
Equation (2.3) Extending 1vsAll with a relation prediction term
Treats all triple components as tokens:
| Dataset | Entity Prediction | Relation Prediction | MRR | Hits@1 | Hits@3 | Hits@10 |
|---|---|---|---|---|---|---|
| WN18RR | ❌ | ✅ | 0.258 | 0.212 | 0.290 | 0.339 |
| ✅ | ❌ | 0.487 | 0.441 | 0.501 | 0.580 | |
| ✅ | ✅ | 0.488 | 0.443 | 0.505 | 0.578 | |
| FB15K-237 | ❌ | ✅ | 0.263 | 0.187 | 0.287 | 0.411 |
| ✅ | ❌ | 0.366 | 0.271 | 0.401 | 0.557 | |
| ✅ | ✅ | 0.388 | 0.298 | 0.425 | 0.568 | |
| Aristo-v4 | ❌ | ✅ | 0.169 | 0.120 | 0.177 | 0.267 |
| ✅ | ❌ | 0.301 | 0.232 | 0.324 | 0.438 | |
| ✅ | ✅ | 0.311 | 0.240 | 0.336 | 0.447 |
Table 2.3 The new objective brings consistent improvements across all datasets on all metrics except Hits@10 on WN18RR.
The body in practice is not general enough and requires lots of data to adapt to the new language
Is there any way to make the body more general?
A model with general body will be plastic to new embedding learning.
Is there any way to make the body more general?
Is there any way to make the body more general?
Is there any way to make the body more general?
Is there any way to make the body more general?
General v.s. Specific
Meta-learning with minimal intervention during pretraining
Meta-learning with minimal intervention during pretraining
Figure 5.6: Relative gains of forgetting PLMs over standard PLMs across languages for XNLI. Forgetting yields substantial relative gains for languages like Arabic, Hindi, Thai, Turkish, and Urdu.
Questions and Discussion