Simplex
Simplex is an AI safety research organization building a science of intelligence.
We believe that understanding intelligence is safety. AI systems are deployed across society, and we don't know how they work. Without genuine understanding, we can't reliably monitor, control, or reason clearly about what these systems are doing. But these same systems also present a new opportunity. For the first time, we have machines complex enough to serve as testbeds for theories of intelligence itself, including biological.
Our aim is to develop and apply a rigorous theory of latent internal structure in neural networks — how they internally organize their representations, and how that structure relates to computation and behavior. We aim to build a theory applicable to intelligence, both artificial and biological.
We believe that intelligence is the defining issue of our time. Beyond the technical challenge, intelligence forces us to ask what we actually are. This is bigger than AI safety in the narrow sense. It's about understanding what makes us human, how we relate to the minds we're building, and what we want to become.
Careers
We are hiring. Research Scientists and Senior Research Scientists in the Bay Area, and Research Scientists in London.
Research
Transformers learn factored representations
PreprintOur world naturally decomposes into parts, but neural networks learn only from undifferentiated streams of tokens. We show that transformers discover this factored structure anyway, representing independent components in orthogonal subspaces and revealing a deep inductive bias toward decomposing the world into parts.
Rank-1 LoRAs encode interpretable reasoning signals
NeurIPS WorkshopReasoning performance can arise from minimal, interpretable changes to base model parameters.
Neural networks leverage nominally quantum and post-quantum representations
PreprintNeural nets discover and represent beliefs over quantum and post-quantum generative models.
Simplex progress report
BlogNext-token pretraining implies in-context learning
PreprintIn-context learning arises predictably from standard next-token pretraining.
Constrained belief updates explain geometric structures in transformer representations
ICMLTransformers implement constrained Bayesian belief updating shaped by architectural constraints.
AXRP Interview: Computational mechanics and transformers
TalkFAR Seminar: Building the science of predictive systems
TalkWhat can you learn from next-token prediction?
TalkTransformers represent belief state geometry in their residual stream
NeurIPSWhat computational structure are we building into large language models when we train them on next-token prediction? We present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process.
Learn more about our work
Simplex was founded by Paul Riechers and Adam Shai in 2024, bringing the best of both physics and computational neuroscience to build a new science of intelligence needed for AI safety. We are a growing team of world-class researchers and engineers bringing scientific rigor to enable a brighter future.
We've shown that transformers trained on next-token prediction spontaneously organize their activations into geometric structures predicted by Bayesian inference over world models (manuscript, blog post). Even on simple training data, complex fractals emerge that reflect the hidden structure of the world the model is learning. The demo below lets you see it happen in real time.
A 42-parameter RNN trains in your browser on a 3-state hidden Markov model. As it learns to predict the next token, it organizes its activations into a fractal that mirrors optimal Bayesian belief geometry.
Since then, we've extended this in several directions: explaining how attention implements belief updating under architectural constraints, deriving in-context learning from training data structure, discovering quantum and post-quantum representations in networks, and showing that transformers decompose their world models into interpretable, factored parts. The perspective and intuition we've developed provides a unique edge for interpretability. For the bigger picture, see our progress report, the FAR Seminar talk, or this recent interview.
Our foundational result showed that transformers trained on next-token prediction spontaneously organize their activations into geometries predicted by Bayesian belief updating over hidden states of a world model. Even when trained on simple token sequences from hidden Markov models, complex fractals emerge in the residual stream, structures far removed from the surface statistics of the training data. We think of this work as providing the first steps into an understanding of what fundamentally we are training AI systems to do, and what representations we are implicitly training them to have.
In Constrained Belief Updating Explains Transformer Representations, we asked how attention implements belief updating when Bayesian inference is fundamentally recurrent. We found that attention parallelizes recurrence by decomposing belief updates spectrally across heads, and we were able to make verified predictions about embeddings, OV vectors, attention patterns, and residual stream geometry at different layers.
We've also developed a theory of in-context learning grounded in training data structure. When training data mixes multiple sources, models must infer not just what hidden state the generator is in, but which source is active. This hierarchical belief updating necessarily produces power-law loss scaling with context length and explains why induction heads emerge.
We've been asking what the most general computational framework for understanding neural network representations might be. Our initial work implied activations should lie in simplices, but we've now shown that networks discover quantum and post-quantum belief geometries when these are the minimal way to model their training data. This offers a new foundation for thinking about features, superposition, and what representations neural networks use on their own terms.
Most recently, we've shown that transformers naturally decompose their world model into interpretable parts. These factored belief representations provide an exponential-dimensional advantage, and suggest that we can understand and surgically intervene upon low-dimensional subspaces of large models.