The Aphygo Blog

Explore our latest insights on AI memory, scaling, and the future of intelligent systems.

The Goldilocks Dilemma of AI Memory: Why Embedding Dimensions Matter

Johan Thulin - CTO Aphygo AB

In the world of Artificial Intelligence, "memory" isn't just about storing data; it's about understanding it. When an AI needs to remember a concept—like a user's preference, a paragraph from a document, or the meaning of a search query—it turns that information into a list of numbers called a vector embedding.

Imagine this embedding as a coordinate on a giant, multi-dimensional map. Similar concepts live close together, and dissimilar ones live far apart. This spatial relationship is what allows an AI to perform "semantic search," finding things based on meaning rather than just matching keywords.

But here is the crucial question for anyone building an AI system: How many dimensions should that map have?

Should it be a simple 2D plot, a complex 3D space, or a hyper-dimensional realm with thousands of axes? The answer is a classic "Goldilocks" problem. Too few dimensions, and your AI's memory is blurry. Too many, and it becomes slow, expensive, and strangely empty.

Let's explore the trade-offs of this critical architectural decision.

The Case for High Dimensions: A 4K View of the World

Modern state-of-the-art embedding models often use a huge number of dimensions—sometimes upwards of 3,072 or even 4,096. Why so many?

Think of high dimensionality as high resolution. A 4K TV can show you details that an old tube TV simply couldn't. Similarly, a high-dimensional embedding can capture nuanced distinctions in meaning. It can tell the difference between a "slightly annoyed kitten" and an "angry cat." It can distinguish between "bank" (financial institution) and "bank" (river edge) based on subtle contextual clues.

Pros of High Dimensions:

  • Precision: Captures fine-grained semantic details for highly accurate retrieval.
  • Nuance: Better at understanding complex, multi-faceted concepts.

Cons of High Dimensions:

  • The "Empty Space" Problem: In a massive 4096-dimensional space, data points are incredibly sparse. The distance between any two random points starts to look the same, which can make it harder to form meaningful clusters or find "neighbors." This is often called the curse of dimensionality.
  • Extreme Cost: Storing and processing these massive vectors requires significantly more memory (RAM) and computational power (GPU/CPU). A similarity check that takes 1 millisecond at low dimensions could take 5 to 10 times longer at high dimensions.

The Case for Low Dimensions: Speed and Connection

On the other end of the spectrum, you have models that use fewer dimensions, perhaps 256 or 384.

Think of this as a pixel-art version of an image. The fine details are gone—the "annoyed kitten" and "angry cat" might get compressed into the same general "grumpy feline" blob.

Pros of Low Dimensions:

  • Incredible Efficiency: They are fast to generate, cheap to store, and lightning-quick to search. You can run these systems on standard hardware without breaking the bank.
  • Dense Connectivity: Because the space is smaller, data points are forced to pack closer together. This makes it very easy to see who your "neighbors" are and to form dense, interconnected clusters of related concepts.

Cons of Low Dimensions:

  • Loss of Detail: Subtle semantic distinctions get lost in compression. You sacrifice accuracy for speed.

The "Sweet Spot" and the “Russian nesting doll”

For years, the industry standard has been around 768 dimensions (popularized by models like BERT). This has proven to be a fantastic middle ground—dense enough to form good connections but sharp enough to capture meaningful detail.

Recently, however, a new technique called Matryoshka Representation Learning (MRL) has emerged, offering the best of both worlds. MRL trains models so that the most important information is packed into the first few dimensions.This can be compared to the Russian nesting doll which is constructed in the different layers, which is also where Matryoshka got its name from.

This means you can take a massive 4096-dimensional embedding from a top-tier model and "slice" off just the first 512 or 768 dimensions.

  • You use the sliced version for your primary search index, keeping it fast, cheap, and densely connected.
  • You can still keep the full version in storage and use it only when you need that final, high-precision check.

This hybrid approach is quickly becoming the standard for building scalable, high-performance AI memory systems. It allows engineers to design systems that are fast enough for real-time interaction but smart enough to understand the nuances of human language.

Ultimately, there is no single "correct" dimension size. It depends entirely on what you are building. Are you building a high-frequency trading bot that needs speed above all else? Go low. Are you building a legal discovery tool where missing a nuance could be disastrous? Go high. For most general-purpose AI memory systems, finding that "Goldilocks" zone in the middle—or using a clever trick like MRL—is the key to success.


For a deeper dive into the technical concept of Matryoshka Representation Learning and how it enables flexible embedding sizes, this video provides an excellent overview: Matryoshka Representation Learning (MRL) for ML tasks and vector compression This video explains the core concept of MRL, which is a key technique mentioned in the blog for balancing embedding dimension trade-offs.

Can a Swarm of Intelligent Ants Out-Think the Reliability Wall?

Rethinking AI Scaling with Mixture-of-Experts and Multi-Agent “Ant-Stacks”

1. The Reliability Wall – why “just scale it” eventually cracks

Every modern model carries a non-zero per-step error. Chain enough steps together and the chance that all of them succeed plunges exponentially (e.g., $0.999^{7000} \approx 0.37$). This compounding-error law is the invisible brick wall big-model teams keep slamming into. It explains why even trillion-parameter LLMs still hallucinate on very long tasks, and why self-driving stacks need so many safety layers.

The core of the issue lies in a misunderstanding of how scale translates to capability. The impulse is often to simply add more parameters, but the mathematics of neural scaling laws tell a more complex story.

1.1 The Math of Diminishing Returns

In 2022, researchers at DeepMind formulated the "Chinchilla" scaling laws, which describe the relationship between a model's performance (Loss), its number of parameters (N), and the size of its training data (D). The equation reveals the trade-offs at the heart of large-scale training:

$$ L(N, D) = \left[ \left( \frac{N_c}{N} \right)^{\frac{\alpha N}{\beta D}} + \frac{D_c}{D} \right]^{\beta} $$

In essence, it shows that for a fixed compute budget, there are two primary ways to hit a wall:

  • The Data Wall: You can have a massive model (huge N), but if you don't have enough high-quality training data (D), its performance will plateau. The model essentially memorizes the data it has and fails to generalize.
  • The Model Wall: You can have a vast dataset (huge D), but if your model is too small (small N), it lacks the capacity to learn the complex patterns within the data.

The "just scale it" approach often fixates on N while underestimating the Herculean effort required to scale D with high-quality, diverse tokens. This leads to an inevitable plateau—the Reliability Wall—where adding more parameters yields diminishing returns and does not solve the fundamental problem of compounding error. No mainstream lab has solved this in the open-world case; at best we keep pushing the wall farther out.

2. An ant-stack thought experiment

One ant can’t bridge a chasm, but a living “ant-bridge” of thousands can. Could AI copy that trick?

“Ant” flavour Core idea Leading evidence
Sparse Mixture-of-Experts (MoE) Activate only the subset of parameters (experts) needed for the current token. Google Switch-T, DeepMind GLaM, HuggingFace Mixtral: 2–4× better perplexity at equal FLOPs; production inference on consumer GPUs.
Independent ensemble / voting Run several distinct models, then vote or cross-check. Code-generation and medical-vision systems report >10× reduction in critical errors when 5–7 diverse models vote.
Multi-agent cooperation Give each agent a role (planner, coder, critic, tester) and let them talk. DreamCoder team beats single LLMs on 50-step synthesis; Reflexion agents solve 2× more tasks in Baby-AGI suite.

3. How far does the ant-bridge stretch?

3.1 Strengths

  • Capacity without linear compute (MoE)
  • Error drops exponentially when failures are independent (ensembles)
  • Division of labour & self-critique (multi-agent)

3.2 Cracks that re-appear

Swarm tactic Where it still fails
MoE Router misfires occur when the gating network shunts an outsized share of tokens to a handful of popular experts, overloading them while others sit idle; the resulting imbalance snowballs into local collapse that propagates system-wide. Training the router adds its own instability & latency.
Ensemble Real-world inputs create correlated mistakes. Example: Ten vision models all trained on web-scale images misidentify a snow-covered husky as a wolf because the dataset associates snow with wolves; their majority vote is confidently wrong.
Multi-agent Communication overhead grows ≈O(K²) messages per round if K agents broadcast naïvely. Concurrency bugs, stale beliefs, and deadlocks emerge; benchmarks like MultiAgentBench show performance drop-offs beyond ~30 dialogue turns.

3.3 Engineering the “ant glue” – the hidden cost

Building a reliable swarm is not free:

  • Routing & Load-balancing: MoE routers require gradient-based gating, load-balancing losses, and sometimes reinforcement-learning fine-tuning. A single percentage-point routing skew can erase MoE gains.
  • Message Protocols: Agents need a language and a trust model. Choices span JSON-RPC, protobuf, or natural-language with schema enforcement. All add latency; schema-based messages add 5-20 % token overhead but cut parsing failures by 90 %.
  • Shared Memory & Consensus: Who owns the project plan? Convergent approaches (central blackboard) scale poorly; peer-to-peer needs distributed locks or CRDTs.
  • Security & Sandboxing: Each agent is a new attack surface. Swarms in production must isolate file I/O, cap external calls, and sign messages.

TL;DR: The glue often outweighs the raw model FLOPs.

4. Design principles from successful swarms

  • Diversity is oxygen. Train experts and ensemble members on different data slices, objectives, or even modalities to slash error correlation.
  • Task-level redundancy beats per-step perfection. Fast rollback, replanning, or human hand-offs contain the blast radius of inevitable slips.
  • Critics are cheaper than builders. Small “auditor” models detect or patch flaws faster than large generators can avoid them.
  • Protocols → Products. Invest in explicit coordination APIs (topic-based routing, timeouts, arbitration rules). Good communication beats raw IQ in every ant-colony test so far.

4.5 Digital individuality — “Paranted but unique”

A colony is resilient not because every ant is identical, but because each has slight genetic and experiential differences. The software analogue is digital DNA: agents that share a common parent model yet diverge through targeted, high-variance data and fine-tuning.

From diverse data to boutique experts
  • MoE → boutique experts. Swap generalists for specialists: a legal-only expert, a protein-folding expert, a Socratic-dialogue expert. The router now directs a query to the expert whose DNA fits the task, not merely to balance FLOPs.
  • Ensemble → orthogonal blind spots. Train one model on formal-logic proofs, another on synthetic adversarial corpora, a third on multilingual news. Their blind spots overlap far less, so majority vote becomes a true cross-check rather than a loud echo.
  • Multi-agent → genuine personalities.
    • Optimist Planner — fine-tuned on ambitious project retros.
    • Skeptic Critic — immersed in post-mortems and bug trackers.
    • Empirical Tester — fed unit tests and execution traces.

    These agents reason from distinct priors, instead of pretending to.

The colony’s common language

Individuality magnifies the need for integration and calibration:

  • How should a legal expert’s verdict weigh against a creative agent’s story arc?
  • How to stop a Skeptic agent from vetoing every idea?

The glue therefore needs richer schemas, confidence signals, and meta-arbitration learned on top of the experts themselves.

5. Is the reliability wall unsolvable?

In a formally closed domain (e.g. data transmission) you can push error → 0 with enough redundancy. In an open, adversarial world perfect reliability is provably out of reach—complexity, shifting distributions and adversaries guarantee residual risk. But a well-architected swarm raises the “good-enough” ceiling dramatically and, crucially, fails gracefully instead of catastrophically.

6. Final Take-aways for Practitioners

  • Think ecosystems, not monoliths. Combine MoE capacity, ensemble diversity, and agent-level checks. Curate data individually for these components; build specialist corpora instead of just shuffling the same web dump.
  • Measure correlation, not just accuracy. If all your ants share the same blind spot, the bridge still falls. Fine-tune for personality—imbue agents with specific cognitive traits like caution, creativity, or rigor to create truly orthogonal blind spots.
  • Budget for glue. Protocol engineering, logging, and orchestration pipelines deserve sprint time. This includes investing in arbitration; a meta-agent trained to fuse divergent worldviews can turn diversity into your greatest strength.
  • Iterate on protocols as hard as on models. Reliability lives in the interfaces between your agents.

Closing thought

A single super-ant may never cross the chasm alone, but a disciplined, well-diversified colony just might build a bridge long enough—and resilient enough—to carry real-world AI systems over the reliability wall. The future of scalable intelligence is less about one giant brain and more about thousands of specialised, cooperating minds held together by robust, well-engineered glue.