Curiosity about use cases and why? #2

Puddings22 · 2025-05-13T06:14:00Z

Puddings22
May 13, 2025

Hello,

Just curios on what would be the use cases of this architecture and benefits vs normal?
What use cases? I cant seem to think of any that would justify the training of such machine because the more you scale it the more resource intensive it gets, transformers are already very resource intesfull but this is way beyond that in needed compute

I mean if every neuron D has its own MLP (even a small one), the total parameter count could explode for very wide networks, potentially more so than architectures with shared parameters. The processing of a history (A t) for each NLM also adds computational load per neuron.

And calculating St=Z t⋅(Z t ) ⊺ is an O(D 2⋅t) operation (or O(D 2) if Z t means current history of activations up to t for D neurons, and D 2 for the matrix multiplication for a fixed history length). While they subsample pairs, the full matrix (or at least dot products involving all neurons over their history) seems to be computed. For very large D (millions/billions of neurons as hypothesized for AGI), this D 2 factor is a classic scaling concern so this for me looks extremely inefficient and I cant see the benefits or use cases

lukedarlow · 2025-05-13T08:03:40Z

lukedarlow
May 13, 2025
Maintainer

Hello and thank you for your question!

We released this work at a relatively early stage because we think that the benefits of rethinking some of the aspects of AI could unlock new avenues, and we really want to inspire and get people thinking about ways to explore intelligence.

A core tenet of AI research is to understand intelligence. Biological intelligence (i.e., our best example) emerges from systems where neurons are far more complex than what NNs implement, neural timing and synchronization matters, and recurrence is central. Thus, our research is an exploration of aspects of intelligence that have yet to be fully explored and/or leveraged in modern models.

We set the maze task up to be intentionally difficult and distinct from how earlier work has set it up. The purpose of this was to build a task where complex sequential reasoning was a mandated requirement for success. Thus, tasks that require sequential reasoning are prime candidates. We are thinking long-term about what tasks are the best candidates for a CTM, and our next paper in this space will definitely attempt to bring these to light. As is often the case with novel ideas in AI - it can take some time to unpack the benefits, use-cases, etc., and we hope that you will be inspired to come on this journey with us!

To reiterate: our primary motivation is to understanding, explore, and build advanced forms of artificial intelligence; the optimal use-cases are important, but secondary to this core motivation.

Scaling NLMs

For a model of width D the NLMs definitely do scale up the parameters (precisely as $D \times ( M \times h + h)$ where M is the memory length, and h is the width of the NLM MLPs). Typically M might be 25 (that is what we used for ImageNet) and h might be 32. Even with these relatively small values ones gains flexibility regarding what each neuron can model, and this is a step towards biologically-plausible structure.

Scaling models has proven potent over the past few years. Yet, perhaps the benefits are tapering off. Our perspective is that NLMs represent a new dimension to which scaling can be applied and we're only beginning to explore the benefits it offers.

Synchronization cost

You noted that in Section 2.4.1. of the paper we discuss that we use a sub-sampling approach and do definitely do not compute the full synchronization matrix as you have described. Further, and as implemented in the code, we use a far cheaper but mathematically identical recursive computation of the synchronization representations (see Appendix K in the paper).

The point of synchronization representations

Synchronization is fundamentally new and we have only scratched the surface of what it can enable. What synchronization does, fundamentally, is to loosely decouple the action of thought (computation and reasoning) from the result thereof (e.g., a class prediction), enabling new flexibility. It is, once again, biologically inspired.

0 replies

GoldenFishes · 2025-05-23T08:00:44Z

GoldenFishes
May 23, 2025

If we could seamlessly transfer the weights of a pre-trained Transformer model to the CTM architecture, we could temporarily leverage additional resources in CTM to analyze the interpretability of a powerful model. Unfortunately, no theoretical basis for such a transfer has been identified in the existing formulations.

1 reply

lukedarlow May 23, 2025
Maintainer

Adding a CTM on top of a transformer is not necessarily a complex task. For this version of the CTM we were not interesting in scaling to such a degree, but I hope that somebody tries this soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Curiosity about use cases and why? #2

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Curiosity about use cases and why? #2

Uh oh!

Puddings22 May 13, 2025

Replies: 2 comments · 1 reply

Uh oh!

lukedarlow May 13, 2025 Maintainer

Scaling NLMs

Synchronization cost

The point of synchronization representations

Uh oh!

GoldenFishes May 23, 2025

Uh oh!

lukedarlow May 23, 2025 Maintainer

Puddings22
May 13, 2025

Replies: 2 comments 1 reply

lukedarlow
May 13, 2025
Maintainer

GoldenFishes
May 23, 2025

lukedarlow May 23, 2025
Maintainer