Curiosity about use cases and why? #2
Replies: 2 comments 1 reply
-
Hello and thank you for your question! We released this work at a relatively early stage because we think that the benefits of rethinking some of the aspects of AI could unlock new avenues, and we really want to inspire and get people thinking about ways to explore intelligence. A core tenet of AI research is to understand intelligence. Biological intelligence (i.e., our best example) emerges from systems where neurons are far more complex than what NNs implement, neural timing and synchronization matters, and recurrence is central. Thus, our research is an exploration of aspects of intelligence that have yet to be fully explored and/or leveraged in modern models. We set the maze task up to be intentionally difficult and distinct from how earlier work has set it up. The purpose of this was to build a task where complex sequential reasoning was a mandated requirement for success. Thus, tasks that require sequential reasoning are prime candidates. We are thinking long-term about what tasks are the best candidates for a CTM, and our next paper in this space will definitely attempt to bring these to light. As is often the case with novel ideas in AI - it can take some time to unpack the benefits, use-cases, etc., and we hope that you will be inspired to come on this journey with us! To reiterate: our primary motivation is to understanding, explore, and build advanced forms of artificial intelligence; the optimal use-cases are important, but secondary to this core motivation. Scaling NLMsFor a model of width D the NLMs definitely do scale up the parameters (precisely as Scaling models has proven potent over the past few years. Yet, perhaps the benefits are tapering off. Our perspective is that NLMs represent a new dimension to which scaling can be applied and we're only beginning to explore the benefits it offers. Synchronization costYou noted that in Section 2.4.1. of the paper we discuss that we use a sub-sampling approach and do definitely do not compute the full synchronization matrix as you have described. Further, and as implemented in the code, we use a far cheaper but mathematically identical recursive computation of the synchronization representations (see Appendix K in the paper). The point of synchronization representationsSynchronization is fundamentally new and we have only scratched the surface of what it can enable. What synchronization does, fundamentally, is to loosely decouple the action of thought (computation and reasoning) from the result thereof (e.g., a class prediction), enabling new flexibility. It is, once again, biologically inspired. |
Beta Was this translation helpful? Give feedback.
-
If we could seamlessly transfer the weights of a pre-trained Transformer model to the CTM architecture, we could temporarily leverage additional resources in CTM to analyze the interpretability of a powerful model. Unfortunately, no theoretical basis for such a transfer has been identified in the existing formulations. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
Just curios on what would be the use cases of this architecture and benefits vs normal?
What use cases? I cant seem to think of any that would justify the training of such machine because the more you scale it the more resource intensive it gets, transformers are already very resource intesfull but this is way beyond that in needed compute
I mean if every neuron D has its own MLP (even a small one), the total parameter count could explode for very wide networks, potentially more so than architectures with shared parameters. The processing of a history (A t) for each NLM also adds computational load per neuron.
And calculating St=Z t⋅(Z t ) ⊺ is an O(D 2⋅t) operation (or O(D 2) if Z t means current history of activations up to t for D neurons, and D 2 for the matrix multiplication for a fixed history length). While they subsample pairs, the full matrix (or at least dot products involving all neurons over their history) seems to be computed. For very large D (millions/billions of neurons as hypothesized for AGI), this D 2 factor is a classic scaling concern so this for me looks extremely inefficient and I cant see the benefits or use cases
Beta Was this translation helpful? Give feedback.
All reactions