Question regarding Positional Information and Internal World Models in CTM for Maze Solving #4

TeaWong1012 · 2025-05-13T09:55:35Z

TeaWong1012
May 13, 2025

Dear Sakana AI CTM Team,
I was very impressed reading your technical report on Continuous Thought Machines, particularly the emergent, human-like strategies in solving 2D mazes.
I noted with great interest your deliberate choice to forgo positional embeddings in the maze task, with the stated reasons being to force the model to build an internal world representation and to enable seamless scaling. This resonates with the idea that true spatial reasoning requires more than just coordinate processing.
My question is:
In the absence of explicit positional encodings, how do you conceptualize the CTM model learning or representing "position" or "spatial structure" internally to navigate the maze effectively? Is this spatial understanding an emergent property of the neural synchronization dynamics, or are there specific mechanisms or learned representations within CTM that you believe are crucial for forming this internal cognitive map?
I am exploring a related framework (TPA - Time, Position, Attributes) where "Position" (P) is a core, explicit component interacting with Time (T) and Attributes (A) to learn underlying regularities and enable reasoning, including tasks like fragment reconstruction. Your insights on how CTM handles spatial understanding without external positional cues would be highly valuable.
Thank you for your groundbreaking work and for sharing it with the community.

Sincerely,
Tao Wang

Answered by lukedarlow

May 13, 2025

Hi Tao.

Thank you for your kind words and question!

We discuss this in Section 4.4 in the paper, but I'll expand and reiterate for you because this is a really crucial question that I believe will be important moving forward with research in this space. I don't quite know if I can speak directly to your TPA perspective, but I shall share my intuition and hopefully that helps. Let us consider an example solve; the CTM must:

Find the red square (that's easy, it is the only red square).
Observe the red square and the local maze structure surrounding the red square (e.g., the line goes up, then left, and another line goes down then left, etc.).
Based on this observation, build neural dynamic…

View full answer

lukedarlow · 2025-05-13T10:13:36Z

lukedarlow
May 13, 2025
Maintainer

Hi Tao.

Thank you for your kind words and question!

We discuss this in Section 4.4 in the paper, but I'll expand and reiterate for you because this is a really crucial question that I believe will be important moving forward with research in this space. I don't quite know if I can speak directly to your TPA perspective, but I shall share my intuition and hopefully that helps. Let us consider an example solve; the CTM must:

Find the red square (that's easy, it is the only red square).
Observe the red square and the local maze structure surrounding the red square (e.g., the line goes up, then left, and another line goes down then left, etc.).
Based on this observation, build neural dynamics that synchronize in specific ways such that when projected with a linear probe (weight matrix, $W_{out}$) the attention probability should be high along the path that it wants to traverse. In a sense, it needs to 'imagine' the future state of where it wants to observe based on its current observation, which I think is very interesting.
Observe slightly further along that path, building up features, consequential dynamics, and thereby synchronization that, again, enables it to 'imagine' the future that it wants to observe.
Repeat step 3 and 4 until the maze is solved.

Gazing beyond the training path horizon

Interestingly, we noticed that even after it has 'looked' all the way to the end of the path that it is predicting (e.g., length 100 in the paper; a hyper-parameter choice), it continues to gaze further along the path. If it 'could' output a longer path, it likely would. Perhaps a future experiment should entail taking the pretrained maze, extending the number of internal ticks (which is crucial to observe this emergent phenomenon) and fine-tuning the $W_{out}$ matrix to output a longer path! (that would be a fun experiment if you want to try it)

The attention heads are more complex than they seem

This is also a crucial point, and you can see it in the demo here: https://pub.sakana.ai/ctm/ . The CTM's attention is not simply looking beyond the path (although on average it seems to be, based on our analysis and the visualizations). Instead, it is looking at multiple locations and some attention heads seem to be gathering more global perspectives. Set the animation speed to be very slow and watch the attention heads (below the maze) carefully to observe this!

The patterns 'mature' over time

You can confirm this for yourself because we did not cover it in the paper, but as the CTM learns what the attention pattern is doing changes over time. On several occasions we observed a sort of 'double take' phenomenon where the CTM would look down a path and then double back to fix any mistakes. As it gets better and learns more it discards this wasteful process, but it is quite interesting to watch it learn. Further, in some instances we observed a 'backwards' solve similar to some of the parity results.

Note that these points are based on my intuition and my opinions, and not necessarily proven concretely or shared by everyone on the team.

1 reply

TeaWong1012 May 13, 2025
Author

Dear Luke Darlow,
Thank you so much for your incredibly insightful and detailed response! I'm truly grateful that you took the time to share these specifics and discuss your intuitions about CTM's workings. Your work, and particularly the way CTM handles tasks like maze solving, has been a significant source of inspiration for me.
One of the key challenges I've been grappling with in my own framework, which I call TPA (Time-Position-Attributes), is how to appropriately encode the "Position" (P) element for non-physical world tasks, such as understanding and reasoning over text sequences. Seeing how CTM cleverly circumvents the need for explicit positional embeddings in the maze task—by forcing the model to build its own internal cognitive map—has provided me with a fresh and valuable perspective on tackling similar representational challenges.
Your emphasis on the CTM's ability to "imagine" future states based on current observations and to iteratively build solutions resonates deeply with my own thinking. I've long believed that Time, or more broadly, the sequential order of events, является the fundamental driving force behind any reasoning process. Much of prior work, in my observation, often treats time T a passive index rather than an active component in modeling dynamics. My own preliminary experiments, for instance in continuous localization and navigation tasks (akin to your maze-solving, but with multimodal sensor data), showed that models with a stronger capacity for temporal information propagation (even if implicit, like in Mamba) exhibited significantly improved continuity and smoothness in their predicted trajectories compared to, for example, attention-based models. I attribute this to the more effective, albeit implicit, conveyance of temporal context and dependencies.
These observations were a primary catalyst for developing the TPA framework. TPA explicitly decomposes system states into Time (T), Position (P) (a generalized, potentially structural or role-based location), and Attributes (A) (the content or features). The core idea is to then explicitly model all pairwise (and potentially higher-order) interactions between T, P, and A to learn the underlying, ideally deterministic, rules governing the system's dynamics. The aim is for the model to truly "understand" how P and A evolve as T progresses, and even how P and A can, in turn, influence the "tempo" or "critical junctures" of T.
The TPA framework is still in its early experimental stages. However, on some relatively simple tasks, such as predicting the trajectory of a bouncing ball or, более важно, reconstructing missing segments of that trajectory (which I see as analogous to the CTM's process of "searching" or "imagining" the path from start to finish in a maze), it has already shown promising results, outperforming Transformer and LSTM baselines of comparable parameter counts. This "fragment reconstruction" capability is a core of TPA, as I believe it's a crucial indicator of a model's ability to reason based on learned regularities rather than just pattern matching.
Learning about CTM, and seeing its sophisticated handling of internal "thought steps," its emergent complex behaviors like "double-takes" or even "backwards solves," and the fundamental role of neural dynamics and synchronization, has been incredibly exciting and encouraging. It's a profound affirmation to realize that the underlying logic and the emphasis on temporal dynamics in my TPA concept align so closely with the thinking of leading researchers like yourself and your team at Sakana AI. It's particularly heartening for me, as a first-year Master's student, to find that I'm not alone in these convictions and that these ideas are being explored at the forefront of AI research.
Thank you again for your generosity in sharing your insights. It has provided me with a great deal of motivation and new avenues to consider as I continue to develop and validate the TPA framework. I'll be following CTM's progress with immense interest!
Sincerely,
Wang Tao

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question regarding Positional Information and Internal World Models in CTM for Maze Solving #4

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question regarding Positional Information and Internal World Models in CTM for Maze Solving #4

Uh oh!

TeaWong1012 May 13, 2025

Replies: 1 comment · 1 reply

Uh oh!

lukedarlow May 13, 2025 Maintainer

Gazing beyond the training path horizon

The attention heads are more complex than they seem

The patterns 'mature' over time

Uh oh!

TeaWong1012 May 13, 2025 Author

TeaWong1012
May 13, 2025

Replies: 1 comment 1 reply

lukedarlow
May 13, 2025
Maintainer

TeaWong1012 May 13, 2025
Author