Skip to content

Audit/soorej scaling and reasoning#32

Open
Soorej30 wants to merge 6 commits intostagingfrom
audit/Soorej-ScalingAndReasoning
Open

Audit/soorej scaling and reasoning#32
Soorej30 wants to merge 6 commits intostagingfrom
audit/Soorej-ScalingAndReasoning

Conversation

@Soorej30
Copy link
Contributor

Updated draft with images URL

@Soorej30
Copy link
Contributor Author

@crheckman could you please take a look? Thank you

@github-actions
Copy link

github-actions bot commented Jan 28, 2026

🚀 Preview Deployed

Your preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/32/textbook/audits/ReasoningAndScaling_audit/

Review Checklist

  • LaTeX equations render correctly
  • All sections are complete per the template
  • References are formatted properly
  • Figures/diagrams display correctly

Next Steps

  1. Review your rendered content using the preview link above
  2. Tag @crheckman when ready for instructor review
  3. Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass at reasoning review.

# Scaling and Reasoning in Foundational Models

The training of an LLM model can be split into 2 stages. The **pre-training** and the **post-training**. Pre-training is the part we throw all the data we have at a
transformer based model, and teach it to predict the next token based on internet data. Post-training is the stage where we train the model's reasoning capabilities.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or any available dataset, the larger the better seems to be clear (up to a certain model size and compute limit).

# Problem Domain & Taxonomy


We define **reasoning** as the capacity to produce **structured, coherent intermediate chains** that reflect underlying domain abstractions:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should define "reasoning" before here, like when you first invoke it in the section above; otherwise it reads very flimsily.


## Taxonomy of approaches

1. **Objective Shaping for Reasoning Emergence:** This class of methods treats reasoning as an emergent property that can be induced through carefully designed optimization objectives rather than explicit symbolic structure. Reinforcement learning is used to bias the model toward generating longer, more coherent intermediate chains by rewarding correctness at the trajectory level. **GRPO** exemplifies this approach by replacing absolute reward estimation with group-relative advantage normalization, implicitly encouraging internal competition between candidate reasoning traces. The core assumption is that structured reasoning can be recovered purely through reward shaping over token sequences, without grounding in external state dynamics.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a section on pre-GRPO objective shaping through e.g. DPO and RLHF. That will result in a more natural segue into GRPO.


These papers try to approach this by objectively shaping reasoning, data and pretraining scaling, multimodal embodiment, bargaining between process complexity and context length etc.

## Taxonomy of approaches
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section would benefit from formalization before laying out the taxonomy. What is grounding? What is the "objective"?

| -- | -- | -- |
| **Step 1**<br />**Rollout / Sampling** | Freeze current policy $\pi_\text{old}$<br />Sample output tokens $o_1, o_2, \ldots o_T$| For each prompt, sample a group of $G$ responses ${y_1, y_2, \ldots, y_G}$<br />Sampling is done at the **response level**, not token level|
| **Step 2**<br />**Reward Computation** | Reward model assigns a score<br />KL penalty added per token<br />$r_t = r_\phi - \beta \log \frac{\pi_\text{ref}(o_t)}{\pi_\theta(o_t)}$ | Scalar reward assigned to each response<br />${r_1, r_2, \ldots, r_G}$<br />No KL term added at this stage |
| **Step 3**<br />**Advantage Estimation** | Value model $V_\phi$ used with GAE<br />$\hat A_t = \sum_{l \ge 0} (\gamma \lambda)^l (r_{t+l} - V(s_{t+l}))$ | Within-group reward normalization<br />$\hat A_i = \frac{r_i - \mu_r}{\sigma_r}$<br />$\mu_r$ is mean group reward |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an approximation? Is this just a convenient reframing? Why does it "work"?

**DeepSeek-R1** uses GRPO to train the model to reason on its own.
**Cosmos-Reason1** also uses the GRPO model for the Physical AI RL part of their model.

## Backbone Scaling
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section blends all of these different architectures together, leading it to be a soup of efforts. While the table for backbone types tries to make this clear upfront, it is overall unclear why these model architectures are different, or how the RL is different across them (both DeepSeekMath and Cosmos-Reason1 say RL, is DeepSeek-R1 not RL? pretty sure it is...).

Recast this for clarity. Start with R1, then describe Math. Then go to Cosmos and why it's different (visually grounded reasoning especially).


**Cosmos-Reason1** uses a decoder-only multimodal LLM with two possible instantiations:
- Dense Transformer backbone
- Hybrid Mamba-MLP-Transformer architecture
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to cite Mamba here (and not describe it - which I recommend you don't), you probably should have a reference you want folks to look at.

Cosmos-Reason1-7B, uses Qwen2.5-VL as the pre-trained model and the Cosmos-Reason1-56B, uses InternViT-300M-V2.5 as the vision encoder and Nemotron-H as the LLM backbone.
Cosmos-Reason1 also uses a Hybrid-Mamba-MLP-transformer architecture unlike the standard transformer backbone architecture to avoid the quadratic time-complexity for the 56B model. For the 7B model, Qwen2.5L acts as the backbone.

<img src="https://raw.githubusercontent.com/arpg/vla-foundations/762372b95e020009c1083aed4d17dbfef9366357/content/textbook/audits/Mamba-MLP.png" alt="Hybrid-Mamba-MLP-architecture" width ="900" />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a heading to the figure as it would help clarify what we're looking at.

| **Vision modality** | None | None |Vision encoder integrated |
| **Data focus** | Text reasoning + RL rewards | DeepSeek-V3: Text <br /> DeepSeek-R1: 146k questions/prompts with verifiable solutions | Vision + Physical reasoning + RL |
| **Data volume emphasis** | Large pretrained LLM <br /> RL via reasoning prompts | Large pretrained model on 14.8 trillion tokens <br /> |Large vision data + curated task-specific multimodal corpora |
| **Scaling effects** | Bigger models show stronger reasoning | Longer training leads to longer reasoning chains and emergent reasoning, with superhuman performance |Larger LLM + vision encoder yields better embodied reasoning |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so bigger is better they all say. How well do they get? What are they good at? Show us the benchmarks/evaluations and tear into why they are good or not.

| **Model sizes** | 1.5B → 70B parameters | 671B parameters, with various distilled variants | 8B and 56B variants |
| **Vision modality** | None | None |Vision encoder integrated |
| **Data focus** | Text reasoning + RL rewards | DeepSeek-V3: Text <br /> DeepSeek-R1: 146k questions/prompts with verifiable solutions | Vision + Physical reasoning + RL |
| **Data volume emphasis** | Large pretrained LLM <br /> RL via reasoning prompts | Large pretrained model on 14.8 trillion tokens <br /> |Large vision data + curated task-specific multimodal corpora |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What were the datasets? Large is obvious.


1. **Objective Shaping for Reasoning Emergence:** This class of methods treats reasoning as an emergent property that can be induced through carefully designed optimization objectives rather than explicit symbolic structure. Reinforcement learning is used to bias the model toward generating longer, more coherent intermediate chains by rewarding correctness at the trajectory level. **GRPO** exemplifies this approach by replacing absolute reward estimation with group-relative advantage normalization, implicitly encouraging internal competition between candidate reasoning traces. The core assumption is that structured reasoning can be recovered purely through reward shaping over token sequences, without grounding in external state dynamics.

2. **Data & Pretraining Modality Scaling:** This line of work focuses on scaling reasoning capacity through domain-specialized pretraining data rather than changes to the objective. In this paradigm, reasoning improves as a function of exposure to large volumes of structured domain text. DeepSeekMath follows this strategy by aggressively curating large-scale mathematical corpora using Common Crawl and OpenWebMath and combining them with instruction tuning and RL. Here, reasoning is not explicitly enforced but statistically induced via data distribution shift toward problems that require multi-step abstraction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reasoning for improvement just attributed to data distribution shift here? Or are methods like chain of thought still used?


1. **Objective Shaping for Reasoning Emergence:** This class of methods treats reasoning as an emergent property that can be induced through carefully designed optimization objectives rather than explicit symbolic structure. Reinforcement learning is used to bias the model toward generating longer, more coherent intermediate chains by rewarding correctness at the trajectory level. **GRPO** exemplifies this approach by replacing absolute reward estimation with group-relative advantage normalization, implicitly encouraging internal competition between candidate reasoning traces. The core assumption is that structured reasoning can be recovered purely through reward shaping over token sequences, without grounding in external state dynamics.

2. **Data & Pretraining Modality Scaling:** This line of work focuses on scaling reasoning capacity through domain-specialized pretraining data rather than changes to the objective. In this paradigm, reasoning improves as a function of exposure to large volumes of structured domain text. DeepSeekMath follows this strategy by aggressively curating large-scale mathematical corpora using Common Crawl and OpenWebMath and combining them with instruction tuning and RL. Here, reasoning is not explicitly enforced but statistically induced via data distribution shift toward problems that require multi-step abstraction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reasoning scaling done through synthetic data generation rather than just crawling the web?


Group Relative Policy Optimization (GRPO) is an improvement in the previously used Proximal Policy Optimization(PPO) method for Reinforcement Learning.

PPO uses an actor-critic model where a value function calculates the Advantage. Advantage (or Generalized Advantage Estimation / GAE) can be thought of as "How much better a specific action a<sub>t</sub> is compared to an average action considering the current state s<sub>t</sub>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the average action (reference) defined? Is it just the reference model in the GRPO vs PPO diagram?

| **Step 5**<br />**Optimization**| Run $K$ epochs of SGD on the same batch | Reuse the same batch for $K$ epochs |


GRPO replaces value-function-based advantages with group-relative rewards, and moves KL regularization out of the reward and directly into the loss. Since, value function is a separate model that needs to be trained, skipping that step helps training the reinforcement learning part faster and more efficient.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value function represents the discounted sum of rewards at a given state. Group-relative rewards seems to be an approximation and doesn't consider future steps. Is there an insight into why this works? Does GRPO not face the issue of high variance in reward estimates?

- For input images, we convert the image into tiles and a low-resolution thumbnail image to maintain full context. Then these tokens are concatenated with interleaved tile IDs.
- For input videos, we convert it into a maximum of 32 frames at a max rate of 2 frames per second. The vision encoder generates 1,024 visual tokens per $448 \times 448$ frame, which are then downsampled by a factor of $2 \times 2$ (using PixelShuffle) to 256 tokens per frame.

These tokens are then concatenated along with the text tokens and passed into the **Hybrid-Mamba-MLP-transformer** model. The LLM processes both text tokens and projected visual tokens using standard self-attention. This design choice of early fusion of image and text tokens makes sense for this model since this helps the model scale better (8B -> 56B). And also the hybrid architecture helps avoid the quadratic cost of repeated cross-attention at each decoding step.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth discussing the tradeoff between concatenating the visual tokens + text tokens to help the reader understand what (if any) downsides exist to this method. For example, they may avoid the cross-attention at decoding step, but does that come at the cost of anything performance-wise?


In the DeepSeekMath paper, the focus is on training the model on clean math corpus tokens fetched using different algorithms to focus the model only on
math related problems in English and Chinese. The model is pretrained on 120B tokens from Common crawl and 500B tokens from Natural Language and code data.
And during the instruction tuning and RL stage, the model is trained on GSM8K and Math data with Chain of thought(CoT) and Program of thought (PoT) with Tool-integrated solutions in English and Chinese.
Copy link
Contributor

@lorinachey lorinachey Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does instruction tuning using both English and Chinese improve performance in both languages or only one of them? Is this combined tuning (i.e. both English and Chinese at the same time) or is it separate so there are two models, one for each language?

| Step| PPO| GRPO|
| -- | -- | -- |
| **Step 1**<br />**Rollout / Sampling** | Freeze current policy $\pi_\text{old}$<br />Sample output tokens $o_1, o_2, \ldots o_T$| For each prompt, sample a group of $G$ responses ${y_1, y_2, \ldots, y_G}$<br />Sampling is done at the **response level**, not token level|
| **Step 2**<br />**Reward Computation** | Reward model assigns a score<br />KL penalty added per token<br />$r_t = r_\phi - \beta \log \frac{\pi_\text{ref}(o_t)}{\pi_\theta(o_t)}$ | Scalar reward assigned to each response<br />${r_1, r_2, \ldots, r_G}$<br />No KL term added at this stage |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth adding a definition of KL penalty. I don't think it's necessarily ubiquitous in RL

<img src="https://raw.githubusercontent.com/arpg/vla-foundations/762372b95e020009c1083aed4d17dbfef9366357/content/textbook/audits/Deepseek-arch.png" alt="DeepseekMath architecture" width ="900" />


In the Cosmos-Reason1 work, the focus shifts from text-based symbolic reasoning to multimodal physical and embodied reasoning. The model is trained using a multistage pipeline that includes large-scale vision pretraining followed by supervised fine-tuning and reinforcement learning on curated multimodal datasets targeting physical commonsense, spatial reasoning, and temporal understanding. Visual inputs are processed by a pretrained vision encoder and projected into the LLM embedding space using an MLP, allowing joint reasoning within a decoder-only language model. Experiments across the 8B and 56B model variants show that while increased model size improves performance, the primary gains come from domain-specific multimodal data and task-aligned reward signals, rather than additional generic image–text data or further scaling of parameters.
Copy link
Contributor

@lorinachey lorinachey Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This writeup mentions "physical commonsense" a lot, but I'm not sure what that looks like in terms of reasoning and training. For example, are we using natural language based descriptions as inputs in training that say what is "physically commonsense" in a particular situation? Is it a text-image pair saying what's physically reasonable? Is it embodiment specific? What does the data look like for training reasoning with physical commonsense?


2. **Data & Pretraining Modality Scaling:** This line of work focuses on scaling reasoning capacity through domain-specialized pretraining data rather than changes to the objective. In this paradigm, reasoning improves as a function of exposure to large volumes of structured domain text. DeepSeekMath follows this strategy by aggressively curating large-scale mathematical corpora using Common Crawl and OpenWebMath and combining them with instruction tuning and RL. Here, reasoning is not explicitly enforced but statistically induced via data distribution shift toward problems that require multi-step abstraction.

3. **Grounded Multimodal Embodiment:** Grounded embodiment approaches aim to scale reasoning by anchoring model representations to the physical world. Instead of optimizing purely over text trajectories, these methods incorporate visual inputs and action-relevant representations, with the goal of aligning semantic reasoning to physical common sense. Cosmos-Reason1 follows this paradigm by introducing ontologies for physical concepts and training multimodal models over vision-language inputs that proxy embodied interaction. The underlying hypothesis is that grounding reasoning in perceptual structure yields more robust abstractions than symbolic optimization alone, even if physical feedback remains indirect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether grounding through multimodal perception provides substantially new reasoning signals, or mainly serves as an auxiliary constraint compared to text-only training?


## Taxonomy of approaches

1. **Objective Shaping for Reasoning Emergence:** This class of methods treats reasoning as an emergent property that can be induced through carefully designed optimization objectives rather than explicit symbolic structure. Reinforcement learning is used to bias the model toward generating longer, more coherent intermediate chains by rewarding correctness at the trajectory level. **GRPO** exemplifies this approach by replacing absolute reward estimation with group-relative advantage normalization, implicitly encouraging internal competition between candidate reasoning traces. The core assumption is that structured reasoning can be recovered purely through reward shaping over token sequences, without grounding in external state dynamics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the rewards for RL methods? In the latter sections, you mention RL doesn't require human annotations, how are the rewards determined?

prevent the model from performing much better than humans. This human-annotated data is also fairly expensive, which is why they use RL and GRPO to allow the model to reason by itself.
This self-evolution of reasoning also displays various interesting and useful properties. One such example is that, throughout the training process, the reasoning chains actually increase in length over time.
This is purely an emergent behavior, as the model isn't explicitly encouraged to think for longer. This increased time "thinking" also leads to various emergent behaviors such as exploring alternate solutions,
and the researchers note that the model begins to say "wait" in the later stages of training, and flagging some parts of its own reasoning as important or relevant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether the increase in reasoning length should always be interpreted as improved reasoning quality?

|Cosmos-Reason1 | 56B (SFT) | Physical Common Sense | "80.6% (Average across Space, Time, Physics)|

Increasing model size does improve the model performance in different benchmarks, but this increase is more incremental and modest rather than dramatic, indicating that parameter scaling alone does not introduce qualitatively new reasoning behavior.
But introducing RL, improved the performance from the Minerva comparable model to 88.2% in GSM8K benchmark model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the reported accuracies all from the same evaluation setup?

## Diminishing Returns

### 1. **Model Scale**
- DeepSeekMath-Base 7B (64.2% on GSM8K) outperformed Minerva 540B (58.8%), demonstrating that a model 77x smaller can lead in reasoning if the data quality is sufficiently dense.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for the data quality to be sufficiently dense? If we wanted to reproduce a dataset with that same characteristic, what would we need to do? How would we measure it?


Also, the GRPO model does not focus on a value function, this means that we can just have the model generate new completions, and score these with rule-based rewards to improve the model performance, rather than relying on human curated data.

Cosmos-Reason1 assumes physical reasoning can be scaled via self-supervised MCQs like Arrow-of-Time or Spatial Puzzles. This allows the model to continue learning from the real world (in video format) even if the data is exhausted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there alternatives to video data for introducing physical learning into the model? It seems like there is a gap between learning the scene and translating into functional motor commands based on your report so far.

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments made during class reading itme.


## Interaction Mechanisms

Deepseek-R1 is a unimodel, self-attention based, decode-only model which improves its reasoning only through RL by GRPO rather than architectural changes. However, as the company/group is Chinese,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you already mentioned what the architecture for DeepSeek-R1 is, so this is superfluous.


<img src="https://raw.githubusercontent.com/arpg/vla-foundations/762372b95e020009c1083aed4d17dbfef9366357/content/textbook/audits/Mamba-MLP.png" alt="Hybrid-Mamba-MLP-architecture" width ="900" />

## Interaction Mechanisms
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "Interaction Mechanisms"?

the model is built for both the English and Chinese language, which sometimes leads to it reasoning in both languages (more commonly in DeepSeek-R1-Zero and addressed but not fully elimiated in R1).

Cosmos-Reason1 fuses the image and text modality using a Vision encoder + projection based fusion strategy.
- For input images, we convert the image into tiles and a low-resolution thumbnail image to maintain full context. Then these tokens are concatenated with interleaved tile IDs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the LLaVA-NeXT trick, thumbnail + image tiles for multi-resolution images. So in a lineage, this model is apparently worse than Qwen2.5-VL and that family of models.

For your writeup, please provide some kind of comparison here.

Moving away from the "Throw in more and more parameters and tokens at the model" idea, we start to focus more on the data clarity, reward based optimizations and modality grounding based model design.

In the DeepSeekMath paper, the focus is on training the model on clean math corpus tokens fetched using different algorithms to focus the model only on
math related problems in English and Chinese. The model is pretrained on 120B tokens from Common crawl and 500B tokens from Natural Language and code data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500B is half the size of Qwen3 (1T). Was it smaller because they couldn't get more data, did they do more curation, or something else?


In the DeepSeekMath paper, the focus is on training the model on clean math corpus tokens fetched using different algorithms to focus the model only on
math related problems in English and Chinese. The model is pretrained on 120B tokens from Common crawl and 500B tokens from Natural Language and code data.
And during the instruction tuning and RL stage, the model is trained on GSM8K and Math data with Chain of thought(CoT) and Program of thought (PoT) with Tool-integrated solutions in English and Chinese.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CoT we sort of understand from the above. But what is PoT?

* rule-based or verifiable reward signals
* Raw text scale alone shows diminishing returns, we need custom RL or SFT modelling to extract the full potential reasoning capabilities.

#### **Emergent Reasoning (DeepSeek-R1)**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dig into this, maybe with your own evaluation. You can load up DeepSeek-R1 and evaluate how frequently it makes a mistake against e.g. Cosmos-Reason1 on a question. How do the reasoning traces look? What about DeepSeek-R1 was the "killer app" that made the stock market go bonkers in early 2025, and what are the technical ramifications of this?


## Scaling Laws

- **Data Quality vs. Parameter Scale**: The authors show that DeepSeekMath-Base 7B (trained on 120B high-quality math tokens) achieves 64.2% on GSM8K, outperforming Minerva 540B. This suggests an empirical relationship where a **77x reduction in parameters** can be compensated for by a **7x increase in domain-specific data scale**.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting finding! Did they find a lower-bound saturation on a smaller model where these scaling laws started to become convergent (e.g., data had to scale by 100x to get to a 10x more reduction in parameters)?

## Diminishing Returns

### 1. **Model Scale**
- DeepSeekMath-Base 7B (64.2% on GSM8K) outperformed Minerva 540B (58.8%), demonstrating that a model 77x smaller can lead in reasoning if the data quality is sufficiently dense.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already mention this above (https://github.com/arpg/vla-foundations/pull/32/changes#r2760057482). What I want rather than a wall of text reiterating through various projections of metrics from the papers is a deep dive on this concept. What is really going on here? How far can you take this?

### 1. **Model Scale**
- DeepSeekMath-Base 7B (64.2% on GSM8K) outperformed Minerva 540B (58.8%), demonstrating that a model 77x smaller can lead in reasoning if the data quality is sufficiently dense.
- In Cosmos-Reason1, the 56B model showed an average improvement of only 5.2% over the 7B model in physical common sense (80.6% vs 75.4%).
- Both of which show that without better objectives like long Chain-of-Thought (CoT), simply increasing parameters hits a wall in abstract and physical grounding.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do they measure physical grounding? It doesn't seem possible to do this without a benchmark, but it's unclear which benchmarks above are evidence for this conclusion.

* **DeepSeekMath**: This model does not directly address high-frequency control. It focuses on the logic of the thinking process. However, its GRPO framework is highly relevant. It reduces the computational overhead of RL by eliminating the value/critic model, which in a robotic context could free up memory and time for other computations or to improve the speed.
* **DeepSeek-R1** Similar to DeepSeekMath, this model is not built for robotics or high-frequency control, and rather more built for reasoning in various tasks.
The MLA architecture of the preceeding DeepSeekV3 is highly relevant though, as it allows for dramatically lower KV-Cache size, which can allows for cheaper compute or larger models compared to before.
* **Cosmos-reason1**: This model helps bring high-level reasoning and low-level robotic movemennt execution closer. The model achieves this using the text-based Chain-of-thought mechanism. Instead of the high level solution, since we have a CoT solution, we can break a large action into reason-based micro-activities (Like grab the door handle from the left) which are executable by the robot while the model maintains a high level understanding of the model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent motivation given the context of this course. You might want to bring this insight a little higher in the doc so it becomes clear to readers what the alignment between reasoning and micro-behaviors in robotic systems could be.

| **Scaling effects** | Bigger models show stronger reasoning | Longer training leads to longer reasoning chains and emergent reasoning, with superhuman performance |Larger LLM + vision encoder yields better embodied reasoning |

Cosmos-Reason1-7B, uses Qwen2.5-VL as the pre-trained model and the Cosmos-Reason1-56B, uses InternViT-300M-V2.5 as the vision encoder and Nemotron-H as the LLM backbone.
Cosmos-Reason1 also uses a Hybrid-Mamba-MLP-transformer architecture unlike the standard transformer backbone architecture to avoid the quadratic time-complexity for the 56B model. For the 7B model, Qwen2.5L acts as the backbone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a big claim for using Hybrid-Mamba-MLP-transformer. Are there any other approaches that have improved since then?


<img src="https://raw.githubusercontent.com/arpg/vla-foundations/audit/Soorej-ScalingAndReasoning/content/textbook/audits/PPO%20vs%20GRPO.png" alt="PPO vs GRPO" width="900" />
**DeepSeek-R1** uses GRPO to train the model to reason on its own.
**Cosmos-Reason1** also uses the GRPO model for the Physical AI RL part of their model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way in which Deepseek and Cosmos-Reason actually use GRPO isn't described here (and that seems like the most relevant part).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.