Audit/soorej scaling and reasoning#32

Open

Soorej30 wants to merge 6 commits intostagingfrom

audit/Soorej-ScalingAndReasoning

Contributor

Soorej30 commented Jan 28, 2026

Updated draft with images URL

Soorej S Nair added 6 commits

January 27, 2026 19:27


          Initial draft for the GRPO and Cosmos-reason1 paper

4e5e70e


          Updated table latex equations

9358fe7


          Added audit for DeepSeek-R1

061bbdf


          Replaced <br> with <br />

32e0857


          Replaced <br> with <br />

762372b


          Changed image paths to URL

d5d73ae

Contributor Author

Soorej30 commented Jan 28, 2026

@crheckman could you please take a look? Thank you

github-actions bot commented Jan 28, 2026 •

edited by Soorej30

Loading

🚀 Preview Deployed

Your preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/32/textbook/audits/ReasoningAndScaling_audit/

Review Checklist

LaTeX equations render correctly
All sections are complete per the template
References are formatted properly
Figures/diagrams display correctly

Next Steps

Review your rendered content using the preview link above
Tag @crheckman when ready for instructor review
Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.

crheckman requested changes

View reviewed changes

Collaborator

crheckman left a comment

First pass at reasoning review.

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              # Scaling and Reasoning in Foundational Models
+              The training of an LLM model can be split into 2 stages. The **pre-training** and the **post-training**. Pre-training is the part we throw all the data we have at a
+              transformer based model, and teach it to predict the next token based on internet data. Post-training is the stage where we train the model's reasoning capabilities.

Collaborator

crheckman Feb 3, 2026

or any available dataset, the larger the better seems to be clear (up to a certain model size and compute limit).

content/textbook/audits/ReasoningAndScaling_audit.mdx

		# Problem Domain & Taxonomy


		We define reasoning as the capacity to produce structured, coherent intermediate chains that reflect underlying domain abstractions:

Collaborator

crheckman Feb 3, 2026

you should define "reasoning" before here, like when you first invoke it in the section above; otherwise it reads very flimsily.

content/textbook/audits/ReasoningAndScaling_audit.mdx


		## Taxonomy of approaches

		1. Objective Shaping for Reasoning Emergence: This class of methods treats reasoning as an emergent property that can be induced through carefully designed optimization objectives rather than explicit symbolic structure. Reinforcement learning is used to bias the model toward generating longer, more coherent intermediate chains by rewarding correctness at the trajectory level. GRPO exemplifies this approach by replacing absolute reward estimation with group-relative advantage normalization, implicitly encouraging internal competition between candidate reasoning traces. The core assumption is that structured reasoning can be recovered purely through reward shaping over token sequences, without grounding in external state dynamics.

Collaborator

crheckman Feb 3, 2026

Add a section on pre-GRPO objective shaping through e.g. DPO and RLHF. That will result in a more natural segue into GRPO.

content/textbook/audits/ReasoningAndScaling_audit.mdx


		These papers try to approach this by objectively shaping reasoning, data and pretraining scaling, multimodal embodiment, bargaining between process complexity and context length etc.

		## Taxonomy of approaches

Collaborator

crheckman Feb 3, 2026

This section would benefit from formalization before laying out the taxonomy. What is grounding? What is the "objective"?

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              | -- | -- | -- |
+              | **Step 1**<br />**Rollout / Sampling**   | Freeze current policy $\pi_\text{old}$<br />Sample output tokens $o_1, o_2, \ldots o_T$| For each prompt, sample a group of $G$ responses ${y_1, y_2, \ldots, y_G}$<br />Sampling is done at the **response level**, not token level|
+              | **Step 2**<br />**Reward Computation**   | Reward model assigns a score<br />KL penalty added per token<br />$r_t = r_\phi - \beta \log \frac{\pi_\text{ref}(o_t)}{\pi_\theta(o_t)}$ | Scalar reward assigned to each response<br />${r_1, r_2, \ldots, r_G}$<br />No KL term added at this stage |
+              | **Step 3**<br />**Advantage Estimation** | Value model $V_\phi$ used with GAE<br />$\hat A_t = \sum_{l \ge 0} (\gamma \lambda)^l (r_{t+l} - V(s_{t+l}))$ | Within-group reward normalization<br />$\hat A_i = \frac{r_i - \mu_r}{\sigma_r}$<br />$\mu_r$ is mean group reward |

Collaborator

crheckman Feb 3, 2026

Is this an approximation? Is this just a convenient reframing? Why does it "work"?

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              **DeepSeek-R1** uses GRPO to train the model to reason on its own.
+              **Cosmos-Reason1** also uses the GRPO model for the Physical AI RL part of their model.
+              ## Backbone Scaling

Collaborator

crheckman Feb 3, 2026

This section blends all of these different architectures together, leading it to be a soup of efforts. While the table for backbone types tries to make this clear upfront, it is overall unclear why these model architectures are different, or how the RL is different across them (both DeepSeekMath and Cosmos-Reason1 say RL, is DeepSeek-R1 not RL? pretty sure it is...).

Recast this for clarity. Start with R1, then describe Math. Then go to Cosmos and why it's different (visually grounded reasoning especially).

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              **Cosmos-Reason1** uses a decoder-only multimodal LLM with two possible instantiations:
+              - Dense Transformer backbone
+              - Hybrid Mamba-MLP-Transformer architecture

Collaborator

crheckman Feb 3, 2026

If you're going to cite Mamba here (and not describe it - which I recommend you don't), you probably should have a reference you want folks to look at.

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              Cosmos-Reason1-7B, uses Qwen2.5-VL as the pre-trained model and the Cosmos-Reason1-56B, uses InternViT-300M-V2.5 as the vision encoder and Nemotron-H as the LLM backbone.
+              Cosmos-Reason1 also uses a Hybrid-Mamba-MLP-transformer architecture unlike the standard transformer backbone architecture to avoid the quadratic time-complexity for the 56B model. For the 7B model, Qwen2.5L acts as the backbone.
+              <img src="https://raw.githubusercontent.com/arpg/vla-foundations/762372b95e020009c1083aed4d17dbfef9366357/content/textbook/audits/Mamba-MLP.png" alt="Hybrid-Mamba-MLP-architecture" width ="900" />

Collaborator

crheckman Feb 3, 2026

Add a heading to the figure as it would help clarify what we're looking at.

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              | **Vision modality**      | None                                           | None                               |Vision encoder integrated                                  |
+              | **Data focus**           | Text reasoning + RL rewards                    | DeepSeek-V3: Text <br /> DeepSeek-R1: 146k questions/prompts with verifiable solutions | Vision + Physical reasoning + RL                             |
+              | **Data volume emphasis** | Large pretrained LLM <br /> RL via reasoning prompts | Large pretrained model on 14.8 trillion tokens <br />  |Large vision data + curated task-specific multimodal corpora |
+              | **Scaling effects**      | Bigger models show stronger reasoning          | Longer training leads to longer reasoning chains and emergent reasoning, with superhuman performance |Larger LLM + vision encoder yields better embodied reasoning |

Collaborator

crheckman Feb 3, 2026

OK so bigger is better they all say. How well do they get? What are they good at? Show us the benchmarks/evaluations and tear into why they are good or not.

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              | **Model sizes**          | 1.5B → 70B parameters                          | 671B parameters, with various distilled variants | 8B and 56B variants                                          |
+              | **Vision modality**      | None                                           | None                               |Vision encoder integrated                                  |
+              | **Data focus**           | Text reasoning + RL rewards                    | DeepSeek-V3: Text <br /> DeepSeek-R1: 146k questions/prompts with verifiable solutions | Vision + Physical reasoning + RL                             |
+              | **Data volume emphasis** | Large pretrained LLM <br /> RL via reasoning prompts | Large pretrained model on 14.8 trillion tokens <br />  |Large vision data + curated task-specific multimodal corpora |

Collaborator

crheckman Feb 3, 2026

What were the datasets? Large is obvious.

aritrach reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx


		1. Objective Shaping for Reasoning Emergence: This class of methods treats reasoning as an emergent property that can be induced through carefully designed optimization objectives rather than explicit symbolic structure. Reinforcement learning is used to bias the model toward generating longer, more coherent intermediate chains by rewarding correctness at the trajectory level. GRPO exemplifies this approach by replacing absolute reward estimation with group-relative advantage normalization, implicitly encouraging internal competition between candidate reasoning traces. The core assumption is that structured reasoning can be recovered purely through reward shaping over token sequences, without grounding in external state dynamics.

		2. Data & Pretraining Modality Scaling: This line of work focuses on scaling reasoning capacity through domain-specialized pretraining data rather than changes to the objective. In this paradigm, reasoning improves as a function of exposure to large volumes of structured domain text. DeepSeekMath follows this strategy by aggressively curating large-scale mathematical corpora using Common Crawl and OpenWebMath and combining them with instruction tuning and RL. Here, reasoning is not explicitly enforced but statistically induced via data distribution shift toward problems that require multi-step abstraction.

Contributor

aritrach Feb 3, 2026

Is the reasoning for improvement just attributed to data distribution shift here? Or are methods like chain of thought still used?

lorinachey reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx


		1. Objective Shaping for Reasoning Emergence: This class of methods treats reasoning as an emergent property that can be induced through carefully designed optimization objectives rather than explicit symbolic structure. Reinforcement learning is used to bias the model toward generating longer, more coherent intermediate chains by rewarding correctness at the trajectory level. GRPO exemplifies this approach by replacing absolute reward estimation with group-relative advantage normalization, implicitly encouraging internal competition between candidate reasoning traces. The core assumption is that structured reasoning can be recovered purely through reward shaping over token sequences, without grounding in external state dynamics.

		2. Data & Pretraining Modality Scaling: This line of work focuses on scaling reasoning capacity through domain-specialized pretraining data rather than changes to the objective. In this paradigm, reasoning improves as a function of exposure to large volumes of structured domain text. DeepSeekMath follows this strategy by aggressively curating large-scale mathematical corpora using Common Crawl and OpenWebMath and combining them with instruction tuning and RL. Here, reasoning is not explicitly enforced but statistically induced via data distribution shift toward problems that require multi-step abstraction.

Contributor

lorinachey Feb 3, 2026

Is there any reasoning scaling done through synthetic data generation rather than just crawling the web?

jt7347 reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx


		Group Relative Policy Optimization (GRPO) is an improvement in the previously used Proximal Policy Optimization(PPO) method for Reinforcement Learning.

		PPO uses an actor-critic model where a value function calculates the Advantage. Advantage (or Generalized Advantage Estimation / GAE) can be thought of as "How much better a specific action a<sub>t</sub> is compared to an average action considering the current state s<sub>t</sub>.

Contributor

jt7347 Feb 3, 2026

How is the average action (reference) defined? Is it just the reference model in the GRPO vs PPO diagram?

yi-shiuan-tung reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx

		\| Step 5<br />Optimization\| Run $K$ epochs of SGD on the same batch \| Reuse the same batch for $K$ epochs \|


		GRPO replaces value-function-based advantages with group-relative rewards, and moves KL regularization out of the reward and directly into the loss. Since, value function is a separate model that needs to be trained, skipping that step helps training the reinforcement learning part faster and more efficient.

Contributor

yi-shiuan-tung Feb 3, 2026

The value function represents the discounted sum of rewards at a given state. Group-relative rewards seems to be an approximation and doesn't consider future steps. Is there an insight into why this works? Does GRPO not face the issue of high variance in reward estimates?

lorinachey reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              - For input images, we convert the image into tiles and a low-resolution thumbnail image to maintain full context. Then these tokens are concatenated with interleaved tile IDs.
+              - For input videos, we convert it into a maximum of 32 frames at a max rate of 2 frames per second. The vision encoder generates 1,024 visual tokens per $448 \times 448$ frame, which are then downsampled by a factor of $2 \times 2$ (using PixelShuffle) to 256 tokens per frame.
+              These tokens are then concatenated along with the text tokens and passed into the **Hybrid-Mamba-MLP-transformer** model. The LLM processes both text tokens and projected visual tokens using standard self-attention. This design choice of early fusion of image and text tokens makes sense for this model since this helps the model scale better (8B -> 56B). And also the hybrid architecture helps avoid the quadratic cost of repeated cross-attention at each decoding step.

Contributor

lorinachey Feb 3, 2026

Might be worth discussing the tradeoff between concatenating the visual tokens + text tokens to help the reader understand what (if any) downsides exist to this method. For example, they may avoid the cross-attention at decoding step, but does that come at the cost of anything performance-wise?

lorinachey reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              In the DeepSeekMath paper, the focus is on training the model on clean math corpus tokens fetched using different algorithms to focus the model only on
+              math related problems in English and Chinese. The model is pretrained on 120B tokens from Common crawl and 500B tokens from Natural Language and code data.
+              And during the instruction tuning and RL stage, the model is trained on GSM8K and Math data with Chain of thought(CoT) and Program of thought (PoT) with Tool-integrated solutions in English and Chinese.

Contributor

lorinachey Feb 3, 2026 •

edited

Loading

Does instruction tuning using both English and Chinese improve performance in both languages or only one of them? Is this combined tuning (i.e. both English and Chinese at the same time) or is it separate so there are two models, one for each language?

kalhamilton reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              | Step| PPO| GRPO|
+              | -- | -- | -- |
+              | **Step 1**<br />**Rollout / Sampling**   | Freeze current policy $\pi_\text{old}$<br />Sample output tokens $o_1, o_2, \ldots o_T$| For each prompt, sample a group of $G$ responses ${y_1, y_2, \ldots, y_G}$<br />Sampling is done at the **response level**, not token level|
+              | **Step 2**<br />**Reward Computation**   | Reward model assigns a score<br />KL penalty added per token<br />$r_t = r_\phi - \beta \log \frac{\pi_\text{ref}(o_t)}{\pi_\theta(o_t)}$ | Scalar reward assigned to each response<br />${r_1, r_2, \ldots, r_G}$<br />No KL term added at this stage |

Contributor

kalhamilton Feb 3, 2026

might be worth adding a definition of KL penalty. I don't think it's necessarily ubiquitous in RL

lorinachey reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx

		<img src="https://raw.githubusercontent.com/arpg/vla-foundations/762372b95e020009c1083aed4d17dbfef9366357/content/textbook/audits/Deepseek-arch.png" alt="DeepseekMath architecture" width ="900" />


		In the Cosmos-Reason1 work, the focus shifts from text-based symbolic reasoning to multimodal physical and embodied reasoning. The model is trained using a multistage pipeline that includes large-scale vision pretraining followed by supervised fine-tuning and reinforcement learning on curated multimodal datasets targeting physical commonsense, spatial reasoning, and temporal understanding. Visual inputs are processed by a pretrained vision encoder and projected into the LLM embedding space using an MLP, allowing joint reasoning within a decoder-only language model. Experiments across the 8B and 56B model variants show that while increased model size improves performance, the primary gains come from domain-specific multimodal data and task-aligned reward signals, rather than additional generic image–text data or further scaling of parameters.

Contributor

lorinachey Feb 3, 2026 •

edited

Loading

This writeup mentions "physical commonsense" a lot, but I'm not sure what that looks like in terms of reasoning and training. For example, are we using natural language based descriptions as inputs in training that say what is "physically commonsense" in a particular situation? Is it a text-image pair saying what's physically reasonable? Is it embodiment specific? What does the data look like for training reasoning with physical commonsense?

Hhy903 reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx


		2. Data & Pretraining Modality Scaling: This line of work focuses on scaling reasoning capacity through domain-specialized pretraining data rather than changes to the objective. In this paradigm, reasoning improves as a function of exposure to large volumes of structured domain text. DeepSeekMath follows this strategy by aggressively curating large-scale mathematical corpora using Common Crawl and OpenWebMath and combining them with instruction tuning and RL. Here, reasoning is not explicitly enforced but statistically induced via data distribution shift toward problems that require multi-step abstraction.

		3. Grounded Multimodal Embodiment: Grounded embodiment approaches aim to scale reasoning by anchoring model representations to the physical world. Instead of optimizing purely over text trajectories, these methods incorporate visual inputs and action-relevant representations, with the goal of aligning semantic reasoning to physical common sense. Cosmos-Reason1 follows this paradigm by introducing ontologies for physical concepts and training multimodal models over vision-language inputs that proxy embodied interaction. The underlying hypothesis is that grounding reasoning in perceptual structure yields more robust abstractions than symbolic optimization alone, even if physical feedback remains indirect.

Contributor

Hhy903 Feb 3, 2026

Whether grounding through multimodal perception provides substantially new reasoning signals, or mainly serves as an auxiliary constraint compared to text-only training？

yi-shiuan-tung reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx


		## Taxonomy of approaches

		1. Objective Shaping for Reasoning Emergence: This class of methods treats reasoning as an emergent property that can be induced through carefully designed optimization objectives rather than explicit symbolic structure. Reinforcement learning is used to bias the model toward generating longer, more coherent intermediate chains by rewarding correctness at the trajectory level. GRPO exemplifies this approach by replacing absolute reward estimation with group-relative advantage normalization, implicitly encouraging internal competition between candidate reasoning traces. The core assumption is that structured reasoning can be recovered purely through reward shaping over token sequences, without grounding in external state dynamics.

Contributor

yi-shiuan-tung Feb 3, 2026

What are the rewards for RL methods? In the latter sections, you mention RL doesn't require human annotations, how are the rewards determined?

Hhy903 reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              prevent the model from performing much better than humans. This human-annotated data is also fairly expensive, which is why they use RL and GRPO to allow the model to reason by itself.
+              This self-evolution of reasoning also displays various interesting and useful properties. One such example is that, throughout the training process, the reasoning chains actually increase in length over time.
+              This is purely an emergent behavior, as the model isn't explicitly encouraged to think for longer. This increased time "thinking" also leads to various emergent behaviors such as exploring alternate solutions,
+              and the researchers note that the model begins to say "wait" in the later stages of training, and flagging some parts of its own reasoning as important or relevant.

Contributor

Hhy903 Feb 3, 2026

Whether the increase in reasoning length should always be interpreted as improved reasoning quality?

callie-jones reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              |Cosmos-Reason1   | 56B (SFT) | Physical Common Sense | "80.6% (Average across Space, Time, Physics)|
+              Increasing model size does improve the model performance in different benchmarks, but this increase is more incremental and modest rather than dramatic, indicating that parameter scaling alone does not introduce qualitatively new reasoning behavior.
+              But introducing RL, improved the performance from the Minerva comparable model to 88.2% in GSM8K benchmark model.

Contributor

callie-jones Feb 3, 2026

Are the reported accuracies all from the same evaluation setup?

lorinachey reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              ## Diminishing Returns
+              ### 1. **Model Scale**
+              - DeepSeekMath-Base 7B (64.2% on GSM8K) outperformed Minerva 540B (58.8%), demonstrating that a model 77x smaller can lead in reasoning if the data quality is sufficiently dense.

Contributor

lorinachey Feb 3, 2026

What does it mean for the data quality to be sufficiently dense? If we wanted to reproduce a dataset with that same characteristic, what would we need to do? How would we measure it?

kalhamilton reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx


		Also, the GRPO model does not focus on a value function, this means that we can just have the model generate new completions, and score these with rule-based rewards to improve the model performance, rather than relying on human curated data.

		Cosmos-Reason1 assumes physical reasoning can be scaled via self-supervised MCQs like Arrow-of-Time or Spatial Puzzles. This allows the model to continue learning from the real world (in video format) even if the data is exhausted.

Contributor

kalhamilton Feb 3, 2026

Are there alternatives to video data for introducing physical learning into the model? It seems like there is a gap between learning the scene and translating into functional motor commands based on your report so far.

crheckman requested changes

View reviewed changes

Collaborator

crheckman left a comment

Comments made during class reading itme.

content/textbook/audits/ReasoningAndScaling_audit.mdx


		## Interaction Mechanisms

		Deepseek-R1 is a unimodel, self-attention based, decode-only model which improves its reasoning only through RL by GRPO rather than architectural changes. However, as the company/group is Chinese,

Collaborator

crheckman Feb 3, 2026

I think you already mentioned what the architecture for DeepSeek-R1 is, so this is superfluous.

content/textbook/audits/ReasoningAndScaling_audit.mdx


		<img src="https://raw.githubusercontent.com/arpg/vla-foundations/762372b95e020009c1083aed4d17dbfef9366357/content/textbook/audits/Mamba-MLP.png" alt="Hybrid-Mamba-MLP-architecture" width ="900" />

		## Interaction Mechanisms

Collaborator

crheckman Feb 3, 2026

What do you mean by "Interaction Mechanisms"?

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              the model is built for both the English and Chinese language, which sometimes leads to it reasoning in both languages (more commonly in DeepSeek-R1-Zero and addressed but not fully elimiated in R1).
+              Cosmos-Reason1 fuses the image and text modality using a Vision encoder + projection based fusion strategy.
+              - For input images, we convert the image into tiles and a low-resolution thumbnail image to maintain full context. Then these tokens are concatenated with interleaved tile IDs.

Collaborator

crheckman Feb 3, 2026

This is the LLaVA-NeXT trick, thumbnail + image tiles for multi-resolution images. So in a lineage, this model is apparently worse than Qwen2.5-VL and that family of models.

For your writeup, please provide some kind of comparison here.

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              Moving away from the "Throw in more and more parameters and tokens at the model" idea, we start to focus more on the data clarity, reward based optimizations and modality grounding based model design.
+              In the DeepSeekMath paper, the focus is on training the model on clean math corpus tokens fetched using different algorithms to focus the model only on
+              math related problems in English and Chinese. The model is pretrained on 120B tokens from Common crawl and 500B tokens from Natural Language and code data.

Collaborator

crheckman Feb 3, 2026

500B is half the size of Qwen3 (1T). Was it smaller because they couldn't get more data, did they do more curation, or something else?

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              In the DeepSeekMath paper, the focus is on training the model on clean math corpus tokens fetched using different algorithms to focus the model only on
+              math related problems in English and Chinese. The model is pretrained on 120B tokens from Common crawl and 500B tokens from Natural Language and code data.
+              And during the instruction tuning and RL stage, the model is trained on GSM8K and Math data with Chain of thought(CoT) and Program of thought (PoT) with Tool-integrated solutions in English and Chinese.

Collaborator

crheckman Feb 3, 2026

CoT we sort of understand from the above. But what is PoT?

content/textbook/audits/ReasoningAndScaling_audit.mdx

+                * rule-based or verifiable reward signals
+              * Raw text scale alone shows diminishing returns, we need custom RL or SFT modelling to extract the full potential reasoning capabilities.
+              #### **Emergent Reasoning (DeepSeek-R1)**

Collaborator

crheckman Feb 3, 2026

Dig into this, maybe with your own evaluation. You can load up DeepSeek-R1 and evaluate how frequently it makes a mistake against e.g. Cosmos-Reason1 on a question. How do the reasoning traces look? What about DeepSeek-R1 was the "killer app" that made the stock market go bonkers in early 2025, and what are the technical ramifications of this?

content/textbook/audits/ReasoningAndScaling_audit.mdx


		## Scaling Laws

		- Data Quality vs. Parameter Scale: The authors show that DeepSeekMath-Base 7B (trained on 120B high-quality math tokens) achieves 64.2% on GSM8K, outperforming Minerva 540B. This suggests an empirical relationship where a 77x reduction in parameters can be compensated for by a 7x increase in domain-specific data scale.

Collaborator

crheckman Feb 3, 2026

Interesting finding! Did they find a lower-bound saturation on a smaller model where these scaling laws started to become convergent (e.g., data had to scale by 100x to get to a 10x more reduction in parameters)?

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              ## Diminishing Returns
+              ### 1. **Model Scale**
+              - DeepSeekMath-Base 7B (64.2% on GSM8K) outperformed Minerva 540B (58.8%), demonstrating that a model 77x smaller can lead in reasoning if the data quality is sufficiently dense.

Collaborator

crheckman Feb 3, 2026

You already mention this above (https://github.com/arpg/vla-foundations/pull/32/changes#r2760057482). What I want rather than a wall of text reiterating through various projections of metrics from the papers is a deep dive on this concept. What is really going on here? How far can you take this?

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              ### 1. **Model Scale**
+              - DeepSeekMath-Base 7B (64.2% on GSM8K) outperformed Minerva 540B (58.8%), demonstrating that a model 77x smaller can lead in reasoning if the data quality is sufficiently dense.
+              - In Cosmos-Reason1, the 56B model showed an average improvement of only 5.2% over the 7B model in physical common sense (80.6% vs 75.4%).
+              - Both of which show that without better objectives like long Chain-of-Thought (CoT), simply increasing parameters hits a wall in abstract and physical grounding.

Collaborator

crheckman Feb 3, 2026

How do they measure physical grounding? It doesn't seem possible to do this without a benchmark, but it's unclear which benchmarks above are evidence for this conclusion.

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              * **DeepSeekMath**: This model does not directly address high-frequency control. It focuses on the logic of the thinking process. However, its GRPO framework is highly relevant. It reduces the computational overhead of RL by eliminating the value/critic model, which in a robotic context could free up memory and time for other computations or to improve the speed.
+              * **DeepSeek-R1** Similar to DeepSeekMath, this model is not built for robotics or high-frequency control, and rather more built for reasoning in various tasks.
+              The MLA architecture of the preceeding DeepSeekV3 is highly relevant though, as it allows for dramatically lower KV-Cache size, which can allows for cheaper compute or larger models compared to before.
+              * **Cosmos-reason1**: This model helps bring high-level reasoning and low-level robotic movemennt execution closer. The model achieves this using the text-based Chain-of-thought mechanism. Instead of the high level solution, since we have a CoT solution, we can break a large action into reason-based micro-activities (Like grab the door handle from the left) which are executable by the robot while the model maintains a high level understanding of the model.

Collaborator

crheckman Feb 3, 2026

This is excellent motivation given the context of this course. You might want to bring this insight a little higher in the doc so it becomes clear to readers what the alignment between reasoning and micro-behaviors in robotic systems could be.

gyanigk reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              | **Scaling effects**      | Bigger models show stronger reasoning          | Longer training leads to longer reasoning chains and emergent reasoning, with superhuman performance |Larger LLM + vision encoder yields better embodied reasoning |
+              Cosmos-Reason1-7B, uses Qwen2.5-VL as the pre-trained model and the Cosmos-Reason1-56B, uses InternViT-300M-V2.5 as the vision encoder and Nemotron-H as the LLM backbone.
+              Cosmos-Reason1 also uses a Hybrid-Mamba-MLP-transformer architecture unlike the standard transformer backbone architecture to avoid the quadratic time-complexity for the 56B model. For the 7B model, Qwen2.5L acts as the backbone.

Contributor

gyanigk Feb 3, 2026

It is a big claim for using Hybrid-Mamba-MLP-transformer. Are there any other approaches that have improved since then?

krusnim reviewed

View reviewed changes

content/textbook/audits/ReasoningAndScaling_audit.mdx

+              <img src="https://raw.githubusercontent.com/arpg/vla-foundations/audit/Soorej-ScalingAndReasoning/content/textbook/audits/PPO%20vs%20GRPO.png" alt="PPO vs GRPO" width="900" />
+              **DeepSeek-R1** uses GRPO to train the model to reason on its own.
+              **Cosmos-Reason1** also uses the GRPO model for the Physical AI RL part of their model.

Contributor

krusnim Feb 3, 2026

The way in which Deepseek and Cosmos-Reason actually use GRPO isn't described here (and that seems like the most relevant part).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

yi-shiuan-tung yi-shiuan-tung left review comments

gyanigk gyanigk left review comments

callie-jones callie-jones left review comments

aritrach aritrach left review comments

lorinachey lorinachey left review comments

Hhy903 Hhy903 left review comments

krusnim krusnim left review comments

jt7347 jt7347 left review comments

kalhamilton kalhamilton left review comments

crheckman crheckman requested changes

Requested changes must be addressed to merge this pull request.

Labels

None yet