feat: add formula recognition sample

ikesnowy · ikesnowy · commit 057fc87d3b9b · 2025-11-29T20:43:11.000+08:00
diff --git a/sample/Cnblogs.DashScope.Sample/Cnblogs.DashScope.Sample.csproj b/sample/Cnblogs.DashScope.Sample/Cnblogs.DashScope.Sample.csproj
@@ -44,6 +44,9 @@
     <None Update="scanned.jpg">
       <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
     </None>
+    <None Update="math.jpg">
+      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
+    </None>
   </ItemGroup>
 
   <ItemGroup>
diff --git a/sample/Cnblogs.DashScope.Sample/Multimodal/OcrDocumentToLaTeXSample.cs b/sample/Cnblogs.DashScope.Sample/Multimodal/OcrDocumentToLaTeXSample.cs
@@ -41,3 +41,30 @@ public async Task RunAsync(IDashScopeClient client)
         }
     }
 }
+
+/*
+File uploaded: oss://dashscope-instant/52afe077fb4825c6d74411758cb1ab98/2025-11-29/0c817c8b-2d8c-4642-b805-bb20f5349b73/scanned.jpg
+LaTeX:
+```latex
+\section*{Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution}
+
+Peng Wang* \quad Shuai Bai* \quad Sinan Tan* \quad Shijie Wang* \quad Zhihao Fan* \quad Jinze Bai*? \\
+Keqin Chen \quad Xuejing Liu \quad Jialin Wang \quad Wenbin Ge \quad Yang Fan \quad Kai Dang \quad Mengfei Du \\
+Xuancheng Ren \quad Rui Men \quad Dayiheng Liu \quad Chang Zhou \quad Jingren Zhou \quad Junyang Lin*? \\
+Qwen Team \quad Alibaba Group
+
+\begin{abstract}
+We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMS). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at \url{https://github.com/QwenLM/Qwen2-VL}.
+\end{abstract}
+
+\section{Introduction}
+
+In the realm of artificial intelligence, Large Vision-Language Models (LVLMS) represent a significant leap forward, building upon the strong textual processing capabilities of traditional large language models. These advanced models now encompass the ability to interpret and analyze a broader spectrum of data, including images, audio, and video. This expansion of capabilities has transformed LVLMS into indispensable tools for tackling a variety of real-world challenges. Recognized for their unique capacity to condense extensive and intricate knowledge into functional representations, LVLMS are paving the way for more comprehensive cognitive systems. By integrating diverse data forms, LVLMS aim to more closely mimic the nuanced ways in which humans perceive and interact with their environment. This allows these models to provide a more accurate representation of how we engage with and perceive our environment.
+
+Recent advancements in large vision-language models (LVLMS) (Li et al., 2023c; Liu et al., 2023b; Dai et al., 2023; Zhu et al., 2023; Huang et al., 2023a; Bai et al., 2023b; Liu et al., 2023a; Wang et al., 2023b; OpenAI, 2023; Team et al., 2023) have led to significant improvements in a short span. These models (OpenAI, 2023; Touvron et al., 2023a,b; Chiang et al., 2023; Bai et al., 2023a) generally follow a common approach of \textit{visual encoder} $\rightarrow$ \textit{cross-modal connector} $\rightarrow$ \textit{LLM}. This setup, combined with next-token prediction as the primary training method and the availability of high-quality datasets (Liu et al., 2023a; Zhang et al., 2023; Chen et al., 2023b);
+
+*Equal core contribution, ?Corresponding author
+
+```
+Usage: in(2595)/out(873)/image(2540)/total(3468)
+ */
diff --git a/sample/Cnblogs.DashScope.Sample/Multimodal/OcrFormulaRecognitionSample.cs b/sample/Cnblogs.DashScope.Sample/Multimodal/OcrFormulaRecognitionSample.cs
@@ -0,0 +1,58 @@
+﻿using Cnblogs.DashScope.Core;
+
+namespace Cnblogs.DashScope.Sample.Multimodal;
+
+public class OcrFormulaRecognitionSample : ISample
+{
+    /// <inheritdoc />
+    public string Description => "OCR Math Formula Recognition Sample";
+
+    /// <inheritdoc />
+    public async Task RunAsync(IDashScopeClient client)
+    {
+        // upload file
+        await using var file = File.OpenRead("math.jpg");
+        var ossLink = await client.UploadTemporaryFileAsync("qwen-vl-ocr-latest", file, "math.jpg");
+        Console.WriteLine($"File uploaded: {ossLink}");
+        var messages =
+            new List<MultimodalMessage> { MultimodalMessage.User([MultimodalMessageContent.ImageContent(ossLink)]) };
+        var completion = await client.GetMultimodalGenerationAsync(
+            new ModelRequest<MultimodalInput, IMultimodalParameters>()
+            {
+                Model = "qwen-vl-ocr-latest",
+                Input = new MultimodalInput { Messages = messages },
+                Parameters = new MultimodalParameters()
+                {
+                    OcrOptions = new MultimodalOcrOptions()
+                    {
+                        Task = "formula_recognition",
+                    }
+                }
+            });
+
+        Console.WriteLine("LaTeX:");
+        Console.WriteLine(completion.Output.Choices[0].Message.Content[0].Text);
+
+        if (completion.Usage != null)
+        {
+            var usage = completion.Usage;
+            Console.WriteLine(
+                $"Usage: in({usage.InputTokens})/out({usage.OutputTokens})/image({usage.ImageTokens})/total({usage.TotalTokens})");
+        }
+    }
+}
+
+/*
+File uploaded: oss://dashscope-instant/52afe077fb4825c6d74411758cb1ab98/2025-11-29/073293f5-1294-4110-ba64-c614b509d7c6/math.jpg
+LaTeX:
+```latex
+\begin{align*}
+\tilde{G}(x) &= \frac{\alpha}{\kappa}x, \quad \tilde{T}_i = T, \quad \tilde{H}_i = \tilde{\kappa}T, \quad \tilde{\lambda}_i = \frac{1}{\kappa}\sum_{j=1}^{m}\omega_j - z_i, \\
+L(\{p_n\}; m^n) + L(\{x^n\}, m^n) + L(\{m^n\}; q_n) &= L(m^n; q_n) \\
+I^{m_n} - (L+1) &= z + \int_0^1 I^{m_n} - (L)z \leq x_m | L^{m_n} - (L) |^3 \\
+&\leq \kappa\partial_1\psi(x) + \frac{\kappa^3}{6}\partial_2^3\psi(x) - V(x) \psi(x) = \int d^3y K(x,y) \psi(y), \\
+\int_{B_{\kappa}(0)} I^{m}(w)^2 d\gamma &= \lim_{n\to\infty} \int_{B_{\kappa}(0)} r\psi(w_n)^2 d\gamma = \lim_{n\to\infty} \int_{B_{\kappa}(y_n)} d\gamma \geq \beta > 0,
+\end{align*}
+```
+Usage: in(135)/out(339)/image(107)/total(474)
+ */
diff --git a/sample/Cnblogs.DashScope.Sample/math.jpg b/sample/Cnblogs.DashScope.Sample/math.jpg