Skip to content

Commit ae57cb6

Browse files
authored
update doc (#3460)
1 parent ea26e30 commit ae57cb6

File tree

2 files changed

+22
-22
lines changed

2 files changed

+22
-22
lines changed

docs/practical_tutorials/document_scene_information_extraction(deepseek)_tutorial.en.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -58,9 +58,9 @@ The result shows that PP-ChatOCRv3 can extract text information from the image a
5858

5959
In practical application scenarios, besides a large number of image files, more document information extraction tasks involve multi-page PDF files. Since multi-page PDF files often contain a vast amount of text information, passing all this text information to a large language model at once not only increases the invocation cost but also reduces the accuracy of text information extraction. To address this issue, the PP-ChatOCRv3 pipeline integrates vector retrieval technology, which stores the text information from multi-page PDF files in the form of a vector database and retrieves the most relevant fragments through vector retrieval technology to pass them to the large language model, significantly reducing the invocation cost of the large language model and improving the accuracy of text information extraction. The Baidu Cloud Qianfan platform provides four vector models for establishing vector databases of text information. For the specific model support list and their functional characteristics, refer to the vector model section in the [API List](https://cloud.baidu.com/doc/WENXINWORKSHOP/s/Nlks5zkzu_en). Next, we will use the `embedding-v1` model to establish a vector database of text information and pass the most relevant fragments to the `DeepSeek-V3` large language model through vector retrieval technology, thereby efficiently extracting key information from multi-page PDF files.
6060

61-
First, download the [Test File 2](https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/contract.pdf), then replace the `api_key` in the following code and execute it:
61+
First, download the [Test File 2](https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/contract2.pdf), then replace the `api_key` in the following code and execute it:
6262

63-
**Note**: Due to the large size of multi-page PDF files, the first execution requires a longer time for text information extraction and vector database establishment. The code saves the visual results of the model and the establishment results of the vector database locally, which can be loaded and used directly subsequently.
63+
**Note**: Due to the large size of multi-page PDF files, the free service provided by the Qianfan platform currently experiences a high volume of calls. As a result, there is a limit on the number of tokens per minute. If you test with your own PDF files that have too many pages, you may encounter a TPM (Tokens Per Minute) limit error. This limitation does not apply to other forms of deployed large model services or to Qianfan’s paid users.
6464

6565
```python
6666
import os
@@ -90,7 +90,7 @@ pipeline = create_pipeline(pipeline="PP-ChatOCRv3-doc", initial_predictor=False)
9090

9191
if not os.path.exists(visual_predict_res_path):
9292
visual_predict_res = pipeline.visual_predict(
93-
input="contract.pdf",
93+
input="contract2.pdf",
9494
use_doc_orientation_classify=False,
9595
use_doc_unwarping=False,
9696
)
@@ -138,20 +138,20 @@ After executing the above code, the result obtained is as follows:
138138

139139
```
140140
{'chat_res': {'甲方开户行': '日照银行股份有限公司开发区支行'}}
141-
Visual Predict Time: 18.6519s
142-
Vector Build Time: 6.1515s
143-
Chat Time: 7.0352s
144-
Total Time: 31.8385s
141+
Visual Predict Time: 15.3429s
142+
Vector Build Time: 4.8302s
143+
Chat Time: 3.457s
144+
Total Time: 23.6301s
145145
```
146146

147147
When we execute the above code again, the result obtained is as follows:
148148

149149
```
150150
{'chat_res': {'甲方开户行': '日照银行股份有限公司开发区支行'}}
151-
Visual Predict Time: 0.0161s
152-
Vector Build Time: 0.0016s
153-
Chat Time: 6.9516s
154-
Total Time: 6.9693s
151+
Visual Predict Time: 0.0104s
152+
Vector Build Time: 0.0006s
153+
Chat Time: 4.4056s
154+
Total Time: 4.4167s
155155
```
156156

157157
By comparing the results of the two executions, it can be observed that during the first execution, the PP-ChatOCRv3 Pipeline extracts all text information from multi-page PDF files and establishes a vector library, which takes a longer time. During subsequent executions, the PP-ChatOCRv3 Pipeline only needs to load and retrieve the vector library, significantly reducing the overall time consumption. The PP-ChatOCRv3 Pipeline, combined with vector retrieval technology, effectively reduces the number of calls to large language models when extracting ultra-long text, achieving faster text information extraction speed and more accurate key information location. This provides a more efficient solution for us in actual multi-page PDF file information extraction scenarios.

docs/practical_tutorials/document_scene_information_extraction(deepseek)_tutorial.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -58,9 +58,9 @@ print(chat_result)
5858

5959
在实际的应用场景中,除了大量的图片文件外,更多的文档信息抽取任务会涉及到多页 PDF 文件的处理。由于多页 PDF 文件中往往包含大量的文本信息,而将大量的文本信息一次性传递给大语言模型,除了会增加大语言模型的调用成本外,还会降低大语言模型文本信息抽取的准确性。为了解决这一问题,PP-ChatOCRv3 产线中集成了向量检索技术,能够将多页 PDF 文件中的文本信息通过建立向量库的方式进行存储,并通过向量检索技术将文本信息检索到最相关的片段传递给大语言模型,从而大幅降低大语言模型的调用成本并提高文本信息抽取的准确性。在百度云千帆平台,提供了4个向量模型用于建立文本信息的向量库,具体的模型支持列表及其功能特点可参考 [API列表](https://cloud.baidu.com/doc/WENXINWORKSHOP/s/Nlks5zkzu) 中的向量模型部分。接下来我们将使用 `embedding-v1` 模型建立文本信息的向量库,并通过向量检索技术将最相关的片段传递给 `DeepSeek-V3` 大语言模型,从而实现高效抽取多页 PDF 文件中的关键信息。
6060

61-
首先,您需要下载 [测试文件2](https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/contract.pdf),然后更换以下代码中的 `api_key` 并执行:
61+
首先,您需要下载 [测试文件2](https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/contract2.pdf),然后更换以下代码中的 `api_key` 并执行:
6262

63-
****:由于多页 PDF 文件较大,首次执行时需要较长时间进行文本信息抽取和向量库的建立,代码中已将模型的视觉结果和向量库的建立结果保存到本地,后续可以直接加载使用
63+
****:由于多页 PDF 文件较大,千帆平台提供的免费服务目前调用量极大,故对每分钟 token 数进行了限制,所以如果您测试自己准备的PDF文件且页数过多时,可能会出现 TPM 超出限制报错,对于其他形式部署的大模型服务或千帆付费用户不存在此限制
6464

6565
```python
6666
import os
@@ -90,7 +90,7 @@ pipeline = create_pipeline(pipeline="PP-ChatOCRv3-doc", initial_predictor=False)
9090

9191
if not os.path.exists(visual_predict_res_path):
9292
visual_predict_res = pipeline.visual_predict(
93-
input="contract.pdf",
93+
input="contract2.pdf",
9494
use_doc_orientation_classify=False,
9595
use_doc_unwarping=False,
9696
)
@@ -138,20 +138,20 @@ print(f"Total Time: {round((end_time - start_time), 4)}s")
138138

139139
```
140140
{'chat_res': {'甲方开户行': '日照银行股份有限公司开发区支行'}}
141-
Visual Predict Time: 18.6519s
142-
Vector Build Time: 6.1515s
143-
Chat Time: 7.0352s
144-
Total Time: 31.8385s
141+
Visual Predict Time: 15.3429s
142+
Vector Build Time: 4.8302s
143+
Chat Time: 3.457s
144+
Total Time: 23.6301s
145145
```
146146

147147
当我们再次执行上述代码时,可以得到的结果如下:
148148

149149
```
150150
{'chat_res': {'甲方开户行': '日照银行股份有限公司开发区支行'}}
151-
Visual Predict Time: 0.0161s
152-
Vector Build Time: 0.0016s
153-
Chat Time: 6.9516s
154-
Total Time: 6.9693s
151+
Visual Predict Time: 0.0104s
152+
Vector Build Time: 0.0006s
153+
Chat Time: 4.4056s
154+
Total Time: 4.4167s
155155
```
156156

157157
通过对比两次执行结果可以发现,在首次执行时,PP-ChatOCRv3 产线会对多页 PDF 文件中的所有文本信息进行抽取和向量库的建立,耗时较长。而在后续执行时,PP-ChatOCRv3 产线仅需要对向量库进行加载和检索操作,大幅降低了整体的耗时。结合了向量检索技术的 PP-ChatOCRv3 产线有效的降低了对于超长文本进行抽取时大语言模型调用的次数,实现了更加快速的文本信息抽取速度和更加精准的关键信息定位,为我们在实际的多页 PDF 文件信息抽取场景中提供了更加高效的解决方案。

0 commit comments

Comments
 (0)