Add Windows system support of pipelines (#2903)

w5688414 · web-flow · commit 445ad5297948 · 2022-08-01T18:22:13.000+08:00
* Add Windows system support of pipelines

* Fix the utf-8 on windows messy code

* Support delete index and update indexing data of pipelines

* Add numba dependency
diff --git a/applications/experimental/pipelines/examples/question-answering/Install_windows.md b/applications/experimental/pipelines/examples/question-answering/Install_windows.md
@@ -0,0 +1,112 @@
+# WINDOWS环境下搭建端到端智能问答系统
+
+以下的流程都是使用的Anaconda的环境进行的搭建，Anaconda安装好以后，进入 `Anaconda Powershell Prompt`，然后执行下面的流程。
+
+## 1. 快速开始: 城市百科知识问答系统搭建
+
+### 1.1 运行环境和安装说明
+
+a. 依赖安装：
+```bash
+pip install -r requirements.txt
+# 1) 安装 pipelines package
+cd ${HOME}/PaddleNLP/applications/experimental/pipelines/
+python setup.py install
+```
+### 1.2 数据说明
+问答知识库数据是我们爬取了百度百科上对国内重点城市的百科介绍文档。我们将所有文档中的非结构化文本数据抽取出来， 按照段落切分后作为问答系统知识库的数据，一共包含 365 个城市的百科介绍文档、切分后共 1318 个段落。
+
+### 1.3 一键体验问答系统
+我们预置了搭建城市百科知识问答系统的代码示例，您可以通过如下命令快速体验问答系统的效果。
+
+
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+# 设置 1 个空闲的 GPU 卡，此处假设 0 卡为空闲 GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/question-answering/dense_qa_example.py --device gpu
+# 如果只有 CPU 机器，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+unset CUDA_VISIBLE_DEVICES
+python examples/question-answering/dense_qa_example.py --device cpu
+```
+
+### 1.4 构建 Web 可视化问答系统
+
+整个 Web 可视化问答系统主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI。接下来我们依次搭建这 3 个服务并串联构成可视化的问答系统
+
+#### 1.4.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+把`xpack.security.enabled` 设置成false，如下：
+```
+xpack.security.enabled: false
+```
+
+然后直接双击bin目录下的elasticsearch.bat即可启动。
+
+3. elasticsearch可视化工具Kibana（可选）
+为了更好的对数据进行管理，可以使用Kibana可视化工具进行管理和分析，下载链接为[Kibana](https://www.elastic.co/cn/downloads/kibana)，下载完后解压，直接双击运行 `bin\kibana.bat`即可。
+
+#### 1.4.2 文档数据写入 ANN 索引库
+```
+# 以百科城市数据为例建立 ANN 索引库
+python utils/offline_ann.py --index_name baike_cities --doc_dir data/baike
+```
+
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
+运行成功后会输出如下的日志：
+```
+INFO - pipelines.utils.logger -  Logged parameters:
+ {'processor': 'TextSimilarityProcessor', 'tokenizer': 'NoneType', 'max_seq_len': '0', 'dev_split': '0.1'}
+INFO - pipelines.document_stores.elasticsearch -  Updating embeddings for all 1318 docs ...
+Updating embeddings: 10000 Docs [00:16, 617.76 Docs/s]
+```
+运行结束后，可使用Kibana查看数据
+
+#### 1.4.3 启动 RestAPI 模型服务
+```bash
+# 指定智能问答系统的Yaml配置文件
+$env:PIPELINE_YAML_PATH='rest_api/pipeline/dense_qa.yaml'
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+
+#### 1.4.4 启动 WebUI
+```bash
+# 配置模型服务地址
+$env:API_ENDPOINT='http://127.0.0.1:8891'
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_question_answering.py --server.port 8502
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验城市百科知识问答系统服务了。
+
+#### 1.4.5 数据更新
+
+数据更新的方法有两种，第一种使用前面的 `utils/offline_ann.py`进行数据更新，另一种是使用前端界面的文件上传进行数据更新，支持txt，pdf，image，word的格式，以txt格式的文件为例，每段文本需要使用空行隔开，程序会根据空行进行分段建立索引，示例数据如下(demo.txt)：
+
+```
+兴证策略认为，最恐慌的时候已经过去，未来一个月市场迎来阶段性修复窗口。
+
+从海外市场表现看，
+对俄乌冲突的恐慌情绪已显著释放，
+海外权益市场也从单边下跌转入双向波动。
+
+长期，继续聚焦科技创新的五大方向。1)新能源(新能源汽车、光伏、风电、特高压等)，2)新一代信息通信技术(人工智能、大数据、云计算、5G等)，3)高端制造(智能数控机床、机器人、先进轨交装备等)，4)生物医药(创新药、CXO、医疗器械和诊断设备等)，5)军工(导弹设备、军工电子元器件、空间站、航天飞机等)。
+```
+
+## FAQ
+
+#### pip安装htbuilder包报错，`UnicodeDecodeError: 'gbk' codec can't decode byte....`
+
+windows的默认字符gbk导致的，可以使用源码进行安装，源码已经进行了修复。
+
+```
+git clone https://github.com/tvst/htbuilder.git
+cd htbuilder/
+python setup install
+```
diff --git a/applications/experimental/pipelines/examples/question-answering/README.md b/applications/experimental/pipelines/examples/question-answering/README.md
@@ -25,6 +25,8 @@
 
 ## 3. 快速开始: 城市百科知识问答系统搭建
 
+以下是针对mac和linux的安装流程，windows的安装和使用流程请参考[windows](./Install_windows.md)
+
 ### 3.1 运行环境和安装说明
 
 本实验采用了以下的运行环境进行，详细说明如下，用户也可以在自己 GPU 硬件环境进行：
@@ -90,8 +92,15 @@ curl http://localhost:9200/_aliases?pretty=true
 ```
 # 以百科城市数据为例建立 ANN 索引库
 python utils/offline_ann.py --index_name baike_cities \
-                            --doc_dir data/baike
+                            --doc_dir data/baike \
+                            --delete_index
 ```
+
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
 运行成功后会输出如下的日志：
 ```
 INFO - pipelines.utils.logger -  Logged parameters:
@@ -142,6 +151,20 @@ sh scripts/run_qa_web.sh
 
 到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验城市百科知识问答系统服务了。
 
+#### 3.4.5 数据更新
+
+数据更新的方法有两种，第一种使用前面的 `utils/offline_ann.py`进行数据更新，另一种是使用前端界面的文件上传进行数据更新，支持txt，pdf，image，word的格式，以txt格式的文件为例，每段文本需要使用空行隔开，程序会根据空行进行分段建立索引，示例数据如下(demo.txt)：
+
+```
+兴证策略认为，最恐慌的时候已经过去，未来一个月市场迎来阶段性修复窗口。
+
+从海外市场表现看，
+对俄乌冲突的恐慌情绪已显著释放，
+海外权益市场也从单边下跌转入双向波动。
+
+长期，继续聚焦科技创新的五大方向。1)新能源(新能源汽车、光伏、风电、特高压等)，2)新一代信息通信技术(人工智能、大数据、云计算、5G等)，3)高端制造(智能数控机床、机器人、先进轨交装备等)，4)生物医药(创新药、CXO、医疗器械和诊断设备等)，5)军工(导弹设备、军工电子元器件、空间站、航天飞机等)。
+```
+
 ## Reference
 [1]Y. Sun et al., “[ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/pdf/2107.02137.pdf),” arXiv:2107.02137 [cs], Jul. 2021, Accessed: Jan. 17, 2022. [Online]. Available: http://arxiv.org/abs/2107.02137
 
diff --git a/applications/experimental/pipelines/examples/question-answering/dense_qa_example.py b/applications/experimental/pipelines/examples/question-answering/dense_qa_example.py
@@ -41,7 +41,9 @@ def dense_qa_pipeline():
         doc_dir = "data/baike"
         city_data = "https://paddlenlp.bj.bcebos.com/applications/baike.zip"
         fetch_archive_from_http(url=city_data, output_dir=doc_dir)
-        dicts = convert_files_to_dicts(dir_path=doc_dir, split_paragraphs=True)
+        dicts = convert_files_to_dicts(dir_path=doc_dir,
+                                       split_paragraphs=True,
+                                       encoding='utf-8')
 
         if os.path.exists(args.index_name):
             os.remove(args.index_name)
diff --git a/applications/experimental/pipelines/examples/semantic-search/Install_windows.md b/applications/experimental/pipelines/examples/semantic-search/Install_windows.md
@@ -0,0 +1,91 @@
+# WINDOWS环境下搭建端到端语义检索系统
+以下的流程都是使用的Anaconda的环境进行的搭建，Anaconda安装好以后，进入 `Anaconda Powershell Prompt`，然后执行下面的流程。
+
+## 1. 快速开始: 快速搭建语义检索系统
+
+### 1.1 运行环境和安装说明
+
+a. 依赖安装：
+```bash
+pip install -r requirements.txt
+# 1) 安装 pipelines package
+cd ${HOME}/PaddleNLP/applications/experimental/pipelines/
+python setup.py install
+```
+### 1.2 数据说明
+语义检索数据库的数据来自于[DuReader-Robust数据集](https://github.com/baidu/DuReader/tree/master/DuReader-Robust)，共包含 46972 个段落文本。
+
+### 1.3 一键体验语义检索系统
+我们预置了基于[DuReader-Robust数据集](https://github.com/baidu/DuReader/tree/master/DuReader-Robust)搭建语义检索系统的代码示例，您可以通过如下命令快速体验语义检索系统的效果
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+# 设置 1 个空闲的 GPU 卡，此处假设 0 卡为空闲 GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/semantic-search/semantic_search_example.py --device gpu
+# 如果只有 CPU 机器，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+unset CUDA_VISIBLE_DEVICES
+python examples/semantic-search/semantic_search_example.py --device cpu
+```
+
+### 1.4 构建 Web 可视化语义检索系统
+
+整个 Web 可视化语义检索系统主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI，接下来我们依次搭建这 3 个服务并最终形成可视化的语义检索系统。
+
+#### 1.4.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+把`xpack.security.enabled` 设置成false，如下：
+```
+xpack.security.enabled: false
+```
+
+然后直接双击bin目录下的elasticsearch.bat即可启动。
+
+3. elasticsearch可视化工具Kibana（可选）
+为了更好的对数据进行管理，可以使用Kibana可视化工具进行管理和分析，下载链接为[Kibana](https://www.elastic.co/cn/downloads/kibana)，下载完后解压，直接双击运行 `bin\kibana.bat`即可。
+
+#### 1.4.2 文档数据写入 ANN 索引库
+```
+# 以DuReader-Robust 数据集为例建立 ANN 索引库
+python utils/offline_ann.py --index_name dureader_robust_query_encoder --doc_dir data/dureader_robust_processed
+```
+
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
+
+运行结束后，可使用Kibana查看数据
+
+#### 1.4.3 启动 RestAPI 模型服务
+```bash
+# 指定语义检索系统的Yaml配置文件
+$env:PIPELINE_YAML_PATH='rest_api/pipeline/semantic_search.yaml'
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+
+#### 1.4.4 启动 WebUI
+```bash
+# 配置模型服务地址
+$env:API_ENDPOINT='http://127.0.0.1:8891'
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_semantic_search.py --server.port 8502
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验语义检索系统服务了。
+
+#### 1.4.5 数据更新
+
+数据更新的方法有两种，第一种使用前面的 `utils/offline_ann.py`进行数据更新，另一种是使用前端界面的文件上传进行数据更新，支持txt，pdf，image，word的格式，以txt格式的文件为例，每段文本需要使用空行隔开，程序会根据空行进行分段建立索引，示例数据如下(demo.txt)：
+
+```
+兴证策略认为，最恐慌的时候已经过去，未来一个月市场迎来阶段性修复窗口。
+
+从海外市场表现看，
+对俄乌冲突的恐慌情绪已显著释放，
+海外权益市场也从单边下跌转入双向波动。
+
+长期，继续聚焦科技创新的五大方向。1)新能源(新能源汽车、光伏、风电、特高压等)，2)新一代信息通信技术(人工智能、大数据、云计算、5G等)，3)高端制造(智能数控机床、机器人、先进轨交装备等)，4)生物医药(创新药、CXO、医疗器械和诊断设备等)，5)军工(导弹设备、军工电子元器件、空间站、航天飞机等)。
+```
diff --git a/applications/experimental/pipelines/examples/semantic-search/README.md b/applications/experimental/pipelines/examples/semantic-search/README.md
@@ -30,6 +30,8 @@
 
 ## 3. 快速开始: 快速搭建语义检索系统
 
+以下是针对mac和linux的安装流程，windows的安装和使用流程请参考[windows](./Install_windows.md)
+
 ### 3.1 运行环境和安装说明
 
 本实验采用了以下的运行环境进行，详细说明如下，用户也可以在自己 GPU 硬件环境进行：
@@ -95,6 +97,12 @@ curl http://localhost:9200/_aliases?pretty=true
 python utils/offline_ann.py --index_name dureader_robust_query_encoder \
                             --doc_dir data/dureader_dev
 ```
+
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
 #### 3.4.3 启动 RestAPI 模型服务
 ```bash
 # 指定语义检索系统的Yaml配置文件
@@ -123,6 +131,20 @@ sh scripts/run_search_web.sh
 
 到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验语义检索系统服务了。
 
+#### 3.4.5 数据更新
+
+数据更新的方法有两种，第一种使用前面的 `utils/offline_ann.py`进行数据更新，另一种是使用前端界面的文件上传进行数据更新，支持txt，pdf，image，word的格式，以txt格式的文件为例，每段文本需要使用空行隔开，程序会根据空行进行分段建立索引，示例数据如下(demo.txt)：
+
+```
+兴证策略认为，最恐慌的时候已经过去，未来一个月市场迎来阶段性修复窗口。
+
+从海外市场表现看，
+对俄乌冲突的恐慌情绪已显著释放，
+海外权益市场也从单边下跌转入双向波动。
+
+长期，继续聚焦科技创新的五大方向。1)新能源(新能源汽车、光伏、风电、特高压等)，2)新一代信息通信技术(人工智能、大数据、云计算、5G等)，3)高端制造(智能数控机床、机器人、先进轨交装备等)，4)生物医药(创新药、CXO、医疗器械和诊断设备等)，5)军工(导弹设备、军工电子元器件、空间站、航天飞机等)。
+```
+
 ## FAQ
 
 #### 语义检索系统可以跑通，但终端输出字符是乱码怎么解决？
diff --git a/applications/experimental/pipelines/examples/semantic-search/semantic_search_example.py b/applications/experimental/pipelines/examples/semantic-search/semantic_search_example.py
@@ -40,7 +40,9 @@ def semantic_search_tutorial():
         dureader_data = "https://paddlenlp.bj.bcebos.com/applications/dureader_dev.zip"
 
         fetch_archive_from_http(url=dureader_data, output_dir=doc_dir)
-        dicts = convert_files_to_dicts(dir_path=doc_dir, split_paragraphs=True)
+        dicts = convert_files_to_dicts(dir_path=doc_dir,
+                                       split_paragraphs=True,
+                                       encoding='utf-8')
 
         if os.path.exists(args.index_name):
             os.remove(args.index_name)
diff --git a/applications/experimental/pipelines/requirements.txt b/applications/experimental/pipelines/requirements.txt
@@ -20,4 +20,5 @@ st-annotated-text
 streamlit==1.9.0
 fastapi
 uvicorn
-markdown
+markdown
+numba
diff --git a/applications/experimental/pipelines/utils/offline_ann.py b/applications/experimental/pipelines/utils/offline_ann.py
@@ -15,6 +15,10 @@
                     default='data/baike/',
                     type=str,
                     help="The doc path of the corpus")
+parser.add_argument(
+    '--delete_index',
+    action='store_true',
+    help='whether to delete existing index while updating index')
 
 args = parser.parse_args()
 
@@ -28,9 +32,10 @@ def offline_ann(index_name, doc_dir):
                                                 username="",
                                                 password="",
                                                 index=index_name)
-
     # 将每篇文档按照段落进行切分
-    dicts = convert_files_to_dicts(dir_path=doc_dir, split_paragraphs=True)
+    dicts = convert_files_to_dicts(dir_path=doc_dir,
+                                   split_paragraphs=True,
+                                   encoding='utf-8')
 
     print(dicts[:3])
 
@@ -53,5 +58,18 @@ def offline_ann(index_name, doc_dir):
     document_store.update_embeddings(retriever)
 
 
+def delete_data(index_name):
+    document_store = ElasticsearchDocumentStore(host="127.0.0.1",
+                                                port="9200",
+                                                username="",
+                                                password="",
+                                                index=index_name)
+
+    document_store.delete_index(index_name)
+    print('Delete an existing elasticsearch index {} Done.'.format(index_name))
+
+
 if __name__ == "__main__":
+    if (args.delete_index):
+        delete_data(args.index_name)
     offline_ann(args.index_name, args.doc_dir)