Skip to content

Commit d035ba2

Browse files
Perform section extraction only in pre-embedding cleaner (#48)
1 parent c10bc02 commit d035ba2

File tree

5 files changed

+18
-52
lines changed

5 files changed

+18
-52
lines changed

adi_function_app/README.md

Lines changed: 2 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -24,15 +24,13 @@ Once the Markdown is obtained, several steps are carried out:
2424

2525
1. **Extraction of images / charts**. The figures identified are extracted from the original document and passed to a multi-modal model (gpt4o in this case) for analysis. We obtain a description and summary of the chart / image to infer the meaning of the figure. This allows us to index and perform RAG analysis the information that is visually obtainable from a chart, without it being explicitly mentioned in the text surrounding. The information is added back into the original chart.
2626

27-
2. **Extraction of sections and headers**. The sections and headers are extracted from the document and returned additionally to the indexer under a separate field. This allows us to store them as a separate field in the index and therefore surface the most relevant chunks.
28-
29-
3. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.
27+
2. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.
3028

3129
Page wise analysis in ADI is used to avoid splitting tables / figures across multiple chunks, when the chunking is performed.
3230

3331
The properties returned from the ADI Custom Skill are then used to perform the following skills:
3432

35-
- Pre-vectorisation cleaning
33+
- Pre-vectorisation cleaning. This stage is important as we extract the section information in this step from the headers in the document. Additionally, we remove any Markdown tags or characters that would cause an embedding error.
3634
- Keyphrase extraction
3735
- Vectorisation
3836

@@ -43,7 +41,6 @@ Using the [Phi-3 Technical Report: A Highly Capable Language Model Locally on Yo
4341
```json
4442
{
4543
"content": "\n<table>\n<caption>Table 1: Comparison results on RepoQA benchmark.</caption>\n<tr>\n<th>Model</th>\n<th>Ctx Size</th>\n<th>Python</th>\n<th>C++</th>\n<th>Rust</th>\n<th>Java</th>\n<th>TypeScript</th>\n<th>Average</th>\n</tr>\n<tr>\n<td>gpt-4O-2024-05-13</td>\n<td>128k</td>\n<td>95</td>\n<td>80</td>\n<td>85</td>\n<td>96</td>\n<td>97</td>\n<td>90.6</td>\n</tr>\n<tr>\n<td>gemini-1.5-flash-latest</td>\n<td>1000k</td>\n<td>93</td>\n<td>79</td>\n<td>87</td>\n<td>94</td>\n<td>97</td>\n<td>90</td>\n</tr>\n<tr>\n<td>Phi-3.5-MoE</td>\n<td>128k</td>\n<td>89</td>\n<td>74</td>\n<td>81</td>\n<td>88</td>\n<td>95</td>\n<td>85</td>\n</tr>\n<tr>\n<td>Phi-3.5-Mini</td>\n<td>128k</td>\n<td>86</td>\n<td>67</td>\n<td>73</td>\n<td>77</td>\n<td>82</td>\n<td>77</td>\n</tr>\n<tr>\n<td>Llama-3.1-8B-Instruct</td>\n<td>128k</td>\n<td>80</td>\n<td>65</td>\n<td>73</td>\n<td>76</td>\n<td>63</td>\n<td>71</td>\n</tr>\n<tr>\n<td>Mixtral-8x7B-Instruct-v0.1</td>\n<td>32k</td>\n<td>66</td>\n<td>65</td>\n<td>64</td>\n<td>71</td>\n<td>74</td>\n<td>68</td>\n</tr>\n<tr>\n<td>Mixtral-8x22B-Instruct-v0.1</td>\n<td>64k</td>\n<td>60</td>\n<td>67</td>\n<td>74</td>\n<td>83</td>\n<td>55</td>\n<td>67.8</td>\n</tr>\n</table>\n\n\nsuch as Arabic, Chinese, Russian, Ukrainian, and Vietnamese, with average MMLU-multilingual scores\nof 55.4 and 47.3, respectively. Due to its larger model capacity, phi-3.5-MoE achieves a significantly\nhigher average score of 69.9, outperforming phi-3.5-mini.\n\nMMLU(5-shot) MultiLingual\n\nPhi-3-mini\n\nPhi-3.5-mini\n\nPhi-3.5-MoE\n\n\n<!-- FigureContent=\"**Technical Analysis of Figure 4: Comparison of phi-3-mini, phi-3.5-mini and phi-3.5-MoE on MMLU-Multilingual tasks**\n\n1. **Overview:**\n - The image is a bar chart comparing the performance of three different models—phi-3-mini, phi-3.5-mini, and phi-3.5-MoE—on MMLU-Multilingual tasks across various languages.\n\n2. **Axes:**\n - The x-axis represents the languages in which the tasks were performed. The languages listed are: Arabic, Chinese, Dutch, French, German, Italian, Russian, Spanish, Ukrainian, Vietnamese, and English.\n - The y-axis represents the performance, likely measured in percentage or score, ranging from 0 to 90.\n\n3. **Legend:**\n - The chart uses three different colors to represent the three models:\n - Orange bars represent the phi-3-mini model.\n - Green bars represent the phi-3.5-mini model.\n - Blue bars represent the phi-3.5-MoE model.\n\n4. **Data Interpretation:**\n - Across all languages, the phi-3.5-MoE (blue bars) consistently outperforms the other two models, showing the highest bars.\n - The phi-3.5-mini (green bars) shows better performance than the phi-3-mini (orange bars) in most languages, but not at the level of phi-3.5-MoE.\n\n5. **Language-specific Insights:**\n - **Arabic**: phi-3.5-MoE shows significantly higher performance compared to the other two models, with phi-3.5-mini outperforming phi-3-mini.\n - **Chinese**: A similar trend is observed as in Arabic, with phi-3.5-MoE leading by a wide margin.\n - **Dutch**: Performance is roughly similar between phi-3.5-mini and phi-3.5-MoE, with phi-3.5-MoE being slightly better.\n - **French**: A clear distinction in performance, with phi-3.5-MoE far exceeding the other two.\n - **German**: phi-3.5-MoE leads, followed by phi-3.5-mini, while phi-3-mini lags significantly behind.\n - **Italian**: The performance gap narrows between phi-3.5-mini and phi-3.5-MoE, but the latter is still superior.\n - **Russian**: phi-3.5-MoE shows noticeably higher performance.\n - **Spanish**: The performance trend is consistent with the previous languages, with phi-3.5-MoE leading.\n - **Ukrainian**: A substantial lead by phi-3.5-MoE.\n - **Vietnamese**: An anomaly where all models show closer performance, yet phi-3.5-MoE still leads.\n - **English**: The highest performance is seen in English, with phi-3.5-MoE nearly reaching the maximum score.\n\n6. **Conclusion:**\n - The phi-3.5-MoE model consistently outperforms the phi-3-mini and phi-3.5-mini models across all MMLU-Multilingual tasks.\n - The phi-3.5-mini model shows a general improvement over the phi-3-mini, but the improvement is not as significant as phi-3.5-MoE.\n\nThis structured analysis provides a comprehensive understanding of the comparative performance of the mentioned models across multilingual tasks.\" -->\n\n\n We evaluate the phi-3.5-mini and phi-3.5-MoE models on two long-context understanding tasks:\nRULER [HSK+24] and RepoQA [LTD+24]. As shown in Tables 1 and 2, both phi-3.5-MoE and phi-\n3.5-mini outperform other open-source models with larger sizes, such as Llama-3.1-8B, Mixtral-8x7B,\nand Mixtral-8x22B, on the RepoQA task, and achieve comparable performance to Llama-3.1-8B on\nthe RULER task. However, we observe a significant performance drop when testing the 128K context\nwindow on the RULER task. We suspect this is due to the lack of high-quality long-context data in\nmid-training, an issue we plan to address in the next version of the model release.\n\n In the table 3, we present a detailed evaluation of the phi-3.5-mini and phi-3.5-MoE models\ncompared with recent SoTA pretrained language models, such as GPT-4o-mini, Gemini-1.5 Flash, and\nopen-source models like Llama-3.1-8B and the Mistral models. The results show that phi-3.5-mini\nachieves performance comparable to much larger models like Mistral-Nemo-12B and Llama-3.1-8B, while\nphi-3.5-MoE significantly outperforms other open-source models, offers performance comparable to\nGemini-1.5 Flash, and achieves above 90% of the average performance of GPT-4o-mini across various\nlanguage benchmarks.\n\n\n\n\n",
46-
"sections": [],
4744
"page_number": 7
4845
}
4946
```
@@ -133,16 +130,10 @@ If `chunk_by_page` header is `True` (recommended):
133130
"extracted_content": [
134131
{
135132
"page_number": 1,
136-
"sections": [
137-
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
138-
],
139133
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 1>"
140134
},
141135
{
142136
"page_number": 2,
143-
"sections": [
144-
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 2>"
145-
],
146137
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
147138
}
148139
]
@@ -154,16 +145,10 @@ If `chunk_by_page` header is `True` (recommended):
154145
"extracted_content": [
155146
{
156147
"page_number": 1,
157-
"sections": [
158-
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
159-
],
160148
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
161149
},
162150
{
163151
"page_number": 2,
164-
"sections": [
165-
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
166-
],
167152
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
168153
}
169154
]
@@ -182,9 +167,6 @@ If `chunk_by_page` header is `False`:
182167
"recordId": "0",
183168
"data": {
184169
"extracted_content": {
185-
"sections": [
186-
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR THE ENTIRE DOCUMENT>"
187-
],
188170
"content": "<CLEANED MARKDOWN CONTENT FOR THE ENTIRE DOCUMENT>"
189171
}
190172
}
@@ -193,9 +175,6 @@ If `chunk_by_page` header is `False`:
193175
"recordId": "1",
194176
"data": {
195177
"extracted_content": {
196-
"sections": [
197-
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR THE ENTIRE DOCUMENT>"
198-
],
199178
"content": "<CLEANED MARKDOWN CONTENT FOR THE ENTIRE DOCUMENT>"
200179
}
201180
}

adi_function_app/adi_2_ai_search.py

Lines changed: 1 addition & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ async def build_and_clean_markdown_for_response(
4040
"""
4141

4242
output_dict = {}
43-
comment_patterns = r"<!-- PageNumber=\"[^\"]*\" -->|<!-- PageHeader=\"[^\"]*\" -->|<!-- PageFooter=\"[^\"]*\" -->|<!-- PageBreak -->"
43+
comment_patterns = r"<!-- PageNumber=\"[^\"]*\" -->|<!-- PageHeader=\"[^\"]*\" -->|<!-- PageFooter=\"[^\"]*\" -->|<!-- PageBreak -->|<!-- Footnote=\"[^\"]*\" -->"
4444
cleaned_text = re.sub(comment_patterns, "", markdown_text, flags=re.DOTALL)
4545

4646
# Remove irrelevant figures
@@ -52,18 +52,7 @@ async def build_and_clean_markdown_for_response(
5252

5353
logging.info(f"Cleaned Text: {cleaned_text}")
5454

55-
markdown_without_figure_content = re.sub(
56-
r"<!-- FigureContent=\"[^\"]*\" -->", "", cleaned_text, flags=re.DOTALL
57-
)
58-
59-
combined_pattern = r"(.*?)\n===|\n#+\s*(.*?)\n"
60-
doc_metadata = re.findall(
61-
combined_pattern, markdown_without_figure_content, re.DOTALL
62-
)
63-
doc_metadata = [match for group in doc_metadata for match in group if match]
64-
6555
output_dict["content"] = cleaned_text
66-
output_dict["sections"] = doc_metadata
6756

6857
output_dict["figures"] = figures
6958

adi_function_app/pre_embedding_cleaner.py

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
import re
66

77

8-
def get_section(cleaned_text: str) -> list:
8+
def get_sections(cleaned_text: str) -> list:
99
"""
1010
Returns the section details from the content
1111
@@ -52,7 +52,7 @@ def remove_markdown_tags(text: str, tag_patterns: dict) -> str:
5252
return text
5353

5454

55-
def clean_text(src_text: str) -> str:
55+
def clean_text_with_section_extraction(src_text: str) -> tuple[str, str]:
5656
"""This function performs following cleanup activities on the text, remove all unicode characters
5757
remove line spacing,remove stop words, normalize characters
5858
@@ -77,6 +77,8 @@ def clean_text(src_text: str) -> str:
7777
}
7878
cleaned_text = remove_markdown_tags(src_text, tag_patterns)
7979

80+
sections = get_sections(cleaned_text)
81+
8082
# Updated regex to keep Unicode letters, punctuation, whitespace, currency symbols, and percentage signs,
8183
# while also removing non-printable characters
8284
cleaned_text = re.sub(r"[^\p{L}\p{P}\s\p{Sc}%\x20-\x7E]", "", cleaned_text)
@@ -88,7 +90,7 @@ def clean_text(src_text: str) -> str:
8890
except Exception as e:
8991
logging.error(f"An error occurred in clean_text: {e}")
9092
return ""
91-
return cleaned_text
93+
return cleaned_text, sections
9294

9395

9496
async def process_pre_embedding_cleaner(record: dict) -> dict:
@@ -114,19 +116,17 @@ async def process_pre_embedding_cleaner(record: dict) -> dict:
114116

115117
# scenarios when page by chunking is enabled
116118
if isinstance(record["data"]["chunk"], dict):
117-
cleaned_record["data"]["cleanedChunk"] = clean_text(
118-
record["data"]["chunk"]["content"]
119-
)
119+
(
120+
cleaned_record["data"]["cleanedChunk"],
121+
cleaned_record["data"]["sections"],
122+
) = clean_text_with_section_extraction(record["data"]["chunk"]["content"])
120123
cleaned_record["data"]["chunk"] = record["data"]["chunk"]["content"]
121-
cleaned_record["data"]["cleanedSections"] = clean_sections(
122-
record["data"]["chunk"]["sections"]
123-
)
124124
else:
125-
cleaned_record["data"]["cleanedChunk"] = clean_text(record["data"]["chunk"])
125+
(
126+
cleaned_record["data"]["cleanedChunk"],
127+
cleaned_record["data"]["sections"],
128+
) = clean_text_with_section_extraction(record["data"]["chunk"])
126129
cleaned_record["data"]["chunk"] = record["data"]["chunk"]
127-
cleaned_record["data"]["cleanedSections"] = get_section(
128-
record["data"]["chunk"]
129-
)
130130

131131
except Exception as e:
132132
logging.error("string cleanup Error: %s", e)

deploy_ai_search/ai_search.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -220,9 +220,7 @@ def get_pre_embedding_cleaner_skill(self, context, source) -> WebApiSkill:
220220
pre_embedding_cleaner_skill_outputs = [
221221
OutputFieldMappingEntry(name="cleanedChunk", target_name="cleanedChunk"),
222222
OutputFieldMappingEntry(name="chunk", target_name="chunk"),
223-
OutputFieldMappingEntry(
224-
name="cleanedSections", target_name="cleanedSections"
225-
),
223+
OutputFieldMappingEntry(name="sections", target_name="sections"),
226224
]
227225

228226
pre_embedding_cleaner_skill = WebApiSkill(

deploy_ai_search/rag_documents.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,7 @@ def get_index_projections(self) -> SearchIndexerIndexProjection:
215215
name="Keywords", source="/document/pages/*/keywords"
216216
),
217217
InputFieldMappingEntry(
218-
name="Sections", source="/document/pages/*/cleanedSections"
218+
name="Sections", source="/document/pages/*/sections"
219219
),
220220
InputFieldMappingEntry(
221221
name="Figures",

0 commit comments

Comments
 (0)