'remove_field' option: also remove fldChar nodes #707
Merged
davidgohel merged 1 commit intodavidgohel:masterfrom Jan 14, 2026
Merged
'remove_field' option: also remove fldChar nodes #707davidgohel merged 1 commit intodavidgohel:masterfrom
davidgohel merged 1 commit intodavidgohel:masterfrom
Conversation
…ing in 'run_content_text' using doc_summary
Contributor
Author
|
BTW great work with providing run data, potentially really useful! |
Owner
|
thank you, using your example shows the issue you detected !
library(tibble)
library(officer)
curl::curl_download("https://github.com/user-attachments/files/24586594/example.docx", "example.docx")
doc <- read_docx("example.docx")
z1 <- docx_summary(doc, preserve = TRUE, remove_fields = FALSE, detailed = TRUE)
z2 <- docx_summary(doc, preserve = TRUE, remove_fields = TRUE, detailed = TRUE)
as_tibble(z1)
#> # A tibble: 6 × 37
#> doc_index content_type run_index run_content_index run_content_text image_path
#> <int> <chr> <int> <int> <chr> <chr>
#> 1 1 paragraph 1 1 " " <NA>
#> 2 1 paragraph 2 1 <NA> <NA>
#> 3 1 paragraph 3 1 "ADDIN ZOTERO_I… <NA>
#> 4 1 paragraph 4 1 <NA> <NA>
#> 5 1 paragraph 5 1 "[1,2]" <NA>
#> 6 1 paragraph 6 1 <NA> <NA>
#> # ℹ 31 more variables: field_code <chr>, footnote_text <chr>, link <chr>,
#> # link_to_bookmark <chr>, bookmark_start <chr>, character_stylename <chr>,
#> # sz <int>, sz_cs <int>, font_family_ascii <chr>, font_family_eastasia <chr>,
#> # font_family_hansi <chr>, font_family_cs <chr>, bold <lgl>, italic <lgl>,
#> # underline <lgl>, color <chr>, shading <chr>, shading_color <chr>,
#> # shading_fill <chr>, paragraph_stylename <chr>, keep_with_next <lgl>,
#> # align <chr>, level <int>, num_id <int>, table_index <int>, row_id <int>, …
as_tibble(z2)
#> # A tibble: 5 × 37
#> doc_index content_type run_index run_content_index run_content_text image_path
#> <int> <chr> <int> <int> <chr> <chr>
#> 1 1 paragraph 1 1 " " <NA>
#> 2 1 paragraph 2 1 <NA> <NA>
#> 3 1 paragraph 3 1 <NA> <NA>
#> 4 1 paragraph 4 1 "[1,2]" <NA>
#> 5 1 paragraph 5 1 <NA> <NA>
#> # ℹ 31 more variables: field_code <chr>, footnote_text <chr>, link <chr>,
#> # link_to_bookmark <chr>, bookmark_start <chr>, character_stylename <chr>,
#> # sz <int>, sz_cs <int>, font_family_ascii <chr>, font_family_eastasia <chr>,
#> # font_family_hansi <chr>, font_family_cs <chr>, bold <lgl>, italic <lgl>,
#> # underline <lgl>, color <chr>, shading <chr>, shading_color <chr>,
#> # shading_fill <chr>, paragraph_stylename <chr>, keep_with_next <lgl>,
#> # align <chr>, level <int>, num_id <int>, table_index <int>, row_id <int>, …Created on 2026-01-14 with reprex v2.1.1 With your fix, we will see the expected output: library(tibble)
library(officer)
curl::curl_download("https://github.com/user-attachments/files/24586594/example.docx", "example.docx")
doc <- read_docx("example.docx")
z1 <- docx_summary(doc, preserve = TRUE, remove_fields = FALSE, detailed = TRUE)
z2 <- docx_summary(doc, preserve = TRUE, remove_fields = TRUE, detailed = TRUE)
as_tibble(z1)
#> # A tibble: 6 × 37
#> doc_index content_type run_index run_content_index run_content_text image_path
#> <int> <chr> <int> <int> <chr> <chr>
#> 1 1 paragraph 1 1 " " <NA>
#> 2 1 paragraph 2 1 <NA> <NA>
#> 3 1 paragraph 3 1 "ADDIN ZOTERO_I… <NA>
#> 4 1 paragraph 4 1 <NA> <NA>
#> 5 1 paragraph 5 1 "[1,2]" <NA>
#> 6 1 paragraph 6 1 <NA> <NA>
#> # ℹ 31 more variables: field_code <chr>, footnote_text <chr>, link <chr>,
#> # link_to_bookmark <chr>, bookmark_start <chr>, character_stylename <chr>,
#> # sz <int>, sz_cs <int>, font_family_ascii <chr>, font_family_eastasia <chr>,
#> # font_family_hansi <chr>, font_family_cs <chr>, bold <lgl>, italic <lgl>,
#> # underline <lgl>, color <chr>, shading <chr>, shading_color <chr>,
#> # shading_fill <chr>, paragraph_stylename <chr>, keep_with_next <lgl>,
#> # align <chr>, level <int>, num_id <int>, table_index <int>, row_id <int>, …
as_tibble(z2)
#> # A tibble: 2 × 37
#> doc_index content_type run_index run_content_index run_content_text image_path
#> <int> <chr> <int> <int> <chr> <chr>
#> 1 1 paragraph 1 1 " " <NA>
#> 2 1 paragraph 2 1 "[1,2]" <NA>
#> # ℹ 31 more variables: field_code <chr>, footnote_text <chr>, link <chr>,
#> # link_to_bookmark <chr>, bookmark_start <chr>, character_stylename <chr>,
#> # sz <int>, sz_cs <int>, font_family_ascii <chr>, font_family_eastasia <chr>,
#> # font_family_hansi <chr>, font_family_cs <chr>, bold <lgl>, italic <lgl>,
#> # underline <lgl>, color <chr>, shading <chr>, shading_color <chr>,
#> # shading_fill <chr>, paragraph_stylename <chr>, keep_with_next <lgl>,
#> # align <chr>, level <int>, num_id <int>, table_index <int>, row_id <int>, …Created on 2026-01-14 with reprex v2.1.1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In
docx_summary(), whenremove_fieldsis true anddetailedis true:Recent changes made that empty nodes get converted to paragraphs containing
NAwhen usingdocx_summary(), meaning that they will be represented by a row in the returned dataframe that hasrun_content_textset toNA. I think this is not a desired behaviour.See code in
docx_runs_content_information()in fortify_docx.R.Therefore, the empty fldChar nodes also need to be removed since they got converted to to
NA, to avoidNA's appearing inrun_content_textusing doc_summary.Attached also a .docx with an example field to try:
example.docx