Skip to content

Latest commit

 

History

History
2123 lines (1402 loc) · 59.1 KB

File metadata and controls

2123 lines (1402 loc) · 59.1 KB

PDFix Template

--

Table of Contents

General Settings

General template settings

key type value
rtl bool False
substructure_form_xobject bool True
page_tag string NonStruct
debug_pagemap_stop string

Example:

{
  "template": {
    "settings": {
      "rtl": false,
      "substructure_form_xobject": true,
      "page_tag": "NonStruct",
      "debug_pagemap_stop": ""
    }
  }
}

Threshold Values

--

key value desciption
preflight_artifact_font_size_min 32 Minimum font size for artifact
preflight_artifact_w1 1 Horizontal alignment weight.
preflight_artifact_w2 1 Vertical alignment weight.
preflight_artifact_w3 1 Element width weight.
preflight_artifact_w4 1 Element height(for images) or font size(for text) weight.
preflight_artifact_w5 1 Page numbers weight.
preflight_artifact_distance 0.7 Maximum distance<0,1> when elements can be an artifact/header/footer.
preflight_artifact_cluster_points 2 Minimal number of points in preflight_artifact_distance radian.
concurrent_threads 0 The number of concurrent threads. If zero, the number of concurrent threads supported by the implementation is used. If it's set to 1, no parallel algorithms are used.
text_only 0 Process only texts in pagemap.
rotation_detect 1 Detect and correct page rotation for reading.
background_color_red 255 Page background color - red.
background_color_green 255 Page background color - green.
background_color_blue 255 Page background color - blue.
background_color_diff 2 Page background color max color component difference.
bbox_expansion 2 Bounding box expansion - half of kTrTextHeight.
angle_deviation 0.015707963267949 Maximum angle deviation for horizontal and vertical elements.
header_ratio 0.15 Maximum percentage of a header height. Possible values from interval [0,1].
footer_ratio 0.15 Maximum percentage of a footer height. Possible values from interval [0,1].
artifact_w1 1 Artifact page border distance weight.
artifact_w2 1 Artifact image area weight.
artifact_border_distance_max 2 Maximum distance of artifact to page border.
artifact_similarity 0.7 Minimum similarity value when object or element is an artifact normalized to interval [0,1].
path_object_max 2000 Maximum number of subsequence path objects that are still paths.
path_object_min 100 Minimum number of subsequence path objects that are still paths.
initial_element_expansion 1 Initial element bounding box expansion when searching children. Size in points. If its zero, a half of default page font size is used.
initial_element_overlap 0.5 Minimum percentage of covered area of element by the initial element.
annot_char_overlap 0.05 Minimum percentage of covered area of character by the annotation.
isolated_text_ratio 10 Maximum isolated text width ratio. Is multiplied with the font size.
isolated_text 80 Maximum isolated text width.
isolated_element_ratio 6 Maximum isolated element width/height ratio. Is multiplied with the font size.
element_isolated_w1 1 Element paragraph weight.
element_isolated_w2 1 Element width weight.
element_isolated_caption 1 If set to 1 and element contains caption(table, image, chart, note) do not mark it as isolated element.
element_isolated_width_min 0 Minimal value of bbox width for isolated element. If zero, element_isolated_width_min_ratio is used. Size in points.
element_isolated_width_min_ratio 4 Minimal value of bbox width for isolated element multiply with average page font size.
element_isolated_width_max 0 Maximal value of bbox width for isolated element. If zero, element_isolated_width_max_ratio is used. Size in points.
element_isolated_width_max_ratio 10 Maximal value of bbox width for isolated element multiply with average page font size.
element_isolated_similarity 0.7 Minimum similarity value when element is isolated normalized to interval [0,1].
element_isolated_image_w1 1 Image vs page area weight.
element_isolated_image_w2 1 Elements isolated similarity weight.
element_isolated_image_w3 1 Images area vs join image area weight.
element_isolated_image_similarity 0.7 Minimum similarity value when isolated elements can be added to an image.
element_line_w1 1 Line width weight.
element_line_width_max 8 Maximal value of line width. If zero, element_line_width_max_ratio is used. Size in points.
element_line_width_max_ratio 1 Maximal value of line width multiply with average page font size.
element_line_similarity 0.6 Minimum similarity value when element is recognized as line normalized to interval [0,1].
element_alignment_ratio 0.5 Ration between baseline and bounding box alignments. Bounding box alignment precision is multiplied with element_alignment_ratio.
rect_image_similarity 0.7 Minimum similarity value when the rectangle should be an image normalized to interval [0,1].
rect_line_similarity 0.5 Minimum similarity value when the rectangle should be a line normalized to interval [0,1].
image_background_text 1 Text bounding box expansion.
image_overlap_distance 1 Maximum distance value when graphic page objects can be joined. Distance in points.
image_join_distance 8 Defines the maximum allowed distanc (in points) between small images for them to be considered joinable. These parameters help fine-tune the grouping of small image elements into a cohesive larger visual block based on their spatial proximity.
char_clip_ratio 0.5 Minimal ratio of the clipping area of the character comparing to it's original size.
word_space_width_ratio 0.6 The word_space_width_ratio is a multiplier that determines the threshold for identifying inter-word spaces by comparing the gap between characters to the typical width of a space character. It scales the space width so that small variations in spacing can be interpreted as either a valid word separator or a mere character gap.
word_space_width_min_ratio 0.1 The word_space_width_min_ratio is an additional multiplier that sets a minimum threshold for the allowed space between words. It ensures that, even when minimal character spacing is detected, the computed gap used to determine word boundaries does not fall below a baseline value relative to the font size.
word_space_distance_max 0 Maximum word space distance in points.
word_space_distance_max_ratio 0 Maximum word space distance. The value is multiplied by word font size.
word_space_ratio 1 Ratio that defines if the text line is simple or justify.
word_space_update_min 0.2 Minimum ratio of detected word spacing.
word_space_update_max 4 Maximum ratio of detected word spacing. If set to 0, update word spacif from lines is not applied.
word_space_update_distance 0.04 Distance for clustering word spaces in text line update.
word_splitter_ratio 2 Minimum space before splitter. The value is multiplied by most used font size.
word_splitter_distance 4 Maximum threshold value for word splitters detections. Real distance in points.
word_overlap 0.9 Minimum overlap percentage (0-1) required between bounding boxes to consider words as duplicates. A word must cover at least this percentage of another word's area to be considered overlapping.
text_line_baseline_ratio 0.1 Maximum baseline shift. Value multiplies minimal font. Baseline shift moves individual characters up or down in relation to other text on the same line.
text_line_baseline_elem_ratio 0.7 Maximum baseline line and element shift. Value multiplies minimal font.
text_line_underline_distance 2.6 Distance of the underline line and text baseline. Size in points.
text_line_underline_char_distance_ratio 0.1 Distance of the underline line start/end point and character bounding box. The value is multiplied by line font size. Size in points.
text_line_subscript_font_ratio 1 This ratio is used to calculate the maximum allowed baseline difference for joining a subscript with its main word. Specifically, multiply the word's font size by this ratio to get a threshold.
text_line_join_font_size_distance 0 Distance of two fonts in points, when two lines with different fonts can be join.
text_line_distance_max 0 Maximum distance between lines. If zero, text_line_distance_max_ratio is used. Size in points.
text_line_distance_max_ratio 2 Maximum distance between lines. The value is multiplied by line font size.
text_line_join_distance 2 Maximum threshold value in line spacing detection for specific font size. The higher value allows creating paragraph with variable line spacings. The value is multiplied by font size.
text_line_chunk_distance_max 0 Maximum distance between chunks. If zero, text_chunk_distance_max_ratio is used. Size in points.
text_line_chunk_distance_max_ratio 6 Maximum distance between chunks. The value is multiplied by simple word spacing between words.
text_line_chunk_distance 0 A fixed threshold parameter used by the clustering algorithm to group word spaces in a line. When set to a nonzero value, it directly defines the threshold that determines whether adjacent word spaces are similar enough to be considered part of the same cluster. If zero, word_distance_ratio is used. Size in points.
text_line_chunk_distance_ratio 0.4 A relative multiplier that comes into play when the fixed threshold (word_distance) is zero. It calculates the threshold by multiplying the line’s font size by the ratio, thereby adapting the clustering sensitivity to the text size.
text_chunk_distance 0 Maximum distance value when text chunks are vertically aligned. If zero, text_chunk_distance_ratio is used. Size in points.
text_chunk_distance_ratio 0.42 Maximum distance value when text chunks are vertically aligned. The value is multiplied by page font width.
text_chunk_simple_distance 0.4 Maximum distance value when text chunks create simple line. Normalized to interval [0,1].
text_chunk_word_distance 0.1 Maximum distance value when single line text has to be split to words. Normalized to interval [0,1].
text_height 8 Minimal text height on the page.
text_simple_similarity 0.96 Minimum similarity value when text lines create a simple paragraph normalized to interval [0,1].
text_justify_similarity 0.96 Minimum similarity value when text lines create a justify paragraph normalized to interval [0,1].
text_table_similarity 0.65 Minimum similarity value when text lines create a table normalized to interval [0,1].
text_paragraph_similarity 0.7 Minimum similarity value when text is paragraph normalized to interval [0,1].
text_split_distance 0.2 Dissimilarity boundary value when text lines creates a paragraph.
text_column_similarity 0.7 Minimum similarity value that text creates a column normalized to interval [0,1].
label_image_detect 1 Graphic labels detection. Possible values: 0
label_word_detect 1 Texts labels detection. Possible values: 0
label_alignment_h 2 Maximum deviation of horizontal label alignment.
label_distance_ratio 10 Distance of the label and text. Is multiplied with the page most used font size.
label_baseline_ration 0.14 Multiplies minimal font. Maximum deviation of horizontal label aligned to text.
label_image_w1 1 Controls how much vertical alignment matters when clustering labels. A higher value enforces stricter alignment, while a lower value allows more variation.
label_image_w2 1 Controls how much the distance between a label and its associated text influences clustering. A higher value enforces stricter proximity, ensuring labels are closely linked to their text.
label_image_w3 1 This weight controls how much the label's width consistency matters in clustering. A higher value enforces that labels should have the same width, while a lower value allows more variation in width between labels.
label_image_w4 1 This weight determines how important the height consistency of labels is when clustering. A higher value enforces that labels should have the same height, while a lower value allows more flexibility in height differences.
label_image_w5 0.5 This weight adjusts how important the height relationship is between the image label and its associated text. A higher value means the height alignment between the label and the text is more significant in clustering decisions.
label_image_width_min 0 Specifies a fixed minimum width in points. If set to zero, the label_image_width_min_ratio is used instead..
label_image_width_min_ratio 0 Defines the minimum width as a multiple of the average font size. Useful when label size varies with font size.
label_image_width_max 0 Specifies a fixed maximum width in points. If set to zero, the label_image_width_max_ratio is used instead.
label_image_width_max_ratio 6 Defines the maximum width as a multiple of the average page font size. This ratio is applied when label_image_width_max is zero.
label_image_distance 4 Clustering threshold in points that decides when labels should be grouped together. A higher value makes clustering more flexible, allowing distant labels to merge, while a lower value keeps clusters tight and separate.
label_word_w1 1 Controls how much vertical alignment matters when clustering labels. A higher value enforces stricter alignment, while a lower value allows more variation.
label_word_w2 1 Controls how much the distance between a label and its associated text influences clustering. A higher value enforces stricter proximity, ensuring labels are closely linked to their text.
label_word_dist_sibling_ratio 4 This threshold, defined as a ratio multiplied by a siblings font size, sets the maximum gap allowed between a label and its sibling element to be joined together. If the distance exceeds this value, the label and its sibling remain separate.
label_word_distance 0 Clustering threshold in points that decides when labels should be grouped together. A higher value makes clustering more flexible, allowing distant labels to merge, while a lower value keeps clusters tight and separate.
label_word_distance_ratio 1 Clustering threshold value that decides when labels should be grouped together. The value is multiplied by avarage page font width.
toc_detect 1 TOC detection. Possible values: 0
toc_word_distance 0 Controls how much vertical alignment matters when clustering TOC words. A higher value enforces stricter alignment, ensuring TOC elements are well-structured.
toc_word_distance_ratio 1 Threshold ratio that determines when TOC entries should be clustered together. The value is multiplied by the average page font width.
graphic_table_detect 1 Graphic tables detection. Possible values: 0
graphic_table_detect_row 1 Row graphic tables detection.
graphic_table_detect_col 1 Column graphic tables detection.
graphic_table_alignment_distance 0.8 Maximum alignment distance value when elements can create a table. Distance in points.
graphic_table_split_w1 1 Table texts paragraph weight.
graphic_table_split_w2 1 Table texts horizontal alignment weight.
graphic_table_split_w3 1 Columns width weight.
graphic_table_split_w4 0.5 Number of columns weight.
graphic_table_split_w5 0.5 Number of rows weight.
graphic_table_split_w6 1 Page area weight.
graphic_table_split_col_max 5 Maximal number of columns when table can be split.
graphic_table_split_row_max 5 Maximal number of rows when table can be split.
graphic_table_split_similarity 0.7 Minimum similarity value when graphic table has to be split.
graphic_table_split_layout_similarity 0.7 Minimum similarity value when graphic table has to be split.
graphic_table_chart_similarity 0.3 Minimum similarity value when graphic table is a char.
graphic_table_image_w1 -1 Images area weight. If -1, number of images is used.
graphic_table_image_w2 -1 Images weight. If -1, number of images is used.
graphic_table_image_w3 -1 Chart similarity weight. If -1, number of paths is used.
graphic_table_image_w4 1 Texts vertical alignment weight.
graphic_table_image_w5 1 Table size weight.
graphic_table_image_similarity 0.7 Minimum similarity value when graphic table has an image.
text_table_detect 1 Texts (not graphic) tables detection. Possible values: 0
text_table_detect_row 1 Row texts (not graphic) tables detection.
text_table_detect_col 1 Column texts (not graphic) tables detection.
text_table_row_alignment_type 1 Table row alignment type [0 - strong, 1 - average, 2 - weak].
text_table_col_alignment_type 1 Table column alignment type [0 - strong, 1 - average, 2 - weak].
text_table_col_similarity_type 0 Table column similarity type [0 - column alignment distance, 1 - element distance, 2 - element size, 3 - max].
text_table_col_distance 0.8 Maximum deviation value for detection nearest distancies for table columns. Real distance in points.
text_table_col_similarity 0.36 Minimum similarity value when elements create table column.
text_table_alignment_type 2 Table column alignment type [0 - strong, 1 - average, 2 - weak]. Select strong for strictly aligned table elements.
text_table_alignment_distance 0.4 Maximum threshold value for detection text tables.
text_table_text_col_w1 1 Text column paragraph weight.
text_table_text_col_w2 1 Text column width weight.
text_table_text_col_width_min 0 Minimal value of bbox width for text in table column. If zero, text_table_text_col_width_min_ratio is used. Size in points.
text_table_text_col_width_min_ratio 1 Minimal value of bbox width for text in table column multiply with average page font size.
text_table_text_col_width_max 0 Maximal value of bbox width for text in table column. If zero, text_table_text_col_width_max_ratio is used. Size in points.
text_table_text_col_width_max_ratio 8 Maximal value of bbox width for text in table column multiply with average page font size.
text_table_image_col_w1 1 Image column weight.
text_table_image_col_gs 1 If set to 1, image column has to have same graphics state.
text_table_image_col_width_min 0 Minimal value of bbox width for image in table column. If zero, text_table_image_col_width_min_ratio is used. Size in points.
text_table_image_col_width_min_ratio 1 Minimal value of bbox width for image in table column multiply with average page font size.
text_table_image_col_width_max 0 Maximal value of bbox width for image in table column. If zero, text_table_image_col_width_max_ratio is used. Size in points.
text_table_image_col_width_max_ratio 4 Maximal value of bbox width for image in table column multiply with average page font size.
text_table_image_col_height_min 0 Minimal value of bbox height for image in table column. If zero, text_table_image_col_height_min_ratio is used.
text_table_image_col_height_min_ratio 1 Minimal value of bbox height for image in table column multiply with average page font size.
text_table_image_col_height_max 0 Maximal value of bbox height for image in table column. If zero, text_table_image_col_height_max_ratio is used.
text_table_image_col_height_max_ratio 2 Maximal value of bbox height for image in table column multiply with average page font size.
text_table_column_similarity 0.5 Minimum similarity value when elements create table column.
text_table_image_similarity_w1 1 Sect table image similarity area weight.
text_table_image_similarity_w2 1 Sect table image similarity chart weight.
text_table_image_similarity 0.7 Minimum similarity value when text table is image normalized to interval [0,1].
text_table_paragraph_similarity 0.7 Minimum similarity value when text table is paragraph normalized to interval [0,1].
table_update_delete_empty 1 Delete empty rows and cols.
table_update_split_by_cell 0 Split elements that should be originally splitted, It usually happens when some paragraph is recognized instead of single lines or images(bullets) are joined together.
table_update_split_by_row 0 Split table texts to lines.
table_update_split_label 0 Split labels in tables.
table_update_span_empty 1 Span empty cells.
table_update_span_row 0 Join rows based on the maximum row span
table_update_span_row_first 0 If set to true, rows are merged together first using span
table_update_join 0 Join texts in a single cell.
table_update_cell_header 1 Detect headers.
table_span_col_ratio 0.1 Intersection percentage of colspan element. Possible values from interval [0,1].
table_span_row_ratio 0.2 Intersection percentage of rowspan element. Possible values from interval [0,1].
table_alignment_h 1 Maximum deviation (in points) of horizontal table aligned elements.
table_alignment_v 4 Maximum deviation (in points) of vertical table aligned elements.
table_line_intersection 1 Expansion (in points) for lines intersection. It's used in table detection.
form_table_detect 1 Recognize form fields as tables.
caption_distance 80 Distance of the caption and the image/table.
caption_alignment_h 4 Maximum deviation (in points) in caption and nearest element alignment.
caption_alignment_v 4 Maximum deviation (in points) in caption and nearest element alignment.
mc_detect 1 Update elements language, alternate description and actual text based on kb. Default value is set to 1 but can be turn to 0 due to optimization - when alternate description is not required.
rd_sort 0 Sort elements: 0 - inbuild, 1 - original content positions, 2 - by x and y coordinates, 3 - by rd_index.
rd_sort_direction 0 Sort elements: 0 - inbuild, 1 - prefere columns, 2 - prefere rows.
rd_column_distance 0.8 Maximum threshold value for columns detection. Real distance in points.

Example:

{
  "template": {
    "pagemap": [
      {
        "text_table_paragraph_similarity": 0.7,
        "graphic_table_image_w1": -1,
        "label_baseline_ration": 0.14
      }
    ]
  }
}

Regular Expressions

--

key value
regex_hyphen \\w+-$
regex_bullet ^[\\u2010\\u2011\\u2212\\u005E\\u005B\\ uF0A7\\uF097\\uF0BB\\u25CF\\u2022\\u25D8 \\u25CB\\u25D9\\u2023\\u2043\\uF0B7\\u22 12\\u204C\\u204D\\u25E6\\u29BE\\u29BF\\u 21E8\\u25BA\\u25C4\\u2219\\u25A0\\uF06C\ \u25A1\\u005D\\u25C6]$
regex_bullet_font (Wingdings)|(Symbol)
regex_label ^[\\[\\(]?((M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))|(\\d+)|([a-zA-Z]))[\\)\\]\\. ]$
label_chars .()[]
regex_decimal_numbering ^[\\[\\(]?(?:\\d{1,4}\\.){0,5}\\d{0,4}\\s?[\\)\\]\\.]?$
regex_roman_numbering ^[\\[\\(]?M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})[\\)\\]\\.]?$
regex_letter_numbering ^[\\[\\(]?[A-Za-z][\\)\\]\\.]$
regex_filling [._]{2,}
regex_filling_chars ._
regex_page_number (^\\d+$)|(^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$)
regex_first_cap ^[A-Z]
regex_terminal [\\.\\!\\?]$
regex_table_caption ((^table)|(^tab\\.))
regex_image_caption ((^image)|(^img\\.)|(^figure)|(^fig\ \.))
regex_chart_caption ((^chart)|(^map))
regex_note_caption ((^source\\:)|(^note\\:))
regex_toc_caption ((^content)|(^toc))
regex_colon :$
regex_comma [,;]$
regex_letter ^[A-Za-z]$
number_chars -+.,%\\u20AC$\\u00A5\\u00A3
numbering_splitter_chars .()[]

Example:

{
  "template": {
    "pagemap_regex": [
      {
        "regex_page_number": "(^\\d+$)|(^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$)",
        "regex_letter": "^[A-Za-z]$"
      }
    ]
  }
}

Functions

element_create

Create user-defined elements.

keys and values:

object_update

The test is triggered when the page content object is tested.

keys and values:

text_run_update

Updates a text run element after processing text objects.

keys and values:

text_run_neighbours

This test is triggered when forming text lines from textrun.

keys and values:

line_update

Updates a line element after detecting horizontal and vertical lines.

keys and values:

rect_update

Updates a rectangle element after detecting rectangles.

keys and values:

element_graphic_neighbours

Test if two neighbours path elements can form a single graphic table.

keys and values:

element_graphic_update

Updates line, rects and graphic table element after detecting.

keys and values:

word_update

Updates a word element after detecting words.

keys and values:

word_neighbours

This test is triggered when forming text lines from words.

keys and values:

text_line_update

Updates a text line element after detecting text lines.

keys and values:

text_line_neighbours

Test if two neighbours text lines can form a paragraph.

keys and values:

text_update

Updates the text element after detecting paragraphs.

keys and values:

image_update

Updates an image after detecting basic images from page objects.

keys and values:

element_update

Updates an element after detecting basic elements.

keys and values:

table_update

Updates a table after the whole process od table detection is done.

keys and values:

cell_update

Updates a table cell after the whole process od table detection is done.

keys and values:

table_split

Updates the table after the entire table detection process is completed.

keys and values:

alt_update

Sets an alternate description for the element. The alternate description is established in a specific order. To skip a step, set the default value to false for that step.

keys and values:

actual_text_update

Sets the actual text for the element. The actual text is established in a specific order. To skip a step, set the default value to false for that step.

keys and values:

artifact_update

Marks an element as an artifact.

keys and values:

label_update

Update elements marked as labels to include them as part of the list.

keys and values:

list_update

Tests if a list is correct.

keys and values:

tag_image

Handles the process of tagging images. For repurposing and accessibility purposes, a Figure element should have either an Alt entry or an ActualText entry in its structure element dictionary. If both are absent, the default behavior is to tag the Figure with an empty alt attribute.

keys and values:

tag_table

Handles the process of tagging tables. For repurposing and accessibility purposes, a table should have headers. If no headers are detected, the default behavior is to leave the table without any elements.

keys and values:

tag_update

Updates the tag after it has been created..

keys and values:

annot_update

Updates the annotation tag after it has been created.

keys and values:

Schema

statement

The if statement type of the query. According to the statement the query evaluation stops upon pass or not.

  • values:
    • ['$if', '$elif', '$else']
  • defaule value: $if

keys and values:

  • "$if"
  • "$elif"
  • "$else"

$if

Can by used in all functions. Applies a rule when a condition is true.

  • type: statement

$elif

Can by used in all functions. Applies a rule when a condition is true.

  • type: statement

$else

Can by used in all functions. Applies a rule when a condition is not true.

  • type: statement

query

The query defines thresholds and operations for a pagemap detection.

  • type: query

keys and values:

  • "param" params:

    • "pds_object_params"

      • A parameter that represents PdsObject. The value starts with the character $, followed by a number (e.g., $0_width). The number represents the index of the parameter in the param array.
    • "pde_element_params"

      • A parameter that represents PdeElement. The value starts with the character $, followed by a number (e.g., $0_width). The number represents the index of the parameter in the param array.
    • "pds_struct_elem_params"

      • A parameter that represents PdsStructElem. The value starts with the character $, followed by a number (e.g., $0_width). The number represents the index of the parameter in the param array.
    • "pdf_annot_params"

      • A parameter that represents PdfAnnots. The value starts with the character $, followed by a number (e.g., $0_width). The number represents the index of the parameter in the param array.
    • "pdf_rect"

    • "pdf_rgb"

    • "int"

      • Parameter that represents integer.
    • "bool"

      • Parameter that represents boolean value.
    • "float"

      • Parameter that represents floating value.
    • "string"

      • Parameter that represents string value.
  • "var" params:

    • "0_value"
  • "logical_operators"

param

Define the number and type of input parameters.

  • type: query_param

keys and values:

int

Parameter that represents integer.

  • type: int

bool

Parameter that represents boolean value.

  • type: bool

float

Parameter that represents floating value.

  • type: float

string

Parameter that represents string value.

  • type: string

var

User defined variables. Use macros to define variables

  • type: var

keys and values:

  • "0_value"

logical_operators

Available logical operators.

  • type: string
  • values:
    • ['$and', '$or', '$not']

keys and values:

  • "$and" params:

  • "$or" params:

  • "$not" params:

$and

Logical AND. All sub-conditions must be true.

  • type: logical_operator

keys and values:

$or

Logical OR. At least one sub-condition must be true.

  • type: logical_operator

keys and values:

$not

Logical NOT.

  • type: logical_operator

keys and values:

comparison_operators

Available comparison operators.

  • type: string
  • values:
    • ['$eq', '$ne', '$lt', '$lte', '$gt', '$gte', '$regex', '$in', '$nin']

keys and values:

  • "$eq"
  • "$ne"
  • "$lt"
  • "$lte"
  • "$gt"
  • "$gte"
  • "$regex"
  • "$in"
  • "$nin"

$eq

Equal to value.

$ne

Not equal to value.

$lt

Less then value.

  • type: comparison_operator
  • types: ["int", "float"]

$lte

Less or equals then value.

  • type: comparison_operator
  • types: ["int", "float"]

$gt

Greater then value

  • type: comparison_operator
  • types: ["int", "float"]

$gte

Greater or equals then value.

  • type: comparison_operator
  • types: ["int", "float"]

$regex

Regular expression predicate.

  • type: comparison_operator
  • types: ["string"]

$in

Contain value operator.

  • type: comparison_operator
  • types: ["bbox"]

$nin

Not contain value operator.

  • type: comparison_operator
  • types: ["bbox"]

pds_object_params

List of all pds_object types, can be used as parameter in QUERY->PARAM.

keys and values:

pds_text

Text page object

keys and values:

pds_struct_elem_params

List of all pds_tag types, can be used as parameter in QUERY->PARAM.

keys and values:

pdf_annot_params

List of all pdf_annot types, can be used as parameter in QUERY->PARAM.

keys and values:

pde_element_params

List of all pde_element types, can be used as parameter in QUERY->PARAM.

keys and values:

general_vars

General variables can be used without parameters. It represents general state during the processing. It contains information about the current page and the document and can be used in any query.

  • type: string

keys and values:

  • "$page_num"
  • "$page_width"
  • "$page_height"
  • "$page_font_size"
  • "$page_min_font_size"
  • "$page_max_font_size"
  • "$page_rotation"
  • "$page_rtl"
  • "$page_anchor"
  • "$doc_num_pages"
  • "$doc_lang"
  • "$doc_title"
  • "$doc_anchor"

$page_num

Page number.

  • type: int

$page_width

Page cropbox width.

  • type: float

$page_height

Page cropbox height.

  • type: float

$page_font_size

Average font size on the page.

  • type: float

$page_min_font_size

Minimal font size on the page.

  • type: float

$page_max_font_size

Maximal font size on the page.

  • type: float

$page_rotation

Page rotation.

  • type: int
  • values:
    • [0, 90, 180, 270]

$page_rtl

Page contains RTL content.

  • type: bool

$page_anchor

Page already detected anchors.

  • type: string

$doc_num_pages

Document number of pages.

  • type: int

$doc_lang

Document language.

  • type: string

$doc_title

Document title.

  • type: string

$doc_anchor

Document already detected anchors.

  • type: string

values

General values used in JSON default template.

keys and values:

type

Type.

  • type: string
  • values:
    • ['pds_object', 'pds_text', 'pds_path', 'pds_image', 'pds_shading', 'pds_form', 'pde_element', 'pde_text', 'pde_text_line', 'pde_word', 'pde_text_run', 'pde_image', 'pde_container', 'pde_list', 'pde_line', 'pde_rect', 'pde_table', 'pde_cell', 'pde_toc', 'pde_header', 'pde_footer', 'pde_form_field', 'pde_annot', 'pds_struct_elem', 'pdf_annot']

alt

Alternate description typically used for Figure tags.

  • type: string

actual_text

Actual text.

  • type: string

lang

The language identifier.

  • type: string

id

The unique identifier of the tag.

  • type: string

associated_header

The unique identifier of the associated header. For more associated headers use composed string a|b|c|d

  • type: string

expansion

The expanded form of an abbreviation.

  • type: string

has_content

A value identifying whether the object or tag has associated page content.

  • type: bool
  • values:
    • ['true', 'false']

tag_type

Tag type defined by a string or regular expression. Use .* to match all tags.

  • type: string
  • values:
    • ['Annot', 'Art', 'Artifact', 'Aside', 'BibEntry', 'BlockQuote', 'Caption', 'Code', 'Div', 'Document', 'DocumentFragment', 'Em', 'FENote', 'Figure', 'Form', 'Formula', 'H', 'H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'Index', 'L', 'Lbl', 'LBody', 'LI', 'Link', 'NonStruct', 'Note', 'P', 'Part', 'Private', 'Quote', 'RB', 'Reference', 'RP', 'RT', 'Ruby', 'Sect', 'Span', 'Strong', 'Sub', 'Table', 'TBody', 'TD', 'TFoot', 'TH', 'THead', 'Title', 'TOC', 'TOCI', 'TR', 'Warichu', 'WP', 'WT']

annot_type

Annotation type defined by a string or regular expression. Use .* to match all annotations.

  • type: string
  • values:
    • ['Text', 'Link', 'FreeText', 'Line', 'Square', 'Circle', 'Polygon', 'PolyLine', 'Highlight', 'Underline', 'Squiggly', 'StrikeOut', 'Stamp', 'Caret', 'Ink', 'Popup', 'FileAttachment', 'Sound', 'Movie', 'Widget', 'Screen', 'PrinterMark', 'TrapNet', 'Watermark', '3D', 'Redact', 'Projection', 'RichMedia']

contents

A string value specifying the annotation contents.

  • type: string

annot_flag

A comma-delimited string value specifying the annotation flags.

  • type: string
  • values:
    • ['invisible', 'hidden', 'print', 'no_zoom', 'no_rotate', 'no_view', 'read_only', 'locked', 'toggle', 'contents']

title

Title.

  • type: string

name

Unique name to identify element later.

  • type: string

angle

Angle.

  • type: float

bbox

Parameter that represents the bounding box of an object, formatted as an array: [left, bottom, right, top]. Each coordinate can be defined by a float number, general variables, anchor variables, or mathematical functions with previously defined variables. Each bounding box can be associated with only one anchor.

  • type: bbox

keys and values:

cell_column

The column number of the cell in the table.

  • type: int

cell_row

The row number of the cell in the table.

  • type: int

cell_row_span

The cell row span.

  • type: int

cell_column_span

The cell column span.

  • type: int

cell_scope

The cell scope.

  • type: string
  • values:
    • ['row', 'column', 'both']

col_num

Number of columns in the table.

  • type: int

children_num

Number of associated child objects.

  • type: int

object_num

Number of associated page objects.

  • type: int

artifact

True if object has content mark Artifact, false otherwise.

  • type: bool
  • values:
    • ['true', 'false']

mcid

MCID content mark number is exists, -1 otherwise.

  • type: int

has_fill

True if fill color is set

  • type: bool
  • values:
    • ['true', 'false']

fill_color

The fill color of an object.

  • type: rgb

keys and values:

has_stroke

True if stroke color is set

  • type: bool
  • values:
    • ['true', 'false']

stroke_color

The stroke color of an object.

  • type: rgb

keys and values:

flag

The flag value defines a specific property for an object, which is essential for further processing.

  • type: string
  • values:
    • ['no_join', 'no_split', 'artifact', 'header', 'footer', 'splitter', 'no_table', 'no_image', 'no_expand', 'continuous', 'anchor']

numbering

Set the list numbering attribute.

  • type: string
  • values:
    • ['None', 'Unordered', 'Disc', 'Circle', 'Square', 'Ordered', 'Decimal', 'UpperRoman', 'LowerRoman', 'UpperAlpha', 'LowerAlpha', 'Description']

single_instance

Properties that are compared delimited by |. If the element with same properties already exists, only first instance is tagged.

  • type: string
  • values:
    • ['type', 'width', 'height', 'left', 'right', 'top', 'bottom', 'bbox', 'font_size', 'font_name', 'text', 'fill_color', 'stroke_color', 'angle', 'alt', 'actual_text', 'flag', 'word_flag', 'text_line_flag', 'text_flag', 'lang', 'cell_column', 'cell_row', 'cell_column_span', 'cell_row_span', 'cell_scope', 'row_num', 'col_num']

word_space

Update words space for the font in points.

  • type: float

font_name

The name of the font used in the text object.

  • type: string

font_size

The size of the font used in the text object.

  • type: float

red

The red component of an RGB color.

  • type: int

green

The green component of an RGB color.

  • type: int

blue

The blue component of an RGB color.

  • type: int

cell_header

Marks the object as a table header.

  • type: bool
  • values:
    • ['true', 'false']

cell_associated_header

Cell associated headers delimited by |.

  • type: string

heading

Sets the text heading style.

  • type: string
  • values:
    • ['normal', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'h8', 'note', 'title']

width

The object's width dimension.

  • type: float

height

The object's height dimension.

  • type: float

label

Marks the element as a list label.

  • type: string
  • values:
    • ['label', 'li_1', 'li_2', 'li_3', 'li_4', 'label_no']

left

The left coordinate of the object.

  • type: float

right

The left coordinate of the object.

  • type: float

top

The top coordinate of the object.

  • type: float

bottom

The bottom coordinate of the object.

  • type: float

baseline_x

The baseline x coordinate of the text object.

  • type: float

baseline_y

The baseline y coordinate of the text object.

  • type: float

pdf_rect

Parameter that represents the bounding box of an object, formatted as an array: [left, bottom, right, top].

  • type: rec

keys and values:

pdf_rgb

Parameter that represents the RGB color of an object, formatted as an array: [red, green, blue].

  • type: rgb

keys and values:

reflow

Text reflow. If set to false, each line is treated as a new line.

  • type: bool
  • values:
    • ['true', 'false']

row_num

The number of rows in the table.

  • type: int

table_type

The table type represented as a value from the PdfTableType enum.

  • type: string
  • values:
    • ['graphic', 'isolated', 'row', 'col', 'form']

text

The text to be used as a value.

  • type: string

text_flag

The flag to be used for the text element, specifying a value similar to the regex flags.

  • type: string
  • values:
    • ['table_caption', 'image_caption', 'chart_caption', 'note_caption', 'filling', 'uppercase', 'new_line', 'no_new_line']

text_line_flag

The flag to be used for the text line element, specifying a value similar to the regex flags.

  • type: string
  • values:
    • ['hyphen', 'new_line', 'indent', 'terminal', 'drop_cap', 'filling', 'uppercase', 'no_new_line']

text_state_flag

The flag to be used for the text text_state_flag.

  • type: string
  • values:
    • ['underline', 'strikeout', 'highlight', 'subscript', 'superscript', 'no_unicode', 'white_space', 'unicode']

word_flag

The flag to be used for the word element, specifying a value similar to the regex flags.

  • type: string
  • values:
    • ['hyphen', 'bullet', 'colon', 'number', 'subscript', 'superscript', 'terminal', 'capital', 'image', 'decimal_num', 'roman_num', 'letter_num', 'page_num', 'filling', 'uppercase', 'comma', 'no_unicode']

suffix

Container holding all unique suffixes used for naming in JSON default template

keys and values:

condition

Conditions types used in the query

keys and values:

condition_value

{0_width : 100}

comparison

{0_width : {$lt : 100}

keys and values:

  • "$eq"

comparison_array

{0_width : [{$lt : 100}, {$gt : 100}, ...]}

keys and values:

  • "$gt"
  • "$lt"

keywords

Container holding all unique keywords used in JSON default template

keys and values:

general

Holding general data like: version, date, id, SDK version, ...

template

Holding all functions.

query

Can be used in all functions. Each QUERY must have child PARAM, which holding array of parameters to specified query objects.

param

Child of the QUERY. Each QUERY must include a PARAM that specifies the object types used for evaluation.

  • type: array_param

statement

The if statement should be used in function nodes. Based on the statement, the query evaluation stops upon pass or fail. If the if statement is not present, the condition is considered disabled.

  • type: string
  • values:
    • ['$if', '$elif', '$else']

disable

Can by used in all main functions nodes. If value is true, node is not executed. Default value is false

  • type: bool
  • values:
    • ['true', 'false']

purpose

Describes the user-defined purpose or description of the QUERY.

  • type: string

insert

Values to be added as the default for the node.

keys and values:

math_expressions

Mathemical functions to define custom variable.

  • type: string
  • values:
    • ['SUM()', 'MINUS()', 'ABS()', 'MULTIPLY()', 'DIVIDE()', 'MIN()', 'MAX()', 'MOD()', 'FLOOR()', 'CEILING()']