AbsaOSS · yruslan · Jun 5, 2025 · May 30, 2025 · Jun 5, 2025 · Jun 5, 2025
@@ -1532,6 +1532,7 @@ The output looks like this:
 | Option (usage example)                                    | Description                                                                                                                                                                                                                                                                       |
 |-----------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | .option("string_trimming_policy", "both")                 | Specifies if and how string fields should be trimmed. Available options: `both` (default), `none`, `left`, `right`, `keep_all`. `keep_all` - keeps control characters when decoding ASCII text files                                                                              |
+| .option("display_pic_always_string", "false")             | If `true` fields that have `DISPLAY` format will always be converted to `string` type, even if such fields contain numbers, retaining leading and trailing zeros. Cannot be used together with `strict_integral_precision`.                                                       |
 | .option("ebcdic_code_page", "common")                     | Specifies a code page for EBCDIC encoding. Currently supported values: `common` (default), `common_extended`, `cp037`, `cp037_extended`, and others (see "Currently supported EBCDIC code pages" section.                                                                         |
 | .option("ebcdic_code_page_class", "full.class.specifier") | Specifies a user provided class for a custom code page to UNICODE conversion.                                                                                                                                                                                                     |
 | .option("field_code_page:cp825", "field1, field2")        | Specifies the code page for selected fields. You can add mo than 1 such option for multiple code page overrides.                                                                                                                                                                  |
@@ -1541,7 +1542,7 @@ The output looks like this:
 | .option("occurs_mapping", "{\"FIELD\": {\"X\": 1}}")      | If specified, as a JSON string, allows for String `DEPENDING ON` fields with a corresponding mapping.                                                                                                                                                                             |
 | .option("strict_sign_overpunching", "true")               | If `true` (default), sign overpunching will only be allowed for signed numbers. If `false`, overpunched positive sign will be allowed for unsigned numbers, but negative sign will result in null.                                                                                |
 | .option("improved_null_detection", "true")                | If `true`(default), values that contain only 0x0 ror DISPLAY strings and numbers will be considered `null`s instead of empty strings.                                                                                                                                             |
-| .option("strict_integral_precision", "true")              | If `true`, Cobrix will not generate `short`/`integer`/`long` Spark data types, and always use `decimal(n)` with the exact precision that matches the copybook.                                                                                                                    |
+| .option("strict_integral_precision", "true")              | If `true`, Cobrix will not generate `short`/`integer`/`long` Spark data types, and always use `decimal(n)` with the exact precision that matches the copybook. Cannot be used together with `display_pic_always_string`.                                                          |
 | .option("binary_as_hex", "false")                         | By default fields that have `PIC X` and `USAGE COMP` are converted to `binary` Spark data type. If this option is set to `true`, such fields will be strings in HEX encoding.                                                                                                     |
 
 ##### Modifier options

@@ -107,24 +107,25 @@ object CopybookParser extends Logging {
     * Tokenizes a Cobol Copybook contents and returns the AST.
     *
     * @param dataEncoding          Encoding of the data file (either ASCII/EBCDIC). The encoding of the copybook is expected to be ASCII.
-    * @param copyBookContents      A string containing all lines of a copybook
-    * @param dropGroupFillers      Drop groups marked as fillers from the output AST
-    * @param dropValueFillers      Drop primitive fields marked as fillers from the output AST
-    * @param fillerNamingPolicy    Specifies a naming policy for fillers
+    * @param copyBookContents      A string containing all lines of a copybook.
+    * @param dropGroupFillers      Drop groups marked as fillers from the output AST.
+    * @param dropValueFillers      Drop primitive fields marked as fillers from the output AST.
+    * @param fillerNamingPolicy    Specifies a naming policy for fillers.
     * @param segmentRedefines      A list of redefined fields that correspond to various segments. This needs to be specified for automatically
     *                              resolving segment redefines.
-    * @param fieldParentMap        A segment fields parent mapping
-    * @param stringTrimmingPolicy  Specifies if and how strings should be trimmed when parsed
-    * @param strictSignOverpunch   If true sign overpunching is not allowed for unsigned numbers
+    * @param fieldParentMap        A segment fields parent mapping.
+    * @param stringTrimmingPolicy  Specifies if and how strings should be trimmed when parsed.
+    * @param isDisplayAlwaysString If true, all fields having DISPLAY format will remain strings and won't be converted to numbers.
+    * @param strictSignOverpunch   If true sign overpunching is not allowed for unsigned numbers.
     * @param improvedNullDetection If true, string values that contain only zero bytes (0x0) will be considered null.
-    * @param commentPolicy         Specifies a policy for comments truncation inside a copybook
-    * @param ebcdicCodePage        A code page for EBCDIC encoded data
-    * @param asciiCharset          A charset for ASCII encoded data
+    * @param commentPolicy         Specifies a policy for comments truncation inside a copybook.
+    * @param ebcdicCodePage        A code page for EBCDIC encoded data.
+    * @param asciiCharset          A charset for ASCII encoded data.
     * @param isUtf16BigEndian      If true UTF-16 strings are considered big-endian.
-    * @param floatingPointFormat   A format of floating-point numbers (IBM/IEEE754)
-    * @param nonTerminals          A list of non-terminals that should be extracted as strings
+    * @param floatingPointFormat   A format of floating-point numbers (IBM/IEEE754).
+    * @param nonTerminals          A list of non-terminals that should be extracted as strings.
     * @param debugFieldsPolicy     Specifies if debugging fields need to be added and what should they contain (false, hex, raw).
-    * @return Seq[Group] where a group is a record inside the copybook
+    * @return Seq[Group] where a group is a record inside the copybook.
     */
   def parse(copyBookContents: String,
             dataEncoding: Encoding = EBCDIC,
@@ -134,6 +135,7 @@ object CopybookParser extends Logging {
             segmentRedefines: Seq[String] = Nil,
             fieldParentMap: Map[String, String] = HashMap[String, String](),
             stringTrimmingPolicy: StringTrimmingPolicy = StringTrimmingPolicy.TrimBoth,
+            isDisplayAlwaysString: Boolean = false,
             commentPolicy: CommentPolicy = CommentPolicy(),
             strictSignOverpunch: Boolean = true,
             improvedNullDetection: Boolean = false,
@@ -155,6 +157,7 @@ object CopybookParser extends Logging {
       segmentRedefines,
       fieldParentMap,
       stringTrimmingPolicy,
+      isDisplayAlwaysString,
       commentPolicy,
       strictSignOverpunch,
       improvedNullDetection,
@@ -180,6 +183,7 @@ object CopybookParser extends Logging {
     * @param segmentRedefines      A list of redefined fields that correspond to various segments. This needs to be specified for automatically
     * @param fieldParentMap        A segment fields parent mapping
     * @param stringTrimmingPolicy  Specifies if and how strings should be trimmed when parsed
+    * @param isDisplayAlwaysString If true, all fields having DISPLAY format will remain strings and won't be converted to numbers
     * @param commentPolicy         Specifies a policy for comments truncation inside a copybook
     * @param strictSignOverpunch   If true sign overpunching is not allowed for unsigned numbers
     * @param improvedNullDetection If true, string values that contain only zero bytes (0x0) will be considered null.
@@ -198,6 +202,7 @@ object CopybookParser extends Logging {
                 segmentRedefines: Seq[String] = Nil,
                 fieldParentMap: Map[String, String] = HashMap[String, String](),
                 stringTrimmingPolicy: StringTrimmingPolicy = StringTrimmingPolicy.TrimBoth,
+                isDisplayAlwaysString: Boolean = false,
                 commentPolicy: CommentPolicy = CommentPolicy(),
                 strictSignOverpunch: Boolean = true,
                 improvedNullDetection: Boolean = false,
@@ -219,6 +224,7 @@ object CopybookParser extends Logging {
       segmentRedefines,
       fieldParentMap,
       stringTrimmingPolicy,
+      isDisplayAlwaysString,
       commentPolicy,
       strictSignOverpunch,
       improvedNullDetection,
@@ -265,6 +271,7 @@ object CopybookParser extends Logging {
                 segmentRedefines: Seq[String],
                 fieldParentMap: Map[String, String],
                 stringTrimmingPolicy: StringTrimmingPolicy,
+                isDisplayAlwaysString: Boolean,
                 commentPolicy: CommentPolicy,
                 strictSignOverpunch: Boolean,
                 improvedNullDetection: Boolean,
@@ -279,7 +286,7 @@ object CopybookParser extends Logging {
                 debugFieldsPolicy: DebugFieldsPolicy,
                 fieldCodePageMap: Map[String, String]): Copybook = {
 
-    val schemaANTLR: CopybookAST = ANTLRParser.parse(copyBookContents, enc, stringTrimmingPolicy, commentPolicy, strictSignOverpunch, improvedNullDetection, strictIntegralPrecision, decodeBinaryAsHex, ebcdicCodePage, asciiCharset, isUtf16BigEndian, floatingPointFormat, fieldCodePageMap)
+    val schemaANTLR: CopybookAST = ANTLRParser.parse(copyBookContents, enc, stringTrimmingPolicy, isDisplayAlwaysString, commentPolicy, strictSignOverpunch, improvedNullDetection, strictIntegralPrecision, decodeBinaryAsHex, ebcdicCodePage, asciiCharset, isUtf16BigEndian, floatingPointFormat, fieldCodePageMap)
 
     val nonTerms: Set[String] = (for (id <- nonTerminals)
       yield transformIdentifier(id)

@@ -54,6 +54,7 @@ object ANTLRParser extends Logging {
   def parse(copyBookContents: String,
             enc: Encoding,
             stringTrimmingPolicy: StringTrimmingPolicy,
+            isDisplayAlwaysString: Boolean,
             commentPolicy: CommentPolicy,
             strictSignOverpunch: Boolean,
             improvedNullDetection: Boolean,
@@ -64,7 +65,7 @@ object ANTLRParser extends Logging {
             isUtf16BigEndian: Boolean,
             floatingPointFormat: FloatingPointFormat,
             fieldCodePageMap: Map[String, String]): CopybookAST = {
-    val visitor = new ParserVisitor(enc, stringTrimmingPolicy, ebcdicCodePage, asciiCharset, isUtf16BigEndian, floatingPointFormat, strictSignOverpunch, improvedNullDetection, strictIntegralPrecision, decodeBinaryAsHex, fieldCodePageMap)
+    val visitor = new ParserVisitor(enc, stringTrimmingPolicy, isDisplayAlwaysString, ebcdicCodePage, asciiCharset, isUtf16BigEndian, floatingPointFormat, strictSignOverpunch, improvedNullDetection, strictIntegralPrecision, decodeBinaryAsHex, fieldCodePageMap)
 
     val strippedContents = filterSpecialCharacters(copyBookContents).split("\\r?\\n").map(
       line =>

@@ -41,6 +41,7 @@ sealed trait Expr
 
 class ParserVisitor(enc: Encoding,
                     stringTrimmingPolicy: StringTrimmingPolicy,
+                    isDisplayAlwaysString: Boolean,
                     ebcdicCodePage: CodePage,
                     asciiCharset: Charset,
                     isUtf16BigEndian: Boolean,
@@ -854,7 +855,7 @@ class ParserVisitor(enc: Encoding,
       Map(),
       isDependee = false,
       identifier.toUpperCase() == Constants.FILLER,
-      DecoderSelector.getDecoder(pic.value, stringTrimmingPolicy, effectiveEbcdicCodePage, effectiveAsciiCharset, isUtf16BigEndian, floatingPointFormat, strictSignOverpunch, improvedNullDetection, strictIntegralPrecision)
+      DecoderSelector.getDecoder(pic.value, stringTrimmingPolicy, isDisplayAlwaysString, effectiveEbcdicCodePage, effectiveAsciiCharset, isUtf16BigEndian = isUtf16BigEndian, floatingPointFormat, strictSignOverpunch = strictSignOverpunch, improvedNullDetection = improvedNullDetection, strictIntegralPrecision = strictIntegralPrecision)
       ) (Some(parent))
 
     parent.children.append(prim)

@@ -73,7 +73,7 @@ class NonTerminalsAdder(
             )
             val sz = g.binaryProperties.actualSize
             val dataType = AlphaNumeric(s"X($sz)", sz, enc = Some(enc))
-            val decode = DecoderSelector.getDecoder(dataType, stringTrimmingPolicy, ebcdicCodePage, asciiCharset, isUtf16BigEndian, floatingPointFormat, strictSignOverpunch, improvedNullDetection)
+            val decode = DecoderSelector.getDecoder(dataType, stringTrimmingPolicy, isDisplayAlwaysString = false, ebcdicCodePage, asciiCharset, isUtf16BigEndian, floatingPointFormat, strictSignOverpunch, improvedNullDetection)
             val newName = getNonTerminalName(g.name, g.parent.get)
             newChildren.append(
               Primitive(

@@ -56,6 +56,7 @@ object DecoderSelector {
     */
   def getDecoder(dataType: CobolType,
                  stringTrimmingPolicy: StringTrimmingPolicy = TrimBoth,
+                 isDisplayAlwaysString: Boolean = false,
                  ebcdicCodePage: CodePage = new CodePageCommon,
                  asciiCharset: Charset = StandardCharsets.US_ASCII,
                  isUtf16BigEndian: Boolean = true,
@@ -66,6 +67,7 @@ object DecoderSelector {
     val decoder = dataType match {
       case alphaNumeric: AlphaNumeric => getStringDecoder(alphaNumeric.enc.getOrElse(EBCDIC), stringTrimmingPolicy, ebcdicCodePage, asciiCharset, isUtf16BigEndian, improvedNullDetection)
       case decimalType: Decimal => getDecimalDecoder(decimalType, floatingPointFormat, strictSignOverpunch, improvedNullDetection)
+      case integralType: Integral if isDisplayAlwaysString => getDisplayDecoderAsString(integralType, improvedNullDetection, strictSignOverpunch)
       case integralType: Integral => getIntegralDecoder(integralType, strictSignOverpunch, improvedNullDetection, strictIntegralPrecision)
       case _ => throw new IllegalStateException("Unknown AST object")
     }
@@ -251,6 +253,29 @@ object DecoderSelector {
     }
   }
 
+  private[parser] def getDisplayDecoderAsString(integralType: Integral,
+                                                improvedNullDetection: Boolean,
+                                                strictSignOverpunch: Boolean): Decoder = {
+    val encoding = integralType.enc.getOrElse(EBCDIC)
+    val isSigned = integralType.signPosition.isDefined
+    val allowedSignOverpunch = isSigned || !strictSignOverpunch
+
+    val isEbcdic = encoding match {
+      case EBCDIC => true
+      case _ => false
+    }
+
+    if (isEbcdic) {
+      bytes: Array[Byte] => {
+        StringDecoders.decodeEbcdicNumber(bytes, !isSigned, allowedSignOverpunch, improvedNullDetection)
+      }
+    } else {
+      bytes: Array[Byte] => {
+        StringDecoders.decodeAsciiNumber(bytes, !isSigned, allowedSignOverpunch, improvedNullDetection)
+      }
+    }
+  }
+
   /** Gets a decoder function for a binary encoded integral data type. A direct conversion from array of bytes to the target type is used where possible. */
   private def getBinaryEncodedIntegralDecoder(compact: Option[Usage], precision: Int, signPosition: Option[Position] = None, isBigEndian: Boolean, strictIntegralPrecision: Boolean): Decoder = {
     val isSigned = signPosition.nonEmpty

@@ -49,6 +49,7 @@ import za.co.absa.cobrix.cobol.reader.policies.SchemaRetentionPolicy.SchemaReten
   * @param generateRecordBytes     Generate 'record_bytes' field containing raw bytes of the original record
   * @param schemaRetentionPolicy   A copybook usually has a root group struct element that acts like a rowtag in XML. This can be retained in Spark schema or can be collapsed
   * @param stringTrimmingPolicy    Specify if and how strings should be trimmed when parsed
+  * @param isDisplayAlwaysString If true, all fields having DISPLAY format will remain strings and won't be converted to numbers
   * @param allowPartialRecords     If true, partial ASCII records can be parsed (in cases when LF character is missing for example)
   * @param multisegmentParams      Parameters for reading multisegment mainframe files
   * @param improvedNullDetection   If true, string values that contain only zero bytes (0x0) will be considered null.
@@ -87,6 +88,7 @@ case class CobolParameters(
                             generateRecordBytes:     Boolean,
                             schemaRetentionPolicy:   SchemaRetentionPolicy,
                             stringTrimmingPolicy:    StringTrimmingPolicy,
+                            isDisplayAlwaysString:   Boolean,
                             allowPartialRecords:     Boolean,
                             multisegmentParams:      Option[MultisegmentParameters],
                             commentPolicy:           CommentPolicy,

@@ -59,6 +59,7 @@ import za.co.absa.cobrix.cobol.reader.policies.SchemaRetentionPolicy.SchemaReten
   * @param generateRecordBytes     Generate 'record_bytes' field containing raw bytes of the original record
   * @param schemaPolicy            Specifies a policy to transform the input schema. The default policy is to keep the schema exactly as it is in the copybook.
   * @param stringTrimmingPolicy    Specifies if and how strings should be trimmed when parsed.
+  * @param isDisplayAlwaysString   If true, all fields having DISPLAY format will remain strings and won't be converted to numbers.
   * @param allowPartialRecords     If true, partial ASCII records can be parsed (in cases when LF character is missing for example)
   * @param multisegment            Parameters specific to reading multisegment files
   * @param commentPolicy           A comment truncation policy
@@ -108,6 +109,7 @@ case class ReaderParameters(
                              generateRecordBytes:     Boolean = false,
                              schemaPolicy:            SchemaRetentionPolicy = SchemaRetentionPolicy.CollapseRoot,
                              stringTrimmingPolicy:    StringTrimmingPolicy = StringTrimmingPolicy.TrimBoth,
+                             isDisplayAlwaysString:   Boolean = false,
                              allowPartialRecords:     Boolean = false,
                              multisegment:            Option[MultisegmentParameters] = None,
                              commentPolicy:           CommentPolicy = CommentPolicy(),