Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1532,6 +1532,7 @@ The output looks like this:
| Option (usage example) | Description |
|-----------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| .option("string_trimming_policy", "both") | Specifies if and how string fields should be trimmed. Available options: `both` (default), `none`, `left`, `right`, `keep_all`. `keep_all` - keeps control characters when decoding ASCII text files |
| .option("display_pic_always_string", "false") | If `true` fields that have `DISPLAY` format will always be converted to `string` type, even if such fields contain numbers, retaining leading and trailing zeros. Cannot be used together with `strict_integral_precision`. |
| .option("ebcdic_code_page", "common") | Specifies a code page for EBCDIC encoding. Currently supported values: `common` (default), `common_extended`, `cp037`, `cp037_extended`, and others (see "Currently supported EBCDIC code pages" section. |
| .option("ebcdic_code_page_class", "full.class.specifier") | Specifies a user provided class for a custom code page to UNICODE conversion. |
| .option("field_code_page:cp825", "field1, field2") | Specifies the code page for selected fields. You can add mo than 1 such option for multiple code page overrides. |
Expand All @@ -1541,7 +1542,7 @@ The output looks like this:
| .option("occurs_mapping", "{\"FIELD\": {\"X\": 1}}") | If specified, as a JSON string, allows for String `DEPENDING ON` fields with a corresponding mapping. |
| .option("strict_sign_overpunching", "true") | If `true` (default), sign overpunching will only be allowed for signed numbers. If `false`, overpunched positive sign will be allowed for unsigned numbers, but negative sign will result in null. |
| .option("improved_null_detection", "true") | If `true`(default), values that contain only 0x0 ror DISPLAY strings and numbers will be considered `null`s instead of empty strings. |
| .option("strict_integral_precision", "true") | If `true`, Cobrix will not generate `short`/`integer`/`long` Spark data types, and always use `decimal(n)` with the exact precision that matches the copybook. |
| .option("strict_integral_precision", "true") | If `true`, Cobrix will not generate `short`/`integer`/`long` Spark data types, and always use `decimal(n)` with the exact precision that matches the copybook. Cannot be used together with `display_pic_always_string`. |
| .option("binary_as_hex", "false") | By default fields that have `PIC X` and `USAGE COMP` are converted to `binary` Spark data type. If this option is set to `true`, such fields will be strings in HEX encoding. |

##### Modifier options
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -107,24 +107,25 @@ object CopybookParser extends Logging {
* Tokenizes a Cobol Copybook contents and returns the AST.
*
* @param dataEncoding Encoding of the data file (either ASCII/EBCDIC). The encoding of the copybook is expected to be ASCII.
* @param copyBookContents A string containing all lines of a copybook
* @param dropGroupFillers Drop groups marked as fillers from the output AST
* @param dropValueFillers Drop primitive fields marked as fillers from the output AST
* @param fillerNamingPolicy Specifies a naming policy for fillers
* @param copyBookContents A string containing all lines of a copybook.
* @param dropGroupFillers Drop groups marked as fillers from the output AST.
* @param dropValueFillers Drop primitive fields marked as fillers from the output AST.
* @param fillerNamingPolicy Specifies a naming policy for fillers.
* @param segmentRedefines A list of redefined fields that correspond to various segments. This needs to be specified for automatically
* resolving segment redefines.
* @param fieldParentMap A segment fields parent mapping
* @param stringTrimmingPolicy Specifies if and how strings should be trimmed when parsed
* @param strictSignOverpunch If true sign overpunching is not allowed for unsigned numbers
* @param fieldParentMap A segment fields parent mapping.
* @param stringTrimmingPolicy Specifies if and how strings should be trimmed when parsed.
* @param isDisplayAlwaysString If true, all fields having DISPLAY format will remain strings and won't be converted to numbers.
* @param strictSignOverpunch If true sign overpunching is not allowed for unsigned numbers.
* @param improvedNullDetection If true, string values that contain only zero bytes (0x0) will be considered null.
* @param commentPolicy Specifies a policy for comments truncation inside a copybook
* @param ebcdicCodePage A code page for EBCDIC encoded data
* @param asciiCharset A charset for ASCII encoded data
* @param commentPolicy Specifies a policy for comments truncation inside a copybook.
* @param ebcdicCodePage A code page for EBCDIC encoded data.
* @param asciiCharset A charset for ASCII encoded data.
* @param isUtf16BigEndian If true UTF-16 strings are considered big-endian.
* @param floatingPointFormat A format of floating-point numbers (IBM/IEEE754)
* @param nonTerminals A list of non-terminals that should be extracted as strings
* @param floatingPointFormat A format of floating-point numbers (IBM/IEEE754).
* @param nonTerminals A list of non-terminals that should be extracted as strings.
* @param debugFieldsPolicy Specifies if debugging fields need to be added and what should they contain (false, hex, raw).
* @return Seq[Group] where a group is a record inside the copybook
* @return Seq[Group] where a group is a record inside the copybook.
*/
def parse(copyBookContents: String,
dataEncoding: Encoding = EBCDIC,
Expand All @@ -134,6 +135,7 @@ object CopybookParser extends Logging {
segmentRedefines: Seq[String] = Nil,
fieldParentMap: Map[String, String] = HashMap[String, String](),
stringTrimmingPolicy: StringTrimmingPolicy = StringTrimmingPolicy.TrimBoth,
isDisplayAlwaysString: Boolean = false,
commentPolicy: CommentPolicy = CommentPolicy(),
strictSignOverpunch: Boolean = true,
improvedNullDetection: Boolean = false,
Expand All @@ -155,6 +157,7 @@ object CopybookParser extends Logging {
segmentRedefines,
fieldParentMap,
stringTrimmingPolicy,
isDisplayAlwaysString,
commentPolicy,
strictSignOverpunch,
improvedNullDetection,
Expand All @@ -180,6 +183,7 @@ object CopybookParser extends Logging {
* @param segmentRedefines A list of redefined fields that correspond to various segments. This needs to be specified for automatically
* @param fieldParentMap A segment fields parent mapping
* @param stringTrimmingPolicy Specifies if and how strings should be trimmed when parsed
* @param isDisplayAlwaysString If true, all fields having DISPLAY format will remain strings and won't be converted to numbers
* @param commentPolicy Specifies a policy for comments truncation inside a copybook
* @param strictSignOverpunch If true sign overpunching is not allowed for unsigned numbers
* @param improvedNullDetection If true, string values that contain only zero bytes (0x0) will be considered null.
Expand All @@ -198,6 +202,7 @@ object CopybookParser extends Logging {
segmentRedefines: Seq[String] = Nil,
fieldParentMap: Map[String, String] = HashMap[String, String](),
stringTrimmingPolicy: StringTrimmingPolicy = StringTrimmingPolicy.TrimBoth,
isDisplayAlwaysString: Boolean = false,
commentPolicy: CommentPolicy = CommentPolicy(),
strictSignOverpunch: Boolean = true,
improvedNullDetection: Boolean = false,
Expand All @@ -219,6 +224,7 @@ object CopybookParser extends Logging {
segmentRedefines,
fieldParentMap,
stringTrimmingPolicy,
isDisplayAlwaysString,
commentPolicy,
strictSignOverpunch,
improvedNullDetection,
Expand Down Expand Up @@ -265,6 +271,7 @@ object CopybookParser extends Logging {
segmentRedefines: Seq[String],
fieldParentMap: Map[String, String],
stringTrimmingPolicy: StringTrimmingPolicy,
isDisplayAlwaysString: Boolean,
commentPolicy: CommentPolicy,
strictSignOverpunch: Boolean,
improvedNullDetection: Boolean,
Expand All @@ -279,7 +286,7 @@ object CopybookParser extends Logging {
debugFieldsPolicy: DebugFieldsPolicy,
fieldCodePageMap: Map[String, String]): Copybook = {

val schemaANTLR: CopybookAST = ANTLRParser.parse(copyBookContents, enc, stringTrimmingPolicy, commentPolicy, strictSignOverpunch, improvedNullDetection, strictIntegralPrecision, decodeBinaryAsHex, ebcdicCodePage, asciiCharset, isUtf16BigEndian, floatingPointFormat, fieldCodePageMap)
val schemaANTLR: CopybookAST = ANTLRParser.parse(copyBookContents, enc, stringTrimmingPolicy, isDisplayAlwaysString, commentPolicy, strictSignOverpunch, improvedNullDetection, strictIntegralPrecision, decodeBinaryAsHex, ebcdicCodePage, asciiCharset, isUtf16BigEndian, floatingPointFormat, fieldCodePageMap)

val nonTerms: Set[String] = (for (id <- nonTerminals)
yield transformIdentifier(id)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ object ANTLRParser extends Logging {
def parse(copyBookContents: String,
enc: Encoding,
stringTrimmingPolicy: StringTrimmingPolicy,
isDisplayAlwaysString: Boolean,
commentPolicy: CommentPolicy,
strictSignOverpunch: Boolean,
improvedNullDetection: Boolean,
Expand All @@ -64,7 +65,7 @@ object ANTLRParser extends Logging {
isUtf16BigEndian: Boolean,
floatingPointFormat: FloatingPointFormat,
fieldCodePageMap: Map[String, String]): CopybookAST = {
val visitor = new ParserVisitor(enc, stringTrimmingPolicy, ebcdicCodePage, asciiCharset, isUtf16BigEndian, floatingPointFormat, strictSignOverpunch, improvedNullDetection, strictIntegralPrecision, decodeBinaryAsHex, fieldCodePageMap)
val visitor = new ParserVisitor(enc, stringTrimmingPolicy, isDisplayAlwaysString, ebcdicCodePage, asciiCharset, isUtf16BigEndian, floatingPointFormat, strictSignOverpunch, improvedNullDetection, strictIntegralPrecision, decodeBinaryAsHex, fieldCodePageMap)

val strippedContents = filterSpecialCharacters(copyBookContents).split("\\r?\\n").map(
line =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ sealed trait Expr

class ParserVisitor(enc: Encoding,
stringTrimmingPolicy: StringTrimmingPolicy,
isDisplayAlwaysString: Boolean,
ebcdicCodePage: CodePage,
asciiCharset: Charset,
isUtf16BigEndian: Boolean,
Expand Down Expand Up @@ -854,7 +855,7 @@ class ParserVisitor(enc: Encoding,
Map(),
isDependee = false,
identifier.toUpperCase() == Constants.FILLER,
DecoderSelector.getDecoder(pic.value, stringTrimmingPolicy, effectiveEbcdicCodePage, effectiveAsciiCharset, isUtf16BigEndian, floatingPointFormat, strictSignOverpunch, improvedNullDetection, strictIntegralPrecision)
DecoderSelector.getDecoder(pic.value, stringTrimmingPolicy, isDisplayAlwaysString, effectiveEbcdicCodePage, effectiveAsciiCharset, isUtf16BigEndian = isUtf16BigEndian, floatingPointFormat, strictSignOverpunch = strictSignOverpunch, improvedNullDetection = improvedNullDetection, strictIntegralPrecision = strictIntegralPrecision)
) (Some(parent))

parent.children.append(prim)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ class NonTerminalsAdder(
)
val sz = g.binaryProperties.actualSize
val dataType = AlphaNumeric(s"X($sz)", sz, enc = Some(enc))
val decode = DecoderSelector.getDecoder(dataType, stringTrimmingPolicy, ebcdicCodePage, asciiCharset, isUtf16BigEndian, floatingPointFormat, strictSignOverpunch, improvedNullDetection)
val decode = DecoderSelector.getDecoder(dataType, stringTrimmingPolicy, isDisplayAlwaysString = false, ebcdicCodePage, asciiCharset, isUtf16BigEndian, floatingPointFormat, strictSignOverpunch, improvedNullDetection)
val newName = getNonTerminalName(g.name, g.parent.get)
newChildren.append(
Primitive(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ object DecoderSelector {
*/
def getDecoder(dataType: CobolType,
stringTrimmingPolicy: StringTrimmingPolicy = TrimBoth,
isDisplayAlwaysString: Boolean = false,
ebcdicCodePage: CodePage = new CodePageCommon,
asciiCharset: Charset = StandardCharsets.US_ASCII,
isUtf16BigEndian: Boolean = true,
Expand All @@ -66,6 +67,7 @@ object DecoderSelector {
val decoder = dataType match {
case alphaNumeric: AlphaNumeric => getStringDecoder(alphaNumeric.enc.getOrElse(EBCDIC), stringTrimmingPolicy, ebcdicCodePage, asciiCharset, isUtf16BigEndian, improvedNullDetection)
case decimalType: Decimal => getDecimalDecoder(decimalType, floatingPointFormat, strictSignOverpunch, improvedNullDetection)
case integralType: Integral if isDisplayAlwaysString => getDisplayDecoderAsString(integralType, improvedNullDetection, strictSignOverpunch)
case integralType: Integral => getIntegralDecoder(integralType, strictSignOverpunch, improvedNullDetection, strictIntegralPrecision)
case _ => throw new IllegalStateException("Unknown AST object")
}
Expand Down Expand Up @@ -251,6 +253,29 @@ object DecoderSelector {
}
}

private[parser] def getDisplayDecoderAsString(integralType: Integral,
improvedNullDetection: Boolean,
strictSignOverpunch: Boolean): Decoder = {
val encoding = integralType.enc.getOrElse(EBCDIC)
val isSigned = integralType.signPosition.isDefined
val allowedSignOverpunch = isSigned || !strictSignOverpunch

val isEbcdic = encoding match {
case EBCDIC => true
case _ => false
}

if (isEbcdic) {
bytes: Array[Byte] => {
StringDecoders.decodeEbcdicNumber(bytes, !isSigned, allowedSignOverpunch, improvedNullDetection)
}
} else {
bytes: Array[Byte] => {
StringDecoders.decodeAsciiNumber(bytes, !isSigned, allowedSignOverpunch, improvedNullDetection)
}
}
}

/** Gets a decoder function for a binary encoded integral data type. A direct conversion from array of bytes to the target type is used where possible. */
private def getBinaryEncodedIntegralDecoder(compact: Option[Usage], precision: Int, signPosition: Option[Position] = None, isBigEndian: Boolean, strictIntegralPrecision: Boolean): Decoder = {
val isSigned = signPosition.nonEmpty
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ import za.co.absa.cobrix.cobol.reader.policies.SchemaRetentionPolicy.SchemaReten
* @param generateRecordBytes Generate 'record_bytes' field containing raw bytes of the original record
* @param schemaRetentionPolicy A copybook usually has a root group struct element that acts like a rowtag in XML. This can be retained in Spark schema or can be collapsed
* @param stringTrimmingPolicy Specify if and how strings should be trimmed when parsed
* @param isDisplayAlwaysString If true, all fields having DISPLAY format will remain strings and won't be converted to numbers
* @param allowPartialRecords If true, partial ASCII records can be parsed (in cases when LF character is missing for example)
* @param multisegmentParams Parameters for reading multisegment mainframe files
* @param improvedNullDetection If true, string values that contain only zero bytes (0x0) will be considered null.
Expand Down Expand Up @@ -87,6 +88,7 @@ case class CobolParameters(
generateRecordBytes: Boolean,
schemaRetentionPolicy: SchemaRetentionPolicy,
stringTrimmingPolicy: StringTrimmingPolicy,
isDisplayAlwaysString: Boolean,
allowPartialRecords: Boolean,
multisegmentParams: Option[MultisegmentParameters],
commentPolicy: CommentPolicy,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ import za.co.absa.cobrix.cobol.reader.policies.SchemaRetentionPolicy.SchemaReten
* @param generateRecordBytes Generate 'record_bytes' field containing raw bytes of the original record
* @param schemaPolicy Specifies a policy to transform the input schema. The default policy is to keep the schema exactly as it is in the copybook.
* @param stringTrimmingPolicy Specifies if and how strings should be trimmed when parsed.
* @param isDisplayAlwaysString If true, all fields having DISPLAY format will remain strings and won't be converted to numbers.
* @param allowPartialRecords If true, partial ASCII records can be parsed (in cases when LF character is missing for example)
* @param multisegment Parameters specific to reading multisegment files
* @param commentPolicy A comment truncation policy
Expand Down Expand Up @@ -108,6 +109,7 @@ case class ReaderParameters(
generateRecordBytes: Boolean = false,
schemaPolicy: SchemaRetentionPolicy = SchemaRetentionPolicy.CollapseRoot,
stringTrimmingPolicy: StringTrimmingPolicy = StringTrimmingPolicy.TrimBoth,
isDisplayAlwaysString: Boolean = false,
allowPartialRecords: Boolean = false,
multisegment: Option[MultisegmentParameters] = None,
commentPolicy: CommentPolicy = CommentPolicy(),
Expand Down
Loading
Loading