-
Notifications
You must be signed in to change notification settings - Fork 1.5k
GH-3235: Row count limit for each row group #3236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -48,8 +48,8 @@ class InternalParquetRecordWriter<T> { | |
| private final WriteSupport<T> writeSupport; | ||
| private final MessageType schema; | ||
| private final Map<String, String> extraMetaData; | ||
| private final long rowGroupSize; | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. unused field |
||
| private long rowGroupSizeThreshold; | ||
| private final int rowGroupRecordCountThreshold; | ||
| private long nextRowGroupSize; | ||
| private final BytesInputCompressor compressor; | ||
| private final boolean validating; | ||
|
|
@@ -91,8 +91,8 @@ public InternalParquetRecordWriter( | |
| this.writeSupport = Objects.requireNonNull(writeSupport, "writeSupport cannot be null"); | ||
| this.schema = schema; | ||
| this.extraMetaData = extraMetaData; | ||
| this.rowGroupSize = rowGroupSize; | ||
| this.rowGroupSizeThreshold = rowGroupSize; | ||
| this.rowGroupRecordCountThreshold = props.getRowGroupRowCountLimit(); | ||
| this.nextRowGroupSize = rowGroupSizeThreshold; | ||
| this.compressor = compressor; | ||
| this.validating = validating; | ||
|
|
@@ -166,9 +166,16 @@ public long getDataSize() { | |
| } | ||
|
|
||
| private void checkBlockSizeReached() throws IOException { | ||
| if (recordCount | ||
| >= recordCountForNextMemCheck) { // checking the memory size is relatively expensive, so let's not do it | ||
| // for every record. | ||
| if (recordCount >= rowGroupRecordCountThreshold) { | ||
| LOG.debug("record count reaches threshold: flushing {} records to disk.", recordCount); | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe better to escalate the log level to INFO? It should not be noisy.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it would not be noisy to raise the level to INFO but what purpose does it serve? If we want to answer why a Parquet file looks like how it does (the number/size of row groups etc.), the logs of the file creation are probably long gone.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For a typical Hadoop YARN cluster, which serves Spark workloads, application logs are preserved for a few days by collecting and aggregating to HDFS. Anyway, it's not a big deal, we can always use parquet-cli to analyze the suspicious Parquet files :)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, what I was thinking of, you write a Parquet file, and then, you read it somewhere else months/years later. You won't have the related logs for sure. It is a very separate topic, but it would be a good idea to write these properties directly into the footer so if we have any issues with a file, at least we would know, how it was created. |
||
| flushRowGroupToStore(); | ||
| initStore(); | ||
| recordCountForNextMemCheck = min( | ||
| max(props.getMinRowCountForPageSizeCheck(), recordCount / 2), | ||
| props.getMaxRowCountForPageSizeCheck()); | ||
| this.lastRowGroupEndPos = parquetFileWriter.getPos(); | ||
| } else if (recordCount >= recordCountForNextMemCheck) { | ||
| // checking the memory size is relatively expensive, so let's not do it for every record. | ||
| long memSize = columnStore.getBufferedSize(); | ||
| long recordSize = memSize / recordCount; | ||
| // flush the row group if it is within ~2 records of the limit | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since there is an existing
parquet.block.size