Skip to content

Conversation

@ostinru
Copy link
Collaborator

@ostinru ostinru commented Apr 1, 2025

Hi Cloudberry team!
This is cherry-pick of commits from original greenplum/pxf. In this PR I picked all commits from v6.4.0 to v6.4.2.

Useful parts:

## 6.4.2 (09/19/2022
- #870 Relax the requirement that C string length matches Java string length

## 6.4.1 (09/16/2022)
- #858 Add JsonProtocolHandler to use HdfsFileFragmenter for multi-line JSON

+ minor documentation updates

It also includes changes in concourse code (do you run concourse tests?) and (test)automation framework. I keep these changes to make it easier to cherry-pick next portions of commits.

denalex and others added 14 commits April 1, 2025 23:21
PXF's automation test framework includes a helper method
(`addPathToPxfClassPath`) that modifies pxf-env.sh to set
`PXF_LOADER_PATH`. The problem is that the methods reads the content of
pxf-env.sh from $PXF_BASE/conf on the local host which may not be the
same as the file on a multinode cluster. Since the remote configuration
is copied down from the remote cluster to a temporary working directory,
the helper method should load pxf-env.sh from this location.

Additionally, this commit increases the size of PXF heap for multinode
tests. The original heap size of 512 MB was too small for some of our
tests but this wasn't noticed until now because the automation framework
modified pxf-env.sh without preserving any changes made to it during the
installation and setup of PXF on the multinode cluster.

Authored-by: Bradford D. Boyle <[email protected]>
… (#858)

This commit fixes an issue that could occur with multi-line JSON files. 

Previously, it was possible for a fragment to improperly parse a JSON object if the split started in the middle of a string, causing wrong results. This commit now uses the HdfsFileFragmenter for multi-line JSON files.
* docs - misc updates/additions to the orc write docs

* reword per bradford

* more rewording per bradford
…(#868)

In the case that there is no pivnet artifact to pull down, do not error
out. Instead, keep whatever artifact was already present in the bucket.
… (#870)

Commit 94f8cca added a check that the length of received strings (as
determined by strnlen) matched the length of the string as determined by
the Java-based PXF service; if they did not, this was treated as a fatal
error. Some users have reported that they have ORC/Parquet files with
strings that contain ASCCII NUL-bytes. These strings would not have a
strnlen calculate length that matches with the Java string lenght.

This commit removes the requirements that the lengths be equal and
instead logs a debug message when they do not match.

Authored-by: Bradford D. Boyle <[email protected]>
@ostinru ostinru marked this pull request as ready for review April 1, 2025 19:24
@tuhaihe
Copy link
Member

tuhaihe commented Apr 2, 2025

Cool, thanks @ostinru!

@tuhaihe tuhaihe requested a review from gfphoenix78 April 2, 2025 03:31
@tuhaihe
Copy link
Member

tuhaihe commented Apr 2, 2025

Hi @gfphoenix78 could you please help review this PR? Thanks!

@ostinru
Copy link
Collaborator Author

ostinru commented Apr 2, 2025

@tuhaihe do you have concourse-ci running on premise? How do you test PXF PRs?

@tuhaihe
Copy link
Member

tuhaihe commented Apr 3, 2025

Hi @ostinru, There is still no public cicd workflow for PXF.

Hey @edespino, Any ideas on setting up the cicd workflow on PXF? Love to hear your thoughts.

@ostinru
Copy link
Collaborator Author

ostinru commented Apr 3, 2025

Hi @ostinru, There is still no public cicd workflow for PXF.

Added PR with fast ci: #6

And I am working on heavy ci: open-gpdb/pxf#3
Plan is following:

  1. build deb/rpm package for cloudberry, and put it into /downloads directory (as concourse ci did before)
  2. build deb/rpm package for cloudberry-pxf, and put it into /downloads directory
  3. run automation/pg_regress tests in docker-compose

By doing this I am expecting that developers will be able to run tests & debug it locally in docker.
Also, usage of actions/cache@v4 (in github actions) and /downloads (locally) should reduce build time significantly.

@ostinru
Copy link
Collaborator Author

ostinru commented Apr 7, 2025

@gfphoenix78 - I was trying to pick all commits (so, next cherry-picks will be easier to do). Here is a list of commits I picked:

git merge-base cloudberry-pxf/main archive/main
08a30ac3990a7d147c48dda9683883d11d705cce
git log  08a30ac3990a7d147c48dda9683883d11d705cce..2f2f2b942d513ff033ac693b6101b81f88589819 --oneline
[pick] 2f2f2b94 Bump version to 6.4.2 (#871)
[pick] f2ca30c0 Relax the requirement that C string length matches Java string length (#870)
[skip] b5805e26 Bump version to 6.4.2-SNAPSHOT [skip ci]
[pick] d211b6d0 Bump version to 6.4.1 (#869)
[pick] 356b85be added specs for gpdb7-rhel8-test-pxf docker image (#867)
[pick] e1f40c85 ci: allow existing artifacts to remain if there is no pivnet product (#868)
[pick] 4eb0069f docs - command block formatting fix (#866)
[pick] 4acf1258 docs - misc updates/additions to the orc write docs (#864)
[pick] 0d9cad91 docs - update jdbc driver version number (#865)
[pick] 2e2d559f docs - clarify new prepared statement property (#856)
[pick] 14678acb Add JsonProtocolHandler to use HdfsFileFragmenter for multi-line JSON (#858)
[pick] bfbbea65 Fix bug when modifying PXF's classpath in automation (#861)
[pick] 5163e5f8 removed jsoup dependency and unused class from the automation code (#863)
[pick] 555d21a5 install Java 8 JDK in gp7 centos7 docker image (#862)
[pick] 4cfe89eb added Maven, Java 8 and pip install for GPDB7 docker images (#860)

@ostinru
Copy link
Collaborator Author

ostinru commented Apr 9, 2025

@reshke

@reshke
Copy link

reshke commented Apr 9, 2025

Hi @gfphoenix78 could you please help review this PR? Thanks!

can we merge this already? or should we wait for CI setup?

@ostinru
Copy link
Collaborator Author

ostinru commented Apr 10, 2025

Hi @ostinru, There is still no public cicd workflow for PXF.

Hey @edespino, Any ideas on setting up the cicd workflow on PXF? Love to hear your thoughts.

Finally I have merged our heavy CI pipeline [1]. Currently it checks that we can produce deb packages (this includes: java tests, compilation of go, java and C code).
This is a base for next steps: run /regression and /automation tests. I think that it will take some time to migrate it from concourse to docker-compose. But, I think it will enable developers to run any test locally.

[1] https://github.com/open-gpdb/pxf/pull/3/files

@tuhaihe
Copy link
Member

tuhaihe commented Apr 11, 2025

Hi @gfphoenix78 could you please help review this PR? Thanks!

can we merge this already? or should we wait for CI setup?

IMO, the CI setup needs more time. We can merge the PR first. If @gfphoenix78 doesn't respond for a while, we can proceed further.

Copy link
Contributor

@gfphoenix78 gfphoenix78 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tuhaihe tuhaihe merged commit 0cede2d into apache:main Apr 15, 2025
@tuhaihe
Copy link
Member

tuhaihe commented Apr 15, 2025

Hi @ostinru thanks for your great work!

@ostinru ostinru deleted the cherry-to-6-4-2 branch April 17, 2025 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants