Skip to content

Conversation

@sharath1709
Copy link
Contributor

@sharath1709 sharath1709 commented Jan 9, 2025

What is the purpose of the change

This pull request makes changes to replace the usage of SQL connection with DataSource to improve the multithreaded performance of the JDBC plugin.

Brief change log

  • *Replace all usages of SQL connection with DataSource *

Verifying this change

This change is already covered by existing unit tests. The test cases have been modified appropriately to use DataSource rather than Connection everywhere

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@sharath1709 sharath1709 changed the title [FLINK-36696] Switch sql connection usages to datasource [FLINK-36696] [flink-autoscaler-plugin-jdbc] Switch sql connection usages to datasource Jan 9, 2025
@1996fanrui 1996fanrui self-assigned this Jan 10, 2025
Copy link
Member

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sharath1709 for the contribution!

Overall LGTM! And I still have some comments:

  1. The first question is related to connection leak for testing. This PR will create lots of new dataSources, but the close isn't called. IIUC, it will have connection leak.
  2. The CI is failed, please take a look in your free time.
  3. We encountered a bug related to this fix recently. The database master is crashed, the slave is promoted to be the master. The jdbc state store and event handler won't work due to they are using the fixed connection.
    • As I understand, DataSource could solve this problem as well.
    • When old connection is crashed, DataSource will create a new physical connection to connect the new database master server.
    • Would you mind adding a test to check it? In the ITCase, we only have one databse server, I'm not sure could it be reproduced when the database server is restarted.

Comment on lines 59 to 64
@AfterEach
void afterEach() throws SQLException {
if (conn != null) {
conn.close();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't close dataSource here?

As I understand, each test will create a new dataSource. The old dataSource will leak some connections if we don't close dataSource here.

try (var conn = getConnection()) {
var jdbcStateInteractor = new JdbcStateInteractor(conn);
assertThat(jdbcStateInteractor.queryData(jobKey)).isEmpty();
var dataSource = getDataSource();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's similar with the last comment, we didn't close the dataSource.

@sharath1709
Copy link
Contributor Author

Thanks a lot @1996fanrui for the quick review. Please find my replies below

  1. Typically, DataSource connections will be relinquished automatically after a time out and DataSource isn't Closeable. However, some DataSource implementations like HikariDataSource implement the Closeable interface to close the ConnectionPool. We can add a check to see if the DataSource is Closeable and then close it. Additionally, I've attempted to make the classes that hold the datasource objects Autocloseable as well so that they could be created in a try-with-resources block
  2. Thanks, I've tried to fix most of the issues but might need a couple more iterations.
  3. Great point, I will try to add a test case for this scenario soon

Copy link
Member

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sharath1709 , thanks for the update!

I left some comments, and the CI is still failed (It cannot be compiled for now. ).

You could try to execute mvn clean install -DskipTests in your local before next push, thanks~


try {
conf.set(JDBC_URL, String.format("%s;shutdown=true", jdbcUrl));
HikariJDBCUtil.getConnection(conf).close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, conf.set(JDBC_URL, String.format("%s;shutdown=true", jdbcUrl)) means shutdown the derby server. Don't we need to call it here?

Copy link
Contributor Author

@sharath1709 sharath1709 Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I wasn't familiar with this and assumed it only closes the connection. We can continue to keep the same logic and additionally close the datasource as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both close connection and datasource should be wrapped inside of try to prevent test failure.

We can check the comment: database shutdown ignored exception, this catch means the test won't be failed if database shutdown throws exception.

@sharath1709 sharath1709 force-pushed the FLINK-36696 branch 3 times, most recently from 350dfdf to 7d29111 Compare January 22, 2025 00:40
Copy link
Member

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sharath1709 for the update, overall LGTM. I left 2 minor comments, please take a look in your free time, thanks~

  1. We encountered a bug related to this fix recently. The database master is crashed, the slave is promoted to be the master. The jdbc state store and event handler won't work due to they are using the fixed connection.

    • As I understand, DataSource could solve this problem as well.
    • When old connection is crashed, DataSource will create a new physical connection > to connect the new database master server.
    • Would you mind adding a test to check it? In the ITCase, we only have one databse server, I'm not sure could it be reproduced when the database server is restarted.

I have checked it manually on my Mac, it works well with this PR. Also, as we discussed before, it's better to add a test to check it.

Comment on lines 82 to 83
var dataSource = getDataSource();
try (var conn = dataSource.getConnection();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var dataSource = getDataSource();
try (var conn = dataSource.getConnection();
try (var dataSource = getDataSource();
var conn = dataSource.getConnection();


try {
conf.set(JDBC_URL, String.format("%s;shutdown=true", jdbcUrl));
HikariJDBCUtil.getConnection(conf).close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both close connection and datasource should be wrapped inside of try to prevent test failure.

We can check the comment: database shutdown ignored exception, this catch means the test won't be failed if database shutdown throws exception.

@sharath1709
Copy link
Contributor Author

sharath1709 commented Jan 23, 2025

Thanks, @1996fanrui for your patience with the review. I addressed the remaining comments and added a test case for db restart to AbstractJdbcStateInteractorITCase. The same could be added to AbstractJdbcEventInteractorITCase if needed but it's a bit redundant. I also verified that the test case failed without the change in this PR.

@1996fanrui
Copy link
Member

I addressed the remaining comments and added a test case for db restart to AbstractJdbcStateInteractorITCase. The same could be added to AbstractJdbcEventInteractorITCase if needed but it's a bit redundant.

Sounds make sense to me, testing restart in AbstractJdbcStateInteractorITCase is enough, but the test is failed.
https://github.com/apache/flink-kubernetes-operator/actions/runs/12939940745/job/36097810802?pr=929#step:8:1025

Also, I may need to do another round of overall review after the test passes.

@sharath1709
Copy link
Contributor Author

sharath1709 commented Jan 24, 2025

@1996fanrui It appears that MySQL/Postgresql DB restart test cases are failing in CI with SQLTransientException. Let me update the test to ignore transient exception as typically subsequent connection should be successful. Before this PR, the test case produces SQLNonTransientConnectionException. We may also consider adding retries but I feel it's overkill. lmk your thoughts

@1996fanrui
Copy link
Member

1996fanrui commented Jan 24, 2025

@1996fanrui It appears that MySQL/Postgresql DB restart test cases are failing in CI with SQLTransientException. Let me update the test to ignore transient exception as typically subsequent connection should be successful. Before this PR, the test case produces SQLNonTransientConnectionException. We may also consider adding retries but I feel it's overkill. lmk your thoughts

After I debug it on my Mac, I don't think catch SQLTransientException is reasonable. Please check the following figure, the bbbbb is not executed. It means we cannot ensure the connection works after restarting. So I think retry mechanism is needed, we need to retry queryData from data base until the result is expected.

Also, I added the retry mechanism on my Mac, but it still doesn't work, I found after restart, the port of MySQL Container is changed. It means the original data source won't be used.

  • I asked chatgpt for this situation, it suggests using pauseContainer and unpauseContainer to test instead of restart container.
  • I'm not the expert, so I'm not sure.
  • Also, I don't think this test is required for this PR, If you are not mind, we could remove it from this PR, then I or you could add it in the next PR. (This test is important in the long term.)
  • I don't want this test block this PR since next week is Chinese new year, and I will be on vacation.

@sharath1709
Copy link
Contributor Author

That sounds great, let me remove the test. I'm not the expert on this situation either and ChatGPT originally suggested starting and stopping the container to simulate a restart. I will try to explore it in my free time and add the test case later

@1996fanrui
Copy link
Member

One test fails since too many clients already, I'm not sure is there any client or connection leak?

Error:  Tests run: 17, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 33.745 s <<< FAILURE! - in org.apache.flink.autoscaler.jdbc.event.PostgreSQLJdbcAutoscalerEventHandlerITCase
Error:  org.apache.flink.autoscaler.jdbc.event.PostgreSQLJdbcAutoscalerEventHandlerITCase.testEventIntervalWithoutMessageKey  Time elapsed: 1.024 s  <<< ERROR!
com.zaxxer.hikari.pool.HikariPool$PoolInitializationException: Failed to initialize pool: FATAL: sorry, too many clients already
Caused by: org.postgresql.util.PSQLException: FATAL: sorry, too many clients already

https://github.com/apache/flink-kubernetes-operator/actions/runs/12958019250/job/36164630541?pr=929#step:8:2293

public void afterEach(ExtensionContext extensionContext) throws Exception {
try (var conn = getConnection();
try (var dataSource = getDataSource();
var conn = getDataSource().getConnection();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var conn = getDataSource().getConnection();
var conn = dataSource.getConnection();

@sharath1709 I found a bug here, it should use data source instead of getDataSource() here.

The second getDataSource() is not closed. That's why connection is leaked.

#929 (comment)

Copy link
Member

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sharath1709 for the patience!

LGTM. I will rebase master later, and merge it if the CI is green.

@1996fanrui 1996fanrui merged commit d5d027e into apache:main Jan 26, 2025
107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants