Skip to content

Conversation

@jolivier23
Copy link
Contributor

Nasf-Fan and others added 20 commits July 22, 2025 21:14
…16536)

If current transaction is aborted during dtx_refresh() yield by race,
then return non-zero value to the sponsor to trigger client side RPC
retry. That will make related transaction's status to be more clean.

More check after dtx_refresh() to avoid re-initializing aborted DTX.

The patch also cleanup the usage for vos_dtx_validation() to handle
kinds of DTX abort (and maybe resent after that) cases.

Signed-off-by: Fan Yong <[email protected]>
To avoid parent and child processes generating the same DTX ID.

It also changes vos_dtx logic to avoid assertion when client
reuses some DTX ID.

Signed-off-by: Fan Yong <[email protected]>
#16617)

Address issue where snapshot load fails because of inconsistency with
Member address-to-uuid map. Avoid duplicate UUID member entries by
fixing removeMember function.

Signed-off-by: Tom Nabarro <[email protected]>
Signed-off-by: Kris Jacque <[email protected]>
Use-after-free addressed in JSON parsing code that extracts daos_data
from SPDK engine-bootstrap config file. Avoid freeing JSON context
until relevant objects have been read and stored elsewhere.

Signed-off-by: Tom Nabarro <[email protected]>
#16645)

Add ORF_FETCH_EPOCH_EC_AGG_BOUNDARY flag for rebuild fetch. The container's
sc_ec_agg_eph_boundary possibly be different on the initiator and target engines
of the rebuild fetch, initiator selected fetch epoch possibly lower than readable
epoch at target engine side if vos aggregation merged adjacent extents to higher
epoch. For this case increase the fetch epoch to sc_ec_agg_eph_boundary.

Signed-off-by: Xuezhao Liu <[email protected]>
When a faulty SSD is replaced, reintegration will be auto triggered
once local setup completed (ds_pool_child started).

Howerver, admin could manually run "dmg pool reintegrate" before the
local setup done, then we need to return a retry-able error to make
reintegration keep retry until the local ds_pool_child started.

Signed-off-by: Niu Yawei <[email protected]>
…#16665)

When adding a new access point to config and restarting, the member is updated, not added, so it was not being considered a voter in the MS leader election.

Signed-off-by: Kris Jacque <[email protected]>
Backport of PR-16586 and updated with:

ci/provisioning/post_provision_config_nodes_LEAP.sh:
  Something in Leap-15.6 added an additional dependency of the distro
  provided lua-lmod that is not removed when lua-lmod is removed and
  blocks the install of the newer lua-lmod.

Signed-off-by: John E. Malmberg <[email protected]>
Add documentation on how to add or remove MS replicas.

Also remove unused variable in a unit test.

Signed-off-by: Kris Jacque <[email protected]>
…- b26 (#16669)

When close the container, dtx_flush_on_close logic will try to commit
pending committable DTX entries. If such flush failed for some reason,
then it will ask async-batched-commit logic to do that sometime later.
But if the container is in stopping, then do not re-add the container
back to the async-batched-commit list; otherwise the stop logic maybe
blocked for long time (or for ever).

Similar cases for when open/close the container.

Some code cleanup for DTX logic.

Signed-off-by: Fan Yong <[email protected]>
Tag second release candidate for 2.6.4.

Signed-off-by: Dalton Bohning <[email protected]>
… - b26 (#16726)

DTX logic maintains batched commit list. Each opened container has each
own 'dtx_batched_cont_args' (dbca) item in such list. If some container
is already in such list, then do not re-add it; otherwise such list may
be broken.

Signed-off-by: Fan Yong <[email protected]>
)

Give create_release.yml write permission so it can create tags.
Also exit on error.

Signed-off-by: Dalton Bohning <[email protected]>
…16723)

If the host where the test was run had a capital letter in the
name, this test failed. Fault domain code normalizes names to
lowercase.

Signed-off-by: Kris Jacque <[email protected]>
* DAOS-17828 vos: fix a pointer misuse (#16635)

A handle passed to evt_iter_probe() is an EVT context not a VOS iterator.

Signed-off-by: Jan Michalski <[email protected]>
Add aggregation debugging information on the state of the pool to allow debugging if ENOSPACE error happens unexpectedly.

Signed-off-by: Cedric Koch-Hofer <[email protected]>
An additional case of tsan::TraceRestartMemoryAccess with a slightly
different call stack. This is a false positive coming from the Go
runtime.

Also moved another tsan suppression to be near similar ones, and named
them more descriptively.

Signed-off-by: Kris Jacque <[email protected]>
Our current DTX resync mechanism does DTX leader sponsored scanning
for the specified container. But if current DTX leader is dead, the
new DTX leader will switch to another target on which related entry
may be not exist or has been committed. Under such case, DTX resync
on the new DTX leader will not handle such DTX entry, as to the DTX
entry on other non-leaders may become "orphan".

Such kind of orphan DTX entries may affect subsequent rebuild. This
patch introduces DTX orphan cleanup mechanism to handle them before
rebuild scanning related container.

Signed-off-by: Fan Yong <[email protected]>
…le/2.6

Change-Id: Ia2ca4e64b86cdd8b7641e9c15ad9ada56585b5f9
Signed-off-by: Jeff Olivier <[email protected]>
@github-actions
Copy link

github-actions bot commented Sep 3, 2025

Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/Merge

@jolivier23 jolivier23 merged commit dd8e8a1 into google/2.6 Sep 5, 2025
55 of 58 checks passed
@jolivier23 jolivier23 deleted the jeffolivier/google/2.6 branch September 5, 2025 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.