Skip to content

feat(io): add bulk delete API to FileIO.#659

Merged
wgtmac merged 5 commits into
apache:mainfrom
slfan1989:add-fileio-bulk-delete-api
May 21, 2026
Merged

feat(io): add bulk delete API to FileIO.#659
wgtmac merged 5 commits into
apache:mainfrom
slfan1989:add-fileio-bulk-delete-api

Conversation

@slfan1989
Copy link
Copy Markdown
Contributor

@slfan1989 slfan1989 commented May 18, 2026

Summary

Add a new FileIO::DeleteFiles(...) API as a bulk deletion entry point.

The default implementation deletes files sequentially by calling the existing
DeleteFile(...) method and returns the first deletion error encountered.

This PR only adds the API and backward-compatible fallback behavior. It does not
yet update ExpireSnapshots to use DeleteFiles(...), and it does not introduce
parallel deletion.

Fixed: #658

Motivation

ExpireSnapshots and other cleanup flows may need to delete many files. A
bulk deletion API gives FileIO implementations a common extension point for
future optimized deletion strategies, such as storage-native batch deletion or
parallel fallback deletion.

@slfan1989
Copy link
Copy Markdown
Contributor Author

@wgtmac Could you please help review this PR? This is my first code contribution to the iceberg-cpp module, and I would really appreciate your guidance and help. Thank you very much!

Comment thread src/iceberg/file_io.h
Comment thread src/iceberg/file_io.h Outdated
///
/// \param file_locations The locations of the files to delete.
/// \return void if all deletes succeeded, an error code if any delete failed.
virtual Status DeleteFiles(std::span<const std::string> file_locations) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if you could make some changes that utilize this new API. I saw you mentioned ExpireSnapshots.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense. I originally planned to split this into two PRs: first add the FileIO::DeleteFiles API, then update ExpireSnapshots to use it.

I agree that this PR would be more useful if it also adopted the new API. I’ll take a look at updating ExpireSnapshots to use DeleteFiles for grouped file cleanup.

#include "iceberg/test/matchers.h"

namespace iceberg {
namespace {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this or its counterpart. If you want to wrap RecordingFileIO in an anonymous namespace, you should shrink the scope instead.

Suggested change
namespace {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I’ll shrink the anonymous namespace scope to cover only the helper RecordingFileIO and keep the tests outside of the iceberg namespace.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not putting all cases in the iceberg namespace so we don't need to use iceberg:: prefix everywhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I moved the test cases into the iceberg namespace and kept RecordingFileIO in an anonymous namespace, so the iceberg:: prefixes are no longer needed.

Comment thread src/iceberg/file_io.h Outdated
/// \return void if all deletes succeeded, an error code if any delete failed.
virtual Status DeleteFiles(std::span<const std::string> file_locations) {
for (const auto& file_location : file_locations) {
auto status = DeleteFile(file_location);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put the implementation to file_io.cc and use ICEBERG_RETURN_UNEXPECTED to save some lines.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update it. I'll move the default DeleteFiles implementation to file_io.cc and use ICEBERG_RETURN_UNEXPECTED for error propagation.

Comment thread src/iceberg/file_io.h Outdated
Comment thread src/iceberg/file_io.h Outdated
///
/// \param file_locations The locations of the files to delete.
/// \return void if all deletes succeeded, an error code if any delete failed.
virtual Status DeleteFiles(std::span<const std::string> file_locations) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be complete, let's override this function in arrow_io_internal.h by adapting to https://github.com/apache/arrow/blob/eab3d11b484df5bb6ddd75630f9bfa17c0a3f491/cpp/src/arrow/filesystem/filesystem.h#L274

A relevant question is whether to use const std::vector<std::string>& as the parameter so no conversion is needed to call the Arrow equivalent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed FileIO::DeleteFiles to take const std::vectorstd::string& to match Arrow FileSystem::DeleteFiles, and added an ArrowFileSystemFileIO override that resolves Iceberg file locations and delegates to arrow_fs_->DeleteFiles(paths). I also added a LocalFileIO test for the bulk delete path.

#include "iceberg/test/matchers.h"

namespace iceberg {
namespace {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not putting all cases in the iceberg namespace so we don't need to use iceberg:: prefix everywhere?

@slfan1989 slfan1989 force-pushed the add-fileio-bulk-delete-api branch from 04cef13 to 1e963a0 Compare May 21, 2026 03:20
Co-authored-by: Gang Wu <ustcwg@gmail.com>

class RecordingFileIO : public FileIO {
public:
explicit RecordingFileIO(std::string failure_path = "")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name RecordingFileIO is not that obvious about its purpose but this is fine in the test.

@wgtmac wgtmac merged commit 6c781c7 into apache:main May 21, 2026
15 checks passed
@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented May 21, 2026

Thanks @slfan1989 for working on this and @zhjwpku for the review!

@slfan1989
Copy link
Copy Markdown
Contributor Author

Thanks @slfan1989 for working on this and @zhjwpku for the review!

@wgtmac @zhjwpku Thank you for the reviews and guidance! I really appreciate your time and help throughout the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support bulk and parallel file deletion for ExpireSnapshots

3 participants