Skip to content

Conversation

@gty404
Copy link
Contributor

@gty404 gty404 commented Apr 17, 2025

  • Added NamespaceContainer class to manage namespace hierarchy and table metadata
  • Implemented MemoryCatalog methods: ListTables, TableExists, DropTable, RegisterTable

@gty404 gty404 force-pushed the memory-catalog branch 3 times, most recently from 2774dc9 to e47caa0 Compare April 17, 2025 09:04
* the children of the root are returned.
* @return A vector of child namespace names.
*/
std::vector<std::string> ListChildrenNamespaces(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return std::vector<Namespace>? Does it return next level or all levels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is returning the namespace of the next level, returning std::string instead of Namespace, I think this makes it easier to get the name.

@Fokko
Copy link
Contributor

Fokko commented Apr 20, 2025

I know that you already took the time and effort to write this, but I did want to put this out here. In PyIceberg we've removed the MemoryCatalog from the codebase, mostly because the SQLCatalog (JDBCCatalog in Java) was proven to be much better. The SQLCatalog also allows running in-memory using a SQLite in-memory instance, but also provides the option to easily keep the state after the process terminates. Next to that, it also provides locking out of the box.

@wgtmac
Copy link
Member

wgtmac commented Apr 21, 2025

@Fokko AFAIK, the C++ ecosystem does not yet have a good ORM framework so it might be challenging to implement a SqlCatalog at the moment. @gty404 is an expert on this and he may have insight in it.

@gty404
Copy link
Contributor Author

gty404 commented Apr 21, 2025

@Fokko When implementing JDBCCatalog, ORM and SQLite need to be introduced, as @wgtmac also mentioned. It's difficult to find a suitable ORM framework in the C++ system, which might be due to the license or ease of use.

@Fokko
Copy link
Contributor

Fokko commented Apr 23, 2025

@gty404 @wgtmac I see, got it. Thanks for considering it, then the InMemoryCatalog is the next best option 👍

@gty404
Copy link
Contributor Author

gty404 commented Apr 24, 2025

@yingcai-cy @wgtmac @lidavidm @zhjwpku Could you help me review it again?

Copy link
Collaborator

@zhjwpku zhjwpku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two minor review comments, other than that, LGTM.

std::unique_ptr<iceberg::TableBuilder> BuildTable(const TableIdentifier& identifier,
const Schema& schema) const override;

private:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what about using pimpl idiom to hide the implementation detail like InMemoryNamespace and reduce compile time?

class ICEBERG_EXPORT InMemoryCatalog : public Catalog {
public:
  ...

private:
  std::unique_ptr<class InMemoryCatalogImpl> impl_;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about placing InMemoryCatalogImpl and InMemoryNamespace in another header file? Their unit tests are also easy to add.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why unit tests are required to have access to InMemoryCatalogImpl? Isn't the Catalog interface enough to test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't InMemoryCatalogImpl and InMemoryNamespace require corresponding unit tests? The interfaces provided by these two may not be completely covered by the InMemoryCatalog test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's reasonable.

* the children of the root are returned.
* \return A vector of child namespace names.
*/
Result<std::vector<std::string>> ListChildrenNamespaces(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use std::span

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned Namespace list is a temporary variable.

* tables, and child namespaces. This structure enables a tree-like representation
* of nested namespaces.
*/
class ICEBERG_EXPORT InMemoryNamespace {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was simply mirroring the Java impl when I added the Catalog without notice of SupportsNamespace: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/catalog/SupportsNamespaces.java. Does it make sense to add all interfaces from SupportsNamespace to our Catalog API? iceberg-python did the same thing: https://github.com/apache/iceberg-python/blob/main/pyiceberg/catalog/__init__.py#L588 This aligns with the rest catalog api.

@Fokko Do you have an opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @wgtmac I think it makes sense to add all of them to the Catalog API. Keep in mind that some implementations (Rest, SQL) support nested namespaces, and some don't (Hive, Glue etc)


namespace iceberg {

class ICEBERG_EXPORT InMemoryCatalog : public Catalog {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments to describe its purpose. Especially to clarify that it is not supposed for production use.

* tables, and child namespaces. This structure enables a tree-like representation
* of nested namespaces.
*/
class ICEBERG_EXPORT InMemoryNamespace {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in #80 (comment), we need to move some of these functions (which are included in the rest catalog spec) as a part of the public Catalog API.

* \param[in] namespace_ident The namespace to check.
* \return Status indicating success or failure.
*/
Status NamespaceExists(const Namespace& namespace_ident) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Result<bool>

* the children of the root are returned.
* \return A vector of child namespace names.
*/
Result<std::vector<std::string>> ListChildrenNamespaces(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If parent can be optional, is it better to rename it to ListNamespaces?

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating it! I've left some minor comments.

/// \return Status::OK if updated successfully;
/// ErrorKind::kNoSuchNamespace if the namespace does not exist;
/// ErrorKind::kNotSupported if the operation is not supported
virtual Status SetNamespaceProperties(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about merging SetNamespaceProperties and RemoveNamespaceProperties into a single UpdateNamespaceProperties(updates, removals)? This is a single call in the rest catalog spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am referring to the implementation in java implementations, which has two interfaces. Should we follow the REST catalog spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko WDYT?

Comment on lines +25 to +34
/**
* @brief An in-memory implementation of the Iceberg Catalog interface.
*
* This catalog stores all metadata purely in memory, with no persistence to disk
* or external systems. It is primarily intended for unit tests, prototyping, or
* demonstration purposes.
*
* @note This class is **not** suitable for production use.
* All data will be lost when the process exits.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* @brief An in-memory implementation of the Iceberg Catalog interface.
*
* This catalog stores all metadata purely in memory, with no persistence to disk
* or external systems. It is primarily intended for unit tests, prototyping, or
* demonstration purposes.
*
* @note This class is **not** suitable for production use.
* All data will be lost when the process exits.
*/
/// \brief An in-memory implementation of the Iceberg Catalog interface.
///
/// This catalog stores all metadata purely in memory, with no persistence to disk
/// or external systems. It is primarily intended for unit tests, prototyping, or
/// demonstration purposes.
///
/// \note This class is **not** suitable for production use.
/// All data will be lost when the process exits.

@gty404
Copy link
Contributor Author

gty404 commented May 26, 2025

@Xuanwo @Fokko Could you help review this? Thanks!

@Fokko Fokko merged commit 5aec271 into apache:main May 27, 2025
6 checks passed
@Fokko
Copy link
Contributor

Fokko commented May 27, 2025

Yes, looks good @gty404 thanks!

@gty404 gty404 deleted the memory-catalog branch May 28, 2025 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants