Skip to content

Conversation

@TwentyPast4
Copy link
Contributor

Type

  • Bug fix (non-breaking change which fixes an issue): Fixes #
  • New feature (non-breaking change which adds functionality). Resolves #
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) Resolves #

Motivation and Context

Gram and row Gram matrix computations (ie. A.T @ A and A @ A.T) are relatively common in some linear algebra (eg. least squares, linear independence, ML kernels, ...).
If you execute A.T().Matmul(A), at least one of the two matrices in the matmul operation will not be contiguous, meaning when matmul is executed, a copy operation will be performed which can be a noticeable performance loss.
Implementations of Gram() and RowGram() are done with a single matrix, with the transposition being done in gemm functions. This means that if A is contiguous, no copy will be performed, which is not true for A.T().Matmul(A).

Checklist:

  • I have run python util/check_style.py --apply to apply Open3D code style
    to my code.
  • This PR changes Open3D behavior or adds new functionality.
    • Both C++ (Doxygen) and Python (Sphinx / Google style) documentation is
      updated accordingly.
    • I have added or updated C++ and / or Python unit tests OR included test
      results
      (e.g. screenshots or numbers) here.
  • I will follow up and update the code if CI fails.
  • For fork PRs, I have selected Allow edits from maintainers.

Description

Added Gram() and RowGram() functions to Tensor. These are intended for <=2D tensors, similar to how T() is implemented.

@update-docs
Copy link

update-docs bot commented Nov 28, 2025

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

@ssheorey
Copy link
Member

ssheorey commented Jan 5, 2026

Hi @TwentyPast4, thanks for adding this interesting new operation. Where is this used in Open3D, or a 3D workflow? How much of a performance gain does it provide to that workflow?

@TwentyPast4
Copy link
Contributor Author

TwentyPast4 commented Jan 6, 2026

Hi @TwentyPast4, thanks for adding this interesting new operation. Where is this used in Open3D, or a 3D workflow? How much of a performance gain does it provide to that workflow?

The current use cases of this operation I could find are:

  • for Gram in RGBDOdometry testing, Ttrans.T().Matmul(Ttrans)
  • for RowGram in FilterGaussian, mask.Matmul(mask.T())

The workflow is solving least-squares problems, for example fitting a conic to point clouds - something useful when you are working with point clouds containing objects of a known shape, and wanting to measure properties of those shapes.
My specific use case is fitting a cone to point cloud scans of tree trunks.
Of course I would love to also include a full workflow with per-made code for fitting a conic (or other 3D shapes) to point cloud data, but I'm afraid I do not have the time for that at the moment.

The performance difference is the same for both Gram and Row gram - on my hardware it is a 6-7x speedup on CPU and 2x on GPU. (3.60 vs 22.25 seconds on CPU and 14.01 vs 24.37 seconds on GPU)
This is the perf test I ran for this data:

TEST_P(LinalgPermuteDevices, GramPerf) {
    core::Device device = GetParam();

    // Gram test.
    core::Tensor A = core::Tensor::Init<float>({{1, 2, 3}, {4, 5, 6}}, device);

    auto start = std::chrono::steady_clock::now();
    core::Tensor B;
    for (int i = 0; i < 1000000; ++i) {
        B = A.Gram();
    }
    auto after_gram = std::chrono::steady_clock::now();
    for (int i = 0; i < 1000000; ++i) {
        B = A.T().Matmul(A);
    }
    auto finish = std::chrono::steady_clock::now();

    double elapsed_gram = std::chrono::duration_cast<std::chrono::microseconds>(after_gram - start).count() * 1e-6;
    double elapsed_matmul = std::chrono::duration_cast<std::chrono::microseconds>(finish - after_gram).count() * 1e-6;
    EXPECT_LT(elapsed_gram, elapsed_matmul);
}

It should be noted that there may be compiler optimization specifics at play with this kind of benchmark, but it's probably ballpark-accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants