Skip to content

Conversation

Alexandr-Solovev
Copy link
Contributor

@Alexandr-Solovev Alexandr-Solovev commented Mar 5, 2025

Changes Summary

  1. Added noise_variance computation
    Implemented calculation of noise variance as part of the PCA result options. This allows better estimation of the unexplained variance in the dataset.

  2. Added syevr-based eigen decomposition function
    Introduced a new function utilizing LAPACK's syevr routine to improve the performance of eigenvector and eigenvalue computations for symmetric matrices.


PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
  • I have provided justification why performance has changed or why changes are not expected.
  • I have provided justification why quality metrics have changed or why changes are not expected.
  • I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

@Alexandr-Solovev Alexandr-Solovev added the dpc++ Issue/PR related to DPC++ functionality label Jun 19, 2025
@Alexandr-Solovev
Copy link
Contributor Author

/intelci: run

Copy link
Contributor

@icfaust icfaust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this review is useful, I see some basic for loops which could use pragmas, and then some locations where its done on CPU where maybe a GPU solution could also be done without too much work.

DAAL_INT iu = static_cast<DAAL_INT>(nFeatures);
DAAL_INT m;
DAAL_INT info;
// Could be modified to be a function parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a little snippet referring to the application notes for values < 0 (https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2025-0/syevx.html)

return services::Status(services::ErrorPCAFailedToComputeCorrelationEigenvalues);
}

for (size_t i = 0; i < nComponents; ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to add vectorization pragmas here? (thoughts @Vika-F ?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't merged the PR with pragmas change yet :-\ need to collect the perf data.
But I think the inner loop should be easily vectorizable, maybe even without the pragmas. But pragmas can help to guide the compiler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion that vectorization pragmas could be added later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make/share a ticket for that. I don't think it would be too bad if @Vika-F 's PR is merged soon.

auto eigvals = arr_eigval.get_data();
auto total_variance = arr_vars.get_data();

double noiseVariance = 0.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pragmas here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no

@@ -126,6 +126,21 @@ static result_t call_daal_kernel(const context_cpu& ctx,
result.set_singular_values(homogen_table::wrap(arr_singular_values, 1, component_count));
}

if (desc.get_result_options().test(result_options::noise_variance)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this done twice once for batch and once for incremental?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, in general it should be done once for each of method:
batch: cov, svd, precomputed
incrementa: svd, cov

std::int64_t rows_count_ = rows_count_global;
auto range = std::min(rows_count_, column_count) - component_count;
auto noise_variance =
compute_noise_variance_on_host(q, eigvals_host_tmp, range, { syevd_event });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading this right, this would mean that the compute_noise_variance is only done on host? If so please note this in the code and in the description (possibly as a follow up, or why it is done this way).

auto eigvals_ptr = eigenvalues.get_data();

Float sum = 0;
for (std::int64_t i = 0; i < range; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reason not to have an equivalent add reduce call on GPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess its a performance question, usually the range is kinda small and reduce(parallel_for kerenl initialization) takes more time than move data to cpu and compute via cpu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave a comment behind that design choice then, as that should help guide developers who may see this in the future.

Comment on lines 377 to 383
Float sum = 0;
for (std::int64_t i = 0; i < column_count; ++i) {
sum += vars_ptr[i];
}

ONEDAL_ASSERT(sum > 0);
const Float inverse_sum = 1.0 / sum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this was done on CPU?

copyArray(nFeatures * nComponents, fullEigenvectorsArray, eigenvectorsArray);
copyArray(nComponents, fullEigenvaluesArray, eigenvaluesArray);
// SYEVR branch
// In this case, we compute only nComponents eigenvectors and then sort them in descending order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting is mentioned in the comment, but there is no sorting in the code. Is it Ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sorting for syevr is already included in computeEigenVectorsInplaceSyevr, but I can split the functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my opinion, it would be better to align the behavior of computeEigenvectorsInplace and computeEigenvectorsInplaceSyevr: either to include sorting into both, or to exclude sorting from both.

Otherwise, it is harder to get what's happening in the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All lapack functions for symmetric eigenvalues produce the results in sorted order by design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean descending and ascending ordering sort

@Alexandr-Solovev
Copy link
Contributor Author

/intelci: run

@icfaust icfaust requested a review from Copilot June 23, 2025 04:56
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces noise variance computation into oneDAL’s PCA implementation and adds a new syevr‐based eigen decomposition function to enhance performance and improve variance estimation.

  • Added noise variance getter/setter and corresponding computation in both CPU and GPU backends.
  • Updated tests and common option identifiers to support noise variance.
  • Removed legacy sign flip code where replaced with GPU-specific implementations.

Reviewed Changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
examples/oneapi/dpc/source/pca/pca_cor_dense_batch.cpp Added console output for noise variance in PCA train result.
cpp/oneapi/dal/algo/pca/train_types.{hpp,cpp} Introduced noise variance getter/setter and implementation functions.
cpp/oneapi/dal/algo/pca/common.{hpp,cpp} Added noise_variance result option and updated indices consistently.
cpp/oneapi/dal/algo/pca/backend/* (multiple files) Integrated noise variance computation and replaced sign_flip usage.
cpp/daal/src/algorithms/pca/* Updated DAAL integration to compute noise variance and added syevr branch.

@@ -91,6 +91,60 @@ auto syevd_computation(sycl::queue& queue,
return std::make_tuple(eigenvalues, syevd_event);
}

template <typename Float>
auto prepare_eigenvectors_svd(sycl::queue& queue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a description.

// SYEVR branch
// In this case, we compute only nComponents eigenvectors and then sort them in descending order
// inside the 'computeEigenvectorsInplaceSyevr' function
if (nComponents < nFeatures)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it also limit the components by the number of rows in the data by this point? Is that info available here?

eigenvalues[i] = temp_eigenvalues[idx];
for (size_t j = 0; j < nFeatures; ++j)
{
eigenvectors[j + i * nFeatures] = temp_eigenvectors[j + idx * nFeatures];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a dedicated lapack function to reorder columns:
https://www.netlib.org/lapack/explore-3.2.1-html/dlapmt.f.html

Plus there's C++ 'reverse' for vectors:
https://en.cppreference.com/w/cpp/algorithm/reverse.html

for (size_t i = 0; i < nComponents; ++i)
{
size_t idx = nComponents - 1 - i;
eigenvalues[i] = temp_eigenvalues[idx];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be discarding components with too small eigenvalues?


Float max_val = row[0];
Float abs_max = std::abs(row[0]);
for (std::int64_t j = 1; j < column_count; j++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it perhaps be faster to use idamax from blas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a todo for investigation. For now looks like makes the pr bigger

auto explained_variances_ratio_ptr = explained_variances_ratio.get_mutable_data();

Float sum = 0;
for (std::int64_t i = 0; i < column_count; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could oneDPL be used for this kind of things?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a todo mark, but may be done in next pr

auto eigvals_ptr = eigenvalues.get_data();
auto singular_values_ptr = singular_values.get_mutable_data();

const Float factor = row_count - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why '-1' here? Wouldn't this make the result not meet the necessary property that $(\mathbf{X} \mathbf{Z})^T (\mathbf{X} \mathbf{Z}) = \mathbf{I}$

auto compute_event = queue.submit([&](sycl::handler& h) {
h.depends_on(deps);
h.parallel_for(sycl::range<1>(component_count), [=](sycl::id<1> i) {
singular_values_ptr[i] = sycl::sqrt(factor * eigvals_ptr[i]);
Copy link
Contributor

@david-cortes-intel david-cortes-intel Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syevr only generates results that are valid up to numerical tolerance. So a very small eigenvalue that should in theory be positive or zero could still end up as a very small negative number.

@@ -45,6 +45,8 @@ class PCADenseBase : public Kernel
services::Status computeExplainedVariancesRatio(const data_management::NumericTable & eigenvalues,
const data_management::NumericTable & variances,
data_management::NumericTable & explained_variances_ratio);
services::Status computeNoiseVariances(const data_management::NumericTable & eigenvalues, const data_management::NumericTable & variances,
double & noise_variance);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vika-F does it make sense to name the variable noiseVariance for daal like style?

algorithmFPType tmp = eigenvalues[i];
algorithmFPType tmp_rev = eigenvalues[nComponents - 1 - i];

if (tmp < 0) tmp = algorithmFPType(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a singular value should in theory be zero when calculated to infinite precision, lapack will very likely output a number (either positive or negative) in the neighborhood of zero. The approach of discarding values below machine epsilon is more sensible given the way lapack works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify please in terms of code whats your suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion is to compare the (signed) values against a (positive) threshold - e.g.:

if (tmp < thr) {
    // discard
} else {
    // take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dpc++ Issue/PR related to DPC++ functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants