Calling cublas primitives (Batched QR) #31470

lachinov · 2025-09-01T23:12:40Z

lachinov
Sep 1, 2025

Hi,

when calling
Q, R = jax.lax.linalg.qr(A, full_matrices=False, pivoting=False, use_magma=False)
with sufficiently large but not gigantic matrices, e.g. 4x1040x256, the code compiles into cusolve call
%1472:2 = stablehlo.custom_call @cusolver_geqrf_ffi(%1471) {backend_config = "", mhlo.backend_config = {}, mhlo.frontend_attributes = {num_batch_dims = "1"}, operand_layouts = [dense<[1, 2, 0]> : tensor<3xindex>], output_operand_aliases = [#stablehlo.output_operand_alias<output_tuple_indices = [0], operand_index = 0, operand_tuple_indices = []>], result_layouts = [dense<[1, 2, 0]> : tensor<3xindex>, dense<[1, 0]> : tensor<2xindex>]} : (tensor<4x1040x256xf32>) -> (tensor<4x1040x256xf32>, tensor<4x256xf32>) - validated on the local workstation.

cusolver geqr ignores batching, as I understand.

Manually implementing QR using Householder reflections (as in cublas) improves performance by a factor of 100 on a100 with larger matrices, e.g 64x32768x256. However, there's already an optimized version implemented in cublas, namely geqrfBatched, which would be a better fit.

Is it possible to ask jax or xla compiler to use cublas backend for QR decomposition?

Answered by dfm

Sep 2, 2025

I think that JAX does actually call geqrfBatched for batched systems:

jax/jaxlib/gpu/solver_kernels_ffi.cc

Lines 299 to 305 in ea22c3b

     if (batch > 1 && rows / batch <= 128 && cols / batch <= 128) {  
   SOLVER_BLAS_DISPATCH_IMPL(GeqrfBatchedImpl, batch, rows, cols, stream,  
   scratch, a, out, tau);  
   } else {  
   SOLVER_DISPATCH_IMPL(GeqrfImpl, batch, rows, cols, stream, scratch, a, out,  
   tau);  
   }  

 

although there is a heuristic to avoid batching for smaller batches of large matrices. This decision is made at runtime, not during lowering, so you won't see the cublas name embedded in the HLO. Hope this helps!

View full answer

dfm · 2025-09-02T15:39:42Z

dfm
Sep 2, 2025
Collaborator

I think that JAX does actually call geqrfBatched for batched systems:

jax/jaxlib/gpu/solver_kernels_ffi.cc

Lines 299 to 305 in ea22c3b

    
           if (batch > 1 && rows / batch <= 128 && cols / batch <= 128) { 
        
             SOLVER_BLAS_DISPATCH_IMPL(GeqrfBatchedImpl, batch, rows, cols, stream, 
        
                                       scratch, a, out, tau); 
        
           } else { 
        
             SOLVER_DISPATCH_IMPL(GeqrfImpl, batch, rows, cols, stream, scratch, a, out, 
        
                                  tau); 
        
           }

although there is a heuristic to avoid batching for smaller batches of large matrices. This decision is made at runtime, not during lowering, so you won't see the cublas name embedded in the HLO. Hope this helps!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Calling cublas primitives (Batched QR) #31470

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

	if (batch > 1 && rows / batch <= 128 && cols / batch <= 128) {
	SOLVER_BLAS_DISPATCH_IMPL(GeqrfBatchedImpl, batch, rows, cols, stream,
	scratch, a, out, tau);
	} else {
	SOLVER_DISPATCH_IMPL(GeqrfImpl, batch, rows, cols, stream, scratch, a, out,
	tau);
	}

Calling cublas primitives (Batched QR) #31470

Uh oh!

lachinov Sep 1, 2025

Replies: 1 comment

Uh oh!

dfm Sep 2, 2025 Collaborator

lachinov
Sep 1, 2025

dfm
Sep 2, 2025
Collaborator