Size computation slows bulk insert significantly

The size computation requires a small memcpy from device to host and then a synchronization.  **Each one** is the cause of serious performance degradation.

https://github.com/NVIDIA/cuCollections/blob/8786234a38a4d1283c6dc2011e45d14801510725/include/cuco/detail/static_map.inl#L149-L151

The synchronization is bad because it means that other unrelated streams are unable to do work.

The memcpy is bad because future copies are queued behind this one in architectures that have a limited number of cuda copy engines.

I was able to get a significant performance improvement by deleting these lines.

There ought to be a better way to compute size.  Perhaps a lazy method.  If this is too difficult, you might consider using templates to allow the user to choose to not maintain `size_` at all!  Use templates to change the type of `size_` from int to a struct that has no members.  That way it doesn't take up any space.  Provide no methods on this struct so that the `size_` doesn't get accidentally used.  It will still use some space on the host but that seems like no big deal.


https://github.com/NVIDIA/cuCollections/issues/237#tasklist-block-efc3d0dd-74a4-4f46-b10e-d6fde965d057


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Size computation slows bulk insert significantly #237

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	CUCO_CUDA_TRY(cudaMemcpyAsync(
	&h_num_successes, num_successes_, sizeof(atomic_ctr_type), cudaMemcpyDeviceToHost, stream));
	CUCO_CUDA_TRY(cudaStreamSynchronize(stream));

Size computation slows bulk insert significantly #237

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions