Skip to content

Conversation

@Engininja2
Copy link

The current version of CUDA allows you to access the component halfs of half2 through half2.x and half2.y, but in HIP x and y are unsigned shorts and not half or _float16. This replaces accessing those variables with intrinsic functions.

With the changes I can use --force_half2 and still get sensible text with my AMD GPU.

@ardfork
Copy link
Contributor

ardfork commented Jul 10, 2023

This doesn't work for me. Still spouting gibberish with -fh2.
I also do not understand your change for __compat_h2rcp, it's just a backport from ROCm 5.6 for those that use outdated version of ROCm, it can just be removed if using ROCm >= 5.6.

@Engininja2
Copy link
Author

That sucks. I thought I had figured it out. I'm using a 5700XT with rocblas and pytorch 2.0.1 recompiled on ROCm 5.4.3, so it's possible it's avoiding something else that happens related to feature differences like the dot product opcodes it doesn't have.

With __compat_h2rcp as from ROCM5.6, I got gibberish with silu_mul_cuda_kernel with half2.
A basic test kernel that calls that __compat_h2rcp gives the wrong result, and has a v_cvt_f16_u16_e32 in its disassembly so it's converting from a short to a 16bit float. Compiling it in a rocm 5.6 docker has the same result, including for the h2rcp() that's builtin to rocm. Targeting gfx1030 produces the same object code as gfx1010 so I suspect there's a bug with AMD's h2rcp.

@ardfork
Copy link
Contributor

ardfork commented Jul 10, 2023

I spent a bit more time testing your patch. It seems to be a bit more coherent or at least different than without it.

Without patch:

 -- Testing 8 chunks.
 ** Perplexity (reconstruct): 32000.1813
 -- Testing 8 chunks.
 ** Perplexity (quant, token): 31321.9178
 ** Generation: 'To be or not to be, that is theMDbrnлия Sud Beauoin Burg stick BourbourgrourutobreFD Stock ReserveMDbMDbMDbMDb'

With patch:

 -- Testing 8 chunks.
 ** Perplexity (reconstruct): 5.4959
 -- Testing 8 chunks.
 ** Perplexity (quant, token): 292330.3358
 ** Generation: 'To be or not to be, that is the............\n – – – –– – noiseokal drag next kv Ron Ep Brook future'

As you can see Perplexity (reconstruct) seem correct with your patch. I looked a bit more, and while using your patch, disabling fused MLP with --fused_mlp_thd 0 make it work.

So, there are still some half2 problem in fused MLP.

@turboderp
Copy link
Owner

So, there are still some half2 problem in fused MLP.

Do they actually matter, though? Maybe you could benchmark it with and without --silu_no_half2 to see if it's even worthwhile to keep fighting with h2exp and h2rcp. The activation kernel is very memory-bound anyway.

@Engininja2
Copy link
Author

Engininja2 commented Jul 11, 2023

@ardfork can you try clearing the contents of the exllama_ext cache? In my case it was in ~/.cache/torch_extensions/py311_cpu/exllama_ext

I tried first running exllama without the patch and with the cache deleted, and then ran with the patch without deleting the cache and got gibberish. After clearing the cache it worked.

With using functions to access the component halfs of half2, undefining HIP_NO_HALF_CONVERSIONS doesn't seem to be necessary anymore.

@ardfork
Copy link
Contributor

ardfork commented Jul 11, 2023

That work now. Look like the change in hip_compat.cuh weren't taken into account when rebuilding. I wonder if adding a useless change in q4_mlp to force it to be rebuild would be useful (or find another way to force a rebuild of it), since people are going to have the same problem as I had.

As for speed, on my machine it is the same using half2 or not, but it's nice to be able to use everything.

It might be a good idea to raise an issue or a PR upstream to get that h2rcp fixed, as they already tried to fix it in 5.6.

If they are no performance drop on most HIP supported platform, I think we should revert #afc8b7cd. And also revert oobabooga/text-generation-webui@3c076c3 once it use a version of exllama with your patch merged.

If they are no problem with using __low2half and __high2half on CUDA, I think it would be nice to merge that PR, having all the paths working on HIP is better.

@Engininja2
Copy link
Author

Changing extra_cuda_cflags caused ninja to compile everything again too, so I don't think there's a need to change q4_mlp this time.

I raised an issue with hipamd for h2rcp(). ROCm/clr#8

Stop defaulting to --no_half2 with hip, and whitespace added to
q4_mlp to help make sure it gets recompiled.
@Engininja2
Copy link
Author

I went and added a newline to q4_mlp.cu anyways in case someone using exllama downstream is using their own code for loading the extension. I removed hip defaulting to no_half2 like you suggested.

I would assume that performance should be fine for anyone with an RDNA, CDNA, or Vega GPU. Anything from the RX 580 era lacks the packed math instructions so half2 might be slower for them depending on what the compiler does.

@ardfork
Copy link
Contributor

ardfork commented Jul 21, 2023

I raised an issue with hipamd for h2rcp()

They have moved hipamd to https://github.com/ROCm-Developer-Tools/clr, might be a good idea to also put your issue on that new repo, I fear that it might get forgotten on the old one.

This was referenced Jul 28, 2023
__device__ __forceinline__ __half2 __compat_h2rcp(__half2 x) {
return _Float16_2{static_cast<_Float16>(__builtin_amdgcn_rcph(x.x)),
static_cast<_Float16>(__builtin_amdgcn_rcph(x.y))};
return _Float16_2{static_cast<_Float16>(__builtin_amdgcn_rcph(static_cast<__half2_raw>(x).data.x)),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know where I can read about this __builtin_amdgcn_rcph?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants