Skip to content

Conversation

@stduhpf
Copy link
Contributor

@stduhpf stduhpf commented Jan 29, 2026

Tested with https://civitai.green/models/344873/plana-blue-archivelokr

Hip (ROCm 6.2)

Master:

512x512 1024x1024 1024x1536
Unet compute buffer 3 295.83MB 3 993.52MB 4 860.33MB
Average time per step 0.89s 2.14s 3.2s

PR:

512x512 1024x1024 1024x1536
Unet compute buffer 137.05MB 830.86MB 1 701.55MB
Average time per step 1.3s 4.8s 7.44s 3.84s

Vulkan (AMD propreitary driver):

###Master:

512x512 1024x1024 1024x1536
Unet compute buffer 3 363.80MB 4056.49MB 4986.61MB
Average time per step 1.02s 2.38s 3.57s

PR:

512x512 1024x1024 1024x1536
Unet compute buffer 137.05MB 830.86MB 1 746.55MB
Average time per step 0.92s 2.98s 4.5s 4.04s

TLDR: significant VRAM savings across the board. Somehow a big performance hit across all resolutions on ROCm backend (that needs some more investigation), Vulkan backend going faster at smaller resolutions, but slower at high res.

EDIT:

Not long after doing these measurments I found a way to massively reduce the performance gap (with the same compute buffer size). It's now consistently better than master on Vulkan,
I'm too lazy to do the all tests again for now, but for example,1024x1536 now takes 3.84s per step on ROCm, and 3.07s on Vulkan

@stduhpf

This comment was marked as resolved.

@stduhpf
Copy link
Contributor Author

stduhpf commented Jan 30, 2026

Seems to work with more LoKrs now on Vulkan backends. I still can't figure out why it's behaving so strangely on ROCm.

@leejet
Copy link
Owner

leejet commented Feb 1, 2026

When I use the CUDA backend, the program runs abnormally slowly, and when I use --clip-on-cpu it throws an error:

sd.cpp\ggml\src\ggml-cpu\ggml-cpu.c:1249: GGML_ASSERT(nb10 == ggml_type_size(src1->type)) failed
sd.cpp\ggml\src\ggml-cpu\ggml-cpu.c:1249: GGML_ASSERT(nb10 == ggml_type_size(src1->type)) failed

@stduhpf
Copy link
Contributor Author

stduhpf commented Feb 1, 2026

With the lastest commit, I no longer experience the extreme slowdown on ROCm backend, I'm pretty sure it also fixed it for CUDA. CPU backend doesn't crash anymore either.

The conv2D implementation is pretty slow for now (and I'm not 100% sure it's correct), If I fail to optimize it, maybe I will give up and compute the full weight diff for convs. (compute buffer goes from 830MB to 880MB with the model I'm testing, but speed goes from 14s/it to 3s/it)

@stduhpf
Copy link
Contributor Author

stduhpf commented Feb 1, 2026

@leejet Can you test again?

@stduhpf
Copy link
Contributor Author

stduhpf commented Feb 2, 2026

Good new is that it seems it works pretty well now, especially on Vulkan. Bad news is that on Vulkan backend, with some models and LoKR combos I'm getting crashes with:

H:\stable-diffusion.cpp\ggml\src\ggml-vulkan\ggml-vulkan.cpp:6184: GGML_ASSERT(wg0 <= ctx->device->properties.limits.maxComputeWorkGroupCount[0] && wg1 <= ctx->device->properties.limits.maxComputeWorkGroupCount[1] && wg2 <= ctx->device->properties.limits.maxComputeWorkGroupCount[2]) failed

I tried something that seems to lower the chances of it happening in last commit, but it still happens in some cases. I don't know what I can do about it.

@stduhpf
Copy link
Contributor Author

stduhpf commented Feb 2, 2026

@leejet I don't see any issue remaining either on Vulkan nor ROCm. I've tried a few SDXL and ZIT LoKrs, all seemed to work fine now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants