LoRA: Optimise LoKr at runtime #1233

stduhpf · 2026-01-29T15:37:15Z

Tested with https://civitai.green/models/344873/plana-blue-archivelokr

Hip (ROCm 6.2)

Master:

	512x512	1024x1024	1024x1536
Unet compute buffer	3 295.83MB	3 993.52MB	4 860.33MB
Average time per step	0.89s	2.14s	3.2s

PR:

	512x512	1024x1024	1024x1536
Unet compute buffer	137.05MB	830.86MB	1 701.55MB
Average time per step	~~1.3s~~	~~4.8s~~	~~7.44s ~~3.84s~~~~

Vulkan (AMD propreitary driver):

###Master:

	512x512	1024x1024	1024x1536
Unet compute buffer	3 363.80MB	4056.49MB	4986.61MB
Average time per step	1.02s	2.38s	3.57s

PR:

	512x512	1024x1024	1024x1536
Unet compute buffer	137.05MB	830.86MB	1 746.55MB
Average time per step	~~0.92s~~	~~2.98s~~	~~4.5s~~ 4.04s

TLDR: significant VRAM savings across the board. Somehow a big performance hit across all resolutions on ROCm backend (that needs some more investigation), Vulkan backend going faster at smaller resolutions, but slower at high res.

EDIT:

Not long after doing these measurments I found a way to massively reduce the performance gap (with the same compute buffer size). It's now consistently better than master on Vulkan,
I'm too lazy to do the all tests again for now, but for example,1024x1536 now takes 3.84s per step on ROCm, and 3.07s on Vulkan

stduhpf · 2026-01-30T20:08:00Z

Seems to work with more LoKrs now on Vulkan backends. I still can't figure out why it's behaving so strangely on ROCm.

leejet · 2026-02-01T11:59:17Z

When I use the CUDA backend, the program runs abnormally slowly, and when I use --clip-on-cpu it throws an error:

sd.cpp\ggml\src\ggml-cpu\ggml-cpu.c:1249: GGML_ASSERT(nb10 == ggml_type_size(src1->type)) failed
sd.cpp\ggml\src\ggml-cpu\ggml-cpu.c:1249: GGML_ASSERT(nb10 == ggml_type_size(src1->type)) failed

stduhpf · 2026-02-01T15:30:36Z

With the lastest commit, I no longer experience the extreme slowdown on ROCm backend, I'm pretty sure it also fixed it for CUDA. CPU backend doesn't crash anymore either.

The conv2D implementation is pretty slow for now (and I'm not 100% sure it's correct), If I fail to optimize it, maybe I will give up and compute the full weight diff for convs. (compute buffer goes from 830MB to 880MB with the model I'm testing, but speed goes from 14s/it to 3s/it)

stduhpf · 2026-02-01T23:52:49Z

@leejet Can you test again?

stduhpf · 2026-02-02T00:50:17Z

Good new is that it seems it works pretty well now, especially on Vulkan. Bad news is that on Vulkan backend, with some models and LoKR combos I'm getting crashes with:

H:\stable-diffusion.cpp\ggml\src\ggml-vulkan\ggml-vulkan.cpp:6184: GGML_ASSERT(wg0 <= ctx->device->properties.limits.maxComputeWorkGroupCount[0] && wg1 <= ctx->device->properties.limits.maxComputeWorkGroupCount[1] && wg2 <= ctx->device->properties.limits.maxComputeWorkGroupCount[2]) failed

I tried something that seems to lower the chances of it happening in last commit, but it still happens in some cases. I don't know what I can do about it.

stduhpf · 2026-02-02T16:55:17Z

@leejet I don't see any issue remaining either on Vulkan nor ROCm. I've tried a few SDXL and ZIT LoKrs, all seemed to work fine now.

loci-dev mentioned this pull request Jan 29, 2026

UPSTREAM PR #1233: LoRA: Optimise LoKr at runtime auroralabs-loci/stable-diffusion.cpp#36

Open

This comment was marked as resolved.

Sign in to view

stduhpf added 5 commits February 1, 2026 17:53

LoRA: Optimise LoKr at runtime

b4db4be

lokr: fix convs

d608b37

lokr: fix lienar forward for CUDA/HIP and CPU backends

b486097

lokr: disable "optimization" for convolutions

8553862

LoKR: re-implement conv

2430989

stduhpf force-pushed the lokr-forward branch from 51a6932 to 2430989 Compare February 1, 2026 18:55

stduhpf added 2 commits February 2, 2026 00:21

lokr: fix conv bypass implementation

fbf401b

lokr: cleanup linear path code

04f9b1f

stduhpf added 2 commits February 2, 2026 01:21

reshape to 2d before mat_mul

5b67c4b

maxComputeWorkGroupCount workaround for vulkan

f7d53b6

loci-dev mentioned this pull request Feb 2, 2026

UPSTREAM PR #1233: LoRA: Optimise LoKr at runtime auroralabs-loci/stable-diffusion.cpp#42

Open

stduhpf added 4 commits February 2, 2026 17:13

Avoid too large tensors dims in matmul for smaller vk workgroups

244480e

make it vk only

1ab9ed2

remove unncesary casts for non-conv weights

c7629d9

fix wrong flag (oops)

30051a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA: Optimise LoKr at runtime #1233

LoRA: Optimise LoKr at runtime #1233

stduhpf commented Jan 29, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

stduhpf commented Jan 30, 2026

Uh oh!

leejet commented Feb 1, 2026

Uh oh!

stduhpf commented Feb 1, 2026 •

edited

Loading

Uh oh!

stduhpf commented Feb 1, 2026

Uh oh!

stduhpf commented Feb 2, 2026

Uh oh!

stduhpf commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LoRA: Optimise LoKr at runtime #1233

Are you sure you want to change the base?

LoRA: Optimise LoKr at runtime #1233

Conversation

stduhpf commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hip (ROCm 6.2)

Master:

PR:

Vulkan (AMD propreitary driver):

PR:

EDIT:

Uh oh!

This comment was marked as resolved.

stduhpf commented Jan 30, 2026

Uh oh!

leejet commented Feb 1, 2026

Uh oh!

stduhpf commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stduhpf commented Feb 1, 2026

Uh oh!

stduhpf commented Feb 2, 2026

Uh oh!

stduhpf commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stduhpf commented Jan 29, 2026 •

edited

Loading

stduhpf commented Feb 1, 2026 •

edited

Loading