[v1] Support multiple KV cache groups in GPU model runner #17945

heheda12345 · 2025-05-10T09:14:49Z

Should be merged after #17483

This PR finishes the hybrid allocator support on worker side. It does the following things:

change block_ids in SchedulerOutput to list[list[int]], where the outer list is for multiple kv cache groups and inner list is for blocks in one group.
Create BlockTable class for each kv cache group.
Build different attention metadata for each kv cache group.
TPU backend still only supports one KVCacheGroup after this PR.

Splitted from #16101

-actions · 2025-05-10T09:15:01Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-05-10T23:12:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs..com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Chen Zhang <[email protected]>

WoosukKwon · 2025-05-11T09:14:17Z

vllm/v1/kv_cache_interface.py

+# Some layers may be regarded as full attention layers in KV cache manager (
+# blocks are allocated for all tokens), while computed as sliding window
+# attention in model runner. In this case, we use FullAttentionSpec and
+# record the sliding window size. Default to None for not using sliding
+# window attention.
+sliding_window: Optional[int] = None


Is this for the case where the hybrid allocator is disabled? If so, please leave a comment.

Yeah. I've updated the comment.

WoosukKwon · 2025-05-11T09:23:06Z

vllm/v1/worker/gpu_model_runner.py

+batch_reordered = self.attn_metadata_builders[0].reorder_batch(
+self.input_batch, scheduler_output)
+
+# For models with multiple KV cache groups, the groups should agree on
+# the same order of requests. We ensure this by only allowing the first
+# group to reorder the batch and asserting that all other groups do not
+# reorder the batch.
+for i in range(1, len(self.kv_cache_config.kv_cache_groups)):
+assert not self.attn_metadata_builders[i].reorder_batch(
+self.input_batch, scheduler_output)


QQ: What if the first group is full attn and the second group is MLA? IIUC, the current code will fail in this case. Is this intended?

You are right. But it's fine as no model contains both full attn and MLA now. Prefer to raise an error here and find a solution when such a model is released.

WoosukKwon · 2025-05-11T09:26:04Z

vllm/v1/kv_cache_interface.py

+@classmethod
+def merge(cls, specs: list[Self]) -> Self:
+"""
+Merge a list of KVCacheSpec objects into a single KVCacheSpec object.
+"""
+assert all(spec.type_id == specs[0].type_id for spec in specs[1:]), (
+"All layers in the same KV cache group must share the same "
+"type_id.")
+return copy.deepcopy(specs[0])


Do we really want to inherit and override this? What about defining this as a utility function outside the class?

I prefer to keep the function inside the class. If it is a utility function, it is highly possible that people will forget to update that function when extending the KVCacheSpecs.

WoosukKwon · 2025-05-11T09:26:23Z

vllm/v1/kv_cache_interface.py

+@classmethod
+def merge(cls, specs: list[Self]) -> Self:
+"""
+Merge a list of FullAttentionSpec objects into a single
+FullAttentionSpec object.
+"""
+merged_spec = super().merge(specs)
+sliding_window = set(spec.sliding_window for spec in specs
+if spec.sliding_window is not None)
+if len(sliding_window) == 0:
+merged_spec.sliding_window = None
+elif len(sliding_window) == 1:
+merged_spec.sliding_window = sliding_window.pop()
+else:
+raise ValueError(
+"All sliding window layers in the same KV cache group "
+"must have the same window size.")
+return merged_spec


Don't we need a similar logic in SlidingWindowSpec as well?

We don't need it as SlidingWindowSpec.type_id contains sliding window size and can help to ensure that layers with different sliding window size are in different kv cache groups.

Signed-off-by: Chen Zhang <[email protected]>

mergify · 2025-05-12T05:04:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs..com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

heheda12345 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners May 10, 2025 09:14

mergify bot added v1 tpuRelated to Google TPUslabels May 10, 2025

mergify bot added the needs-rebase label May 10, 2025

heheda12345 added 3 commits May 10, 2025 19:25

support kv cache group in runner

62baae1
Signed-off-by: Chen Zhang <[email protected]>

fix tests in v1/core

e412e02
Signed-off-by: Chen Zhang <[email protected]>

fix tests in v1/worker

f65b904
Signed-off-by: Chen Zhang <[email protected]>

heheda12345 force-pushed the multi_group_worker branch from 5ef5bed to f65b904 Compare May 11, 2025 02:26

mergify bot removed the needs-rebase label May 11, 2025

heheda12345 added 4 commits May 10, 2025 20:07

merge_layer_specs

bda43ef
Signed-off-by: Chen Zhang <[email protected]>

add tests to merge_kv_cache_spec

25f33f5
Signed-off-by: Chen Zhang <[email protected]>

fix comments

9903e3b
Signed-off-by: Chen Zhang <[email protected]>

fix precommit

414e6b4
Signed-off-by: Chen Zhang <[email protected]>

WoosukKwon reviewed May 11, 2025
View reviewed changes

iupdate comments

07fc5ea
Signed-off-by: Chen Zhang <[email protected]>

WoosukKwon added the readyONLY add when PR is ready to merge/full CI is neededlabel May 11, 2025

mergify bot added the needs-rebase label May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1] Support multiple KV cache groups in GPU model runner #17945

[v1] Support multiple KV cache groups in GPU model runner #17945

heheda12345 commented May 10, 2025•
edited by -actions bot
Loading

-actions bot commented May 10, 2025

mergify bot commented May 10, 2025

WoosukKwon May 11, 2025

heheda12345 May 11, 2025

WoosukKwon May 11, 2025

heheda12345 May 11, 2025

WoosukKwon May 11, 2025

heheda12345 May 11, 2025

WoosukKwon May 11, 2025

heheda12345 May 11, 2025

mergify bot commented May 12, 2025

[v1] Support multiple KV cache groups in GPU model runner #17945

Are you sure you want to change the base?

[v1] Support multiple KV cache groups in GPU model runner #17945

Conversation

heheda12345 commented May 10, 2025•edited by -actions bot Loading

-actions bot commented May 10, 2025

mergify bot commented May 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented May 12, 2025

heheda12345 commented May 10, 2025•
edited by -actions bot
Loading