- Notifications
You must be signed in to change notification settings - Fork 7.3k
[v1] Support multiple KV cache groups in GPU model runner #17945
New issue
Have a question about this project? Sign up for a free account to open an issue and contact its maintainers and the community.
By clicking “Sign up for ”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on ? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
5ef5bed
to f65b904
Compare Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
vllm/v1/kv_cache_interface.py Outdated
# Some layers may be regarded as full attention layers in KV cache manager ( | ||
# blocks are allocated for all tokens), while computed as sliding window | ||
# attention in model runner. In this case, we use FullAttentionSpec and | ||
# record the sliding window size. Default to None for not using sliding | ||
# window attention. | ||
sliding_window: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for the case where the hybrid allocator is disabled? If so, please leave a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I've updated the comment.
batch_reordered = self.attn_metadata_builders[0].reorder_batch( | ||
self.input_batch, scheduler_output) | ||
# For models with multiple KV cache groups, the groups should agree on | ||
# the same order of requests. We ensure this by only allowing the first | ||
# group to reorder the batch and asserting that all other groups do not | ||
# reorder the batch. | ||
for i in range(1, len(self.kv_cache_config.kv_cache_groups)): | ||
assert not self.attn_metadata_builders[i].reorder_batch( | ||
self.input_batch, scheduler_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: What if the first group is full attn and the second group is MLA? IIUC, the current code will fail in this case. Is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. But it's fine as no model contains both full attn and MLA now. Prefer to raise an error here and find a solution when such a model is released.
@classmethod | ||
def merge(cls, specs: list[Self]) -> Self: | ||
""" | ||
Merge a list of KVCacheSpec objects into a single KVCacheSpec object. | ||
""" | ||
assert all(spec.type_id == specs[0].type_id for spec in specs[1:]), ( | ||
"All layers in the same KV cache group must share the same " | ||
"type_id.") | ||
return copy.deepcopy(specs[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to inherit and override this? What about defining this as a utility function outside the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to keep the function inside the class. If it is a utility function, it is highly possible that people will forget to update that function when extending the KVCacheSpecs.
@classmethod | ||
def merge(cls, specs: list[Self]) -> Self: | ||
""" | ||
Merge a list of FullAttentionSpec objects into a single | ||
FullAttentionSpec object. | ||
""" | ||
merged_spec = super().merge(specs) | ||
sliding_window = set(spec.sliding_window for spec in specs | ||
if spec.sliding_window is not None) | ||
if len(sliding_window) == 0: | ||
merged_spec.sliding_window = None | ||
elif len(sliding_window) == 1: | ||
merged_spec.sliding_window = sliding_window.pop() | ||
else: | ||
raise ValueError( | ||
"All sliding window layers in the same KV cache group " | ||
"must have the same window size.") | ||
return merged_spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need a similar logic in SlidingWindowSpec
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need it as SlidingWindowSpec.type_id
contains sliding window size and can help to ensure that layers with different sliding window size are in different kv cache groups.
Signed-off-by: Chen Zhang <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Should be merged after #17483
This PR finishes the hybrid allocator support on worker side. It does the following things:
block_ids
in SchedulerOutput tolist[list[int]]
, where the outer list is for multiple kv cache groups and inner list is for blocks in one group.BlockTable
class for each kv cache group.Splitted from #16101