Compatibility wrapper for DeepGEMM API changes.
Users of vLLM should always import only these wrappers.
 module-attribute  ¶
 __all__ = [
    "calc_diff",
    "fp8_gemm_nt",
    "m_grouped_fp8_gemm_nt_contiguous",
    "fp8_m_grouped_gemm_nt_masked",
    "fp8_mqa_logits",
    "fp8_paged_mqa_logits",
    "get_paged_mqa_logits_metadata",
    "per_block_cast_to_fp8",
    "is_deep_gemm_e8m0_used",
    "is_deep_gemm_supported",
    "get_num_sms",
    "should_use_deepgemm_for_fp8_linear",
    "get_col_major_tma_aligned_tensor",
    "get_mk_alignment_for_contiguous_layout",
]
 module-attribute  ¶
   module-attribute  ¶
   module-attribute  ¶
   module-attribute  ¶
   
    
  Import deep_gemm and resolve symbols on first use.
Source code in vllm/utils/deep_gemm.py
  
  Placeholder for unavailable DeepGEMM backend.
Source code in vllm/utils/deep_gemm.py
   
  Return a global difference metric for unit tests.
DeepGEMM kernels on Blackwell/B200 currently exhibit noticeable per-element error, causing torch.testing.assert_close to fail. Instead of checking every element, we compute a cosine-style similarity over the whole tensor and report 1 - sim. Once kernel accuracy improves this helper can be removed.
Source code in vllm/utils/deep_gemm.py
  
  Source code in vllm/utils/deep_gemm.py
  
    
 fp8_mqa_logits(
    q: Tensor,
    kv: tuple[Tensor, Tensor],
    weights: Tensor,
    cu_seqlen_ks: Tensor,
    cu_seqlen_ke: Tensor,
) -> Tensor
Compute FP8 MQA logits for a single sequence without KV paging.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| q | Tensor | Query tensor of shape [M, H, D]. Casted to  | required | 
| kv | tuple[Tensor, Tensor] | Tuple  | required | 
| weights | Tensor | weights of shape [M, H], dtype  | required | 
| cu_seqlen_ks | Tensor | Start indices (inclusive) for valid K per query position, shape [M], dtype int32. | required | 
| cu_seqlen_ke | Tensor | End indices (exclusive) for valid K per query position, shape [M], dtype int32. | required | 
Returns:
| Type | Description | 
|---|---|
| Tensor | Logits tensor of shape [M, N], dtype  | 
Source code in vllm/utils/deep_gemm.py
  
 fp8_paged_mqa_logits(
    q_fp8: Tensor,
    kv_cache_fp8: Tensor,
    weights: Tensor,
    context_lens: Tensor,
    block_tables: Tensor,
    schedule_metadata: Tensor,
    max_model_len: int,
) -> Tensor
Compute FP8 MQA logits using paged KV-cache.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| q_fp8 | Tensor | Query tensor of shape [B, next_n, H, D]. Casted to  | required | 
| kv_cache_fp8 | Tensor | Paged KV-cache in packed FP8+scale layout with shape [num_blocks, block_size, 1, D+4], dtype  | required | 
| weights | Tensor | Tensor of shape [B * next_n, H], dtype  | required | 
| context_lens | Tensor | Tensor of shape [B], dtype int32; effective context length for each batch element. | required | 
| block_tables | Tensor | Tensor of shape [B, max_blocks], dtype int32; maps logical block indices to physical blocks in the paged cache. | required | 
| schedule_metadata | Tensor | Returned by  | required | 
| max_model_len | int | Maximum sequence length used to size the logits output. | required | 
Returns:
| Type | Description | 
|---|---|
| Tensor | Logits tensor of shape [B * next_n, max_model_len], dtype | 
| Tensor | 
 | 
Source code in vllm/utils/deep_gemm.py
  
  Wrapper for DeepGEMM's get_mn_major_tma_aligned_tensor
Source code in vllm/utils/deep_gemm.py
   cached  ¶
  Source code in vllm/utils/deep_gemm.py
   
  Build scheduling metadata for paged MQA logits.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| context_lens | Tensor | Tensor of shape [B], dtype int32; effective context length per batch element. | required | 
| block_size | int | KV-cache block size in tokens (e.g., 64). | required | 
| num_sms | int | Number of SMs available. 132 for Hopper | required | 
Returns:
| Type | Description | 
|---|---|
| Tensor | Backend-specific tensor consumed by  | 
| Tensor | schedule work across SMs. | 
Source code in vllm/utils/deep_gemm.py
  cached  ¶
 is_deep_gemm_e8m0_used() -> bool
Return True if vLLM is configured to use DeepGEMM " "E8M0 scale on a Hopper or Blackwell-class GPU.
Source code in vllm/utils/deep_gemm.py
  cached  ¶
 is_deep_gemm_supported() -> bool
Return True if DeepGEMM is supported on the current platform. Currently, only Hopper and Blackwell GPUs are supported.
Source code in vllm/utils/deep_gemm.py
  
    
 per_block_cast_to_fp8(
    x: Tensor,
    block_size: list[int] = DEFAULT_BLOCK_SIZE,
    use_ue8m0: bool = False,
) -> tuple[Tensor, Tensor]
Source code in vllm/utils/deep_gemm.py
  
 should_use_deepgemm_for_fp8_linear(
    output_dtype: dtype,
    weight: Tensor,
    supports_deep_gemm: bool | None = None,
)