vllm.model_executor.layers.quantization.utils.quant_utils ¶
This file is used for /tests and /benchmarks
kDynamic128Scale module-attribute ¶
kDynamic128Scale = ScaleDesc(
float32, False, GroupShape(1, 128)
)
kFp8Dynamic128Sym module-attribute ¶
kFp8Dynamic128Sym = QuantKey(
FP8_DTYPE, kDynamic128Scale, symmetric=True
)
kFp8Dynamic64Sym module-attribute ¶
kFp8Dynamic64Sym = QuantKey(
FP8_DTYPE, kDynamic64Scale, symmetric=True
)
kFp8DynamicTensorSym module-attribute ¶
kFp8DynamicTensorSym = QuantKey(
FP8_DTYPE, kDynamicTensorScale, symmetric=True
)
kFp8DynamicTokenSym module-attribute ¶
kFp8DynamicTokenSym = QuantKey(
FP8_DTYPE, kDynamicTokenScale, symmetric=True
)
kFp8Static128BlockSym module-attribute ¶
kFp8Static128BlockSym = QuantKey(
FP8_DTYPE, kStatic128BlockScale, symmetric=True
)
kFp8StaticChannelSym module-attribute ¶
kFp8StaticChannelSym = QuantKey(
FP8_DTYPE, kStaticChannelScale, symmetric=True
)
kFp8StaticTensorSym module-attribute ¶
kFp8StaticTensorSym = QuantKey(
FP8_DTYPE, kStaticTensorScale, symmetric=True
)
kFp8StaticTokenSym module-attribute ¶
kFp8StaticTokenSym = QuantKey(
FP8_DTYPE, kStaticTokenScale, symmetric=True
)
kNvfp4Dynamic module-attribute ¶
kNvfp4Dynamic = QuantKey(
FP4_DTYPE,
scale=kNvfp4DynamicGroupScale,
scale2=kStaticTensorScale,
)
kNvfp4DynamicGroupScale module-attribute ¶
kNvfp4DynamicGroupScale = ScaleDesc(
FP8_DTYPE, False, GroupShape(1, 16)
)
kNvfp4Static module-attribute ¶
kNvfp4Static = QuantKey(
FP4_DTYPE,
scale=kNvfp4StaticGroupScale,
scale2=kStaticTensorScale,
)
kNvfp4StaticGroupScale module-attribute ¶
kNvfp4StaticGroupScale = ScaleDesc(
FP8_DTYPE, True, GroupShape(1, 16)
)
kStatic128BlockScale module-attribute ¶
kStatic128BlockScale = ScaleDesc(
float32, True, GroupShape(128, 128)
)
GroupShape ¶
Bases: _GroupShape
This class describes the quantization group shape. It includes static members for common shapes (per-tensor, per-token).
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
QuantKey dataclass ¶
Class for identifying the type of quantization. dtype: quantized data type scale: scale descriptor scale2: second-level scale descriptor symmetric: symmetric if True, asymmetric if False
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
ScaleDesc dataclass ¶
Class for describing a single quantization scaling factor. dtype: data type of the scale static: static scale if True, dynamic if False group_shape: group shape of the scale
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
__str__ ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
_GroupShape ¶
_normalize_quant_group_shape ¶
_normalize_quant_group_shape(
x: Tensor, group_shape: GroupShape
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
awq_pack ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
convert_bf16_scales_to_fp8 ¶
Convert a BF16 scale tensor into the pair of (fp8_scales, channel_scales) expected by W4A8 GEMM kernels.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
convert_packed_uint4b8_to_signed_int4_inplace ¶
Convert int4b8 (packed to int32) to signed int4
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
cutlass_fp4_supported ¶
cutlass_fp4_supported() -> bool
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_and_maybe_dequant_weights ¶
get_and_maybe_dequant_weights(
layer: LinearBase, out_dtype: dtype = float32
)
Return layer's unquantized weights in [out, in] layout
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_attribute_fallback ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_fp8_min_max ¶
Get the min and max values for FP8 quantization.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_pack_factor ¶
gptq_pack ¶
gptq_quantize_weights ¶
gptq_quantize_weights(
w: Tensor,
quant_type: ScalarType,
group_size: int,
act_order: bool,
test_perm: Tensor | None = None,
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
group_broadcast ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
is_layer_skipped ¶
is_layer_skipped(
prefix: str,
ignored_layers: list[str],
fused_mapping: Mapping[
str, list[str]
] = MappingProxyType({}),
*,
skip_with_substr: bool = False,
) -> bool
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
pack_cols ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
pack_quantized_values_into_int32 ¶
pack_quantized_values_into_int32(
w_q: Tensor, wtype: ScalarType, packed_dim: int = 0
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
pack_rows ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
permute_rows ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
prep_scale_for_group_broadcast ¶
prep_scale_for_group_broadcast(
scale: Tensor, x: Tensor, group_shape: GroupShape | None
) -> Tensor
Prepare the input quantization scale for group broadcasting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scale | Tensor | The scale tensor (scalar or 1D). | required |
x | Tensor | Target tensor whose shape determines broadcast dimensions. | required |
group_shape | GroupShape | None | GroupShape to broadcast over. | required |
Returns:
| Type | Description |
|---|---|
Tensor | scale reshaped for correct broadcasting. |
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
quantize_weights ¶
quantize_weights(
w: Tensor,
quant_type: ScalarType,
group_size: int | None,
zero_points: bool = False,
ref_zero_points_after_scales: bool = False,
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 | |
scaled_dequantize ¶
scaled_dequantize(
x_q: Tensor,
x_s: Tensor,
group_shape: GroupShape | None = None,
out_dtype: dtype = float32,
) -> Tensor
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
scaled_quantize ¶
scaled_quantize(
x: Tensor,
group_shape: GroupShape,
quant_dtype: dtype,
compute_dtype: dtype | None = None,
) -> tuple[Tensor, Tensor]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x | Tensor | Input tensor to quantize | required |
group_shape | GroupShape | Shape of quantization groups | required |
quant_dtype | dtype | Target quantized dtype (e.g., torch.float8_e4m3fn) | required |
compute_dtype | dtype | None | Optional dtype for intermediate computations. If None, uses input dtype. Use torch.float32 for higher precision. | None |
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
sort_weights ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
swizzle_blockscale ¶
Pad and block-interleave the FP4 block-scales so that they match the data layout expected by the CUTLASS / FlashInfer kernels.
Parameters¶
scale: torch.Tensor
Returns¶
torch.Tensor The swizzled tensor with the same logical shape as scale.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
unpack_cols ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
unpack_quantized_values_into_int32 ¶
unpack_quantized_values_into_int32(
w_q: Tensor, wtype: ScalarType, packed_dim: int = 0
)