vllm.model_executor.layers.fused_moe.triton_cutlass_moe ¶
TritonOrCutlassExperts ¶
Bases: FallbackExperts
Cutlass with fallback to Triton for low latency shapes on SM100.
Source code in vllm/model_executor/layers/fused_moe/triton_cutlass_moe.py
__init__ ¶
__init__(
moe_config: FusedMoEConfig,
quant_config: FusedMoEQuantConfig,
)
Source code in vllm/model_executor/layers/fused_moe/triton_cutlass_moe.py
_select_experts_impl ¶
_select_experts_impl(
hidden_states: Tensor, w1: Tensor, w2: Tensor
) -> FusedMoEPermuteExpertsUnpermute
Source code in vllm/model_executor/layers/fused_moe/triton_cutlass_moe.py
get_clses staticmethod ¶
get_clses() -> tuple[
type[FusedMoEPermuteExpertsUnpermute],
type[FusedMoEPermuteExpertsUnpermute],
]
workspace_shapes ¶
workspace_shapes(
M: int,
N: int,
K: int,
topk: int,
global_num_experts: int,
local_num_experts: int,
expert_tokens_meta: ExpertTokensMetadata | None,
activation: str,
) -> tuple[
tuple[int, ...], tuple[int, ...], tuple[int, ...]
]