astrlrd commited on
Commit
20587ff
·
verified ·
1 Parent(s): 34674c6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -76,6 +76,8 @@ Average acceptance length (τ) measured across MT-bench, HumanEval, and GSM8K wi
76
 
77
  *Measured at temperature = 1 with K = 7*
78
 
 
 
79
  ## Usage with vLLM
80
  ```python
81
  from vllm import LLM, SamplingParams
@@ -86,6 +88,8 @@ llm = LLM(
86
  "method": "eagle3",
87
  "model": "nebius/EAGLE3-Llama-3.3-70B-Instruct",
88
  "num_speculative_tokens": 6,
 
 
89
  },
90
  )
91
 
@@ -93,8 +97,6 @@ sampling_params = SamplingParams(temperature=0.7)
93
  outputs = llm.generate(["Explain speculative decoding in simple terms."], sampling_params)
94
  ```
95
 
96
- > **Note**: The current vLLM implementation samples draft tokens greedily regardless of temperature settings, which can underestimate acceptance rates at temperature > 0. A community fix is under development (see [vllm-project/vllm#20459](https://github.com/vllm-project/vllm/pull/20459)). The acceptance metrics reported above were measured with proper rejection sampling.
97
-
98
  ## License
99
 
100
  [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
 
76
 
77
  *Measured at temperature = 1 with K = 7*
78
 
79
+ > **Note:** Earlier vLLM versions sampled draft tokens greedily regardless of temperature, which underestimated acceptance rates at temperature > 0. Stochastic draft sampling was introduced in **v0.18.0**, and from **v0.21.0** it can be enabled via `speculative_config` using `rejection_sample_method` and `draft_sample_method`. The acceptance metrics reported above were measured under standard rejection sampling and are reproducible with the configuration below.
80
+
81
  ## Usage with vLLM
82
  ```python
83
  from vllm import LLM, SamplingParams
 
88
  "method": "eagle3",
89
  "model": "nebius/EAGLE3-Llama-3.3-70B-Instruct",
90
  "num_speculative_tokens": 6,
91
+ "rejection_sample_method": "standard",
92
+ "draft_sample_method": "gumbel",
93
  },
94
  )
95
 
 
97
  outputs = llm.generate(["Explain speculative decoding in simple terms."], sampling_params)
98
  ```
99
 
 
 
100
  ## License
101
 
102
  [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)