Benchmarking Methodology for Large Language Model Serving

Benchmarking Methodology for Large Language Model Serving Independent Researcher

gaikwad.madhav@gmail.com

Operations and Management Network Working Group LLM benchmarking inference serving methodology This document defines benchmarking methodologies for Large Language Model (LLM) inference serving systems. It provides test procedures, setup parameters, measurement specifications, and reporting formats for evaluating latency, throughput, scheduling, and resource management characteristics. This document is a companion to "Benchmarking Terminology for Large Language Model Serving" and SHOULD be consulted alongside that terminology document.

Introduction This document provides benchmarking methodologies for Large Language Model inference serving systems. It defines test procedures, measurement specifications, and reporting formats that enable meaningful performance comparison. A companion document, "Benchmarking Terminology for Large Language Model Serving" , defines the metrics referenced in this methodology. That terminology document SHOULD be consulted before attempting to make use of this document. LLM serving systems present unique benchmarking challenges:

Streaming responses:: Output tokens arrive incrementally over seconds or minutes, requiring timing measurements at multiple points within a single request.
Phase separation:: The prefill phase (processing input) and decode phase (generating output) have distinct computational profiles and optimization targets.
Memory-bound decoding:: The decode phase is limited by memory bandwidth rather than compute, creating different bottlenecks than traditional neural network inference.
Dynamic batching:: Continuous batching systems interleave requests, causing per-request performance to depend on concurrent load.
Context-dependent performance:: Request latency varies with input length, output length, and cache state, making workload specification critical.

These characteristics require methodology beyond traditional throughput and latency measurement. This document addresses these challenges by specifying:

Test configurations for different system boundaries
Reference workloads with defined characteristics
Measurement procedures for streaming responses
Statistical requirements for reliable percentile estimation
Reporting formats enabling meaningful comparison

This document does not specify acceptance thresholds or recommend particular systems. It provides methodology for fair comparison.

Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here. An implementation is not compliant if it fails to satisfy one or more of the MUST requirements for a given test. An implementation that satisfies all the MUST and all the SHOULD requirements for a test is said to be "unconditionally compliant" for that test; one that satisfies all the MUST requirements but not all the SHOULD requirements is said to be "conditionally compliant."

Scope This document covers benchmarking methodology for transformer-based autoregressive language models deployed as network services. The methodology applies to:

Inference engines executing model forward passes
Application gateways providing API endpoints
Compound systems with retrieval or tool execution

The following are out of scope:

Model training or fine-tuning performance
Model quality or accuracy evaluation
Non-autoregressive models (diffusion, encoder-only)
Edge deployment or on-device inference
Specific vendor implementations or products

Test Setup

System Under Test Configurations The System Under Test (SUT) boundary MUST be declared before benchmarking. This document defines three standard configurations.

Model Engine Configuration The Model Engine configuration measures raw inference capability.

Model Engine Configuration Included components:

Model weights and inference runtime
Batching and scheduling logic
KV cache management
Tensor operations and kernels

Excluded components:

External network transport
Authentication and authorization
Rate limiting
Input/output safety filtering
Load balancing

This configuration is appropriate for comparing inference engines (vLLM, TensorRT-LLM, SGLang) independent of deployment stack.

Application Gateway Configuration The Application Gateway configuration measures user-observable API performance.

Application Gateway Configuration Included components (in addition to Model Engine):

TLS termination
Authentication and session management
Rate limiting and quota enforcement
Input validation and output filtering
Safety guardrails

This configuration is appropriate for comparing API providers or evaluating production deployment performance.

Compound System Configuration The Compound System configuration measures end-to-end task completion for agentic or retrieval-augmented workloads.

Compound System Configuration Included components (in addition to Application Gateway):

Retrieval pipeline (embedding, vector search, reranking)
Tool execution environment
Orchestration logic
Multi-turn conversation state

This configuration is appropriate for evaluating RAG systems or agentic applications.

Load Generator Requirements The load generator produces requests and measures responses. It MUST satisfy the following requirements.

Timing Resolution The load generator MUST measure time with resolution of 1 millisecond or better. Microsecond resolution is RECOMMENDED for ITL measurement.

Streaming Support The load generator MUST support streaming response protocols (SSE, WebSocket, or gRPC streaming). It MUST record the arrival time of each token or chunk, not only the complete response.

Open-Loop Load Generation The load generator MUST support open-loop load generation where request arrival times are determined by a specified distribution independent of response times. Poisson arrivals MUST be supported. Uniform and bursty arrival patterns are RECOMMENDED.

Closed-Loop Load Generation The load generator MUST support closed-loop load generation where a fixed number of concurrent requests are maintained. When a request completes, a new request is immediately submitted.

Request Isolation The load generator MUST NOT allow slow responses to delay the submission of subsequent requests in open-loop mode. Asynchronous or multi-threaded implementation is REQUIRED.

Output Recording The load generator MUST record for each request:

Request submission timestamp
First token arrival timestamp
Each subsequent token arrival timestamp
Final token arrival timestamp
Total input token count
Total output token count
Request success/failure status

Reference Workloads Workload specification is critical for reproducible benchmarking. This document defines reference workloads with fixed characteristics. Testers MAY use custom workloads but MUST fully specify them.

Workload Parameters Each workload MUST specify:

Input length distribution:: Distribution type (fixed, uniform, normal, empirical), parameters (mean, std, min, max, or histogram), and unit (tokens using specified tokenizer).
Output length distribution:: Distribution type (fixed, uniform, normal, empirical), parameters (mean, std, min, max, or histogram), control method (max_tokens parameter, stop sequence, or both), and unit (tokens using specified tokenizer).
Content characteristics:: Domain (general, code, conversation, instruction), language (English, multilingual, code languages), and system prompt presence and typical length.
Prefix sharing:: Fraction of requests sharing common prefix and shared prefix length distribution.

Standard Workloads This document defines five standard workloads. Full specifications appear in .

Synthetic-Uniform Purpose: Baseline comparison with controlled variability

Input length: Uniform(128, 512) tokens
Output length: Uniform(64, 256) tokens
Content: Random token sequences (no semantic meaning)
Prefix sharing: None

This workload isolates inference performance from content effects. It is REQUIRED for Model Engine benchmarking.

Synthetic-Skewed Purpose: Test behavior under realistic length variation

Input length: Log-normal(mu=5.5, sigma=1.0) tokens, capped at 4096
Output length: Log-normal(mu=4.5, sigma=1.2) tokens, capped at 2048
Content: Random token sequences
Prefix sharing: None

This workload tests scheduling fairness with high length variance.

Conversation Purpose: Simulate interactive chat workloads

Input length: Empirical distribution from ShareGPT dataset
Output length: Empirical distribution from ShareGPT dataset
Content: Natural language conversation
Prefix sharing: 50% share 200-token system prompt

This workload is RECOMMENDED for Application Gateway benchmarking.

Code Completion Purpose: Simulate coding assistant workloads

Input length: Empirical from code completion datasets
Output length: Log-normal(mu=4.0, sigma=1.5) tokens
Content: Source code in Python, JavaScript, TypeScript
Prefix sharing: 80% share repository context prefix

This workload tests prefix caching effectiveness.

Long Context Purpose: Test long-context behavior

Input length: Uniform(8192, 32768) tokens
Output length: Fixed at 256 tokens
Content: Document + question format
Prefix sharing: None

This workload is REQUIRED for Long Context Scaling tests.

Workload Reproducibility For reproducible benchmarking:

Testers MUST use deterministic random seeds for workload generation. The seed MUST be reported.
Testers SHOULD publish the exact request sequences used, or provide generation code with fixed seeds.
When using dataset-derived workloads (ShareGPT, HumanEval), testers MUST specify the dataset version, subset selection method, and any preprocessing applied.

Tokenization Token counts depend on the tokenizer. Different tokenizers produce different counts for identical text, making cross-system comparison challenging.

Tokenizer Specification The test report MUST specify:

Tokenizer name and version (e.g., "cl100k_base", "Llama-3 tokenizer")
Vocabulary size
Source (Hugging Face model ID, tiktoken name, or custom)

Token Counting Method For cross-system comparison where systems use different tokenizers:

Option A - Native tokenizer:: Count tokens using each system's native tokenizer. Report results separately with tokenizer identified. This method reflects actual system behavior but complicates comparison.
Option B - Reference tokenizer:: Count tokens using a declared reference tokenizer for all systems. This enables direct comparison but may not reflect actual system token counts.

The test report MUST declare which option is used. Option B with cl100k_base (GPT-4 tokenizer) as reference is RECOMMENDED for cross-system comparison.

Special Token Handling The test report MUST specify handling of:

BOS/EOS tokens (included or excluded from counts)
System prompt tokens (counted separately or included)
Tool/function call formatting tokens

Warm-up Procedures LLM serving systems require warm-up before reaching steady-state performance. Warm-up effects include JIT compilation, memory allocator initialization, prefix cache population, and batch size ramp-up.

Warm-up Requirements Before measurement begins, testers MUST:

Load the model fully into accelerator memory
Process at least 100 requests or 10,000 output tokens, whichever is greater
Wait for request queue to drain completely
If prefix caching is enabled and being tested, populate the cache with representative prefixes

Warm-up Verification Testers SHOULD verify warm-up completion by:

Measuring latency for a probe request before and after warm-up
Confirming latency stabilization (less than 10% variation across consecutive probe requests)

Cold Start Measurement When cold start performance is being measured (Model Load Time, Cold Start Latency), warm-up MUST be skipped. The test report MUST clearly indicate cold start measurement.

Streaming Protocol LLM serving systems deliver tokens via streaming protocols. The choice of protocol affects timing measurement.

Supported Protocols This methodology supports:

Server-Sent Events (SSE):: HTTP-based streaming. Each event contains one or more tokens. RECOMMENDED for Application Gateway testing.
WebSocket:: Bidirectional streaming. Each message contains one or more tokens.
gRPC streaming:: Binary streaming protocol. Each message contains one or more tokens. RECOMMENDED for Model Engine testing.

Token Chunking Streaming protocols may deliver multiple tokens per chunk due to batching or network buffering. The test report MUST specify:

Protocol used
Whether each chunk contains exactly one token or potentially multiple tokens
How multi-token chunks are handled for ITL calculation

ITL Calculation with Chunked Delivery When chunks contain multiple tokens:

Option A - Chunk timing:: Measure inter-chunk latency. Report as "Time Between Chunks" rather than ITL. Note chunk size distribution.
Option B - Distributed timing:: Distribute chunk arrival time across tokens. If a chunk with N tokens arrives at time T, assign arrival time T to all N tokens. This understates ITL variance.
Option C - Server-side timing:: Use server-reported per-token timestamps if available. This measures ITL independent of network effects.

The test report MUST declare which option is used. Option C is RECOMMENDED when available.

Clock Synchronization Accurate timing requires synchronized clocks between load generator and SUT, and between distributed SUT components.

Single-Machine Testing When load generator and SUT run on the same machine, clock synchronization is inherent. This configuration is RECOMMENDED for Model Engine testing.

Distributed Testing When load generator and SUT are on different machines:

NTP synchronization MUST achieve accuracy of 10ms or better
PTP synchronization SHOULD be used when sub-millisecond accuracy is required
The test report MUST state the synchronization method and estimated accuracy

Network Latency Measurement For Application Gateway testing where network latency is significant:

Testers SHOULD measure and report network RTT separately
Testers MAY subtract estimated network latency from TTFT to isolate server-side processing time
Any latency adjustment MUST be documented in the test report

Timestamp Format All timestamps MUST be recorded in a format with at least millisecond precision. ISO 8601 with milliseconds (YYYY-MM-DDTHH:MM:SS.sssZ) or Unix epoch with milliseconds is RECOMMENDED.

Safety and Guardrail Configuration Production LLM deployments include safety systems that affect performance. Benchmarking MUST account for these systems.

Guardrail Disclosure The test report MUST disclose:

Whether input content filtering is enabled
Whether output content filtering is enabled
Names of safety systems if known (e.g., "Llama Guard")
Whether any requests were refused during testing

Production-Representative Testing For Application Gateway benchmarking intended to represent production performance:

Safety systems SHOULD be enabled in their default configuration
The test report MUST note if safety systems are disabled
Testers SHOULD run comparative tests with safety enabled and disabled to quantify overhead

Benchmarking Tests This section defines benchmarking tests. Each test includes: objective, setup parameters, procedure, measurements, and reporting format.

Time to First Token

Objective To determine the latency from request submission to first token receipt under varying load conditions. TTFT measures perceived responsiveness for interactive applications.

Setup Parameters The following parameters MUST be defined:

Workload Parameters

Workload:: One of the standard workloads () or a fully specified custom workload.
Request count:: Total number of requests to execute. MUST be at least 1000 for P99 measurement, 10000 for P99.9.

Load Parameters

Load model:: Open-loop or closed-loop.

For open-loop:

Arrival rate:: Requests per second.
Arrival distribution:: Poisson (REQUIRED), uniform, or bursty.

For closed-loop:

Concurrency:: Number of concurrent requests maintained.

System Parameters

SUT configuration:: Model Engine, Application Gateway, or Compound System.
Model identifier:: Model name, version, and quantization if applicable.
Hardware:: Accelerator type, count, and memory.
Prefix caching:: Enabled or disabled.

Procedure

Configure the SUT with specified parameters.
Complete warm-up procedure ().
Begin load generation at the specified arrival rate or concurrency.
For each request:
1. Record submission timestamp (T_submit)
2. Record first token arrival timestamp (T_first)
3. Calculate TTFT = T_first - T_submit
4. Record input token count
Continue until request count is reached.
Compute distribution statistics.

First Token Definition The first token is defined as the first content token received, excluding:

Empty tokens or whitespace-only tokens
Protocol overhead (SSE event markers, JSON framing)
Metadata tokens (token IDs, logprobs if requested separately)

If the system emits non-content tokens before content, the test report MUST note this and specify whether TTFT measures time to any token or time to first content token.

Measurements

Primary Measurements

TTFT Percentiles:: P50, P90, P95, P99, and P99.9 of TTFT distribution. All percentiles MUST be reported.
TTFT Mean:: Arithmetic mean of TTFT values.
TTFT Minimum:: Smallest TTFT observed.
TTFT Maximum:: Largest TTFT observed.

Conditional Measurements

TTFT by input length:: When workload has variable input length, report TTFT percentiles bucketed by input length ranges. RECOMMENDED buckets: [0-256), [256-512), [512-1024), [1024-2048), [2048-4096), [4096+) tokens.
Queue wait time:: If measurable (server instrumentation), report the queue wait component of TTFT separately.
Prefill latency:: If measurable, report the prefill computation component of TTFT separately.

Statistical Requirements For P99 accuracy within 10% relative error at 95% confidence, at least 1000 samples are required. For P99.9, at least 10000 samples. The test report MUST state the sample count.

Reporting Format The test report MUST include:

Configuration Summary

SUT configuration and boundary
Model identifier and hardware
Workload name or full specification
Load model and parameters
Request count and test duration
Warm-up procedure followed
Prefix caching state
Guardrail configuration

Results Table The results SHOULD be reported in tabular format: TTFT Results Example

Metric	Value
Requests	10000
TTFT P50	127 ms
TTFT P90	245 ms
TTFT P95	312 ms
TTFT P99	524 ms
TTFT P99.9	891 ms
TTFT Mean	156 ms
TTFT Min	89 ms
TTFT Max	1243 ms

TTFT by Input Length If applicable: TTFT by Input Length Example

Input Tokens	P50 (ms)	P95 (ms)	P99 (ms)
0-256	95	198	312
256-512	142	287	445
512-1024	198	412	623
1024-2048	312	587	891
2048+	523	912	1243

Distribution Visualization Testers SHOULD include a histogram or CDF plot of the TTFT distribution.

Output Token Throughput

Objective To determine the maximum rate at which the SUT can generate output tokens while maintaining acceptable latency. This test measures system capacity under load.

Setup Parameters The following parameters MUST be defined:

Workload Parameters

Workload:: One of the standard workloads or fully specified custom workload.
Test duration:: Minimum 60 seconds. RECOMMENDED 300 seconds for stable measurement.

Load Parameters

Load model:: Open-loop or closed-loop.

For open-loop:

Arrival rate range:: Minimum and maximum request rates to test.
Rate increment:: Step size for iterative search.

For closed-loop:

Concurrency range:: Minimum and maximum concurrent requests.
Concurrency increment:: Step size for iterative search.

Latency Constraint (Optional)

TTFT SLO:: Maximum acceptable P99 TTFT.
TPOT SLO:: Maximum acceptable P99 TPOT.

When specified, throughput is measured as the maximum rate achieving these SLOs.

Procedure This test employs an iterative search to find maximum throughput.

Configure the SUT with specified parameters.
Complete warm-up procedure.
For each load level (arrival rate or concurrency):
1. Run load for the specified test duration.
2. Record all request timings.
3. Compute throughput as total output tokens divided by test duration.
4. Compute TTFT and TPOT percentiles.
5. If latency constraint specified, check SLO compliance.
Use binary search to find maximum throughput:
1. If no latency constraint: find load level where queue grows unboundedly (system saturation).
2. If latency constraint: find highest load level meeting SLO.
Report throughput at the maximum sustainable load level.

Saturation Detection System saturation is detected when:

Queue depth grows continuously during test duration, OR
Request completion rate is less than 90% of arrival rate, OR
P99 latency exceeds 10x the P50 latency at lower load

Steady State Verification At each load level, verify steady state by:

Confirming queue depth is stable (not growing)
Confirming throughput is stable across test duration
Excluding initial ramp-up period (first 10% of duration)

Measurements

Primary Measurements

Maximum output token throughput:: Output tokens per second at maximum sustainable load. Report with or without latency constraint as specified.
Request throughput:: Requests completed per second at maximum load.
Input token throughput:: Input tokens processed per second (measures prefill capacity).

Efficiency Measurements

Tokens per GPU-second:: Output tokens per second divided by GPU count. Enables comparison across different hardware configurations.
Batch utilization:: If measurable, report average batch size divided by maximum batch size.

Latency at Maximum Throughput At the maximum sustainable load level, report:

TTFT P50, P95, P99
TPOT P50, P95, P99
End-to-end latency P50, P95, P99

Reporting Format

Summary Results Throughput Summary Example

Metric	Value
Max Output Throughput	2847 tok/s
Max Request Throughput	18.2 req/s
Max Input Throughput	5123 tok/s
Sustainable Load	20 req/s
Tokens per GPU-second	356 tok/s/GPU

Latency at Maximum Throughput Latency at Maximum Throughput Example

Metric	P50	P95	P99
TTFT	312 ms	687 ms	1124 ms
TPOT	42 ms	78 ms	134 ms
End-to-End	6.2 s	11.4 s	18.7 s

Throughput-Latency Tradeoff

Objective To characterize the relationship between throughput and latency across the operating range of the SUT. This test produces a throughput-latency curve revealing system behavior better than point measurements.

Setup Parameters

Workload:: One of the standard workloads or fully specified custom workload.
Test duration per point:: Minimum 60 seconds per load level.
Load levels:: At least 10 load levels spanning from low load (10% of estimated capacity) to saturation.
Load model:: Open-loop is REQUIRED for this test. Closed-loop cannot reveal behavior beyond capacity.

Procedure

Estimate system capacity using a preliminary throughput test or published specifications.
Define load levels: 10%, 20%, 30%, ..., 100%, 110%, 120% of estimated capacity.
For each load level in ascending order:
1. Run load for specified duration.
2. Record all request timings.
3. Compute achieved throughput (may differ from offered load at saturation).
4. Compute latency percentiles.
Plot throughput vs latency curves.

Measurements For each load level, record:

Offered load (request rate)
Achieved throughput (output tokens per second)
TTFT: P50, P95, P99
TPOT: P50, P95, P99
End-to-end latency: P50, P95, P99
Request success rate
Queue growth indicator (stable/growing)

Derived metrics:

Optimal operating point:: Load level achieving highest throughput while meeting specified SLO.
Knee point:: Load level where P99 latency exceeds 2x the minimum P99 latency observed.
Saturation point:: Load level where achieved throughput first decreases from previous level.

Reporting Format Throughput-Latency Table Example

Offered (r/s)	Achieved (tok/s)	TTFT P50	TTFT P99	TPOT P50	TPOT P99	Success
2	284	95	142	32	41	100%
6	852	102	178	34	48	100%
10	1420	128	267	38	62	100%
14	1988	198	512	48	98	100%
18	2534	378	1234	72	198	99.8%
22	2712	823	3456	142	523	94.1%

Knee point: 14 req/s (TTFT P99 exceeds 2x minimum) Saturation point: 22 req/s (throughput peaks)

Inter-Token Latency Distribution

Objective To characterize the variability of token delivery during the decode phase. ITL distribution determines streaming smoothness experienced by users.

Setup Parameters

Workload:: Synthetic-Uniform or Conversation workload RECOMMENDED.
Minimum output length:: Requests MUST generate at least 50 output tokens to provide meaningful ITL samples.
Request count:: At least 100 requests for per-request statistics, yielding 5000+ ITL samples.
Load level:: Specify as percentage of maximum throughput. Multiple load levels RECOMMENDED: 25%, 50%, 75%, 90% of saturation.
Measurement method:: Specify per (chunk timing, distributed timing, or server-side timing).

Procedure

Configure SUT and complete warm-up.
For each load level:
1. Generate requests at specified load.
2. For each request, record arrival time of each token after the first.
3. Calculate ITL_i = T(token_i) - T(token_{i-1}) for each consecutive token pair.
4. Aggregate ITL samples across all requests.
5. Calculate per-request jitter (standard deviation of ITL within each request).
6. Record maximum pause duration per request.

The interval between request submission and first token (TTFT) MUST NOT be included in ITL calculation.

Measurements

Aggregate ITL Statistics

ITL Percentiles:: P50, P90, P95, P99, P99.9 across all ITL samples.
ITL Mean:: Arithmetic mean of all ITL samples.
ITL Standard Deviation:: Standard deviation across all samples.

Per-Request Statistics

Jitter Distribution:: P50, P95, P99 of per-request standard deviation.
Maximum Pause Distribution:: P50, P95, P99 of per-request maximum ITL.

Distribution Shape

Modality:: Whether ITL distribution is unimodal or multimodal. Multimodal distributions indicate distinct operating regimes (e.g., batching effects).
Tail behavior:: Characterize tail (exponential, heavy-tailed). Report the ratio P99/P50 as a tail heaviness indicator.

Reporting Format ITL Results Example

Metric	Value
ITL Samples	15234
ITL P50	38 ms
ITL P90	52 ms
ITL P95	67 ms
ITL P99	124 ms
ITL P99.9	312 ms
ITL Mean	42 ms
ITL Std Dev	28 ms
P99/P50 Ratio	3.26

Concurrent Request Capacity

Objective To determine the maximum number of concurrent requests the SUT can maintain while meeting latency objectives. This test measures memory capacity and scheduling limits.

Setup Parameters

Workload:: Synthetic-Uniform RECOMMENDED for controlled testing.
Fixed output length:: Use fixed output length (e.g., 256 tokens) to ensure all requests have similar duration.
Initial concurrency:: Starting number of concurrent requests (e.g., 8).
Maximum concurrency:: Upper bound for search (e.g., 512).
Success criteria:: Request completion rate >= 99%, TTFT P99 <= specified threshold, and no out-of-memory errors.

Procedure This test employs binary search to find maximum concurrent capacity.

Configure SUT and complete warm-up.
Set concurrency = initial concurrency.
For each concurrency level:
1. Submit [concurrency] requests simultaneously.
2. Maintain concurrency: when a request completes, immediately submit a replacement.
3. Run for at least 60 seconds or 100 request completions per slot, whichever is longer.
4. Record completion rate, latency percentiles, and any errors.
5. Check success criteria.
Binary search:
1. If success criteria met: increase concurrency toward maximum.
2. If success criteria not met: decrease concurrency.
3. Continue until convergence.
Report maximum concurrency meeting success criteria.

Measurements

Maximum concurrent requests:: Highest concurrency meeting success criteria.
Achieved throughput at maximum:: Output tokens per second at maximum concurrency.
Tokens in flight at maximum:: Approximate total tokens (input + output so far) across all concurrent requests.

Reporting Format Capacity Search Results Example

Concurrency	Completion	TTFT P99	TPOT P99	Errors	Status
8	100%	142 ms	38 ms	0	Pass
16	100%	178 ms	42 ms	0	Pass
32	100%	267 ms	52 ms	0	Pass
64	99.7%	523 ms	78 ms	0	Pass
128	97.2%	1234 ms	156 ms	3	Fail

Maximum concurrent requests meeting criteria: 64

Scheduling Fairness

Objective To evaluate how equitably the SUT allocates resources across concurrent requests with different characteristics. This test reveals head-of-line blocking, starvation, and priority effects.

Setup Parameters

Workload:

Synthetic-Skewed REQUIRED. The high length variance creates fairness-sensitive conditions.

Request classes:

Define two or more request classes:

Short requests: Input [64, 256] tokens, output [32, 128] tokens
Long requests: Input [1024, 4096] tokens, output [256, 1024] tokens

Class mix:

Ratio of request classes (e.g., 80% short, 20% long).

Load level:

70-90% of saturation throughput RECOMMENDED to create contention.

Request count:

At least 500 requests per class.

Procedure

Configure SUT and complete warm-up.
Measure baseline: performance of each class in isolation at same total load.
Generate mixed workload with specified class ratio.
Run at specified load level for at least 300 seconds.
For each request, record class membership, submission time, first token time, completion time.
Compute per-class statistics and fairness metrics.

Measurements

Per-class latency:: TTFT P50, P95, P99 for each request class.
Latency inflation:: (Mixed workload TTFT) / (Isolated TTFT) per class.
Jain's Fairness Index:: J = (sum(x_i))^2 / (n * sum(x_i^2)) where x_i is normalized latency. J = 1.0 indicates perfect fairness. J < 0.9 indicates significant unfairness.
Starvation rate:: Fraction of requests waiting longer than 5x the median wait time for their class.

Reporting Format Per-Class Results Example

Class	Count	TTFT P50	TTFT P99	TPOT P50	TPOT P99
Short	4012	89 ms	234 ms	35 ms	67 ms
Long	988	312 ms	1234 ms	42 ms	89 ms

Fairness Metrics Example

Metric	Value
Jain's Fairness Index	0.87
Short Class Starvation	0.3%
Long Class Starvation	2.1%

Prefix Cache Effectiveness

Objective To evaluate the performance benefit of prefix caching under workloads with shared prefixes. This test quantifies TTFT reduction from cache hits.

Setup Parameters

Workload:: Code Completion workload RECOMMENDED (high prefix sharing).
Shared prefix:: Define a prefix shared across requests.
Prefix length:: Length in tokens of shared prefix.
Sharing fraction:: Percentage of requests sharing the prefix.
Comparison mode:: Test MUST run in two configurations: cache disabled (baseline) and cache enabled.

Procedure

Configure SUT with cache disabled.
Complete warm-up (without populating prefix cache).
Run workload, record TTFT for all requests.
Enable prefix cache.
Optionally pre-populate cache with shared prefix.
Run identical workload, record TTFT for all requests.
Compare results.

Measurements

TTFT without cache:: P50, P95, P99 with caching disabled.
TTFT with cache:: P50, P95, P99 with caching enabled.
TTFT reduction:: (TTFT_no_cache - TTFT_cache) / TTFT_no_cache as percentage.
Cache hit rate:: Fraction of prefix tokens served from cache.
Throughput improvement:: Percentage increase from caching.

Reporting Format Cache Effectiveness Example

Configuration	TTFT P50	TTFT P95	TTFT P99
Cache Disabled	312 ms	423 ms	534 ms
Cache (Cold)	134 ms	198 ms	267 ms
Cache (Warm)	98 ms	156 ms	212 ms

Memory Pressure Behavior

Objective To characterize SUT behavior when memory resources are constrained, including preemption, swapping, and degradation patterns.

Setup Parameters

Workload:: Long Context workload RECOMMENDED to create memory pressure.
Oversubscription level:: Percentage above maximum capacity (e.g., 110%, 125%, 150%).

Procedure

Determine maximum concurrent capacity from .
Configure SUT and complete warm-up.
For each oversubscription level:
1. Submit requests at concurrency exceeding capacity.
2. Run for at least 120 seconds.
3. Monitor request completions, preemption events, latency.
4. Record any OOM errors or system failures.
Analyze degradation patterns.

Measurements

Completion rate:: Percentage of requests completing successfully at each level.
Preemption rate:: Fraction of requests preempted at least once.
Preemption recovery rate:: Fraction of preempted requests that eventually complete.
Preemption loss:: Average tokens discarded per preemption event.

Reporting Format Memory Pressure Degradation Example

Oversub Level	Complete	Preempt	Fail Rate	TTFT P99
100% (base)	99.7%	0%	0.3%	523 ms
110%	98.2%	5.2%	1.8%	789 ms
125%	94.5%	18.7%	5.5%	1456 ms
150%	82.3%	42.1%	17.7%	3234 ms

Long Context Scaling

Objective To characterize how latency and throughput scale with context length.

Setup Parameters

Workload:: Long Context workload REQUIRED.
Context length range:: Sequence of lengths to test (e.g., 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K tokens).
Fixed output length:: Use consistent short output (256 tokens) to isolate prefill impact.
Load model:: Closed-loop with low concurrency (1-4).
Requests per length:: At least 20 requests per context length.

Procedure

Configure SUT and complete warm-up with short-context requests.
For each context length in ascending order:
1. Generate requests with specified input length.
2. Submit requests at low concurrency.
3. Record TTFT and total latency for each request.
Analyze scaling behavior and fit to scaling models.

Measurements

Per-length latency:: TTFT Mean, P50, P95 for each context length.
Prefill scaling:: Time per input token (TTFT / input_length).
Scaling exponent:: Fit exponent k where TTFT proportional to context_length^k.
Throughput at length:: Maximum throughput achievable at each context length.

Reporting Format Long Context Scaling Example

Context (tokens)	TTFT Mean	TTFT P95	ms/1K tokens
1024	89 ms	112 ms	76
4096	289 ms	367 ms	63
16384	1023 ms	1287 ms	59
65536	4234 ms	5123 ms	62
131072	9123 ms	11234 ms	68

Best fit: Linear (R^2 = 0.9987), ~68 microseconds per input token

Guardrail Overhead

Objective To quantify the latency impact of safety systems and content filtering.

Setup Parameters

Workload:

Conversation workload RECOMMENDED.

Content mix:

Use benign content to measure processing overhead.

Configurations to compare:

The following configurations should be tested:

Baseline: All guardrails disabled (if possible)
Input filtering only
Output filtering only
Full filtering: All production guardrails enabled

Load levels:

Test at 25%, 50%, 75% of capacity.

Procedure

Configure SUT with baseline (no guardrails).
Complete warm-up and run workload at each load level.
Enable each guardrail configuration and repeat.
Compare results across configurations.

Measurements

Per-configuration latency:: TTFT P50, P95, P99 and End-to-end latency for each configuration.
Input filter overhead:: TTFT(input_filter) - TTFT(baseline)
Total guardrail overhead:: End-to-end(full) - End-to-end(baseline)
Throughput reduction:: Percentage reduction from guardrails.

Reporting Format Guardrail Overhead Example

Configuration	TTFT P50	TTFT P99	E2E P50	E2E P99
Baseline	98 ms	234 ms	4.2 s	8.7 s
Input Filter	112 ms	267 ms	4.3 s	8.9 s
Output Filter	101 ms	242 ms	4.8 s	9.8 s
Full Filter	118 ms	289 ms	5.0 s	10.2 s

Throughput Impact Example

Configuration	Max Throughput	Reduction
Baseline	2867 tok/s	-
Input Filter	2756 tok/s	-3.9%
Output Filter	2412 tok/s	-15.9%
Full Filter	2289 tok/s	-20.2%

Multi-System Comparison Guidelines When comparing multiple SUTs:

Equivalence Requirements Testers MUST ensure:

Identical workload (same requests in same order with same seeds)
Equivalent SUT boundary (all systems at same boundary)
Comparable hardware (or normalize by hardware capability)
Same load model and parameters

Normalization When hardware differs:

Report tokens per GPU-second (normalized by GPU count)
Report cost-normalized throughput (tokens per dollar-hour)
Clearly state normalization method

Statistical Significance For comparative claims:

Report confidence intervals for key metrics
Conduct multiple independent runs (at least 3)
Use appropriate statistical tests for comparison

Fair Comparison Checklist Before publishing comparative results, verify:

Same workload specification
Same test duration
Same warm-up procedure
Same success criteria
Both systems tested at same time (if using shared resources)
Both systems in production-representative configuration
Differences in configuration explicitly noted

Security Considerations Benchmarking methodology intersects with security in several ways.

Side-Channel Risks Benchmark results may reveal:

System capacity limits useful for DoS planning
Timing patterns enabling cache probing attacks
Memory pressure thresholds for resource exhaustion

Operators SHOULD consider whether to publish detailed capacity information publicly.

Benchmark Gaming Systems may be optimized specifically for benchmark workloads in ways that do not generalize:

Detecting benchmark patterns and applying special handling
Caching benchmark-specific prefixes
Prioritizing benchmark-like requests

Testers SHOULD vary workloads and verify results with production traffic samples.

Adversarial Workloads This methodology uses benign workloads. Adversarial inputs (jailbreak attempts, prompt injections) may have different performance characteristics due to guardrail processing. Testing with adversarial workloads requires additional ethical and safety considerations not covered here.

Resource Exhaustion Memory pressure tests () intentionally push systems beyond capacity. Testers SHOULD:

Conduct such tests on isolated systems
Have recovery procedures ready
Monitor for cascading failures

References Normative References Key words for use in RFCs to Indicate Requirement Levels Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words Benchmarking Terminology for Large Language Model Serving Informative References Benchmarking Terminology for Network Interconnection Devices Benchmarking Methodology for Network Interconnect Devices Benchmarking Methodology for Firewall Performance Efficient Memory Management for Large Language Model Serving with PagedAttention Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Reference Workload Specifications This appendix provides complete specifications for standard workloads.

Synthetic-Uniform Workload Purpose: Controlled baseline with minimal variance

Input Specification

Distribution:: Uniform
Minimum:: 128 tokens
Maximum:: 512 tokens
Mean:: 320 tokens
Content:: Random token IDs from vocabulary

Output Specification

Distribution:: Uniform
Minimum:: 64 tokens
Maximum:: 256 tokens
Mean:: 160 tokens
Control:: max_tokens parameter

Other Parameters

System prompt:: None
Prefix sharing:: None
Temperature:: 0.0 (deterministic)
Stop sequences:: None

Generation Method Python pseudocode:

Synthetic-Skewed Workload Purpose: Test scheduling with high length variance

Input Specification

Distribution:: Log-normal
mu:: 5.5 (in log space)
sigma:: 1.0 (in log space)
Minimum:: 32 tokens (floor)
Maximum:: 4096 tokens (cap)
Median:: ~245 tokens
Mean:: ~405 tokens

Output Specification

Distribution:: Log-normal
mu:: 4.5 (in log space)
sigma:: 1.2 (in log space)
Minimum:: 16 tokens (floor)
Maximum:: 2048 tokens (cap)

Conversation Workload Purpose: Realistic interactive chat patterns

Data Source

Dataset:: ShareGPT (vicuna_cleaned subset)
Version:: 2023-04-12
Preprocessing:: Filter conversations with 1+ assistant turns

Length Statistics (Reference) Input tokens:

P50: 156
P95: 892
P99: 2134

Output tokens:

P50: 234
P95: 789
P99: 1567

Code Completion Workload Purpose: Test prefix caching with code context

Data Source

Dataset:: The Stack (Python, JavaScript, TypeScript subset)
Preprocessing:: Extract function-level completions

Prefix Sharing Pattern

10 unique repository contexts
Each 512-1024 tokens
80% of requests share one of these prefixes
Distribution: Zipf with s=1.5

Long Context Workload Purpose: Test long-context handling

Input Specification

Distribution:: Uniform over target lengths
Target lengths:: [8192, 16384, 32768, 65536, 131072] tokens
Structure:: [document][question]
Document:: Fills target length minus 100 tokens
Question:: Fixed ~100 token question about document

Output Specification

Distribution:: Fixed
Length:: 256 tokens
Control:: max_tokens = 256

Timing Measurement Reference This appendix provides detailed guidance for timing measurements.

TTFT Measurement Points

HTTP/SSE Measurement Client-side TTFT:

T_submit:: time of sending final byte of HTTP request
T_first:: time of receiving first data event with content token

T_first is when the complete "data:" line is received and parsed, not when the first byte of the response arrives.

gRPC Streaming Measurement

T_submit:: time of sending request message
T_first:: time of receiving first response message with token

Server-Side Measurement If server instrumentation available:

T_submit:: time request enters inference queue
T_first:: time first token exits model forward pass

Server-side excludes network latency but may include internal queue time.

ITL Measurement with SSE SSE delivery may batch multiple tokens per event due to server-side batching, TCP buffering, or client-side buffering.

Recommended Approach

First, characterize delivery pattern (tokens per chunk)
If single-token chunks dominate (>90%): use direct measurement
If multi-token chunks common: prefer server timestamps
If server timestamps unavailable: use chunk timing and document

Clock Synchronization Methods

NTP Synchronization

Both machines sync to same NTP server
Verify offset: ntpq -p (check offset column)
Acceptable offset: < 10ms for most LLM benchmarking
Document NTP server and measured offset

PTP Synchronization For sub-millisecond accuracy:

Use PTP-capable network hardware
Configure ptp4l on Linux systems
Acceptable offset: < 1 microsecond

Single-Machine Alternative Recommended for Model Engine testing:

Run load generator on same machine as SUT
Use loopback network interface
Clock synchronization inherent
Eliminates network latency from measurement

Reporting Templates

Minimum Viable Report For quick comparisons, include at minimum:

Full Report Template A complete benchmark report should include the following sections:

System Identification (model, hardware, software)
Test Configuration (workload, load, execution parameters)
Results (latency summary, throughput summary, success metrics)
Detailed Results (per-test tables and visualizations)
Methodology Compliance (tests performed, deviations, limitations)
Reproduction Information (test harness, configuration, data)

Acknowledgements This document draws on the structure and approach established by RFC 3511 for firewall benchmarking methodology. The author thanks the Benchmarking Methodology Working Group for their foundational work in network device benchmarking.