<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>

<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
     category="info"
     docName="draft-gaikwad-llm-benchmarking-methodology-00"
     ipr="trust200902"
     obsoletes=""
     updates=""
     submissionType="IETF"
     xml:lang="en"
     version="3">

  <front>
    <title abbrev="LLM Benchmarking Methodology">Benchmarking Methodology for Large Language Model Serving</title>
    
    <seriesInfo name="Internet-Draft" value="draft-gaikwad-llm-benchmarking-methodology-00"/>
    
    <author fullname="Madhava Gaikwad" initials="M." surname="Gaikwad">
      <organization>Independent Researcher</organization>
      <address>
        <email>gaikwad.madhav@gmail.com</email>
      </address>
    </author>
    
    <date year="2026" month="January"/>
    
    <area>Operations and Management</area>
    <workgroup>Network Working Group</workgroup>
    
    <keyword>LLM</keyword>
    <keyword>benchmarking</keyword>
    <keyword>inference</keyword>
    <keyword>serving</keyword>
    <keyword>methodology</keyword>
    
    <abstract>
      <t>This document defines benchmarking methodologies for Large Language
      Model (LLM) inference serving systems. It provides test procedures,
      setup parameters, measurement specifications, and reporting formats
      for evaluating latency, throughput, scheduling, and resource
      management characteristics. This document is a companion to
      "Benchmarking Terminology for Large Language Model Serving" and
      SHOULD be consulted alongside that terminology document.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="introduction">
      <name>Introduction</name>
      
      <t>This document provides benchmarking methodologies for Large Language
      Model inference serving systems. It defines test procedures,
      measurement specifications, and reporting formats that enable
      meaningful performance comparison.</t>
      
      <t>A companion document, "Benchmarking Terminology for Large Language
      Model Serving" <xref target="LLM-TERMS"/>, defines the metrics referenced in this
      methodology. That terminology document SHOULD be consulted before
      attempting to make use of this document.</t>
      
      <t>LLM serving systems present unique benchmarking challenges:</t>
      
      <dl>
        <dt>Streaming responses:</dt>
        <dd>Output tokens arrive incrementally over seconds or minutes,
        requiring timing measurements at multiple points within a single
        request.</dd>
        
        <dt>Phase separation:</dt>
        <dd>The prefill phase (processing input) and decode phase
        (generating output) have distinct computational profiles and
        optimization targets.</dd>
        
        <dt>Memory-bound decoding:</dt>
        <dd>The decode phase is limited by memory bandwidth rather than
        compute, creating different bottlenecks than traditional neural
        network inference.</dd>
        
        <dt>Dynamic batching:</dt>
        <dd>Continuous batching systems interleave requests, causing
        per-request performance to depend on concurrent load.</dd>
        
        <dt>Context-dependent performance:</dt>
        <dd>Request latency varies with input length, output length, and
        cache state, making workload specification critical.</dd>
      </dl>
      
      <t>These characteristics require methodology beyond traditional
      throughput and latency measurement. This document addresses these
      challenges by specifying:</t>
      
      <ul>
        <li>Test configurations for different system boundaries</li>
        <li>Reference workloads with defined characteristics</li>
        <li>Measurement procedures for streaming responses</li>
        <li>Statistical requirements for reliable percentile estimation</li>
        <li>Reporting formats enabling meaningful comparison</li>
      </ul>
      
      <t>This document does not specify acceptance thresholds or recommend
      particular systems. It provides methodology for fair comparison.</t>
    </section>

    <section anchor="requirements">
      <name>Requirements Language</name>
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
      NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED",
      "MAY", and "OPTIONAL" in this document are to be interpreted as
      described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/>
      when, and only when, they appear in all capitals, as shown here.</t>
      
      <t>An implementation is not compliant if it fails to satisfy one or more
      of the MUST requirements for a given test. An implementation that
      satisfies all the MUST and all the SHOULD requirements for a test is
      said to be "unconditionally compliant" for that test; one that
      satisfies all the MUST requirements but not all the SHOULD
      requirements is said to be "conditionally compliant."</t>
    </section>

    <section anchor="scope">
      <name>Scope</name>
      
      <t>This document covers benchmarking methodology for transformer-based
      autoregressive language models deployed as network services. The
      methodology applies to:</t>
      
      <ul>
        <li>Inference engines executing model forward passes</li>
        <li>Application gateways providing API endpoints</li>
        <li>Compound systems with retrieval or tool execution</li>
      </ul>
      
      <t>The following are out of scope:</t>
      
      <ul>
        <li>Model training or fine-tuning performance</li>
        <li>Model quality or accuracy evaluation</li>
        <li>Non-autoregressive models (diffusion, encoder-only)</li>
        <li>Edge deployment or on-device inference</li>
        <li>Specific vendor implementations or products</li>
      </ul>
    </section>

    <section anchor="test-setup">
      <name>Test Setup</name>
      
      <section anchor="sut-configurations">
        <name>System Under Test Configurations</name>
        
        <t>The System Under Test (SUT) boundary MUST be declared before
        benchmarking. This document defines three standard configurations.</t>
        
        <section anchor="model-engine-config">
          <name>Model Engine Configuration</name>
          
          <t>The Model Engine configuration measures raw inference capability.</t>
          
          <figure anchor="fig-model-engine">
            <name>Model Engine Configuration</name>
            <artwork type="ascii-art"><![CDATA[
                +------------------+
                |   Load Generator |
                +--------+---------+
                         |
              Internal API (gRPC/HTTP)
                         |
                +--------v---------+
                |   Model Engine   |
                |  (SUT Boundary)  |
                +------------------+
]]></artwork>
          </figure>
          
          <t>Included components:</t>
          <ul>
            <li>Model weights and inference runtime</li>
            <li>Batching and scheduling logic</li>
            <li>KV cache management</li>
            <li>Tensor operations and kernels</li>
          </ul>
          
          <t>Excluded components:</t>
          <ul>
            <li>External network transport</li>
            <li>Authentication and authorization</li>
            <li>Rate limiting</li>
            <li>Input/output safety filtering</li>
            <li>Load balancing</li>
          </ul>
          
          <t>This configuration is appropriate for comparing inference engines
          (vLLM, TensorRT-LLM, SGLang) independent of deployment stack.</t>
        </section>
        
        <section anchor="gateway-config">
          <name>Application Gateway Configuration</name>
          
          <t>The Application Gateway configuration measures user-observable API
          performance.</t>
          
          <figure anchor="fig-gateway">
            <name>Application Gateway Configuration</name>
            <artwork type="ascii-art"><![CDATA[
                +------------------+
                |   Load Generator |
                +--------+---------+
                         |
              External API (HTTPS)
                         |
                +--------v---------+
                | Application GW   |
                |  (SUT Boundary)  |
                |  +------------+  |
                |  |   Engine   |  |
                |  +------------+  |
                +------------------+
]]></artwork>
          </figure>
          
          <t>Included components (in addition to Model Engine):</t>
          <ul>
            <li>TLS termination</li>
            <li>Authentication and session management</li>
            <li>Rate limiting and quota enforcement</li>
            <li>Input validation and output filtering</li>
            <li>Safety guardrails</li>
          </ul>
          
          <t>This configuration is appropriate for comparing API providers or
          evaluating production deployment performance.</t>
        </section>
        
        <section anchor="compound-config">
          <name>Compound System Configuration</name>
          
          <t>The Compound System configuration measures end-to-end task
          completion for agentic or retrieval-augmented workloads.</t>
          
          <figure anchor="fig-compound">
            <name>Compound System Configuration</name>
            <artwork type="ascii-art"><![CDATA[
                +------------------+
                |   Task Driver    |
                +--------+---------+
                         |
                +--------v---------+
                |  Compound System |
                |  (SUT Boundary)  |
                |  +------------+  |
                |  | Retrieval  |  |
                |  +------------+  |
                |  +------------+  |
                |  |   Tools    |  |
                |  +------------+  |
                |  +------------+  |
                |  |  Gateway   |  |
                |  +------------+  |
                +------------------+
]]></artwork>
          </figure>
          
          <t>Included components (in addition to Application Gateway):</t>
          <ul>
            <li>Retrieval pipeline (embedding, vector search, reranking)</li>
            <li>Tool execution environment</li>
            <li>Orchestration logic</li>
            <li>Multi-turn conversation state</li>
          </ul>
          
          <t>This configuration is appropriate for evaluating RAG systems or
          agentic applications.</t>
        </section>
      </section>
      
      <section anchor="load-generator">
        <name>Load Generator Requirements</name>
        
        <t>The load generator produces requests and measures responses. It
        MUST satisfy the following requirements.</t>
        
        <section anchor="timing-resolution">
          <name>Timing Resolution</name>
          <t>The load generator MUST measure time with resolution of 1 millisecond
          or better. Microsecond resolution is RECOMMENDED for ITL measurement.</t>
        </section>
        
        <section anchor="streaming-support">
          <name>Streaming Support</name>
          <t>The load generator MUST support streaming response protocols (SSE,
          WebSocket, or gRPC streaming). It MUST record the arrival time of
          each token or chunk, not only the complete response.</t>
        </section>
        
        <section anchor="open-loop">
          <name>Open-Loop Load Generation</name>
          <t>The load generator MUST support open-loop load generation where
          request arrival times are determined by a specified distribution
          independent of response times. Poisson arrivals MUST be supported.
          Uniform and bursty arrival patterns are RECOMMENDED.</t>
        </section>
        
        <section anchor="closed-loop">
          <name>Closed-Loop Load Generation</name>
          <t>The load generator MUST support closed-loop load generation where
          a fixed number of concurrent requests are maintained. When a
          request completes, a new request is immediately submitted.</t>
        </section>
        
        <section anchor="request-isolation">
          <name>Request Isolation</name>
          <t>The load generator MUST NOT allow slow responses to delay the
          submission of subsequent requests in open-loop mode. Asynchronous
          or multi-threaded implementation is REQUIRED.</t>
        </section>
        
        <section anchor="output-recording">
          <name>Output Recording</name>
          <t>The load generator MUST record for each request:</t>
          <ul>
            <li>Request submission timestamp</li>
            <li>First token arrival timestamp</li>
            <li>Each subsequent token arrival timestamp</li>
            <li>Final token arrival timestamp</li>
            <li>Total input token count</li>
            <li>Total output token count</li>
            <li>Request success/failure status</li>
          </ul>
        </section>
      </section>
      
      <section anchor="reference-workloads">
        <name>Reference Workloads</name>
        
        <t>Workload specification is critical for reproducible benchmarking.
        This document defines reference workloads with fixed characteristics.
        Testers MAY use custom workloads but MUST fully specify them.</t>
        
        <section anchor="workload-params">
          <name>Workload Parameters</name>
          
          <t>Each workload MUST specify:</t>
          
          <dl>
            <dt>Input length distribution:</dt>
            <dd><t>Distribution type (fixed, uniform, normal, empirical),
            parameters (mean, std, min, max, or histogram), and
            unit (tokens using specified tokenizer).</t></dd>
            
            <dt>Output length distribution:</dt>
            <dd><t>Distribution type (fixed, uniform, normal, empirical),
            parameters (mean, std, min, max, or histogram),
            control method (max_tokens parameter, stop sequence, or both), and
            unit (tokens using specified tokenizer).</t></dd>
            
            <dt>Content characteristics:</dt>
            <dd><t>Domain (general, code, conversation, instruction),
            language (English, multilingual, code languages), and
            system prompt presence and typical length.</t></dd>
            
            <dt>Prefix sharing:</dt>
            <dd><t>Fraction of requests sharing common prefix and
            shared prefix length distribution.</t></dd>
          </dl>
        </section>
        
        <section anchor="standard-workloads">
          <name>Standard Workloads</name>
          
          <t>This document defines five standard workloads. Full specifications
          appear in <xref target="workload-specs"/>.</t>
          
          <section anchor="synthetic-uniform">
            <name>Synthetic-Uniform</name>
            <t>Purpose: Baseline comparison with controlled variability</t>
            <ul>
              <li>Input length: Uniform(128, 512) tokens</li>
              <li>Output length: Uniform(64, 256) tokens</li>
              <li>Content: Random token sequences (no semantic meaning)</li>
              <li>Prefix sharing: None</li>
            </ul>
            <t>This workload isolates inference performance from content effects.
            It is REQUIRED for Model Engine benchmarking.</t>
          </section>
          
          <section anchor="synthetic-skewed">
            <name>Synthetic-Skewed</name>
            <t>Purpose: Test behavior under realistic length variation</t>
            <ul>
              <li>Input length: Log-normal(mu=5.5, sigma=1.0) tokens, capped at 4096</li>
              <li>Output length: Log-normal(mu=4.5, sigma=1.2) tokens, capped at 2048</li>
              <li>Content: Random token sequences</li>
              <li>Prefix sharing: None</li>
            </ul>
            <t>This workload tests scheduling fairness with high length variance.</t>
          </section>
          
          <section anchor="conversation">
            <name>Conversation</name>
            <t>Purpose: Simulate interactive chat workloads</t>
            <ul>
              <li>Input length: Empirical distribution from ShareGPT dataset</li>
              <li>Output length: Empirical distribution from ShareGPT dataset</li>
              <li>Content: Natural language conversation</li>
              <li>Prefix sharing: 50% share 200-token system prompt</li>
            </ul>
            <t>This workload is RECOMMENDED for Application Gateway benchmarking.</t>
          </section>
          
          <section anchor="code-completion">
            <name>Code Completion</name>
            <t>Purpose: Simulate coding assistant workloads</t>
            <ul>
              <li>Input length: Empirical from code completion datasets</li>
              <li>Output length: Log-normal(mu=4.0, sigma=1.5) tokens</li>
              <li>Content: Source code in Python, JavaScript, TypeScript</li>
              <li>Prefix sharing: 80% share repository context prefix</li>
            </ul>
            <t>This workload tests prefix caching effectiveness.</t>
          </section>
          
          <section anchor="long-context">
            <name>Long Context</name>
            <t>Purpose: Test long-context behavior</t>
            <ul>
              <li>Input length: Uniform(8192, 32768) tokens</li>
              <li>Output length: Fixed at 256 tokens</li>
              <li>Content: Document + question format</li>
              <li>Prefix sharing: None</li>
            </ul>
            <t>This workload is REQUIRED for Long Context Scaling tests.</t>
          </section>
        </section>
        
        <section anchor="workload-reproducibility">
          <name>Workload Reproducibility</name>
          <t>For reproducible benchmarking:</t>
          <ul>
            <li>Testers MUST use deterministic random seeds for workload
            generation. The seed MUST be reported.</li>
            <li>Testers SHOULD publish the exact request sequences used, or
            provide generation code with fixed seeds.</li>
            <li>When using dataset-derived workloads (ShareGPT, HumanEval),
            testers MUST specify the dataset version, subset selection
            method, and any preprocessing applied.</li>
          </ul>
        </section>
      </section>
      
      <section anchor="tokenization">
        <name>Tokenization</name>
        
        <t>Token counts depend on the tokenizer. Different tokenizers produce
        different counts for identical text, making cross-system comparison
        challenging.</t>
        
        <section anchor="tokenizer-spec">
          <name>Tokenizer Specification</name>
          <t>The test report MUST specify:</t>
          <ul>
            <li>Tokenizer name and version (e.g., "cl100k_base", "Llama-3 tokenizer")</li>
            <li>Vocabulary size</li>
            <li>Source (Hugging Face model ID, tiktoken name, or custom)</li>
          </ul>
        </section>
        
        <section anchor="token-counting">
          <name>Token Counting Method</name>
          <t>For cross-system comparison where systems use different tokenizers:</t>
          
          <dl>
            <dt>Option A - Native tokenizer:</dt>
            <dd>Count tokens using each system's native tokenizer. Report results
            separately with tokenizer identified. This method reflects actual
            system behavior but complicates comparison.</dd>
            
            <dt>Option B - Reference tokenizer:</dt>
            <dd>Count tokens using a declared reference tokenizer for all systems.
            This enables direct comparison but may not reflect actual system
            token counts.</dd>
          </dl>
          
          <t>The test report MUST declare which option is used. Option B with
          cl100k_base (GPT-4 tokenizer) as reference is RECOMMENDED for
          cross-system comparison.</t>
        </section>
        
        <section anchor="special-tokens">
          <name>Special Token Handling</name>
          <t>The test report MUST specify handling of:</t>
          <ul>
            <li>BOS/EOS tokens (included or excluded from counts)</li>
            <li>System prompt tokens (counted separately or included)</li>
            <li>Tool/function call formatting tokens</li>
          </ul>
        </section>
      </section>
      
      <section anchor="warmup">
        <name>Warm-up Procedures</name>
        
        <t>LLM serving systems require warm-up before reaching steady-state
        performance. Warm-up effects include JIT compilation, memory
        allocator initialization, prefix cache population, and batch
        size ramp-up.</t>
        
        <section anchor="warmup-requirements">
          <name>Warm-up Requirements</name>
          <t>Before measurement begins, testers MUST:</t>
          <ol>
            <li>Load the model fully into accelerator memory</li>
            <li>Process at least 100 requests or 10,000 output tokens,
            whichever is greater</li>
            <li>Wait for request queue to drain completely</li>
            <li>If prefix caching is enabled and being tested, populate the
            cache with representative prefixes</li>
          </ol>
        </section>
        
        <section anchor="warmup-verification">
          <name>Warm-up Verification</name>
          <t>Testers SHOULD verify warm-up completion by:</t>
          <ol>
            <li>Measuring latency for a probe request before and after warm-up</li>
            <li>Confirming latency stabilization (less than 10% variation
            across consecutive probe requests)</li>
          </ol>
        </section>
        
        <section anchor="cold-start">
          <name>Cold Start Measurement</name>
          <t>When cold start performance is being measured (Model Load Time,
          Cold Start Latency), warm-up MUST be skipped. The test report
          MUST clearly indicate cold start measurement.</t>
        </section>
      </section>
      
      <section anchor="streaming-protocol">
        <name>Streaming Protocol</name>
        
        <t>LLM serving systems deliver tokens via streaming protocols. The
        choice of protocol affects timing measurement.</t>
        
        <section anchor="supported-protocols">
          <name>Supported Protocols</name>
          <t>This methodology supports:</t>
          <dl>
            <dt>Server-Sent Events (SSE):</dt>
            <dd>HTTP-based streaming. Each event contains one or more tokens.
            RECOMMENDED for Application Gateway testing.</dd>
            
            <dt>WebSocket:</dt>
            <dd>Bidirectional streaming. Each message contains one or more tokens.</dd>
            
            <dt>gRPC streaming:</dt>
            <dd>Binary streaming protocol. Each message contains one or more tokens.
            RECOMMENDED for Model Engine testing.</dd>
          </dl>
        </section>
        
        <section anchor="token-chunking">
          <name>Token Chunking</name>
          <t>Streaming protocols may deliver multiple tokens per chunk due to
          batching or network buffering. The test report MUST specify:</t>
          <ul>
            <li>Protocol used</li>
            <li>Whether each chunk contains exactly one token or potentially
            multiple tokens</li>
            <li>How multi-token chunks are handled for ITL calculation</li>
          </ul>
        </section>
        
        <section anchor="itl-chunked">
          <name>ITL Calculation with Chunked Delivery</name>
          <t>When chunks contain multiple tokens:</t>
          <dl>
            <dt>Option A - Chunk timing:</dt>
            <dd>Measure inter-chunk latency. Report as "Time Between Chunks"
            rather than ITL. Note chunk size distribution.</dd>
            
            <dt>Option B - Distributed timing:</dt>
            <dd>Distribute chunk arrival time across tokens. If a chunk with N
            tokens arrives at time T, assign arrival time T to all N tokens.
            This understates ITL variance.</dd>
            
            <dt>Option C - Server-side timing:</dt>
            <dd>Use server-reported per-token timestamps if available. This
            measures ITL independent of network effects.</dd>
          </dl>
          <t>The test report MUST declare which option is used. Option C is
          RECOMMENDED when available.</t>
        </section>
      </section>
      
      <section anchor="clock-sync">
        <name>Clock Synchronization</name>
        
        <t>Accurate timing requires synchronized clocks between load generator
        and SUT, and between distributed SUT components.</t>
        
        <section anchor="single-machine">
          <name>Single-Machine Testing</name>
          <t>When load generator and SUT run on the same machine, clock
          synchronization is inherent. This configuration is RECOMMENDED
          for Model Engine testing.</t>
        </section>
        
        <section anchor="distributed-testing">
          <name>Distributed Testing</name>
          <t>When load generator and SUT are on different machines:</t>
          <ul>
            <li>NTP synchronization MUST achieve accuracy of 10ms or better</li>
            <li>PTP synchronization SHOULD be used when sub-millisecond
            accuracy is required</li>
            <li>The test report MUST state the synchronization method and
            estimated accuracy</li>
          </ul>
        </section>
        
        <section anchor="network-latency">
          <name>Network Latency Measurement</name>
          <t>For Application Gateway testing where network latency is significant:</t>
          <ul>
            <li>Testers SHOULD measure and report network RTT separately</li>
            <li>Testers MAY subtract estimated network latency from TTFT to
            isolate server-side processing time</li>
            <li>Any latency adjustment MUST be documented in the test report</li>
          </ul>
        </section>
        
        <section anchor="timestamp-format">
          <name>Timestamp Format</name>
          <t>All timestamps MUST be recorded in a format with at least
          millisecond precision. ISO 8601 with milliseconds
          (YYYY-MM-DDTHH:MM:SS.sssZ) or Unix epoch with milliseconds is
          RECOMMENDED.</t>
        </section>
      </section>
      
      <section anchor="guardrail-config">
        <name>Safety and Guardrail Configuration</name>
        
        <t>Production LLM deployments include safety systems that affect
        performance. Benchmarking MUST account for these systems.</t>
        
        <section anchor="guardrail-disclosure">
          <name>Guardrail Disclosure</name>
          <t>The test report MUST disclose:</t>
          <ul>
            <li>Whether input content filtering is enabled</li>
            <li>Whether output content filtering is enabled</li>
            <li>Names of safety systems if known (e.g., "Llama Guard")</li>
            <li>Whether any requests were refused during testing</li>
          </ul>
        </section>
        
        <section anchor="production-representative">
          <name>Production-Representative Testing</name>
          <t>For Application Gateway benchmarking intended to represent
          production performance:</t>
          <ul>
            <li>Safety systems SHOULD be enabled in their default configuration</li>
            <li>The test report MUST note if safety systems are disabled</li>
            <li>Testers SHOULD run comparative tests with safety enabled and
            disabled to quantify overhead</li>
          </ul>
        </section>
      </section>
    </section>

    <section anchor="benchmarking-tests">
      <name>Benchmarking Tests</name>
      
      <t>This section defines benchmarking tests. Each test includes:
      objective, setup parameters, procedure, measurements, and
      reporting format.</t>
      
      <section anchor="test-ttft">
        <name>Time to First Token</name>
        
        <section anchor="ttft-objective">
          <name>Objective</name>
          <t>To determine the latency from request submission to first token
          receipt under varying load conditions. TTFT measures perceived
          responsiveness for interactive applications.</t>
        </section>
        
        <section anchor="ttft-setup">
          <name>Setup Parameters</name>
          <t>The following parameters MUST be defined:</t>
          
          <section anchor="ttft-workload-params">
            <name>Workload Parameters</name>
            <dl>
              <dt>Workload:</dt>
              <dd>One of the standard workloads (<xref target="standard-workloads"/>)
              or a fully specified custom workload.</dd>
              
              <dt>Request count:</dt>
              <dd>Total number of requests to execute. MUST be at least 1000
              for P99 measurement, 10000 for P99.9.</dd>
            </dl>
          </section>
          
          <section anchor="ttft-load-params">
            <name>Load Parameters</name>
            <dl>
              <dt>Load model:</dt>
              <dd>Open-loop or closed-loop.</dd>
            </dl>
            
            <t>For open-loop:</t>
            <dl>
              <dt>Arrival rate:</dt>
              <dd>Requests per second.</dd>
              <dt>Arrival distribution:</dt>
              <dd>Poisson (REQUIRED), uniform, or bursty.</dd>
            </dl>
            
            <t>For closed-loop:</t>
            <dl>
              <dt>Concurrency:</dt>
              <dd>Number of concurrent requests maintained.</dd>
            </dl>
          </section>
          
          <section anchor="ttft-system-params">
            <name>System Parameters</name>
            <dl>
              <dt>SUT configuration:</dt>
              <dd>Model Engine, Application Gateway, or Compound System.</dd>
              
              <dt>Model identifier:</dt>
              <dd>Model name, version, and quantization if applicable.</dd>
              
              <dt>Hardware:</dt>
              <dd>Accelerator type, count, and memory.</dd>
              
              <dt>Prefix caching:</dt>
              <dd>Enabled or disabled.</dd>
            </dl>
          </section>
        </section>
        
        <section anchor="ttft-procedure">
          <name>Procedure</name>
          <ol>
            <li>Configure the SUT with specified parameters.</li>
            <li>Complete warm-up procedure (<xref target="warmup"/>).</li>
            <li>Begin load generation at the specified arrival rate or concurrency.</li>
            <li><t>For each request:</t>
              <ol type="a">
                <li>Record submission timestamp (T_submit)</li>
                <li>Record first token arrival timestamp (T_first)</li>
                <li>Calculate TTFT = T_first - T_submit</li>
                <li>Record input token count</li>
              </ol>
            </li>
            <li>Continue until request count is reached.</li>
            <li>Compute distribution statistics.</li>
          </ol>
          
          <section anchor="first-token-def">
            <name>First Token Definition</name>
            <t>The first token is defined as the first content token received,
            excluding:</t>
            <ul>
              <li>Empty tokens or whitespace-only tokens</li>
              <li>Protocol overhead (SSE event markers, JSON framing)</li>
              <li>Metadata tokens (token IDs, logprobs if requested separately)</li>
            </ul>
            <t>If the system emits non-content tokens before content, the test
            report MUST note this and specify whether TTFT measures time to
            any token or time to first content token.</t>
          </section>
        </section>
        
        <section anchor="ttft-measurements">
          <name>Measurements</name>
          
          <section anchor="ttft-primary">
            <name>Primary Measurements</name>
            <dl>
              <dt>TTFT Percentiles:</dt>
              <dd>P50, P90, P95, P99, and P99.9 of TTFT distribution.
              All percentiles MUST be reported.</dd>
              
              <dt>TTFT Mean:</dt>
              <dd>Arithmetic mean of TTFT values.</dd>
              
              <dt>TTFT Minimum:</dt>
              <dd>Smallest TTFT observed.</dd>
              
              <dt>TTFT Maximum:</dt>
              <dd>Largest TTFT observed.</dd>
            </dl>
          </section>
          
          <section anchor="ttft-conditional">
            <name>Conditional Measurements</name>
            <dl>
              <dt>TTFT by input length:</dt>
              <dd>When workload has variable input length, report TTFT percentiles
              bucketed by input length ranges. RECOMMENDED buckets: [0-256),
              [256-512), [512-1024), [1024-2048), [2048-4096), [4096+) tokens.</dd>
              
              <dt>Queue wait time:</dt>
              <dd>If measurable (server instrumentation), report the queue wait
              component of TTFT separately.</dd>
              
              <dt>Prefill latency:</dt>
              <dd>If measurable, report the prefill computation component of
              TTFT separately.</dd>
            </dl>
          </section>
          
          <section anchor="ttft-statistical">
            <name>Statistical Requirements</name>
            <t>For P99 accuracy within 10% relative error at 95% confidence,
            at least 1000 samples are required. For P99.9, at least 10000
            samples. The test report MUST state the sample count.</t>
          </section>
        </section>
        
        <section anchor="ttft-reporting">
          <name>Reporting Format</name>
          <t>The test report MUST include:</t>
          
          <section anchor="ttft-config-summary">
            <name>Configuration Summary</name>
            <ul>
              <li>SUT configuration and boundary</li>
              <li>Model identifier and hardware</li>
              <li>Workload name or full specification</li>
              <li>Load model and parameters</li>
              <li>Request count and test duration</li>
              <li>Warm-up procedure followed</li>
              <li>Prefix caching state</li>
              <li>Guardrail configuration</li>
            </ul>
          </section>
          
          <section anchor="ttft-results-table">
            <name>Results Table</name>
            <t>The results SHOULD be reported in tabular format:</t>
            <table anchor="ttft-example-table">
              <name>TTFT Results Example</name>
              <thead>
                <tr><th>Metric</th><th>Value</th></tr>
              </thead>
              <tbody>
                <tr><td>Requests</td><td>10000</td></tr>
                <tr><td>TTFT P50</td><td>127 ms</td></tr>
                <tr><td>TTFT P90</td><td>245 ms</td></tr>
                <tr><td>TTFT P95</td><td>312 ms</td></tr>
                <tr><td>TTFT P99</td><td>524 ms</td></tr>
                <tr><td>TTFT P99.9</td><td>891 ms</td></tr>
                <tr><td>TTFT Mean</td><td>156 ms</td></tr>
                <tr><td>TTFT Min</td><td>89 ms</td></tr>
                <tr><td>TTFT Max</td><td>1243 ms</td></tr>
              </tbody>
            </table>
          </section>
          
          <section anchor="ttft-by-length">
            <name>TTFT by Input Length</name>
            <t>If applicable:</t>
            <table anchor="ttft-length-table">
              <name>TTFT by Input Length Example</name>
              <thead>
                <tr><th>Input Tokens</th><th>P50 (ms)</th><th>P95 (ms)</th><th>P99 (ms)</th></tr>
              </thead>
              <tbody>
                <tr><td>0-256</td><td>95</td><td>198</td><td>312</td></tr>
                <tr><td>256-512</td><td>142</td><td>287</td><td>445</td></tr>
                <tr><td>512-1024</td><td>198</td><td>412</td><td>623</td></tr>
                <tr><td>1024-2048</td><td>312</td><td>587</td><td>891</td></tr>
                <tr><td>2048+</td><td>523</td><td>912</td><td>1243</td></tr>
              </tbody>
            </table>
          </section>
          
          <section anchor="ttft-visualization">
            <name>Distribution Visualization</name>
            <t>Testers SHOULD include a histogram or CDF plot of the TTFT
            distribution.</t>
          </section>
        </section>
      </section>
      
      <section anchor="test-throughput">
        <name>Output Token Throughput</name>
        
        <section anchor="throughput-objective">
          <name>Objective</name>
          <t>To determine the maximum rate at which the SUT can generate output
          tokens while maintaining acceptable latency. This test measures
          system capacity under load.</t>
        </section>
        
        <section anchor="throughput-setup">
          <name>Setup Parameters</name>
          <t>The following parameters MUST be defined:</t>
          
          <section anchor="throughput-workload">
            <name>Workload Parameters</name>
            <dl>
              <dt>Workload:</dt>
              <dd>One of the standard workloads or fully specified custom workload.</dd>
              
              <dt>Test duration:</dt>
              <dd>Minimum 60 seconds. RECOMMENDED 300 seconds for stable measurement.</dd>
            </dl>
          </section>
          
          <section anchor="throughput-load">
            <name>Load Parameters</name>
            <dl>
              <dt>Load model:</dt>
              <dd>Open-loop or closed-loop.</dd>
            </dl>
            
            <t>For open-loop:</t>
            <dl>
              <dt>Arrival rate range:</dt>
              <dd>Minimum and maximum request rates to test.</dd>
              <dt>Rate increment:</dt>
              <dd>Step size for iterative search.</dd>
            </dl>
            
            <t>For closed-loop:</t>
            <dl>
              <dt>Concurrency range:</dt>
              <dd>Minimum and maximum concurrent requests.</dd>
              <dt>Concurrency increment:</dt>
              <dd>Step size for iterative search.</dd>
            </dl>
          </section>
          
          <section anchor="throughput-latency-constraint">
            <name>Latency Constraint (Optional)</name>
            <dl>
              <dt>TTFT SLO:</dt>
              <dd>Maximum acceptable P99 TTFT.</dd>
              <dt>TPOT SLO:</dt>
              <dd>Maximum acceptable P99 TPOT.</dd>
            </dl>
            <t>When specified, throughput is measured as the maximum rate
            achieving these SLOs.</t>
          </section>
        </section>
        
        <section anchor="throughput-procedure">
          <name>Procedure</name>
          <t>This test employs an iterative search to find maximum throughput.</t>
          <ol>
            <li>Configure the SUT with specified parameters.</li>
            <li>Complete warm-up procedure.</li>
            <li><t>For each load level (arrival rate or concurrency):</t>
              <ol type="a">
                <li>Run load for the specified test duration.</li>
                <li>Record all request timings.</li>
                <li>Compute throughput as total output tokens divided by test duration.</li>
                <li>Compute TTFT and TPOT percentiles.</li>
                <li>If latency constraint specified, check SLO compliance.</li>
              </ol>
            </li>
            <li><t>Use binary search to find maximum throughput:</t>
              <ol type="a">
                <li>If no latency constraint: find load level where queue grows
                unboundedly (system saturation).</li>
                <li>If latency constraint: find highest load level meeting SLO.</li>
              </ol>
            </li>
            <li>Report throughput at the maximum sustainable load level.</li>
          </ol>
          
          <section anchor="saturation-detection">
            <name>Saturation Detection</name>
            <t>System saturation is detected when:</t>
            <ul>
              <li>Queue depth grows continuously during test duration, OR</li>
              <li>Request completion rate is less than 90% of arrival rate, OR</li>
              <li>P99 latency exceeds 10x the P50 latency at lower load</li>
            </ul>
          </section>
          
          <section anchor="steady-state">
            <name>Steady State Verification</name>
            <t>At each load level, verify steady state by:</t>
            <ul>
              <li>Confirming queue depth is stable (not growing)</li>
              <li>Confirming throughput is stable across test duration</li>
              <li>Excluding initial ramp-up period (first 10% of duration)</li>
            </ul>
          </section>
        </section>
        
        <section anchor="throughput-measurements">
          <name>Measurements</name>
          
          <section anchor="throughput-primary">
            <name>Primary Measurements</name>
            <dl>
              <dt>Maximum output token throughput:</dt>
              <dd>Output tokens per second at maximum sustainable load.
              Report with or without latency constraint as specified.</dd>
              
              <dt>Request throughput:</dt>
              <dd>Requests completed per second at maximum load.</dd>
              
              <dt>Input token throughput:</dt>
              <dd>Input tokens processed per second (measures prefill capacity).</dd>
            </dl>
          </section>
          
          <section anchor="throughput-efficiency">
            <name>Efficiency Measurements</name>
            <dl>
              <dt>Tokens per GPU-second:</dt>
              <dd>Output tokens per second divided by GPU count. Enables comparison
              across different hardware configurations.</dd>
              
              <dt>Batch utilization:</dt>
              <dd>If measurable, report average batch size divided by maximum batch size.</dd>
            </dl>
          </section>
          
          <section anchor="throughput-latency-at-max">
            <name>Latency at Maximum Throughput</name>
            <t>At the maximum sustainable load level, report:</t>
            <ul>
              <li>TTFT P50, P95, P99</li>
              <li>TPOT P50, P95, P99</li>
              <li>End-to-end latency P50, P95, P99</li>
            </ul>
          </section>
        </section>
        
        <section anchor="throughput-reporting">
          <name>Reporting Format</name>
          
          <section anchor="throughput-summary">
            <name>Summary Results</name>
            <table anchor="throughput-summary-table">
              <name>Throughput Summary Example</name>
              <thead>
                <tr><th>Metric</th><th>Value</th></tr>
              </thead>
              <tbody>
                <tr><td>Max Output Throughput</td><td>2847 tok/s</td></tr>
                <tr><td>Max Request Throughput</td><td>18.2 req/s</td></tr>
                <tr><td>Max Input Throughput</td><td>5123 tok/s</td></tr>
                <tr><td>Sustainable Load</td><td>20 req/s</td></tr>
                <tr><td>Tokens per GPU-second</td><td>356 tok/s/GPU</td></tr>
              </tbody>
            </table>
          </section>
          
          <section anchor="throughput-latency-table">
            <name>Latency at Maximum Throughput</name>
            <table anchor="latency-at-max-table">
              <name>Latency at Maximum Throughput Example</name>
              <thead>
                <tr><th>Metric</th><th>P50</th><th>P95</th><th>P99</th></tr>
              </thead>
              <tbody>
                <tr><td>TTFT</td><td>312 ms</td><td>687 ms</td><td>1124 ms</td></tr>
                <tr><td>TPOT</td><td>42 ms</td><td>78 ms</td><td>134 ms</td></tr>
                <tr><td>End-to-End</td><td>6.2 s</td><td>11.4 s</td><td>18.7 s</td></tr>
              </tbody>
            </table>
          </section>
        </section>
      </section>
      
      <section anchor="test-tradeoff">
        <name>Throughput-Latency Tradeoff</name>
        
        <section anchor="tradeoff-objective">
          <name>Objective</name>
          <t>To characterize the relationship between throughput and latency
          across the operating range of the SUT. This test produces a
          throughput-latency curve revealing system behavior better than
          point measurements.</t>
        </section>
        
        <section anchor="tradeoff-setup">
          <name>Setup Parameters</name>
          <dl>
            <dt>Workload:</dt>
            <dd>One of the standard workloads or fully specified custom workload.</dd>
            
            <dt>Test duration per point:</dt>
            <dd>Minimum 60 seconds per load level.</dd>
            
            <dt>Load levels:</dt>
            <dd>At least 10 load levels spanning from low load (10% of estimated
            capacity) to saturation.</dd>
            
            <dt>Load model:</dt>
            <dd>Open-loop is REQUIRED for this test. Closed-loop cannot reveal
            behavior beyond capacity.</dd>
          </dl>
        </section>
        
        <section anchor="tradeoff-procedure">
          <name>Procedure</name>
          <ol>
            <li>Estimate system capacity using a preliminary throughput test
            or published specifications.</li>
            <li>Define load levels: 10%, 20%, 30%, ..., 100%, 110%, 120% of
            estimated capacity.</li>
            <li><t>For each load level in ascending order:</t>
              <ol type="a">
                <li>Run load for specified duration.</li>
                <li>Record all request timings.</li>
                <li>Compute achieved throughput (may differ from offered load at saturation).</li>
                <li>Compute latency percentiles.</li>
              </ol>
            </li>
            <li>Plot throughput vs latency curves.</li>
          </ol>
        </section>
        
        <section anchor="tradeoff-measurements">
          <name>Measurements</name>
          <t>For each load level, record:</t>
          <ul>
            <li>Offered load (request rate)</li>
            <li>Achieved throughput (output tokens per second)</li>
            <li>TTFT: P50, P95, P99</li>
            <li>TPOT: P50, P95, P99</li>
            <li>End-to-end latency: P50, P95, P99</li>
            <li>Request success rate</li>
            <li>Queue growth indicator (stable/growing)</li>
          </ul>
          
          <t>Derived metrics:</t>
          <dl>
            <dt>Optimal operating point:</dt>
            <dd>Load level achieving highest throughput while meeting specified SLO.</dd>
            
            <dt>Knee point:</dt>
            <dd>Load level where P99 latency exceeds 2x the minimum P99 latency observed.</dd>
            
            <dt>Saturation point:</dt>
            <dd>Load level where achieved throughput first decreases from previous level.</dd>
          </dl>
        </section>
        
        <section anchor="tradeoff-reporting">
          <name>Reporting Format</name>
          <table anchor="tradeoff-table">
            <name>Throughput-Latency Table Example</name>
            <thead>
              <tr>
                <th>Offered (r/s)</th>
                <th>Achieved (tok/s)</th>
                <th>TTFT P50</th>
                <th>TTFT P99</th>
                <th>TPOT P50</th>
                <th>TPOT P99</th>
                <th>Success</th>
              </tr>
            </thead>
            <tbody>
              <tr><td>2</td><td>284</td><td>95</td><td>142</td><td>32</td><td>41</td><td>100%</td></tr>
              <tr><td>6</td><td>852</td><td>102</td><td>178</td><td>34</td><td>48</td><td>100%</td></tr>
              <tr><td>10</td><td>1420</td><td>128</td><td>267</td><td>38</td><td>62</td><td>100%</td></tr>
              <tr><td>14</td><td>1988</td><td>198</td><td>512</td><td>48</td><td>98</td><td>100%</td></tr>
              <tr><td>18</td><td>2534</td><td>378</td><td>1234</td><td>72</td><td>198</td><td>99.8%</td></tr>
              <tr><td>22</td><td>2712</td><td>823</td><td>3456</td><td>142</td><td>523</td><td>94.1%</td></tr>
            </tbody>
          </table>
          <t>Knee point: 14 req/s (TTFT P99 exceeds 2x minimum)</t>
          <t>Saturation point: 22 req/s (throughput peaks)</t>
        </section>
      </section>
      
      <section anchor="test-itl">
        <name>Inter-Token Latency Distribution</name>
        
        <section anchor="itl-objective">
          <name>Objective</name>
          <t>To characterize the variability of token delivery during the decode
          phase. ITL distribution determines streaming smoothness experienced
          by users.</t>
        </section>
        
        <section anchor="itl-setup">
          <name>Setup Parameters</name>
          <dl>
            <dt>Workload:</dt>
            <dd>Synthetic-Uniform or Conversation workload RECOMMENDED.</dd>
            
            <dt>Minimum output length:</dt>
            <dd>Requests MUST generate at least 50 output tokens to provide
            meaningful ITL samples.</dd>
            
            <dt>Request count:</dt>
            <dd>At least 100 requests for per-request statistics, yielding
            5000+ ITL samples.</dd>
            
            <dt>Load level:</dt>
            <dd>Specify as percentage of maximum throughput. Multiple load
            levels RECOMMENDED: 25%, 50%, 75%, 90% of saturation.</dd>
            
            <dt>Measurement method:</dt>
            <dd>Specify per <xref target="itl-chunked"/> (chunk timing,
            distributed timing, or server-side timing).</dd>
          </dl>
        </section>
        
        <section anchor="itl-procedure">
          <name>Procedure</name>
          <ol>
            <li>Configure SUT and complete warm-up.</li>
            <li><t>For each load level:</t>
              <ol type="a">
                <li>Generate requests at specified load.</li>
                <li>For each request, record arrival time of each token after the first.</li>
                <li>Calculate ITL_i = T(token_i) - T(token_{i-1}) for each consecutive token pair.</li>
                <li>Aggregate ITL samples across all requests.</li>
                <li>Calculate per-request jitter (standard deviation of ITL within each request).</li>
                <li>Record maximum pause duration per request.</li>
              </ol>
            </li>
          </ol>
          <t>The interval between request submission and first token (TTFT)
          MUST NOT be included in ITL calculation.</t>
        </section>
        
        <section anchor="itl-measurements">
          <name>Measurements</name>
          
          <section anchor="itl-aggregate">
            <name>Aggregate ITL Statistics</name>
            <dl>
              <dt>ITL Percentiles:</dt>
              <dd>P50, P90, P95, P99, P99.9 across all ITL samples.</dd>
              
              <dt>ITL Mean:</dt>
              <dd>Arithmetic mean of all ITL samples.</dd>
              
              <dt>ITL Standard Deviation:</dt>
              <dd>Standard deviation across all samples.</dd>
            </dl>
          </section>
          
          <section anchor="itl-per-request">
            <name>Per-Request Statistics</name>
            <dl>
              <dt>Jitter Distribution:</dt>
              <dd>P50, P95, P99 of per-request standard deviation.</dd>
              
              <dt>Maximum Pause Distribution:</dt>
              <dd>P50, P95, P99 of per-request maximum ITL.</dd>
            </dl>
          </section>
          
          <section anchor="itl-shape">
            <name>Distribution Shape</name>
            <dl>
              <dt>Modality:</dt>
              <dd>Whether ITL distribution is unimodal or multimodal. Multimodal
              distributions indicate distinct operating regimes (e.g., batching effects).</dd>
              
              <dt>Tail behavior:</dt>
              <dd>Characterize tail (exponential, heavy-tailed). Report the ratio
              P99/P50 as a tail heaviness indicator.</dd>
            </dl>
          </section>
        </section>
        
        <section anchor="itl-reporting">
          <name>Reporting Format</name>
          <table anchor="itl-table">
            <name>ITL Results Example</name>
            <thead>
              <tr><th>Metric</th><th>Value</th></tr>
            </thead>
            <tbody>
              <tr><td>ITL Samples</td><td>15234</td></tr>
              <tr><td>ITL P50</td><td>38 ms</td></tr>
              <tr><td>ITL P90</td><td>52 ms</td></tr>
              <tr><td>ITL P95</td><td>67 ms</td></tr>
              <tr><td>ITL P99</td><td>124 ms</td></tr>
              <tr><td>ITL P99.9</td><td>312 ms</td></tr>
              <tr><td>ITL Mean</td><td>42 ms</td></tr>
              <tr><td>ITL Std Dev</td><td>28 ms</td></tr>
              <tr><td>P99/P50 Ratio</td><td>3.26</td></tr>
            </tbody>
          </table>
        </section>
      </section>
      
      <section anchor="test-capacity">
        <name>Concurrent Request Capacity</name>
        
        <section anchor="capacity-objective">
          <name>Objective</name>
          <t>To determine the maximum number of concurrent requests the SUT can
          maintain while meeting latency objectives. This test measures
          memory capacity and scheduling limits.</t>
        </section>
        
        <section anchor="capacity-setup">
          <name>Setup Parameters</name>
          <dl>
            <dt>Workload:</dt>
            <dd>Synthetic-Uniform RECOMMENDED for controlled testing.</dd>
            
            <dt>Fixed output length:</dt>
            <dd>Use fixed output length (e.g., 256 tokens) to ensure all
            requests have similar duration.</dd>
            
            <dt>Initial concurrency:</dt>
            <dd>Starting number of concurrent requests (e.g., 8).</dd>
            
            <dt>Maximum concurrency:</dt>
            <dd>Upper bound for search (e.g., 512).</dd>
            
            <dt>Success criteria:</dt>
            <dd><t>Request completion rate &gt;= 99%, TTFT P99 &lt;= specified threshold,
            and no out-of-memory errors.</t></dd>
          </dl>
        </section>
        
        <section anchor="capacity-procedure">
          <name>Procedure</name>
          <t>This test employs binary search to find maximum concurrent capacity.</t>
          <ol>
            <li>Configure SUT and complete warm-up.</li>
            <li>Set concurrency = initial concurrency.</li>
            <li><t>For each concurrency level:</t>
              <ol type="a">
                <li>Submit [concurrency] requests simultaneously.</li>
                <li>Maintain concurrency: when a request completes, immediately submit a replacement.</li>
                <li>Run for at least 60 seconds or 100 request completions per slot, whichever is longer.</li>
                <li>Record completion rate, latency percentiles, and any errors.</li>
                <li>Check success criteria.</li>
              </ol>
            </li>
            <li><t>Binary search:</t>
              <ol type="a">
                <li>If success criteria met: increase concurrency toward maximum.</li>
                <li>If success criteria not met: decrease concurrency.</li>
                <li>Continue until convergence.</li>
              </ol>
            </li>
            <li>Report maximum concurrency meeting success criteria.</li>
          </ol>
        </section>
        
        <section anchor="capacity-measurements">
          <name>Measurements</name>
          <dl>
            <dt>Maximum concurrent requests:</dt>
            <dd>Highest concurrency meeting success criteria.</dd>
            
            <dt>Achieved throughput at maximum:</dt>
            <dd>Output tokens per second at maximum concurrency.</dd>
            
            <dt>Tokens in flight at maximum:</dt>
            <dd>Approximate total tokens (input + output so far) across all
            concurrent requests.</dd>
          </dl>
        </section>
        
        <section anchor="capacity-reporting">
          <name>Reporting Format</name>
          <table anchor="capacity-table">
            <name>Capacity Search Results Example</name>
            <thead>
              <tr>
                <th>Concurrency</th>
                <th>Completion</th>
                <th>TTFT P99</th>
                <th>TPOT P99</th>
                <th>Errors</th>
                <th>Status</th>
              </tr>
            </thead>
            <tbody>
              <tr><td>8</td><td>100%</td><td>142 ms</td><td>38 ms</td><td>0</td><td>Pass</td></tr>
              <tr><td>16</td><td>100%</td><td>178 ms</td><td>42 ms</td><td>0</td><td>Pass</td></tr>
              <tr><td>32</td><td>100%</td><td>267 ms</td><td>52 ms</td><td>0</td><td>Pass</td></tr>
              <tr><td>64</td><td>99.7%</td><td>523 ms</td><td>78 ms</td><td>0</td><td>Pass</td></tr>
              <tr><td>128</td><td>97.2%</td><td>1234 ms</td><td>156 ms</td><td>3</td><td>Fail</td></tr>
            </tbody>
          </table>
          <t>Maximum concurrent requests meeting criteria: 64</t>
        </section>
      </section>
      
      <section anchor="test-fairness">
        <name>Scheduling Fairness</name>
        
        <section anchor="fairness-objective">
          <name>Objective</name>
          <t>To evaluate how equitably the SUT allocates resources across
          concurrent requests with different characteristics. This test
          reveals head-of-line blocking, starvation, and priority effects.</t>
        </section>
        
        <section anchor="fairness-setup">
          <name>Setup Parameters</name>
          <dl>
            <dt>Workload:</dt>
            <dd>Synthetic-Skewed REQUIRED. The high length variance creates
            fairness-sensitive conditions.</dd>
            
            <dt>Request classes:</dt>
            <dd><t>Define two or more request classes:</t>
              <ul>
                <li>Short requests: Input [64, 256] tokens, output [32, 128] tokens</li>
                <li>Long requests: Input [1024, 4096] tokens, output [256, 1024] tokens</li>
              </ul>
            </dd>
            
            <dt>Class mix:</dt>
            <dd>Ratio of request classes (e.g., 80% short, 20% long).</dd>
            
            <dt>Load level:</dt>
            <dd>70-90% of saturation throughput RECOMMENDED to create contention.</dd>
            
            <dt>Request count:</dt>
            <dd>At least 500 requests per class.</dd>
          </dl>
        </section>
        
        <section anchor="fairness-procedure">
          <name>Procedure</name>
          <ol>
            <li>Configure SUT and complete warm-up.</li>
            <li>Measure baseline: performance of each class in isolation at same total load.</li>
            <li>Generate mixed workload with specified class ratio.</li>
            <li>Run at specified load level for at least 300 seconds.</li>
            <li>For each request, record class membership, submission time, first token time, completion time.</li>
            <li>Compute per-class statistics and fairness metrics.</li>
          </ol>
        </section>
        
        <section anchor="fairness-measurements">
          <name>Measurements</name>
          <dl>
            <dt>Per-class latency:</dt>
            <dd>TTFT P50, P95, P99 for each request class.</dd>
            
            <dt>Latency inflation:</dt>
            <dd>(Mixed workload TTFT) / (Isolated TTFT) per class.</dd>
            
            <dt>Jain's Fairness Index:</dt>
            <dd>J = (sum(x_i))^2 / (n * sum(x_i^2)) where x_i is normalized latency.
            J = 1.0 indicates perfect fairness. J &lt; 0.9 indicates significant unfairness.</dd>
            
            <dt>Starvation rate:</dt>
            <dd>Fraction of requests waiting longer than 5x the median wait time for their class.</dd>
          </dl>
        </section>
        
        <section anchor="fairness-reporting">
          <name>Reporting Format</name>
          <table anchor="fairness-class-table">
            <name>Per-Class Results Example</name>
            <thead>
              <tr><th>Class</th><th>Count</th><th>TTFT P50</th><th>TTFT P99</th><th>TPOT P50</th><th>TPOT P99</th></tr>
            </thead>
            <tbody>
              <tr><td>Short</td><td>4012</td><td>89 ms</td><td>234 ms</td><td>35 ms</td><td>67 ms</td></tr>
              <tr><td>Long</td><td>988</td><td>312 ms</td><td>1234 ms</td><td>42 ms</td><td>89 ms</td></tr>
            </tbody>
          </table>
          
          <table anchor="fairness-metrics-table">
            <name>Fairness Metrics Example</name>
            <thead>
              <tr><th>Metric</th><th>Value</th></tr>
            </thead>
            <tbody>
              <tr><td>Jain's Fairness Index</td><td>0.87</td></tr>
              <tr><td>Short Class Starvation</td><td>0.3%</td></tr>
              <tr><td>Long Class Starvation</td><td>2.1%</td></tr>
            </tbody>
          </table>
        </section>
      </section>
      
      <section anchor="test-cache">
        <name>Prefix Cache Effectiveness</name>
        
        <section anchor="cache-objective">
          <name>Objective</name>
          <t>To evaluate the performance benefit of prefix caching under
          workloads with shared prefixes. This test quantifies TTFT
          reduction from cache hits.</t>
        </section>
        
        <section anchor="cache-setup">
          <name>Setup Parameters</name>
          <dl>
            <dt>Workload:</dt>
            <dd>Code Completion workload RECOMMENDED (high prefix sharing).</dd>
            
            <dt>Shared prefix:</dt>
            <dd>Define a prefix shared across requests.</dd>
            
            <dt>Prefix length:</dt>
            <dd>Length in tokens of shared prefix.</dd>
            
            <dt>Sharing fraction:</dt>
            <dd>Percentage of requests sharing the prefix.</dd>
            
            <dt>Comparison mode:</dt>
            <dd>Test MUST run in two configurations: cache disabled (baseline)
            and cache enabled.</dd>
          </dl>
        </section>
        
        <section anchor="cache-procedure">
          <name>Procedure</name>
          <ol>
            <li>Configure SUT with cache disabled.</li>
            <li>Complete warm-up (without populating prefix cache).</li>
            <li>Run workload, record TTFT for all requests.</li>
            <li>Enable prefix cache.</li>
            <li>Optionally pre-populate cache with shared prefix.</li>
            <li>Run identical workload, record TTFT for all requests.</li>
            <li>Compare results.</li>
          </ol>
        </section>
        
        <section anchor="cache-measurements">
          <name>Measurements</name>
          <dl>
            <dt>TTFT without cache:</dt>
            <dd>P50, P95, P99 with caching disabled.</dd>
            
            <dt>TTFT with cache:</dt>
            <dd>P50, P95, P99 with caching enabled.</dd>
            
            <dt>TTFT reduction:</dt>
            <dd>(TTFT_no_cache - TTFT_cache) / TTFT_no_cache as percentage.</dd>
            
            <dt>Cache hit rate:</dt>
            <dd>Fraction of prefix tokens served from cache.</dd>
            
            <dt>Throughput improvement:</dt>
            <dd>Percentage increase from caching.</dd>
          </dl>
        </section>
        
        <section anchor="cache-reporting">
          <name>Reporting Format</name>
          <table anchor="cache-comparison-table">
            <name>Cache Effectiveness Example</name>
            <thead>
              <tr><th>Configuration</th><th>TTFT P50</th><th>TTFT P95</th><th>TTFT P99</th></tr>
            </thead>
            <tbody>
              <tr><td>Cache Disabled</td><td>312 ms</td><td>423 ms</td><td>534 ms</td></tr>
              <tr><td>Cache (Cold)</td><td>134 ms</td><td>198 ms</td><td>267 ms</td></tr>
              <tr><td>Cache (Warm)</td><td>98 ms</td><td>156 ms</td><td>212 ms</td></tr>
            </tbody>
          </table>
        </section>
      </section>
      
      <section anchor="test-memory">
        <name>Memory Pressure Behavior</name>
        
        <section anchor="memory-objective">
          <name>Objective</name>
          <t>To characterize SUT behavior when memory resources are constrained,
          including preemption, swapping, and degradation patterns.</t>
        </section>
        
        <section anchor="memory-setup">
          <name>Setup Parameters</name>
          <dl>
            <dt>Workload:</dt>
            <dd>Long Context workload RECOMMENDED to create memory pressure.</dd>
            
            <dt>Oversubscription level:</dt>
            <dd>Percentage above maximum capacity (e.g., 110%, 125%, 150%).</dd>
          </dl>
        </section>
        
        <section anchor="memory-procedure">
          <name>Procedure</name>
          <ol>
            <li>Determine maximum concurrent capacity from <xref target="test-capacity"/>.</li>
            <li>Configure SUT and complete warm-up.</li>
            <li><t>For each oversubscription level:</t>
              <ol type="a">
                <li>Submit requests at concurrency exceeding capacity.</li>
                <li>Run for at least 120 seconds.</li>
                <li>Monitor request completions, preemption events, latency.</li>
                <li>Record any OOM errors or system failures.</li>
              </ol>
            </li>
            <li>Analyze degradation patterns.</li>
          </ol>
        </section>
        
        <section anchor="memory-measurements">
          <name>Measurements</name>
          <dl>
            <dt>Completion rate:</dt>
            <dd>Percentage of requests completing successfully at each level.</dd>
            
            <dt>Preemption rate:</dt>
            <dd>Fraction of requests preempted at least once.</dd>
            
            <dt>Preemption recovery rate:</dt>
            <dd>Fraction of preempted requests that eventually complete.</dd>
            
            <dt>Preemption loss:</dt>
            <dd>Average tokens discarded per preemption event.</dd>
          </dl>
        </section>
        
        <section anchor="memory-reporting">
          <name>Reporting Format</name>
          <table anchor="memory-degradation-table">
            <name>Memory Pressure Degradation Example</name>
            <thead>
              <tr><th>Oversub Level</th><th>Complete</th><th>Preempt</th><th>Fail Rate</th><th>TTFT P99</th></tr>
            </thead>
            <tbody>
              <tr><td>100% (base)</td><td>99.7%</td><td>0%</td><td>0.3%</td><td>523 ms</td></tr>
              <tr><td>110%</td><td>98.2%</td><td>5.2%</td><td>1.8%</td><td>789 ms</td></tr>
              <tr><td>125%</td><td>94.5%</td><td>18.7%</td><td>5.5%</td><td>1456 ms</td></tr>
              <tr><td>150%</td><td>82.3%</td><td>42.1%</td><td>17.7%</td><td>3234 ms</td></tr>
            </tbody>
          </table>
        </section>
      </section>
      
      <section anchor="test-long-context">
        <name>Long Context Scaling</name>
        
        <section anchor="long-context-objective">
          <name>Objective</name>
          <t>To characterize how latency and throughput scale with context
          length.</t>
        </section>
        
        <section anchor="long-context-setup">
          <name>Setup Parameters</name>
          <dl>
            <dt>Workload:</dt>
            <dd>Long Context workload REQUIRED.</dd>
            
            <dt>Context length range:</dt>
            <dd>Sequence of lengths to test (e.g., 1K, 2K, 4K, 8K, 16K, 32K,
            64K, 128K tokens).</dd>
            
            <dt>Fixed output length:</dt>
            <dd>Use consistent short output (256 tokens) to isolate prefill impact.</dd>
            
            <dt>Load model:</dt>
            <dd>Closed-loop with low concurrency (1-4).</dd>
            
            <dt>Requests per length:</dt>
            <dd>At least 20 requests per context length.</dd>
          </dl>
        </section>
        
        <section anchor="long-context-procedure">
          <name>Procedure</name>
          <ol>
            <li>Configure SUT and complete warm-up with short-context requests.</li>
            <li><t>For each context length in ascending order:</t>
              <ol type="a">
                <li>Generate requests with specified input length.</li>
                <li>Submit requests at low concurrency.</li>
                <li>Record TTFT and total latency for each request.</li>
              </ol>
            </li>
            <li>Analyze scaling behavior and fit to scaling models.</li>
          </ol>
        </section>
        
        <section anchor="long-context-measurements">
          <name>Measurements</name>
          <dl>
            <dt>Per-length latency:</dt>
            <dd>TTFT Mean, P50, P95 for each context length.</dd>
            
            <dt>Prefill scaling:</dt>
            <dd>Time per input token (TTFT / input_length).</dd>
            
            <dt>Scaling exponent:</dt>
            <dd>Fit exponent k where TTFT proportional to context_length^k.</dd>
            
            <dt>Throughput at length:</dt>
            <dd>Maximum throughput achievable at each context length.</dd>
          </dl>
        </section>
        
        <section anchor="long-context-reporting">
          <name>Reporting Format</name>
          <table anchor="long-context-table">
            <name>Long Context Scaling Example</name>
            <thead>
              <tr><th>Context (tokens)</th><th>TTFT Mean</th><th>TTFT P95</th><th>ms/1K tokens</th></tr>
            </thead>
            <tbody>
              <tr><td>1024</td><td>89 ms</td><td>112 ms</td><td>76</td></tr>
              <tr><td>4096</td><td>289 ms</td><td>367 ms</td><td>63</td></tr>
              <tr><td>16384</td><td>1023 ms</td><td>1287 ms</td><td>59</td></tr>
              <tr><td>65536</td><td>4234 ms</td><td>5123 ms</td><td>62</td></tr>
              <tr><td>131072</td><td>9123 ms</td><td>11234 ms</td><td>68</td></tr>
            </tbody>
          </table>
          <t>Best fit: Linear (R^2 = 0.9987), ~68 microseconds per input token</t>
        </section>
      </section>
      
      <section anchor="test-guardrail">
        <name>Guardrail Overhead</name>
        
        <section anchor="guardrail-objective">
          <name>Objective</name>
          <t>To quantify the latency impact of safety systems and content
          filtering.</t>
        </section>
        
        <section anchor="guardrail-setup">
          <name>Setup Parameters</name>
          <dl>
            <dt>Workload:</dt>
            <dd>Conversation workload RECOMMENDED.</dd>
            
            <dt>Content mix:</dt>
            <dd>Use benign content to measure processing overhead.</dd>
            
            <dt>Configurations to compare:</dt>
            <dd><t>The following configurations should be tested:</t>
              <ul>
                <li>Baseline: All guardrails disabled (if possible)</li>
                <li>Input filtering only</li>
                <li>Output filtering only</li>
                <li>Full filtering: All production guardrails enabled</li>
              </ul>
            </dd>
            
            <dt>Load levels:</dt>
            <dd>Test at 25%, 50%, 75% of capacity.</dd>
          </dl>
        </section>
        
        <section anchor="guardrail-procedure">
          <name>Procedure</name>
          <ol>
            <li>Configure SUT with baseline (no guardrails).</li>
            <li>Complete warm-up and run workload at each load level.</li>
            <li>Enable each guardrail configuration and repeat.</li>
            <li>Compare results across configurations.</li>
          </ol>
        </section>
        
        <section anchor="guardrail-measurements">
          <name>Measurements</name>
          <dl>
            <dt>Per-configuration latency:</dt>
            <dd>TTFT P50, P95, P99 and End-to-end latency for each configuration.</dd>
            
            <dt>Input filter overhead:</dt>
            <dd>TTFT(input_filter) - TTFT(baseline)</dd>
            
            <dt>Total guardrail overhead:</dt>
            <dd>End-to-end(full) - End-to-end(baseline)</dd>
            
            <dt>Throughput reduction:</dt>
            <dd>Percentage reduction from guardrails.</dd>
          </dl>
        </section>
        
        <section anchor="guardrail-reporting">
          <name>Reporting Format</name>
          <table anchor="guardrail-latency-table">
            <name>Guardrail Overhead Example</name>
            <thead>
              <tr><th>Configuration</th><th>TTFT P50</th><th>TTFT P99</th><th>E2E P50</th><th>E2E P99</th></tr>
            </thead>
            <tbody>
              <tr><td>Baseline</td><td>98 ms</td><td>234 ms</td><td>4.2 s</td><td>8.7 s</td></tr>
              <tr><td>Input Filter</td><td>112 ms</td><td>267 ms</td><td>4.3 s</td><td>8.9 s</td></tr>
              <tr><td>Output Filter</td><td>101 ms</td><td>242 ms</td><td>4.8 s</td><td>9.8 s</td></tr>
              <tr><td>Full Filter</td><td>118 ms</td><td>289 ms</td><td>5.0 s</td><td>10.2 s</td></tr>
            </tbody>
          </table>
          
          <table anchor="guardrail-throughput-table">
            <name>Throughput Impact Example</name>
            <thead>
              <tr><th>Configuration</th><th>Max Throughput</th><th>Reduction</th></tr>
            </thead>
            <tbody>
              <tr><td>Baseline</td><td>2867 tok/s</td><td>-</td></tr>
              <tr><td>Input Filter</td><td>2756 tok/s</td><td>-3.9%</td></tr>
              <tr><td>Output Filter</td><td>2412 tok/s</td><td>-15.9%</td></tr>
              <tr><td>Full Filter</td><td>2289 tok/s</td><td>-20.2%</td></tr>
            </tbody>
          </table>
        </section>
      </section>
    </section>
    
    <section anchor="comparison-guidelines">
      <name>Multi-System Comparison Guidelines</name>
      
      <t>When comparing multiple SUTs:</t>
      
      <section anchor="equivalence">
        <name>Equivalence Requirements</name>
        <t>Testers MUST ensure:</t>
        <ul>
          <li>Identical workload (same requests in same order with same seeds)</li>
          <li>Equivalent SUT boundary (all systems at same boundary)</li>
          <li>Comparable hardware (or normalize by hardware capability)</li>
          <li>Same load model and parameters</li>
        </ul>
      </section>
      
      <section anchor="normalization">
        <name>Normalization</name>
        <t>When hardware differs:</t>
        <ul>
          <li>Report tokens per GPU-second (normalized by GPU count)</li>
          <li>Report cost-normalized throughput (tokens per dollar-hour)</li>
          <li>Clearly state normalization method</li>
        </ul>
      </section>
      
      <section anchor="statistical-significance">
        <name>Statistical Significance</name>
        <t>For comparative claims:</t>
        <ul>
          <li>Report confidence intervals for key metrics</li>
          <li>Conduct multiple independent runs (at least 3)</li>
          <li>Use appropriate statistical tests for comparison</li>
        </ul>
      </section>
      
      <section anchor="fair-comparison-checklist">
        <name>Fair Comparison Checklist</name>
        <t>Before publishing comparative results, verify:</t>
        <ul>
          <li>Same workload specification</li>
          <li>Same test duration</li>
          <li>Same warm-up procedure</li>
          <li>Same success criteria</li>
          <li>Both systems tested at same time (if using shared resources)</li>
          <li>Both systems in production-representative configuration</li>
          <li>Differences in configuration explicitly noted</li>
        </ul>
      </section>
    </section>

    <section anchor="security">
      <name>Security Considerations</name>
      
      <t>Benchmarking methodology intersects with security in several ways.</t>
      
      <section anchor="side-channel-risks">
        <name>Side-Channel Risks</name>
        <t>Benchmark results may reveal:</t>
        <ul>
          <li>System capacity limits useful for DoS planning</li>
          <li>Timing patterns enabling cache probing attacks</li>
          <li>Memory pressure thresholds for resource exhaustion</li>
        </ul>
        <t>Operators SHOULD consider whether to publish detailed capacity
        information publicly.</t>
      </section>
      
      <section anchor="benchmark-gaming">
        <name>Benchmark Gaming</name>
        <t>Systems may be optimized specifically for benchmark workloads in
        ways that do not generalize:</t>
        <ul>
          <li>Detecting benchmark patterns and applying special handling</li>
          <li>Caching benchmark-specific prefixes</li>
          <li>Prioritizing benchmark-like requests</li>
        </ul>
        <t>Testers SHOULD vary workloads and verify results with production
        traffic samples.</t>
      </section>
      
      <section anchor="adversarial-workloads">
        <name>Adversarial Workloads</name>
        <t>This methodology uses benign workloads. Adversarial inputs
        (jailbreak attempts, prompt injections) may have different
        performance characteristics due to guardrail processing.</t>
        <t>Testing with adversarial workloads requires additional ethical
        and safety considerations not covered here.</t>
      </section>
      
      <section anchor="resource-exhaustion">
        <name>Resource Exhaustion</name>
        <t>Memory pressure tests (<xref target="test-memory"/>) intentionally
        push systems beyond capacity. Testers SHOULD:</t>
        <ul>
          <li>Conduct such tests on isolated systems</li>
          <li>Have recovery procedures ready</li>
          <li>Monitor for cascading failures</li>
        </ul>
      </section>
    </section>
  </middle>

  <back>
    <references>
      <name>References</name>
      
      <references>
        <name>Normative References</name>
        
        <reference anchor="RFC2119" target="https://www.rfc-editor.org/info/rfc2119">
          <front>
            <title>Key words for use in RFCs to Indicate Requirement Levels</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <date month="March" year="1997"/>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
          <seriesInfo name="DOI" value="10.17487/RFC2119"/>
        </reference>
        
        <reference anchor="RFC8174" target="https://www.rfc-editor.org/info/rfc8174">
          <front>
            <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
            <author fullname="B. Leiba" initials="B." surname="Leiba"/>
            <date month="May" year="2017"/>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="8174"/>
          <seriesInfo name="DOI" value="10.17487/RFC8174"/>
        </reference>
        
        <reference anchor="LLM-TERMS" target="https://datatracker.ietf.org/doc/draft-gaikwad-llm-benchmarking-terminology/">
          <front>
            <title>Benchmarking Terminology for Large Language Model Serving</title>
            <author fullname="Madhava Gaikwad" initials="M." surname="Gaikwad"/>
            <date month="January" year="2026"/>
          </front>
          <seriesInfo name="Internet-Draft" value="draft-gaikwad-llm-benchmarking-terminology-00"/>
        </reference>
      </references>
      
      <references>
        <name>Informative References</name>
        
        <reference anchor="RFC1242" target="https://www.rfc-editor.org/info/rfc1242">
          <front>
            <title>Benchmarking Terminology for Network Interconnection Devices</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <date month="July" year="1991"/>
          </front>
          <seriesInfo name="RFC" value="1242"/>
          <seriesInfo name="DOI" value="10.17487/RFC1242"/>
        </reference>
        
        <reference anchor="RFC2544" target="https://www.rfc-editor.org/info/rfc2544">
          <front>
            <title>Benchmarking Methodology for Network Interconnect Devices</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <author fullname="J. McQuaid" initials="J." surname="McQuaid"/>
            <date month="March" year="1999"/>
          </front>
          <seriesInfo name="RFC" value="2544"/>
          <seriesInfo name="DOI" value="10.17487/RFC2544"/>
        </reference>
        
        <reference anchor="RFC3511" target="https://www.rfc-editor.org/info/rfc3511">
          <front>
            <title>Benchmarking Methodology for Firewall Performance</title>
            <author fullname="B. Hickman" initials="B." surname="Hickman"/>
            <author fullname="D. Newman" initials="D." surname="Newman"/>
            <author fullname="S. Tadjudin" initials="S." surname="Tadjudin"/>
            <author fullname="T. Martin" initials="T." surname="Martin"/>
            <date month="April" year="2003"/>
          </front>
          <seriesInfo name="RFC" value="3511"/>
          <seriesInfo name="DOI" value="10.17487/RFC3511"/>
        </reference>
        
        <reference anchor="VLLM">
          <front>
            <title>Efficient Memory Management for Large Language Model Serving with PagedAttention</title>
            <author fullname="Woosuk Kwon" surname="Kwon"/>
            <date year="2023"/>
          </front>
          <seriesInfo name="Proceedings of" value="SOSP 2023"/>
          <seriesInfo name="DOI" value="10.1145/3600006.3613165"/>
        </reference>
        
        <reference anchor="SARATHI">
          <front>
            <title>Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve</title>
            <author fullname="Amey Agrawal" surname="Agrawal"/>
            <date year="2024"/>
          </front>
          <seriesInfo name="Proceedings of" value="OSDI 2024"/>
        </reference>
      </references>
    </references>
    
    <section anchor="workload-specs">
      <name>Reference Workload Specifications</name>
      
      <t>This appendix provides complete specifications for standard workloads.</t>
      
      <section anchor="synthetic-uniform-spec">
        <name>Synthetic-Uniform Workload</name>
        
        <t>Purpose: Controlled baseline with minimal variance</t>
        
        <section anchor="uniform-input">
          <name>Input Specification</name>
          <dl>
            <dt>Distribution:</dt><dd>Uniform</dd>
            <dt>Minimum:</dt><dd>128 tokens</dd>
            <dt>Maximum:</dt><dd>512 tokens</dd>
            <dt>Mean:</dt><dd>320 tokens</dd>
            <dt>Content:</dt><dd>Random token IDs from vocabulary</dd>
          </dl>
        </section>
        
        <section anchor="uniform-output">
          <name>Output Specification</name>
          <dl>
            <dt>Distribution:</dt><dd>Uniform</dd>
            <dt>Minimum:</dt><dd>64 tokens</dd>
            <dt>Maximum:</dt><dd>256 tokens</dd>
            <dt>Mean:</dt><dd>160 tokens</dd>
            <dt>Control:</dt><dd>max_tokens parameter</dd>
          </dl>
        </section>
        
        <section anchor="uniform-other">
          <name>Other Parameters</name>
          <dl>
            <dt>System prompt:</dt><dd>None</dd>
            <dt>Prefix sharing:</dt><dd>None</dd>
            <dt>Temperature:</dt><dd>0.0 (deterministic)</dd>
            <dt>Stop sequences:</dt><dd>None</dd>
          </dl>
        </section>
        
        <section anchor="uniform-generation">
          <name>Generation Method</name>
          <t>Python pseudocode:</t>
          <sourcecode type="python"><![CDATA[
def generate_synthetic_uniform(n_requests, seed=42):
    rng = random.Random(seed)
    requests = []
    for i in range(n_requests):
        input_len = rng.randint(128, 512)
        output_len = rng.randint(64, 256)
        input_tokens = [rng.randint(0, 100255)
                        for _ in range(input_len)]
        requests.append({
            'input_tokens': input_tokens,
            'max_tokens': output_len,
            'temperature': 0.0
        })
    return requests
]]></sourcecode>
        </section>
      </section>
      
      <section anchor="synthetic-skewed-spec">
        <name>Synthetic-Skewed Workload</name>
        
        <t>Purpose: Test scheduling with high length variance</t>
        
        <section anchor="skewed-input">
          <name>Input Specification</name>
          <dl>
            <dt>Distribution:</dt><dd>Log-normal</dd>
            <dt>mu:</dt><dd>5.5 (in log space)</dd>
            <dt>sigma:</dt><dd>1.0 (in log space)</dd>
            <dt>Minimum:</dt><dd>32 tokens (floor)</dd>
            <dt>Maximum:</dt><dd>4096 tokens (cap)</dd>
            <dt>Median:</dt><dd>~245 tokens</dd>
            <dt>Mean:</dt><dd>~405 tokens</dd>
          </dl>
        </section>
        
        <section anchor="skewed-output">
          <name>Output Specification</name>
          <dl>
            <dt>Distribution:</dt><dd>Log-normal</dd>
            <dt>mu:</dt><dd>4.5 (in log space)</dd>
            <dt>sigma:</dt><dd>1.2 (in log space)</dd>
            <dt>Minimum:</dt><dd>16 tokens (floor)</dd>
            <dt>Maximum:</dt><dd>2048 tokens (cap)</dd>
          </dl>
        </section>
      </section>
      
      <section anchor="conversation-spec">
        <name>Conversation Workload</name>
        
        <t>Purpose: Realistic interactive chat patterns</t>
        
        <section anchor="conv-source">
          <name>Data Source</name>
          <dl>
            <dt>Dataset:</dt><dd>ShareGPT (vicuna_cleaned subset)</dd>
            <dt>Version:</dt><dd>2023-04-12</dd>
            <dt>Preprocessing:</dt><dd>Filter conversations with 1+ assistant turns</dd>
          </dl>
        </section>
        
        <section anchor="conv-stats">
          <name>Length Statistics (Reference)</name>
          <t>Input tokens:</t>
          <ul>
            <li>P50: 156</li>
            <li>P95: 892</li>
            <li>P99: 2134</li>
          </ul>
          <t>Output tokens:</t>
          <ul>
            <li>P50: 234</li>
            <li>P95: 789</li>
            <li>P99: 1567</li>
          </ul>
        </section>
      </section>
      
      <section anchor="code-spec">
        <name>Code Completion Workload</name>
        
        <t>Purpose: Test prefix caching with code context</t>
        
        <section anchor="code-source">
          <name>Data Source</name>
          <dl>
            <dt>Dataset:</dt><dd>The Stack (Python, JavaScript, TypeScript subset)</dd>
            <dt>Preprocessing:</dt><dd>Extract function-level completions</dd>
          </dl>
        </section>
        
        <section anchor="code-prefix">
          <name>Prefix Sharing Pattern</name>
          <ul>
            <li>10 unique repository contexts</li>
            <li>Each 512-1024 tokens</li>
            <li>80% of requests share one of these prefixes</li>
            <li>Distribution: Zipf with s=1.5</li>
          </ul>
        </section>
      </section>
      
      <section anchor="long-context-spec">
        <name>Long Context Workload</name>
        
        <t>Purpose: Test long-context handling</t>
        
        <section anchor="long-input">
          <name>Input Specification</name>
          <dl>
            <dt>Distribution:</dt><dd>Uniform over target lengths</dd>
            <dt>Target lengths:</dt><dd>[8192, 16384, 32768, 65536, 131072] tokens</dd>
            <dt>Structure:</dt><dd>[document][question]</dd>
            <dt>Document:</dt><dd>Fills target length minus 100 tokens</dd>
            <dt>Question:</dt><dd>Fixed ~100 token question about document</dd>
          </dl>
        </section>
        
        <section anchor="long-output">
          <name>Output Specification</name>
          <dl>
            <dt>Distribution:</dt><dd>Fixed</dd>
            <dt>Length:</dt><dd>256 tokens</dd>
            <dt>Control:</dt><dd>max_tokens = 256</dd>
          </dl>
        </section>
      </section>
    </section>
    
    <section anchor="timing-reference">
      <name>Timing Measurement Reference</name>
      
      <t>This appendix provides detailed guidance for timing measurements.</t>
      
      <section anchor="ttft-measurement-points">
        <name>TTFT Measurement Points</name>
        
        <section anchor="http-sse">
          <name>HTTP/SSE Measurement</name>
          <t>Client-side TTFT:</t>
          <dl>
            <dt>T_submit:</dt><dd>time of sending final byte of HTTP request</dd>
            <dt>T_first:</dt><dd>time of receiving first data event with content token</dd>
          </dl>
          <t>T_first is when the complete "data:" line is received and parsed,
          not when the first byte of the response arrives.</t>
        </section>
        
        <section anchor="grpc-measurement">
          <name>gRPC Streaming Measurement</name>
          <dl>
            <dt>T_submit:</dt><dd>time of sending request message</dd>
            <dt>T_first:</dt><dd>time of receiving first response message with token</dd>
          </dl>
        </section>
        
        <section anchor="server-side-measurement">
          <name>Server-Side Measurement</name>
          <t>If server instrumentation available:</t>
          <dl>
            <dt>T_submit:</dt><dd>time request enters inference queue</dd>
            <dt>T_first:</dt><dd>time first token exits model forward pass</dd>
          </dl>
          <t>Server-side excludes network latency but may include internal queue time.</t>
        </section>
      </section>
      
      <section anchor="itl-sse-measurement">
        <name>ITL Measurement with SSE</name>
        
        <t>SSE delivery may batch multiple tokens per event due to server-side
        batching, TCP buffering, or client-side buffering.</t>
        
        <section anchor="recommended-approach">
          <name>Recommended Approach</name>
          <ol>
            <li>First, characterize delivery pattern (tokens per chunk)</li>
            <li>If single-token chunks dominate (&gt;90%): use direct measurement</li>
            <li>If multi-token chunks common: prefer server timestamps</li>
            <li>If server timestamps unavailable: use chunk timing and document</li>
          </ol>
        </section>
      </section>
      
      <section anchor="clock-sync-methods">
        <name>Clock Synchronization Methods</name>
        
        <section anchor="ntp-sync">
          <name>NTP Synchronization</name>
          <ol>
            <li>Both machines sync to same NTP server</li>
            <li>Verify offset: ntpq -p (check offset column)</li>
            <li>Acceptable offset: &lt; 10ms for most LLM benchmarking</li>
            <li>Document NTP server and measured offset</li>
          </ol>
        </section>
        
        <section anchor="ptp-sync">
          <name>PTP Synchronization</name>
          <t>For sub-millisecond accuracy:</t>
          <ol>
            <li>Use PTP-capable network hardware</li>
            <li>Configure ptp4l on Linux systems</li>
            <li>Acceptable offset: &lt; 1 microsecond</li>
          </ol>
        </section>
        
        <section anchor="single-machine-testing">
          <name>Single-Machine Alternative</name>
          <t>Recommended for Model Engine testing:</t>
          <ol>
            <li>Run load generator on same machine as SUT</li>
            <li>Use loopback network interface</li>
            <li>Clock synchronization inherent</li>
            <li>Eliminates network latency from measurement</li>
          </ol>
        </section>
      </section>
    </section>
    
    <section anchor="reporting-templates">
      <name>Reporting Templates</name>
      
      <section anchor="minimum-report">
        <name>Minimum Viable Report</name>
        <t>For quick comparisons, include at minimum:</t>
        <artwork type="ascii-art"><![CDATA[
=== LLM Benchmark Report (Minimum) ===

System Identification:
- Model: [model name and version]
- Hardware: [GPU type] x [count]
- Software: [inference engine and version]
- SUT Boundary: [Model Engine | Gateway | Compound]

Test Configuration:
- Workload: [workload name]
- Load Model: [open-loop rate | closed-loop concurrency]
- Request Count: [N]
- Test Duration: [seconds]

Key Results:
- TTFT P50: [value] ms
- TTFT P99: [value] ms
- TPOT P50: [value] ms
- TPOT P99: [value] ms
- Max Throughput: [value] tok/s
- Throughput at P99 TTFT < 500ms: [value] tok/s

Notes:
- [Any deviations from methodology]
- [Guardrail configuration]

=== End Report ===
]]></artwork>
      </section>
      
      <section anchor="full-report">
        <name>Full Report Template</name>
        <t>A complete benchmark report should include the following sections:</t>
        <ol>
          <li>System Identification (model, hardware, software)</li>
          <li>Test Configuration (workload, load, execution parameters)</li>
          <li>Results (latency summary, throughput summary, success metrics)</li>
          <li>Detailed Results (per-test tables and visualizations)</li>
          <li>Methodology Compliance (tests performed, deviations, limitations)</li>
          <li>Reproduction Information (test harness, configuration, data)</li>
        </ol>
      </section>
    </section>
    
    <section numbered="false" anchor="acknowledgements">
      <name>Acknowledgements</name>
      <t>This document draws on the structure and approach established by
      RFC 3511 for firewall benchmarking methodology. The author thanks
      the Benchmarking Methodology Working Group for their foundational
      work in network device benchmarking.</t>
    </section>
  </back>
</rfc>
