Inference Cards



Why

skip past the why

When someone says “I run Qwen 3.6 at 25 tokens per second”, or makes any similar performance claim about their self-hosted LLM setup, this is only meaningful if we know several other things.

Also, knowing only the generation (i.e. decode) speed at a shallow context depth is not enough to understand whether agentic workloads will be usable on a given setup. Prefill (i.e. prompt processing) speed matters because agents spend a lot of time reading stuff. It also matters how speeds change as context depth increases, because agents do most of their work with tens of thousands of tokens in the context window. Also, if you’re trying to serve multi-agent (or multi-user) workloads, it matters how these numbers change with multiple concurrent requests. (And no, you cannot guesstimate any of these other numbers from “25 TPS generation speed” because different hardware and inference engines all have different performance characteristics in this several-dimensional space.)

With this fuller picture, we can more reasonably compare your computer to my computer. We can talk about which workloads are usable interactively and which will crawl at “run overnight” speed. We can spot when something is broken, and reasonably ask “Does this change make it faster?”, knowing what “it” even was to begin with. We also get a sense of what quality of output to expect from the LLM.

In online communities for self-hosted inference, most people don’t bother to communicate most of this information, and the quality of discussion suffers! We need a compact, easy way to share so that more people will do it.

Now I follow in some big footsteps to propose a deliberately under-specified plaintext markup format. I hope it is highly readable and easy for new people to pick up.

Inference Cards

Think of baseball cards, but for computers running LLMs. An inference card shows the most important information to understand setup and performance. You can share them in a code block, or as a screenshot if you hate searching / accessibility. Put inference cards in your pull requests, reddit posts, or wherever you talk about your LLM life.

Here is the the world’s first inference card, for my own slop machine.

+----------------Inference Card v1-----------------+
| Who+when: cmart.blog, 2026-06-25                 |
| Weights repo: hf.co/unsloth/Qwen3.6-27B-GGUF     |
| Quantization: UD-Q4_K_XL                         |
| Platform: Thinkpad T480, Debian 13, eGPU dock    |
| Accelerator+mem: AMD Radeon AI Pro 9700, 32 GB   |
| Engine+ver: llama.cpp b9733                      |
| GPU runtime+ver: ROCm 7.2.4                      |
|----------------------Tok/s-----------------------|
| Concurrency  |           Context depth           |
|  ↓ | Stage   |  Empty |   4096 |  16384 |  65536 |
|----|---------|--------|--------|--------|--------|
|  1 | prefill |    667 |    669 |    640 |    474 |
|  1 | decode  |   32.1 |   24.8 |   26.6 |   22.9 |
|  2 | prefill |    519 |    588 |        |        |
|  2 | decode  |   23.3 |   16.2 |        |        |
|  4 | prefill |    526 |    537 |        |        |
|  4 | decode  |   16.4 |   9.80 |        |        |
+------------------Config / Notes------------------+

Serving with:

./llama-server \
--hf-repo unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
--gpu-layers all \
--spec-type draft-mtp \
--spec-draft-n-max 4 \
--chat-template-file ~/Qwen-Fixed-Chat-Templates/chat_template.jinja

Measuring with:

uv run llama-benchy \
--base-url http://localhost:8080/v1 \
--api-key "" \
--model "unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL" \
--tokenizer "Qwen/Qwen3-32B" \
--pp 4096 \
--tg 128 \
--depth 0 4096 16384 65536 \
--concurrency 1 2 4 \
--runs 1 \
--latency-mode generation

GPU is under-volted with increased power cap via https://github.com/kyuz0/amd-r9700-vllm-toolboxes/blob/main/TUNING.md

+----------------End Config / Notes----------------+

FAQ

How do you make an inference card? You copy mine from this page and edit the fields. If you hate overtype mode, paste my card into your LLM and ask it to fill in your details.

You’re using, e.g., a fork of vLLM? Then specify the repo URL and commit hash instead of the release version.

You ran out of space on the card? Add another line or make the card wider. There are no rules.

The config / notes section is just free text. Include whatever someone needs to know in order to make sense of your setup, perhaps starting with the commands you use to run it.

Should someone be able to read an inference card, buy some hardware, and reproduce the numbers on the card? Ideally yes, but in practice, maybe not. Reproducibility is still a hard problem in high-performance computing. Without loss of generality: what’s the temperature of the air that your GPU is sucking in to cool itself? That affects performance but I’m not expecting folks to measure it. Inference cards are a low rung on the reproducibility ladder.

Can someone mislead or lie on their inference card? Of course, it’s the internet! Each of us is only as good as our word.

The speed table is a parameter sweep across concurrency and context depth. If your setup doesn’t handle concurrent requests well, maybe don’t include any lines for concurrency greater than 1.

Should this be, e.g., JSON? Sure, maybe, but that’s less readable. This is a sloppy first attempt. If you feel inspired to fork and improve this idea, or build tooling around it, please do.

Is “Inference Card” a confusing name because people refer to GPUs as ‘cards’? Maybe, but the phrase is semi-novel.

I welcome any feedback or complaints. Send them to inference-cards at cmart dot blog. Heck, send me your inference card.