Benchmarking¶
Kortex ships a benchmark harness that measures the cost and routing-quality advantage of Kortex's heuristic router over static model assignment strategies. Run it before deploying to production to quantify the routing benefit for your specific workload.
Quick Start¶
import asyncio
from kortex.benchmark.harness import BenchmarkHarness
from kortex.core.router import ProviderModel
models = [
ProviderModel(
provider="openai", model="gpt-4o-mini",
cost_per_1k_input_tokens=0.00015, cost_per_1k_output_tokens=0.0006,
avg_latency_ms=250, capabilities=["reasoning", "analysis"], tier="fast",
),
ProviderModel(
provider="anthropic", model="claude-sonnet-4-20250514",
cost_per_1k_input_tokens=0.003, cost_per_1k_output_tokens=0.015,
avg_latency_ms=800,
capabilities=["reasoning", "analysis", "code_generation", "content_generation"],
tier="balanced",
),
ProviderModel(
provider="anthropic", model="claude-opus-4-20250514",
cost_per_1k_input_tokens=0.015, cost_per_1k_output_tokens=0.075,
avg_latency_ms=2000,
capabilities=["reasoning", "analysis", "code_generation", "content_generation"],
tier="powerful",
),
]
async def main():
harness = BenchmarkHarness(models)
report = await harness.full_benchmark()
print(report.to_markdown())
print(report.summary)
asyncio.run(main())
No API keys or network access required — the harness works entirely from cost and latency metadata.
Task Datasets¶
Three pre-built datasets cover different real-world workload profiles:
| Dataset | Tasks | Profile |
|---|---|---|
mixed |
100 | 40% simple / 35% moderate / 25% complex |
cost_sensitive |
100 | All have cost ceiling constraints |
latency_sensitive |
100 | All have tight latency SLA requirements |
from kortex.benchmark.harness import TaskDataset
mixed = TaskDataset.mixed_workload(n=100)
cost = TaskDataset.cost_sensitive(n=100)
latency = TaskDataset.latency_sensitive(n=100)
Baseline Strategies¶
The harness compares Kortex routing against three static baselines:
| Strategy | Behaviour |
|---|---|
cheapest |
Always routes to the cheapest registered model |
strongest |
Always routes to the most powerful registered model |
random |
Random model selection (lower bound) |
CLI Commands¶
Run the full benchmark suite¶
kortex benchmark run
kortex benchmark run --dataset cost_sensitive
kortex benchmark run --dataset latency_sensitive
kortex benchmark run --output results.json
Compare routing vs. a baseline under a specific policy¶
kortex benchmark compare --policy examples/policies/cost_optimized.toml --baseline cheapest
kortex benchmark compare --policy examples/policies/quality_first.toml --baseline strongest
Report Format¶
BenchmarkReport.to_markdown() produces a table like:
| Dataset | Kortex Cost | Baseline Cost | Savings | Kortex P95 | Baseline P95 |
|------------------|-------------|---------------|---------|------------|--------------|
| mixed | $0.0412 | $0.1850 | 77.7% | 800ms | 2000ms |
| cost_sensitive | $0.0087 | $0.0450 | 80.7% | 250ms | 800ms |
| latency_sensitive| $0.0031 | $0.1850 | 98.3% | 250ms | 2000ms |
BenchmarkReport.summary is a single human-readable sentence suitable for logging.
Programmatic API¶
from kortex.benchmark.harness import BenchmarkHarness, BaselineStrategy
harness = BenchmarkHarness(models)
# Run individual phases
kortex_run = await harness.run_kortex(dataset)
baseline_run = await harness.run_baseline(dataset, BaselineStrategy.CHEAPEST)
# Compare
comparison = harness.compare(kortex_run, baseline_run)
print(f"Cost savings: {comparison.cost_savings_pct:.1f}%")
print(f"Routing failures: {comparison.kortex_routing_failures}")
Running the Example¶
This registers 6 models across 3 tiers, runs the full suite, and prints a markdown report.