Multi-LLM routing practice: How to save 60% of AI API costs (Sonnet 4.6 + Haiku 4.5 offload)

AI tools in action Claude LLM routing AI cost SME AI

Multi-LLM routing allocates tasks to different models according to complexity, and solves the problem of out-of-control AI API monthly fees through the cost difference between Haiku 4.5, Sonnet 4.6, and Opus 4.7. 5M tokens per month can save about 60% in actual testing.

Why does the AI bill at the end of the month get longer and uglier?

When AI API costs explode, it’s usually not because the models become more expensive, but because you hand everything over to the same big model.

We have seen the AI introduction path of many small and medium-sized enterprises: the first month is just writing copywriting, summarizing meeting minutes, and organizing customer service messages, and the monthly fee is about NT$3,000. In the second month, I started to access CRM, customer service mailbox, community scheduling, and internal knowledge base, and the token cost went up to NT$10,000. In the third month, with the addition of automated agent, batch rewriting, and list cleaning, the monthly fee will directly reach NT$30,000+.

The problem isn’t that the AI isn’t worth the money, it’s that the tasks aren’t graded.

Should I use Claude Sonnet 4.6 for a customer service letter? Maybe. Should I use Sonnet to divide the 3,000 list into “B2B / B2C / Unsure”? Not used in most cases. Changing the product title to a fixed format, extracting the company name, and filling in the JSON fields are not high-level reasoning questions.

Most people who do AI automation will step into this trap: first connect to the API, and when they see good results, they will throw all the processes to the strongest model. It went smoothly in the short term, but at the end of the month I discovered that every small task was priced at the premium labor rate.

The token cost can be thought of as AI’s “usage electricity bill.” Token is the text unit that the model reads and outputs, input token is the content you throw in, and output token is the content the model returns to you. The longer the mission, the more runs, and the more expensive models used, the higher the bill will be.

If the company is already working on Claude / Codex / Gemini tool combination, the next step should not just add tools, but break down the model usage rights: which tasks only require cheap models, and which tasks require high-end models.

The core value of multi-LLM routing is here: not to use less AI, but to use AI in the right place.

What is multi-LLM routing? Finished in one sentence

Multi-LLM routing is to “judge the difficulty of the task first, and then send the task to the corresponding model.”

Simple tasks will give you Haiku 4.5, medium tasks will give you Sonnet 4.6, and complex tasks will give you Opus 4.7. You can think of it like the work allocation in a company: data collection is handed over to assistants, standard analysis is handed over to senior specialists, and consultants are called in for major decisions.

This is not to pursue technical beauty, but to achieve a reasonable return on every token cost.

According to Anthropic’s official pricing and model documents, the Claude series models have different prices, speeds and capabilities. Please refer to Anthropic pricing and Claude models docs. rel="noopener" of external links is handled uniformly by the front-end template.

The cost formula is simple:

節省 % = 1 - (Σ 各模型費用) / (全 Sonnet 費用)

If all the original 5M tokens per month go to Sonnet, the cost will be about NT$15,000. After changing to 70% Haiku, 25% Sonnet, 5% Opus, the cost will be about NT$5,400 to NT$6,000. Save about NT$9,000 to NT$9,600, or about 60%.

That 60% isn’t magic, it comes mainly from one thing: 70%+ of the missions don’t actually require Sonnet.

We actually tested tasks such as list classification, copy format rewriting, field extraction, FAQ summaries, and customer service letter drafts. The results are clear: as long as the tasks have a fixed format, are low-risk, and have verifiable answers, Haiku 4.5’s performance is usually sufficient. What really requires Sonnet 4.6 are tasks that require judging tone, integrating multiple pieces of information, and producing content that can be directly used externally.

<img src=“https://images.aicycle.cc/2026-05/LLM-routing-reduce-api-cost/body-1.webp” alt=""Three-layer task offloading” title with Haiku Sonnet Opus three-layer routing diagram” loading=“lazy” decoding=“async” width=“1024” height=“1024” />

Task complexity is divided into three levels: Haiku / Sonnet / Opus. How to choose?

Model shunting is most afraid of relying on feeling. The correct approach is to first divide the task into three layers, and then write each layer into the workflow.

The key here is not “which model is the smartest”, but “which model is just enough.” For small and medium-sized enterprises, just enough is more important than the most powerful, because when there are tens of thousands of automated calls per month, the price difference per time will be magnified.

ClassificationExampleModelMonthly token estimateMonthly cost estimate
Tier 1 (trivial)Classification, drawing fields, rewriting format, simple translation, label judgmentHaiku 4.53.5M tokensAbout NT$900-1,200
Tier 2 (standard)Summary, reply, code patch, community copywriting, sales letter draftSonnet 4.61.25M tokensAbout NT$3,600-4,000
Tier 3 (strategic/long context)Architecture decisions, cross-file refactoring, complex reviews, strategic judgmentsOpus 4.70.25M tokensAbout NT$900-1,200

The tokens and fees in the above table are estimated using 5M tokens per month. The actual bill will be affected by the input/output ratio, exchange rate, cache hit rate, and official model pricing changes, so you need to recalculate it with your own logs before officially importing it.

Tier 1 (Trivial tasks) → Haiku 4.5

The judgment criteria for Tier 1 are very straightforward: the answer format is fixed, mistakes are easy to catch, and the task does not require in-depth reasoning.

For example:

These tasks are perfect for Haiku 4.5 because you want speed and cost, not deep thinking. As long as the prompt clearly writes the classification rules, outputs JSON schema, and returns unknown when it fails, the quality can usually be monitored.

We actually tested a batch of 1,000 form clues, and originally all of them were thrown into Sonnet for classification. The cost was not high but it was very wasteful. After switching to Haiku, the classification accuracy remained within the acceptable range and the cost dropped to a fraction of the original value. What really needs to be looked at manually are unknown and low confidence score samples.

Tier 2 (Standard Reasoning) → Sonnet 4.6

Tier 2 is the workhorse tier for most AI workflows. The task requires understanding context, selecting information, and controlling tone, but it does not fall short of strategic decision-making.

For example:

It is recommended to use Sonnet 4.6 for this layer because it has a better balance between quality, speed and cost. Especially for external content, customer service replies, and business letters, as long as the tone is inaccurate, it will cause brand costs. You cannot just look at the token price.

If you are planning a low-cost AI import path, Tier 2 will usually be the first process to go live. It can bring significant labor savings, and it is easier to calculate ROI.

Tier 3 (strategy / code review final review) → Opus 4.7

Tier 3 is a small but high-risk mission. The cost of one failure of these tasks may be much higher than the API fee saved.

For example:

This layer can be reserved for Opus 4.7, or whatever your company internally deems to be the highest-order model. You don’t need to use too much, the key is to put it where it’s most worthwhile.

A healthy cost structure is not “not using large models at all”, but “big models only do what large models should do”.

3 types of routing architecture: rules, LLM-as-Router, hybrid

Multi-LLM routing can be simple or complex. Small and medium-sized enterprises do not need to build a complete model governance platform from the beginning. They need to start with an observable, rollable, and explainable architecture.

Rule routing uses if-else, task type, token length, and risk level to determine the model.

Example rules:

if task_type in ["分類", "欄位抽取", "格式改寫"]:
  model = "Haiku 4.5"
elif token_count > 8000 or risk_level == "high":
  model = "Sonnet 4.6"
elif task_type in ["架構決策", "複雜 review"]:
  model = "Opus 4.7"
else:
  model = "Sonnet 4.6"

This method is the most stable, cheapest, and easiest to debug. You can directly look at the log: why a certain task was assigned to Haiku because task_type=category; why a certain task was upgraded to Sonnet because the input exceeded 8,000 tokens.

We recommend using rule routing first in 80% of scenarios. Especially for content production, customer service diversion, CRM data cleaning, and community copywriting reuse, the task types are relatively fixed, and there is no need to let another LLM judge each time.

B: LLM-as-Router (suitable for dynamic input)

LLM-as-Router first uses a cheap model as a classifier to determine which model should be used for the task.

For example, using Haiku 4.5 to read user input first, output:

{
  "tier": "tier_2",
  "model": "Sonnet 4.6",
  "reason": "需要整合多段客訴內容並產出對外回覆",
  "confidence": 0.86
}

This architecture is suitable for scenarios with very irregular input, such as customer service mailboxes, open forms, internal SLAck instructions, and agent task dispatch. It is more flexible, but it also requires one more model call, so you cannot cover all tasks without thinking.

Most people will encounter a pitfall when trying to implement LLM-as-Router: the router prompt is too abstract. Don’t ask “Is this task difficult?” but give clear rubrics, such as “whether multi-step reasoning is required”, “whether it is sent externally”, “whether it involves amount, law, and security” and “whether it exceeds 6,000 tokens.”

The most recommended hybrid architecture for production environments is rule-based and LLM fallback.

The method is to use clear rules to handle 70% to 80% of the tasks; only when the rules cannot be judged, confidence is insufficient, or the input is abnormal, Haiku 4.5 is called as the router. If the router is still not sure, upgrade to Sonnet 4.6.

A practical process is as follows:

StageJudgment methodResult
Layer 1task_type, token_count, risk_levelDirectly assigned to Haiku / Sonnet / Opus
Layer 2When the rules cannot be determined, use Haiku as the routerReturn tier, confidence, reason
Tier 3confidence < 0.75 or high riskUpgrade Sonnet
Tier 4Sonnet flags uncertain or high-impact decisionsUpgrade Opus or manual review

The advantage of the hybrid architecture is that the cost is controllable and it is not stuck by rigid rules. It is also easier to write into the AI-team workflow: each task has a preset model first, and then upgrade conditions.

How much is the actual saving? Monthly 5M tokens full table calculation

We use 5M tokens per month to create a common scenario for small and medium-sized enterprises. This company has customer service summaries, list categories, community copywriting, internal SOP Q&A, simple code patches, and runs hundreds to thousands of API calls every day.

Before is to complete all tasks in Sonnet 4.6:

ProjectModelToken proportionMonthly tokensMonthly cost estimate
All tasksSonnet 4.6100%5MAbout NT$15,000
Total-100%5MAbout NT$15,000

After is the rule route 70 / 25 / 5 split:

Task levelModelToken proportionMonthly tokensMonthly cost estimate
Tier 1 Trivial TasksHaiku 4.570%3.5MAbout NT$900-1,200
Tier 2 Standard TaskSonnet 4.625%1.25MAbout NT$3,600-4,000
Tier 3 Strategic MissionOpus 4.75%0.25MAbout NT$900-1,200
Total-100%5MAbout NT$5,400-6,000

Before / After comparison:

MetricsBefore: Full SonnetAfter: Multiple LLM routing
Monthly tokens5M5M
Model configurationSonnet 100%Haiku 70% / Sonnet 25% / Opus 5%
Monthly costAbout NT$15,000About NT$5,400-6,000
Monthly savings-About NT$9,000-9,600
Cost reduction-About 60%

Let’s look at it in business language:

ProjectNumbers
Monthly SavingsNT$9,000-9,600
Annual savingsNT$108,000-115,200
Can be exchanged forcontent assistant hours, ad testing budget, CRM cleanup project
Introduction cost recoveryIf the construction cost is NT$30,000, it will take about 3-4 months to recover

This is why AI costs should be viewed together with ROI. Looking at API costs alone, you may think you are only saving a few thousand dollars, but if you are splitting AI team ROI, the monthly fixed API cost savings will directly improve gross profit.

There is another important point: the prompt cache is not included here. If your workflow often loses the same system prompts, brand tone rules, and knowledge base summaries repeatedly, the prompt cache may further reduce costs.

<img src=“https://images.aicycle.cc/2026-05/LLM-routing-reduce-api-cost/body-2.webp” alt=""Cost Before/After” title with two bars showing the comparison from 15,000 to 6,000” loading=“lazy” decoding=“async” width=“1024” height=“1024” />

3 common mistakes in routing implementation

Multi-LLM routing is not about stuffing Haiku everywhere. The real difficulty is not in diversion, but in knowing when not to save.

Mistake No. 1: Over-routing Haiku.

Haiku 4.5 is very suitable for classification, column drawing, and formatting, but complex reasoning can cause problems. Saving a small amount of money can lead to big things, usually occurring in these scenarios: external customer service replies, summary of contract terms, technical decisions, cross-file analysis, and tasks that require long context windows.

The solution is to write quality indicators into the rules. As long as the task has high-risk, irreversible, external sending, amount judgment, legal or security content, do not go directly to Haiku. Even if you go to Haiku first, you still have to send Sonnet for review.

The second mistake: not doing fallback.

Many teams only wrote “Haiku for classification tasks” but did not write “What to do if Haiku fails”. The result is that when the JSON format is wrong, the confidence score is low, the input is too long, and the answer is blank, the process hangs directly.

A basic fallback rule should look like this:

Haiku 回傳格式錯誤 → retry 1 次
retry 後仍失敗 → 升級 Sonnet
Sonnet 仍不確定 → 標記人工 review
高風險任務 → 不自動發送,只產生草稿

Mistake No. 3: Ignore prompt cache.

Some teams spend a lot of time doing model distribution, but re-send 5,000 tokens of brand rules, product knowledge, and customer service SOPs each time. At this time, the savings of prompt cache may be greater than model offloading.

Especially in situations where there are a large number of repeated calls within a 5-minute TTL (time to live, cache validity time), such as rewriting 200 product descriptions in batches, answering 500 questions from the same knowledge base, and producing 100 social copywriting pieces from the same set of brand rules, the cache hit rate will directly affect the bill.

It is recommended to design cache, routing, and fallback together instead of splitting them into three islands.

<img src=“https://images.aicycle.cc/2026-05/LLM-routing-reduce-api-cost/body-3.webp” alt=""Fallback Mechanism” title with Haiku Sonnet Opus model complementation process diagram” loading=“lazy” decoding=“async” width=“1024” height=“1024” />

How to control quality? Monthly sampling review formula

Saving costs is only the first step. If quality is not controlled, the money saved will eventually be used to make up for mistakes.

We recommend sampling 50 Haiku routing results every month for manual review. The sampling should cover different task types, such as 20 classifications, 10 field extractions, 10 format rewritings, and 10 simple translations.

Quality formula:

錯答率 = 錯誤樣本數 / 抽樣樣本數

Judgment rules:

Wrong answer rateProcessing method
< 5%Continue routing to Haiku
5%-10%Adjust prompt, add examples, and observe next month
> 10%Upgrade to Sonnet and redefine mission boundaries

Why 5%? Because AI automation for most small and medium-sized enterprises is not a research project, but an operational process. A 5% wrong answer rate means that 5 corrections will be required for every 100 times. This is acceptable in low-risk tasks, but not necessarily in external information, quotations, contracts, medical, legal, and information security scenarios.

Quality control should not only look at “right or wrong”, but also look at three indicators:

IndicatorsDefinitionRisk Signals
Format success rateWhether it conforms to JSON / Markdown / field schemaLower than 98%, you need to adjust prompt
Upgrade rateThe proportion of Haiku being fallbacked to SonnetA sudden increase means that the task input has changed
Manual correction timeHow long does it take for personnel to correct each correctionIf the time saved exceeds the time saved, the route is not cost-effective

Putting these metrics into monthly operational reviews is more useful than just looking at API bills. You’ll know which tasks are really suitable for cheap models and which ones just look cheap.

If you are already calculating AI import ROI, it is recommended to put “API cost savings”, “manual correction time” and “rework caused by wrong answers” into the same table. Only in this way can the net benefit be seen.

Frequently Asked Questions (FAQ) - at least 4 questions (trigger FAQPage schema)

Q1 Can Haiku 4.5 really replace Sonnet?

cannot be completely replaced. Haiku 4.5 is suitable for low-risk tasks with fixed formats and verifiable answers, such as classification, field extraction, short article rewriting, and simple translation.

Sonnet 4.6 is still suitable for tasks such as standard inference, external content, customer service responses, and code patches. The correct approach is not to replace Sonnet with Haiku, but to let Haiku eat 60% to 70% of the menial tasks.

Q2 How to determine which model to use for rule routing?

First look at the four fields: task type, token length, risk level, and whether to send it externally.

For classification, slot drawing, and format rewriting, Haiku is usually used. Abstracts, replies, copywriting, and code patches are all handled by Sonnet. For long context, multi-file reasoning, architectural decisions, and complex reviews, use Opus or manual review.

The simplest way to start is to establish a task_type → default_model comparison table, plus upgrade conditions, such as tokens exceeding 8,000, risk high, and confidence lower than 0.75.

Q3 What happens if the routing is wrong? How to set fallback?

Wrong routing may have three consequences: quality degradation, process failure, and external output errors. Low-risk tasks can be solved by retry, while high-risk tasks must be upgraded or manually reviewed.

The recommended fallback chain is Haiku → Sonnet → Opus / human review. When Haiku returns a wrong format, lacks confidence, the input is too long, or the content involves amounts or laws, directly upgrade to Sonnet. If Sonnet still flags uncertainty, do not send automatically.

Q4 Do I need to write my own router for multiple LLMs? Are there any ready-made tools?

You don’t have to start from scratch. Small teams can first write rule routing in n8n, Make, Zapier, LangChain, LlamaIndex or their own backend. The focus is not on the tool name, but on the completeness of logs, fallback, and quality sampling.

If your process already has fixed task types, it is fastest to write if-else yourself. If the input is very dynamic, add LLM-as-Router.

Q5 When should you not do long LLM routing?

If your monthly API cost is less than NT$1,000, don’t do complex routing yet. At this time, you should prioritize organizing prompts, reducing unnecessary input, and using cache.

When the monthly fee stably exceeds NT$5,000, or the same type of tasks are run hundreds of times a day, multiple LLM routes will be significantly recycled.

Further reading

Write multiple LLM routing rules into your AI-team workflow and change it from “Full Sonnet” to “Haiku / Sonnet / Opus layered”. If you want to import into customer service, content, CRM or internal knowledge base processes, please see AIcycle Service or contact the team, we will use your actual token logs to calculate a version of Before / After.