LLM Routing — New Tech, Same Problems (2026 )

Every new technology arrives with its own unique set of challenges, and that fact still holds today. As Generative AI (GenAI) begins to take root in organizations, significant obstacles must be overcome to make these systems functional, efficient, and profitable — especially when scaling.

I recently delivered a pilot for a client with 37,000 employees. Large organizations like this face an overwhelming number of repetitive IT inquiries daily. Questions like “How do I reset my password?” or “Where do I submit an incident ticket?” are common, yet handling these manually (over 10k monthly) leads to inefficiencies, increased response times, a higher workload for IT support staff, and ultimately, negative financial impacts.

My goal was to enable the client on their GenAI journey and build an IT system capable of efficiently answering these common questions. While the solution was achieved with a straightforward RAG (Retrieval-Augmented Generation) implementation, a realization quickly set in: scaling this system was going to be expensive, especially as more use cases emerged.

There are numerous ways to cut costs here (e.g., Q&A caching), but one solution caught my attention during my daily morning reading of Medium articles: LLM routing.

At its core, routing is simply the process of selecting a path for traffic. But in this context, the decision-making needs to be far smarter.

The question is: do we really need to call GPT-4 every time a user asks a question?

The answer, more often than not, is no — and that’s where RouteLLM comes in.

RouteLLM: The Open-Source Solution

Source: https://lmsys.org/blog/2024-07-01-routellm/

The latest and greatest Large Language Models (LLMs) excel at answering complex questions using context chunks provided via semantic search. However, users often ask mundane and simple questions when interacting with your GenAI solution.

Take our IT chatbot example: the question “Who do I call if I lose my work phone?” doesn’t require a sophisticated LLM to generate a good response. A simpler, more cost-effective model could handle this just as well.

That’s the premise behind RouteLLM: simpler questions should be handled by weaker models, while more complex queries are routed to stronger (and more expensive) models. The goal is to minimize costs while maintaining high-quality responses through efficient routing.

But how does RouteLLM determine which model to use? The tool currently offers four solutions to tackle this problem. I’ll focus on the one that caught my eye — Matrix Factorization.

ML Is Still King

My father always told me, “Son, remember that Mathematics is the queen of sciences. Nothing happens without her knowing. She is everywhere.” Boy, was he right.

Reading “Matrix Factorization” at 6 am brought back memories of my 2019 recommender systems class. We were learning about Netflix, and movies, and predicting what star rating Franco would give Amazon’s “Rings of Power” adaptation (okay, I made that last part up — but if you know, you know).

While the term might sound intimidating, the concept is straightforward.

Source: https://www.scaler.com/topics/nlp/non-negative-matrix-factorization/

Matrix Factorization (MF) is a powerful ML technique with significant computational benefits, including storage efficiency and speed of execution.

Here’s an extremely simplistic breakdown:

1. You start with a large matrix V.

2. Matrices have a natural property that allows them to be decomposed into smaller matrices, which can then be used to reconstitute the original matrix through a dot product.

3. In this case, matrix V is decomposed into matrices W and H.

In the Netflix example:

1. Matrix V (target) represents user ratings for movies.

2. Matrix W captures movie features (e.g., fantasy, comedy) that each user enjoys.

3. Matrix H represents features each movie possesses.

The idea is that you can train an ML model to approximate matrix V’s values by tweaking matrix W’s and matrix H’s values to minimize the difference between their dot product and the target values in matrix V. This enables the ML model to predict ratings users might give to movies they haven’t seen yet. (By the way, Franco would give “Rings of Power” 1 star at best.)

Now that we’ve set the stage, let’s see how this applies to saving what my daughter calls “a million-dollar bucks.” Thanks, Bluey.

Matrix Factorization in RouteLLM

For testing, the LMSYS Org team used GPT-4 Turbo as the strong model and Mixtral 8x7B as the weak one.

Following the factorization example we covered above, it’s easier to understand what they did next, isn’t it?

– Movies = LLM Models

– Star ratings = LLM winner

– Users = prompts

They trained an MF model on LLM pairs using preference data. This allowed the model to learn the strengths and weaknesses of different models and how they relate to user queries.

Preference data is curated public data where each point consists of a prompt and a comparison between the response quality of two models on that prompt — this could be a win for the first model, a win for the second model, or a tie.

To improve performance, they used data augmentation techniques (golden-label datasets and an LLM judge) before training their router model. The results were impressive.

So, to summarize: an MF model was trained on a curated dataset to predict which LLM model would be best suited to answer a prompt.

Source: https://lmsys.org/blog/2024-07-01-routellm/#results

As I stated before, we will keep focusing on the MF implementation. “Random” in the above figures represent a…well random router, which chooses between the GPT-4 and Mixtral model randomly. MT Bench is a benchmarking framework used to evaluate the performance of LLMs across multiple tasks.

Cheaper And (Almost) Just As Good.

On the non-augmented dataset, the MF approach achieved 95% of GPT-4’s performance using only 26% of GPT-4 calls — approximately 48% cheaper compared to a random baseline. With augmented data, the results improved further, halving the GPT-4 calls needed to achieve 95% performance, making it 75% cheaper than the random baseline.

Even better, the MF router performed well with other LLM pairs, matching commercial routers like Martian and Unify AI while being over 40% cheaper.

You don’t even need to retrain your own MF router; you’ll see performance improvements using their out-of-the-box solution. But if you want to customize it, you can use their router serving and evaluation framework on GitHub, and explore their artifacts and datasets on HuggingFace.

Conclusion

As companies grapple with the scaling and integration challenges of implementing LLM solutions, I firmly believe that traditional ML techniques will remain vital for solving specific problems, and in some cases, can enhance systems that utilize sophisticated models.

If you’re facing cost issues, LLM routing might be the solution you’ve been looking for.

Discover more from Joshua Orfin

Subscribe to get the latest posts sent to your email.

LLM Routing — New Tech, Same Problems