
This package allows you to route chat requests between small and large LLM models based on prompt classification. It dynamically selects the most suitable model depending on the complexity of the user input, ensuring optimal performance and maintaining conversation context.
Installation
You can install the package using pip:
pip install llm-predictive-router
Example Usage
from llm_predictive_router import LLMRouter # Define model configuration config = { "classifier": { "model_id": "DevQuasar/roberta-prompt_classifier-v0.1" }, # The entity name should match the predicted label from your prompt classifier "small_llm": { "escalation_order": 0, "url": "http://localhost:1234/v1", "api_key": "lm-studio", "model_id": "lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf", "max_ctx": 4096 }, # The entity name should match the predicted label from your prompt classifier "large_llm": { "escalation_order": 1, "url": "http://localhost:1234/v1", "api_key": "lm-studio", "model_id": "lmstudio-community/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct-Q4_K_M.gguf", "max_ctx": 8192 } } router = LLMRouter(config) # Example call with simple prompt -> router to "small_llm" response, context, selected_model = router.chat( "Hello", temperature=0.5, # Lower temperature for more focused responses max_tokens=100, # Limit the response length verbose=True ) # Another simple prompt -> router to "small_llm" response, context, selected_model = router.chat( "Tell me a story about a cat", curr_ctx=context, # carry the chat history model_store_entry=selected_model, temperature=0.5, # Lower temperature for more focused responses max_tokens=512, # Limit the response length verbose=True ) # Default prompt classifier still considers this to a generic simple prompt -> router to "small_llm" response, context, selected_model = router.chat( "Now explain the biology of the cat", curr_ctx=context, model_store_entry=selected_model, temperature=0.5, # Lower temperature for more focused responses max_tokens=512, # Limit the response length verbose=True ) # This will escalate the model -> router to "large_llm" as we are getting into specific domain details response, context, selected_model = router.chat( "Get into the details of his metabolism, especially interested in the detailed role of the liver", curr_ctx=context, model_store_entry=selected_model, temperature=0.5, # Lower temperature for more focused responses max_tokens=512, # Limit the response length verbose=True )
PIP
https://pypi.org/project/llm-predictive-router
GIT
https://github.com/csabakecskemeti/llm_predictive_router_package