Best Practices
Follow these guidelines to maximize performance and reliability when running models on your Cube.
Batching
Llama: Send up to 8 concurrent chat requests per RPC.
Whisper: Send up to 4 simultaneous audio streams.
Prompt Size
Llama: Keep the total context (prompt + history) under 32,000 tokens. Truncate older messages to stay within limits.
Audio Length
Whisper: Split audio into chunks of 30 seconds or less for optimal speed and accuracy.
Error Handling
HTTP 429 / 503: Retry with exponential back-off (1s → 2s → 4s).
Timeouts:
Llama: Set client timeout to 60 seconds
Whisper: Set client timeout to 120 seconds
Last updated

