⌘Ctrlk

Best Practices

Follow these guidelines to maximize performance and reliability when running models on your Cube.

Batching

Llama: Send up to 8 concurrent chat requests per RPC.
Whisper: Send up to 4 simultaneous audio streams.

Prompt Size

Llama: Keep the total context (prompt + history) under 32,000 tokens. Truncate older messages to stay within limits.

Audio Length

Whisper: Split audio into chunks of 30 seconds or less for optimal speed and accuracy.

Error Handling

HTTP 429 / 503: Retry with exponential back-off (1s → 2s → 4s).
Timeouts:
- Llama: Set client timeout to 60 seconds
- Whisper: Set client timeout to 120 seconds

PreviousAudio Transcription

Last updated 8 months ago