Best Practices

Follow these guidelines to maximize performance and reliability when running models on your Cube.

Batching

  • Llama: Send up to 8 concurrent chat requests per RPC.

  • Whisper: Send up to 4 simultaneous audio streams.

Prompt Size

  • Llama: Keep the total context (prompt + history) under 32,000 tokens. Truncate older messages to stay within limits.

Audio Length

  • Whisper: Split audio into chunks of 30 seconds or less for optimal speed and accuracy.

Error Handling

  • HTTP 429 / 503: Retry with exponential back-off (1s → 2s → 4s).

  • Timeouts:

    • Llama: Set client timeout to 60 seconds

    • Whisper: Set client timeout to 120 seconds

Last updated