uncloseai.
How We Run Inference
How we run inference
This section is optional. This is only if you wanted to try to contribute idle GPU time to the project or if you wanted to reproduce everything in your own cluster.
vLLM Setup
We use vLLM to run models, generally with full f16 safetensors. We make sure to use a virtualenv to hold the dependencies.
Note: For Hermes, we use an FP8 quant by adamo1139 (adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic), which is optimized for 4090 and 3090 GPUs.
We are considering supporting ollama for better quant support.
Stand up a replica cluster on a new domain.
sudo apt-get install gcc python3.12-dev python3.12-venv
cd ~
python3 -m venv env
source env/bin/activate
pip install vllm
python -m vllm.entrypoints.openai.api_server --model adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic --host 0.0.0.0 --port 18888 --max-model-len 82000
TTS Setup
The Speech endpoint or TTS uses openedai-speech running via Docker.
We maintain a fork at git.unturf.com/engineering/unturf/openedai-speech.
Proxy Setup
If you want to see how we setup the proxy, check out /etc/caddy/Caddyfile
ai.unturf.com {
root * /opt/www
file_server
log {
output file /var/log/caddy/ai.unturf.com.log {
roll_size 50mb
roll_keep 5
}
}
tls {
on_demand
}
}
hermes.ai.unturf.com {
reverse_proxy <removed>:18888
log {
output file /var/log/caddy/hermes.ai.unturf.com.log {
roll_size 50mb
roll_keep 5
}
}
tls {
on_demand
}
}
speech.ai.unturf.com {
reverse_proxy <removed>:8000
log {
output file /var/log/caddy/speech.ai.unturf.com.log {
roll_size 50mb
roll_keep 5
}
}
tls {
on_demand
}
}
Rate Limiting
Rate limiting is configured based on client IP address: 3 requests per second per IP per endpoint.