uncloseai.

How We Run Inference

How we run inference

This section is optional. This is only if you wanted to try to contribute idle GPU time to the project or if you wanted to reproduce everything in your own cluster.

vLLM Setup

We use vLLM to run models, generally with full f16 safetensors. We make sure to use a virtualenv to hold the dependencies.

Note: For Hermes, we use an FP8 quant by adamo1139 (adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic), which is optimized for 4090 and 3090 GPUs.

We are considering supporting ollama for better quant support.

Stand up a replica cluster on a new domain.

sudo apt-get install gcc python3.12-dev python3.12-venv
cd ~
python3 -m venv env
source env/bin/activate
pip install vllm
python -m vllm.entrypoints.openai.api_server --model  adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic --host 0.0.0.0 --port 18888 --max-model-len 82000
        

TTS Setup

The Speech endpoint or TTS uses openedai-speech running via Docker.

We maintain a fork at git.unturf.com/engineering/unturf/openedai-speech.

Proxy Setup

If you want to see how we setup the proxy, check out /etc/caddy/Caddyfile

ai.unturf.com {
    root * /opt/www
    file_server
    log {
        output file /var/log/caddy/ai.unturf.com.log {
            roll_size 50mb
            roll_keep 5
        }
    }
    tls {
        on_demand
    }
}

hermes.ai.unturf.com {
    reverse_proxy <removed>:18888
    log {
        output file /var/log/caddy/hermes.ai.unturf.com.log {
            roll_size 50mb
            roll_keep 5
        }
    }
    tls {
        on_demand
    }
}

speech.ai.unturf.com {
    reverse_proxy <removed>:8000
    log {
        output file /var/log/caddy/speech.ai.unturf.com.log {
            roll_size 50mb
            roll_keep 5
        }
    }
    tls {
        on_demand
    }
}
        

Rate Limiting

Rate limiting is configured based on client IP address: 3 requests per second per IP per endpoint.