Introducing Hermes AI and TTS Speech Endpoints
At ai.unturf.com, we offer free AI services powered by the NousResearch/Hermes-3-Llama-3.1-8B model and a TTS (Text-to-Speech) endpoint. Our mission is to provide accessible AI tools for everyone, embodying the principles of both free as in beer & free as in freedom. You can interact with our models without any cost, and you are encouraged to contribute and build upon the open-source code & models that we use.
We intend to be a drop in replacement, you can use the existing open source OpenAI client to communicate with us.
Web Client-Only Solution: Interact with AI Services Directly from Static Sites or CDNs
Because we don't require a valid API key, we don't have any real need for a server.
Add this LLM to any static site or CDN.
This web client-only solution uses uncloseai.js to make the browser act as a client, directly interacting with the API without needing an intermediary server. By eliminating the need for a valid API key, the API handles requests on behalf of the browser client, making it efficient and accessible thin client, especially those on battery power like phones & laptops.
This static site has a live LLM demonstration above!
Installing the OpenAI Client
Python
To install the OpenAI package for Python, use pip
:
pip install openai
Node.js
To install the OpenAI package for Node.js, you can use npm
in your package.json
:
{
"dependencies": {
"openai": "^v4.67.3" // Use the latest version
}
}
Run the following command to install it:
npm install
Using the Hermes AI Model
Python Example
Non-Streaming
# Python Fizzbuzz Example
from openai import OpenAI
client = OpenAI(base_url="https://hermes.ai.unturf.com/v1", api_key="choose-any-value")
MODEL = "NousResearch/Hermes-3-Llama-3.1-8B"
messages = [{"role": "user", "content": "Give a Python Fizzbuzz solution in one line of code?"}]
response = client.chat.completions.create(
model=MODEL,
messages=messages,
temperature=0.5,
max_tokens=150
)
print(response.choices[0].message.content)
Streaming
# Streaming response in Python
from openai import OpenAI
client = OpenAI(base_url="https://hermes.ai.unturf.com/v1", api_key="choose-any-value")
MODEL = "NousResearch/Hermes-3-Llama-3.1-8B"
messages = [
{"role": "user", "content": "Give a Python Fizzbuzz solution in one line of code?"}
]
response = client.chat.completions.create(
model=MODEL,
messages=messages,
temperature=0.5,
max_tokens=150,
stream=True, # Enable streaming
)
for chunk in response:
if hasattr(chunk.choices[0].delta, "content"):
print(chunk.choices[0].delta.content, end="")
Node.js Example
Non-Streaming
const OpenAI = require('openai');
const client = new OpenAI({
baseURL: "https://hermes.ai.unturf.com/v1",
apiKey: "dummy-api-key",
});
const MODEL = "NousResearch/Hermes-3-Llama-3.1-8B";
const messages = [{"role": "user", "content": "Give a Python Fizzbuzz solution in one line of code?"}];
async function getResponse() {
try {
const response = await client.chat.completions.create({
model: MODEL,
messages: messages,
temperature: 0.5,
max_tokens: 150,
});
console.log(response.choices[0].message.content);
} catch (error) {
console.error("Error:", error.response ? error.response.data : error.message);
}
}
getResponse();
Streaming
const OpenAI = require('openai');
const client = new OpenAI({
baseURL: "https://hermes.ai.unturf.com/v1",
apiKey: "dummy-api-key",
});
const MODEL = "NousResearch/Hermes-3-Llama-3.1-8B";
const messages = [{"role": "user", "content": "Give a Python Fizzbuzz solution in one line of code?"}];
async function streamResponse() {
try {
const stream = await client.chat.completions.create({
model: MODEL,
messages: messages,
temperature: 0.5,
max_tokens: 150,
stream: true, // Enable streaming
});
// Use async iterator to read each chunk
for await (const chunk of stream) {
const msg = chunk.choices[0].delta.content;
process.stdout.write(msg); // Print each chunk as it arrives
}
} catch (error) {
console.error("Error:", error.response ? error.response.data : error.message);
}
}
streamResponse();
Using the Text To Speech Endpoint
Python Example
# TTS Speech Example in Python
import openai
client = openai.OpenAI(
api_key = "YOLO",
base_url = "https://speech.ai.unturf.com/v1",
)
with client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="alloy",
speed=0.9,
input="I think so therefore, Today is a wonderful day to build something people love!"
) as response:
response.stream_to_file("speech.mp3")
Node.js Example
const OpenAI = require('openai');
const client = new OpenAI({
baseURL: "https://speech.ai.unturf.com/v1",
apiKey: "YOLO",
});
async function getSpeech() {
try {
const response = await client.audio.speech.with_streaming_response.create({
model: "tts-1",
voice: "alloy",
speed: 0.9,
input: "I think so therefore, Today is a wonderful day to build something people love!"
});
response.stream_to_file("speech.mp3");
} catch (error) {
console.error("Error:", error.response ? error.response.data : error.message);
}
}
getSpeech();
How we run inference
This section is optional. This is only if you wanted to try to contribute idle GPU time to the project or if you wanted to reproduce everything in your own cluster.
We use vLLM to run models, currently full f16 safetensors. We make sure to use a virtualenv to hold the dependencies.
We are considering supporting ollama for better quant support.
Stand up a replica cluster on a new domain.
cd ~
python3 -m venv env
source env/bin/activate
pip install vllm
python -m vllm.entrypoints.openai.api_server --model NousResearch/Hermes-3-Llama-3.1-8B --host 0.0.0.0 --port 18888 --max-model-len 16000
The Speech endpoint or TTS uses openedai-speech running via Docker.
If you want to see how we setup the proxy, check out /etc/caddy/Caddyfile
ai.unturf.com {
root * /opt/www
file_server
log {
output file /var/log/caddy/ai.unturf.com.log {
roll_size 50mb
roll_keep 5
}
}
tls {
on_demand
}
}
hermes.ai.unturf.com {
reverse_proxy :18888
log {
output file /var/log/caddy/hermes.ai.unturf.com.log {
roll_size 50mb
roll_keep 5
}
}
tls {
on_demand
}
}
speech.ai.unturf.com {
reverse_proxy :8000
log {
output file /var/log/caddy/speech.ai.unturf.com.log {
roll_size 50mb
roll_keep 5
}
}
tls {
on_demand
}
}
We will likely implement a rate limit based on client IP address.