Using Free LLM & Text To Speech Artificial Intelligence Service

Introducing Hermes AI and TTS Speech Endpoints

At ai.unturf.com, we offer free AI services powered by the NousResearch/Hermes-3-Llama-3.1-8B model and a TTS (Text-to-Speech) endpoint. Our mission is to provide accessible AI tools for everyone, embodying the principles of both free as in beer & free as in freedom. You can interact with our models without any cost, and you are encouraged to contribute and build upon the open-source code & models that we use.

We intend to be a drop in replacement, you can use the existing open source OpenAI client to communicate with us.

Web Client-Only Solution: Interact with AI Services Directly from Static Sites or CDNs

Because we don't require a valid API key, we don't have any real need for a server.

Add this LLM to any static site or CDN.

This web client-only solution uses uncloseai.js to make the browser act as a client, directly interacting with the API without needing an intermediary server. By eliminating the need for a valid API key, the API handles requests on behalf of the browser client, making it efficient and accessible thin client, especially those on battery power like phones & laptops.

This static site has a live LLM demonstration above!

A stand alone demo is here: uncloseai.js demostration

Installing the OpenAI Client

Python

To install the OpenAI package for Python, use pip:

pip install openai

Node.js

To install the OpenAI package for Node.js, you can use npm in your package.json:

{
  "dependencies": {
    "openai": "^v4.67.3"  // Use the latest version
  }
}

Run the following command to install it:

npm install

Using the Hermes AI Model

Python Example

Non-Streaming

# Python Fizzbuzz Example
from openai import OpenAI

client = OpenAI(base_url="https://hermes.ai.unturf.com/v1", api_key="choose-any-value")

#MODEL = "NousResearch/Hermes-3-Llama-3.1-8B"
MODEL = "adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic"

messages = [{"role": "user", "content": "Give a Python Fizzbuzz solution in one line of code?"}]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    temperature=0.5,
    max_tokens=150
)

print(response.choices[0].message.content)

Streaming

# Streaming response in Python
from openai import OpenAI

client = OpenAI(base_url="https://hermes.ai.unturf.com/v1", api_key="choose-any-value")

MODEL = "adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic"

messages = [
    {"role": "user", "content": "Give a Python Fizzbuzz solution in one line of code?"}
]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    temperature=0.5,
    max_tokens=150,
    stream=True,  # Enable streaming
)

for chunk in response:
    if hasattr(chunk.choices[0].delta, "content"):
        print(chunk.choices[0].delta.content, end="")

Node.js Example

Non-Streaming

const OpenAI = require('openai');

const client = new OpenAI({
    baseURL: "https://hermes.ai.unturf.com/v1",
    apiKey: "dummy-api-key",
});

const MODEL = "adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic";

const messages = [{"role": "user", "content": "Give a Python Fizzbuzz solution in one line of code?"}];

async function getResponse() {
    try {
        const response = await client.chat.completions.create({
            model: MODEL,
            messages: messages,
            temperature: 0.5,
            max_tokens: 150,
        });

        console.log(response.choices[0].message.content);
    } catch (error) {
        console.error("Error:", error.response ? error.response.data : error.message);
    }
}

getResponse();

Streaming


const OpenAI = require('openai');

const client = new OpenAI({
    baseURL: "https://hermes.ai.unturf.com/v1",
    apiKey: "dummy-api-key",
});

const MODEL = "adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic";

const messages = [{"role": "user", "content": "Give a Python Fizzbuzz solution in one line of code?"}];

async function streamResponse() {
    try {
        const stream = await client.chat.completions.create({
            model: MODEL,
            messages: messages,
            temperature: 0.5,
            max_tokens: 150,
            stream: true,  // Enable streaming
        });

        // Use async iterator to read each chunk
        for await (const chunk of stream) {
            const msg = chunk.choices[0].delta.content;
            process.stdout.write(msg);  // Print each chunk as it arrives
        }
    } catch (error) {
        console.error("Error:", error.response ? error.response.data : error.message);
    }
}

streamResponse();

Using the Text To Speech Endpoint

Python Example

# TTS Speech Example in Python
import openai

client = openai.OpenAI(
  api_key = "YOLO",
  base_url = "https://speech.ai.unturf.com/v1",
)

with client.audio.speech.with_streaming_response.create(
  model="tts-1",
  voice="alloy",
  speed=0.9,
  input="I think so therefore, Today is a wonderful day to build something people love!"
) as response:
  response.stream_to_file("speech.mp3")

Node.js Example

const OpenAI = require('openai');

const client = new OpenAI({
    baseURL: "https://speech.ai.unturf.com/v1",
    apiKey: "YOLO",
});

async function getSpeech() {
    try {
        const response = await client.audio.speech.with_streaming_response.create({
            model: "tts-1",
            voice: "alloy",
            speed: 0.9,
            input: "I think so therefore, Today is a wonderful day to build something people love!"
        });

        response.stream_to_file("speech.mp3");
    } catch (error) {
        console.error("Error:", error.response ? error.response.data : error.message);
    }
}

getSpeech();

How we run inference

This section is optional. This is only if you wanted to try to contribute idle GPU time to the project or if you wanted to reproduce everything in your own cluster.

We use vLLM to run models, currently full f16 safetensors. We make sure to use a virtualenv to hold the dependencies.

We are considering supporting ollama for better quant support.

Stand up a replica cluster on a new domain.


cd ~
python3 -m venv env
source env/bin/activate
pip install vllm
python -m vllm.entrypoints.openai.api_server --model  adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic --host 0.0.0.0 --port 18888 --max-model-len 82000

The Speech endpoint or TTS uses openedai-speech running via Docker.

If you want to see how we setup the proxy, check out /etc/caddy/Caddyfile



ai.unturf.com {
    root * /opt/www
    file_server
    log {
        output file /var/log/caddy/ai.unturf.com.log {
            roll_size 50mb
            roll_keep 5
        }
    }
    tls {
        on_demand
    }
}

hermes.ai.unturf.com {
    reverse_proxy :18888
    log {
        output file /var/log/caddy/hermes.ai.unturf.com.log {
            roll_size 50mb
            roll_keep 5
        }
    }
    tls {
        on_demand
    }
}

speech.ai.unturf.com {
    reverse_proxy :8000
    log {
        output file /var/log/caddy/speech.ai.unturf.com.log {
            roll_size 50mb
            roll_keep 5
        }
    }
    tls {
        on_demand
    }
}

We will likely implement a rate limit based on client IP address.

Questions & Comments & Discussions

Use the Remarkbox below to tell us what you think!