fast generative yield

While you focus on building,we handle the cache.

FGY Cache is your inference caching department from day one. Drop it in front of OpenAI, Anthropic, or any OpenRouter model and start recovering costs immediately — exact hash matching, pgvector semantic similarity, request coalescing, and full streaming support. We charge 15% of what we save you. Misses are free.

fgy_charge = tokens_saved × provider_rate × 0.15
client.py
from openai import OpenAI
import os
 
- client = OpenAI(api_key="sk-...")
+ client = OpenAI(
  api_key="fgy_...",                  # cache tenant key
  base_url="https://api.fgy.ai/v1",
  default_headers={
    "X-Provider-Auth": f"Bearer {os.environ['OPENAI_API_KEY']}", # forwarded on miss
  },
)
 
response = client.chat.completions.create(
  model="gpt-4o-mini", messages=[...]
)
x-fgy-cache: exact x-fgy-tokens-saved: 1024 x-fgy-version: 1 x-fgy-cache: miss

FGY does not store your provider key. The X-Provider-Auth header travels with each request and is forwarded upstream only on cache misses. On a hit, it is never read.

~0 μs
exact cache latency
<10 ms
semantic hit latency
N→1
concurrent deduplication
$0
fgy charge on miss

why this exists

Prompt repetition is structural, not an edge case.

Every production LLM application has a class of traffic where the same or semantically equivalent prompt arrives repeatedly — support bots, semantic search, document Q&A, classification pipelines, code completion with shared context. That traffic is invisible money.

FGY sits in front of your provider and captures that traffic. Exact matches return in microseconds from ETS. Near-matches resolve against a pgvector store. Concurrent identical in-flight requests collapse into a single upstream call. The savings accumulate from the first request.

You keep 85 cents of every dollar FGY saves you. We take 15. If nothing is saved, nothing is charged.

cache paths

Three paths. Each one cheaper than going upstream.

01

Exact match

The request is normalized and hashed. If the tenant's ETS shard has a matching entry, the cached response returns in microseconds. No embedding call, no database query, no upstream hop.

provider: $0  ·  fgy: $0
02

Semantic match

On an ETS miss, the prompt is embedded and checked against prior responses via pgvector cosine distance. Entries within your configured similarity threshold (default 0.92) serve the stored payload. No provider call. You control the threshold per tenant from the dashboard.

provider: $0  ·  fgy: 15% of avoided cost
03

Miss + coalescing

True misses go upstream using your forwarded provider key. Concurrent callers with the same prompt collapse into one upstream request via the OTP GenServer coalescer. All waiters receive the broadcast simultaneously.

provider: normal rate  ·  fgy: $0

stack

Built on the right runtime for this problem.

A cache is a concurrency problem. Elixir and OTP were built for exactly this — millions of lightweight processes, preemptive scheduling, no GC pauses, and message passing as the concurrency primitive. The entire cache layer is a natural expression of the BEAM.

elixir / otp

The runtime

Elixir on the BEAM gives each request its own process. Supervision trees mean crashes are isolated and recovered automatically. The scheduler handles thousands of concurrent cache lookups without blocking.

ets

In-memory exact store

16 shards with read_concurrency: true. Keyed via :erlang.phash2. Lookup, validation, and TTL check happen without touching any external process. Exact hits never leave the BEAM VM.

pgvector

Semantic similarity

Prompt embeddings stored per tenant and model. Nearest-neighbour search via the <=> cosine distance operator directly in Postgres. Hit counts incremented via Task.start, never blocking the response path.

fly.io

Global distribution

Deployed across Fly.io regions. Requests route to the nearest instance. The BEAM cluster handles state coordination across nodes. Low-latency cache access regardless of where your traffic originates.

genserver coalescer

N concurrent requests → 1 upstream call.

This is not a debounce or a queue. The first process to register for an in-flight key gets :execute. Every subsequent arrival gets :wait and blocks on receive. When the executing process completes, GenServer.cast broadcasts to all waiters simultaneously. Your provider sees one request, billed once, regardless of how many clients triggered it.

pid 1
:execute
pid 2
:wait
pid 3
:wait
...
n waiters
1 upstream request → N simultaneous responses

pricing model

We take a cut of what we save you.

No platform fee. No minimum spend. Misses pass through for free — your provider key is forwarded, and FGY charges nothing for the proxy hop. On hits, FGY bills 15% of the avoided provider cost at list price.

saved $10 you keep $8.50 · fgy $1.50
full pricing and formula breakdown

what's shipping

v1

This is not a prototype.

V1 ships with everything you need to drop FGY in front of your inference stack today: three-tier cache, full streaming, multi-provider support, request coalescing, prepaid billing with audit ledger, credit codes, auto top-up, and a dashboard to manage it all.

Streaming response cache
Set stream: true and cache hits replay as real SSE chunks. Misses pipe upstream tokens through in real time and store the assembled response for future hits.
Multi-provider routing
OpenAI, Anthropic, and OpenRouter models through one endpoint. Pass your own provider key via header or let FGY route to the right upstream. Cache the response regardless of who fulfilled it.
Multi-node cache coherence
Cache writes broadcast across the BEAM cluster via PubSub. Every node holds a full copy of the hot cache for local sub-microsecond reads, no matter which instance stored the entry.
API versioning
Every response includes x-fgy-version. Pin your integration to a version and upgrade when you're ready. Breaking changes will never land without a new version number.
Prepaid billing + audit ledger
Top up via Stripe checkout or credit codes. Per-model cost tracking with exact provider rates. Append-only ledger for every credit and debit. No negative balances — auto top-up available when balance drops below your threshold.
Credit codes + auto top-up
Single-use credit codes for onboarding, testing, and promotions. Auto top-up charges your saved payment method when balance drops below a configurable threshold. All transactions auditable via the ledger.
Embedding retry + resilience
Embedding calls retry with exponential backoff on rate limits and transient failures. Semantic cache degrades gracefully to exact-only — your request always goes through.
Per-tenant similarity thresholds
Configure the pgvector cosine threshold per key from the dashboard. Tighter thresholds for precise use cases, looser for high-repetition pipelines. Presets for strict (0.97), balanced (0.92), and loose (0.85).

on the roadmap

Cache warming API
Pre-seed the cache from your prompt corpus before traffic arrives. Deploy knowing your hit rate is already primed.
SDK packages
First-class Python, TypeScript, and Go SDKs with built-in header inspection and fallback helpers.
round 0

We're building this in the open.

FGY is at round zero. If you're spending meaningful money on LLM inference and want to be involved early — as a design partner, pilot user, or in an investment conversation — we want to hear from you. We're looking for teams where the math on caching is obvious and who want to shape what the product becomes.

[email protected]