While you focus on building,
we handle the cache.
FGY Cache is your inference caching department from day one. Drop it in front of OpenAI, Anthropic, or any OpenRouter model and start recovering costs immediately — exact hash matching, pgvector semantic similarity, request coalescing, and full streaming support. We charge 15% of what we save you. Misses are free.
api_key="fgy_...", # cache tenant key
base_url="https://api.fgy.ai/v1",
default_headers={
"X-Provider-Auth": f"Bearer {os.environ['OPENAI_API_KEY']}", # forwarded on miss
},
)
FGY does not store your provider key. The X-Provider-Auth header travels with each request and is forwarded upstream only on cache misses. On a hit, it is never read.
why this exists
Prompt repetition is structural, not an edge case.
Every production LLM application has a class of traffic where the same or semantically equivalent prompt arrives repeatedly — support bots, semantic search, document Q&A, classification pipelines, code completion with shared context. That traffic is invisible money.
FGY sits in front of your provider and captures that traffic. Exact matches return in microseconds from ETS. Near-matches resolve against a pgvector store. Concurrent identical in-flight requests collapse into a single upstream call. The savings accumulate from the first request.
You keep 85 cents of every dollar FGY saves you. We take 15. If nothing is saved, nothing is charged.
cache paths
Three paths. Each one cheaper than going upstream.
Exact match
The request is normalized and hashed. If the tenant's ETS shard has a matching entry, the cached response returns in microseconds. No embedding call, no database query, no upstream hop.
Semantic match
On an ETS miss, the prompt is embedded and checked against prior responses via pgvector cosine distance. Entries within your configured similarity threshold (default 0.92) serve the stored payload. No provider call. You control the threshold per tenant from the dashboard.
Miss + coalescing
True misses go upstream using your forwarded provider key. Concurrent callers with the same prompt collapse into one upstream request via the OTP GenServer coalescer. All waiters receive the broadcast simultaneously.
stack
Built on the right runtime for this problem.
A cache is a concurrency problem. Elixir and OTP were built for exactly this — millions of lightweight processes, preemptive scheduling, no GC pauses, and message passing as the concurrency primitive. The entire cache layer is a natural expression of the BEAM.
The runtime
Elixir on the BEAM gives each request its own process. Supervision trees mean crashes are isolated and recovered automatically. The scheduler handles thousands of concurrent cache lookups without blocking.
In-memory exact store
16 shards with read_concurrency: true. Keyed via :erlang.phash2. Lookup, validation, and TTL check happen without touching any external process. Exact hits never leave the BEAM VM.
Semantic similarity
Prompt embeddings stored per tenant and model. Nearest-neighbour search via the <=> cosine distance operator directly in Postgres. Hit counts incremented via Task.start, never blocking the response path.
Global distribution
Deployed across Fly.io regions. Requests route to the nearest instance. The BEAM cluster handles state coordination across nodes. Low-latency cache access regardless of where your traffic originates.
N concurrent requests → 1 upstream call.
This is not a debounce or a queue. The first process to register for an in-flight key gets :execute. Every subsequent arrival gets :wait and blocks on receive. When the executing process completes, GenServer.cast broadcasts to all waiters simultaneously. Your provider sees one request, billed once, regardless of how many clients triggered it.
pricing model
We take a cut of what we save you.
No platform fee. No minimum spend. Misses pass through for free — your provider key is forwarded, and FGY charges nothing for the proxy hop. On hits, FGY bills 15% of the avoided provider cost at list price.
what's shipping
v1This is not a prototype.
V1 ships with everything you need to drop FGY in front of your inference stack today: three-tier cache, full streaming, multi-provider support, request coalescing, prepaid billing with audit ledger, credit codes, auto top-up, and a dashboard to manage it all.
stream: true and cache hits replay as real SSE chunks. Misses pipe upstream tokens through in real time and store the assembled response for future hits.x-fgy-version. Pin your integration to a version and upgrade when you're ready. Breaking changes will never land without a new version number.on the roadmap
We're building this in the open.
FGY is at round zero. If you're spending meaningful money on LLM inference and want to be involved early — as a design partner, pilot user, or in an investment conversation — we want to hear from you. We're looking for teams where the math on caching is obvious and who want to shape what the product becomes.