How Cosmo works
Cosmo is small, but it's built like something much bigger should be. This page takes the whole system apart, one piece at a time — what each part does, how it connects to the rest, and why it's built that way — so that by the end you could rebuild it yourself.
Each chapter starts with a problem, shows you the contract before the code, then explains the one decision that made it work. Read top to bottom, or jump around with the menu.
Every part, on one canvas
How the pieces actually wire together — your input, the IPC line, the four boundaries, the brain, voice, memory, and what reaches the outside world. Scroll to zoom, drag to pan.
The shape of the whole thing
Before any code: what are the big pieces, and what rule governs each?
Cosmo is an Electron app, so it has two halves. The main process (Node.js, CommonJS) is the grown-up: it owns timing, state, the AI, tools, the microphone loop, and everything that touches your Mac. The renderer (a sandboxed browser window, bundled with esbuild) owns exactly two display-side jobs — drawing the face and running the on-device microphone voice-activity detector. Between them sits a deliberately narrow IPC boundary with a tiny shared vocabulary.
MoodState & ActivityState cross heremain thinks & decides · renderer shows & senses · the line between them is the whole privacy story
Almost every capability hides behind the same shape: a contract (a types.ts interface) plus a registry (the thing that holds the implementations). Adding a feature means adding a folder — never editing a neighbour. Three boundaries follow this pattern, plus a fourth for sensing:
- Expression packs — how Cosmo looks (
renderer/packs) - Tools — what Cosmo can do (
main/tools) - LLM providers — which brain Cosmo uses (
main/ai/providers) - Watchers — how Cosmo senses your day (
main/watchers)
Three reusable ideas carry the whole codebase: contract + registry (swap implementations without touching the engine), single-owner state (one file decides mood), and facts vs. judgment (sensors report, one place decides). Spot these three and the rest is detail.
Getting a face on screen
How does a little floating character even appear, always-on-top, without a window frame?
Main creates a small, frameless, transparent BrowserWindow pinned above other windows, then hands the renderer its HTML. The catch: browser audio APIs (the on-device voice-activity detector, the TTS WASM) refuse to run from a bare file:// origin. So main serves the renderer through a custom app:// protocol, giving it a real origin where AudioWorklet and WASM behave.
The window isn't a fixed box, either. One owner in main resizes it on the fly — growing rightward from just his face, to a slim dock the moment your cursor is over him, to a full panel when the chat is open — so he only ever takes the space he's actually using. The catch that shaped it: a non-activating always-on-top panel never receives hover events of its own, so that one owner watches the cursor position from main rather than waiting on the window to report it. One place decides the size; nothing else fights over it.
The two halves never share objects. They speak only through a preload bridge that exposes a tiny allow-listed API — window.cosmo.on() / send() / invoke() — and rejects any channel not on the list. That allow-list is the contract for what main and renderer are allowed to say to each other.
The renderer is sandboxed and context-isolated. No Node in the window; no direct IPC. Everything crosses through the preload allow-list, so the attack surface is a handful of named messages — not "the whole app."
Who owns how Cosmo feels
If a watcher, a tool, and the brain can all change the mood, who's actually in charge?
Exactly one file: state.ts in main. The StateManager is the single owner of MoodState. Anyone who wants a mood change calls setState(mood, durationMs?); the manager pushes it to the renderer over IPC and, for transient moods like happy or listening, sets a timer to revert to idle. The renderer never decides a mood — it receives one and draws it.
shared/types.ts — the vocabulary that crosses the boundary
type MoodState =
| 'idle' | 'listening' | 'thinking' | 'speaking'
| 'happy' | 'bored' | 'annoyed' | 'sleeping';
type ActivityState =
| { type: 'music'; nowPlaying: { track: string; artist: string } }
| { type: 'searching' }
| { type: 'timer'; remainingSec: number; label: string }
| null;
setState()mood:setevery mood change funnels through one owner — so two sources can never fight over the face
Mood is a property with one owner, not a state machine scattered across files. Timing and physics (blink rhythm, idle escalation, how long "happy" lasts) live in main. Packs decide how things look, never when they happen — which is what lets you reskin Cosmo without touching a line of logic.
How the face knows what to draw
You want to add a whole new look — anime eyes, a chibi character — without editing the engine. How?
An expression pack is anything that can render a mood. The contract is small on purpose, and the registry just picks one by name. Adding a look means adding a folder under renderer/packs/<name>/ and registering it — the brain, the watchers, and main don't change at all.
renderer/packs/types.ts
interface ExpressionPack {
init(container: HTMLElement, opts: { reducedMotion: boolean }): void;
setState(state: MoodState): void; // render a mood
pulse(event: PulseEvent): void; // one-shot reaction (blink, heart…)
setActivity(a: ActivityState | null): void;
setGaze?(dx: number, dy: number): void; // optional cursor-follow
dispose(): void;
}
The shipped classic pack is pure DOM + CSS: two dark circles, a glossy white shine offset up-left, two clustered catch-lights, cheek blush, and a tweened mouth — exactly the eyes blinking at the top of this page. A second chibi pack swaps in illustrated characters. Both satisfy the same five methods, so main treats them identically.
This same "small contract + registry + one folder per implementation" shape repeats three more times below — for tools, for providers, for watchers. Learn it once here and the rest of the codebase reads itself.
The line nothing crosses
Cosmo notices your day — when you've gone idle, what app you're in. So what stops it from being spyware?
The architecture, not a promise. Privacy here isn't a setting you trust — it's enforced by where the boundaries are drawn. Cosmo has no camera. It senses your day through a few narrow, local channels: system idle-time (how long since any input), and the frontmost app name, browser domain, and window title, all read on your machine via AppleScript. From those it keeps a private record of how your time was spent — stored in ~/.pixel and mirrored to your own Obsidian vault, and sent nowhere. What actually crosses into the rest of the app is a tiny fact:
// what the focus watcher reports to the judge — a fact, no title, no URL { source: 'focus', cls: 'work' | 'distraction' | 'meeting' | 'neutral', secs: 30 } // the richer sample is written only to the local activity log on disk { app: 'Xcode', title: 'main.ts', domain: '', category: 'dev', secs: 30 }
No keylogging. Cosmo knows that you typed from idle-time deltas only — never key codes, never content.
Activity stays on your machine. App name, window title, and domain feed a local activity log (~/.pixel) and a mirrored Activity.md in your own vault — for nudges and your own recaps, never uploaded.
By default, nothing implicit reaches the LLM. Window titles, URLs, app names, and idle facts are never put in a model request — the AI sees only what you explicitly typed or said. The one exception is opt-in: Smart Focus (off by default) may send a single app name + window title to your configured model to label an app the local heuristic can't place — never URLs, never your activity history, and cached so it runs rarely.
Because these are structural, you can reason about them by reading the boundary, not by auditing every feature. No tool can see your keystrokes, and nothing ships your activity history to a model — the one app-context path that can reach the model, Smart Focus, stays off until you switch it on.
Noticing you, honestly
How does Cosmo know you've gone idle, or fallen down a distraction hole — without surveilling you?
Small polling loops called watchers. Idle watches system idle-time; focus classifies the frontmost app as work / distraction / neutral; others watch battery and screen-time. Crucially, a watcher's job is to report a fact — "idle for 12 minutes", "distraction app for 15 of the last 30" — and nothing more.
main/watchers/types.ts
interface Watcher {
name: string;
start(ctx: WatcherContext): void; // emits facts via ctx, never opinions
stop(): void;
}
Detecting which app you're in uses AppleScript through a single chokepoint (Chapter 11) — app name, browser domain, and window title, all kept on your machine. No vision, no keystrokes.
If each watcher could set the mood directly, the idle watcher and the focus watcher could scold you twice for the same lull. Keeping them dumb — pure sensors — means the judgment lives in exactly one tunable place. That place is the next chapter.
Turning facts into feeling
Several sensors are firing facts. Who decides Cosmo should actually look bored — and just once?
The clean answer is one file: workSignal.ts — the only place allowed to translate the combined stream of watcher facts into a decision — drift to bored, escalate to annoyed, fire a spoken callout, or do nothing because it's outside your work hours. Every "should Cosmo react?" rule lives there, which means tuning the personality is editing one file, not hunting through four.
And that's how it actually works. A watcher's context exposes exactly one mood-related method — ctx.report(fact) — and nothing else; none of them can touch the mood. workSignal is the single consumer of those facts: it's the only place that calls setMood, fires a callout, applies the cooldowns, and respects your work hours — and it won't stomp a mood you caused yourself (it only nudges Cosmo toward bored or annoyed when he's already idling). The judgment really does live in one tunable file.
many dumb sensors fan in · one smart judge fans out
Talking to any LLM
A 7B model on your laptop and a frontier model in the cloud should be interchangeable. How do you pull that off?
Every brain hides behind one contract. The registry returns whichever provider your config selects, and the brain code above never knows which one it got.
main/ai/providers/types.ts
interface LLMProvider {
name: string;
capabilities: { offline: boolean; nativeTools?: boolean };
chat(req: ChatRequest): Promise<ChatResponse>;
}
Six of the seven cloud providers are thin presets over a single shared transport, openaiCompat.ts — they differ by a base URL, an API-key name, and the odd per-vendor quirk (OpenAI's max_completion_tokens, Gemini's empty tool-call ids). Anthropic gets its own adapter because its API shape differs. Adding a new OpenAI-compatible vendor is roughly fifteen lines.
+ Anthropic as a standalone adapter · all satisfy the same chat()
The tool protocol is plain text (next chapter), not vendor function-calling, so a tiny local model and a frontier model are genuinely swappable. A heavyweight framework would bloat the bundle and couple us to one vendor's calling convention — the opposite of the goal.
Actually doing things
The model decided to search the web. How does that wish become a real action — safely, and without crashing on a malformed reply?
A tool is a name, a description, a zod schema for its arguments, and an execute. The registry validates arguments against the schema and races execution against a timeout before anything runs — so a tool can't get bad input or hang the app.
main/tools/types.ts
interface Tool<A> {
name: string;
description: string;
schema: z.ZodType<A>; // args validated BEFORE execute
availableOffline: boolean;
execute(args: A, ctx: ToolContext): Promise<ToolResult>;
}
type ToolResult =
| { ok: true; summary: string; data?: unknown }
| { ok: false; error: string; userMessage: string };
Tools never reach into the app directly. They get a ToolContext — ctx.speak(), ctx.setMood(), ctx.setActivity(), ctx.config, ctx.log — so they stay testable and can't poke at internals they shouldn't.
The tool-call protocol
Tool calls are fenced JSON blocks in plain text — deliberately not vendor function-calling. The dispatcher parses them out of the model's prose; capable cloud models additionally use native function-calling, but both paths converge in the same loop. Unknown tool or broken JSON? It falls back to a plain text answer. It never crashes.
// the model writes this; the dispatcher extracts it
```search.web
{ "query": "papers on large language models" }
```
Reason → Act → Observe
The brain runs a bounded ReAct loop (up to four steps): the model reasons, optionally calls one tool, sees the result as an observation, and continues — so it can chain search → read and recover when the first guess is wrong. Identical calls are de-duplicated so it can't spin.
- You speak or typethe only thing that enters the brain is what you explicitly said
- Model replies with prose + maybe a fenced tool block"On it —" then
search.web {…} - Dispatcher parses & validateszod-checks args; unknown/garbled → plain answer, no crash
- Registry executes with a timeouttool runs with an injected ToolContext, capped at a few seconds
- Result fed back as an observationloop continues, or the model writes its final reply
- Cosmo speaks the answerthrough the serialized speech queue (Chapter 10)
Timeouts and error-wrapping live in the registry, not in individual tools. A tool author writes the happy path; the boundary makes it safe. That's why a malformed model reply degrades to text instead of taking down the app.
Hearing and speaking
From "Cosmo…" out loud to a spoken answer back — without cutting you off, talking over itself, or shipping your audio anywhere it shouldn't go.
The renderer runs an on-device voice-activity detector (Silero VAD) and ships finished speech segments to main. Main transcribes them — locally by default (Moonshine / Whisper via transformers.js, in a forked worker process), or via a cloud STT provider if you choose one. A wake-word gate opens a short window when it hears "Cosmo", and smart end-of-turn detection decides you've actually finished talking instead of just pausing.
- VAD catches a speech segmentrenderer → main, 16 kHz audio · paused while Cosmo talks (echo control)
- STT transcribeslocal Whisper/Moonshine worker by default; cloud optional
- Wake gate + smart turnfuzzy-matches "Cosmo", rejects "cosmos/costco", waits for a real end-of-turn
- The brain runsthe ReAct loop from Chapter 08
- Speech queue speaksserialized TTS (local Kokoro or cloud) with a watchdog so a stuck voice can't mute the mic
Tap the mic mid-sentence and Cosmo stops instantly — barge-in aborts synthesis, clears the queue, and drops straight into listening. The whole thing is glued together by one primitive, the speech queue, so tools just call ctx.speak() and never worry about ordering.
The ONNX speech models run in a forked system-node worker, not in Electron's bundled Node — onnxruntime crashes under Electron's runtime. A small platform quirk, but it's the kind of thing the architecture has to make room for, so it's isolated behind a worker boundary.
How the parts talk — and why Cosmo never talks over itself
Two processes, a handful of background loops, and tools all wanting to say something. What carries those messages — and what stops Cosmo from blurting four things at once?
The line between the two halves
Main and the renderer never share an object. They pass messages across the preload allow-list from Chapter 01, and the traffic comes in two distinct shapes:
- Pushes — fire-and-forget, main → renderer. Main decides something changed and tells the face:
mood:set,activity:set,chat:message,voice:status. The renderer subscribes withwindow.cosmo.on(…)and redraws. - Requests — ask-and-wait, renderer → main.
window.cosmo.invoke('settings:get')returns a value;chat:submithands main what you typed and waits for the reply. Anything that needs an answer usesinvoke; anything fire-once usessend.
In the draw direction only MoodState and ActivityState ever cross — the same narrow vocabulary from Chapter 02. Every channel name is an explicit enum in shared/types.ts, so both sides agree on exactly which messages exist; a channel that isn't on the list is rejected by the bridge.
One mouth, one queue
Speaking is the one place a floating companion can embarrass itself. A reply lands just as the idle watcher gets bored, the battery dips, and a timer ends — four voices, all at once, talking over each other. Cosmo routes every spoken line through a two-stage pipeline so that can't happen.
main/watchers/calloutManager.ts — the gate in front of the queue
requestCallout(text, config) {
if (!config.voice.proactiveSpeech) return; // opt-in: Cosmo is quiet by default
if (this.paused || this.meetingQuiet) return; // never mid-meeting
if (Date.now() - this.lastCalloutAt < cooldownMs) return; // ≤ 1 nudge per cooldown (def. 20 min)
this.lastCalloutAt = Date.now();
speechQueue.enqueue(text); // → the one serialized queue
this.onSpeak?.(text); // + a visual nudge if he's off-screen / muted
}
Stage 1 — the callout gate. The watchers' proactive nudges ("you've gone quiet…") go through calloutManager first. It drops the line unless you've opted into proactive speech, and then only when Cosmo isn't in a meeting or paused and a cooldown has elapsed since the last nudge. So even when idle, focus, and battery all fire in the same second, at most one nudge survives — the rest are silently swallowed.
Stage 2 — the speech queue. Everything that does speak — the questions you asked answered and the surviving nudge — is enqueued in one FIFO speechQueue that speaks one utterance at a time, never overlapping. Each line is sanitized for the synth, capped so a runaway reply can't synthesize minutes of audio, and guarded by a 30-second watchdog so a wedged audio device can't freeze the mic. Tap the mic to interrupt and barge-in aborts the current line and empties the queue.
And one more guarantee — that you actually notice. A hush is no use if Cosmo is buried behind a window or muted in the tray, so every proactive line — a surviving callout, a daily recap, a reminder coming due — also fires a silent visual nudge through the same hook (onSpeak above): he bounces in place to catch your eye, and if he's hidden or muted, raises a native notification instead. The gate decides whether to speak up; this makes sure the moment lands even when the voice can't.
enqueue()proactive nudges must pass the cooldown gate · your direct replies skip it, but everything still serializes through the one queue
They solve different problems. The cooldown controls how often Cosmo speaks up on its own — that's taste, and it's tunable. The queue controls whether two sounds ever overlap — that's a hard guarantee, because ctx.speak() is the only way anything makes sound and it always lands in the same single-consumer queue. Separate the "how often" from the "never at once" and each stays simple.
Remembering, locally
Cosmo should recall what matters from past chats — without dumping your whole history into every prompt, and without a cloud database.
Memory is two cooperating ideas. Semantic recall keeps a plain-JSON vector store on disk; a forked worker runs a small local embedding model (all-MiniLM, 384-dim). When you ask something, Cosmo embeds the query, cosine-searches the store, and injects only the handful of relevant chunks. If the embedder isn't ready, it falls back to injecting the memory files whole — it degrades, it doesn't fail.
A file-based store has zero native dependencies, re-indexes on file mtime (cheap and self-healing), and is trivial to inspect. For one user's memory it's more than fast enough — and it survives packaging quirks that a compiled vector extension wouldn't.
The second idea is the Obsidian vault mirror: every note, task, and reminder is also written as plain Markdown into a vault folder you own. The canonical stores live elsewhere; the vault is a readable projection you can browse or edit. Redundant on purpose — your data is never trapped inside Cosmo.
Your conversation is kept the same honest way. Every message is appended to a plain transcript file in ~/.pixel; the chat window shows only the last few and lazy-loads older ones as you scroll up, so a long history never bloats memory or a prompt. Local, capped, and yours to clear in a click — like everything else here, nothing about your past is locked in a format only Cosmo can read.
The safe machine
Tools and watchers all want to run AppleScript, speak, log, and read API keys. How do you give them that without scattering risk everywhere?
Shared low-level modules in main/core/, injected through context — tools and watchers use them, never import them directly:
osascript.ts— the only place AppleScript runs. AlwaysexecFilewith an argument array, never a shell string with your text interpolated in. One chokepoint to audit for injection.speechQueue.ts— serialized TTS with abort and a 30-second watchdog, so a bad audio device can't leave the mic muted forever.secrets.ts— API keys sealed via the OS keychain (safeStorage), never written as plaintext.log.ts— a rotating logger to~/.pixel/logs.
One chokepoint per dangerous capability. AppleScript only through osascript.ts; TTS only through the speech queue; keys only through secrets.ts. Centralizing the risk is what makes the rest of the code boring — and boring is safe.
How Cosmo comes alive — and where his soul lives
Before the face ever appears, something has to decide who Cosmo is. Where does his personality come from — and what order does everything else switch on in?
Cosmo's personality isn't compiled in. It lives as plain Markdown you can open and edit, under ~/.pixel/workspace/. On the very first run main seeds those files with the built-in persona and operating rules, then leaves them alone forever after — so the instant you change a line, your version is what Cosmo reads.
~/.pixel/workspace/ — editable, plain markdown
SOUL.md # personality, voice, values — the character sheet
AGENTS.md # operating rules: when to use a tool, the output contract
USER.md # durable facts about you ("remember X" appends here)
MEMORY.md # curated long-term memory across sessions
memory/YYYY-MM-DD.md # daily notes — compaction folds into these
The seed is write-if-missing: ensureWorkspace() runs every boot but only creates a file that isn't there yet, so your edits survive every update. Each turn the prompt builder loads them back — context.ts reads SOUL (the persona), AGENTS (the rules), and USER + MEMORY (what Cosmo knows about you) and folds them into the system prompt, shaped per model family. Personality is data, not code: change the file, change Cosmo, no rebuild.
With the soul in place, the rest comes up in a deliberate order. Startup splits in two halves: a synchronous stretch that runs the moment main loads — environment, config, and the workspace + vault seeds — and then everything inside app.whenReady(), where the window, the four registries, and the voice worker switch on. A loader overlay holds the face back until voice, ears, and turn-detection all report ready, with a 25-second safety net so one stuck subsystem can't trap it forever.
backfillDefaults()your saved choices win · new defaults fill the gapsensureWorkspace() · seed if missingSOUL.md · AGENTS.md · USER.md · MEMORY.md — your edits always survivewarmRecall()fold legacy memory into USER.md · build the embedder index in the backgroundapp.whenReady() ↓app://boot:ready → the loader liftsvoice · ears · turn all green (25s safety net)seed the editable soul before the face appears · then warm the senses · the loader waits for all three
Hardcoding the persona in TypeScript means only a rebuild can retune the voice — and you can never actually see what Cosmo "is". Plain files make the personality inspectable, diffable, and yours to rewrite. Same local-and-open principle as the vault and the chat transcript: nothing about who Cosmo is, or what he knows, is locked in a format only Cosmo can read.
Build your own Cosmo
You've now seen every load-bearing piece. The same three patterns held the whole way down:
- Contract + registry — packs, tools, providers, watchers. Add a folder, never edit a neighbour.
- Single-owner state —
state.tsowns mood;workSignal.tsowns judgment, turning watcher facts into a mood or a nudge; one owner in main even owns the window's size. One owner each, no fights. - Facts vs. judgment — sensors report, one place decides. Privacy and personality both fall out of this.
If you wanted to rebuild it from nothing, the order writes itself — and each step is a working app:
- A frameless always-on-top Electron windowserve the renderer over
app:// - One mood, one ownerStateManager pushes
MoodStateto a CSS-eyes pack - One LLM provider + the fenced-JSON ReAct loopstart with a single cloud model
- A couple of toolssearch, open URL, timer — behind the zod-validated registry
- Voicecloud STT/TTS first, then a local Whisper worker, then wake word + barge-in
- Watchers → workSignalwatchers report idle and focus facts; workSignal turns them into mood and nudges
- Memorya JSON vector store + a local embedder worker, plus the vault mirror
- Harden the boundariesthe privacy line, the AppleScript chokepoint, sealed keys
Bonus: how this very site takes payments
You just took Cosmo apart. Here's a second, much smaller system — the "buy me a coffee" button on this very page — and it leans on the same instincts: one owner per job, a contract you can swap behind, and a hard line that secrets never cross.
The goal is modest: let a visitor leave an optional tip in their own currency, then hand back the GitHub link — on a plain PHP site with no payment SDK bundled in. The whole thing reduces to three ideas and one pattern you've already met. The three ideas: the browser is never trusted, the secret never leaves the server, and every payment is confirmed twice. The one pattern: each payment processor hides behind a single contract — the same contract-plus-registry shape that carried all of Part I.
No routes, no file names, no copy-paste endpoints — on purpose. This is the shape of a safe integration, the part that's the same whether you use Razorpay, PayPal, or Stripe, so you can build your own rather than clone ours.
A safe checkout, end to end
A visitor wants to leave a $5 tip. Where does the money actually change hands — and which parts of this is the browser allowed to decide? (Almost none of it.)
The browser is hostile territory: anyone can open dev-tools and change a number. So the rule is strict — the browser may start a payment and display the provider's checkout, but the amount, the verification, and the reward are all decided on your server. The payment provider (Razorpay for ₹, PayPal for other currencies) is the only thing that actually moves money; your server is the only thing trusted to say it happened.
- Browser asks your server to start a tipit sends an amount + currency — but the server re-checks that amount against a minimum; the page's number is never trusted on its own
- Server creates an order with the providerserver-to-server, using the secret key; the browser gets back only an order id + your public identifier — never the secret
- The provider's checkout opens in the browsercard form or wallet, hosted by the provider — your site never sees card details
- The visitor pays the provider directlymoney moves between the visitor and the provider, not through your server
- Server confirms with the providerit verifies a signature, or captures the order server-side — and believes only the provider's own answer, never a "success" flag from the page
- Confirmed → server unlocks the rewardflip the record to paid, hand back the GitHub link
the browser may start and display a payment · only the server may decide it succeeded
It's the same lesson as Cosmo's privacy line (Chapter 04): trust is a property of where you draw the boundary, not a promise you make. Treat every value from the page as a request to be re-checked, and a tampered amount or a faked "it worked" simply can't get past the server.
Keys, secrets & signatures
Every provider hands you two strings that look alike: a publishable key and a secret key. Swap them by accident and you either break checkout or hand the world your account.
Three credentials, and the whole integration's safety is just keeping them in the right place:
- Publishable key / client id — a token that's safe in public. It only identifies your account to the provider. It can start a checkout but can't move money or read anything. This is the one value allowed into the browser.
- Secret key — never leaves the server. It authenticates your server when it creates or captures an order. In the browser it would let anyone charge as you, so it lives only in an environment file the web server can read and the public can't.
- Webhook secret — proves the provider is the one calling. Used to confirm a server-to-server notification genuinely came from the provider and wasn't forged (full webhooks next chapter).
where each credential is allowed to live
PUBLIC · in the page → publishable key / client id // identify only SECRET · on the server → secret key // create & capture orders SECRET · on the server → webhook secret // verify provider callbacks
Confirmation comes in two flavours, and a provider gives you one or the other:
- Signature check. After payment, the provider hands the browser a signature. Your server recomputes it from the order id + payment id using the secret key, and believes the payment only if they match — proof nothing was tampered with in transit.
- Server-side capture. Instead, your server asks the provider directly: "did this order complete, and for how much?" You trust only that reply, and cross-check the amount and currency against what you stored — so a tampered amount can't unlock the reward.
confirming a signed payment — runs on the server, with the secret key
expected = hmac_sha256(order_id + "|" + payment_id, secret_key) if (expected !== signature_from_browser) reject() // forged or tampered markPaidOnce(order_id) // verified → release the reward
The secret key and the webhook secret never appear in HTML, JavaScript, or any file the public can fetch — only the publishable token ships to the browser. If a secret ever lands in front-end code or a screenshot, treat it as burned and rotate it immediately.
Webhooks — the backstop when the tab closes
The visitor pays, then closes the tab before the "it worked" call reaches your server. Did you just lose the payment?
No — because the browser was never the only path. A webhook is the provider calling your server directly, server-to-server, the moment a payment settles. It doesn't depend on the tab staying open, the network holding, or the page not being refreshed. And it's signed with the webhook secret from the last chapter, so your server can tell a real event from a forged one.
The catch is that confirmation can now arrive twice — the browser's callback and the webhook, or a provider retrying. So the write that marks a tip paid must be idempotent: flip the record from pending → paid exactly once, and quietly ignore any later copy. Releasing the reward twice is precisely the bug this prevents.
two independent confirmations, one idempotent write — a closed tab never loses a payment, a retry never double-counts it
They do different jobs. The browser callback is instant feedback for the person who just paid; the webhook is the truth that survives closed tabs and flaky networks. Build both, make the write idempotent, and the edge cases stop being edge cases.
One checkout, many currencies
An Indian visitor should pay in ₹ through one provider; everyone else in their own currency through another. How do you do that without four copies of the checkout?
With the exact boundary pattern from Part I — a contract plus a registry. A small pure function turns a visitor's rough region into a currency, a minimum tip, and which processor handles it. Each processor — one for ₹, one for everything else — hides behind a single shared interface: create an order · verify it · capture it. A registry hands back whichever one the currency selected, and the checkout code above never knows which it got. Adding a third processor tomorrow is a new adapter, not a rewrite.
one pure router · interchangeable processors behind one contract · the same shape as Cosmo's packs, tools, and providers
This is the contract-plus-registry boundary from Chapters 03, 07, and 08 — now applied to money. Learn the shape once and it keeps paying off: swap an implementation, never edit a neighbour. A safe checkout isn't a special skill; it's the same architecture, pointed at payments.
That's the whole system: three ideas — an untrusted browser, server-only secrets, confirmation that arrives twice and writes once — and one pattern you already knew. The "Get Cosmo" button at the top runs exactly this flow.