Under the hood · build-it-yourself

How Cosmo works

Cosmo is small, but it's built like something much bigger should be. This page takes the whole system apart, one piece at a time — what each part does, how it connects to the rest, and why it's built that way — so that by the end you could rebuild it yourself.

Each chapter starts with a problem, shows you the contract before the code, then explains the one decision that made it work. Read top to bottom, or jump around with the menu.

The whole system, connected

Every part, on one canvas

How the pieces actually wire together — your input, the IPC line, the four boundaries, the brain, voice, memory, and what reaches the outside world. Scroll to zoom, drag to pan.

System map · Cosmo
RENDERER · draws & senses MAIN PROCESS · thinks & decides IPC · MoodState · ActivityState 🎤 You speak ⌨ You type Expression packthe eyes you see Mic · voice activitydetects speech on-device Voice pipelineSTT worker · wake gatesmart turn · echo control BrainReAct: reason→act→observedispatcher · fenced + native LLM providersopenaiCompat + Anthropicswap any model Watchersidle · focus · battery Toolsregistry · zod · ToolContext StateManagersingle mood owner workSignalfacts → mood + callout Memoryvector store · embedder Coreosascript · secrets · log Speech queuereplies + nudges → one voice LLMcloud or local Your MacAppleScript · Music Obsidian vaultyour markdown data · voice · actions mood local & private facts
scroll to zoom · drag to pan
Chapter 00 · Orientation

The shape of the whole thing

Before any code: what are the big pieces, and what rule governs each?

Cosmo is an Electron app, so it has two halves. The main process (Node.js, CommonJS) is the grown-up: it owns timing, state, the AI, tools, the microphone loop, and everything that touches your Mac. The renderer (a sandboxed browser window, bundled with esbuild) owns exactly two display-side jobs — drawing the face and running the on-device microphone voice-activity detector. Between them sits a deliberately narrow IPC boundary with a tiny shared vocabulary.

Renderer expression packs · microphone VAD — draws the face and listens for speech, decides nothing
IPC boundary — only MoodState & ActivityState cross here
Main process state · brain · tools · providers · watchers · voice · memory · core primitives

main thinks & decides · renderer shows & senses · the line between them is the whole privacy story

Almost every capability hides behind the same shape: a contract (a types.ts interface) plus a registry (the thing that holds the implementations). Adding a feature means adding a folder — never editing a neighbour. Three boundaries follow this pattern, plus a fourth for sensing:

  • Expression packs — how Cosmo looks (renderer/packs)
  • Tools — what Cosmo can do (main/tools)
  • LLM providers — which brain Cosmo uses (main/ai/providers)
  • Watchers — how Cosmo senses your day (main/watchers)
Why this way

Three reusable ideas carry the whole codebase: contract + registry (swap implementations without touching the engine), single-owner state (one file decides mood), and facts vs. judgment (sensors report, one place decides). Spot these three and the rest is detail.

Chapter 01 · The shell

Getting a face on screen

How does a little floating character even appear, always-on-top, without a window frame?

Main creates a small, frameless, transparent BrowserWindow pinned above other windows, then hands the renderer its HTML. The catch: browser audio APIs (the on-device voice-activity detector, the TTS WASM) refuse to run from a bare file:// origin. So main serves the renderer through a custom app:// protocol, giving it a real origin where AudioWorklet and WASM behave.

The window isn't a fixed box, either. One owner in main resizes it on the fly — growing rightward from just his face, to a slim dock the moment your cursor is over him, to a full panel when the chat is open — so he only ever takes the space he's actually using. The catch that shaped it: a non-activating always-on-top panel never receives hover events of its own, so that one owner watches the cursor position from main rather than waiting on the window to report it. One place decides the size; nothing else fights over it.

The two halves never share objects. They speak only through a preload bridge that exposes a tiny allow-listed API — window.cosmo.on() / send() / invoke() — and rejects any channel not on the list. That allow-list is the contract for what main and renderer are allowed to say to each other.

Critical rule

The renderer is sandboxed and context-isolated. No Node in the window; no direct IPC. Everything crosses through the preload allow-list, so the attack surface is a handful of named messages — not "the whole app."

Chapter 02 · State

Who owns how Cosmo feels

If a watcher, a tool, and the brain can all change the mood, who's actually in charge?

Exactly one file: state.ts in main. The StateManager is the single owner of MoodState. Anyone who wants a mood change calls setState(mood, durationMs?); the manager pushes it to the renderer over IPC and, for transient moods like happy or listening, sets a timer to revert to idle. The renderer never decides a mood — it receives one and draws it.

shared/types.ts — the vocabulary that crosses the boundary

type MoodState =
  | 'idle' | 'listening' | 'thinking' | 'speaking'
  | 'happy' | 'bored' | 'annoyed' | 'sleeping';

type ActivityState =
  | { type: 'music'; nowPlaying: { track: string; artist: string } }
  | { type: 'searching' }
  | { type: 'timer'; remainingSec: number; label: string }
  | null;
watcher · tool · brain — "set mood to thinking"
setState()
StateManagerclears revert timer · single source of truth
↓ IPC mood:set
Renderer / packdraws the mood, decides nothing

every mood change funnels through one owner — so two sources can never fight over the face

Why this way

Mood is a property with one owner, not a state machine scattered across files. Timing and physics (blink rhythm, idle escalation, how long "happy" lasts) live in main. Packs decide how things look, never when they happen — which is what lets you reskin Cosmo without touching a line of logic.

Chapter 03 · Boundary #1 — Expression packs

How the face knows what to draw

You want to add a whole new look — anime eyes, a chibi character — without editing the engine. How?

An expression pack is anything that can render a mood. The contract is small on purpose, and the registry just picks one by name. Adding a look means adding a folder under renderer/packs/<name>/ and registering it — the brain, the watchers, and main don't change at all.

renderer/packs/types.ts

interface ExpressionPack {
  init(container: HTMLElement, opts: { reducedMotion: boolean }): void;
  setState(state: MoodState): void;        // render a mood
  pulse(event: PulseEvent): void;          // one-shot reaction (blink, heart…)
  setActivity(a: ActivityState | null): void;
  setGaze?(dx: number, dy: number): void;  // optional cursor-follow
  dispose(): void;
}

The shipped classic pack is pure DOM + CSS: two dark circles, a glossy white shine offset up-left, two clustered catch-lights, cheek blush, and a tweened mouth — exactly the eyes blinking at the top of this page. A second chibi pack swaps in illustrated characters. Both satisfy the same five methods, so main treats them identically.

The boundary pattern

This same "small contract + registry + one folder per implementation" shape repeats three more times below — for tools, for providers, for watchers. Learn it once here and the rest of the codebase reads itself.

Chapter 04 · The privacy line

The line nothing crosses

Cosmo notices your day — when you've gone idle, what app you're in. So what stops it from being spyware?

The architecture, not a promise. Privacy here isn't a setting you trust — it's enforced by where the boundaries are drawn. Cosmo has no camera. It senses your day through a few narrow, local channels: system idle-time (how long since any input), and the frontmost app name, browser domain, and window title, all read on your machine via AppleScript. From those it keeps a private record of how your time was spent — stored in ~/.pixel and mirrored to your own Obsidian vault, and sent nowhere. What actually crosses into the rest of the app is a tiny fact:

// what the focus watcher reports to the judge — a fact, no title, no URL
{ source: 'focus', cls: 'work' | 'distraction' | 'meeting' | 'neutral', secs: 30 }
// the richer sample is written only to the local activity log on disk
{ app: 'Xcode', title: 'main.ts', domain: '', category: 'dev', secs: 30 }
Invariants baked in

No keylogging. Cosmo knows that you typed from idle-time deltas only — never key codes, never content.
Activity stays on your machine. App name, window title, and domain feed a local activity log (~/.pixel) and a mirrored Activity.md in your own vault — for nudges and your own recaps, never uploaded.
By default, nothing implicit reaches the LLM. Window titles, URLs, app names, and idle facts are never put in a model request — the AI sees only what you explicitly typed or said. The one exception is opt-in: Smart Focus (off by default) may send a single app name + window title to your configured model to label an app the local heuristic can't place — never URLs, never your activity history, and cached so it runs rarely.

Because these are structural, you can reason about them by reading the boundary, not by auditing every feature. No tool can see your keystrokes, and nothing ships your activity history to a model — the one app-context path that can reach the model, Smart Focus, stays off until you switch it on.

Chapter 05 · Boundary #4 — Watchers

Noticing you, honestly

How does Cosmo know you've gone idle, or fallen down a distraction hole — without surveilling you?

Small polling loops called watchers. Idle watches system idle-time; focus classifies the frontmost app as work / distraction / neutral; others watch battery and screen-time. Crucially, a watcher's job is to report a fact — "idle for 12 minutes", "distraction app for 15 of the last 30" — and nothing more.

main/watchers/types.ts

interface Watcher {
  name: string;
  start(ctx: WatcherContext): void;  // emits facts via ctx, never opinions
  stop(): void;
}

Detecting which app you're in uses AppleScript through a single chokepoint (Chapter 11) — app name, browser domain, and window title, all kept on your machine. No vision, no keystrokes.

Why facts, not moods

If each watcher could set the mood directly, the idle watcher and the focus watcher could scold you twice for the same lull. Keeping them dumb — pure sensors — means the judgment lives in exactly one tunable place. That place is the next chapter.

Chapter 06 · Judgment

Turning facts into feeling

Several sensors are firing facts. Who decides Cosmo should actually look bored — and just once?

The clean answer is one file: workSignal.ts — the only place allowed to translate the combined stream of watcher facts into a decision — drift to bored, escalate to annoyed, fire a spoken callout, or do nothing because it's outside your work hours. Every "should Cosmo react?" rule lives there, which means tuning the personality is editing one file, not hunting through four.

And that's how it actually works. A watcher's context exposes exactly one mood-related method — ctx.report(fact) — and nothing else; none of them can touch the mood. workSignal is the single consumer of those facts: it's the only place that calls setMood, fires a callout, applies the cooldowns, and respects your work hours — and it won't stomp a mood you caused yourself (it only nudges Cosmo toward bored or annoyed when he's already idling). The judgment really does live in one tunable file.

idleidle-time
focusapp class
batterycharge state
workSignalthe one judge · cooldowns · work-hours · no double-scold
setMood()bored / annoyed
callouta gentle nudge

many dumb sensors fan in · one smart judge fans out

Chapter 07 · Boundary #2 — Providers

Talking to any LLM

A 7B model on your laptop and a frontier model in the cloud should be interchangeable. How do you pull that off?

Every brain hides behind one contract. The registry returns whichever provider your config selects, and the brain code above never knows which one it got.

main/ai/providers/types.ts

interface LLMProvider {
  name: string;
  capabilities: { offline: boolean; nativeTools?: boolean };
  chat(req: ChatRequest): Promise<ChatResponse>;
}

Six of the seven cloud providers are thin presets over a single shared transport, openaiCompat.ts — they differ by a base URL, an API-key name, and the odd per-vendor quirk (OpenAI's max_completion_tokens, Gemini's empty tool-call ids). Anthropic gets its own adapter because its API shape differs. Adding a new OpenAI-compatible vendor is roughly fifteen lines.

openaiCompat.tsone HTTP transport · fenced + native tool paths
↑ thin presets ↑
OpenAI
Groq
Gemini
DeepSeek
xAI
Cerebras
Ollama

+ Anthropic as a standalone adapter · all satisfy the same chat()

Why no LangChain

The tool protocol is plain text (next chapter), not vendor function-calling, so a tiny local model and a frontier model are genuinely swappable. A heavyweight framework would bloat the bundle and couple us to one vendor's calling convention — the opposite of the goal.

Chapter 08 · Boundary #3 — Tools & the ReAct loop

Actually doing things

The model decided to search the web. How does that wish become a real action — safely, and without crashing on a malformed reply?

A tool is a name, a description, a zod schema for its arguments, and an execute. The registry validates arguments against the schema and races execution against a timeout before anything runs — so a tool can't get bad input or hang the app.

main/tools/types.ts

interface Tool<A> {
  name: string;
  description: string;
  schema: z.ZodType<A>;            // args validated BEFORE execute
  availableOffline: boolean;
  execute(args: A, ctx: ToolContext): Promise<ToolResult>;
}

type ToolResult =
  | { ok: true;  summary: string; data?: unknown }
  | { ok: false; error: string;   userMessage: string };

Tools never reach into the app directly. They get a ToolContextctx.speak(), ctx.setMood(), ctx.setActivity(), ctx.config, ctx.log — so they stay testable and can't poke at internals they shouldn't.

The tool-call protocol

Tool calls are fenced JSON blocks in plain text — deliberately not vendor function-calling. The dispatcher parses them out of the model's prose; capable cloud models additionally use native function-calling, but both paths converge in the same loop. Unknown tool or broken JSON? It falls back to a plain text answer. It never crashes.

// the model writes this; the dispatcher extracts it
```search.web
{ "query": "papers on large language models" }
```

Reason → Act → Observe

The brain runs a bounded ReAct loop (up to four steps): the model reasons, optionally calls one tool, sees the result as an observation, and continues — so it can chain search → read and recover when the first guess is wrong. Identical calls are de-duplicated so it can't spin.

  1. You speak or typethe only thing that enters the brain is what you explicitly said
  2. Model replies with prose + maybe a fenced tool block"On it —" then search.web {…}
  3. Dispatcher parses & validateszod-checks args; unknown/garbled → plain answer, no crash
  4. Registry executes with a timeouttool runs with an injected ToolContext, capped at a few seconds
  5. Result fed back as an observationloop continues, or the model writes its final reply
  6. Cosmo speaks the answerthrough the serialized speech queue (Chapter 10)
Critical rule

Timeouts and error-wrapping live in the registry, not in individual tools. A tool author writes the happy path; the boundary makes it safe. That's why a malformed model reply degrades to text instead of taking down the app.

Chapter 09 · The voice pipeline

Hearing and speaking

From "Cosmo…" out loud to a spoken answer back — without cutting you off, talking over itself, or shipping your audio anywhere it shouldn't go.

The renderer runs an on-device voice-activity detector (Silero VAD) and ships finished speech segments to main. Main transcribes them — locally by default (Moonshine / Whisper via transformers.js, in a forked worker process), or via a cloud STT provider if you choose one. A wake-word gate opens a short window when it hears "Cosmo", and smart end-of-turn detection decides you've actually finished talking instead of just pausing.

  1. VAD catches a speech segmentrenderer → main, 16 kHz audio · paused while Cosmo talks (echo control)
  2. STT transcribeslocal Whisper/Moonshine worker by default; cloud optional
  3. Wake gate + smart turnfuzzy-matches "Cosmo", rejects "cosmos/costco", waits for a real end-of-turn
  4. The brain runsthe ReAct loop from Chapter 08
  5. Speech queue speaksserialized TTS (local Kokoro or cloud) with a watchdog so a stuck voice can't mute the mic

Tap the mic mid-sentence and Cosmo stops instantly — barge-in aborts synthesis, clears the queue, and drops straight into listening. The whole thing is glued together by one primitive, the speech queue, so tools just call ctx.speak() and never worry about ordering.

A decision worth seeing

The ONNX speech models run in a forked system-node worker, not in Electron's bundled Node — onnxruntime crashes under Electron's runtime. A small platform quirk, but it's the kind of thing the architecture has to make room for, so it's isolated behind a worker boundary.

Chapter 10 · Communication

How the parts talk — and why Cosmo never talks over itself

Two processes, a handful of background loops, and tools all wanting to say something. What carries those messages — and what stops Cosmo from blurting four things at once?

The line between the two halves

Main and the renderer never share an object. They pass messages across the preload allow-list from Chapter 01, and the traffic comes in two distinct shapes:

  • Pushes — fire-and-forget, main → renderer. Main decides something changed and tells the face: mood:set, activity:set, chat:message, voice:status. The renderer subscribes with window.cosmo.on(…) and redraws.
  • Requests — ask-and-wait, renderer → main. window.cosmo.invoke('settings:get') returns a value; chat:submit hands main what you typed and waits for the reply. Anything that needs an answer uses invoke; anything fire-once uses send.

In the draw direction only MoodState and ActivityState ever cross — the same narrow vocabulary from Chapter 02. Every channel name is an explicit enum in shared/types.ts, so both sides agree on exactly which messages exist; a channel that isn't on the list is rejected by the bridge.

One mouth, one queue

Speaking is the one place a floating companion can embarrass itself. A reply lands just as the idle watcher gets bored, the battery dips, and a timer ends — four voices, all at once, talking over each other. Cosmo routes every spoken line through a two-stage pipeline so that can't happen.

main/watchers/calloutManager.ts — the gate in front of the queue

requestCallout(text, config) {
  if (!config.voice.proactiveSpeech) return;          // opt-in: Cosmo is quiet by default
  if (this.paused || this.meetingQuiet) return;       // never mid-meeting
  if (Date.now() - this.lastCalloutAt < cooldownMs) return;  // ≤ 1 nudge per cooldown (def. 20 min)
  this.lastCalloutAt = Date.now();
  speechQueue.enqueue(text);                           // → the one serialized queue
  this.onSpeak?.(text);                                // + a visual nudge if he's off-screen / muted
}

Stage 1 — the callout gate. The watchers' proactive nudges ("you've gone quiet…") go through calloutManager first. It drops the line unless you've opted into proactive speech, and then only when Cosmo isn't in a meeting or paused and a cooldown has elapsed since the last nudge. So even when idle, focus, and battery all fire in the same second, at most one nudge survives — the rest are silently swallowed.

Stage 2 — the speech queue. Everything that does speak — the questions you asked answered and the surviving nudge — is enqueued in one FIFO speechQueue that speaks one utterance at a time, never overlapping. Each line is sanitized for the synth, capped so a runaway reply can't synthesize minutes of audio, and guarded by a 30-second watchdog so a wedged audio device can't freeze the mic. Tap the mic to interrupt and barge-in aborts the current line and empties the queue.

And one more guarantee — that you actually notice. A hush is no use if Cosmo is buried behind a window or muted in the tray, so every proactive line — a surviving callout, a daily recap, a reminder coming due — also fires a silent visual nudge through the same hook (onSpeak above): he bounces in place to catch your eye, and if he's hidden or muted, raises a native notification instead. The gate decides whether to speak up; this makes sure the moment lands even when the voice can't.

idle
focus
battery
↓ proactive nudges
callout gateopt-in · meeting-quiet · cooldown → ≤ 1 nudge per window
surviving nudgepassed the gate
your replyyou asked — skips the gate
enqueue()
speech queueFIFO · one utterance at a time · sanitize · watchdog · barge-in
one voicelines never overlap

proactive nudges must pass the cooldown gate · your direct replies skip it, but everything still serializes through the one queue

Why two stages

They solve different problems. The cooldown controls how often Cosmo speaks up on its own — that's taste, and it's tunable. The queue controls whether two sounds ever overlap — that's a hard guarantee, because ctx.speak() is the only way anything makes sound and it always lands in the same single-consumer queue. Separate the "how often" from the "never at once" and each stays simple.

Chapter 11 · Memory

Remembering, locally

Cosmo should recall what matters from past chats — without dumping your whole history into every prompt, and without a cloud database.

Memory is two cooperating ideas. Semantic recall keeps a plain-JSON vector store on disk; a forked worker runs a small local embedding model (all-MiniLM, 384-dim). When you ask something, Cosmo embeds the query, cosine-searches the store, and injects only the handful of relevant chunks. If the embedder isn't ready, it falls back to injecting the memory files whole — it degrades, it doesn't fail.

Why JSON, not SQLite

A file-based store has zero native dependencies, re-indexes on file mtime (cheap and self-healing), and is trivial to inspect. For one user's memory it's more than fast enough — and it survives packaging quirks that a compiled vector extension wouldn't.

The second idea is the Obsidian vault mirror: every note, task, and reminder is also written as plain Markdown into a vault folder you own. The canonical stores live elsewhere; the vault is a readable projection you can browse or edit. Redundant on purpose — your data is never trapped inside Cosmo.

Your conversation is kept the same honest way. Every message is appended to a plain transcript file in ~/.pixel; the chat window shows only the last few and lazy-loads older ones as you scroll up, so a long history never bloats memory or a prompt. Local, capped, and yours to clear in a click — like everything else here, nothing about your past is locked in a format only Cosmo can read.

Chapter 12 · Core primitives

The safe machine

Tools and watchers all want to run AppleScript, speak, log, and read API keys. How do you give them that without scattering risk everywhere?

Shared low-level modules in main/core/, injected through context — tools and watchers use them, never import them directly:

  • osascript.ts — the only place AppleScript runs. Always execFile with an argument array, never a shell string with your text interpolated in. One chokepoint to audit for injection.
  • speechQueue.ts — serialized TTS with abort and a 30-second watchdog, so a bad audio device can't leave the mic muted forever.
  • secrets.ts — API keys sealed via the OS keychain (safeStorage), never written as plaintext.
  • log.ts — a rotating logger to ~/.pixel/logs.
Critical rule

One chokepoint per dangerous capability. AppleScript only through osascript.ts; TTS only through the speech queue; keys only through secrets.ts. Centralizing the risk is what makes the rest of the code boring — and boring is safe.

Chapter 13 · Waking up

How Cosmo comes alive — and where his soul lives

Before the face ever appears, something has to decide who Cosmo is. Where does his personality come from — and what order does everything else switch on in?

Cosmo's personality isn't compiled in. It lives as plain Markdown you can open and edit, under ~/.pixel/workspace/. On the very first run main seeds those files with the built-in persona and operating rules, then leaves them alone forever after — so the instant you change a line, your version is what Cosmo reads.

~/.pixel/workspace/ — editable, plain markdown

SOUL.md      # personality, voice, values — the character sheet
AGENTS.md    # operating rules: when to use a tool, the output contract
USER.md      # durable facts about you ("remember X" appends here)
MEMORY.md    # curated long-term memory across sessions
memory/YYYY-MM-DD.md   # daily notes — compaction folds into these

The seed is write-if-missing: ensureWorkspace() runs every boot but only creates a file that isn't there yet, so your edits survive every update. Each turn the prompt builder loads them back — context.ts reads SOUL (the persona), AGENTS (the rules), and USER + MEMORY (what Cosmo knows about you) and folds them into the system prompt, shaped per model family. Personality is data, not code: change the file, change Cosmo, no rebuild.

With the soul in place, the rest comes up in a deliberate order. Startup splits in two halves: a synchronous stretch that runs the moment main loads — environment, config, and the workspace + vault seeds — and then everything inside app.whenReady(), where the window, the four registries, and the voice worker switch on. A loader overlay holds the face back until voice, ears, and turn-detection all report ready, with a 25-second safety net so one stuck subsystem can't trap it forever.

① the moment main loads · synchronous
load .envproject root in dev · app resources when packaged
config store + backfillDefaults()your saved choices win · new defaults fill the gaps
ensureWorkspace() · seed if missingSOUL.md · AGENTS.md · USER.md · MEMORY.md — your edits always survive
migrate memory · seed vault · warmRecall()fold legacy memory into USER.md · build the embedder index in the background
↓ then app.whenReady()
② window + trayframeless always-on-top panel · served over app://
fill the four registriesproviders · tools · STT · TTS — each boundary populates itself
warm the voice workerlocal ASR + Smart Turn v3 · wake loop + liveliness begin
boot:ready → the loader liftsvoice · ears · turn all green (25s safety net)

seed the editable soul before the face appears · then warm the senses · the loader waits for all three

Why a folder of Markdown, not constants

Hardcoding the persona in TypeScript means only a rebuild can retune the voice — and you can never actually see what Cosmo "is". Plain files make the personality inspectable, diffable, and yours to rewrite. Same local-and-open principle as the vault and the chat transcript: nothing about who Cosmo is, or what he knows, is locked in a format only Cosmo can read.

Chapter 14 · Epilogue

Build your own Cosmo

You've now seen every load-bearing piece. The same three patterns held the whole way down:

  • Contract + registry — packs, tools, providers, watchers. Add a folder, never edit a neighbour.
  • Single-owner statestate.ts owns mood; workSignal.ts owns judgment, turning watcher facts into a mood or a nudge; one owner in main even owns the window's size. One owner each, no fights.
  • Facts vs. judgment — sensors report, one place decides. Privacy and personality both fall out of this.

If you wanted to rebuild it from nothing, the order writes itself — and each step is a working app:

  1. A frameless always-on-top Electron windowserve the renderer over app://
  2. One mood, one ownerStateManager pushes MoodState to a CSS-eyes pack
  3. One LLM provider + the fenced-JSON ReAct loopstart with a single cloud model
  4. A couple of toolssearch, open URL, timer — behind the zod-validated registry
  5. Voicecloud STT/TTS first, then a local Whisper worker, then wake word + barge-in
  6. Watchers → workSignalwatchers report idle and focus facts; workSignal turns them into mood and nudges
  7. Memorya JSON vector store + a local embedder worker, plus the vault mirror
  8. Harden the boundariesthe privacy line, the AppleScript chokepoint, sealed keys
Part II · A different system, same instincts

Bonus: how this very site takes payments

You just took Cosmo apart. Here's a second, much smaller system — the "buy me a coffee" button on this very page — and it leans on the same instincts: one owner per job, a contract you can swap behind, and a hard line that secrets never cross.

The goal is modest: let a visitor leave an optional tip in their own currency, then hand back the GitHub link — on a plain PHP site with no payment SDK bundled in. The whole thing reduces to three ideas and one pattern you've already met. The three ideas: the browser is never trusted, the secret never leaves the server, and every payment is confirmed twice. The one pattern: each payment processor hides behind a single contract — the same contract-plus-registry shape that carried all of Part I.

What you won't see here

No routes, no file names, no copy-paste endpoints — on purpose. This is the shape of a safe integration, the part that's the same whether you use Razorpay, PayPal, or Stripe, so you can build your own rather than clone ours.

Chapter 15 · Payments — the flow

A safe checkout, end to end

A visitor wants to leave a $5 tip. Where does the money actually change hands — and which parts of this is the browser allowed to decide? (Almost none of it.)

The browser is hostile territory: anyone can open dev-tools and change a number. So the rule is strict — the browser may start a payment and display the provider's checkout, but the amount, the verification, and the reward are all decided on your server. The payment provider (Razorpay for ₹, PayPal for other currencies) is the only thing that actually moves money; your server is the only thing trusted to say it happened.

  1. Browser asks your server to start a tipit sends an amount + currency — but the server re-checks that amount against a minimum; the page's number is never trusted on its own
  2. Server creates an order with the providerserver-to-server, using the secret key; the browser gets back only an order id + your public identifier — never the secret
  3. The provider's checkout opens in the browsercard form or wallet, hosted by the provider — your site never sees card details
  4. The visitor pays the provider directlymoney moves between the visitor and the provider, not through your server
  5. Server confirms with the providerit verifies a signature, or captures the order server-side — and believes only the provider's own answer, never a "success" flag from the page
  6. Confirmed → server unlocks the rewardflip the record to paid, hand back the GitHub link
Browser · untrusted holds the public identifier only · opens the provider's checkout · can be tampered with, so it decides nothing
the trust line — the secret key and every "is this really paid?" decision stay on this side
Your server · trusted holds the secret · creates the order · verifies the payment · releases the reward

the browser may start and display a payment · only the server may decide it succeeded

Why the browser decides nothing

It's the same lesson as Cosmo's privacy line (Chapter 04): trust is a property of where you draw the boundary, not a promise you make. Treat every value from the page as a request to be re-checked, and a tampered amount or a faked "it worked" simply can't get past the server.

Chapter 16 · Payments — credentials

Keys, secrets & signatures

Every provider hands you two strings that look alike: a publishable key and a secret key. Swap them by accident and you either break checkout or hand the world your account.

Three credentials, and the whole integration's safety is just keeping them in the right place:

  • Publishable key / client id — a token that's safe in public. It only identifies your account to the provider. It can start a checkout but can't move money or read anything. This is the one value allowed into the browser.
  • Secret key — never leaves the server. It authenticates your server when it creates or captures an order. In the browser it would let anyone charge as you, so it lives only in an environment file the web server can read and the public can't.
  • Webhook secret — proves the provider is the one calling. Used to confirm a server-to-server notification genuinely came from the provider and wasn't forged (full webhooks next chapter).

where each credential is allowed to live

PUBLIC  · in the page    →  publishable key / client id   // identify only
SECRET  · on the server  →  secret key                    // create & capture orders
SECRET  · on the server  →  webhook secret                // verify provider callbacks

Confirmation comes in two flavours, and a provider gives you one or the other:

  • Signature check. After payment, the provider hands the browser a signature. Your server recomputes it from the order id + payment id using the secret key, and believes the payment only if they match — proof nothing was tampered with in transit.
  • Server-side capture. Instead, your server asks the provider directly: "did this order complete, and for how much?" You trust only that reply, and cross-check the amount and currency against what you stored — so a tampered amount can't unlock the reward.

confirming a signed payment — runs on the server, with the secret key

expected = hmac_sha256(order_id + "|" + payment_id, secret_key)
if (expected !== signature_from_browser) reject()   // forged or tampered
markPaidOnce(order_id)                               // verified → release the reward
The one rule that matters

The secret key and the webhook secret never appear in HTML, JavaScript, or any file the public can fetch — only the publishable token ships to the browser. If a secret ever lands in front-end code or a screenshot, treat it as burned and rotate it immediately.

Chapter 17 · Payments — webhooks

Webhooks — the backstop when the tab closes

The visitor pays, then closes the tab before the "it worked" call reaches your server. Did you just lose the payment?

No — because the browser was never the only path. A webhook is the provider calling your server directly, server-to-server, the moment a payment settles. It doesn't depend on the tab staying open, the network holding, or the page not being refreshed. And it's signed with the webhook secret from the last chapter, so your server can tell a real event from a forged one.

The catch is that confirmation can now arrive twice — the browser's callback and the webhook, or a provider retrying. So the write that marks a tip paid must be idempotent: flip the record from pending → paid exactly once, and quietly ignore any later copy. Releasing the reward twice is precisely the bug this prevents.

Browser confirmfast feedback · may never arrive
Signed webhookserver-to-server · the reliable source of truth
↓ each verified, then ↓
mark paid — onceidempotent: first writer wins · duplicates ignored
reward releasedthe GitHub link, exactly one time

two independent confirmations, one idempotent write — a closed tab never loses a payment, a retry never double-counts it

Why two paths

They do different jobs. The browser callback is instant feedback for the person who just paid; the webhook is the truth that survives closed tabs and flaky networks. Build both, make the write idempotent, and the edge cases stop being edge cases.

Chapter 18 · Payments — one checkout, many currencies

One checkout, many currencies

An Indian visitor should pay in ₹ through one provider; everyone else in their own currency through another. How do you do that without four copies of the checkout?

With the exact boundary pattern from Part I — a contract plus a registry. A small pure function turns a visitor's rough region into a currency, a minimum tip, and which processor handles it. Each processor — one for ₹, one for everything else — hides behind a single shared interface: create an order · verify it · capture it. A registry hands back whichever one the currency selected, and the checkout code above never knows which it got. Adding a third processor tomorrow is a new adapter, not a rewrite.

visitor's regionresolved server-side
↓ one pure function
currency · minimum · processor₹ floor · $/€/£ floor · which adapter
↓ registry picks the adapter
₹ → processor Asame interface
others → processor Bsame interface

one pure router · interchangeable processors behind one contract · the same shape as Cosmo's packs, tools, and providers

You've seen this before

This is the contract-plus-registry boundary from Chapters 03, 07, and 08 — now applied to money. Learn the shape once and it keeps paying off: swap an implementation, never edit a neighbour. A safe checkout isn't a special skill; it's the same architecture, pointed at payments.

That's the whole system: three ideas — an untrusted browser, server-only secrets, confirmation that arrives twice and writes once — and one pattern you already knew. The "Get Cosmo" button at the top runs exactly this flow.