You have an unsaved session. Pick up where you left off?
π Pinned Thumbnails
Composite
1280 Γ 720
Selected Frame
4px
100%
Faces in frame:
β
Face detection
β
Face identification
β
Emotional match
Click a frame to select Β· hover and click × to exclude
Topical pattern (left half)
A patterned graphic that speaks to the topic of this video. Generated by gpt-image-1, ranked best-to-worst.
Color tint
A colour tint that captures the emotional tenor of the video. Multiplied over the pattern.
(Generate to see suggested tint and rationale.)
Platypus brand:
Topic subject (cutout)
The central subject of the story β typically a person or object. We'll search the web and remove the background.
Fade between halves
Where the left composite blends into the right scene. Drag both handles to widen or narrow the transition.
Choose background
Tune background
Fine tune background
Apply the grading values from your settings cog.
--How well the face reads against the background
Adds a soft halo of light behind the face. "All around" gives an even glow; the directional options simulate studio rim lighting. White is most versatile.
Draws a visible border around the face — the classic YouTube thumbnail "sticker" look. White at 3–5 px is the most popular choice.
Adds a soft dark shadow behind the face so it doesn't look "pasted on." Start with opacity around 0.3.
Background Cleanup
Higher = more aggressive removal of stray pixels. Adjust until floating debris disappears.
Multiple faces detected β click a face on the right canvas to toggle deletion.
20px25px
Switch Face
Click a face below to swap it in. Your current size, position, and crop carry over.
Upscale the face cutout for a sharper result. Drag the slider on the right canvas to compare before/after at full composite position.
Ready to upscale.
Provider
The original alpha mask is re-applied after upscaling β no second BG-removal pass.
Enhance lighting, colour, and sharpness on the face.Currently showing automated suggestions.
Brightness
DarkerNo changeBrighter
0.512
Overall face brightness. Most webcam footage benefits from a small lift.
Contrast
FlatterNo changePunchier
0.512
Pushes highlights brighter and shadows darker for more pop.
Normalize
Off
Stretches the colour range when the face looks washed-out or flat. Auto-suggestions turn this on when it would help.
Saturation
MutedNo changeVivid
0.512
Richness of skin tones and clothing. Push too far and skin turns orange.
Warmth
CoolerNo changeWarmer
β0.30+0.3
Shifts the colour temperature. Most webcam footage benefits from a touch of warmth.
Gamma
No changeOpen shadows
13
Lightens dark areas (under the chin, eye sockets) without washing out the bright parts.
Sharpen
NoneCrisp
03
Crispens edges. A little helps after upscaling; too much shows every pore.
Micro detail
SubtleBold
110
Brings out fine detail (CLAHE). 2β4 gives a polished look; above 6 can look over-processed.
Click a suggestion to use it, or type your own below. The right canvas shows text and logo only.
Lines:
Size:Auto
Accent:
Click words to toggle accent highlight:
Text colour:
Platypus Logo
Add decorative overlays around the subject. Choose a type, method, and style.
Brand colour:
Style
Engine (image generation model)
Settings
What this stage does. Picks the still images of you that the rest of the app turns into thumbnails. ffmpeg grabs 60β108 stills from random points in the middle 90% of your video. At the same time, BlazeFace looks for faces in every still and gives each one a score based on how many faces it found and how big the largest one is β if the biggest face is smaller than Min face size, the score takes a heavy Γ0.1 penalty. Meanwhile the audio is sent to OpenAI's transcription model, the resulting transcript goes to the chat model which pulls out 5β10 emotional words describing the video's tone, and finally GPT Vision sees the stills + those emotions and ranks the 12 best by how well your expression matches what you were saying. The 12 ranked stills feed into Cutout; Vision's background guesses feed into Background.
Frame extraction
How many still frames ffmpeg pulls from the video, evenly spaced across its duration. Leave at 0 for auto-mode (one frame every 30 seconds, capped at 36). More frames give the vision model more options to choose from but slow extraction and increase API cost.
Minimum face area (as a fraction of total frame area) before a frame is considered usable. 0.02 = the face must occupy at least 2% of the frame. Frames where every detected face is smaller than this β typically wide shots or empty rooms β are dropped before scoring.
Models
OpenAI model used to transcribe and diarize the video's audio track. gpt-4o-transcribe-diarize produces speaker-labelled segments which feed both the title and the emotion-extraction prompts.
Long audio is split into chunks before being sent to the transcription API. Smaller chunks upload more reliably but lose context at chunk boundaries; larger chunks give cleaner diarization but risk timeouts on slow connections.
OpenAI model that scores each extracted frame against the emotional tone of the transcript and decides which frames feature the target person. Needs vision capability β gpt-4.1, gpt-4o, etc.
Controls how much resolution the vision API uses when ranking frames. Low sends a 512Γ512 thumbnail (1 token block, fast). High tiles the image at full resolution for finer facial-expression detection. Auto lets OpenAI pick.
Transcription prompt
Vision / Frame Analysis prompts
Emotion extraction prompt
What this stage does. Builds the blurred photo that sits behind your cut-out face. The starting image is one of the four built-in studio/office shots in public/assets/backgrounds/ (or any photo you've uploaded), with GPT Vision sometimes picking the best fit back in Frames. The image is drawn to a canvas and run through a chain of browser canvas filters (ctx.filter): blur for the bokeh, brightness to dim it behind the text, saturate for colour intensity, plus a temperature tint built from a sepia filter (warm) or sepia + a 180Β° hue-rotate (cool). On top of that, a radial-gradient vignette darkens the edges and a directional linear-gradient overlay adds a coloured wash. The nine named presets (Studio Clean, News Flash, Crisisβ¦) are just preset combinations of those slider values. The graded result is composited under the cut-out in the next stage.
Color Grading Defaults
When a named preset is selected, its values override the individual sliders. "Custom" uses the slider values as-is. "None" resets all grading to neutral.
What this stage does. Cuts you out of the chosen still and stages the glow / outline / shadow effects that sit between you and the background. The picked JPEG goes to BiRefNet β a background-removal neural net that runs locally on the server via the @imgly/background-removal-node wrapper β which returns a per-pixel cut-out mask. The raw mask is then cleaned up in three passes on the server: (1) a flood-fill keeps only the largest connected blob, plus any blob that's at least 70% as big in case there are two people in shot, and erases the rest as floating debris; (2) what's left is converted to fully opaque or fully transparent at the alpha-60 cutoff to kill semi-transparent fringe pixels; (3) a 1-pixel erosion shrinks the mask inward to remove the thin halo of background colour bleeding into the edges. On the canvas, the cleaned cut-out is positioned, scaled and flipped, with optional glow, contact shadow and outline stroke all drawn behind it in that order, and any eraser/smear paint you've added applied to the cut-out itself. The result feeds into Upscale and Relight.
Subject
Glow / rim light
Outline / sticker border
Drop shadow
What this stage does. Doubles or quadruples the cut-out's resolution so your face still looks sharp at the final 1280Γ720 size. First, if the cut-out is bigger than 768 px on its longest side it's downscaled to fit (so the GPU doesn't run out of memory), and any transparent pixels are temporarily filled with the average colour of the visible pixels β this stops the upscaler from drawing dark halos at the alpha edge. The image is then sent to whichever provider you've picked: Sharp/Lanczos3 running locally if Replicate isn't configured, or one of five Replicate models β Real-ESRGAN (cheap and identity-safe; we pass face_enhance: false so it doesn't smooth your features away), Recraft Crisp Upscale (Recraft's restorative model β like Real-ESRGAN it won't invent detail, but it tends to look crisper on skin and eyes; the model picks its own output size, so the scale setting is ignored for this provider), GFPGAN v1.4 (face restoration), SwinIR (transformer super-resolution, slowest but highest fidelity), or Crystal (portrait-tuned diffusion with low creativity and high resemblance so your features don't drift). When the upscaler returns, the original alpha mask is resized with Lanczos3 and joined back onto the upscaled RGB, so any erases or smears you did in Cutout stay perfect. Results are cached per provider, so re-runs are instant.
What this stage does. Corrects the exposure and colour of the upscaled cut-out so your face matches the background it's about to sit on. Hitting Suggest runs a smart-defaults pass on the server: it grabs the face bounding box, builds a 1000-bin brightness histogram of just those pixels, reads off the 70th-percentile L* (a CIELAB lightness value that approximates the lit cheek rather than hair or shadow), and computes a continuous correction β boosting brightness/gamma when L* falls below 45, easing off when it's above 80, with smaller warmth/saturation/contrast/CLAHE nudges based on whole-image stats. When you click Apply, those values (or your slider overrides) run through a Sharp pipeline in colour-science order: normalise β warmth tint β gamma β brightness/saturation β contrast β Reinhard highlight rolloff (a custom lookup table that softly compresses bright skin instead of hard-clipping it) β CLAHE local-contrast equalisation β unsharp mask. The alpha channel is split off before processing and joined back at the end, and the final image is hashed and cached so identical settings return instantly. The values below are the slider defaults β Suggest will nudge them further per frame.
Linear multiplier on every pixel value. Above 1 lifts the whole face uniformly; below 1 darkens it. The cheapest fix for an under- or over-exposed shot, but it crushes highlights and shadows equally.
Stretches midtone pixels away from grey. Above 1 makes lights lighter and darks darker (punchier face); below 1 flattens the tonal range (softer, more cinematic).
Strength of colour. 0.7 gives a desaturated, documentary look; 1.0 is unchanged; 2.0 is vivid. Useful for warming up sallow studio skin tones.
White-balance shift on the red/blue axis. Negative cools the image (bluer, fluorescent feel); positive warms it (oranger, golden-hour feel). Use small values β Β±0.1 is usually plenty.
Gamma curve that brightens dark tones while leaving highlights largely untouched. Best fix for an under-lit face shot against a bright background β opens up the eye sockets and jawline without blowing out the cheeks.
Unsharp-mask amount. Adds edge contrast so eyes, lashes and stubble read as crisper at thumbnail size. Above ~2 tends to look digital and brittle.
CLAHE clip-limit (Contrast-Limited Adaptive Histogram Equalisation). Boosts contrast within small tiles independently, so dim regions get lifted without flattening already-bright ones. Adds depth and "pop" without the global heaviness of pumping the Contrast slider.
What this stage does. Writes the headline that goes on the thumbnail. The diarised transcript (which the app generates in parallel with frame extraction) is sent to OpenAI's chat model (default gpt-4o) via the Responses API, with a structured-JSON schema that forces every suggestion into {text: ALL CAPS, highlightWords: [β¦]} form, so you don't get prose or markdown back. The system prompt frames the model as an expert YouTube content strategist promoting Justin Wolfers' economics videos; the user prompt asks for N punchy 1β6 word options (where N is the Number of text suggestions setting) and tags 1β2 words per option to render in the accent colour. You pick one (or hit Refine, which threads the conversation via the Responses API's previous_response_id so it doesn't re-read the whole transcript), and the chosen text + highlight words are stored on the Composer for canvas rendering. The font, colours and watermark settings below control how that chosen title is drawn onto the canvas.
YOUR TITLE GOES HERE
Used for the big title text on the thumbnail. Heavy condensed fonts read best at small sizes in YouTube search results.
Logo / watermark library
Upload a PNG, JPG, WebP, or SVG to add it to the watermark picker in Stage 6. Uploads are saved to persistent storage. Built-in Platypus marks can't be removed.
Model
Thumbnail title prompts
What this stage does. Adds optional hand-drawn graphics over the finished composite β either kinetic accent lines or topical doodle stickers. For emotional lines, the whole canvas image (plus an optional marker-style reference photo) is sent to Gemini 2.5 Flash Image, and Gemini is asked to edit the photograph by drawing a few loose, hand-drawn HOT PINK #FF1493 Sharpie strokes on top, mostly along your silhouette and clothing edges. The server then chromakey-extracts the pink pixels (bright red + low green + a big redβgreen gap, to keep skin tones out) into a white mask that the browser tints on the fly with whichever colour you've picked, so you can recolour without calling Gemini again. For topical doodles, gpt-4o-mini is first asked to suggest 8β12 concrete drawable topics from the transcript β roughly 60% literal objects, 40% visual metaphors for abstract ideas β then Flux Schnell (default) or Gemini 2.5 Flash Image (alternative) is asked for a sprite sheet on a pure-black background in your chosen style and colour, using a grid layout that adapts to the topic count (4Γ3 for 10β12 icons, 3Γ3 for 9, 4Γ2 for 7β8, 3Γ2 for 5β6, etc.). The server then slices each cell into its own PNG sprite using a connected-component blob extractor (4-connected BFS flood-fill, with a 40 px gap merge so disjoint parts of one icon stay together). The result β a tinted overlay or up to 12 movable PNG stickers β is drawn on top of the composite at the configured opacity.
General
Doodles & lines
Decoration prompts
What this stage does. Figures out who's in each extracted frame, so you can rank or filter the candidates by the person you actually want. Add a person, drop in one or more clear front-facing reference photos, and face-api.js β running entirely in your browser, loading tinyFaceDetector + faceLandmark68TinyNet + faceRecognitionNet from /assets/face-api-models/ β computes a 128-number face descriptor for each photo and averages them into one stable embedding per person. That averaged descriptor is POSTed back to the server and cached, so the heavy face-recognition step only runs once per photo. After Frames extraction has finished β this is a separate browser-side pass, not part of the Frames pipeline itself β every face detected in every candidate frame is embedded the same way and matched against the cached descriptors using Euclidean distance, with a hard threshold of 0.52: anything closer is the same person, anything further is "unknown".