Two layers, one number. Layer 1 measures contrast, faces, text coverage, WCAG readability, vibrancy with real computer vision. Layer 2 hands the image to Claude vision and asks if it can stand out against the top 3 thumbnails actually winning in your keyword + format + size bracket. Free creators get one full run per cycle.
Free plan unlocks one full analysis · ~25 seconds per run · re-upload revised versions for side-by-side history
Seven Layer 1 components run pixel-level computer vision (60 points). Six Layer 2 dimensions run Claude Sonnet 4.6 vision against your niche feed (40 points). Each one returns a score, a one-sentence verdict referencing exact visual elements, and a concrete fix when below 8.
Dimensions
5 ptsFull credit at 1280×720 (YouTube’s recommended resolution). Partial at any 16:9 ratio. Zero off-ratio. Those get cropped or letterboxed in feeds and tank CTR.
File size
5 ptsFull credit under 2 MB. Partial under 4 MB. Zero above 4 MB (loads slow on mobile data, hurts the first-frame impression).
Contrast (stddev)
15 ptsGreyscale standard deviation across the whole image. >80 wins full credit. High contrast separates from feed neighbours. <30 means the thumbnail looks washed out.
Face presence
10 ptsHaar cascade face detection. >20% of image area = full credit (faces drive CTR). 10–20% = partial. Detected at all = base credit. Zero faces is OK if vision Layer 2 says the scene compensates.
Text presence
10 pts10–30% of image covered by text wins full credit (sweet spot. Readable on mobile, doesn’t crowd the visual). >30% feels cluttered, gets capped.
Text readability (WCAG)
10 ptsReal WCAG luminance-contrast ratio between text and its background. >7:1 wins full credit (AAA). >4.5:1 is partial. Below 3 reads as a smear at 200px.
Color vibrancy
5 ptsMean HSV saturation. >120 wins full credit (vivid). Plus k-means dominant color extraction so Layer 2 can compare against the niche palette.
Facial emotion
10 ptsWhat specific emotion is expressed? Is it readable at 200px (mobile feed size)? Does it match the video’s promise? If no face, does the scene create equivalent emotional pull?
Text psychology
10 ptsDoes the text create curiosity tension without revealing the answer? Does it complement or contradict the image? Bold enough for mobile? If no text, scored against whether the visual is strong enough alone.
Color psychology
10 ptsAre colors emotionally congruent with the topic? Is there a single dominant color that separates this in the feed? Compared directly against the benchmark color palette.
Composition & visual hierarchy
10 ptsWhere does the eye go first? Is there visual tension? Is the most important element in a rule-of-thirds power zone? Mobile-first read, since most YouTube viewing is mobile.
Title-thumbnail relationship
10 ptsDo the title and thumbnail tell DIFFERENT parts of the same story (the gold standard). Or is the thumbnail just illustrating the title? Scored zero if no title was provided.
Feed distinctiveness
10 ptsCompared against the actual top 3 benchmark thumbnails for your niche. Would this stand out, blend in, or disappear? Names the single most distinctive element. Or explains exactly why it blends.
Five stages. Re-upload a revised version anytime. The version-history panel tracks the score across iterations so you can see exactly what moved the needle.
Upload + context
Drop the image, paste the draft title, pick the keyword you’re targeting. Or pull the title and keyword from a video idea you generated in Competitor Analysis.
Layer 1 measures
OpenCV detects faces, pytesseract reads any text, WCAG luminance ratio scores readability, k-means extracts dominant colors, HSV measures vibrancy. 60 points.
Niche pool built
Top 10 thumbnails for your keyword + format + size bracket are fetched + scored. Pool cached 30 days, shared across users. Most runs hit a warm pool.
Layer 2 vision call
Claude Sonnet 4.6 sees your thumbnail alongside the top 3 benchmark thumbnails and scores 6 psychological dimensions in context. 40 points.
Combined result
Score 0–100, per-dimension verdict + fix, biggest win, biggest fix, emotion label, feed-position tag, percentile vs peers, version saved to history.
Layer 1 runs entirely on our infrastructure. No third-party scoring API, no per-image fees. Layer 2 calls Claude Sonnet 4.6 with your thumbnail and the top 3 benchmark images. Benchmark thumbnails come from the official YouTube Data API; the same public images anyone visiting those channels can see. Each analysis spends one credit on paid plans; free tier gets one full analysis per cycle.
Face detection
OpenCV Haar cascade · frontal-face classifier
Text OCR
pytesseract · sparse-text page mode (psm 11)
Color extraction
OpenCV k-means · k=3 dominant + HSV saturation
Readability ratio
WCAG 2.2 luminance contrast · sampled per text box
Niche benchmark
YouTube Data API · top 10 by view velocity, 30-day cache
Vision model
Claude Sonnet 4.6 · 4-image input · ~12s on warm cache
Real answers from how the product behaves. The two layers, the niche pool, the size brackets, version history, and what won’t work.
Still have questions? Email us →