WeMkr · AI Test Suite

8Test Suites

—Total Scenarios

11Subject / Logo Categories

—Last Run

—Pass Rate

Active Suites — Ready to Run

Suite F · Conversation Flow

12 scenarios testing agent decision logic — when to generate vs. ask a question. Multi-turn coherence, bypass phrases, graceful exits. Free — no images generated.

Active Run: npm run test:flow 12 scenarios · 0 images

Suite RB · Raw vs WeMkr Benchmark

5 test cases targeting the specific failure modes where raw unconstrained OpenAI is weakest. Each case generates both a RAW image and a WeMkr image from the same input. Grade both — the delta proves where we add value.

Active Run: npm run test:benchmark TC1 text spelling · TC2 logo colors · TC3 no gradients · TC4 no-text · TC5 variant consistency

Suite A · Subject Generation

8 subject categories × n samples each. Animals, events, company logos, people, places, abstract, sports, food. Tests whether the agent generates the correct subject from a natural language request.

Ready Run: npm run test:subjects (n=5, ~$0.80) 8 categories · pool-sampled

Suite L · Creative Logo Pin — Two-Pass Agent

8 logo inputs × 2 turns each. Validates Pass 1 (creative concepts, no JSON leak) and Pass 2 (correct production action). Covers: mascot emblem, circular badge, wordmark, monogram, sport logo, org logo, corporate wordmark, graphic silhouette.

Active Run: node scripts/test-creative-logo.mjs 8 logos · 2 turns · ~6 images generated

Suite LI · Logo Iteration Depth

3 logos × 3 concept picks each = 9 generated images. Answers: how many distinct, quality pin designs can the agent produce from a single logo input? Tests mascot badge, circular crest, and graphic-only variants per logo.

Active Run: node scripts/test-logo-iterations.mjs 3 logos · 3 concepts each · 9 images

Suite B · Logo Input → Pin Output (End-to-End)

Full pipeline test with real logo image uploads. Covers all logo sub-types from the sample set. Two passes: T1 = creative brief (no JSON), T2 = production JSON + DALL-E generation. Text validation checked on every output.

Active Run: node scripts/test-creative-logo.mjs Add logos to scripts/test-creative-logo.mjs TESTS array to expand

Suite C · Profile Shape Accuracy

Prompt pool testing round, oval, die-cut, star, heart, rectangle, half-circle, and irregular shapes. Visual assertion: does the silhouette match the requested shape?

Pending Not yet built 6 shapes · pool-sampled

Suite D · Color Instruction Following

Pools for color commands: "make it red", "change that to black", "make it colorful", "pastel only", "monochrome". Tests whether the generated image matches the requested color treatment.

Pending Not yet built 6 color categories · pool-sampled

Suite E · Direction Following

Tests agent ability to follow spatial and compositional instructions: "move this left", "put text on top", "make it bigger", "add a border", "flip it". Multi-turn refinement sequences.

Pending Not yet built 4 direction types · multi-turn

Planned — Future Suites

Suite G · Multi-Product Coverage

Runs subject and shape tests across all product types: pins, coins, stickers, patches. Ensures the agent understands product-specific manufacturing constraints per type.

Planned coins · stickers · patches · pins

Suite H · Subject Change

Multi-turn tests where the user replaces the subject entirely mid-conversation: horse → flower, logo → image, abstract → portrait. Tests whether the agent rewrites rather than appends.

Planned subject-swap pairs · multi-turn

How to Grade Results

Grading Scale

Click any PASS/FAIL row in the test report to expand it. You'll see the exact prompt sent, and for image tests, the full generated result. Grade each result using the buttons at the bottom of the expanded row.

A — Excellent. Production-ready output. B — Good. Minor issues only. C — Partial. Usable but needs prompt work. D — Poor. Got something but wrong. F — Fail. Wrong subject, shape, or no output.