8Test Suites
Total Scenarios
11Subject / Logo Categories
Last Run
Pass Rate
Active Suites — Ready to Run
Not run
Suite F · Conversation Flow
12 scenarios testing agent decision logic — when to generate vs. ask a question. Multi-turn coherence, bypass phrases, graceful exits. Free — no images generated.
Active Run: npm run test:flow 12 scenarios · 0 images
Not run
Suite RB · Raw vs WeMkr Benchmark
5 test cases targeting the specific failure modes where raw unconstrained OpenAI is weakest. Each case generates both a RAW image and a WeMkr image from the same input. Grade both — the delta proves where we add value.
Active Run: npm run test:benchmark TC1 text spelling · TC2 logo colors · TC3 no gradients · TC4 no-text · TC5 variant consistency
Not run
Suite A · Subject Generation
8 subject categories × n samples each. Animals, events, company logos, people, places, abstract, sports, food. Tests whether the agent generates the correct subject from a natural language request.
Ready Run: npm run test:subjects (n=5, ~$0.80) 8 categories · pool-sampled
Not run
Suite LI · Logo Iteration Depth
3 logos × 3 concept picks each = 9 generated images. Answers: how many distinct, quality pin designs can the agent produce from a single logo input? Tests mascot badge, circular crest, and graphic-only variants per logo.
Active Run: node scripts/test-logo-iterations.mjs 3 logos · 3 concepts each · 9 images
Not run
Suite B · Logo Input → Pin Output (End-to-End)
Full pipeline test with real logo image uploads. Covers all logo sub-types from the sample set. Two passes: T1 = creative brief (no JSON), T2 = production JSON + DALL-E generation. Text validation checked on every output.
Active Run: node scripts/test-creative-logo.mjs Add logos to scripts/test-creative-logo.mjs TESTS array to expand
Suite C · Profile Shape Accuracy
Prompt pool testing round, oval, die-cut, star, heart, rectangle, half-circle, and irregular shapes. Visual assertion: does the silhouette match the requested shape?
Pending Not yet built 6 shapes · pool-sampled
Suite D · Color Instruction Following
Pools for color commands: "make it red", "change that to black", "make it colorful", "pastel only", "monochrome". Tests whether the generated image matches the requested color treatment.
Pending Not yet built 6 color categories · pool-sampled
Suite E · Direction Following
Tests agent ability to follow spatial and compositional instructions: "move this left", "put text on top", "make it bigger", "add a border", "flip it". Multi-turn refinement sequences.
Pending Not yet built 4 direction types · multi-turn
Planned — Future Suites
Suite G · Multi-Product Coverage
Runs subject and shape tests across all product types: pins, coins, stickers, patches. Ensures the agent understands product-specific manufacturing constraints per type.
Planned coins · stickers · patches · pins
Suite H · Subject Change
Multi-turn tests where the user replaces the subject entirely mid-conversation: horse → flower, logo → image, abstract → portrait. Tests whether the agent rewrites rather than appends.
Planned subject-swap pairs · multi-turn
How to Grade Results
Grading Scale
Click any PASS/FAIL row in the test report to expand it. You'll see the exact prompt sent, and for image tests, the full generated result. Grade each result using the buttons at the bottom of the expanded row.
A — Excellent. Production-ready output. B — Good. Minor issues only. C — Partial. Usable but needs prompt work. D — Poor. Got something but wrong. F — Fail. Wrong subject, shape, or no output.