Active Suites — Ready to Run
Not run
Suite F · Conversation Flow
12 scenarios testing agent decision logic — when to generate vs. ask a question. Multi-turn coherence, bypass phrases, graceful exits. Free — no images generated.
Not run
Suite RB · Raw vs WeMkr Benchmark
5 test cases targeting the specific failure modes where raw unconstrained OpenAI is weakest. Each case generates both a RAW image and a WeMkr image from the same input. Grade both — the delta proves where we add value.
Not run
Suite A · Subject Generation
8 subject categories × n samples each. Animals, events, company logos, people, places, abstract, sports, food. Tests whether the agent generates the correct subject from a natural language request.
Not run
Suite L · Creative Logo Pin — Two-Pass Agent
8 logo inputs × 2 turns each. Validates Pass 1 (creative concepts, no JSON leak) and Pass 2 (correct production action). Covers: mascot emblem, circular badge, wordmark, monogram, sport logo, org logo, corporate wordmark, graphic silhouette.
Not run
Suite LI · Logo Iteration Depth
3 logos × 3 concept picks each = 9 generated images. Answers: how many distinct, quality pin designs can the agent produce from a single logo input? Tests mascot badge, circular crest, and graphic-only variants per logo.
Not run
Suite B · Logo Input → Pin Output (End-to-End)
Full pipeline test with real logo image uploads. Covers all logo sub-types from the sample set. Two passes: T1 = creative brief (no JSON), T2 = production JSON + DALL-E generation. Text validation checked on every output.
Suite C · Profile Shape Accuracy
Prompt pool testing round, oval, die-cut, star, heart, rectangle, half-circle, and irregular shapes. Visual assertion: does the silhouette match the requested shape?
Suite D · Color Instruction Following
Pools for color commands: "make it red", "change that to black", "make it colorful", "pastel only", "monochrome". Tests whether the generated image matches the requested color treatment.
Suite E · Direction Following
Tests agent ability to follow spatial and compositional instructions: "move this left", "put text on top", "make it bigger", "add a border", "flip it". Multi-turn refinement sequences.
Planned — Future Suites
Suite G · Multi-Product Coverage
Runs subject and shape tests across all product types: pins, coins, stickers, patches. Ensures the agent understands product-specific manufacturing constraints per type.
Suite H · Subject Change
Multi-turn tests where the user replaces the subject entirely mid-conversation: horse → flower, logo → image, abstract → portrait. Tests whether the agent rewrites rather than appends.
How to Grade Results
Grading Scale
Click any PASS/FAIL row in the test report to expand it. You'll see the exact prompt sent, and for image tests, the full generated result. Grade each result using the buttons at the bottom of the expanded row.