StudioMI300
Director Agent + vision critic + image, video, music & voice models — all on a single AMD Instinct MI300X.
Live demo paused — hackathon ended
The AMD x lablab hackathon has wrapped, so the on-demand MI300X demo is paused. Every clip in the archive below was generated end-to-end on a single AMD Instinct MI300X during the event (FLUX.2 [klein] 4B keyframe + Wan2.2-I2V-A14B at 81 frames / 16 fps, FBCache 0.08, ~6 minutes per clip).
Not Sora. Not Runway. Not Veo. Every frame here was made by models you can download, weights you can self-host, and code you can fork. No paywall, no waitlist, no usage cap. See the vs Sora & Runway tab for the full breakdown.
Generations archive
Why this is not another frontier-model clone
Sora, Runway Gen-3, Google Veo, Kling, Pika — all closed weights, all hosted-only, all paid. They produce beautiful clips, and they leave you with zero leverage: you can't fork them, can't host them on your own GPU, can't see their critic logic, can't sell the output under terms you control, can't extend the pipeline for a new use case without their permission.
StudioMI300 is the opposite stack — built so that the work this project produces is owned by the person who runs it, not rented from a vendor.
Side by side
| Dimension | Sora · Runway · Veo · Kling · Pika | StudioMI300 |
|---|---|---|
| Weights | Closed, vendor-only | Apache 2.0 / MIT — every model |
| Output license | Vendor ToS, often non-commercial | Commercial use, no royalties |
| Where it runs | Vendor cloud only | Any MI300X / any ROCm host |
| Pipeline | Black-box single model | 8 stages, every artifact extractable |
| Story planning | Hidden inside the model | Director Agent emits a JSON plan |
| Quality control | None — render once, hope | Vision critic with 10 failure labels, auto-retry |
| Music | Vendor-locked or stock licensing | ACE-Step v1, open weights, royalty-free |
| Narration | Not included | Kokoro-82M, 9 languages, per-shot timing |
| Cost per 30s reel | $0.50 – $4 per render, per attempt | One GPU-hour, fully amortizable |
| Audit & reproducibility | None | Full plan.json + every keyframe + every clip + critic verdicts saved |
| Vendor lock-in | Total | None — fork and ship |
What the open stack uniquely gives you
1. The Director's plan is inspectable. Sora returns an mp4. StudioMI300 returns the mp4 plus the 6-shot plan, the character bibles, the music brief, and the per-shot voice-over script — as structured JSON. Producers can edit the plan and re-render only the shots they changed. Try doing that on Runway.
2. The vision critic is explainable. Every clip carries the critic's verdict — character drift, extras invade frame, walking backwards, etc. — with the retry strategy that fixed it. Sora gives you a frame; this gives you a paper trail.
3. Identity without LoRA training. FLUX.2 [klein] reference editing pins identity by construction — no per-character training step, no dataset prep, no 30-minute fine-tune wait. Sora has no concept of a named character across shots; here it's first-class.
4. Locale-aware narration. Director picks the narration language to match the setting — Tokyo → Japanese, Paris → French, Mumbai → Hindi. Sora narrates in nothing.
5. Sequential single-GPU orchestration. A 35B-MoE director, a 4B diffusion model, a 14B I2V model, a 3.5B music model, and a TTS share one MI300X by loading sequentially. This is the part that only works because of 192 GB HBM3 — and the part that frontier vendors never have to expose, because their cost structure is subsidized by a closed API.
What it deliberately does not try to do
Frontier models invest billions of training-compute into raw photoreal fidelity. StudioMI300 doesn't chase that — it composes the best open weights available right now into a pipeline that delivers the entire creative artifact (story, characters, shots, music, voice, mix) instead of a single isolated clip. The bet: an open, transparent, end-to-end pipeline that ships every month with the latest open weights will outpace any closed vendor on the dimensions that actually matter to a producer — control, auditability, ownership, and cost.
Frontier models give you a clip. This gives you a studio.
Pre-rendered reels from the live pipeline
Each reel is an actual mp4 produced end-to-end by the pipeline on the MI300X droplet — one prompt in, finished reel out. No human selected or trimmed shots. The vision critic ran on every clip.
San Francisco walk - golden hour to blue hour
Logline. A young woman walks alone down a steep Pacific Heights street, past painted Victorians and rolling fog, to a quiet overlook of the Golden Gate Bridge as the light shifts to blue hour.
Prompt.
30-second cinematic reel: a young woman walks alone through San Francisco at golden hour - down a steep Pacific Heights street with bay views, past painted Victorian houses, fog rolling in over the Pacific, ending at a quiet overlook of the Golden Gate Bridge as the light shifts to blue hour
Music. intimate ambient piano with a soft synth pad, 75 BPM, contemplative
Voice-over. American English (Director picked from setting)
Render time. 81 min on 1× MI300X
The pipeline
Eight stages run sequentially on one GPU. Each model loads, runs, unloads — making room for the next. No multi-GPU magic, no separate inference servers, no LoRA training step.
Plans 6 cinematic shots with character portraits, music brief, voice-over script and language tag. Same checkpoint doubles as the vision critic in stage 5.
One canonical image per character + an ABC group composition. These pin identity for every downstream shot.
Master image goes in as conditioning, shot prompt drives the edit. Identity is preserved by construction — no LoRA training, no per-character setup.
Dual-expert MoE diffusion, 121 frames at 24 fps. ParaAttention FBCache 2× lossless + selective torch.compile on transformer_2 (1.2× compile win).
Grades each clip on character_match, scene_match, composition, artifact_free. Below 7/10 → re-render with a bumped seed (max 3 attempts).
Audio diffusion produces a 30-second instrumental matching the Director's brief (BPM, mood, instrumentation, no drums hint).
Director picks the language to match the setting (Tokyo→ja, Paris→fr, Mumbai→hi, ...). Script is written in that language, not translated.
Six clips concatenated, upscaled to 1280×704, audio loudness-normalised, output is a single mp4.
Why research-driven prompts?
The Director's planner and the vision critic system prompts aren't folklore. They distill 16 sources (Alibaba's official Wan2.2 system prompts, the official prompt rewriter, ComfyUI community guides, InstaSD's controlled camera tests, HuggingFace Forums) into hard rules:
- Verbatim Chinese trained negative from
shared_config.py— umT5 was multilingual-pretrained against those exact tokens; the English translation is observably weaker. - Positive boundary sentences instead of "EXACTLY N people" — umT5 doesn't ground numerics; Wan2.2 distorts the crowd trying to enforce a count.
- Lens / film tags (
Arri Alexa, anamorphic, 35mm film grain) instead ofcinematic— that word triggers Wan2.2's stylization branch and gives the AI look. - Sentence-case motion verbs described as a process, not ALL-CAPS shouting. The all-caps trick is community folklore with no documented support; Alibaba's own examples use lowercase.
- One camera verb per shot, placed first — multiple verbs in one sentence ("dolly in tracking tilt up") cancel each other out.
Full research write-up lives in the GitHub repo (research/wan22_prompting.md).
The self-correcting render loop
Most generative video pipelines render once and pray. This one re-checks every clip with a 35-billion-parameter vision model, scores it on four 1–10 axes, and re-renders if it fails. The same Qwen3.5-35B that planned the story now grades it.
The critic returns four scores (character_match, scene_match, composition, artifact_free) plus a list of structured failure labels. The labels are machine-readable and feed back into the planner's retry strategy:
Up to three attempts per shot. After that, the best-scoring attempt ships and the issue list goes into the run log. The pipeline is self-correcting, not blind.
Real verdicts pulled from the run logs
These are actual JSON returns from Qwen3.5-35B critiquing real Wan2.2 clips on this pipeline. The labels feed back into the planner's retry strategy.
{ "shot": 0, "attempt": 1, "score": {
"character_match": 9, "scene_match": 8, "composition": 9, "artifact_free": 7,
"issues": ["STYLIZED_AI_LOOK: skin texture appears slightly plastic/smooth in close-up frames 1-2",
"OBJECT_MORPHING: background bridge structure shifts from Golden Gate to a generic suspension bridge mid-clip"],
"overall": 8 }}
{ "shot": 2, "attempt": 1, "score": {
"character_match": 10, "scene_match": 10, "composition": 10, "artifact_free": 9,
"issues": [],
"overall": 10 }}
{ "shot": 3, "attempt": 2, "score": {
"character_match": 4, "scene_match": 3, "composition": 2, "artifact_free": 5,
"issues": ["CHARACTER_DRIFT: Subject identity changes completely in final frame from long-haired woman in trench coat to bob cut and turtleneck",
"SCENE_MISMATCH: Golden Gate Bridge vanishes in Frame 3, replaced by generic city street",
"CAMERA_IGNORED: Prompt requested 'static camera' but subject rotates 180 degrees and camera zooms",
"STYLIZED_AI_LOOK: Frame 4 plastic skin texture and oversaturated bokeh"],
"overall": 3 }}
The 10/10 was the awning two-shot of Kenji + Mei in v22 - identity locked,
no extras, lighting matches, no STYLIZED_AI_LOOK even at this resolution.
The 3/10 was the Golden Gate Bridge overlook - Wan2.2 can't reliably render
that landmark, drifts to generic suspension bridges. After 3 attempts the
pipeline ships the best one and logs the issues.
Acceleration on AMD MI300X
Cumulative end-to-end speedup: 2.5× lossless vs unoptimised Wan2.2 — 25.9 min → 10.4 min per 720p clip.
Per-knob multiplier breakdown
Knob presets (config.py)
| preset | num_frames | fps | hero / b-roll steps | FBCache | critic | est. minutes for 30s reel |
|---|---|---|---|---|---|---|
| default | 121 | 24 | 30 / 24 | 0.05 (lossless) | 7/10, 3 attempts | ~50-65 |
| cinematic | 121 | 24 | 30 / 24 | 0.05 | 7/10, 3 attempts | ~50-65 |
| fast | 97 | 24 | 20 / 18 | 0.08 | 6/10, 2 attempts | ~32-40 |
| draft | 81 | 24 | 14 / 14 | 0.10 | 5/10, 1 attempt | ~22-28 |
STUDIOMI_AITER_FP8=1 is a separate env switch; documented but disabled by
default until ROCm/aiter#2187 closes for the multi-shape Wan2.2 case.
What didn't work (and why)
| Tried | Result | Reason |
|---|---|---|
| MagCache via diffusers 0.38 hooks | dead, calibration empty | dual-transformer step counting confuses _perform_calibration_step |
| cache-dit DBCache + TaylorSeer | 22.87 min (slower than baseline) | TaylorSeer adds ~6 min on ROCm; cache-dit's L20 numbers don't reproduce |
AITER FA3 set_attention_backend("flash") |
hung 9+ min at step 0 | JIT compile for 81×1280×704 sequence never finishes |
guidance_scale_2=1.0 (skip CFG on low-noise) |
10.35 vs 10.36 min | diffusers WanPipeline doesn't actually short-circuit at boundary |
torch.compile(mode="max-autotune", fullgraph=True) |
crash | Dynamo error on Wan2.2 (diffusers#12728) |
to(memory_format=torch.channels_last) on transformer_2 |
RuntimeError | Wan2.2 transformer is rank-5 (B,C,F,H,W); channels_last is rank-4 only |
AITER FP8 (gemm_a8w8, gemm_a8w8_CK) |
segfault mid-pipeline | AITER#2187 multi-shape crash; standalone shape works on ROCm 7.2, pipeline composition does not |
Field journal
A subset of failures, root causes and fixes from May 6–10, 2026. These are the stories that don't show up in commit messages — the ones where the Wan2.2 prompt did something genuinely surprising, or where a kernel decided to disagree with the docs.
Full incident log is in incidents.md in the GitHub repo.
How the Director thinks
The Director Agent (Qwen3.5-35B-A3B via vLLM) doesn't just write a description. It returns a structured 6-shot plan with named characters, per-shot prompts (written in Wan2.2-friendly language: camera verb first, sentence-case motion, positive boundary phrases), a music brief, a per-shot voice-over array, and the language to narrate in.
{
"characters": {
"A": "Aiko (slim Japanese woman, 27, jet-black chin-length bob, ...)",
"B": "Kenji (Japanese man, 28, tall and lean, ...)",
"C": "Mei (Japanese woman, 26, shoulder-length lavender hair, ...)"
},
"story_logline": "Aiko walks alone through neon-lit Tokyo and reunites with two friends",
"shots": [
{
"index": 0, "is_hero": true, "shot_type": "Wide tracking",
"dominant_subject": "A", "cut": true,
"prompt": "Tracking shot following from behind at hip level. Aiko (slim Japanese woman, 27, jet-black bob, mustard yellow vinyl raincoat) walks down the center of the wet street, head turning slightly. Distant pedestrians stay blurred. Light rain falls steadily, neon signs flicker. shot on Arri Alexa, anamorphic, 35mm film grain, photorealistic"
},
"... 5 more shots ..."
],
"music_style": "intimate ambient piano with warm pad and soft synth bell, 75 BPM, melancholic but hopeful, no drums",
"vo_script_per_shot": [
"She had been walking alone for too long.",
"Tonight, the city felt softer.",
"Two figures waited under an awning.",
"She broke into a quick walk.",
"Their arms found hers.",
"Some places only feel like home because of who is standing in them."
],
"vo_lang": "j"
}
The exact same character description string repeats verbatim in every shot that character appears in. Token-level consistency is character-LoRA-without-LoRA-training.
Six-shot story arc template
| Shot | Role | Cut |
|---|---|---|
| 0 | Hero wide establishing - all main characters visible | true |
| 1 | Setup - protagonist's intent or POV moves the story forward | false |
| 2 | Other element - secondary character solo or detail insert | true if scene changes |
| 3 | Climax - two-character moment or A-with-OBJECT | false |
| 4 | Static medium close-up - face anchor, reduces drift accumulation | false |
| 5 | Closing wide - scene fades or A walks away | false or true |
Voice-over languages (Kokoro-82M)
Director picks the language that matches the setting. Tokyo scene -> Japanese, Paris -> French, Mumbai -> Hindi, Rio -> Brazilian Portuguese, anywhere else -> American English.
| Code | Language | Default voice |
|---|---|---|
a |
American English | af_heart |
b |
British English | bf_emma |
e |
Spanish | ef_dora |
f |
French | ff_siwis |
h |
Hindi | hf_alpha |
i |
Italian | if_sara |
j |
Japanese | jf_alpha |
p |
Brazilian Portuguese | pf_dora |
z |
Mandarin Chinese | zf_xiaobei |
The vo_script_per_shot array is one line per shot, 6-10 words each (~3-4 seconds
of TTS at 150 wpm). Each Kokoro WAV gets layered onto the music bed at
i * 5.04 s offset via ffmpeg adelay, so the narration lands when the
visual beat lands - no description before or after the action.
Live API server
The pipeline ships as a FastAPI server with an asyncio.Lock backing a strict-FIFO single-GPU queue. SSE event stream + per-artifact endpoints let a frontend render the pipeline phases as they happen, instead of waiting 45 minutes for one mp4.
# on your MI300X droplet
STUDIO_API_TOKEN=secret uvicorn server:app --host 0.0.0.0 --port 8000
Submit a job
curl -X POST https://your-droplet:8000/jobs \
-H "X-API-Token: secret" \
-H "Content-Type: application/json" \
-d '{"prompt": "30s reel: a violinist plays in a Brooklyn subway station at midnight, golden hour light through the platform windows", "use_critic": true}'
# -> {"job_id": "a3f9c1d2b6e8", "status": "queued"}
Watch it happen
curl -N https://your-droplet:8000/jobs/a3f9c1d2b6e8/stream
# (SSE stream)
data: {"stage":"started","ts":1778425000.1,"prompt":"30s reel: ..."}
data: {"stage":"plan_starting","ts":1778425000.5}
data: {"stage":"plan_ready","ts":1778425245.3,"logline":"...","n_shots":6,"characters":["A"],"music_style":"...","shots":[{...}]}
data: {"stage":"master_ready","ts":1778425248.1,"name":"A","path":"...master_A.png","seconds":7.8}
data: {"stage":"keyframe_ready","ts":1778425250.0,"shot":0,"path":"...keyframe_00.png"}
data: {"stage":"clip_started","ts":1778425251.2,"shot":0,"attempt":1,"flow_shift":5.0,"n_steps":30,"flf2v":true}
data: {"stage":"clip_rendered","ts":1778425759.6,"shot":0,"path":"...clip_00.mp4","minutes":8.47}
data: {"stage":"critic_starting","ts":1778425760.1,"shot":0,"frames":[...]}
data: {"stage":"critic_verdict","ts":1778425853.4,"shot":0,"score":{"character_match":8,"scene_match":9,"composition":9,"artifact_free":7,"issues":["STYLIZED_AI_LOOK: ..."],"overall":8}}
data: {"stage":"clip_passed","ts":1778425881.0,"shot":0,"attempts":1,"score":{...}}
data: {"stage":"music_starting","ts":1778428100.0,"style":"..."}
data: {"stage":"music_ready","ts":1778428170.4,"path":"...music.wav"}
data: {"stage":"vo_chunk_ready","ts":1778428172.1,"shot":0,"path":"...vo_00.wav","seconds":3.4,"text":"..."}
data: {"stage":"mix_done","ts":1778428180.0,"path":"...reel_final.mp4"}
data: {"stage":"completed","ts":1778428180.5,"final":"...reel_final.mp4"}
Per-artifact endpoints
While the job runs, fetch any artifact that's already on disk:
| Endpoint | Returns |
|---|---|
GET /jobs/{id} |
full status meta with latest event |
GET /jobs/{id}/events |
full jsonl event history |
GET /jobs/{id}/plan |
director's plan_expanded.json |
GET /jobs/{id}/master/{A,B,C,ABC,scene} |
a master keyframe png |
GET /jobs/{id}/keyframe/{0..5} |
a per-shot keyframe png |
GET /jobs/{id}/clip/{0..5} |
a per-shot mp4 (silent, 5 sec) |
GET /jobs/{id}/music |
the 30-second music wav |
GET /jobs/{id}/vo/{0..5} |
a per-shot voice-over wav |
GET /jobs/{id}/video |
final mixed reel mp4 (404 while running) |
GET /jobs returns the most recent 50 jobs. GET /health is auth-free for status.
Python client snippet
import requests, sseclient
API = "https://your-droplet:8000"
H = {"X-API-Token": "secret"}
job = requests.post(f"{API}/jobs", headers=H, json={
"prompt": "30s reel: a cellist on a Brooklyn fire escape at sunset",
"use_critic": True,
}).json()
resp = requests.get(f"{API}/jobs/{job['job_id']}/stream", headers=H, stream=True)
for ev in sseclient.SSEClient(resp).events():
print(ev.data)
Multi-GPU routing
Each pipeline stage can pin to its own device via env vars (defaults to cuda:0):
STUDIOMI_GPU_FLUX=cuda:1 \
STUDIOMI_GPU_WAN=cuda:0 \
STUDIOMI_GPU_ACE=cuda:1 \
STUDIOMI_GPU_TTS=cuda:1 \
uvicorn server:app --host 0.0.0.0 --port 8000
On 2x MI300X you can render the next reel's plan on card 1 while card 0 still animates the current reel. Tested on a single-MI300X rig - 2-card setup is designed but not yet validated.
The stack — every model is permissively licensed
Every output is yours to use commercially.
| Stage | Model | Size | License |
|---|---|---|---|
| Planner & Critic | Qwen3.5-35B-A3B | 35B params (3B active) | Apache 2.0 |
| Image (keyframes) | FLUX.2 [klein] 4B | 4B params | Apache 2.0 |
| Video | Wan2.2-I2V-A14B | A14B (dual-expert MoE) | Apache 2.0 |
| Music | ACE-Step v1 | 3.5B params | Apache 2.0 |
| Voice-over | Kokoro-82M | 82M, 9 languages | Apache 2.0 |
| LLM serving | vLLM | — | Apache 2.0 |
| Diffusion cache | ParaAttention FBCache | — | Apache 2.0 |
| AMD kernels | AITER | — | MIT |
| Project code | StudioMI300 | — | MIT |
Why a single MI300X
192 GB HBM3 is overkill for any single model in this stack. The point is sequential diversity — the same card runs four very different model architectures back-to-back in one reel, with no offload to disk in between.
| Phase | VRAM peak | Compute pattern |
|---|---|---|
| 1. Director planning | ~70 GB BF16 | Qwen3.5-35B MoE LLM decode (vLLM + AITER MoE) |
| 2. Character masters | ~8 GB | FLUX.2 [klein] 4B diffusion transformer, 4 steps |
| 3. Wan2.2 animation | ~94 GB BF16 | Dual-expert MoE diffusion, 121 frames |
| 4. Vision critic | ~70 GB BF16 | Qwen3.5-35B re-loaded, vision-conditioned |
| 5. Music | ~12 GB | ACE-Step v1 audio diffusion, 27 steps |
| 6. Voice-over | < 1 GB | Kokoro-82M TTS, fits anywhere |
The ROCm allocator caches ~30 GB on top of any active model. With careful unload
and torch.cuda.empty_cache() between stages, all phases fit on the same 192 GB
card. On a 24 GB consumer GPU you'd need 4–5 separate machines wired together
just to host all of this.
That's the project's central constraint and its main flex on AMD's headline GPU.
Run it on your own MI300X
A 30-second reel takes ~45 minutes on one MI300X. That's too long for a casual visitor on a public Space, so this Space hosts only the showcase. To run the full pipeline yourself:
- Get an AMD MI300X (e.g. AMD Developer Cloud — $100 starting credits via the AMD AI Developer Program).
- Pull the
rocm/vllm-devcontainer. - Clone the repo and run:
python generate.py \
--prompt "a cellist plays in a Brooklyn subway station at midnight" \
--out outputs/my_reel \
--critic
Walk away for ~45 minutes. The pipeline plans, paints, animates, scores music, narrates and mixes — all autonomously. No prompt engineering per shot, no model swapping, no manual stitching.
→ Full code on GitHub
amd-hackathon-2026