StudioMI300

One prompt → 30-second cinematic reel.
Director Agent + vision critic + image, video, music & voice models — all on a single AMD Instinct MI300X.

AMD MI300X · 192 GB HBM3 ROCm 7.2 + AITER Apache 2.0 / MIT amd-hackathon-2026

MI300X GPU

Models orchestrated

2.5×

Lossless speedup

VO languages

Live demo paused — hackathon ended

The AMD x lablab hackathon has wrapped, so the on-demand MI300X demo is paused. Every clip in the archive below was generated end-to-end on a single AMD Instinct MI300X during the event (FLUX.2 [klein] 4B keyframe + Wan2.2-I2V-A14B at 81 frames / 16 fps, FBCache 0.08, ~6 minutes per clip).

Not Sora. Not Runway. Not Veo. Every frame here was made by models you can download, weights you can self-host, and code you can fork. No paywall, no waitlist, no usage cap. See the vs Sora & Runway tab for the full breakdown.

Generations archive

Live demo backend not configured.

Why this is not another frontier-model clone

Sora, Runway Gen-3, Google Veo, Kling, Pika — all closed weights, all hosted-only, all paid. They produce beautiful clips, and they leave you with zero leverage: you can't fork them, can't host them on your own GPU, can't see their critic logic, can't sell the output under terms you control, can't extend the pipeline for a new use case without their permission.

StudioMI300 is the opposite stack — built so that the work this project produces is owned by the person who runs it, not rented from a vendor.

Side by side

Dimension	Sora · Runway · Veo · Kling · Pika	StudioMI300
Weights	Closed, vendor-only	Apache 2.0 / MIT — every model
Output license	Vendor ToS, often non-commercial	Commercial use, no royalties
Where it runs	Vendor cloud only	Any MI300X / any ROCm host
Pipeline	Black-box single model	8 stages, every artifact extractable
Story planning	Hidden inside the model	Director Agent emits a JSON plan
Quality control	None — render once, hope	Vision critic with 10 failure labels, auto-retry
Music	Vendor-locked or stock licensing	ACE-Step v1, open weights, royalty-free
Narration	Not included	Kokoro-82M, 9 languages, per-shot timing
Cost per 30s reel	$0.50 – $4 per render, per attempt	One GPU-hour, fully amortizable
Audit & reproducibility	None	Full plan.json + every keyframe + every clip + critic verdicts saved
Vendor lock-in	Total	None — fork and ship

What the open stack uniquely gives you

1. The Director's plan is inspectable. Sora returns an mp4. StudioMI300 returns the mp4 plus the 6-shot plan, the character bibles, the music brief, and the per-shot voice-over script — as structured JSON. Producers can edit the plan and re-render only the shots they changed. Try doing that on Runway.

2. The vision critic is explainable. Every clip carries the critic's verdict — character drift, extras invade frame, walking backwards, etc. — with the retry strategy that fixed it. Sora gives you a frame; this gives you a paper trail.

3. Identity without LoRA training. FLUX.2 [klein] reference editing pins identity by construction — no per-character training step, no dataset prep, no 30-minute fine-tune wait. Sora has no concept of a named character across shots; here it's first-class.

4. Locale-aware narration. Director picks the narration language to match the setting — Tokyo → Japanese, Paris → French, Mumbai → Hindi. Sora narrates in nothing.

5. Sequential single-GPU orchestration. A 35B-MoE director, a 4B diffusion model, a 14B I2V model, a 3.5B music model, and a TTS share one MI300X by loading sequentially. This is the part that only works because of 192 GB HBM3 — and the part that frontier vendors never have to expose, because their cost structure is subsidized by a closed API.

What it deliberately does not try to do

Frontier models invest billions of training-compute into raw photoreal fidelity. StudioMI300 doesn't chase that — it composes the best open weights available right now into a pipeline that delivers the entire creative artifact (story, characters, shots, music, voice, mix) instead of a single isolated clip. The bet: an open, transparent, end-to-end pipeline that ships every month with the latest open weights will outpace any closed vendor on the dimensions that actually matter to a producer — control, auditability, ownership, and cost.

Frontier models give you a clip. This gives you a studio.

Pre-rendered reels from the live pipeline

Each reel is an actual mp4 produced end-to-end by the pipeline on the MI300X droplet — one prompt in, finished reel out. No human selected or trimmed shots. The vision critic ran on every clip.

San Francisco walk - golden hour to blue hour

0:00 / 0:00

San Francisco walk - golden hour to blue hour

Logline. A young woman walks alone down a steep Pacific Heights street, past painted Victorians and rolling fog, to a quiet overlook of the Golden Gate Bridge as the light shifts to blue hour.

Prompt.

30-second cinematic reel: a young woman walks alone through San Francisco at golden hour - down a steep Pacific Heights street with bay views, past painted Victorian houses, fog rolling in over the Pacific, ending at a quiet overlook of the Golden Gate Bridge as the light shifts to blue hour

Music. intimate ambient piano with a soft synth pad, 75 BPM, contemplative

Voice-over. American English (Director picked from setting)

Render time. 81 min on 1× MI300X

The pipeline

Eight stages run sequentially on one GPU. Each model loads, runs, unloads — making room for the next. No multi-GPU magic, no separate inference servers, no LoRA training step.

Director Agent

Qwen3.5-35B-A3B · vLLM · AITER MoE

Plans 6 cinematic shots with character portraits, music brief, voice-over script and language tag. Same checkpoint doubles as the vision critic in stage 5.

Character Masters

FLUX.2 [klein] 4B · 4-step distilled · ~0.4 s/master

One canonical image per character + an ABC group composition. These pin identity for every downstream shot.

Per-shot Keyframes

FLUX.2 [klein] 4B reference editing · ~0.6 s/shot

Master image goes in as conditioning, shot prompt drives the edit. Identity is preserved by construction — no LoRA training, no per-character setup.

Animation

Wan2.2-I2V-A14B · FBCache 0.05 · torch.compile

Dual-expert MoE diffusion, 121 frames at 24 fps. ParaAttention FBCache 2× lossless + selective torch.compile on transformer_2 (1.2× compile win).

Vision Critic

Qwen3.5-35B reload · 4 frames per clip · structured labels

Grades each clip on character_match, scene_match, composition, artifact_free. Below 7/10 → re-render with a bumped seed (max 3 attempts).

Music

ACE-Step v1 3.5B · 27 steps · 30 s output

Audio diffusion produces a 30-second instrumental matching the Director's brief (BPM, mood, instrumentation, no drums hint).

Voice-over

Kokoro-82M · 9 languages · ~0.05× RTF

Director picks the language to match the setting (Tokyo→ja, Paris→fr, Mumbai→hi, ...). Script is written in that language, not translated.

Mix

ffmpeg · concat + lanczos upscale + loudnorm

Six clips concatenated, upscaled to 1280×704, audio loudness-normalised, output is a single mp4.

Why research-driven prompts?

The Director's planner and the vision critic system prompts aren't folklore. They distill 16 sources (Alibaba's official Wan2.2 system prompts, the official prompt rewriter, ComfyUI community guides, InstaSD's controlled camera tests, HuggingFace Forums) into hard rules:

Verbatim Chinese trained negative from shared_config.py — umT5 was multilingual-pretrained against those exact tokens; the English translation is observably weaker.
Positive boundary sentences instead of "EXACTLY N people" — umT5 doesn't ground numerics; Wan2.2 distorts the crowd trying to enforce a count.
Lens / film tags (Arri Alexa, anamorphic, 35mm film grain) instead of cinematic — that word triggers Wan2.2's stylization branch and gives the AI look.
Sentence-case motion verbs described as a process, not ALL-CAPS shouting. The all-caps trick is community folklore with no documented support; Alibaba's own examples use lowercase.
One camera verb per shot, placed first — multiple verbs in one sentence ("dolly in tracking tilt up") cancel each other out.

Full research write-up lives in the GitHub repo (research/wan22_prompting.md).

The self-correcting render loop

Most generative video pipelines render once and pray. This one re-checks every clip with a 35-billion-parameter vision model, scores it on four 1–10 axes, and re-renders if it fails. The same Qwen3.5-35B that planned the story now grades it.

The critic returns four scores (character_match, scene_match, composition, artifact_free) plus a list of structured failure labels. The labels are machine-readable and feed back into the planner's retry strategy:

STYLIZED_AI_LOOK

plastic skin, oversaturation, 3D-render look

→ bump anti-style negatives, tone keyframe saturation

CHARACTER_DRIFT

named character's face shifts mid-clip

→ repeat exact character description string, prefer FLF2V

EXTRAS_INVADE_FRAME

unprompted extras crossing the main subjects

→ add positive boundary sentence ("no extras enter")

CAMERA_IGNORED

the prompted camera move never happens

→ put camera verb FIRST, use only one camera move

OBJECT_MORPHING

an object materially changes mid-clip

→ describe material+color explicitly, 121 → 97 frames

RANDOM_INTIMACY

characters touch / hug / kiss without prompt

→ add explicit "they do not touch" boundary

NEON_GLOW_LEAK

neon spilling onto faces or unprompted surfaces

→ localize light sources, "no glow on faces"

WALKING_BACKWARDS

subject walks the wrong direction

→ specify direction explicitly ("walks toward camera")

HAND_FINGER_ARTIFACT

extra fingers, fused hands

→ already in negative; reduce hand close-ups

WARDROBE_DRIFT

clothing color or style changes mid-clip

→ anchor wardrobe in the repeated character string

Up to three attempts per shot. After that, the best-scoring attempt ships and the issue list goes into the run log. The pipeline is self-correcting, not blind.

Real verdicts pulled from the run logs

These are actual JSON returns from Qwen3.5-35B critiquing real Wan2.2 clips on this pipeline. The labels feed back into the planner's retry strategy.

{ "shot": 0, "attempt": 1, "score": {
  "character_match": 9, "scene_match": 8, "composition": 9, "artifact_free": 7,
  "issues": ["STYLIZED_AI_LOOK: skin texture appears slightly plastic/smooth in close-up frames 1-2",
             "OBJECT_MORPHING: background bridge structure shifts from Golden Gate to a generic suspension bridge mid-clip"],
  "overall": 8 }}

{ "shot": 2, "attempt": 1, "score": {
  "character_match": 10, "scene_match": 10, "composition": 10, "artifact_free": 9,
  "issues": [],
  "overall": 10 }}

{ "shot": 3, "attempt": 2, "score": {
  "character_match": 4, "scene_match": 3, "composition": 2, "artifact_free": 5,
  "issues": ["CHARACTER_DRIFT: Subject identity changes completely in final frame from long-haired woman in trench coat to bob cut and turtleneck",
             "SCENE_MISMATCH: Golden Gate Bridge vanishes in Frame 3, replaced by generic city street",
             "CAMERA_IGNORED: Prompt requested 'static camera' but subject rotates 180 degrees and camera zooms",
             "STYLIZED_AI_LOOK: Frame 4 plastic skin texture and oversaturated bokeh"],
  "overall": 3 }}

The 10/10 was the awning two-shot of Kenji + Mei in v22 - identity locked, no extras, lighting matches, no STYLIZED_AI_LOOK even at this resolution. The 3/10 was the Golden Gate Bridge overlook - Wan2.2 can't reliably render that landmark, drifts to generic suspension bridges. After 3 attempts the pipeline ships the best one and logs the issues.

Acceleration on AMD MI300X

Cumulative end-to-end speedup: 2.5× lossless vs unoptimised Wan2.2 — 25.9 min → 10.4 min per 720p clip.

Wan2.2 720p cumulative speedup

Each row stacks multiplicatively; lower bar = faster. Same prompt, same seed.

Baseline (BF16, no cache)

25.9 min · 1.00×

+ FBCache 0.12 (both experts)

12.5 min · 2.08×

+ flow_shift=5 + ROCm flags

11.3 min · 2.30×

+ torch.compile(transformer_2)

10.4 min · 2.50×

VRAM peak per phase · 192 GB HBM3

Sequential, never concurrent. Wan2.2 hits 94/192 GB (49% of the card) at peak.

Director · Qwen3.5-35B BF16

70 GB

Klein 4B keyframes

8 GB

Wan2.2-I2V-A14B animation

94 GB

Critic · Qwen3.5-35B vision

70 GB

ACE-Step v1 music

12 GB

Kokoro-82M voice-over

1 GB

Where the 49 minutes go

Single 30-second reel, end-to-end on 1× MI300X. Wan2.2 inference dominates.

8.5m

33.0m

5.0m

Director plan · 0.5mMasters + keyframes · 0.2mWan2.2 hero @ 30 stp · 8.5mWan2.2 5× B-roll @ 24 · 33.0mCritic + retries · 5.0mMusic + VO + mix · 2.0m

Critic verdict distribution (rolling avg over recent reels)

Two-thirds of clips pass first try. The retry loop salvages another ~30%; only 3% fall through to best-of-three.

67%

22%

Pass on attempt 1Pass on attempt 2Pass on attempt 3Best-of accepted

Per-knob multiplier breakdown

ParaAttention FBCache (threshold 0.05)

2.00×

torch.compile(transformer_2, mode="default")

1.20×

ROCm env flags (hipBLASLt, expandable_segments, etc.)

1.10×

UniPC scheduler with flow_shift=12.0 for 480p

1.05×

AITER MoE for Qwen3.5-35B planner

~1.30× decode

FLUX.2 [klein] 4B vs FLUX.1-schnell on keyframes

~15× faster

Knob presets (config.py)

preset	num_frames	fps	hero / b-roll steps	FBCache	critic	est. minutes for 30s reel
default	121	24	30 / 24	0.05 (lossless)	7/10, 3 attempts	~50-65
cinematic	121	24	30 / 24	0.05	7/10, 3 attempts	~50-65
fast	97	24	20 / 18	0.08	6/10, 2 attempts	~32-40
draft	81	24	14 / 14	0.10	5/10, 1 attempt	~22-28

STUDIOMI_AITER_FP8=1 is a separate env switch; documented but disabled by default until ROCm/aiter#2187 closes for the multi-shape Wan2.2 case.

What didn't work (and why)

Tried	Result	Reason
MagCache via diffusers 0.38 hooks	dead, calibration empty	dual-transformer step counting confuses `_perform_calibration_step`
cache-dit DBCache + TaylorSeer	22.87 min (slower than baseline)	TaylorSeer adds ~6 min on ROCm; cache-dit's L20 numbers don't reproduce
AITER FA3 `set_attention_backend("flash")`	hung 9+ min at step 0	JIT compile for 81×1280×704 sequence never finishes
`guidance_scale_2=1.0` (skip CFG on low-noise)	10.35 vs 10.36 min	diffusers `WanPipeline` doesn't actually short-circuit at boundary
`torch.compile(mode="max-autotune", fullgraph=True)`	crash	Dynamo error on Wan2.2 (diffusers#12728)
`to(memory_format=torch.channels_last)` on transformer_2	RuntimeError	Wan2.2 transformer is rank-5 (B,C,F,H,W); channels_last is rank-4 only
AITER FP8 (`gemm_a8w8`, `gemm_a8w8_CK`)	segfault mid-pipeline	AITER#2187 multi-shape crash; standalone shape works on ROCm 7.2, pipeline composition does not

Field journal

A subset of failures, root causes and fixes from May 6–10, 2026. These are the stories that don't show up in commit messages — the ones where the Wan2.2 prompt did something genuinely surprising, or where a kernel decided to disagree with the docs.

May 7 · reel_v5

The headless violinist

Wan2.2 invented a third violinist in the busker scene — without a head. Compound clauses like "busker plays violin nearby" got read as a request for an extra violin-holder, sometimes generated incomplete.

✓ Fix: Added "two heads, headless, extra people, ghost figures, duplicate character" to the negative prompt. Hasn't recurred over 12 reels.

May 7 · reel_v6

Woman with violin

The protagonist ended up holding a violin in shots 4–8 even though the prompt only said she walked past the busker. Master keyframe baked "near violin" into the protagonist embedding because the master prompt mentioned the instrument as setting context.

✓ Fix: Stripped instrument refs from master_prompt v2. Master shows protagonist alone in setting baseline; instrument context goes via per-shot prompts only.

May 8 · qwen-tts

The 4-shim TTS nightmare

Tried Qwen3-TTS-12Hz-0.6B for voice-over. Hit four cascading issues: hard-pinned transformers 4.57.3 vs rest of stack ≥5.x, a removed decorator API, a missing pad_token_id in config.json, and ROPE_INIT_FUNCTIONS dropped in transformers 5. Even after writing all four shims, hit a deeper SDPA shape mismatch.

✓ Fix: Gave up after 1.5 hours, switched to Kokoro-82M (Apache 2.0, standalone, no transformers dependency). Ships in 9 languages.

May 9 · FP8 evaluation

AITER FP8 segfault on cross-attention

Evaluated two FP8 paths on Wan2.2: torch._scaled_mm raised HIPBLAS_STATUS_NOT_SUPPORTED on ROCm 7.0, and aiter.gemm_a8w8 + gemm_a8w8_CK both segfaulted with "Memory access fault by GPU node-1" on the cross-attention shape M=512, K=4096, N=5120. ROCm 7.2 closed the standalone shape, but the same call inside the full Wan2.2 + FBCache + torch.compile pipeline still crashes (matches AITER#2187).

✓ Fix: Production ships on BF16 + FBCache + selective torch.compile (2.5× lossless). aiter_linear.py and STUDIOMI_AITER_FP8 env-toggle stay in the repo for future experiments.

May 9 · FBCache jitter

Motion tearing at high cache thresholds

FBCache threshold 0.12 looked fast but introduced visible jitter on fast camera pans, especially in B-roll wides. Wan2.1 community had reported the same — at thresholds ≥0.09 you get tearing on motion.

✓ Fix: Stepped down to 0.05. Slightly slower but lossless across the whole reel. The 0.05 / 0.08 / 0.12 sweep is in benchmarks/results.md.

May 10 · Director→Wan2.2 OOM

94 GB Wan2.2 won't fit if Qwen still resident

After Director ran inference, vLLM left ~30 GB of allocator cache resident on top of its model weights. Wan2.2 needs 94 GB to load — total exceeded 192 GB and the load OOMed.

✓ Fix: Director runs in a separate Python subprocess so its full memory frees on exit. gpu_memory_utilization lowered to 0.70.

May 10 · Multi-day caches survive

Container migration was painless

When the original AMD Developer Cloud droplet got decommissioned for credit overuse, the new droplet inherited the same rocm/vllm-dev container image. The 247 GB HuggingFace cache survived intact via volume mount — no re-download of Wan2.2, FLUX.2, Qwen3.5, ACE-Step or Kokoro.

✓ Fix: ACE-Step's separate cache (/root/.cache/ace-step/checkpoints, 7.6 GB) had to be re-fetched + four pip deps re-installed. Bootstrap script now pre-warms both.

Full incident log is in incidents.md in the GitHub repo.

How the Director thinks

The Director Agent (Qwen3.5-35B-A3B via vLLM) doesn't just write a description. It returns a structured 6-shot plan with named characters, per-shot prompts (written in Wan2.2-friendly language: camera verb first, sentence-case motion, positive boundary phrases), a music brief, a per-shot voice-over array, and the language to narrate in.

{
  "characters": {
    "A": "Aiko (slim Japanese woman, 27, jet-black chin-length bob, ...)",
    "B": "Kenji (Japanese man, 28, tall and lean, ...)",
    "C": "Mei (Japanese woman, 26, shoulder-length lavender hair, ...)"
  },
  "story_logline": "Aiko walks alone through neon-lit Tokyo and reunites with two friends",
  "shots": [
    {
      "index": 0, "is_hero": true, "shot_type": "Wide tracking",
      "dominant_subject": "A", "cut": true,
      "prompt": "Tracking shot following from behind at hip level. Aiko (slim Japanese woman, 27, jet-black bob, mustard yellow vinyl raincoat) walks down the center of the wet street, head turning slightly. Distant pedestrians stay blurred. Light rain falls steadily, neon signs flicker. shot on Arri Alexa, anamorphic, 35mm film grain, photorealistic"
    },
    "... 5 more shots ..."
  ],
  "music_style": "intimate ambient piano with warm pad and soft synth bell, 75 BPM, melancholic but hopeful, no drums",
  "vo_script_per_shot": [
    "She had been walking alone for too long.",
    "Tonight, the city felt softer.",
    "Two figures waited under an awning.",
    "She broke into a quick walk.",
    "Their arms found hers.",
    "Some places only feel like home because of who is standing in them."
  ],
  "vo_lang": "j"
}

The exact same character description string repeats verbatim in every shot that character appears in. Token-level consistency is character-LoRA-without-LoRA-training.

Six-shot story arc template

Shot	Role	Cut
0	Hero wide establishing - all main characters visible	true
1	Setup - protagonist's intent or POV moves the story forward	false
2	Other element - secondary character solo or detail insert	true if scene changes
3	Climax - two-character moment or A-with-OBJECT	false
4	Static medium close-up - face anchor, reduces drift accumulation	false
5	Closing wide - scene fades or A walks away	false or true

Voice-over languages (Kokoro-82M)

Director picks the language that matches the setting. Tokyo scene -> Japanese, Paris -> French, Mumbai -> Hindi, Rio -> Brazilian Portuguese, anywhere else -> American English.

Code	Language	Default voice
`a`	American English	af_heart
`b`	British English	bf_emma
`e`	Spanish	ef_dora
`f`	French	ff_siwis
`h`	Hindi	hf_alpha
`i`	Italian	if_sara
`j`	Japanese	jf_alpha
`p`	Brazilian Portuguese	pf_dora
`z`	Mandarin Chinese	zf_xiaobei

The vo_script_per_shot array is one line per shot, 6-10 words each (~3-4 seconds of TTS at 150 wpm). Each Kokoro WAV gets layered onto the music bed at i * 5.04 s offset via ffmpeg adelay, so the narration lands when the visual beat lands - no description before or after the action.

Live API server

The pipeline ships as a FastAPI server with an asyncio.Lock backing a strict-FIFO single-GPU queue. SSE event stream + per-artifact endpoints let a frontend render the pipeline phases as they happen, instead of waiting 45 minutes for one mp4.

# on your MI300X droplet
STUDIO_API_TOKEN=secret uvicorn server:app --host 0.0.0.0 --port 8000

Submit a job

curl -X POST https://your-droplet:8000/jobs \
  -H "X-API-Token: secret" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "30s reel: a violinist plays in a Brooklyn subway station at midnight, golden hour light through the platform windows", "use_critic": true}'
# -> {"job_id": "a3f9c1d2b6e8", "status": "queued"}

Watch it happen

curl -N https://your-droplet:8000/jobs/a3f9c1d2b6e8/stream
# (SSE stream)

data: {"stage":"started","ts":1778425000.1,"prompt":"30s reel: ..."}
data: {"stage":"plan_starting","ts":1778425000.5}
data: {"stage":"plan_ready","ts":1778425245.3,"logline":"...","n_shots":6,"characters":["A"],"music_style":"...","shots":[{...}]}
data: {"stage":"master_ready","ts":1778425248.1,"name":"A","path":"...master_A.png","seconds":7.8}
data: {"stage":"keyframe_ready","ts":1778425250.0,"shot":0,"path":"...keyframe_00.png"}
data: {"stage":"clip_started","ts":1778425251.2,"shot":0,"attempt":1,"flow_shift":5.0,"n_steps":30,"flf2v":true}
data: {"stage":"clip_rendered","ts":1778425759.6,"shot":0,"path":"...clip_00.mp4","minutes":8.47}
data: {"stage":"critic_starting","ts":1778425760.1,"shot":0,"frames":[...]}
data: {"stage":"critic_verdict","ts":1778425853.4,"shot":0,"score":{"character_match":8,"scene_match":9,"composition":9,"artifact_free":7,"issues":["STYLIZED_AI_LOOK: ..."],"overall":8}}
data: {"stage":"clip_passed","ts":1778425881.0,"shot":0,"attempts":1,"score":{...}}
data: {"stage":"music_starting","ts":1778428100.0,"style":"..."}
data: {"stage":"music_ready","ts":1778428170.4,"path":"...music.wav"}
data: {"stage":"vo_chunk_ready","ts":1778428172.1,"shot":0,"path":"...vo_00.wav","seconds":3.4,"text":"..."}
data: {"stage":"mix_done","ts":1778428180.0,"path":"...reel_final.mp4"}
data: {"stage":"completed","ts":1778428180.5,"final":"...reel_final.mp4"}

Per-artifact endpoints

While the job runs, fetch any artifact that's already on disk:

Endpoint	Returns
`GET /jobs/{id}`	full status meta with latest event
`GET /jobs/{id}/events`	full jsonl event history
`GET /jobs/{id}/plan`	director's plan_expanded.json
`GET /jobs/{id}/master/{A,B,C,ABC,scene}`	a master keyframe png
`GET /jobs/{id}/keyframe/{0..5}`	a per-shot keyframe png
`GET /jobs/{id}/clip/{0..5}`	a per-shot mp4 (silent, 5 sec)
`GET /jobs/{id}/music`	the 30-second music wav
`GET /jobs/{id}/vo/{0..5}`	a per-shot voice-over wav
`GET /jobs/{id}/video`	final mixed reel mp4 (404 while running)

GET /jobs returns the most recent 50 jobs. GET /health is auth-free for status.

Python client snippet

import requests, sseclient

API = "https://your-droplet:8000"
H   = {"X-API-Token": "secret"}

job = requests.post(f"{API}/jobs", headers=H, json={
    "prompt": "30s reel: a cellist on a Brooklyn fire escape at sunset",
    "use_critic": True,
}).json()

resp = requests.get(f"{API}/jobs/{job['job_id']}/stream", headers=H, stream=True)
for ev in sseclient.SSEClient(resp).events():
    print(ev.data)

Multi-GPU routing

Each pipeline stage can pin to its own device via env vars (defaults to cuda:0):

STUDIOMI_GPU_FLUX=cuda:1 \
STUDIOMI_GPU_WAN=cuda:0 \
STUDIOMI_GPU_ACE=cuda:1 \
STUDIOMI_GPU_TTS=cuda:1 \
uvicorn server:app --host 0.0.0.0 --port 8000

On 2x MI300X you can render the next reel's plan on card 1 while card 0 still animates the current reel. Tested on a single-MI300X rig - 2-card setup is designed but not yet validated.

The stack — every model is permissively licensed

Every output is yours to use commercially.

Stage	Model	Size	License
Planner & Critic	Qwen3.5-35B-A3B	35B params (3B active)	Apache 2.0
Image (keyframes)	FLUX.2 [klein] 4B	4B params	Apache 2.0
Video	Wan2.2-I2V-A14B	A14B (dual-expert MoE)	Apache 2.0
Music	ACE-Step v1	3.5B params	Apache 2.0
Voice-over	Kokoro-82M	82M, 9 languages	Apache 2.0
LLM serving	vLLM	—	Apache 2.0
Diffusion cache	ParaAttention FBCache	—	Apache 2.0
AMD kernels	AITER	—	MIT
Project code	StudioMI300	—	MIT

Why a single MI300X

192 GB HBM3 is overkill for any single model in this stack. The point is sequential diversity — the same card runs four very different model architectures back-to-back in one reel, with no offload to disk in between.

Phase	VRAM peak	Compute pattern
1. Director planning	~70 GB BF16	Qwen3.5-35B MoE LLM decode (vLLM + AITER MoE)
2. Character masters	~8 GB	FLUX.2 [klein] 4B diffusion transformer, 4 steps
3. Wan2.2 animation	~94 GB BF16	Dual-expert MoE diffusion, 121 frames
4. Vision critic	~70 GB BF16	Qwen3.5-35B re-loaded, vision-conditioned
5. Music	~12 GB	ACE-Step v1 audio diffusion, 27 steps
6. Voice-over	< 1 GB	Kokoro-82M TTS, fits anywhere

The ROCm allocator caches ~30 GB on top of any active model. With careful unload and torch.cuda.empty_cache() between stages, all phases fit on the same 192 GB card. On a 24 GB consumer GPU you'd need 4–5 separate machines wired together just to host all of this.

That's the project's central constraint and its main flex on AMD's headline GPU.

Run it on your own MI300X

A 30-second reel takes ~45 minutes on one MI300X. That's too long for a casual visitor on a public Space, so this Space hosts only the showcase. To run the full pipeline yourself:

Get an AMD MI300X (e.g. AMD Developer Cloud — $100 starting credits via the AMD AI Developer Program).
Pull the rocm/vllm-dev container.
Clone the repo and run:

python generate.py \
    --prompt "a cellist plays in a Brooklyn subway station at midnight" \
    --out outputs/my_reel \
    --critic

Walk away for ~45 minutes. The pipeline plans, paints, animates, scores music, narrates and mixes — all autonomously. No prompt engineering per shot, no model swapping, no manual stitching.

→ Full code on GitHub

Built solo for the AMD Developer Hackathon 2026 on a single AMD Instinct MI300X. Apache 2.0 / MIT all the way down. GitHub · amd-hackathon-2026

Built with Gradio logo