T2V · X-ARC

Constraint

Producing a single educational video required four to five roles for a small studio, and one to two weeks of human effort per video. Off-the-shelf tools could write a script. Off-the-shelf tools could generate images. None of them was a pipeline. Each tool solved a slice; nothing tied them together into a system that ran the work without a human between every step.

First principles

A video is a sequence of decisions made under attention constraints. Which topic. Which hook. Which line of the script. Which frame for which beat. Which thumbnail. Each decision is small and reversible. The problem is not that any one decision is hard. The problem is that there are hundreds of them, and a human in the loop at each one collapses the throughput to linear-in-headcount.

If the decisions can be made automatically against a fixed quality bar, and the failures are cheap to retry, the throughput uncouples from headcount. Cheap retries are the unlock.

Pipeline

The operator triggers a new production from Telegram. The pipeline runs to completion and returns the cut for review.

01

Topic

Audience-scoped angle generation under a strict topic boundary. Off-target topics are rejected at brief stage.

02

Script

Full script with retention architecture. First ten seconds engineered to hold attention. Concreteness rule: every claim has a specific example.

03

Voice

TTS pass against the script. Voice timing locked before any visuals render.

04

Visuals

Storyboard derived from the script audio. Grid-based batch generation, followed by validated selection. Frames that miss the visual standard are regenerated.

05

Edit

Audio, visuals, and timing compiled through ffmpeg 8.0.1 into the final cut. Three thumbnail options ranked.

06

Publish

YouTube-optimised metadata produced. Upload as a private draft for operator review and publication.

Frame-chaining

The experimental motion track runs Seedance 2.0 plus GPT Image 2 with explicit frame-chaining. Chunk N's end frame becomes chunk N+1's start frame. This is what avoids the identity drift that killed earlier face-based attempts. Gemini 3 Pro runs as the video judge against the rendered output.

Economics

$13 per finished video at current draw. The cost-down work targets $6 by replacing two paid steps with in-house alternatives. The numbers matter because they decide what is worth doing at all.

Where it came from

T2V was built for the lab's own marketing surface. The first autonomous video shipped 2026·03·24 at 8:49 running time. The channel has been running continuously since then. The case study on the channel is documented under deployments.

Contact

If something on this page is relevant to work you are running, write to us. The form is on the landing page. We come back within two working days.

Book a discovery call →