Supernal Family

Wise Songs — Video Pipeline Spec

Wise Songs — Video Pipeline Spec Two parallel tracks. Track A is our own sequential scene pipeline (better visual coherence than Revid, cheaper, full control)....

Wise Songs — Video Pipeline Spec

Two parallel tracks. Track A is our own sequential scene pipeline (better visual coherence than Revid, cheaper, full control). Track B is Revid automation (for movie-mode content where Revid genuinely wins).


Track A — Sequential Scene Pipeline (--level scene)

The Core Idea

Every existing tool (Revid included) generates each scene image independently from a text prompt. This produces visual incoherence: different styles, random palettes, jarring cuts.

Our approach: img2img chaining. Each scene is generated from the previous scene using a new prompt at a controlled strength. The result is a visually continuous world that evolves — same palette, same style, same general aesthetic — while the content changes per verse.

Revid cannot do this. It is a genuine differentiator.


Pipeline Steps

1. INGEST
   song_data (id, title, lyrics, category, style_prompt, image_url)
   mp3 path

2. VISUAL WORLD GENERATION  [GPT-4o, ~$0.002/song]
   Input: title + full lyrics + category + style_prompt
   Output (JSON):
     {
       "art_style": "warm watercolor, amber and indigo palette, soft rounded edges",
       "setting": "forest clearing, ancient library, ocean horizon...",
       "mood": "hopeful, tense, playful",
       "character_anchors": "small fox with orange scarf, wise owl in glasses",
       "scenes": [
         {
           "verse_index": 0,
           "section": "verse",
           "cinematic_description": "Wide shot: fox standing at crossroads, golden light...",
           "transition_to_next": "dissolve"
         },
         ...
       ]
     }
   Prompt template:
     "You are a visual director. Given these song lyrics, produce a JSON visual world for a
      music video. Art style should match the category ({category}). Output valid JSON only."

3. IMAGE GENERATION  [Replicate, ~$0.003–$0.25/song depending on scene count]
   Scene 0:
     FLUX-schnell text-to-image
     prompt = "{art_style}, {setting}, {scene[0].cinematic_description}, no text, 16:9"
     cost: $0.003

   Scene N (for N > 0):
     FLUX-dev img2img
     image = scene[N-1] output URL (pass directly to Replicate, no re-download needed)
     prompt = "{art_style}, {scene[N].cinematic_description}, no text, 16:9"
     prompt_strength = 0.55  (tune per category: fables 0.6, GRE 0.45, cerebral 0.65)
     cost: $0.025/image

   Total cost example (7 scenes): $0.003 + 6 × $0.025 = $0.153/song
   Total cost example (5 scenes): $0.003 + 4 × $0.025 = $0.103/song

4. VIDEO ASSEMBLY  [ffmpeg + moviepy, free]
   Per image: Ken Burns effect (zoompan filter)
     - Alternate zoom-in / zoom-out per scene
     - Pan direction varies: center, left-drift, right-drift, diagonal
     - Duration = verse duration + 1.5s overlap for crossfade

   Transitions (xfade filter, 1.0s):
     Selected per scene pair from visual world "transition_to_next" field:
       verse→verse:   "fade" or "dissolve"
       verse→chorus:  "wipeleft" or "circleopen"
       chorus→verse:  "wiperight" or "fadeblack"
       bridge/outro:  "pixelize" or "radial"
     Fallback: "fade" for any unrecognised type

   ffmpeg xfade supports 40+ types — full list at disposal.

5. KARAOKE TEXT OVERLAY  [Whisper + PIL + moviepy, ~$0.006/min audio]
   Whisper word timestamps → yellow highlight on active word
   Verse text fades in/slides up at segment boundary (0.18s)
   Semi-transparent text panel so scene imagery shows through
   Font: Impact 72px for lyrics, 44px for title bar

6. HOOK EXTRACTION  [ffmpeg, free]
   Trim first 30s → 9:16 reformat (crop center) → same karaoke + scene imagery
   Outputs to hooks/{channel}/{slug}_hook.mp4

7. DB UPDATE
   video_assets record: path, format, duration, scene_count, cost_usd, whisper_words
   pipeline_stage → "video_ready"

Cost Table

SongsScenes avgImage costWhisperGPTTotal
106$1.53$0.06$0.02~$1.61
506$7.65$0.30$0.10~$8.05
1006$15.30$0.60$0.20~$16.10

$0.16/song fully automated. Revid charges $39/mo for ~100–400 videos depending on tier. At 100 songs/month we pay $0.16 vs ~$0.10–$0.39/Revid video. Parity at ~250 songs/month. Below that, Revid is competitive on price.

The reason to build our own anyway:

  • Visual coherence (img2img chaining) is genuinely better output
  • Full control over style per channel
  • No Revid account dependency
  • Can integrate motion video clips (Wan/LTX) as scene inserts when budget allows

Video Generation (Future: Actual Motion Per Scene)

Replicate has motion video models if we want actual video clips per scene instead of Ken Burns stills:

ModelCostQualitySpeedNotes
LTX-Video~$0.04/secGoodFastBest cost/quality for our use case
Wan 2.1 480p$0.09/secGoodModerateOpen source, controllable
Wan 2.1 720p$0.25/secExcellentSlowWorth it for hero content
KlingvariesExcellentSlowRevid's movie mode uses this or similar

For a 3s clip per scene (7 scenes × 3s): LTX = $0.84/song, Wan 480p = $1.89/song, Wan 720p = $5.25/song.

Recommendation: Use Ken Burns stills for standard production. Reserve Wan/LTX video clips for "hero" content (first GRE song, featured Aesop fable) where we want maximum quality.


prompt_strength Tuning by Channel

ChannelRecommended strengthReason
GRE Word Wizards0.40–0.45Each word is a new concept — more visual change desired
Aesop's Fables0.55–0.65Narrative continuity — scenes should feel like same world
STEM Nursery Rhymes0.50Balance: new concept per verse but consistent style
Cerebral / Mental Models0.60–0.70Abstract visuals benefit from slow evolution

Track B — Revid Automation

Why Revid Still Matters

Revid's movie mode (likely Kling or equivalent under the hood) generates actual motion video — not Ken Burns on stills. For Aesop's fables and high-production cerebral songs, this is meaningfully better. The subscription is already paid.

The problem: no public API. Automation must go through the browser UI.


Automation Approach: Playwright

revid_automation.py
  class RevidSession:
    - login(email, password)         # cookie-based, persist session
    - create_video(mp3_path, config) # upload + configure
    - poll_status(job_id)            # wait for completion
    - download_result(job_id, dest)  # save to wise-songs/videos/

Config object:

{
  "title": "The Dog and His Reflection — Aesop's Fable Song",
  "style": "cinematic",
  "duration": "auto",
  "captions": True,
  "aspect_ratio": "16:9",
  "hook_clip": True,
}

Session flow:

1. Load saved cookies → skip login if valid
2. Navigate to create page
3. Upload mp3
4. Set title, style, aspect ratio, captions
5. Click generate
6. Poll job status (every 30s, timeout 20min)
7. Download mp4 when complete
8. Save to ~/sai-workspace/content/wise-songs/videos/{channel}/
9. Update content.db: video_assets, pipeline_stage → "video_ready"

Revid Config Per Channel

ChannelStyleUse Revid?Notes
Aesop's Fablescinematic or animeYes — movie modeStories benefit most from actual motion
Cerebralcinematic or artisticYes for hero contentPhilosophical visuals
Mental ModelscinematicOptionalScene pipeline acceptable
GRE Word WizardsNoKinetic vocab cards — our pipeline is better
STEM Nursery RhymesNoDiagram style — our pipeline is better

Make.com as Alternative

If Playwright proves fragile (UI changes breaking selectors), Make.com is the fallback:

Webhook → Make.com scenario:
  1. Receive {slug, mp3_url, title, style}
  2. Revid module: create video
  3. Wait for completion
  4. HTTP module: POST result URL back to our webhook receiver
  5. Our receiver downloads + updates DB

Cost: Make.com free tier = 1000 ops/month. Each video ≈ 5 ops → covers ~200 videos/month free.


Decision Matrix — Which Pipeline Per Song

song arrives at video_gen stage
    ↓
channel?
    ├── GRE Word Wizards      → scene pipeline (img2img, strength 0.40–0.45)
    ├── STEM Nursery Rhymes   → scene pipeline (strength 0.50)
    ├── Aesop's Fables        → Revid (movie mode) + scene pipeline as backup
    ├── Mental Models         → scene pipeline (strength 0.60) or Revid
    └── Cerebral              → Revid preferred; scene pipeline acceptable

Implementation Order

  1. scene mode in video_pipeline.py — GPT-4o visual world + FLUX img2img chain + ffmpeg assembly
  2. Test on 3 songs — one GRE, one Aesop, one cerebral — compare against viral mode
  3. revid_automation.py — Playwright session, login, upload, poll, download
  4. Content DB integration — track source, cost, style per video asset
  5. Review workflow — never auto-publish; human review before upload
  6. youtube_upload.py batch mode — queue reviewed videos for upload with metadata

Comments

  • No comments yet.