How To Use Seedance 2.5 — Online Generator Guide

Q: What input materials and limits does Seedance 2.5 support?

Image Input: jpeg, png, webp, bmp, tiff, and gif. Video Input: mp4 and mov reference clips for motion, camera, and scene guidance. Audio Input: mp3 and wav for rhythm, dialogue, and lip-sync. Text Input: natural language prompts. Output Duration: 4–30 seconds, user-selectable. Audio Output: native sound effects and background music. Total Reference Limit: up to 50 multimodal references per Seedance 2.5 job (vs 12 mixed files on Seedance 2.0). Prioritize references that most affect visuals or rhythm, and allocate reasonably across modalities.

Building on the foundation of the Seed ecosystem, the Seedance 2.5 model utilizes a revolutionary dual-branch transformer architecture, achieving true native multimodal input and output. It merges simultaneous audio and video generation with unprecedented reference-driven control capabilities. Operating as a unified multimodal diffusion system, Seedance 2.5 accepts text, images, video clips, audio, and 3D white-model blockouts simultaneously to guide the output. The model understands complex scene planning and real-world physics, injecting hyper-real vitality into AI-generated visual content. The overall realism of the visuals is significantly improved, and character performances are perfectly synchronized with native sound. Bringing a comprehensive evolution in multi-shot cuts and narrative power, Seedance 2.5 is designed to transform your ideas into cinematic reality in a single pipeline. For prompt writing best practices and tips, see our Seedance 2.5 Prompt Guide.

Seedance 2.5 vs 2.0 Capabilities

Core Capability	Seedance 2.0	Seedance 2.5
Max Duration	Up to 15s	Up to 30s
Reference budget	Up to 12 files	Up to 50 references
3D blockout input	Not supported	Supported
Native 4K	Up to 4K export	Native 4K pipeline
Prompt adherence	Baseline	~20% better
Multi-Modal Input	Text + Image + Video + Audio	Text + Image + Video + Audio + 3D

Seedance 2.5 Model Highlights

1. Multi-Shot: Structured Scene Planning in One Generation

Instead of short continuous clips, the all-new Seedance 2.5 delivers a system that plans scenes in shots. The model automatically handles natural cuts and transitions within a single generation, so a 30-second output can feel like an edited sequence rather than a single continuous clip.

2. Universal Reference: Unprecedented Control Over Every Element

Seedance 2.5 is the first truly multimodal video creation platform. You can upload a reference video showing specific camera movements or choreography, an image for character consistency, and an audio clip for rhythm. The model accurately replicates these elements—whether it's cinematographic style or motion—with your own content.

3. Joint Audio-Video Generation: Cinema-Grade Sound, Built In

Most AI video generators create "silent movies," but Seedance 2.5 uses a dual-branch transformer to generate video and audio simultaneously. This ensures precise timing and flawless connection between the generated audio and video, meaning everything stays in sync without the need for post-production.

4. Hyper-Real World Physics Simulation

One of the biggest giveaways of AI video used to be uncanny valley physics. Seedance 2.5 nails real-world physics, understanding spatial awareness, gravity, and material interactions, making hyper-real outputs that push the boundaries of AI video generation.

5. 30-Second Generation: Complete Narrative Sequences

The new model generates up to 30 seconds of highly polished video per output. Within that duration, the model comfortably accommodates complex action sequences, multi-shot editing, and synchronized dialogue or sound effects.

Seedance 2.5 New Capabilities Guide

1. Multimodal Reference

Seedance 2.5 is the first truly multimodal video creation platform. You can upload reference images, videos, and audio to guide the model—whether for character consistency, camera movement, or rhythm.

Image Reference: Lock character appearance, style, or scene elements with reference images. The model accurately replicates these elements in your output.
Video Reference: Upload a reference video for camera movement, choreography, or motion patterns. The model seamlessly transfers these dynamics to your generation.
Audio Reference: Use an audio clip to sync rhythm, beats, or dialogue timing. The model generates video and audio in sync with your reference.

Video × 6

1 / 6

Image × 1

Output

Prompt:

Bright and colorful advertising film style, fruity biscuits as the protagonist, including strawberry, apple, grape, orange four flavors, strawberry flavor reference

@Image1 , biscuits and corresponding fruits are arranged in a geometric array with a strong sense of order, the overall picture is clean, advanced, and rhythmic. The opening fruit quickly establishes visual focus, refers to the composition of

@Video1 , and the music reshoots cut in. Then the biscuits of different flavors are neatly arranged, cut close-up, and refer to the dynamics and movement of

@Video2 . A biscuit is broken in the climax, and it enters slow motion in an instant. The fruity sandwich explodes, the debris splashes, the juice feeling and the particle impact are enlarged and displayed, and the impact feeling of

@Video3 is referred to. Horizontal array, forming a sense of rhythm parabolism, referring to the movement of

@Video4 , highlighting the beauty of order and product richness. Then quickly return to fast-paced editing. Ending English text One bit of crispness, a heart full of delight Quick word segmentation switches into the painting, with strong rhythm text movement and product freeze frame, referring to

@Video5 , the final brand sense ends, biscuits and fruits diverge around, referring to

@Video6 , the picture is full of young, energetic, delicious, and want to share the advertising atmosphere.

2. First & Last Frame Control

Seedance 2.5 allows you to lock the narrative structure by defining the exact start and end points of your video. The AI will seamlessly interpolate the physics, lighting, and camera movement between the two frames.

Image × 2

1 / 2

Output

Prompt:

Create a 30-second educational short video about the 3,000-year evolution of football. The entire film uses a single ball as the visual throughline — the ball rolls, traverses, and transforms from ancient times onward, linking different civilizations and eras. The overall pacing is tight, the visuals are premium, combining historical educational content with artistic transitions, emphasizing the feeling of one ball spanning three thousand years. Voiceover is concise and impactful. Opening: An ancient ball slowly emerges from a black background, its surface bearing textures of age. It then rolls into a Warring States period Cuju scene in China. The visual shifts to an ink-wash painting style

@Image1 . Figures in ancient attire play Cuju in a courtyard with elegant movements, the ball bouncing at their feet. Next, the ball continues rolling forward, and the scene naturally transitions to an ancient Greek ball game. The visual adopts a classical oil painting style

@Image2 . A plaza and stone columns are prominent in the background; people in ancient Greek robes kick the ball. The image feels weighty and historically rich. Then the ball rolls into medieval Europe. The visual maintains the oil painting style — a village, muddy ground, common folk chasing a leather ball. The atmosphere is lively and raw, like the ember of folk football being kept alive. Ending: The ball rests at center field in a modern stadium. Crowds from around the world and cheering voices merge in the background, creating the feeling of "one ball connecting the world." The visuals are grand and epic in scale.

3. Native Audio & Synchronization

Seedance 2.5 breaks the boundary between visuals and sound. You can upload an audio clip (up to 30s), and the model will generate motion that matches the rhythm, or animate characters to lip-sync perfectly with the dialogue.

Video × 3

1 / 3

Image × 13

1 / 13

Audio × 1

Output

Prompt:

Core Directive: A 26-second one-shot narrative short film throughout, interweaving steady tracking

@Video1 with smooth orbital camera movement

@Video2 . Fluid forward momentum. Day-to-night transitions and seasonal shifts are achieved within the single continuous take. The protagonist is a European woman

@Image1 , immersed in bustling crowds full of everyday life, highlighting an extreme sense of solitude with cinematic photography quality. Segmented Camera Movement & Scene Descriptions: 0-3s (Steady back-follow): An old wooden door

@Image2 creaks open. The camera closely follows the European woman's back as she steps out, dressed in the outfit from

@Image3 . She pauses briefly at the threshold — dappled light and shadow fill the alley ahead, vendor calls and crowd noise rush toward her. Her expression is distant; she slowly steps forward, merging into the street. 3-6s (Back-side tracking): The camera maintains smooth follow. She enters a crowded morning market

@Video3 . Both sides are packed with vibrantly colored fruit stalls and spice shops. A troupe of street performers blows fire dragons

@Image4 , flames illuminating the crowd — yet she doesn't glance sideways, passing through at an even pace. 6-9s (Side smooth orbit): The camera begins a smooth orbit toward the front-side, capturing the protagonist's profile. She passes a noisy butcher shop

@Image5 . A young mother holding a baby

@Image6 brushes past her — the baby stares curiously, but she merely lowers her gaze to avoid eye contact, never pausing for a moment. 9-12s (Front-facing reverse tracking): The camera continues orbiting to directly face the protagonist, tracking in reverse. The crowd ahead parts naturally like the Red Sea, and a massive elephant draped in ornate red cloth

@Image7 emerges from the right side of frame with steady strides, occupying most of the image. 12-15s (Gap threading & orbit back): At the instant the woman and elephant are about to collide, the camera deftly slides through the narrow gap between them, orbiting back to her rear. The elephant passes — enormous and silent. Children cheer and chase after it. Elephant bells and laughter erupt around her, yet she never once slows her pace. 15-18s (Ambient light shift): As she walks, the light within the continuous shot transforms magically — the piercing summer sun softens instantly, a breeze sweeps up a sky full of golden autumn leaves

@Image8 . The season seamlessly transitions to deep autumn within the same unbroken take. Falling leaves brush across her shoulders. 18-21s (360-degree immersive orbit): Ahead, a grand street festival erupts

@Image9 . Streamers and confetti burst into the air; vendors lean out cheering. The camera executes a continuous 360-degree orbital movement at this moment, creating an intensely powerful visual contrast between the quiet, solitary protagonist and the frenetic surroundings. 21-24s (Orbit returns to rear-side): As the orbit completes its full revolution back to her rear-side, the falling streamers have silently transformed into drifting snowflakes — winter arrives instantly

@Image10 . Pedestrians open umbrellas or pull up hoods. The woman shivers slightly, turns up her coat collar — now wearing the outfit from

@Image11 — and continues walking alone through the snow. 24-26s (Slow push tracking): As she walks toward the end of the long street, daylight visibly fades in real time during her stride, seamlessly sinking from day into night. Warm yellowish streetlamps and stall bulbs flicker on one by one along both sides

@Image12 . Vendors pack up their goods; the din seems to be slowly absorbed and pulled away by the heavy snow. Her footsteps gradually slow. Fireworks suddenly burst magnificently across the night sky

@Image13 — fireworks sound reference: Audio 1. Colorful light speckles dance and flicker across building walls and in her eyes

@Audio1 . The world remains lively as ever, while she gazes up in quiet stillness. The camera slowly pulls back, gently concluding here.

4. 30-Second Cinematic Generation

Generate up to 30 seconds of continuous, multi-shot, audio-synced video, providing maximum flexibility for your narrative workflows.

Image × 8

1 / 8

Output

Prompt:

One continuous shot — the camera steadily follows a person wearing a black coat (reference

@Image1 ) moving from left to right through six interconnected rooms, each with a different color tone and atmosphere. Every room shares the same structure: white walls, light-colored herringbone wood flooring, French double floor-to-ceiling windows, and white sheer curtains (reference

@Image2 ), but the view outside the window and the interior atmosphere are completely different. The protagonist walks at a constant pace throughout, passing through each open doorway in the walls. 0–5s, Room One — Theme: American Comic-Style Fight. The protagonist enters the room and engages in combat with a character (

@Image3 ). The opponent is defeated. 5–10s, Room Two — Theme: Warmth, Felt-Craft Style. The view outside the window is a sunflower field (

@Image4 ). The interior is bathed in warm orange soft light, with a painter working on a sunflower painting (

@Image5 ). As the protagonist enters, they also transform into felt-craft style. 10–15s, Room Three — Theme: Sadness, Black-and-White Comic Stop-Motion Style. The entire frame is rendered in black-and-white comic stop-motion animation. Outside the window, it rains. The interior light is cold, gray, and somber. A person sits alone on the floor in the center of the empty room, head down, hugging their knees, with a phone beside them glowing with an unanswered call screen. The protagonist enters the room and switches off the light — then immediately turns it back on. The room bursts into full color, and flowers instantly bloom and fill the entire space. 15–20s, Room Four — Theme: Joy, Underwater Room. The entire scene is a room submerged in the ocean (reference

@Image6 ). The protagonist swims into the room, surrounded by beautiful coral reefs and schools of fish. 20–25s, Room Five — Theme: Surprise. Outside the window, fireworks fill the night sky (reference

@Image7 ). Inside, colorful flickering light reflections dance across the room. The protagonist is swept up in a celebratory atmosphere. 25–30s, Final Room. The protagonist arrives in a completely blank white room, stands in the center, and snaps their fingers — accompanied by the crisp sound effect of a finger snap. The entire screen cuts to black, with the word "seedance" appearing in the center (reference

@Image8 ). Overall cinematic quality, high-fashion advertising style. Lighting is entirely determined by the scene outside each window, creating strong emotional contrast between rooms. No text appears on screen.

Frequently Asked Questions

What input materials and limits does Seedance 2.5 support?

Image Input: Supports jpeg, png, webp, bmp, tiff, and gif formats for character look, style, props, and scene plates.

Video Input: Supports mp4 and mov reference clips for camera movement, motion style, and scene continuity.

Audio Input: Supports mp3 and wav for rhythm, beats, dialogue, and lip-sync performance.

Text Input: Natural language prompts.

Output Duration: 4–30 seconds, user-selectable.

Audio Output: Native sound effects and background music.

Total Reference Limit: Up to 50 multimodal references per job on Seedance 2.5 (vs 12 mixed files on 2.0). There is no fixed per-type cap—mix images, videos, and audio within the 50-reference budget. Prioritize uploading materials that have the greatest impact on the visuals or rhythm, and allocate reasonably across different modalities.

How do I use multimodal references with @Video, @Image, and @Audio tags?

In the Seedance 2.5 generator, upload images, video clips, and audio beds, then reference them inline in your prompt with @Video1, @Image2, @Audio1-style tags. Each tag maps to a specific slot so the model keeps character look, camera motion, color palette, and rhythm separate. Assign clear roles—hero character, location plate, motion reference, dialogue track—instead of stacking similar assets. A well-organized brief within the 50-reference budget usually beats a chaotic full upload.

What are Reference, First & Last Frame, and Text to Video modes?

Reference mode blends up to 50 multimodal inputs—images, clips, audio, and 3D blockouts—into one coherent Seedance 2.5 shot. First & Last Frame mode locks your opening and closing stills; the model interpolates physics, lighting, and camera travel between them. Text to Video mode needs only a written brief for fast concepting. Use Reference for branded or character-heavy work, First & Last Frame for precise narrative beats, and Text to Video when you want the fastest draft.

How is Seedance 2.5 different from Seedance 2.0 in this tool?

Both models run in the same browser workflow. Seedance 2.5 doubles native clip length to 30 seconds (vs 15 on 2.0), raises the reference budget to 50 multimodal slots (vs 12), adds 3D white-model blockout input, and delivers roughly 20% better prompt adherence for complex briefs. Seedance 2.0 remains available for cheaper, shorter drafts when you are iterating on ideas before a final 2.5 render.

Does Seedance 2.5 output native 4K and 30-second clips?

Yes. Seedance 2.5 renders up to 30 seconds of continuous, multi-shot footage in a single pass—no separate extend or upscale step. Output resolutions include 480p, 720p, 1080p, and native 4K on paid tiers, with aspect ratios 16:9, 9:16, 1:1, and 4:3. For runtimes beyond 30 seconds, chain extensions: render a 30-second clip, then use it as the seed for the next segment or stitch several 2.5 clips that share the same references in post.

Can I upload audio for lip-sync and rhythm-matched motion?

Yes. Seedance 2.5 accepts mp3 and wav reference audio. The model can animate characters to lip-sync dialogue, match cuts to a beat, or drive motion from an uploaded soundtrack. Tag the track with @Audio1 in your prompt and describe how motion should follow—dialogue performance, dance timing, or ambient mood. Native audio output also includes generated sound effects and background music when you do not supply a reference track.

Can I edit or extend a video after generation?

Yes. Seedance 2.5 supports smooth video extension: prompt the model to keep shooting the next sequence from an existing clip, or refine characters and trim scenes with follow-up edits. For long-form work, render 30-second segments that share the same reference set, then extend or stitch in the generator before exporting. Use Seedance 2.0 for low-cost draft extensions when final 4K quality is not required yet.

How many credits does a Seedance 2.5 video cost?

Credits scale with duration, resolution, Fast vs Standard quality, and whether video references are attached. On Standard without long video input, expect roughly 2 credits per second at 480p, 4 at 720p, 10 at 1080p, and 15 at 4K; Fast mode and reference-video jobs use higher per-second rates. A 30-second 1080p clip costs more than a 5-second 480p silent draft. New accounts receive starter credits to try the tool. Check our Pricing Page for detailed credit packages.

Create Professional AI Videos with Seedance 2.5

Create cinematic AI videos with realistic motion, immersive sound, and director-level control—without complex production.

Start Creating with Seedance 2.5