SyncTalk_2D

SyncTalk_2D is a 2D lip-sync video generation model based on SyncTalk and Ultralight-Digital-Human. It can generate lip-sync videos with high quality and low latency, and it can also be used for real-time lip-sync video generation.

Compared to the Ultralght-Digital-Human, we have improved the audio feature encoder and increased the resolution to 328 to accommodate higher-resolution input video. This version can realize high-definition, commercial-grade digital humans.

与Ultralght-Digital-Human相比，我们改进了音频特征编码器，并将分辨率提升至328以适应更高分辨率的输入视频。该版本可实现高清、商业级数字人。

Architecture

SyncTalk_2D uses a modular architecture following SOLID design principles:

Single Responsibility: Each component (model service, TTS service, frame generation) has a distinct responsibility
Open/Closed: Components are extensible without requiring modification (e.g., the TTSService interface)
Liskov Substitution: Derived classes (like the enhanced frame generator) can be substituted for their base classes
Interface Segregation: Clean interfaces between components with minimal dependencies
Dependency Inversion: High-level modules depend on abstractions, not concrete implementations

The system is divided into these main components:

ModelService: Loads and manages neural network models (U-Net and Audio Encoder)
TTSService: Handles text-to-speech synthesis with graceful fallback mechanisms
FrameGenerationService: Generates animated frames from text or audio input
API Layer: Exposes functionality through FastAPI endpoints with proper error handling

Setting up

Set up the environment

conda create -n synctalk_2d python=3.10
conda activate synctalk_2d

# install dependencies
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -c conda-forge ffmpeg  #very important
pip install opencv-python transformers soundfile librosa onnxruntime-gpu configargparse
pip install numpy==1.23.5
pip install fastapi uvicorn pydantic python-multipart piper-tts

Prepare your data

Record a 5-minute video with your head facing the camera and without significant movement. At the same time, ensure that the camera does not move and the background light remains unchanged during video recording.
Don't worry about FPS, the code will automatically convert the video to 25fps.
No second person's voice can appear in the recorded video, and a 5-second silent clip is left at the beginning and end of the video.
Don't wear clothes with overly obvious texture, it's better to wear single-color clothes.
The video should be recorded in a well-lit environment.
The audio should be clear and without background noise.

Train

put your video in the 'dataset/name/name.mp4'

example: dataset/May/May.mp4

run the process and training script

bash training_328.sh name gpu_id

example: bash training_328.sh May 0
Waiting for training to complete, approximately 5 hours
If OOM occurs, try reducing the size of batch_size

Inference

python inference_328.py --name data_name --audio_path path_to_audio.wav

example: python inference_328.py --name May --audio_path demo/talk_hb.wav
the result will be saved in the 'result' folder

Web Interface & API

SyncTalk_2D now includes a web interface and API for real-time talking head animation. This allows you to generate animations through a browser using either text input (which gets converted to speech) or by uploading audio files directly.

Starting the Web Server

python api.py

This will start a FastAPI server on port 8000. You can access the web interface by visiting http://localhost:8000 in your browser.

Features

Text-to-Speech Animation: Enter text and watch the avatar speak it in real-time
Audio File Animation: Upload a WAV file (16kHz, 16-bit PCM recommended) and see the avatar speak along with it
Streaming API: All animations are streamed as multipart MJPEG for real-time viewing
Debug Mode: Save generated frames and audio for troubleshooting and optimization
Audio-Video Synchronization: Precise timing to ensure lips match the audio perfectly
Idle Mode: Built-in idle endpoint streams a natural loop using the first ~1s of source frames
Live Mic (WebSocket): Real-time mic input over WebSocket with binary JPEG frames output

API Endpoints

Endpoint	Method	Description
`/`	GET	Web interface
`/generate`	GET	Generate animation from text (streaming)
`/generate`	POST	Generate animation from text (JSON body)
`/generate-from-audio`	POST	Generate animation from uploaded audio
`/generate-from-audio`	GET	Initializer stream for `<img>` while POST upload processes
`/avatar`	GET	Get a static avatar frame
`/idle`	GET	Stream idle animation frames
`/ws/live`	WS	Live mic audio in; JPEG frames out
`/get-tts-audio`	GET	Generate and download TTS audio
`/health`	GET	Check service health
`/debug/status`	GET	Get current debug settings
`/debug/config`	POST	Update debug settings

Configuration

The application uses a centralized configuration system defined in config.py:

from config import settings

# Access configuration
device = settings.models.device
tts_model_path = settings.tts.model_path

Configuration sections include:

models: Neural network models and device settings
tts: Text-to-speech service configuration
debug: Debug and diagnostics settings

Debug Features

SyncTalk_2D includes debug features to help troubleshoot issues:

Save Frames: When enabled, each generated frame will be saved to the debug/frames directory
Save Audio: When enabled, generated or uploaded audio will be saved to the debug/audio directory
Chunked Processing: The system now processes audio in smaller chunks to enable real-time streaming

To enable debug features:

Use the debug panel in the web interface (click on "Debug Controls" at the bottom)

Or update settings via API: POST /debug/config with JSON body:

{
  "enabled": true,
  "save_frames": true,
  "save_audio": true
}

Notes

The UI reads X-Audio-Duration from /get-tts-audio to synchronize return-to-idle when speech finishes.
For best latency, the WebSocket path uses moderately reduced JPEG quality.

Code Structure

├── api.py                       # FastAPI web server and endpoints
├── config.py                    # Central configuration management
├── docs/                        # Documentation
│   ├── architecture.md          # System architecture overview
│   ├── model-details.md         # Detailed model architecture and specifications
│   ├── data-flow.md            # Data processing pipelines and flow
│   ├── training-guide.md       # Comprehensive training guide
│   └── api-reference.md        # API endpoints and usage
├── services/
│   ├── models.py                # Neural network model management
│   ├── tts.py                   # Text-to-speech synthesis
│   └── generator.py             # Frame generation and animation
├── static/
│   └── index.html               # Web interface
├── utils.py                     # Utility functions
├── unet_328.py                  # U-Net neural network architecture
├── syncnet_328.py               # SyncNet for audio-visual synchronization
├── inference_328.py             # Offline inference script
├── train_328.py                 # Training script with checkpointing
└── training_328.sh              # Training automation script

Documentation

For detailed information about the system:

Architecture Overview: High-level system design and components
Model Details: Deep dive into the U-Net architecture and training
Data Flow: How data flows through training and inference pipelines
Training Guide: Complete guide to training your own models
API Reference: REST API endpoints and usage examples

Visual Diagrams

The documentation includes detailed PlantUML diagrams that illustrate:

System architecture and component relationships
U-Net neural network structure
Training and inference data flows
Audio processing pipelines
API architecture

View the diagrams at PlantUML Online by copying the code from docs/diagrams/ or install a PlantUML viewer in your IDE.

Best Practices Implemented

Type Hints: Extensive use of Python type annotations
Error Handling: Graceful fallbacks and detailed error logging
Modularity: Clear separation of concerns between components
Documentation: Comprehensive docstrings and comments
Configuration: Centralized, type-safe configuration with Pydantic
Asynchronous Processing: Efficient streaming with FastAPI async endpoints
Dependency Injection: Services use DI pattern for better testability
Code Reuse: Inheritance and composition for shared functionality

Acknowledgements

This code is based on Ultralight-Digital-Human and SyncTalk. We thank the authors for their excellent work.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data_utils		data_utils
demo		demo
docs		docs
model/checkpoints		model/checkpoints
result		result
services		services
static		static
syncnet_ckpt		syncnet_ckpt
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
api.py		api.py
config.py		config.py
datasetsss.py		datasetsss.py
datasetsss_328.py		datasetsss_328.py
inference.py		inference.py
inference_328.py		inference_328.py
syncnet.py		syncnet.py
syncnet_328.py		syncnet_328.py
train.py		train.py
train_328.py		train_328.py
training.sh		training.sh
training_328.sh		training_328.sh
unet.py		unet.py
unet_328.py		unet_328.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SyncTalk_2D

Architecture

Setting up

Prepare your data

Train

Inference

Web Interface & API

Starting the Web Server

Features

API Endpoints

Configuration

Debug Features

Notes

Code Structure

Documentation

Visual Diagrams

Best Practices Implemented

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Deltaidiots/digital-clone

Folders and files

Latest commit

History

Repository files navigation

SyncTalk_2D

Architecture

Setting up

Prepare your data

Train

Inference

Web Interface & API

Starting the Web Server

Features

API Endpoints

Configuration

Debug Features

Notes

Code Structure

Documentation

Visual Diagrams

Best Practices Implemented

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages