Nene AI is an advanced voice-based assistant designed for seamless real-time interactions. It can:
-
Act as a Vtuber AI with a lively and affectionate personality
-
Read and respond to live chat messages from YouTube Live
-
Accept text input for conversation
-
Record and process audio for real-time voice-based interactions
- Converts speech to text using Whisper
- Processes text input and generates responses using DeepSeek-R1 14B via Ollama
- Converts text responses to speech using TTS (Text-to-Speech)
- Audio tuning and playback using pydub
- Configured to always use a warm, kind, and loving tone
Hardware
- GPU: Recommended RTX 3070 or higher for optimal performance
- RAM: Minimum 16GB, recommended 32GB+
- Storage: At least ~30GB free space
- OS: Windows 10/11, macOS, or Linux
Ensure you have the following installed:
- Python 3.8+ and < 3.10
- Dependencies:
pip install whisper ollama TTS pydub torch
project_root/
β-- core/
β βββ audio_utils.py
β-- voice/
β βββ input-th.m4a
β βββ idle/
β β βββ en_idle_1.wav
β β βββ en_idle_2.wav
β β βββ jp_idle_1.wav
β β βββ jp_idle_2.wav
β β βββ th_idle_1.wav
β β βββ th_idle_2.wav
β βββ think/
β β βββ en_think_1.wav
β β βββ en_think_2.wav
β β βββ jp_think_1.wav
β β βββ jp_think_2.wav
β β βββ th_think_1.wav
β β βββ th_think_2.wav
β-- output/
β βββ ro-th.wav
β-- target/
β βββ speaker-en.wav
β βββ speaker-jp.wav
β βββ speaker-th.wav
β-- other/
β βββ Nene.png
β βββ Terminal.png
β-- run.py
β-- requirements.txt
β-- README.md
β-- .env
The assistant is configured with the following personality (TH):
- Name: Nene
- Personality: Sweet, caring, playful, and affectionate
- Response style: Uses polite Thai language with "ΰΈΰΉΰΈ°" and "ΰΈΰΈ°" to sound gentle
- Restrictions: Cannot use "ΰΈΰΈ£ΰΈ±ΰΈ" as it is a masculine term
def speech_to_text(audio_path):
model = whisper.load_model("base")
result = model.transcribe(audio_path, fp16=False)
return result["text"]
Converts input audio to text using OpenAI's Whisper model.
def get_response_from_deepseek(text):
response = ollama.chat(model=setup_role["model"], messages=[{"role": "system", "content": setup_role['setup-role']}, {"role": "user", "content": text}])
return response['message']['content']
Uses DeepSeek-R1 14B via Ollama to generate a response.
def text_to_speech(name, lang, text):
tts = TTS(model_name=f"tts_models/{lang}/fairseq/vits")
tts.tts_with_vc_to_file(text, speaker_wav="./target/speaker-en.wav", file_path=f"./output/{name}.wav")
Converts text to speech using TTS with voice cloning.
play_audio(f"output/{name}.wav")
Plays the generated voice response.
python Talk_EN.py
The program will:
- Take an input audio file (
input-th.m4a
) - Convert speech to text
- Generate a response using DeepSeek-R1 14B
- Convert the response into a voice output
- Play the generated voice
- The voice tuning applies pitch and filter modifications for a natural Thai accent.
- The response is always in a cheerful, affectionate style.
- Support for more languages
- Enhanced voice customization
- Integration with real-time voice input/output
This project is open-source and free to use under the MIT License.