update voice
This commit is contained in:
parent
815f1dfbe5
commit
102ef4a0fe
|
|
@ -0,0 +1 @@
|
||||||
|
sudo apt-get update && sudo apt-get install -y portaudio19-dev
|
||||||
|
|
@ -0,0 +1,146 @@
|
||||||
|
# Design Doc: PocketFlow Voice Chat
|
||||||
|
|
||||||
|
> Please DON'T remove notes for AI
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
> Notes for AI: Keep it simple and clear.
|
||||||
|
> If the requirements are abstract, write concrete user stories
|
||||||
|
|
||||||
|
- **Goal**: Enable users to interact with an LLM via voice in a continuous conversation, receiving spoken responses.
|
||||||
|
- **User Story 1**: As a user, I want to speak my query into a microphone so that the application can understand what I'm asking.
|
||||||
|
- **User Story 2**: As a user, I want the application to send my spoken query to an LLM for processing.
|
||||||
|
- **User Story 3**: As a user, I want to hear the LLM's response spoken back to me.
|
||||||
|
- **User Story 4**: As a user, after hearing the response, I want the application to be ready for my next spoken query without restarting.
|
||||||
|
- **Core Functionalities**:
|
||||||
|
1. Capture audio input.
|
||||||
|
2. Convert speech to text (STT).
|
||||||
|
3. Process text with an LLM (maintaining conversation history).
|
||||||
|
4. Convert LLM text response to speech (TTS).
|
||||||
|
5. Play back synthesized audio.
|
||||||
|
6. Loop back to capture new audio input for a continuous conversation.
|
||||||
|
|
||||||
|
## Flow Design
|
||||||
|
|
||||||
|
> Notes for AI:
|
||||||
|
> 1. Consider the design patterns of agent, map-reduce, rag, and workflow. Apply them if they fit.
|
||||||
|
> 2. Present a concise, high-level description of the workflow.
|
||||||
|
|
||||||
|
### Applicable Design Pattern:
|
||||||
|
|
||||||
|
- **Workflow**: A sequential workflow with a loop is most appropriate. Each step (audio capture, STT, LLM query, TTS, audio playback) directly follows the previous, and after playback, the flow returns to the audio capture stage.
|
||||||
|
|
||||||
|
### Flow high-level Design:
|
||||||
|
|
||||||
|
The application will operate in a loop to allow for continuous conversation:
|
||||||
|
1. **`CaptureAudioNode`**: Records audio from the user\'s microphone when triggered.
|
||||||
|
2. **`SpeechToTextNode`**: Converts the recorded audio into text.
|
||||||
|
3. **`QueryLLMNode`**: Sends the transcribed text (with history) to an LLM and gets a text response.
|
||||||
|
4. **`TextToSpeechNode`**: Converts the LLM\'s text response into in-memory audio data and then plays it. After completion, the flow transitions back to the `CaptureAudioNode`.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart TD
|
||||||
|
CaptureAudio[Capture Audio] --> SpeechToText[Speech to Text]
|
||||||
|
SpeechToText --> QueryLLM[Query LLM]
|
||||||
|
QueryLLM --> TextToSpeech[Text to Speech & Play]
|
||||||
|
TextToSpeech -- "Next Turn" --> CaptureAudio
|
||||||
|
```
|
||||||
|
|
||||||
|
## Utility Functions
|
||||||
|
|
||||||
|
> Notes for AI:
|
||||||
|
> 1. Understand the utility function definition thoroughly by reviewing the doc.
|
||||||
|
> 2. Include only the necessary utility functions, based on nodes in the flow.
|
||||||
|
|
||||||
|
1. **`record_audio()`** (`utils/audio_utils.py`)
|
||||||
|
- *Input*: (Optional) `silence_threshold` (float, e.g., RMS energy), `min_silence_duration_ms` (int), `chunk_size_ms` (int), `sample_rate` (int, Hz), `channels` (int).
|
||||||
|
- *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is in-memory audio (e.g., bytes or NumPy array) and `sample_rate` is the recording sample rate (int).
|
||||||
|
- *Description*: Records audio from the microphone. Starts recording when sound is detected above `silence_threshold` (optional, or starts immediately) and stops after `min_silence_duration_ms` of sound below the threshold.
|
||||||
|
- *Necessity*: Used by `CaptureAudioNode` to get user\'s voice input.
|
||||||
|
|
||||||
|
2. **`speech_to_text_api(audio_data, sample_rate)`** (`utils/speech_to_text.py`)
|
||||||
|
- *Input*: `audio_data` (bytes or NumPy array), `sample_rate` (int).
|
||||||
|
- *Output*: `transcribed_text` (str).
|
||||||
|
- *Necessity*: Used by `SpeechToTextNode` to convert in-memory audio data to text.
|
||||||
|
|
||||||
|
3. **`call_llm(prompt, history)`** (`utils/llm_service.py`)
|
||||||
|
- *Input*: `prompt` (str), `history` (list of dicts, e.g., `[{"role": "user", "content": "..."}]`)
|
||||||
|
- *Output*: `llm_response_text` (str)
|
||||||
|
- *Necessity*: Used by `QueryLLMNode` to get an intelligent response.
|
||||||
|
|
||||||
|
4. **`text_to_speech_api(text_to_synthesize)`** (`utils/text_to_speech.py`)
|
||||||
|
- *Input*: `text_to_synthesize` (str).
|
||||||
|
- *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is in-memory audio (e.g., NumPy array) and `sample_rate` is the audio sample rate (int).
|
||||||
|
- *Necessity*: Used by `TextToSpeechNode` to convert LLM text to speakable in-memory audio data.
|
||||||
|
|
||||||
|
5. **`play_audio_data(audio_data, sample_rate)`** (`utils/audio_utils.py`)
|
||||||
|
- *Input*: `audio_data` (NumPy array), `sample_rate` (int).
|
||||||
|
- *Output*: None
|
||||||
|
- *Necessity*: Used by `TextToSpeechNode` (in its `post` method) to play the in-memory synthesized speech.
|
||||||
|
|
||||||
|
## Node Design
|
||||||
|
|
||||||
|
### Shared Memory
|
||||||
|
|
||||||
|
> Notes for AI: Try to minimize data redundancy
|
||||||
|
|
||||||
|
The shared memory structure is organized as follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
shared = {
|
||||||
|
"user_audio_data": None, # In-memory audio data (bytes or NumPy array) from user
|
||||||
|
"user_audio_sample_rate": None, # int: Sample rate of the user audio
|
||||||
|
"user_text_query": None, # str: Transcribed user text
|
||||||
|
"llm_text_response": None, # str: Text response from LLM
|
||||||
|
# "llm_audio_data" and "llm_audio_sample_rate" are handled as exec_res within TextToSpeechNode's post method
|
||||||
|
"chat_history": [], # list: Conversation history [{"role": "user/assistant", "content": "..."}]
|
||||||
|
"continue_conversation": True # boolean: Flag to control the main conversation loop
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Node Steps
|
||||||
|
|
||||||
|
> Notes for AI: Carefully decide whether to use Batch/Async Node/Flow.
|
||||||
|
|
||||||
|
1. **`CaptureAudioNode`**
|
||||||
|
- *Purpose*: Record audio input from the user using VAD.
|
||||||
|
- *Type*: Regular
|
||||||
|
- *Steps*:
|
||||||
|
- *prep*: Check `shared["continue_conversation"]`. (Potentially load VAD parameters from `shared["config"]` if dynamic).
|
||||||
|
- *exec*: Call `utils.audio_utils.record_audio()` (passing VAD parameters if configured).
|
||||||
|
- *post*: `audio_data, sample_rate = exec_res`. Write `audio_data` to `shared["user_audio_data"]` and `sample_rate` to `shared["user_audio_sample_rate"]`. Returns `"default"`.
|
||||||
|
|
||||||
|
2. **`SpeechToTextNode`**
|
||||||
|
- *Purpose*: Convert the recorded in-memory audio to text.
|
||||||
|
- *Type*: Regular
|
||||||
|
- *Steps*:
|
||||||
|
- *prep*: Read `shared["user_audio_data"]` and `shared["user_audio_sample_rate"]`. Return `(user_audio_data, user_audio_sample_rate)`.
|
||||||
|
- *exec*: `audio_data, sample_rate = prep_res`. Call `utils.speech_to_text.speech_to_text_api(audio_data, sample_rate)`.
|
||||||
|
- *post*:
|
||||||
|
- Write `exec_res` (transcribed text) to `shared["user_text_query"]`.
|
||||||
|
- Append `{"role": "user", "content": exec_res}` to `shared["chat_history"]`.
|
||||||
|
- Clear `shared["user_audio_data"]` and `shared["user_audio_sample_rate"]` as they are no longer needed.
|
||||||
|
- Returns `"default"`.
|
||||||
|
|
||||||
|
3. **`QueryLLMNode`**
|
||||||
|
- *Purpose*: Get a response from the LLM based on the user\'s query and conversation history.
|
||||||
|
- *Type*: Regular
|
||||||
|
- *Steps*:
|
||||||
|
- *prep*: Read `shared["user_text_query"]` and `shared["chat_history"]`. Return `(user_text_query, chat_history)`.
|
||||||
|
- *exec*: Call `utils.llm_service.call_llm(prompt=prep_res[0], history=prep_res[1])`.
|
||||||
|
- *post*:
|
||||||
|
- Write `exec_res` (LLM text response) to `shared["llm_text_response"]`.
|
||||||
|
- Append `{"role": "assistant", "content": exec_res}` to `shared["chat_history"]`.
|
||||||
|
- Returns `"default"`.
|
||||||
|
|
||||||
|
4. **`TextToSpeechNode`**
|
||||||
|
- *Purpose*: Convert the LLM\'s text response into speech and play it.
|
||||||
|
- *Type*: Regular
|
||||||
|
- *Steps*:
|
||||||
|
- *prep*: Read `shared["llm_text_response"]`.
|
||||||
|
- *exec*: Call `utils.text_to_speech.text_to_speech_api(prep_res)`. This returns `(llm_audio_data, llm_sample_rate)`.
|
||||||
|
- *post*: `llm_audio_data, llm_sample_rate = exec_res`.
|
||||||
|
- Call `utils.audio_utils.play_audio_data(llm_audio_data, llm_sample_rate)`.
|
||||||
|
- (Optional) Log completion.
|
||||||
|
- If `shared["continue_conversation"]` is `True`, return `"next_turn"` to loop back.
|
||||||
|
- Otherwise, return `"end_conversation"`.
|
||||||
|
|
@ -0,0 +1,5 @@
|
||||||
|
openai
|
||||||
|
sounddevice
|
||||||
|
numpy
|
||||||
|
scipy
|
||||||
|
soundfile
|
||||||
|
|
@ -0,0 +1,132 @@
|
||||||
|
import sounddevice as sd
|
||||||
|
import numpy as np
|
||||||
|
import time
|
||||||
|
# import wave # No longer needed for dummy file saving in main for play_audio_file
|
||||||
|
# import tempfile # No longer needed for dummy file saving in main
|
||||||
|
# import os # No longer needed for dummy file saving in main
|
||||||
|
# import soundfile as sf # No longer needed as play_audio_file is removed
|
||||||
|
|
||||||
|
DEFAULT_SAMPLE_RATE = 44100
|
||||||
|
DEFAULT_CHANNELS = 1
|
||||||
|
DEFAULT_CHUNK_SIZE_MS = 50 # Process audio in 50ms chunks for VAD
|
||||||
|
DEFAULT_SILENCE_THRESHOLD_RMS = 0.01 # RMS value, needs tuning
|
||||||
|
DEFAULT_MIN_SILENCE_DURATION_MS = 1000 # 1 second of silence to stop
|
||||||
|
DEFAULT_MAX_RECORDING_DURATION_S = 15 # Safety cap for recording
|
||||||
|
DEFAULT_PRE_ROLL_CHUNKS = 3 # Number of chunks to keep before speech starts
|
||||||
|
|
||||||
|
def record_audio(sample_rate = DEFAULT_SAMPLE_RATE,
|
||||||
|
channels = DEFAULT_CHANNELS,
|
||||||
|
chunk_size_ms = DEFAULT_CHUNK_SIZE_MS,
|
||||||
|
silence_threshold_rms = DEFAULT_SILENCE_THRESHOLD_RMS,
|
||||||
|
min_silence_duration_ms = DEFAULT_MIN_SILENCE_DURATION_MS,
|
||||||
|
max_recording_duration_s = DEFAULT_MAX_RECORDING_DURATION_S,
|
||||||
|
pre_roll_chunks_count = DEFAULT_PRE_ROLL_CHUNKS):
|
||||||
|
"""
|
||||||
|
Records audio from the microphone with silence-based VAD.
|
||||||
|
Returns in-memory audio data (NumPy array of float32) and sample rate.
|
||||||
|
Returns (None, sample_rate) if recording fails or max duration is met without speech.
|
||||||
|
"""
|
||||||
|
chunk_size_frames = int(sample_rate * chunk_size_ms / 1000)
|
||||||
|
min_silence_chunks = int(min_silence_duration_ms / chunk_size_ms)
|
||||||
|
max_chunks = int(max_recording_duration_s * 1000 / chunk_size_ms)
|
||||||
|
|
||||||
|
print(f"Listening... (max {max_recording_duration_s}s). Speak when ready.")
|
||||||
|
print(f"(Silence threshold RMS: {silence_threshold_rms}, Min silence duration: {min_silence_duration_ms}ms)")
|
||||||
|
|
||||||
|
recorded_frames = []
|
||||||
|
pre_roll_frames = []
|
||||||
|
is_recording = False
|
||||||
|
silence_counter = 0
|
||||||
|
chunks_recorded = 0
|
||||||
|
|
||||||
|
stream = None
|
||||||
|
try:
|
||||||
|
stream = sd.InputStream(samplerate=sample_rate, channels=channels, dtype='float32')
|
||||||
|
stream.start()
|
||||||
|
|
||||||
|
for i in range(max_chunks):
|
||||||
|
audio_chunk, overflowed = stream.read(chunk_size_frames)
|
||||||
|
if overflowed:
|
||||||
|
print("Warning: Audio buffer overflowed!")
|
||||||
|
|
||||||
|
rms = np.sqrt(np.mean(audio_chunk**2))
|
||||||
|
|
||||||
|
if is_recording:
|
||||||
|
recorded_frames.append(audio_chunk)
|
||||||
|
chunks_recorded += 1
|
||||||
|
if rms < silence_threshold_rms:
|
||||||
|
silence_counter += 1
|
||||||
|
if silence_counter >= min_silence_chunks:
|
||||||
|
print("Silence detected, stopping recording.")
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
silence_counter = 0 # Reset silence counter on sound
|
||||||
|
else:
|
||||||
|
pre_roll_frames.append(audio_chunk)
|
||||||
|
if len(pre_roll_frames) > pre_roll_chunks_count:
|
||||||
|
pre_roll_frames.pop(0)
|
||||||
|
|
||||||
|
if rms > silence_threshold_rms:
|
||||||
|
print("Speech detected, starting recording.")
|
||||||
|
is_recording = True
|
||||||
|
for frame_to_add in pre_roll_frames:
|
||||||
|
recorded_frames.append(frame_to_add)
|
||||||
|
chunks_recorded = len(recorded_frames)
|
||||||
|
pre_roll_frames.clear()
|
||||||
|
|
||||||
|
if i == max_chunks - 1 and not is_recording:
|
||||||
|
print("No speech detected within the maximum recording duration.")
|
||||||
|
stream.stop()
|
||||||
|
stream.close()
|
||||||
|
return None, sample_rate
|
||||||
|
|
||||||
|
if not recorded_frames and is_recording:
|
||||||
|
print("Recording started but captured no frames before stopping. This might be due to immediate silence.")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error during recording: {e}")
|
||||||
|
return None, sample_rate
|
||||||
|
finally:
|
||||||
|
if stream and not stream.closed:
|
||||||
|
stream.stop()
|
||||||
|
stream.close()
|
||||||
|
|
||||||
|
if not recorded_frames:
|
||||||
|
print("No audio was recorded.")
|
||||||
|
return None, sample_rate
|
||||||
|
|
||||||
|
audio_data = np.concatenate(recorded_frames)
|
||||||
|
print(f"Recording finished. Total duration: {len(audio_data)/sample_rate:.2f}s")
|
||||||
|
return audio_data, sample_rate
|
||||||
|
|
||||||
|
def play_audio_data(audio_data, sample_rate):
|
||||||
|
"""Plays in-memory audio data (NumPy array)."""
|
||||||
|
try:
|
||||||
|
print(f"Playing in-memory audio data (Sample rate: {sample_rate} Hz, Duration: {len(audio_data)/sample_rate:.2f}s)")
|
||||||
|
sd.play(audio_data, sample_rate)
|
||||||
|
sd.wait()
|
||||||
|
print("Playback from memory finished.")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error playing in-memory audio: {e}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
print("--- Testing audio_utils.py ---")
|
||||||
|
|
||||||
|
# Test 1: record_audio() and play_audio_data() (in-memory)
|
||||||
|
print("\n--- Test: Record and Play In-Memory Audio ---")
|
||||||
|
print("Please speak into the microphone. Recording will start on sound and stop on silence.")
|
||||||
|
recorded_audio, rec_sr = record_audio(
|
||||||
|
sample_rate=DEFAULT_SAMPLE_RATE,
|
||||||
|
silence_threshold_rms=0.02,
|
||||||
|
min_silence_duration_ms=1500,
|
||||||
|
max_recording_duration_s=10
|
||||||
|
)
|
||||||
|
|
||||||
|
if recorded_audio is not None and rec_sr is not None:
|
||||||
|
print(f"Recorded audio data shape: {recorded_audio.shape}, Sample rate: {rec_sr} Hz")
|
||||||
|
play_audio_data(recorded_audio, rec_sr)
|
||||||
|
else:
|
||||||
|
print("No audio recorded or recording failed.")
|
||||||
|
|
||||||
|
print("\n--- audio_utils.py tests finished. ---")
|
||||||
|
|
@ -0,0 +1,45 @@
|
||||||
|
import os
|
||||||
|
from openai import OpenAI
|
||||||
|
|
||||||
|
def call_llm(prompt, history=None):
|
||||||
|
"""
|
||||||
|
Calls the OpenAI API to get a response from an LLM.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
prompt: The user's current prompt.
|
||||||
|
history: A list of previous messages in the conversation, where each message
|
||||||
|
is a dict with "role" and "content" keys. E.g.,
|
||||||
|
[{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi there!"}]
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
The LLM's response content as a string.
|
||||||
|
"""
|
||||||
|
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "your-api-key")) # Default if not set
|
||||||
|
|
||||||
|
messages = []
|
||||||
|
if history:
|
||||||
|
messages.extend(history)
|
||||||
|
messages.append({"role": "user", "content": prompt})
|
||||||
|
|
||||||
|
r = client.chat.completions.create(
|
||||||
|
model="gpt-4o",
|
||||||
|
messages=messages
|
||||||
|
)
|
||||||
|
return r.choices[0].message.content
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Ensure you have OPENAI_API_KEY set in your environment for this test to work
|
||||||
|
print("Testing LLM call...")
|
||||||
|
|
||||||
|
# Test with a simple prompt
|
||||||
|
response = call_llm("Tell me a short joke")
|
||||||
|
print(f"LLM (Simple Joke): {response}")
|
||||||
|
|
||||||
|
# Test with history
|
||||||
|
chat_history = [
|
||||||
|
{"role": "user", "content": "What is the capital of France?"},
|
||||||
|
{"role": "assistant", "content": "The capital of France is Paris."}
|
||||||
|
]
|
||||||
|
follow_up_prompt = "And what is a famous landmark there?"
|
||||||
|
response_with_history = call_llm(follow_up_prompt, history=chat_history)
|
||||||
|
print(f"LLM (Follow-up with History): {response_with_history}")
|
||||||
Loading…
Reference in New Issue