update voice

2025-05-13 14:48:30 -04:00 · 2025-05-13 14:48:30 -04:00 · 102ef4a0fe
parent 815f1dfbe5
commit 102ef4a0fe
6 changed files with 329 additions and 0 deletions
--- a/cookbook/pocketflow-voice-chat/README.md
+++ b/cookbook/pocketflow-voice-chat/README.md
@ -0,0 +1 @@
 sudo apt-get update && sudo apt-get install -y portaudio19-dev
--- a/cookbook/pocketflow-voice-chat/docs/design.md
+++ b/cookbook/pocketflow-voice-chat/docs/design.md
@ -0,0 +1,146 @@
 # Design Doc: PocketFlow Voice Chat
 > Please DON'T remove notes for AI
 ## Requirements
 > Notes for AI: Keep it simple and clear.
 > If the requirements are abstract, write concrete user stories
 -   **Goal**: Enable users to interact with an LLM via voice in a continuous conversation, receiving spoken responses.
 -   **User Story 1**: As a user, I want to speak my query into a microphone so that the application can understand what I'm asking.
 -   **User Story 2**: As a user, I want the application to send my spoken query to an LLM for processing.
 -   **User Story 3**: As a user, I want to hear the LLM's response spoken back to me.
 -   **User Story 4**: As a user, after hearing the response, I want the application to be ready for my next spoken query without restarting.
 -   **Core Functionalities**:
    1.  Capture audio input.
    2.  Convert speech to text (STT).
    3.  Process text with an LLM (maintaining conversation history).
    4.  Convert LLM text response to speech (TTS).
    5.  Play back synthesized audio.
    6.  Loop back to capture new audio input for a continuous conversation.
 ## Flow Design
 > Notes for AI:
 > 1. Consider the design patterns of agent, map-reduce, rag, and workflow. Apply them if they fit.
 > 2. Present a concise, high-level description of the workflow.
 ### Applicable Design Pattern:
 -   **Workflow**: A sequential workflow with a loop is most appropriate. Each step (audio capture, STT, LLM query, TTS, audio playback) directly follows the previous, and after playback, the flow returns to the audio capture stage.
 ### Flow high-level Design:
 The application will operate in a loop to allow for continuous conversation:
 1.  **`CaptureAudioNode`**: Records audio from the user\'s microphone when triggered.
 2.  **`SpeechToTextNode`**: Converts the recorded audio into text.
 3.  **`QueryLLMNode`**: Sends the transcribed text (with history) to an LLM and gets a text response.
 4.  **`TextToSpeechNode`**: Converts the LLM\'s text response into in-memory audio data and then plays it. After completion, the flow transitions back to the `CaptureAudioNode`.
 ```mermaid
 flowchart TD
    CaptureAudio[Capture Audio] --> SpeechToText[Speech to Text]
    SpeechToText --> QueryLLM[Query LLM]
    QueryLLM --> TextToSpeech[Text to Speech & Play]
    TextToSpeech -- "Next Turn" --> CaptureAudio
 ```
 ## Utility Functions
 > Notes for AI:
 > 1. Understand the utility function definition thoroughly by reviewing the doc.
 > 2. Include only the necessary utility functions, based on nodes in the flow.
 1.  **`record_audio()`** (`utils/audio_utils.py`)
    -   *Input*: (Optional) `silence_threshold` (float, e.g., RMS energy), `min_silence_duration_ms` (int), `chunk_size_ms` (int), `sample_rate` (int, Hz), `channels` (int).
    -   *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is in-memory audio (e.g., bytes or NumPy array) and `sample_rate` is the recording sample rate (int).
    -   *Description*: Records audio from the microphone. Starts recording when sound is detected above `silence_threshold` (optional, or starts immediately) and stops after `min_silence_duration_ms` of sound below the threshold.
    -   *Necessity*: Used by `CaptureAudioNode` to get user\'s voice input.
 2.  **`speech_to_text_api(audio_data, sample_rate)`** (`utils/speech_to_text.py`)
    -   *Input*: `audio_data` (bytes or NumPy array), `sample_rate` (int).
    -   *Output*: `transcribed_text` (str).
    -   *Necessity*: Used by `SpeechToTextNode` to convert in-memory audio data to text.
 3.  **`call_llm(prompt, history)`** (`utils/llm_service.py`)
    -   *Input*: `prompt` (str), `history` (list of dicts, e.g., `[{"role": "user", "content": "..."}]`)
    -   *Output*: `llm_response_text` (str)
    -   *Necessity*: Used by `QueryLLMNode` to get an intelligent response.
 4.  **`text_to_speech_api(text_to_synthesize)`** (`utils/text_to_speech.py`)
    -   *Input*: `text_to_synthesize` (str).
    -   *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is in-memory audio (e.g., NumPy array) and `sample_rate` is the audio sample rate (int).
    -   *Necessity*: Used by `TextToSpeechNode` to convert LLM text to speakable in-memory audio data.
 5.  **`play_audio_data(audio_data, sample_rate)`** (`utils/audio_utils.py`)
    -   *Input*: `audio_data` (NumPy array), `sample_rate` (int).
    -   *Output*: None
    -   *Necessity*: Used by `TextToSpeechNode` (in its `post` method) to play the in-memory synthesized speech.
 ## Node Design
 ### Shared Memory
 > Notes for AI: Try to minimize data redundancy
 The shared memory structure is organized as follows:
 ```python
 shared = {
    "user_audio_data": None,      # In-memory audio data (bytes or NumPy array) from user
    "user_audio_sample_rate": None, # int: Sample rate of the user audio
    "user_text_query": None,      # str: Transcribed user text
    "llm_text_response": None,    # str: Text response from LLM
    # "llm_audio_data" and "llm_audio_sample_rate" are handled as exec_res within TextToSpeechNode's post method
    "chat_history": [],            # list: Conversation history [{"role": "user/assistant", "content": "..."}]
    "continue_conversation": True # boolean: Flag to control the main conversation loop
 }
 ```
 ### Node Steps
 > Notes for AI: Carefully decide whether to use Batch/Async Node/Flow.
 1.  **`CaptureAudioNode`**
    -   *Purpose*: Record audio input from the user using VAD.
    -   *Type*: Regular
    -   *Steps*:
        -   *prep*: Check `shared["continue_conversation"]`. (Potentially load VAD parameters from `shared["config"]` if dynamic).
        -   *exec*: Call `utils.audio_utils.record_audio()` (passing VAD parameters if configured).
        -   *post*: `audio_data, sample_rate = exec_res`. Write `audio_data` to `shared["user_audio_data"]` and `sample_rate` to `shared["user_audio_sample_rate"]`. Returns `"default"`.
 2.  **`SpeechToTextNode`**
    -   *Purpose*: Convert the recorded in-memory audio to text.
    -   *Type*: Regular
    -   *Steps*:
        -   *prep*: Read `shared["user_audio_data"]` and `shared["user_audio_sample_rate"]`. Return `(user_audio_data, user_audio_sample_rate)`.
        -   *exec*: `audio_data, sample_rate = prep_res`. Call `utils.speech_to_text.speech_to_text_api(audio_data, sample_rate)`.
        -   *post*:
            -   Write `exec_res` (transcribed text) to `shared["user_text_query"]`.
            -   Append `{"role": "user", "content": exec_res}` to `shared["chat_history"]`.
            -   Clear `shared["user_audio_data"]` and `shared["user_audio_sample_rate"]` as they are no longer needed.
            -   Returns `"default"`.
 3.  **`QueryLLMNode`**
    -   *Purpose*: Get a response from the LLM based on the user\'s query and conversation history.
    -   *Type*: Regular
    -   *Steps*:
        -   *prep*: Read `shared["user_text_query"]` and `shared["chat_history"]`. Return `(user_text_query, chat_history)`.
        -   *exec*: Call `utils.llm_service.call_llm(prompt=prep_res[0], history=prep_res[1])`.
        -   *post*:
            -   Write `exec_res` (LLM text response) to `shared["llm_text_response"]`.
            -   Append `{"role": "assistant", "content": exec_res}` to `shared["chat_history"]`.
            -   Returns `"default"`.
 4.  **`TextToSpeechNode`**
    -   *Purpose*: Convert the LLM\'s text response into speech and play it.
    -   *Type*: Regular
    -   *Steps*:
        -   *prep*: Read `shared["llm_text_response"]`.
        -   *exec*: Call `utils.text_to_speech.text_to_speech_api(prep_res)`. This returns `(llm_audio_data, llm_sample_rate)`.
        -   *post*: `llm_audio_data, llm_sample_rate = exec_res`. 
            -   Call `utils.audio_utils.play_audio_data(llm_audio_data, llm_sample_rate)`.
            -   (Optional) Log completion.
            -   If `shared["continue_conversation"]` is `True`, return `"next_turn"` to loop back.
            -   Otherwise, return `"end_conversation"`.
--- a/cookbook/pocketflow-voice-chat/requirements.txt
+++ b/cookbook/pocketflow-voice-chat/requirements.txt
@ -0,0 +1,5 @@
 openai
 sounddevice
 numpy
 scipy
 soundfile 
--- a/cookbook/pocketflow-voice-chat/utils/init.py
+++ b/cookbook/pocketflow-voice-chat/utils/init.py
--- a/cookbook/pocketflow-voice-chat/utils/audio_utils.py
+++ b/cookbook/pocketflow-voice-chat/utils/audio_utils.py
@ -0,0 +1,132 @@
 import sounddevice as sd
 import numpy as np
 import time
 # import wave # No longer needed for dummy file saving in main for play_audio_file
 # import tempfile # No longer needed for dummy file saving in main
 # import os # No longer needed for dummy file saving in main
 # import soundfile as sf # No longer needed as play_audio_file is removed
 DEFAULT_SAMPLE_RATE = 44100
 DEFAULT_CHANNELS = 1
 DEFAULT_CHUNK_SIZE_MS = 50  # Process audio in 50ms chunks for VAD
 DEFAULT_SILENCE_THRESHOLD_RMS = 0.01 # RMS value, needs tuning
 DEFAULT_MIN_SILENCE_DURATION_MS = 1000 # 1 second of silence to stop
 DEFAULT_MAX_RECORDING_DURATION_S = 15 # Safety cap for recording
 DEFAULT_PRE_ROLL_CHUNKS = 3 # Number of chunks to keep before speech starts
 def record_audio(sample_rate = DEFAULT_SAMPLE_RATE,
                 channels = DEFAULT_CHANNELS,
                 chunk_size_ms = DEFAULT_CHUNK_SIZE_MS,
                 silence_threshold_rms = DEFAULT_SILENCE_THRESHOLD_RMS,
                 min_silence_duration_ms = DEFAULT_MIN_SILENCE_DURATION_MS,
                 max_recording_duration_s = DEFAULT_MAX_RECORDING_DURATION_S,
                 pre_roll_chunks_count = DEFAULT_PRE_ROLL_CHUNKS):
    """
    Records audio from the microphone with silence-based VAD.
    Returns in-memory audio data (NumPy array of float32) and sample rate.
    Returns (None, sample_rate) if recording fails or max duration is met without speech.
    """
    chunk_size_frames = int(sample_rate * chunk_size_ms / 1000)
    min_silence_chunks = int(min_silence_duration_ms / chunk_size_ms)
    max_chunks = int(max_recording_duration_s * 1000 / chunk_size_ms)
    print(f"Listening... (max {max_recording_duration_s}s). Speak when ready.")
    print(f"(Silence threshold RMS: {silence_threshold_rms}, Min silence duration: {min_silence_duration_ms}ms)")
    recorded_frames = []
    pre_roll_frames = []
    is_recording = False
    silence_counter = 0
    chunks_recorded = 0
    stream = None
    try:
        stream = sd.InputStream(samplerate=sample_rate, channels=channels, dtype='float32')
        stream.start()
        for i in range(max_chunks):
            audio_chunk, overflowed = stream.read(chunk_size_frames)
            if overflowed:
                print("Warning: Audio buffer overflowed!")
            rms = np.sqrt(np.mean(audio_chunk**2))
            if is_recording:
                recorded_frames.append(audio_chunk)
                chunks_recorded += 1
                if rms < silence_threshold_rms:
                    silence_counter += 1
                    if silence_counter >= min_silence_chunks:
                        print("Silence detected, stopping recording.")
                        break
                else:
                    silence_counter = 0 # Reset silence counter on sound
            else:
                pre_roll_frames.append(audio_chunk)
                if len(pre_roll_frames) > pre_roll_chunks_count:
                    pre_roll_frames.pop(0)
                if rms > silence_threshold_rms:
                    print("Speech detected, starting recording.")
                    is_recording = True
                    for frame_to_add in pre_roll_frames:
                        recorded_frames.append(frame_to_add)
                    chunks_recorded = len(recorded_frames)
                    pre_roll_frames.clear()
            if i == max_chunks - 1 and not is_recording:
                print("No speech detected within the maximum recording duration.")
                stream.stop()
                stream.close()
                return None, sample_rate
        if not recorded_frames and is_recording:
             print("Recording started but captured no frames before stopping. This might be due to immediate silence.")
    except Exception as e:
        print(f"Error during recording: {e}")
        return None, sample_rate
    finally:
        if stream and not stream.closed:
            stream.stop()
            stream.close()
    if not recorded_frames:
        print("No audio was recorded.")
        return None, sample_rate
    audio_data = np.concatenate(recorded_frames)
    print(f"Recording finished. Total duration: {len(audio_data)/sample_rate:.2f}s")
    return audio_data, sample_rate
 def play_audio_data(audio_data, sample_rate):
    """Plays in-memory audio data (NumPy array)."""
    try:
        print(f"Playing in-memory audio data (Sample rate: {sample_rate} Hz, Duration: {len(audio_data)/sample_rate:.2f}s)")
        sd.play(audio_data, sample_rate)
        sd.wait()
        print("Playback from memory finished.")
    except Exception as e:
        print(f"Error playing in-memory audio: {e}")
 if __name__ == "__main__":
    print("--- Testing audio_utils.py ---")
    # Test 1: record_audio() and play_audio_data() (in-memory)
    print("\n--- Test: Record and Play In-Memory Audio ---")
    print("Please speak into the microphone. Recording will start on sound and stop on silence.")
    recorded_audio, rec_sr = record_audio(
        sample_rate=DEFAULT_SAMPLE_RATE,
        silence_threshold_rms=0.02, 
        min_silence_duration_ms=1500,
        max_recording_duration_s=10
    )
    if recorded_audio is not None and rec_sr is not None:
        print(f"Recorded audio data shape: {recorded_audio.shape}, Sample rate: {rec_sr} Hz")
        play_audio_data(recorded_audio, rec_sr)
    else:
        print("No audio recorded or recording failed.")
    print("\n--- audio_utils.py tests finished. ---") 
--- a/cookbook/pocketflow-voice-chat/utils/call_llm.py
+++ b/cookbook/pocketflow-voice-chat/utils/call_llm.py
@ -0,0 +1,45 @@
 import os
 from openai import OpenAI
 def call_llm(prompt, history=None):
    """
    Calls the OpenAI API to get a response from an LLM.
    Args:
        prompt: The user's current prompt.
        history: A list of previous messages in the conversation, where each message
                 is a dict with "role" and "content" keys. E.g.,
                 [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi there!"}]
    Returns:
        The LLM's response content as a string.
    """
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "your-api-key")) # Default if not set
    messages = []
    if history:
        messages.extend(history)
    messages.append({"role": "user", "content": prompt})
    r = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    return r.choices[0].message.content
 if __name__ == "__main__":
    # Ensure you have OPENAI_API_KEY set in your environment for this test to work
    print("Testing LLM call...")
    # Test with a simple prompt
    response = call_llm("Tell me a short joke")
    print(f"LLM (Simple Joke): {response}")
    # Test with history
    chat_history = [
        {"role": "user", "content": "What is the capital of France?"},
        {"role": "assistant", "content": "The capital of France is Paris."}
    ]
    follow_up_prompt = "And what is a famous landmark there?"
    response_with_history = call_llm(follow_up_prompt, history=chat_history)
    print(f"LLM (Follow-up with History): {response_with_history}")
		`@ -0,0 +1 @@`
							`sudo apt-get update && sudo apt-get install -y portaudio19-dev`