finish voice chat

2025-05-13 16:34:28 -04:00 · 2025-05-13 16:34:28 -04:00 · 76b5886dd0
parent c4898bf307
commit 76b5886dd0
8 changed files with 321 additions and 35 deletions
--- a/README.md
+++ b/README.md
@ -76,12 +76,12 @@ From there, it's easy to implement popular design patterns like ([Multi-](https:
 | [Batch](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-batch) | ☆☆☆ <br> *Dummy* | A batch processor that translates markdown content into multiple languages |
 | [Streaming](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-llm-streaming) | ☆☆☆ <br> *Dummy*   | A real-time LLM streaming demo with user interrupt capability |
 | [Chat Guardrail](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-chat-guardrail) | ☆☆☆ <br> *Dummy*  | A travel advisor chatbot that only processes travel-related queries |
-| [Map-Reduce](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-map-reduce) | ★☆☆ <br> *Beginner* | A resume qualification processor using map-reduce pattern for batch evaluation |
+| [Majority Vote](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-majority-vote) | ☆☆☆ <br> *Dummy* | Improve reasoning accuracy by aggregating multiple solution attempts |
+| [Map-Reduce](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-map-reduce) | ☆☆☆ <br> *Dummy*  | A resume qualification processor using map-reduce pattern for batch evaluation |
 | [Multi-Agent](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-multi-agent) | ★☆☆ <br> *Beginner* | A Taboo word game for asynchronous communication between two agents |
 | [Supervisor](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-supervisor) | ★☆☆ <br> *Beginner* | Research agent is getting unreliable... Let's build a supervision process|
 | [Parallel](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-parallel-batch) | ★☆☆ <br> *Beginner*   | A parallel execution demo that shows 3x speedup |
 | [Parallel Flow](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-parallel-batch-flow) | ★☆☆ <br> *Beginner*   | A parallel image processing demo showing 8x speedup with multiple filters |
-| [Majority Vote](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-majority-vote) | ★☆☆ <br> *Beginner* | Improve reasoning accuracy by aggregating multiple solution attempts |
 | [Thinking](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-thinking) | ★☆☆ <br> *Beginner*   | Solve complex reasoning problems through Chain-of-Thought |
 | [Memory](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-chat-memory) | ★☆☆ <br> *Beginner* | A chat bot with short-term and long-term memory |
 | [Text2SQL](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-text2sql) | ★☆☆ <br> *Beginner* | Convert natural language to SQL queries with an auto-debug loop |
@ -89,6 +89,7 @@ From there, it's easy to implement popular design patterns like ([Multi-](https:
 | [A2A](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-a2a) | ★☆☆ <br> *Beginner* | Agent wrapped with Agent-to-Agent protocol for inter-agent communication |
 | [Streamlit HITL](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-streamlit-hitl) | ★☆☆ <br> *Beginner* | Streamlit app for human-in-the-loop review |
 | [FastAPI HITL](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-fastapi-hitl)    | ★☆☆ <br> *Beginner* | FastAPI app  for async human review loop with SSE |
+| [Voice Chat](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-voice-chat) | ★☆☆ <br> *Beginner* | An interactive voice chat application with VAD, STT, LLM, and TTS. |

 </div>

--- a/cookbook/pocketflow-voice-chat/README.md
+++ b/cookbook/pocketflow-voice-chat/README.md
@ -1 +1,83 @@
-sudo apt-get update && sudo apt-get install -y portaudio19-dev
+# PocketFlow Voice Chat
+
+This project demonstrates a voice-based interactive chat application built with PocketFlow. Users can speak their queries, and the system will respond with spoken answers from an LLM, maintaining conversation history.
+
+## Features
+
+-   **Voice Activity Detection (VAD)**: Automatically detects when the user starts and stops speaking.
+-   **Speech-to-Text (STT)**: Converts spoken audio into text using OpenAI.
+-   **LLM Interaction**: Processes the transcribed text with an LLM (e.g., GPT-4o), maintaining conversation history.
+-   **Text-to-Speech (TTS)**: Converts the LLM's text response back into audible speech using OpenAI.
+-   **Continuous Conversation**: Loops back to listen for the next user query after responding, allowing for an ongoing dialogue.
+
+## How to Run
+
+1.  **Set your OpenAI API key**:
+    ```bash
+    export OPENAI_API_KEY="your-api-key-here"
+    ```
+    Ensure this environment variable is set, as the utility scripts for STT, LLM, and TTS rely on it.
+    You can test individual utility functions (e.g., `python utils/call_llm.py`, `python utils/text_to_speech.py`) to help verify your API key and setup.
+
+2.  **Install dependencies**:
+    Make sure you have Python installed. Then, install the required libraries using pip:
+    ```bash
+    pip install -r requirements.txt
+    ```
+    This will install libraries such as `openai`, `pocketflow`, `sounddevice`, `numpy`, `scipy`, and `soundfile`.
+
+    **Note for Linux users**: `sounddevice` may require PortAudio. If you encounter issues, you might need to install it first:
+    ```bash
+    sudo apt-get update && sudo apt-get install -y portaudio19-dev
+    ```
+
+3.  **Run the application**:
+    ```bash
+    python main.py
+    ```
+    Follow the console prompts. The application will start listening when you see "Listening for your query...".
+
+## How It Works
+
+The application uses a PocketFlow workflow to manage the conversation steps:
+
+```mermaid
+flowchart TD
+    CaptureAudio[Capture Audio] --> SpeechToText[Speech to Text]
+    SpeechToText --> QueryLLM[Query LLM]
+    QueryLLM --> TextToSpeech[Text to Speech & Play]
+    TextToSpeech -- "Next Turn" --> CaptureAudio
+```
+
+Here's what each node in the flow does:
+
+1.  **`CaptureAudioNode`**: Records audio from the user's microphone. It uses Voice Activity Detection (VAD) to start recording when speech is detected and stop when silence is detected.
+2.  **`SpeechToTextNode`**: Takes the recorded audio data, converts it to a suitable format, and sends it to OpenAI's STT API (gpt-4o-transcribe) to get the transcribed text.
+3.  **`QueryLLMNode`**: Takes the transcribed text from the user, along with the existing conversation history, and sends it to an LLM (OpenAI's GPT-4o model) to generate an intelligent response.
+4.  **`TextToSpeechNode`**: Receives the text response from the LLM, converts it into audio using OpenAI's TTS API (gpt-4o-mini-tts), and plays the audio back to the user. If the conversation is set to continue, it transitions back to the `CaptureAudioNode`.
+
+## Example Interaction
+
+When you run `main.py`:
+
+1.  The console will display:
+    ```
+    Starting PocketFlow Voice Chat...
+    Speak your query after 'Listening for your query...' appears.
+    ...
+    ```
+2.  When you see `Listening for your query...`, speak clearly into your microphone.
+3.  After you stop speaking, the console will show updates:
+    ```
+    Audio captured (X.XXs), proceeding to STT.
+    Converting speech to text...
+    User: [Your transcribed query will appear here]
+    Sending query to LLM...
+    LLM: [The LLM's response text will appear here]
+    Converting LLM response to speech...
+    Playing LLM response...
+    ```
+4.  You will hear the LLM's response spoken aloud.
+5.  The application will then loop back, and you'll see `Listening for your query...` again, ready for your next input.
+
+The conversation continues in this manner. To stop the application, you typically need to interrupt it (e.g., Ctrl+C in the terminal), as it's designed to loop continuously.
--- a/cookbook/pocketflow-voice-chat/docs/design.md
+++ b/cookbook/pocketflow-voice-chat/docs/design.md
@ -53,28 +53,31 @@ flowchart TD
 > 2. Include only the necessary utility functions, based on nodes in the flow.

 1.  **`record_audio()`** (`utils/audio_utils.py`)
-    -   *Input*: (Optional) `silence_threshold` (float, e.g., RMS energy), `min_silence_duration_ms` (int), `chunk_size_ms` (int), `sample_rate` (int, Hz), `channels` (int).
-    -   *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is in-memory audio (e.g., bytes or NumPy array) and `sample_rate` is the recording sample rate (int).
-    -   *Description*: Records audio from the microphone. Starts recording when sound is detected above `silence_threshold` (optional, or starts immediately) and stops after `min_silence_duration_ms` of sound below the threshold.
+    -   *Input*: (Optional) `sample_rate` (int, Hz, e.g., `DEFAULT_SAMPLE_RATE`), `channels` (int, e.g., `DEFAULT_CHANNELS`), `chunk_size_ms` (int, e.g., `DEFAULT_CHUNK_SIZE_MS`), `silence_threshold_rms` (float, e.g., `DEFAULT_SILENCE_THRESHOLD_RMS`), `min_silence_duration_ms` (int, e.g., `DEFAULT_MIN_SILENCE_DURATION_MS`), `max_recording_duration_s` (int, e.g., `DEFAULT_MAX_RECORDING_DURATION_S`), `pre_roll_chunks_count` (int, e.g., `DEFAULT_PRE_ROLL_CHUNKS`).
+    -   *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is a NumPy array of float32 audio samples, and `sample_rate` is the recording sample rate (int). Returns `(None, sample_rate)` if no speech is detected or recording fails.
+    -   *Description*: Records audio from the microphone using silence-based Voice Activity Detection (VAD). Buffers `pre_roll_chunks_count` of audio and starts full recording when sound is detected above `silence_threshold_rms`. Stops after `min_silence_duration_ms` of sound below the threshold or if `max_recording_duration_s` is reached.
    -   *Necessity*: Used by `CaptureAudioNode` to get user\'s voice input.

 2.  **`speech_to_text_api(audio_data, sample_rate)`** (`utils/speech_to_text.py`)
-    -   *Input*: `audio_data` (bytes or NumPy array), `sample_rate` (int).
+    -   *Input*: `audio_data` (bytes), `sample_rate` (int, though the API might infer this from the audio format).
    -   *Output*: `transcribed_text` (str).
    -   *Necessity*: Used by `SpeechToTextNode` to convert in-memory audio data to text.
+    -   *Example Model*: OpenAI `gpt-4o-transcribe`.

-3.  **`call_llm(prompt, history)`** (`utils/llm_service.py`)
-    -   *Input*: `prompt` (str), `history` (list of dicts, e.g., `[{"role": "user", "content": "..."}]`)
+3.  **`call_llm(messages)`** (`utils/call_llm.py`)
+    -   *Input*: `messages` (list of dicts, e.g., `[{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]`). This should be the complete conversation history including the latest user query.
    -   *Output*: `llm_response_text` (str)
    -   *Necessity*: Used by `QueryLLMNode` to get an intelligent response.
+    -   *Example Model*: OpenAI `gpt-4o`.

 4.  **`text_to_speech_api(text_to_synthesize)`** (`utils/text_to_speech.py`)
    -   *Input*: `text_to_synthesize` (str).
-    -   *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is in-memory audio (e.g., NumPy array) and `sample_rate` is the audio sample rate (int).
+    -   *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is in-memory audio as bytes (e.g., MP3 format from OpenAI) and `sample_rate` is the audio sample rate (int, e.g., 24000 Hz for OpenAI `gpt-4o-mini-tts`).
    -   *Necessity*: Used by `TextToSpeechNode` to convert LLM text to speakable in-memory audio data.
+    -   *Example Model*: OpenAI `gpt-4o-mini-tts`.

 5.  **`play_audio_data(audio_data, sample_rate)`** (`utils/audio_utils.py`)
-    -   *Input*: `audio_data` (NumPy array), `sample_rate` (int).
+    -   *Input*: `audio_data` (NumPy array of float32 audio samples), `sample_rate` (int).
    -   *Output*: None
    -   *Necessity*: Used by `TextToSpeechNode` (in its `post` method) to play the in-memory synthesized speech.

@ -88,11 +91,8 @@ The shared memory structure is organized as follows:

 ```python
 shared = {
-    "user_audio_data": None,      # In-memory audio data (bytes or NumPy array) from user
+    "user_audio_data": None,      # In-memory audio data (NumPy array) from user
    "user_audio_sample_rate": None, # int: Sample rate of the user audio
-    "user_text_query": None,      # str: Transcribed user text
-    "llm_text_response": None,    # str: Text response from LLM
-    # "llm_audio_data" and "llm_audio_sample_rate" are handled as exec_res within TextToSpeechNode's post method
    "chat_history": [],            # list: Conversation history [{"role": "user/assistant", "content": "..."}]
    "continue_conversation": True # boolean: Flag to control the main conversation loop
 }
@ -107,40 +107,41 @@ shared = {
    -   *Type*: Regular
    -   *Steps*:
        -   *prep*: Check `shared["continue_conversation"]`. (Potentially load VAD parameters from `shared["config"]` if dynamic).
-        -   *exec*: Call `utils.audio_utils.record_audio()` (passing VAD parameters if configured).
-        -   *post*: `audio_data, sample_rate = exec_res`. Write `audio_data` to `shared["user_audio_data"]` and `sample_rate` to `shared["user_audio_sample_rate"]`. Returns `"default"`.
+        -   *exec*: Call `utils.audio_utils.record_audio()` (passing VAD parameters if configured). This returns a NumPy array and sample rate.
+        -   *post*: `audio_numpy_array, sample_rate = exec_res`. Write `audio_numpy_array` to `shared["user_audio_data"]` and `sample_rate` to `shared["user_audio_sample_rate"]`. Returns `"default"`.

 2.  **`SpeechToTextNode`**
    -   *Purpose*: Convert the recorded in-memory audio to text.
    -   *Type*: Regular
    -   *Steps*:
-        -   *prep*: Read `shared["user_audio_data"]` and `shared["user_audio_sample_rate"]`. Return `(user_audio_data, user_audio_sample_rate)`.
-        -   *exec*: `audio_data, sample_rate = prep_res`. Call `utils.speech_to_text.speech_to_text_api(audio_data, sample_rate)`.
+        -   *prep*: Read `shared["user_audio_data"]` (NumPy array) and `shared["user_audio_sample_rate"]`. Return `(user_audio_data_numpy, user_audio_sample_rate)`.
+        -   *exec*: `audio_numpy_array, sample_rate = prep_res`. **Convert `audio_numpy_array` to audio `bytes` (e.g., in WAV format using `scipy.io.wavfile.write` to an `io.BytesIO` object).** Call `utils.speech_to_text.speech_to_text_api(audio_bytes, sample_rate)`.
        -   *post*:
-            -   Write `exec_res` (transcribed text) to `shared["user_text_query"]`.
-            -   Append `{"role": "user", "content": exec_res}` to `shared["chat_history"]`.
+            -   Let `transcribed_text = exec_res`.
+            -   Append `{"role": "user", "content": transcribed_text}` to `shared["chat_history"]`.
            -   Clear `shared["user_audio_data"]` and `shared["user_audio_sample_rate"]` as they are no longer needed.
-            -   Returns `"default"`.
+            -   Returns `"default"` (assuming STT is successful as per simplification).

 3.  **`QueryLLMNode`**
-    -   *Purpose*: Get a response from the LLM based on the user\'s query and conversation history.
+    -   *Purpose*: Get a response from the LLM based on the user's query and conversation history.
    -   *Type*: Regular
    -   *Steps*:
-        -   *prep*: Read `shared["user_text_query"]` and `shared["chat_history"]`. Return `(user_text_query, chat_history)`.
-        -   *exec*: Call `utils.llm_service.call_llm(prompt=prep_res[0], history=prep_res[1])`.
+        -   *prep*: Read `shared["chat_history"]`. Return `chat_history`.
+        -   *exec*: `history = prep_res`. Call `utils.call_llm.call_llm(messages=history)`.
        -   *post*:
-            -   Write `exec_res` (LLM text response) to `shared["llm_text_response"]`.
-            -   Append `{"role": "assistant", "content": exec_res}` to `shared["chat_history"]`.
-            -   Returns `"default"`.
+            -   Let `llm_response = exec_res`.
+            -   Append `{"role": "assistant", "content": llm_response}` to `shared["chat_history"]`.
+            -   Returns `"default"` (assuming LLM call is successful).

 4.  **`TextToSpeechNode`**
-    -   *Purpose*: Convert the LLM\'s text response into speech and play it.
+    -   *Purpose*: Convert the LLM's text response into speech and play it.
    -   *Type*: Regular
    -   *Steps*:
-        -   *prep*: Read `shared["llm_text_response"]`.
-        -   *exec*: Call `utils.text_to_speech.text_to_speech_api(prep_res)`. This returns `(llm_audio_data, llm_sample_rate)`.
-        -   *post*: `llm_audio_data, llm_sample_rate = exec_res`. 
-            -   Call `utils.audio_utils.play_audio_data(llm_audio_data, llm_sample_rate)`.
+        -   *prep*: Read `shared["chat_history"]`. Identify the last message, which should be the LLM's response. Return its content.
+        -   *exec*: `text_to_synthesize = prep_res`. Call `utils.text_to_speech.text_to_speech_api(text_to_synthesize)`. This returns `(llm_audio_bytes, llm_sample_rate)`.
+        -   *post*: `llm_audio_bytes, llm_sample_rate = exec_res`.
+            -   **Convert `llm_audio_bytes` (e.g., MP3 bytes from TTS API) to a NumPy array of audio samples (e.g., using a library like `pydub` or `soundfile` to decode).**
+            -   Call `utils.audio_utils.play_audio_data(llm_audio_numpy_array, llm_sample_rate)`.
            -   (Optional) Log completion.
            -   If `shared["continue_conversation"]` is `True`, return `"next_turn"` to loop back.
            -   Otherwise, return `"end_conversation"`.
--- a/cookbook/pocketflow-voice-chat/flow.py
+++ b/cookbook/pocketflow-voice-chat/flow.py
@ -0,0 +1,25 @@
+from pocketflow import Flow
+from nodes import CaptureAudioNode, SpeechToTextNode, QueryLLMNode, TextToSpeechNode
+
+def create_voice_chat_flow() -> Flow:
+    """Creates and returns the voice chat flow."""
+    # Create nodes
+    capture_audio = CaptureAudioNode()
+    speech_to_text = SpeechToTextNode()
+    query_llm = QueryLLMNode()
+    text_to_speech = TextToSpeechNode()
+
+    # Define transitions
+    capture_audio >> speech_to_text
+    speech_to_text >> query_llm
+    query_llm >> text_to_speech
+
+    # Loop back for next turn or end
+    text_to_speech - "next_turn" >> capture_audio
+    # "end_conversation" action from any node will terminate the flow naturally
+    # if no transition is defined for it from the current node.
+    # Alternatively, one could explicitly transition to an EndNode if desired.
+
+    # Create flow starting with the capture audio node
+    voice_chat_flow = Flow(start=capture_audio)
+    return voice_chat_flow 
--- a/cookbook/pocketflow-voice-chat/main.py
+++ b/cookbook/pocketflow-voice-chat/main.py
@ -0,0 +1,28 @@
+from flow import create_voice_chat_flow
+
+def main():
+    """Runs the PocketFlow Voice Chat application."""
+    print("Starting PocketFlow Voice Chat...")
+    print("Speak your query after 'Listening for your query...' appears.")
+    print("The conversation will continue until an error occurs or the loop is intentionally stopped.")
+    print("To attempt to stop, you might need to cause an error (e.g., silence during capture if not handled by VAD to end gracefully) or modify shared[\"continue_conversation\"] if a mechanism is added.")
+
+    shared = {
+        "user_audio_data": None,
+        "user_audio_sample_rate": None,
+        "user_text_query": None,
+        "llm_text_response": None,
+        "chat_history": [],
+        "continue_conversation": True # Flag to control the main conversation loop
+    }
+
+    # Create the flow
+    voice_chat_flow = create_voice_chat_flow()
+
+    # Run the flow
+    # The flow will loop based on the "next_turn" action from TextToSpeechNode
+    # and the continue_conversation flag checked within nodes or if an error action is returned.
+    voice_chat_flow.run(shared)
+
+if __name__ == "__main__":
+    main() 
--- a/cookbook/pocketflow-voice-chat/nodes.py
+++ b/cookbook/pocketflow-voice-chat/nodes.py
@ -0,0 +1,148 @@
+import numpy as np
+import scipy.io.wavfile
+import io
+import soundfile # For converting MP3 bytes to NumPy array
+
+from pocketflow import Node
+from utils.audio_utils import record_audio, play_audio_data
+from utils.speech_to_text import speech_to_text_api
+from utils.call_llm import call_llm
+from utils.text_to_speech import text_to_speech_api
+
+class CaptureAudioNode(Node):
+    """Records audio input from the user using VAD."""
+    def exec(self, _): # prep_res is not used as per design
+        print("\nListening for your query...")
+        audio_data, sample_rate = record_audio()
+        if audio_data is None:
+            return None, None
+        return audio_data, sample_rate
+
+    def post(self, shared, prep_res, exec_res):
+        audio_numpy_array, sample_rate = exec_res
+        if audio_numpy_array is None:
+            shared["user_audio_data"] = None
+            shared["user_audio_sample_rate"] = None
+            print("CaptureAudioNode: Failed to capture audio.")
+            return "end_conversation" 
+
+        shared["user_audio_data"] = audio_numpy_array
+        shared["user_audio_sample_rate"] = sample_rate
+        print(f"Audio captured ({len(audio_numpy_array)/sample_rate:.2f}s), proceeding to STT.")
+
+class SpeechToTextNode(Node):
+    """Converts the recorded in-memory audio to text."""
+    def prep(self, shared):
+        user_audio_data = shared.get("user_audio_data")
+        user_audio_sample_rate = shared.get("user_audio_sample_rate")
+        if user_audio_data is None or user_audio_sample_rate is None:
+            print("SpeechToTextNode: No audio data to process.")
+            return None # Signal to skip exec
+        return user_audio_data, user_audio_sample_rate
+
+    def exec(self, prep_res):
+        if prep_res is None:
+            return None # Skip if no audio data
+
+        audio_numpy_array, sample_rate = prep_res
+        
+        # Convert NumPy array to WAV bytes for the API
+        byte_io = io.BytesIO()
+        scipy.io.wavfile.write(byte_io, sample_rate, audio_numpy_array)
+        wav_bytes = byte_io.getvalue()
+        
+        print("Converting speech to text...")
+        transcribed_text = speech_to_text_api(audio_data=wav_bytes, sample_rate=sample_rate)
+        return transcribed_text
+
+    def post(self, shared, prep_res, exec_res):
+        if exec_res is None:
+            print("SpeechToTextNode: STT API returned no text.")
+            return "end_conversation" 
+
+        transcribed_text = exec_res
+        print(f"User: {transcribed_text}")
+        
+        if "chat_history" not in shared:
+            shared["chat_history"] = []
+        shared["chat_history"].append({"role": "user", "content": transcribed_text})
+        
+        shared["user_audio_data"] = None
+        shared["user_audio_sample_rate"] = None
+        return "default"
+
+class QueryLLMNode(Node):
+    """Gets a response from the LLM."""
+    def prep(self, shared):
+        chat_history = shared.get("chat_history", [])
+        
+        if not chat_history:
+            print("QueryLLMNode: Chat history is empty. Skipping LLM call.")
+            return None 
+        
+        return chat_history
+
+    def exec(self, prep_res):
+        if prep_res is None: 
+            return None 
+
+        chat_history = prep_res
+        print("Sending query to LLM...")
+        llm_response_text = call_llm(messages=chat_history)
+        return llm_response_text
+
+    def post(self, shared, prep_res, exec_res):
+        if exec_res is None:
+            print("QueryLLMNode: LLM API returned no response.")
+            return "end_conversation" 
+
+        llm_response_text = exec_res
+        print(f"LLM: {llm_response_text}")
+        
+        shared["chat_history"].append({"role": "assistant", "content": llm_response_text})
+        return "default"
+
+class TextToSpeechNode(Node):
+    """Converts the LLM's text response into speech and plays it."""
+    def prep(self, shared):
+        chat_history = shared.get("chat_history", [])
+        if not chat_history:
+            print("TextToSpeechNode: Chat history is empty. No LLM response to synthesize.")
+            return None
+        
+        last_message = chat_history[-1]
+        if last_message.get("role") == "assistant" and last_message.get("content"):
+            return last_message.get("content")
+        else:
+            print("TextToSpeechNode: Last message not from assistant or no content. Skipping TTS.")
+            return None
+
+    def exec(self, prep_res):
+        if prep_res is None:
+            return None, None
+            
+        llm_text_response = prep_res
+        print("Converting LLM response to speech...")
+        llm_audio_bytes, llm_sample_rate = text_to_speech_api(llm_text_response)
+        return llm_audio_bytes, llm_sample_rate
+
+    def post(self, shared, prep_res, exec_res):
+        if exec_res is None or exec_res[0] is None:
+            print("TextToSpeechNode: TTS failed or was skipped.")
+            return "next_turn" 
+
+        llm_audio_bytes, llm_sample_rate = exec_res
+        
+        print("Playing LLM response...")
+        try:
+            audio_segment, sr_from_file = soundfile.read(io.BytesIO(llm_audio_bytes))
+            play_audio_data(audio_segment, sr_from_file)
+        except Exception as e:
+            print(f"Error playing TTS audio: {e}")
+            return "next_turn" 
+
+        if shared.get("continue_conversation", True):
+            return "next_turn"
+        else:
+            print("Conversation ended by user flag.")
+            return "end_conversation" 
--- a/cookbook/pocketflow-voice-chat/requirements.txt
+++ b/cookbook/pocketflow-voice-chat/requirements.txt
@ -1,5 +1,6 @@
 openai
-sounddevice
+pocketflow
 numpy
+sounddevice
 scipy
 soundfile 
--- a/cookbook/pocketflow-voice-chat/utils/speech_to_text.py
+++ b/cookbook/pocketflow-voice-chat/utils/speech_to_text.py
@ -8,7 +8,7 @@ def speech_to_text_api(audio_data: bytes, sample_rate: int):
    # The API expects a file-like object. We can use io.BytesIO for in-memory bytes.
    # We also need to give it a name, as if it were a file upload.
    audio_file = io.BytesIO(audio_data)
-    audio_file.name = "audio.mp3"  # Provide a dummy filename with a common audio extension
+    audio_file.name = "audio.wav"  # Corrected to WAV format

    transcript = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",