diff --git a/README.md b/README.md
index c924faf..b76b739 100644
--- a/README.md
+++ b/README.md
@@ -76,12 +76,12 @@ From there, it's easy to implement popular design patterns like ([Multi-](https:
| [Batch](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-batch) | ☆☆☆
*Dummy* | A batch processor that translates markdown content into multiple languages |
| [Streaming](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-llm-streaming) | ☆☆☆
*Dummy* | A real-time LLM streaming demo with user interrupt capability |
| [Chat Guardrail](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-chat-guardrail) | ☆☆☆
*Dummy* | A travel advisor chatbot that only processes travel-related queries |
-| [Map-Reduce](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-map-reduce) | ★☆☆
*Beginner* | A resume qualification processor using map-reduce pattern for batch evaluation |
+| [Majority Vote](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-majority-vote) | ☆☆☆
*Dummy* | Improve reasoning accuracy by aggregating multiple solution attempts |
+| [Map-Reduce](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-map-reduce) | ☆☆☆
*Dummy* | A resume qualification processor using map-reduce pattern for batch evaluation |
| [Multi-Agent](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-multi-agent) | ★☆☆
*Beginner* | A Taboo word game for asynchronous communication between two agents |
| [Supervisor](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-supervisor) | ★☆☆
*Beginner* | Research agent is getting unreliable... Let's build a supervision process|
| [Parallel](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-parallel-batch) | ★☆☆
*Beginner* | A parallel execution demo that shows 3x speedup |
| [Parallel Flow](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-parallel-batch-flow) | ★☆☆
*Beginner* | A parallel image processing demo showing 8x speedup with multiple filters |
-| [Majority Vote](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-majority-vote) | ★☆☆
*Beginner* | Improve reasoning accuracy by aggregating multiple solution attempts |
| [Thinking](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-thinking) | ★☆☆
*Beginner* | Solve complex reasoning problems through Chain-of-Thought |
| [Memory](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-chat-memory) | ★☆☆
*Beginner* | A chat bot with short-term and long-term memory |
| [Text2SQL](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-text2sql) | ★☆☆
*Beginner* | Convert natural language to SQL queries with an auto-debug loop |
@@ -89,6 +89,7 @@ From there, it's easy to implement popular design patterns like ([Multi-](https:
| [A2A](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-a2a) | ★☆☆
*Beginner* | Agent wrapped with Agent-to-Agent protocol for inter-agent communication |
| [Streamlit HITL](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-streamlit-hitl) | ★☆☆
*Beginner* | Streamlit app for human-in-the-loop review |
| [FastAPI HITL](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-fastapi-hitl) | ★☆☆
*Beginner* | FastAPI app for async human review loop with SSE |
+| [Voice Chat](https://github.com/The-Pocket/PocketFlow/tree/main/cookbook/pocketflow-voice-chat) | ★☆☆
*Beginner* | An interactive voice chat application with VAD, STT, LLM, and TTS. |
diff --git a/cookbook/pocketflow-voice-chat/README.md b/cookbook/pocketflow-voice-chat/README.md
index 167acd1..d4e5ed2 100644
--- a/cookbook/pocketflow-voice-chat/README.md
+++ b/cookbook/pocketflow-voice-chat/README.md
@@ -1 +1,83 @@
-sudo apt-get update && sudo apt-get install -y portaudio19-dev
\ No newline at end of file
+# PocketFlow Voice Chat
+
+This project demonstrates a voice-based interactive chat application built with PocketFlow. Users can speak their queries, and the system will respond with spoken answers from an LLM, maintaining conversation history.
+
+## Features
+
+- **Voice Activity Detection (VAD)**: Automatically detects when the user starts and stops speaking.
+- **Speech-to-Text (STT)**: Converts spoken audio into text using OpenAI.
+- **LLM Interaction**: Processes the transcribed text with an LLM (e.g., GPT-4o), maintaining conversation history.
+- **Text-to-Speech (TTS)**: Converts the LLM's text response back into audible speech using OpenAI.
+- **Continuous Conversation**: Loops back to listen for the next user query after responding, allowing for an ongoing dialogue.
+
+## How to Run
+
+1. **Set your OpenAI API key**:
+ ```bash
+ export OPENAI_API_KEY="your-api-key-here"
+ ```
+ Ensure this environment variable is set, as the utility scripts for STT, LLM, and TTS rely on it.
+ You can test individual utility functions (e.g., `python utils/call_llm.py`, `python utils/text_to_speech.py`) to help verify your API key and setup.
+
+2. **Install dependencies**:
+ Make sure you have Python installed. Then, install the required libraries using pip:
+ ```bash
+ pip install -r requirements.txt
+ ```
+ This will install libraries such as `openai`, `pocketflow`, `sounddevice`, `numpy`, `scipy`, and `soundfile`.
+
+ **Note for Linux users**: `sounddevice` may require PortAudio. If you encounter issues, you might need to install it first:
+ ```bash
+ sudo apt-get update && sudo apt-get install -y portaudio19-dev
+ ```
+
+3. **Run the application**:
+ ```bash
+ python main.py
+ ```
+ Follow the console prompts. The application will start listening when you see "Listening for your query...".
+
+## How It Works
+
+The application uses a PocketFlow workflow to manage the conversation steps:
+
+```mermaid
+flowchart TD
+ CaptureAudio[Capture Audio] --> SpeechToText[Speech to Text]
+ SpeechToText --> QueryLLM[Query LLM]
+ QueryLLM --> TextToSpeech[Text to Speech & Play]
+ TextToSpeech -- "Next Turn" --> CaptureAudio
+```
+
+Here's what each node in the flow does:
+
+1. **`CaptureAudioNode`**: Records audio from the user's microphone. It uses Voice Activity Detection (VAD) to start recording when speech is detected and stop when silence is detected.
+2. **`SpeechToTextNode`**: Takes the recorded audio data, converts it to a suitable format, and sends it to OpenAI's STT API (gpt-4o-transcribe) to get the transcribed text.
+3. **`QueryLLMNode`**: Takes the transcribed text from the user, along with the existing conversation history, and sends it to an LLM (OpenAI's GPT-4o model) to generate an intelligent response.
+4. **`TextToSpeechNode`**: Receives the text response from the LLM, converts it into audio using OpenAI's TTS API (gpt-4o-mini-tts), and plays the audio back to the user. If the conversation is set to continue, it transitions back to the `CaptureAudioNode`.
+
+## Example Interaction
+
+When you run `main.py`:
+
+1. The console will display:
+ ```
+ Starting PocketFlow Voice Chat...
+ Speak your query after 'Listening for your query...' appears.
+ ...
+ ```
+2. When you see `Listening for your query...`, speak clearly into your microphone.
+3. After you stop speaking, the console will show updates:
+ ```
+ Audio captured (X.XXs), proceeding to STT.
+ Converting speech to text...
+ User: [Your transcribed query will appear here]
+ Sending query to LLM...
+ LLM: [The LLM's response text will appear here]
+ Converting LLM response to speech...
+ Playing LLM response...
+ ```
+4. You will hear the LLM's response spoken aloud.
+5. The application will then loop back, and you'll see `Listening for your query...` again, ready for your next input.
+
+The conversation continues in this manner. To stop the application, you typically need to interrupt it (e.g., Ctrl+C in the terminal), as it's designed to loop continuously.
\ No newline at end of file
diff --git a/cookbook/pocketflow-voice-chat/docs/design.md b/cookbook/pocketflow-voice-chat/docs/design.md
index d4d22c5..b2c9a14 100644
--- a/cookbook/pocketflow-voice-chat/docs/design.md
+++ b/cookbook/pocketflow-voice-chat/docs/design.md
@@ -53,28 +53,31 @@ flowchart TD
> 2. Include only the necessary utility functions, based on nodes in the flow.
1. **`record_audio()`** (`utils/audio_utils.py`)
- - *Input*: (Optional) `silence_threshold` (float, e.g., RMS energy), `min_silence_duration_ms` (int), `chunk_size_ms` (int), `sample_rate` (int, Hz), `channels` (int).
- - *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is in-memory audio (e.g., bytes or NumPy array) and `sample_rate` is the recording sample rate (int).
- - *Description*: Records audio from the microphone. Starts recording when sound is detected above `silence_threshold` (optional, or starts immediately) and stops after `min_silence_duration_ms` of sound below the threshold.
+ - *Input*: (Optional) `sample_rate` (int, Hz, e.g., `DEFAULT_SAMPLE_RATE`), `channels` (int, e.g., `DEFAULT_CHANNELS`), `chunk_size_ms` (int, e.g., `DEFAULT_CHUNK_SIZE_MS`), `silence_threshold_rms` (float, e.g., `DEFAULT_SILENCE_THRESHOLD_RMS`), `min_silence_duration_ms` (int, e.g., `DEFAULT_MIN_SILENCE_DURATION_MS`), `max_recording_duration_s` (int, e.g., `DEFAULT_MAX_RECORDING_DURATION_S`), `pre_roll_chunks_count` (int, e.g., `DEFAULT_PRE_ROLL_CHUNKS`).
+ - *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is a NumPy array of float32 audio samples, and `sample_rate` is the recording sample rate (int). Returns `(None, sample_rate)` if no speech is detected or recording fails.
+ - *Description*: Records audio from the microphone using silence-based Voice Activity Detection (VAD). Buffers `pre_roll_chunks_count` of audio and starts full recording when sound is detected above `silence_threshold_rms`. Stops after `min_silence_duration_ms` of sound below the threshold or if `max_recording_duration_s` is reached.
- *Necessity*: Used by `CaptureAudioNode` to get user\'s voice input.
2. **`speech_to_text_api(audio_data, sample_rate)`** (`utils/speech_to_text.py`)
- - *Input*: `audio_data` (bytes or NumPy array), `sample_rate` (int).
+ - *Input*: `audio_data` (bytes), `sample_rate` (int, though the API might infer this from the audio format).
- *Output*: `transcribed_text` (str).
- *Necessity*: Used by `SpeechToTextNode` to convert in-memory audio data to text.
+ - *Example Model*: OpenAI `gpt-4o-transcribe`.
-3. **`call_llm(prompt, history)`** (`utils/llm_service.py`)
- - *Input*: `prompt` (str), `history` (list of dicts, e.g., `[{"role": "user", "content": "..."}]`)
+3. **`call_llm(messages)`** (`utils/call_llm.py`)
+ - *Input*: `messages` (list of dicts, e.g., `[{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]`). This should be the complete conversation history including the latest user query.
- *Output*: `llm_response_text` (str)
- *Necessity*: Used by `QueryLLMNode` to get an intelligent response.
+ - *Example Model*: OpenAI `gpt-4o`.
4. **`text_to_speech_api(text_to_synthesize)`** (`utils/text_to_speech.py`)
- *Input*: `text_to_synthesize` (str).
- - *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is in-memory audio (e.g., NumPy array) and `sample_rate` is the audio sample rate (int).
+ - *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is in-memory audio as bytes (e.g., MP3 format from OpenAI) and `sample_rate` is the audio sample rate (int, e.g., 24000 Hz for OpenAI `gpt-4o-mini-tts`).
- *Necessity*: Used by `TextToSpeechNode` to convert LLM text to speakable in-memory audio data.
+ - *Example Model*: OpenAI `gpt-4o-mini-tts`.
5. **`play_audio_data(audio_data, sample_rate)`** (`utils/audio_utils.py`)
- - *Input*: `audio_data` (NumPy array), `sample_rate` (int).
+ - *Input*: `audio_data` (NumPy array of float32 audio samples), `sample_rate` (int).
- *Output*: None
- *Necessity*: Used by `TextToSpeechNode` (in its `post` method) to play the in-memory synthesized speech.
@@ -88,11 +91,8 @@ The shared memory structure is organized as follows:
```python
shared = {
- "user_audio_data": None, # In-memory audio data (bytes or NumPy array) from user
+ "user_audio_data": None, # In-memory audio data (NumPy array) from user
"user_audio_sample_rate": None, # int: Sample rate of the user audio
- "user_text_query": None, # str: Transcribed user text
- "llm_text_response": None, # str: Text response from LLM
- # "llm_audio_data" and "llm_audio_sample_rate" are handled as exec_res within TextToSpeechNode's post method
"chat_history": [], # list: Conversation history [{"role": "user/assistant", "content": "..."}]
"continue_conversation": True # boolean: Flag to control the main conversation loop
}
@@ -107,40 +107,41 @@ shared = {
- *Type*: Regular
- *Steps*:
- *prep*: Check `shared["continue_conversation"]`. (Potentially load VAD parameters from `shared["config"]` if dynamic).
- - *exec*: Call `utils.audio_utils.record_audio()` (passing VAD parameters if configured).
- - *post*: `audio_data, sample_rate = exec_res`. Write `audio_data` to `shared["user_audio_data"]` and `sample_rate` to `shared["user_audio_sample_rate"]`. Returns `"default"`.
+ - *exec*: Call `utils.audio_utils.record_audio()` (passing VAD parameters if configured). This returns a NumPy array and sample rate.
+ - *post*: `audio_numpy_array, sample_rate = exec_res`. Write `audio_numpy_array` to `shared["user_audio_data"]` and `sample_rate` to `shared["user_audio_sample_rate"]`. Returns `"default"`.
2. **`SpeechToTextNode`**
- *Purpose*: Convert the recorded in-memory audio to text.
- *Type*: Regular
- *Steps*:
- - *prep*: Read `shared["user_audio_data"]` and `shared["user_audio_sample_rate"]`. Return `(user_audio_data, user_audio_sample_rate)`.
- - *exec*: `audio_data, sample_rate = prep_res`. Call `utils.speech_to_text.speech_to_text_api(audio_data, sample_rate)`.
+ - *prep*: Read `shared["user_audio_data"]` (NumPy array) and `shared["user_audio_sample_rate"]`. Return `(user_audio_data_numpy, user_audio_sample_rate)`.
+ - *exec*: `audio_numpy_array, sample_rate = prep_res`. **Convert `audio_numpy_array` to audio `bytes` (e.g., in WAV format using `scipy.io.wavfile.write` to an `io.BytesIO` object).** Call `utils.speech_to_text.speech_to_text_api(audio_bytes, sample_rate)`.
- *post*:
- - Write `exec_res` (transcribed text) to `shared["user_text_query"]`.
- - Append `{"role": "user", "content": exec_res}` to `shared["chat_history"]`.
+ - Let `transcribed_text = exec_res`.
+ - Append `{"role": "user", "content": transcribed_text}` to `shared["chat_history"]`.
- Clear `shared["user_audio_data"]` and `shared["user_audio_sample_rate"]` as they are no longer needed.
- - Returns `"default"`.
+ - Returns `"default"` (assuming STT is successful as per simplification).
3. **`QueryLLMNode`**
- - *Purpose*: Get a response from the LLM based on the user\'s query and conversation history.
+ - *Purpose*: Get a response from the LLM based on the user's query and conversation history.
- *Type*: Regular
- *Steps*:
- - *prep*: Read `shared["user_text_query"]` and `shared["chat_history"]`. Return `(user_text_query, chat_history)`.
- - *exec*: Call `utils.llm_service.call_llm(prompt=prep_res[0], history=prep_res[1])`.
+ - *prep*: Read `shared["chat_history"]`. Return `chat_history`.
+ - *exec*: `history = prep_res`. Call `utils.call_llm.call_llm(messages=history)`.
- *post*:
- - Write `exec_res` (LLM text response) to `shared["llm_text_response"]`.
- - Append `{"role": "assistant", "content": exec_res}` to `shared["chat_history"]`.
- - Returns `"default"`.
+ - Let `llm_response = exec_res`.
+ - Append `{"role": "assistant", "content": llm_response}` to `shared["chat_history"]`.
+ - Returns `"default"` (assuming LLM call is successful).
4. **`TextToSpeechNode`**
- - *Purpose*: Convert the LLM\'s text response into speech and play it.
+ - *Purpose*: Convert the LLM's text response into speech and play it.
- *Type*: Regular
- *Steps*:
- - *prep*: Read `shared["llm_text_response"]`.
- - *exec*: Call `utils.text_to_speech.text_to_speech_api(prep_res)`. This returns `(llm_audio_data, llm_sample_rate)`.
- - *post*: `llm_audio_data, llm_sample_rate = exec_res`.
- - Call `utils.audio_utils.play_audio_data(llm_audio_data, llm_sample_rate)`.
+ - *prep*: Read `shared["chat_history"]`. Identify the last message, which should be the LLM's response. Return its content.
+ - *exec*: `text_to_synthesize = prep_res`. Call `utils.text_to_speech.text_to_speech_api(text_to_synthesize)`. This returns `(llm_audio_bytes, llm_sample_rate)`.
+ - *post*: `llm_audio_bytes, llm_sample_rate = exec_res`.
+ - **Convert `llm_audio_bytes` (e.g., MP3 bytes from TTS API) to a NumPy array of audio samples (e.g., using a library like `pydub` or `soundfile` to decode).**
+ - Call `utils.audio_utils.play_audio_data(llm_audio_numpy_array, llm_sample_rate)`.
- (Optional) Log completion.
- If `shared["continue_conversation"]` is `True`, return `"next_turn"` to loop back.
- Otherwise, return `"end_conversation"`.
diff --git a/cookbook/pocketflow-voice-chat/flow.py b/cookbook/pocketflow-voice-chat/flow.py
new file mode 100644
index 0000000..5c0dec5
--- /dev/null
+++ b/cookbook/pocketflow-voice-chat/flow.py
@@ -0,0 +1,25 @@
+from pocketflow import Flow
+from nodes import CaptureAudioNode, SpeechToTextNode, QueryLLMNode, TextToSpeechNode
+
+def create_voice_chat_flow() -> Flow:
+ """Creates and returns the voice chat flow."""
+ # Create nodes
+ capture_audio = CaptureAudioNode()
+ speech_to_text = SpeechToTextNode()
+ query_llm = QueryLLMNode()
+ text_to_speech = TextToSpeechNode()
+
+ # Define transitions
+ capture_audio >> speech_to_text
+ speech_to_text >> query_llm
+ query_llm >> text_to_speech
+
+ # Loop back for next turn or end
+ text_to_speech - "next_turn" >> capture_audio
+ # "end_conversation" action from any node will terminate the flow naturally
+ # if no transition is defined for it from the current node.
+ # Alternatively, one could explicitly transition to an EndNode if desired.
+
+ # Create flow starting with the capture audio node
+ voice_chat_flow = Flow(start=capture_audio)
+ return voice_chat_flow
\ No newline at end of file
diff --git a/cookbook/pocketflow-voice-chat/main.py b/cookbook/pocketflow-voice-chat/main.py
new file mode 100644
index 0000000..4122982
--- /dev/null
+++ b/cookbook/pocketflow-voice-chat/main.py
@@ -0,0 +1,28 @@
+from flow import create_voice_chat_flow
+
+def main():
+ """Runs the PocketFlow Voice Chat application."""
+ print("Starting PocketFlow Voice Chat...")
+ print("Speak your query after 'Listening for your query...' appears.")
+ print("The conversation will continue until an error occurs or the loop is intentionally stopped.")
+ print("To attempt to stop, you might need to cause an error (e.g., silence during capture if not handled by VAD to end gracefully) or modify shared[\"continue_conversation\"] if a mechanism is added.")
+
+ shared = {
+ "user_audio_data": None,
+ "user_audio_sample_rate": None,
+ "user_text_query": None,
+ "llm_text_response": None,
+ "chat_history": [],
+ "continue_conversation": True # Flag to control the main conversation loop
+ }
+
+ # Create the flow
+ voice_chat_flow = create_voice_chat_flow()
+
+ # Run the flow
+ # The flow will loop based on the "next_turn" action from TextToSpeechNode
+ # and the continue_conversation flag checked within nodes or if an error action is returned.
+ voice_chat_flow.run(shared)
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/cookbook/pocketflow-voice-chat/nodes.py b/cookbook/pocketflow-voice-chat/nodes.py
new file mode 100644
index 0000000..a43ba38
--- /dev/null
+++ b/cookbook/pocketflow-voice-chat/nodes.py
@@ -0,0 +1,148 @@
+import numpy as np
+import scipy.io.wavfile
+import io
+import soundfile # For converting MP3 bytes to NumPy array
+
+from pocketflow import Node
+from utils.audio_utils import record_audio, play_audio_data
+from utils.speech_to_text import speech_to_text_api
+from utils.call_llm import call_llm
+from utils.text_to_speech import text_to_speech_api
+
+class CaptureAudioNode(Node):
+ """Records audio input from the user using VAD."""
+ def exec(self, _): # prep_res is not used as per design
+ print("\nListening for your query...")
+ audio_data, sample_rate = record_audio()
+ if audio_data is None:
+ return None, None
+ return audio_data, sample_rate
+
+ def post(self, shared, prep_res, exec_res):
+ audio_numpy_array, sample_rate = exec_res
+ if audio_numpy_array is None:
+ shared["user_audio_data"] = None
+ shared["user_audio_sample_rate"] = None
+ print("CaptureAudioNode: Failed to capture audio.")
+ return "end_conversation"
+
+ shared["user_audio_data"] = audio_numpy_array
+ shared["user_audio_sample_rate"] = sample_rate
+ print(f"Audio captured ({len(audio_numpy_array)/sample_rate:.2f}s), proceeding to STT.")
+
+class SpeechToTextNode(Node):
+ """Converts the recorded in-memory audio to text."""
+ def prep(self, shared):
+ user_audio_data = shared.get("user_audio_data")
+ user_audio_sample_rate = shared.get("user_audio_sample_rate")
+ if user_audio_data is None or user_audio_sample_rate is None:
+ print("SpeechToTextNode: No audio data to process.")
+ return None # Signal to skip exec
+ return user_audio_data, user_audio_sample_rate
+
+ def exec(self, prep_res):
+ if prep_res is None:
+ return None # Skip if no audio data
+
+ audio_numpy_array, sample_rate = prep_res
+
+ # Convert NumPy array to WAV bytes for the API
+ byte_io = io.BytesIO()
+ scipy.io.wavfile.write(byte_io, sample_rate, audio_numpy_array)
+ wav_bytes = byte_io.getvalue()
+
+ print("Converting speech to text...")
+ transcribed_text = speech_to_text_api(audio_data=wav_bytes, sample_rate=sample_rate)
+ return transcribed_text
+
+ def post(self, shared, prep_res, exec_res):
+ if exec_res is None:
+ print("SpeechToTextNode: STT API returned no text.")
+ return "end_conversation"
+
+ transcribed_text = exec_res
+ print(f"User: {transcribed_text}")
+
+ if "chat_history" not in shared:
+ shared["chat_history"] = []
+ shared["chat_history"].append({"role": "user", "content": transcribed_text})
+
+ shared["user_audio_data"] = None
+ shared["user_audio_sample_rate"] = None
+ return "default"
+
+class QueryLLMNode(Node):
+ """Gets a response from the LLM."""
+ def prep(self, shared):
+ chat_history = shared.get("chat_history", [])
+
+ if not chat_history:
+ print("QueryLLMNode: Chat history is empty. Skipping LLM call.")
+ return None
+
+ return chat_history
+
+ def exec(self, prep_res):
+ if prep_res is None:
+ return None
+
+ chat_history = prep_res
+ print("Sending query to LLM...")
+ llm_response_text = call_llm(messages=chat_history)
+ return llm_response_text
+
+ def post(self, shared, prep_res, exec_res):
+ if exec_res is None:
+ print("QueryLLMNode: LLM API returned no response.")
+ return "end_conversation"
+
+ llm_response_text = exec_res
+ print(f"LLM: {llm_response_text}")
+
+ shared["chat_history"].append({"role": "assistant", "content": llm_response_text})
+ return "default"
+
+class TextToSpeechNode(Node):
+ """Converts the LLM's text response into speech and plays it."""
+ def prep(self, shared):
+ chat_history = shared.get("chat_history", [])
+ if not chat_history:
+ print("TextToSpeechNode: Chat history is empty. No LLM response to synthesize.")
+ return None
+
+ last_message = chat_history[-1]
+ if last_message.get("role") == "assistant" and last_message.get("content"):
+ return last_message.get("content")
+ else:
+ print("TextToSpeechNode: Last message not from assistant or no content. Skipping TTS.")
+ return None
+
+ def exec(self, prep_res):
+ if prep_res is None:
+ return None, None
+
+ llm_text_response = prep_res
+ print("Converting LLM response to speech...")
+ llm_audio_bytes, llm_sample_rate = text_to_speech_api(llm_text_response)
+ return llm_audio_bytes, llm_sample_rate
+
+ def post(self, shared, prep_res, exec_res):
+ if exec_res is None or exec_res[0] is None:
+ print("TextToSpeechNode: TTS failed or was skipped.")
+ return "next_turn"
+
+ llm_audio_bytes, llm_sample_rate = exec_res
+
+ print("Playing LLM response...")
+ try:
+ audio_segment, sr_from_file = soundfile.read(io.BytesIO(llm_audio_bytes))
+ play_audio_data(audio_segment, sr_from_file)
+ except Exception as e:
+ print(f"Error playing TTS audio: {e}")
+ return "next_turn"
+
+ if shared.get("continue_conversation", True):
+ return "next_turn"
+ else:
+ print("Conversation ended by user flag.")
+ return "end_conversation"
\ No newline at end of file
diff --git a/cookbook/pocketflow-voice-chat/requirements.txt b/cookbook/pocketflow-voice-chat/requirements.txt
index 4d2e0ce..40d9e05 100644
--- a/cookbook/pocketflow-voice-chat/requirements.txt
+++ b/cookbook/pocketflow-voice-chat/requirements.txt
@@ -1,5 +1,6 @@
openai
-sounddevice
+pocketflow
numpy
+sounddevice
scipy
soundfile
\ No newline at end of file
diff --git a/cookbook/pocketflow-voice-chat/utils/speech_to_text.py b/cookbook/pocketflow-voice-chat/utils/speech_to_text.py
index 35d533d..3db720d 100644
--- a/cookbook/pocketflow-voice-chat/utils/speech_to_text.py
+++ b/cookbook/pocketflow-voice-chat/utils/speech_to_text.py
@@ -8,7 +8,7 @@ def speech_to_text_api(audio_data: bytes, sample_rate: int):
# The API expects a file-like object. We can use io.BytesIO for in-memory bytes.
# We also need to give it a name, as if it were a file upload.
audio_file = io.BytesIO(audio_data)
- audio_file.name = "audio.mp3" # Provide a dummy filename with a common audio extension
+ audio_file.name = "audio.wav" # Corrected to WAV format
transcript = client.audio.transcriptions.create(
model="gpt-4o-transcribe",