diff --git a/cookbook/pocketflow-voice-chat/README.md b/cookbook/pocketflow-voice-chat/README.md new file mode 100644 index 0000000..167acd1 --- /dev/null +++ b/cookbook/pocketflow-voice-chat/README.md @@ -0,0 +1 @@ +sudo apt-get update && sudo apt-get install -y portaudio19-dev \ No newline at end of file diff --git a/cookbook/pocketflow-voice-chat/docs/design.md b/cookbook/pocketflow-voice-chat/docs/design.md new file mode 100644 index 0000000..d4d22c5 --- /dev/null +++ b/cookbook/pocketflow-voice-chat/docs/design.md @@ -0,0 +1,146 @@ +# Design Doc: PocketFlow Voice Chat + +> Please DON'T remove notes for AI + +## Requirements + +> Notes for AI: Keep it simple and clear. +> If the requirements are abstract, write concrete user stories + +- **Goal**: Enable users to interact with an LLM via voice in a continuous conversation, receiving spoken responses. +- **User Story 1**: As a user, I want to speak my query into a microphone so that the application can understand what I'm asking. +- **User Story 2**: As a user, I want the application to send my spoken query to an LLM for processing. +- **User Story 3**: As a user, I want to hear the LLM's response spoken back to me. +- **User Story 4**: As a user, after hearing the response, I want the application to be ready for my next spoken query without restarting. +- **Core Functionalities**: + 1. Capture audio input. + 2. Convert speech to text (STT). + 3. Process text with an LLM (maintaining conversation history). + 4. Convert LLM text response to speech (TTS). + 5. Play back synthesized audio. + 6. Loop back to capture new audio input for a continuous conversation. + +## Flow Design + +> Notes for AI: +> 1. Consider the design patterns of agent, map-reduce, rag, and workflow. Apply them if they fit. +> 2. Present a concise, high-level description of the workflow. + +### Applicable Design Pattern: + +- **Workflow**: A sequential workflow with a loop is most appropriate. Each step (audio capture, STT, LLM query, TTS, audio playback) directly follows the previous, and after playback, the flow returns to the audio capture stage. + +### Flow high-level Design: + +The application will operate in a loop to allow for continuous conversation: +1. **`CaptureAudioNode`**: Records audio from the user\'s microphone when triggered. +2. **`SpeechToTextNode`**: Converts the recorded audio into text. +3. **`QueryLLMNode`**: Sends the transcribed text (with history) to an LLM and gets a text response. +4. **`TextToSpeechNode`**: Converts the LLM\'s text response into in-memory audio data and then plays it. After completion, the flow transitions back to the `CaptureAudioNode`. + +```mermaid +flowchart TD + CaptureAudio[Capture Audio] --> SpeechToText[Speech to Text] + SpeechToText --> QueryLLM[Query LLM] + QueryLLM --> TextToSpeech[Text to Speech & Play] + TextToSpeech -- "Next Turn" --> CaptureAudio +``` + +## Utility Functions + +> Notes for AI: +> 1. Understand the utility function definition thoroughly by reviewing the doc. +> 2. Include only the necessary utility functions, based on nodes in the flow. + +1. **`record_audio()`** (`utils/audio_utils.py`) + - *Input*: (Optional) `silence_threshold` (float, e.g., RMS energy), `min_silence_duration_ms` (int), `chunk_size_ms` (int), `sample_rate` (int, Hz), `channels` (int). + - *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is in-memory audio (e.g., bytes or NumPy array) and `sample_rate` is the recording sample rate (int). + - *Description*: Records audio from the microphone. Starts recording when sound is detected above `silence_threshold` (optional, or starts immediately) and stops after `min_silence_duration_ms` of sound below the threshold. + - *Necessity*: Used by `CaptureAudioNode` to get user\'s voice input. + +2. **`speech_to_text_api(audio_data, sample_rate)`** (`utils/speech_to_text.py`) + - *Input*: `audio_data` (bytes or NumPy array), `sample_rate` (int). + - *Output*: `transcribed_text` (str). + - *Necessity*: Used by `SpeechToTextNode` to convert in-memory audio data to text. + +3. **`call_llm(prompt, history)`** (`utils/llm_service.py`) + - *Input*: `prompt` (str), `history` (list of dicts, e.g., `[{"role": "user", "content": "..."}]`) + - *Output*: `llm_response_text` (str) + - *Necessity*: Used by `QueryLLMNode` to get an intelligent response. + +4. **`text_to_speech_api(text_to_synthesize)`** (`utils/text_to_speech.py`) + - *Input*: `text_to_synthesize` (str). + - *Output*: A tuple `(audio_data, sample_rate)` where `audio_data` is in-memory audio (e.g., NumPy array) and `sample_rate` is the audio sample rate (int). + - *Necessity*: Used by `TextToSpeechNode` to convert LLM text to speakable in-memory audio data. + +5. **`play_audio_data(audio_data, sample_rate)`** (`utils/audio_utils.py`) + - *Input*: `audio_data` (NumPy array), `sample_rate` (int). + - *Output*: None + - *Necessity*: Used by `TextToSpeechNode` (in its `post` method) to play the in-memory synthesized speech. + +## Node Design + +### Shared Memory + +> Notes for AI: Try to minimize data redundancy + +The shared memory structure is organized as follows: + +```python +shared = { + "user_audio_data": None, # In-memory audio data (bytes or NumPy array) from user + "user_audio_sample_rate": None, # int: Sample rate of the user audio + "user_text_query": None, # str: Transcribed user text + "llm_text_response": None, # str: Text response from LLM + # "llm_audio_data" and "llm_audio_sample_rate" are handled as exec_res within TextToSpeechNode's post method + "chat_history": [], # list: Conversation history [{"role": "user/assistant", "content": "..."}] + "continue_conversation": True # boolean: Flag to control the main conversation loop +} +``` + +### Node Steps + +> Notes for AI: Carefully decide whether to use Batch/Async Node/Flow. + +1. **`CaptureAudioNode`** + - *Purpose*: Record audio input from the user using VAD. + - *Type*: Regular + - *Steps*: + - *prep*: Check `shared["continue_conversation"]`. (Potentially load VAD parameters from `shared["config"]` if dynamic). + - *exec*: Call `utils.audio_utils.record_audio()` (passing VAD parameters if configured). + - *post*: `audio_data, sample_rate = exec_res`. Write `audio_data` to `shared["user_audio_data"]` and `sample_rate` to `shared["user_audio_sample_rate"]`. Returns `"default"`. + +2. **`SpeechToTextNode`** + - *Purpose*: Convert the recorded in-memory audio to text. + - *Type*: Regular + - *Steps*: + - *prep*: Read `shared["user_audio_data"]` and `shared["user_audio_sample_rate"]`. Return `(user_audio_data, user_audio_sample_rate)`. + - *exec*: `audio_data, sample_rate = prep_res`. Call `utils.speech_to_text.speech_to_text_api(audio_data, sample_rate)`. + - *post*: + - Write `exec_res` (transcribed text) to `shared["user_text_query"]`. + - Append `{"role": "user", "content": exec_res}` to `shared["chat_history"]`. + - Clear `shared["user_audio_data"]` and `shared["user_audio_sample_rate"]` as they are no longer needed. + - Returns `"default"`. + +3. **`QueryLLMNode`** + - *Purpose*: Get a response from the LLM based on the user\'s query and conversation history. + - *Type*: Regular + - *Steps*: + - *prep*: Read `shared["user_text_query"]` and `shared["chat_history"]`. Return `(user_text_query, chat_history)`. + - *exec*: Call `utils.llm_service.call_llm(prompt=prep_res[0], history=prep_res[1])`. + - *post*: + - Write `exec_res` (LLM text response) to `shared["llm_text_response"]`. + - Append `{"role": "assistant", "content": exec_res}` to `shared["chat_history"]`. + - Returns `"default"`. + +4. **`TextToSpeechNode`** + - *Purpose*: Convert the LLM\'s text response into speech and play it. + - *Type*: Regular + - *Steps*: + - *prep*: Read `shared["llm_text_response"]`. + - *exec*: Call `utils.text_to_speech.text_to_speech_api(prep_res)`. This returns `(llm_audio_data, llm_sample_rate)`. + - *post*: `llm_audio_data, llm_sample_rate = exec_res`. + - Call `utils.audio_utils.play_audio_data(llm_audio_data, llm_sample_rate)`. + - (Optional) Log completion. + - If `shared["continue_conversation"]` is `True`, return `"next_turn"` to loop back. + - Otherwise, return `"end_conversation"`. diff --git a/cookbook/pocketflow-voice-chat/requirements.txt b/cookbook/pocketflow-voice-chat/requirements.txt new file mode 100644 index 0000000..4d2e0ce --- /dev/null +++ b/cookbook/pocketflow-voice-chat/requirements.txt @@ -0,0 +1,5 @@ +openai +sounddevice +numpy +scipy +soundfile \ No newline at end of file diff --git a/cookbook/pocketflow-voice-chat/utils/__init__.py b/cookbook/pocketflow-voice-chat/utils/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/cookbook/pocketflow-voice-chat/utils/audio_utils.py b/cookbook/pocketflow-voice-chat/utils/audio_utils.py new file mode 100644 index 0000000..accffc8 --- /dev/null +++ b/cookbook/pocketflow-voice-chat/utils/audio_utils.py @@ -0,0 +1,132 @@ +import sounddevice as sd +import numpy as np +import time +# import wave # No longer needed for dummy file saving in main for play_audio_file +# import tempfile # No longer needed for dummy file saving in main +# import os # No longer needed for dummy file saving in main +# import soundfile as sf # No longer needed as play_audio_file is removed + +DEFAULT_SAMPLE_RATE = 44100 +DEFAULT_CHANNELS = 1 +DEFAULT_CHUNK_SIZE_MS = 50 # Process audio in 50ms chunks for VAD +DEFAULT_SILENCE_THRESHOLD_RMS = 0.01 # RMS value, needs tuning +DEFAULT_MIN_SILENCE_DURATION_MS = 1000 # 1 second of silence to stop +DEFAULT_MAX_RECORDING_DURATION_S = 15 # Safety cap for recording +DEFAULT_PRE_ROLL_CHUNKS = 3 # Number of chunks to keep before speech starts + +def record_audio(sample_rate = DEFAULT_SAMPLE_RATE, + channels = DEFAULT_CHANNELS, + chunk_size_ms = DEFAULT_CHUNK_SIZE_MS, + silence_threshold_rms = DEFAULT_SILENCE_THRESHOLD_RMS, + min_silence_duration_ms = DEFAULT_MIN_SILENCE_DURATION_MS, + max_recording_duration_s = DEFAULT_MAX_RECORDING_DURATION_S, + pre_roll_chunks_count = DEFAULT_PRE_ROLL_CHUNKS): + """ + Records audio from the microphone with silence-based VAD. + Returns in-memory audio data (NumPy array of float32) and sample rate. + Returns (None, sample_rate) if recording fails or max duration is met without speech. + """ + chunk_size_frames = int(sample_rate * chunk_size_ms / 1000) + min_silence_chunks = int(min_silence_duration_ms / chunk_size_ms) + max_chunks = int(max_recording_duration_s * 1000 / chunk_size_ms) + + print(f"Listening... (max {max_recording_duration_s}s). Speak when ready.") + print(f"(Silence threshold RMS: {silence_threshold_rms}, Min silence duration: {min_silence_duration_ms}ms)") + + recorded_frames = [] + pre_roll_frames = [] + is_recording = False + silence_counter = 0 + chunks_recorded = 0 + + stream = None + try: + stream = sd.InputStream(samplerate=sample_rate, channels=channels, dtype='float32') + stream.start() + + for i in range(max_chunks): + audio_chunk, overflowed = stream.read(chunk_size_frames) + if overflowed: + print("Warning: Audio buffer overflowed!") + + rms = np.sqrt(np.mean(audio_chunk**2)) + + if is_recording: + recorded_frames.append(audio_chunk) + chunks_recorded += 1 + if rms < silence_threshold_rms: + silence_counter += 1 + if silence_counter >= min_silence_chunks: + print("Silence detected, stopping recording.") + break + else: + silence_counter = 0 # Reset silence counter on sound + else: + pre_roll_frames.append(audio_chunk) + if len(pre_roll_frames) > pre_roll_chunks_count: + pre_roll_frames.pop(0) + + if rms > silence_threshold_rms: + print("Speech detected, starting recording.") + is_recording = True + for frame_to_add in pre_roll_frames: + recorded_frames.append(frame_to_add) + chunks_recorded = len(recorded_frames) + pre_roll_frames.clear() + + if i == max_chunks - 1 and not is_recording: + print("No speech detected within the maximum recording duration.") + stream.stop() + stream.close() + return None, sample_rate + + if not recorded_frames and is_recording: + print("Recording started but captured no frames before stopping. This might be due to immediate silence.") + + except Exception as e: + print(f"Error during recording: {e}") + return None, sample_rate + finally: + if stream and not stream.closed: + stream.stop() + stream.close() + + if not recorded_frames: + print("No audio was recorded.") + return None, sample_rate + + audio_data = np.concatenate(recorded_frames) + print(f"Recording finished. Total duration: {len(audio_data)/sample_rate:.2f}s") + return audio_data, sample_rate + +def play_audio_data(audio_data, sample_rate): + """Plays in-memory audio data (NumPy array).""" + try: + print(f"Playing in-memory audio data (Sample rate: {sample_rate} Hz, Duration: {len(audio_data)/sample_rate:.2f}s)") + sd.play(audio_data, sample_rate) + sd.wait() + print("Playback from memory finished.") + except Exception as e: + print(f"Error playing in-memory audio: {e}") + + +if __name__ == "__main__": + print("--- Testing audio_utils.py ---") + + # Test 1: record_audio() and play_audio_data() (in-memory) + print("\n--- Test: Record and Play In-Memory Audio ---") + print("Please speak into the microphone. Recording will start on sound and stop on silence.") + recorded_audio, rec_sr = record_audio( + sample_rate=DEFAULT_SAMPLE_RATE, + silence_threshold_rms=0.02, + min_silence_duration_ms=1500, + max_recording_duration_s=10 + ) + + if recorded_audio is not None and rec_sr is not None: + print(f"Recorded audio data shape: {recorded_audio.shape}, Sample rate: {rec_sr} Hz") + play_audio_data(recorded_audio, rec_sr) + else: + print("No audio recorded or recording failed.") + + print("\n--- audio_utils.py tests finished. ---") \ No newline at end of file diff --git a/cookbook/pocketflow-voice-chat/utils/call_llm.py b/cookbook/pocketflow-voice-chat/utils/call_llm.py new file mode 100644 index 0000000..5f7dc09 --- /dev/null +++ b/cookbook/pocketflow-voice-chat/utils/call_llm.py @@ -0,0 +1,45 @@ +import os +from openai import OpenAI + +def call_llm(prompt, history=None): + """ + Calls the OpenAI API to get a response from an LLM. + + Args: + prompt: The user's current prompt. + history: A list of previous messages in the conversation, where each message + is a dict with "role" and "content" keys. E.g., + [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi there!"}] + + Returns: + The LLM's response content as a string. + """ + client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "your-api-key")) # Default if not set + + messages = [] + if history: + messages.extend(history) + messages.append({"role": "user", "content": prompt}) + + r = client.chat.completions.create( + model="gpt-4o", + messages=messages + ) + return r.choices[0].message.content + +if __name__ == "__main__": + # Ensure you have OPENAI_API_KEY set in your environment for this test to work + print("Testing LLM call...") + + # Test with a simple prompt + response = call_llm("Tell me a short joke") + print(f"LLM (Simple Joke): {response}") + + # Test with history + chat_history = [ + {"role": "user", "content": "What is the capital of France?"}, + {"role": "assistant", "content": "The capital of France is Paris."} + ] + follow_up_prompt = "And what is a famous landmark there?" + response_with_history = call_llm(follow_up_prompt, history=chat_history) + print(f"LLM (Follow-up with History): {response_with_history}") \ No newline at end of file