8.9 KiB
8.9 KiB
Design Doc: PocketFlow Voice Chat
Please DON'T remove notes for AI
Requirements
Notes for AI: Keep it simple and clear. If the requirements are abstract, write concrete user stories
- Goal: Enable users to interact with an LLM via voice in a continuous conversation, receiving spoken responses.
- User Story 1: As a user, I want to speak my query into a microphone so that the application can understand what I'm asking.
- User Story 2: As a user, I want the application to send my spoken query to an LLM for processing.
- User Story 3: As a user, I want to hear the LLM's response spoken back to me.
- User Story 4: As a user, after hearing the response, I want the application to be ready for my next spoken query without restarting.
- Core Functionalities:
- Capture audio input.
- Convert speech to text (STT).
- Process text with an LLM (maintaining conversation history).
- Convert LLM text response to speech (TTS).
- Play back synthesized audio.
- Loop back to capture new audio input for a continuous conversation.
Flow Design
Notes for AI:
- Consider the design patterns of agent, map-reduce, rag, and workflow. Apply them if they fit.
- Present a concise, high-level description of the workflow.
Applicable Design Pattern:
- Workflow: A sequential workflow with a loop is most appropriate. Each step (audio capture, STT, LLM query, TTS, audio playback) directly follows the previous, and after playback, the flow returns to the audio capture stage.
Flow high-level Design:
The application will operate in a loop to allow for continuous conversation:
CaptureAudioNode: Records audio from the user's microphone when triggered.SpeechToTextNode: Converts the recorded audio into text.QueryLLMNode: Sends the transcribed text (with history) to an LLM and gets a text response.TextToSpeechNode: Converts the LLM's text response into in-memory audio data and then plays it. After completion, the flow transitions back to theCaptureAudioNode.
flowchart TD
CaptureAudio[Capture Audio] --> SpeechToText[Speech to Text]
SpeechToText --> QueryLLM[Query LLM]
QueryLLM --> TextToSpeech[Text to Speech & Play]
TextToSpeech -- "Next Turn" --> CaptureAudio
Utility Functions
Notes for AI:
- Understand the utility function definition thoroughly by reviewing the doc.
- Include only the necessary utility functions, based on nodes in the flow.
-
record_audio()(utils/audio_utils.py)- Input: (Optional)
sample_rate(int, Hz, e.g.,DEFAULT_SAMPLE_RATE),channels(int, e.g.,DEFAULT_CHANNELS),chunk_size_ms(int, e.g.,DEFAULT_CHUNK_SIZE_MS),silence_threshold_rms(float, e.g.,DEFAULT_SILENCE_THRESHOLD_RMS),min_silence_duration_ms(int, e.g.,DEFAULT_MIN_SILENCE_DURATION_MS),max_recording_duration_s(int, e.g.,DEFAULT_MAX_RECORDING_DURATION_S),pre_roll_chunks_count(int, e.g.,DEFAULT_PRE_ROLL_CHUNKS). - Output: A tuple
(audio_data, sample_rate)whereaudio_datais a NumPy array of float32 audio samples, andsample_rateis the recording sample rate (int). Returns(None, sample_rate)if no speech is detected or recording fails. - Description: Records audio from the microphone using silence-based Voice Activity Detection (VAD). Buffers
pre_roll_chunks_countof audio and starts full recording when sound is detected abovesilence_threshold_rms. Stops aftermin_silence_duration_msof sound below the threshold or ifmax_recording_duration_sis reached. - Necessity: Used by
CaptureAudioNodeto get user's voice input.
- Input: (Optional)
-
speech_to_text_api(audio_data, sample_rate)(utils/speech_to_text.py)- Input:
audio_data(bytes),sample_rate(int, though the API might infer this from the audio format). - Output:
transcribed_text(str). - Necessity: Used by
SpeechToTextNodeto convert in-memory audio data to text. - Example Model: OpenAI
gpt-4o-transcribe.
- Input:
-
call_llm(messages)(utils/call_llm.py)- Input:
messages(list of dicts, e.g.,[{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]). This should be the complete conversation history including the latest user query. - Output:
llm_response_text(str) - Necessity: Used by
QueryLLMNodeto get an intelligent response. - Example Model: OpenAI
gpt-4o.
- Input:
-
text_to_speech_api(text_to_synthesize)(utils/text_to_speech.py)- Input:
text_to_synthesize(str). - Output: A tuple
(audio_data, sample_rate)whereaudio_datais in-memory audio as bytes (e.g., MP3 format from OpenAI) andsample_rateis the audio sample rate (int, e.g., 24000 Hz for OpenAIgpt-4o-mini-tts). - Necessity: Used by
TextToSpeechNodeto convert LLM text to speakable in-memory audio data. - Example Model: OpenAI
gpt-4o-mini-tts.
- Input:
-
play_audio_data(audio_data, sample_rate)(utils/audio_utils.py)- Input:
audio_data(NumPy array of float32 audio samples),sample_rate(int). - Output: None
- Necessity: Used by
TextToSpeechNode(in itspostmethod) to play the in-memory synthesized speech.
- Input:
Node Design
Shared Memory
Notes for AI: Try to minimize data redundancy
The shared memory structure is organized as follows:
shared = {
"user_audio_data": None, # In-memory audio data (NumPy array) from user
"user_audio_sample_rate": None, # int: Sample rate of the user audio
"chat_history": [], # list: Conversation history [{"role": "user/assistant", "content": "..."}]
"continue_conversation": True # boolean: Flag to control the main conversation loop
}
Node Steps
Notes for AI: Carefully decide whether to use Batch/Async Node/Flow.
-
CaptureAudioNode- Purpose: Record audio input from the user using VAD.
- Type: Regular
- Steps:
- prep: Check
shared["continue_conversation"]. (Potentially load VAD parameters fromshared["config"]if dynamic). - exec: Call
utils.audio_utils.record_audio()(passing VAD parameters if configured). This returns a NumPy array and sample rate. - post:
audio_numpy_array, sample_rate = exec_res. Writeaudio_numpy_arraytoshared["user_audio_data"]andsample_ratetoshared["user_audio_sample_rate"]. Returns"default".
- prep: Check
-
SpeechToTextNode- Purpose: Convert the recorded in-memory audio to text.
- Type: Regular
- Steps:
- prep: Read
shared["user_audio_data"](NumPy array) andshared["user_audio_sample_rate"]. Return(user_audio_data_numpy, user_audio_sample_rate). - exec:
audio_numpy_array, sample_rate = prep_res. Convertaudio_numpy_arrayto audiobytes(e.g., in WAV format usingscipy.io.wavfile.writeto anio.BytesIOobject). Callutils.speech_to_text.speech_to_text_api(audio_bytes, sample_rate). - post:
- Let
transcribed_text = exec_res. - Append
{"role": "user", "content": transcribed_text}toshared["chat_history"]. - Clear
shared["user_audio_data"]andshared["user_audio_sample_rate"]as they are no longer needed. - Returns
"default"(assuming STT is successful as per simplification).
- Let
- prep: Read
-
QueryLLMNode- Purpose: Get a response from the LLM based on the user's query and conversation history.
- Type: Regular
- Steps:
- prep: Read
shared["chat_history"]. Returnchat_history. - exec:
history = prep_res. Callutils.call_llm.call_llm(messages=history). - post:
- Let
llm_response = exec_res. - Append
{"role": "assistant", "content": llm_response}toshared["chat_history"]. - Returns
"default"(assuming LLM call is successful).
- Let
- prep: Read
-
TextToSpeechNode- Purpose: Convert the LLM's text response into speech and play it.
- Type: Regular
- Steps:
- prep: Read
shared["chat_history"]. Identify the last message, which should be the LLM's response. Return its content. - exec:
text_to_synthesize = prep_res. Callutils.text_to_speech.text_to_speech_api(text_to_synthesize). This returns(llm_audio_bytes, llm_sample_rate). - post:
llm_audio_bytes, llm_sample_rate = exec_res.- Convert
llm_audio_bytes(e.g., MP3 bytes from TTS API) to a NumPy array of audio samples (e.g., using a library likepyduborsoundfileto decode). - Call
utils.audio_utils.play_audio_data(llm_audio_numpy_array, llm_sample_rate). - (Optional) Log completion.
- If
shared["continue_conversation"]isTrue, return"next_turn"to loop back. - Otherwise, return
"end_conversation".
- Convert
- prep: Read