87 lines
4.1 KiB
Markdown
87 lines
4.1 KiB
Markdown
# PocketFlow Voice Chat
|
|
|
|
This project demonstrates a voice-based interactive chat application built with PocketFlow. Users can speak their queries, and the system will respond with spoken answers from an LLM, maintaining conversation history.
|
|
|
|
- Check out the [Substack Post Tutorial](https://pocketflow.substack.com/p/build-your-own-voice-chatbot-from) for more!
|
|
|
|
|
|
## Features
|
|
|
|
- **Voice Activity Detection (VAD)**: Automatically detects when the user starts and stops speaking.
|
|
- **Speech-to-Text (STT)**: Converts spoken audio into text using OpenAI.
|
|
- **LLM Interaction**: Processes the transcribed text with an LLM (e.g., GPT-4o), maintaining conversation history.
|
|
- **Text-to-Speech (TTS)**: Converts the LLM's text response back into audible speech using OpenAI.
|
|
- **Continuous Conversation**: Loops back to listen for the next user query after responding, allowing for an ongoing dialogue.
|
|
|
|
## How to Run
|
|
|
|
1. **Set your OpenAI API key**:
|
|
```bash
|
|
export OPENAI_API_KEY="your-api-key-here"
|
|
```
|
|
Ensure this environment variable is set, as the utility scripts for STT, LLM, and TTS rely on it.
|
|
You can test individual utility functions (e.g., `python utils/call_llm.py`, `python utils/text_to_speech.py`) to help verify your API key and setup.
|
|
|
|
2. **Install dependencies**:
|
|
Make sure you have Python installed. Then, install the required libraries using pip:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
This will install libraries such as `openai`, `pocketflow`, `sounddevice`, `numpy`, `scipy`, and `soundfile`.
|
|
|
|
**Note for Linux users**: `sounddevice` may require PortAudio. If you encounter issues, you might need to install it first:
|
|
```bash
|
|
sudo apt-get update && sudo apt-get install -y portaudio19-dev
|
|
```
|
|
|
|
3. **Run the application**:
|
|
```bash
|
|
python main.py
|
|
```
|
|
Follow the console prompts. The application will start listening when you see "Listening for your query...".
|
|
|
|
## How It Works
|
|
|
|
The application uses a PocketFlow workflow to manage the conversation steps:
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
CaptureAudio[Capture Audio] --> SpeechToText[Speech to Text]
|
|
SpeechToText --> QueryLLM[Query LLM]
|
|
QueryLLM --> TextToSpeech[Text to Speech & Play]
|
|
TextToSpeech -- "Next Turn" --> CaptureAudio
|
|
```
|
|
|
|
Here's what each node in the flow does:
|
|
|
|
1. **`CaptureAudioNode`**: Records audio from the user's microphone. It uses Voice Activity Detection (VAD) to start recording when speech is detected and stop when silence is detected.
|
|
2. **`SpeechToTextNode`**: Takes the recorded audio data, converts it to a suitable format, and sends it to OpenAI's STT API (gpt-4o-transcribe) to get the transcribed text.
|
|
3. **`QueryLLMNode`**: Takes the transcribed text from the user, along with the existing conversation history, and sends it to an LLM (OpenAI's GPT-4o model) to generate an intelligent response.
|
|
4. **`TextToSpeechNode`**: Receives the text response from the LLM, converts it into audio using OpenAI's TTS API (gpt-4o-mini-tts), and plays the audio back to the user. If the conversation is set to continue, it transitions back to the `CaptureAudioNode`.
|
|
|
|
## Example Interaction
|
|
|
|
When you run `main.py`:
|
|
|
|
1. The console will display:
|
|
```
|
|
Starting PocketFlow Voice Chat...
|
|
Speak your query after 'Listening for your query...' appears.
|
|
...
|
|
```
|
|
2. When you see `Listening for your query...`, speak clearly into your microphone.
|
|
3. After you stop speaking, the console will show updates:
|
|
```
|
|
Audio captured (X.XXs), proceeding to STT.
|
|
Converting speech to text...
|
|
User: [Your transcribed query will appear here]
|
|
Sending query to LLM...
|
|
LLM: [The LLM's response text will appear here]
|
|
Converting LLM response to speech...
|
|
Playing LLM response...
|
|
```
|
|
4. You will hear the LLM's response spoken aloud.
|
|
5. The application will then loop back, and you'll see `Listening for your query...` again, ready for your next input.
|
|
|
|
The conversation continues in this manner. To stop the application, you typically need to interrupt it (e.g., Ctrl+C in the terminal), as it's designed to loop continuously.
|