Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastAPI Websocket runs with worse sound quality on Windows PC vs. Mac #957

Open
jonnyjohnson1 opened this issue Jan 10, 2025 · 3 comments
Open

Comments

@jonnyjohnson1
Copy link

Description

I have the same code and it seems to have a different performance on Windows machines versus a Mac.

Environment

  • pipecat-ai version: 0.0.52
  • python version: 3.12
  • OS: WIndows, MacOS

Issue description

There is poorer sound output on windows machines than on macs.

Repro steps

import asyncio
import os
import sys

from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMMessagesFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.openai import OpenAILLMService
from pipecat.transports.network.fastapi_websocket import (
    FastAPIWebsocketParams,
    FastAPIWebsocketTransport
)
from pipecat.serializers.protobuf import ProtobufFrameSerializer

from loguru import logger

from dotenv import load_dotenv

load_dotenv(override=True)

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")


async def run_bot(websocket_client):
    transport = FastAPIWebsocketTransport(
        websocket=websocket_client,
        params=FastAPIWebsocketParams(
            audio_out_sample_rate=16000,
            audio_out_enabled=True,
            add_wav_header=True,
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(),
            vad_audio_passthrough=True,
            serializer=ProtobufFrameSerializer()
        )
    )

    llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o")

    stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))

    tts = ElevenLabsTTSService(
            api_key=os.getenv("ELEVENLABS_API_KEY", ""),
            voice_id=os.getenv("ELEVENLABS_VOICE_ID", ""),
            output_format="pcm_16000",
            params=ElevenLabsTTSService.InputParams(
                inactivity_timeout=180
            )
        )

    messages = [
        {
            "role": "system",
            "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
        },
    ]

    context = OpenAILLMContext(messages)
    context_aggregator = llm.create_context_aggregator(context)

    pipeline = Pipeline(
        [
            transport.input(),  # Websocket input from client
            stt,  # Speech-To-Text
            context_aggregator.user(),
            llm,  # LLM
            tts,  # Text-To-Speech
            transport.output(),  # Websocket output to client
            context_aggregator.assistant(),
        ]
    )

    task = PipelineTask(pipeline, params=PipelineParams(allow_interruptions=True))

    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        # Kick off the conversation.
        messages.append({"role": "system", "content": "Please introduce yourself to the user."})
        await task.queue_frames([LLMMessagesFrame(messages)])

    runner = PipelineRunner(handle_sigint=False)

    await runner.run(task)

index.html used to connect to above example

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/protobuf.min.js"></script>
    <title>Pipecat WebSocket Client Example</title>
</head>

<body>
    <h1>Pipecat WebSocket Client Example</h1>
    <h3>
        <div id="progressText">Loading, wait...</div>
        </h2>
        <button id="startAudioBtn">Start Audio</button>
        <button id="stopAudioBtn">Stop Audio</button>
        <script>
            const SAMPLE_RATE = 16000;
            const NUM_CHANNELS = 1;
            const PLAY_TIME_RESET_THRESHOLD_MS = 1.0;

            // The protobuf type. We will load it later.
            let Frame = null;

            // The websocket connection.
            let ws = null;

            // The audio context
            let audioContext = null;

            // The audio context media stream source
            let source = null;

            // The microphone stream from getUserMedia. SHould be sampled to the
            // proper sample rate.
            let microphoneStream = null;

            // Script processor to get data from microphone.
            let scriptProcessor = null;

            // AudioContext play time.
            let playTime = 0;

            // Last time we received a websocket message.
            let lastMessageTime = 0;

            // Whether we should be playing audio.
            let isPlaying = false;

            let startBtn = document.getElementById('startAudioBtn');
            let stopBtn = document.getElementById('stopAudioBtn');

            const proto = protobuf.load('frames.proto', (err, root) => {
                if (err) {
                    throw err;
                }
                Frame = root.lookupType('pipecat.Frame');
                const progressText = document.getElementById('progressText');
                progressText.textContent = 'We are ready! Make sure to run the server and then click `Start Audio`.';

                startBtn.disabled = false;
                stopBtn.disabled = true;
            });

            function initWebSocket() {
                ws = new WebSocket('ws://localhost:8765/ws');
                // This is so `event.data` is already an ArrayBuffer.
                ws.binaryType = 'arraybuffer';

                ws.addEventListener('open', handleWebSocketOpen);
                ws.addEventListener('message', handleWebSocketMessage);
                ws.addEventListener('close', (event) => {
                    console.log('WebSocket connection closed.', event.code, event.reason);
                    stopAudio(false);
                });
                ws.addEventListener('error', (event) => console.error('WebSocket error:', event));
            }

            function handleWebSocketOpen(event) {
                console.log('WebSocket connection established.', event)

                navigator.mediaDevices.getUserMedia({
                    audio: {
                        sampleRate: SAMPLE_RATE,
                        channelCount: NUM_CHANNELS,
                        autoGainControl: true,
                        echoCancellation: true,
                        noiseSuppression: true,
                    }
                }).then((stream) => {
                    microphoneStream = stream;
                    // 512 is closest thing to 200ms.
                    scriptProcessor = audioContext.createScriptProcessor(512, 1, 1);
                    source = audioContext.createMediaStreamSource(stream);
                    source.connect(scriptProcessor);
                    scriptProcessor.connect(audioContext.destination);

                    scriptProcessor.onaudioprocess = (event) => {
                        if (!ws) {
                            return;
                        }

                        const audioData = event.inputBuffer.getChannelData(0);
                        const pcmS16Array = convertFloat32ToS16PCM(audioData);
                        const pcmByteArray = new Uint8Array(pcmS16Array.buffer);
                        const frame = Frame.create({
                            audio: {
                                audio: Array.from(pcmByteArray),
                                sampleRate: SAMPLE_RATE,
                                numChannels: NUM_CHANNELS
                            }
                        });
                        const encodedFrame = new Uint8Array(Frame.encode(frame).finish());
                        ws.send(encodedFrame);
                    };
                }).catch((error) => console.error('Error accessing microphone:', error));
            }

            function handleWebSocketMessage(event) {
                const arrayBuffer = event.data;
                if (isPlaying) {
                    enqueueAudioFromProto(arrayBuffer);
                }
            }

            function enqueueAudioFromProto(arrayBuffer) {
                const parsedFrame = Frame.decode(new Uint8Array(arrayBuffer));
                if (!parsedFrame?.audio) {
                    return false;
                }

                // Reset play time if it's been a while we haven't played anything.
                const diffTime = audioContext.currentTime - lastMessageTime;
                if ((playTime == 0) || (diffTime > PLAY_TIME_RESET_THRESHOLD_MS)) {
                    playTime = audioContext.currentTime;
                }
                lastMessageTime = audioContext.currentTime;

                // We should be able to use parsedFrame.audio.audio.buffer but for
                // some reason that contains all the bytes from the protobuf message.
                const audioVector = Array.from(parsedFrame.audio.audio);
                const audioArray = new Uint8Array(audioVector);

                audioContext.decodeAudioData(audioArray.buffer, function (buffer) {
                    const source = new AudioBufferSourceNode(audioContext);
                    source.buffer = buffer;
                    source.start(playTime);
                    source.connect(audioContext.destination);
                    playTime = playTime + buffer.duration;
                });
            }

            function convertFloat32ToS16PCM(float32Array) {
                let int16Array = new Int16Array(float32Array.length);

                for (let i = 0; i < float32Array.length; i++) {
                    let clampedValue = Math.max(-1, Math.min(1, float32Array[i]));
                    int16Array[i] = clampedValue < 0 ? clampedValue * 32768 : clampedValue * 32767;
                }
                return int16Array;
            }

            function startAudioBtnHandler() {
                if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
                    alert('getUserMedia is not supported in your browser.');
                    return;
                }

                startBtn.disabled = true;
                stopBtn.disabled = false;

                audioContext = new (window.AudioContext || window.webkitAudioContext)({
                    latencyHint: 'interactive',
                    sampleRate: SAMPLE_RATE
                });

                isPlaying = true;

                initWebSocket();
            }

            function stopAudio(closeWebsocket) {
                playTime = 0;
                isPlaying = false;
                startBtn.disabled = false;
                stopBtn.disabled = true;

                if (ws && closeWebsocket) {
                    ws.close();
                    ws = null;
                }

                if (scriptProcessor) {
                    scriptProcessor.disconnect();
                }
                if (source) {
                    source.disconnect();
                }
            }

            function stopAudioBtnHandler() {
                stopAudio(true);
            }

            startBtn.addEventListener('click', startAudioBtnHandler);
            stopBtn.addEventListener('click', stopAudioBtnHandler);
            startBtn.disabled = true;
            stopBtn.disabled = true;
        </script>
</body>

</html>

Expected behavior

The code should run the same on both OS's.

Actual behavior

Things appear to run smoothly on MacOS.
They sometimes start off alright on the Windows one, but eventually it gets a little disturbance that becomes a constant static within the audio.

@Animeshkr9044
Copy link

Hey, I am facing a similar kind of issue, but on Mac, the voice is breaking.
did you get any solution or any reason why this is happening.

@jonnyjohnson1
Copy link
Author

No solution yet. It does appear to work for the Websocket class, so I am exploring using Websocket, and building it into a flask app, and forgo the FastAPI option.

@jonnyjohnson1
Copy link
Author

If I mute the mic, the audio output of the bot seems to work better, making me think the issue is in the VAD analyzer.

I updated these things in the silero.py file and no improvements.

❌ Tried increasing the number of threads to 4 if the platform was Windows.

        if platform.system() == "Windows":
            num_threads = min(multiprocessing.cpu_count(), 4)
            opts.inter_op_num_threads = num_threads
            opts.intra_op_num_threads = num_threads
        else:
            num_threads = 1
            opts.inter_op_num_threads = num_threads
            opts.intra_op_num_threads = num_threads
        print(f"\t[ Running VAD with {num_threads} threads. ]")

❌ Another thing I did was to check for zero before performing division:

 # Check for zero before performing division:
     if np.shape(x)[1] == 0 or sr / np.shape(x)[1] > 31.25:
         raise ValueError("Input audio chunk is too short")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants