Technical Tutorial
15 min read
February 1, 2024

Next.js + Voice AI: Building Modern Conversational Interfaces

Step-by-step guide to integrating voice AI capabilities into Next.js applications with real code examples and best practices.

dokkodo-services Team
Next.js + Voice AI: Building Modern Conversational Interfaces

Next.js + Voice AI: Building Modern Conversational Interfaces

Modern web applications are evolving beyond traditional click-and-type interfaces. This comprehensive guide shows you how to integrate voice AI capabilities into Next.js applications, creating seamless conversational experiences.

Why Next.js for Voice AI?

Next.js provides the perfect foundation for voice AI applications:

  • Server-side rendering: Better SEO and performance
  • API routes: Built-in backend functionality
  • Edge runtime: Low-latency voice processing
  • TypeScript support: Type-safe voice AI development
  • Streaming: Real-time audio processing

Architecture Overview

Our voice AI architecture consists of:

  1. Frontend: React components for voice capture and playback
  2. API Layer: Next.js API routes for voice processing
  3. AI Services: Integration with OpenAI, Anthropic, or custom models
  4. Real-time Communication: WebSockets for live transcription

Implementation Guide

Step 1: Project Setup

npx create-next-app@latest voice-ai-app --typescript --tailwind --app
cd voice-ai-app
npm install @types/node openai ws @types/ws

Step 2: Voice Capture Component

// components/VoiceCapture.tsx
'use client';

import { useState, useRef, useEffect } from 'react';
import { Button } from '@/components/ui/button';

interface VoiceCaptureProps {
  onTranscription: (text: string) => void;
  onAudioData: (blob: Blob) => void;
}

export default function VoiceCapture({ onTranscription, onAudioData }: VoiceCaptureProps) {
  const [isRecording, setIsRecording] = useState(false);
  const [isProcessing, setIsProcessing] = useState(false);
  const mediaRecorderRef = useRef<MediaRecorder | null>(null);
  const chunksRef = useRef<Blob[]>([]);

  const startRecording = async () => {
    try {
      const stream = await navigator.mediaDevices.getUserMedia({ 
        audio: {
          echoCancellation: true,
          noiseSuppression: true,
          sampleRate: 16000
        } 
      });

      const mediaRecorder = new MediaRecorder(stream, {
        mimeType: 'audio/webm;codecs=opus'
      });

      mediaRecorderRef.current = mediaRecorder;
      chunksRef.current = [];

      mediaRecorder.ondataavailable = (event) => {
        if (event.data.size > 0) {
          chunksRef.current.push(event.data);
        }
      };

      mediaRecorder.onstop = async () => {
        const audioBlob = new Blob(chunksRef.current, { type: 'audio/webm' });
        onAudioData(audioBlob);
        await processAudio(audioBlob);
        
        // Clean up
        stream.getTracks().forEach(track => track.stop());
      };

      mediaRecorder.start(1000); // Collect data every second
      setIsRecording(true);
    } catch (error) {
      console.error('Error starting recording:', error);
    }
  };

  const stopRecording = () => {
    if (mediaRecorderRef.current && isRecording) {
      mediaRecorderRef.current.stop();
      setIsRecording(false);
      setIsProcessing(true);
    }
  };

  const processAudio = async (audioBlob: Blob) => {
    try {
      const formData = new FormData();
      formData.append('audio', audioBlob, 'recording.webm');

      const response = await fetch('/api/transcribe', {
        method: 'POST',
        body: formData,
      });

      if (response.ok) {
        const { transcription } = await response.json();
        onTranscription(transcription);
      } else {
        console.error('Transcription failed');
      }
    } catch (error) {
      console.error('Error processing audio:', error);
    } finally {
      setIsProcessing(false);
    }
  };

  return (
    <div className="flex flex-col items-center space-y-4">
      <Button
        onClick={isRecording ? stopRecording : startRecording}
        disabled={isProcessing}
        className={isRecording ? 'bg-red-500 hover:bg-red-600' : ''}
      >
        {isProcessing ? 'Processing...' : isRecording ? 'Stop Recording' : 'Start Recording'}
      </Button>
      
      {isRecording && (
        <div className="flex items-center space-x-2">
          <div className="w-3 h-3 bg-red-500 rounded-full animate-pulse"></div>
          <span className="text-sm text-gray-600">Recording...</span>
        </div>
      )}
    </div>
  );
}

Step 3: Transcription API Route

// app/api/transcribe/route.ts
import { NextRequest, NextResponse } from 'next/server';
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export async function POST(request: NextRequest) {
  try {
    const formData = await request.formData();
    const audioFile = formData.get('audio') as File;

    if (!audioFile) {
      return NextResponse.json({ error: 'No audio file provided' }, { status: 400 });
    }

    // Convert File to format expected by OpenAI
    const transcription = await openai.audio.transcriptions.create({
      file: audioFile,
      model: 'whisper-1',
      language: 'en',
      response_format: 'json',
      temperature: 0.2,
    });

    return NextResponse.json({ 
      transcription: transcription.text,
      confidence: transcription.confidence || 0.9 
    });

  } catch (error) {
    console.error('Transcription error:', error);
    return NextResponse.json({ error: 'Transcription failed' }, { status: 500 });
  }
}

Step 4: Conversational AI Integration

// app/api/chat/route.ts
import { NextRequest, NextResponse } from 'next/server';
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export async function POST(request: NextRequest) {
  try {
    const { message, context } = await request.json();

    const completion = await openai.chat.completions.create({
      model: 'gpt-4',
      messages: [
        {
          role: 'system',
          content: `You are a helpful voice AI assistant for dokkodo-services, 
                   a European Voice AI consultancy. Provide concise, actionable responses 
                   about voice AI, automation, and business solutions.`
        },
        ...context,
        { role: 'user', content: message }
      ],
      max_tokens: 150,
      temperature: 0.7,
    });

    const response = completion.choices[0]?.message?.content || 'I apologize, but I couldn't process that request.';

    return NextResponse.json({ response });

  } catch (error) {
    console.error('Chat completion error:', error);
    return NextResponse.json({ error: 'Chat completion failed' }, { status: 500 });
  }
}

Step 5: Text-to-Speech Integration

// app/api/speak/route.ts
import { NextRequest, NextResponse } from 'next/server';
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export async function POST(request: NextRequest) {
  try {
    const { text } = await request.json();

    const mp3 = await openai.audio.speech.create({
      model: 'tts-1',
      voice: 'alloy',
      input: text,
      response_format: 'mp3',
      speed: 1.0,
    });

    const buffer = Buffer.from(await mp3.arrayBuffer());

    return new NextResponse(buffer, {
      headers: {
        'Content-Type': 'audio/mpeg',
        'Content-Length': buffer.length.toString(),
      },
    });

  } catch (error) {
    console.error('Text-to-speech error:', error);
    return NextResponse.json({ error: 'Speech generation failed' }, { status: 500 });
  }
}

Step 6: Complete Voice Interface Component

// components/VoiceInterface.tsx
'use client';

import { useState } from 'react';
import VoiceCapture from './VoiceCapture';
import { Button } from '@/components/ui/button';
import { Card, CardContent, CardHeader, CardTitle } from '@/components/ui/card';

interface Message {
  role: 'user' | 'assistant';
  content: string;
  timestamp: Date;
}

export default function VoiceInterface() {
  const [messages, setMessages] = useState<Message[]>([]);
  const [isPlaying, setIsPlaying] = useState(false);

  const handleTranscription = async (transcription: string) => {
    const userMessage: Message = {
      role: 'user',
      content: transcription,
      timestamp: new Date(),
    };

    setMessages(prev => [...prev, userMessage]);

    // Send to chat API
    try {
      const response = await fetch('/api/chat', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          message: transcription,
          context: messages.slice(-5) // Last 5 messages for context
        }),
      });

      if (response.ok) {
        const { response: aiResponse } = await response.json();
        
        const assistantMessage: Message = {
          role: 'assistant',
          content: aiResponse,
          timestamp: new Date(),
        };

        setMessages(prev => [...prev, assistantMessage]);
        
        // Optional: Auto-play response
        await playResponse(aiResponse);
      }
    } catch (error) {
      console.error('Chat error:', error);
    }
  };

  const playResponse = async (text: string) => {
    try {
      setIsPlaying(true);
      
      const response = await fetch('/api/speak', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ text }),
      });

      if (response.ok) {
        const audioBlob = await response.blob();
        const audioUrl = URL.createObjectURL(audioBlob);
        const audio = new Audio(audioUrl);
        
        audio.onended = () => {
          setIsPlaying(false);
          URL.revokeObjectURL(audioUrl);
        };
        
        await audio.play();
      }
    } catch (error) {
      console.error('Speech playback error:', error);
      setIsPlaying(false);
    }
  };

  return (
    <div className="max-w-2xl mx-auto p-6">
      <Card className="mb-6">
        <CardHeader>
          <CardTitle>Voice AI Assistant</CardTitle>
        </CardHeader>
        <CardContent>
          <VoiceCapture 
            onTranscription={handleTranscription}
            onAudioData={() => {}} // Handle audio data if needed
          />
        </CardContent>
      </Card>

      <div className="space-y-4">
        {messages.map((message, index) => (
          <Card key={index} className={message.role === 'user' ? 'ml-12' : 'mr-12'}>
            <CardContent className="p-4">
              <div className="flex justify-between items-start">
                <p className={message.role === 'user' ? 'text-foreground' : 'text-muted-foreground'}>
                  <strong>{message.role === 'user' ? 'You' : 'AI'}:</strong> {message.content}
                </p>
                {message.role === 'assistant' && (
                  <Button
                    variant="outline"
                    size="sm"
                    onClick={() => playResponse(message.content)}
                    disabled={isPlaying}
                  >
                    🔊
                  </Button>
                )}
              </div>
              <span className="text-xs text-gray-500">
                {message.timestamp.toLocaleTimeString()}
              </span>
            </CardContent>
          </Card>
        ))}
      </div>
    </div>
  );
}

Advanced Features

Real-time Streaming

For real-time transcription, implement WebSocket connections:

// lib/websocket-server.ts
import { WebSocketServer } from 'ws';

export function setupWebSocketServer() {
  const wss = new WebSocketServer({ port: 3001 });

  wss.on('connection', (ws) => {
    ws.on('message', async (data) => {
      // Process audio chunks in real-time
      const audioChunk = new Uint8Array(data);
      const transcription = await processAudioChunk(audioChunk);
      
      ws.send(JSON.stringify({ 
        type: 'transcription', 
        data: transcription 
      }));
    });
  });
}

Performance Optimization

  1. Audio Compression: Use WebM with Opus codec
  2. Debouncing: Batch audio chunks for efficiency
  3. Caching: Cache common responses
  4. CDN: Serve audio files from CDN

Security Considerations

  1. Rate Limiting: Prevent API abuse
  2. Input Validation: Sanitize audio inputs
  3. CORS: Configure proper CORS policies
  4. Authentication: Implement user authentication

Production Deployment

Environment Variables

# .env.local
OPENAI_API_KEY=your_openai_key
NEXT_PUBLIC_WS_URL=wss://your-domain.com

Vercel Deployment

npm run build
vercel deploy --prod

Conclusion

Integrating voice AI with Next.js creates powerful, modern web applications. The combination of Next.js's full-stack capabilities and OpenAI's voice technologies enables sophisticated conversational interfaces.

Key benefits:

  • Seamless user experience: Natural voice interactions
  • High performance: Edge-optimized processing
  • Scalable architecture: Built for production
  • Type safety: Full TypeScript support

Need help implementing voice AI in your Next.js project? Contact our team for expert consultation.

Tags

Next.jsVoice AIOpenAIReactTypeScriptWeb Development

Ready to Implement These Ideas?

Get expert consultation on your Voice AI implementation. Join our clients achieving 150-300% ROI with proven strategies and HIPAA-compliant solutions.