Architecture Document

Personal AI Twin

Complete system architecture for writing style LoRA + voice clone TTS on M4 Pro Mac Mini

1. High-Level Architecture

The system has two independent AI pipelines that share infrastructure:

Pipeline	Input	Model	Output
Writing Style	Prompt (draft request)	Qwen 2.5 7B + LoRA via Ollama	Styled text
Voice Clone	Text + Reference audio	OpenVoice V2 + MeloTTS via FastAPI	Audio waveform (WAV/MP3)
Unified Pipeline	Prompt + Reference audio	Both above in sequence	Styled text + spoken audio

2. Component Architecture

2.1 Training Pipeline

📧 Email Export
Formats: .mbox, .eml, Gmail Takeout JSON
Personal emails exported from Apple Mail, Gmail, or Outlook. Only 1:1 personal emails; distribution lists, auto-replies, and BCC excluded.

↓ Parse locally

💬 WhatsApp Export
Format: Plain text .txt · Pattern: timestamp + sender + message
Chat export from WhatsApp iOS/Android. Grouped by consecutive sender into conversational turns.

↓ Parse + format

🧠 Dataset Construction Pipeline

Tool: Python · Format: ChatML JSONL · Deduplication: MinHash LSH

Python scripts parse raw exports → instruction pairs with system prompt, user prompt, and Thota's actual response as assistant output. Deduplicated at 0.85 similarity threshold. Quality filtered (20-char min length, langdetect for English).

↓ 500–1,000 curated samples

🔧 LLaMA Factory (Training)
Platform: macOS + Metal GPU (M4 Pro) · Framework: PyTorch MPS + Unsloth kernels

        QLoRA fine-tuning of Qwen 2.5 7B Instruct. Rank=16, targets q_proj+v_proj, LR=2e-4, 1–3 epochs on 500–1,000 samples. Output: LoRA adapter weights in GGUF or .safetensors format.
      

↓ LoRA adapter weights

🎯 Ollama Model (Inference-Ready)
File: thota-style-lora.gguf · Port: 11434 · API: OpenAI-compatible REST
Modelfile with base Qwen 2.5 7B + ADAPTER directive pointing to LoRA weights. Registered as "thota-writing" model in Ollama. Ready for inference.

2.2 Voice Recording Pipeline

🎙️ Voice Recording Sessions

Duration: ~1 hour total · Emotions: 6–10 contexts · Format: 24kHz WAV

Thota records 5–10 minute sessions per emotional context (neutral, happy, sad, angry, surprised, whispered, authoritative, tired, playful). Same microphone, same room. Raw audio files stored in FileVault-encrypted directory.

↓ Normalize + clean

🔊 Audio Pre-Processing

Tool: Python (librosa, scipy.signal) · Sample rate: 24kHz → 22050Hz normalized

Level normalization, silence removal, breathing artifact removal, consistent sample rate. Organized by emotion tag in separate folders.

↓ Cleaned + tagged audio

🗣️ OpenVoice V2 (Instant Clone)
Checkpoint: checkpoints_v2_0417.zip · Base TTS: MeloTTS · License: MIT

        Reference audio (10–30 sec) passed to OpenVoice tone color cloner. No fine-tuning required for basic clone — instant voice embedding from reference. Fine-tuning mode available for enhanced quality (2–4 hours on 1hr audio).
      

2.3 Inference Stack (Production)

🌐 SvelteKit + Deno Backend

Framework: SvelteKit with Deno adapter · Port: 3000 · Routes: /api/*

Server-side web backend. Handles HTTP requests, orchestrates Ollama + FastAPI calls, returns JSON or audio responses. All AI calls happen server-side (localhost) — no secrets exposed to client.

↓ server-side fetch (localhost)

🧠 Ollama Server
Model: thota-writing (Qwen 2.5 7B + LoRA) · Port: 11434
Metal GPU-accelerated inference via llama.cpp. Receives styled prompts, generates text in Thota's voice. Streaming support via SSE. Zero outbound network calls.

↑ styled text

🔊 FastAPI (Python) — TTS Server

Port: 8000 · Workers: 1 · Framework: FastAPI + uvicorn

Wraps OpenVoice V2 + MeloTTS. Receives text + reference audio path, synthesizes speech with cloned voice. Keeps models warm in memory. Serves WAV/MP3 responses.

↓ audio data

🎧 Audio Output

Formats: WAV (lossless, 22050 Hz) · MP3 (compressed)

Served as file download or streamed via chunked transfer encoding. Browser audio player or downloadable file.

3. Data Flow Diagrams

3.1 Writing Style Generation

┌─────────────────────────────────────────────────────────────────┐
│  Client                                                         │
│  "Draft a reply thanking my colleague"                          │
└────────────────────────┬────────────────────────────────────────┘
                         │ POST /api/tts/lora/generate
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│  SvelteKit Backend (Deno)                                      │
│  1. Validate request                                          │
│  2. Build messages array with system prompt + user prompt     │
│  3. Server-side fetch to Ollama localhost:11434               │
└────────────────────────┬────────────────────────────────────────┘
                         │ POST /api/chat  { model:"thota-writing" }
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│  Ollama (Metal GPU, M4 Pro)                                    │
│  Base: Qwen 2.5 7B Q4_KM  +  LoRA: thota-style-lora.gguf        │
│  System: "You are Thota's writing assistant..."                │
│  Output: "Nice one — thanks for flagging this..."              │
└────────────────────────┬────────────────────────────────────────┘
                         │ JSON { message.content }
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│  SvelteKit Backend                                              │
│  Return styled text to client                                  │
└─────────────────────────────────────────────────────────────────┘

3.2 Voice Clone TTS Pipeline

┌─────────────────────────────────────────────────────────────────┐
│  Client                                                         │
│  POST /api/tts/pipeline  { text, referenceAudio }              │
└────────────────────────┬────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│  SvelteKit Backend (Deno)                                      │
│  1. Call Ollama → get styled text                              │
│  2. Base64-encode reference audio (or use stored voice ID)     │
│  3. Call FastAPI localhost:8000/tts                           │
└────────────────────────┬────────────────────────────────────────┘
                         │
            ┌────────────┴──────────────────┐
            │ POST /tts { text, reference }  │
            ▼
┌─────────────────────────────────────────────────────────────────┐
│  FastAPI Python Server (Mac Mini, port 8000)                  │
│  1. Load reference audio                                      │
│  2. MeloTTS: synthesize base audio (text → waveform)         │
│  3. OpenVoice: clone tone color from reference                 │
│  4. Return WAV audio                                          │
└────────────────────────┬────────────────────────────────────────┘
                         │ audio/wav
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│  SvelteKit Backend                                              │
│  Stream audio back to client as chunked response              │
└─────────────────────────────────────────────────────────────────┘

4. Network & Access Architecture

┌───────────────────────────────────────────────────────────────────┐
│  Mac Mini M4 Pro (Home Network)                                  │
│                                                                   │
│  ┌─────────────┐    ┌──────────────┐    ┌──────────────────────┐   │
│  │  Ollama    │    │  FastAPI    │    │  SvelteKit/Deno    │   │
│  │  :11434    │    │  :8000      │    │  :3000              │   │
│  │  (localhost)│    │  (localhost)│    │  (localhost)        │   │
│  └──────┬─────┘    └──────┬───────┘    └──────────┬───────────┘   │
│         │                │                        │               │
│         └────────────────┴────────────────────────┘               │
│                           │                                      │
│                     localhost only                               │
└───────────────────────────┼─────────────────────────────────────┘
                            │ outbound tunnel
                            ▼
┌───────────────────────────────────────────────────────────────────┐
│  Cloudflare Edge Network                                         │
│  Tunnel: cloudflared (persistent, outbound-only)               │
│  Public URL: voice-api.yourdomain.com → Mac Mini :3000          │
└───────────────────────────────────────────────────────────────────┘

All services bind to localhost only. Only Cloudflare Tunnel connects outward from the Mac Mini. No inbound ports opened on router.

5. Storage Architecture

/Users/
└── thota/
    ├── ai-models/                          # Model weights (50GB+ free space)
    │   ├── qwen2.5-7b/                    # Base Qwen 2.5 7B Instruct
    │   ├── thota-style-lora/              # Trained LoRA adapter
    │   └── openvoice-v2/                  # OpenVoice V2 checkpoints
    │       └── checkpoints_v2/
    ├── voice-references/                   # 🔒 FileVault encrypted
    │   ├── neutral/
    │   ├── happy/
    │   ├── authoritative/
    │   └── ... (by emotion)
    ├── datasets/
    │   ├── email-parsed.jsonl             # Parsed email instruction pairs
    │   ├── whatsapp-parsed.jsonl          # Parsed WhatsApp pairs
    │   └── combined-dataset.jsonl         # Deduplicated + merged
    └── scripts/
        ├── parse_emails.py
        ├── parse_whatsapp.py
        └── train_lora.py

6. Model Files & Checkpoints

Model / File	Size	Location	Source
Qwen 2.5 7B Instruct (Q4_KM)	~4–5 GB	~/.cache/huggingface/	HuggingFace
thota-style-lora.gguf	100–500 MB	ai-models/thota-style-lora/	Trained output
OpenVoice V2 checkpoints	~400 MB	openvoice-v2/checkpoints_v2/	myshell-ai/S3
MeloTTS (EN base)	~300 MB	openvoice-v2/	myshell-ai/MeloTTS
Reference audio (1hr)	~1 GB (24kHz)	voice-references/	Thota's recordings

7. Process Management

Service	Manager	Restart Policy	Command / Config
Ollama server	launchd or tmux	Restart on crash, start on boot	`launchctl start com.ollama.server`
FastAPI TTS server	launchd or tmux	Restart on crash, start on boot	`uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1`
SvelteKit/Deno	launchd	Restart on crash, start on boot	`deno task start`
Cloudflare Tunnel	launchd	Restart on crash, reconnect on network	`cloudflared tunnel run --token <TOKEN>`

8. Privacy Architecture

🔒 Defense in Depth

Layer 1 — Local only: All training and inference happens on Mac Mini. Ollama makes zero outbound requests during inference.
Layer 2 — Encrypted storage: FileVault full-disk encryption. Encrypted DMG for sensitive voice samples.
Layer 3 — Network isolation: All services on localhost. Cloudflare Tunnel is outbound-only. No ports on router.
Layer 4 — SSH hardening: Key-only auth, no password auth, non-standard port optional.
Layer 5 — Dataset curation: Deduplication + quality filtering prevents memorization of exact phrasing.
Layer 6 — API auth: FastAPI middleware adds API key check for any external tunnel requests.