Requirements Document

Personal AI Twin

Build a local AI system that writes and speaks like Thota — using LoRA fine-tuning and voice cloning on an M4 Pro Mac Mini

Status: Planning & Research Complete · Next: Data Collection & Setup

1. Project Overview

Two independent AI capabilities — one trained to write in Thota's voice, one to speak with Thota's voice — both running locally on a personal Mac Mini. No cloud services, no subscriptions, no data leaving the house.

ComponentApproach
Writing Style CloneLoRA fine-tuning of Qwen 2.5 7B Instruct on personal emails and WhatsApp messages
Voice Clone TTSOpenVoice V2 instant voice cloning from ~1 hour of reference recordings
Inference PlatformOllama + Metal GPU on M4 Pro Mac Mini 24GB
BackendSvelteKit + Deno + FastAPI (Python)

2. Writing Style LoRA — Requirements

2.1 Training Data

Requirement
The system shall accept email exports in .mbox and .eml formats from Gmail, Outlook, and Apple Mail, and WhatsApp chat exports in plain text .txt format.
Requirement
Training data shall be parsed locally using Python scripts without any cloud-connected service. No personal data shall leave the Mac Mini during processing.
Requirement
The final curated dataset shall contain between 500–1,000 well-formatted instruction pairs derived from emails and WhatsApp messages, with a minimum of 200 samples for a viable first run.
Requirement
Dataset entries shall follow ChatML format (system, user, assistant message structure) and be serialized as JSONL files.
Requirement
The system shall deduplicate near-identical samples using MinHash LSH at similarity threshold 0.85 before training begins.

2.2 Model Selection

Selected: Qwen2.5-7B-Instruct with 4-bit QLoRA fine-tuning

2.3 Training Configuration

ParameterValueNotes
LoRA Rank8–16Rank 8 for style-only; cap at 16 for style+task. No higher than 32.
Target Modulesq_proj, v_projMinimum. Adding k_proj+o_proj is optional.
Learning Rate2e-4Cosine scheduler, 5–10% warmup steps
Dropout0.05Mild dropout to prevent overfitting
OptimizerAdamW 8-bitbitsandbytes for memory savings
Batch Size4–8Per device; use gradient accumulation 4–8
Sequence Length512–1024 tokens2048+ risks OOM on 24GB
Epochs1–3Style-only: overtraining causes mimicry not adaptation
Training Steps300–600Or 1–3 epochs on 500 samples

2.4 Inference

Requirement
The writing style LoRA shall be served via Ollama with a custom Modelfile that loads the base Qwen 2.5 7B model and applies the LoRA adapter weights. The service shall run on localhost and respond to OpenAI-compatible API calls.
Requirement
The API shall accept a writing task prompt (e.g., "Draft a reply to my colleague thanking them for their help") and return text written in Thota's style — concise, direct, dry humor, no hedging, no corporate fluff.
Requirement
The Ollama server shall make zero outbound network requests during inference. All processing shall happen locally on the Mac Mini.
Requirement
System prompt shall guide the model to return only the draft text — no preamble like "Here's a draft:", no explanations, no "[DRAFT]" markers.
Requirement
Throughput target: 30–50 tokens/second on M4 Pro Metal GPU with Q4 quantization.

3. Voice Clone TTS — Requirements

3.1 Training / Cloning Data

Requirement
Thota shall record approximately 1 hour of audio across 6–10 distinct emotional contexts, including: neutral/calm, happy/excited, sad/contemplative, angry/frustrated, surprised/curious, whispered/soft, authoritative/strong, tired/fatigued, and playful/teasing.
Requirement
Audio recordings shall be captured at 16kHz minimum (24kHz recommended), in a consistent environment with the same microphone. Files shall be recorded in 5–10 minute segments per emotional context to avoid vocal fatigue.
Requirement
Audio shall be pre-processed: normalize levels, remove long silences and breathing artifacts, ensure consistent sample rate across all recordings.
Requirement
All reference audio shall be stored locally in a FileVault-encrypted directory on the Mac Mini. No audio data shall be uploaded to any cloud service.

3.2 Voice Model Selection

Selected: OpenVoice V2 (MIT License)

Runner-up: XTTS v2 (higher quality ceiling, but Coqui Public Model License — not fully open source) or Parler-TTS Mini (Apache 2.0, description-driven style control, 880M params)

3.3 TTS Inference

Requirement
The TTS engine shall produce audio at 22,050 Hz sample rate in WAV format (lossless) and optionally MP3 for streaming. Output shall be returned as a file download or streaming response.
Requirement
End-to-end latency shall not exceed 2 seconds for short texts (under 100 characters) including reference audio processing and synthesis.
Requirement
The TTS API shall be served via FastAPI (Python) on the Mac Mini, callable from the SvelteKit backend via a single REST endpoint.

4. Unified API — Requirements

Requirement
A single SvelteKit backend shall expose these API routes:
RouteMethodDescription
/api/tts/pipelinePOSTUnified pipeline: LoRA text gen → TTS synthesis in one call
/api/tts/clonePOSTTTS clone endpoint with reference audio
/api/tts/lora/generatePOSTLoRA text generation with writing style
/api/voice/uploadPOSTUpload reference audio, returns voice ID
Requirement
The pipeline endpoint shall accept a prompt, optional style reference text, and a reference audio clip, and return styled text plus synthesized speech in a single request.

5. Hardware & Performance Targets

MetricTarget
Target PlatformMac Mini M4 Pro · 24GB unified RAM · macOS
Writing Inference Speed30–50 tokens/sec (Q4 quantized)
TTS Synthesis Speed0.3–0.8× real-time (faster than speech)
Memory Usage — Writing~4–6GB (Qwen QLoRA + LoRA adapter)
Memory Usage — TTS~2–4GB (OpenVoice + MeloTTS)
Combined RAM TargetStay under ~18GB of 24GB (leave headroom for macOS)
Training Time (Writing LoRA)1.5–6 hours per epoch on 500–1,000 samples
TTS Fine-tune Time (1hr audio)2–4 hours on M4 Pro Metal
Storage Required50GB+ free SSD for models + datasets + outputs

6. Privacy & Security

Requirement
All training and inference shall run entirely on the Mac Mini. No personal emails, messages, or voice recordings shall be sent to any external service.
Requirement
The TTS API shall be exposed only via Cloudflare Tunnel (outbound-only connection) or Tailscale VPN. No ports shall be opened directly on the router.
Requirement
SSH access shall use key-based authentication only. Password authentication shall be disabled.
Requirement
Training data shall be deduplicated (minimum 50–200 diverse samples) to prevent the model from memorizing exact phrasing from personal messages.
Requirement
Reference audio storage shall be protected by FileVault full-disk encryption. Optionally, sensitive voice samples shall be stored in an encrypted DMG container.

7. Deployment & Remote Access

Requirement
The Mac Mini shall be configured to start automatically after a power failure (Energy Saver setting) and run as an always-on home server.
Requirement
Cloudflare Tunnel shall be used to provide a persistent public URL for API access without opening router ports. The tunnel shall use an outbound-only connection.
Requirement
Ollama shall be configured to listen on localhost only (default). The FastAPI TTS server shall bind to localhost and only accept connections through the SvelteKit backend.
Requirement
Process management shall use launchd (macOS native) or tmux to ensure services restart automatically after a crash or reboot.

8. Out of Scope