From adb36c0c484b498263a107e98cf80d164b891c7a Mon Sep 17 00:00:00 2001 From: Martin Date: Sat, 21 Mar 2026 10:53:11 +0000 Subject: [PATCH] =?UTF-8?q?feat:=20Phase=200=20bootstrap=20=E2=80=94=20Qdr?= =?UTF-8?q?ant=20deploy=20e=20schema=20PostgreSQL?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - README.md: contesto ALPHA_PROJECT, architettura multi-agent, stack infrastrutturale - CHANGELOG.md: documenta deploy Qdrant v1.17.0 e creazione database pompeo - db/postgres.sql: schema DDL database pompeo (user_profile, memory_facts, finance_documents, behavioral_context, agent_messages) con multi-tenancy user_id - db/qdrant.sh: script per creazione/ripristino collections Qdrant (episodes, knowledge, preferences) con payload indexes Design decisions: - Multi-tenancy via user_id su Qdrant e PostgreSQL (estendibile a nuovi utenti senza modifiche infrastrutturali) - agent_messages come blackboard persistente per il Proactive Arbiter Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- CHANGELOG.md | 154 +++++++++++++++++++++++++++++++++++++++++++ README.md | 41 ++++++------ db/postgres.sql | 170 ++++++++++++++++++++++++++++++++++++++++++++++++ db/qdrant.sh | 91 ++++++++++++++++++++++++++ 4 files changed, 437 insertions(+), 19 deletions(-) create mode 100644 CHANGELOG.md create mode 100644 db/postgres.sql create mode 100644 db/qdrant.sh diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..541b788 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,154 @@ +# ALPHA_PROJECT — Changelog + +Tutte le modifiche significative al progetto ALPHA_PROJECT sono documentate qui. + +--- + +## [2026-03-21] PostgreSQL — Database "pompeo" e schema ALPHA_PROJECT + +### Overview + +Creato il database `pompeo` sul cluster Patroni (namespace `persistence`) e applicato lo schema iniziale per la memoria strutturata di Pompeo. Seconda milestone della Phase 0 — Infrastructure Bootstrap. + +--- + +### Modifica manifest Patroni + +Aggiunto `pompeo: martin` nella sezione `databases` di `infra/cluster/persistence/patroni/postgres.yaml`. Il database è stato creato automaticamente dallo Zalando Operator senza downtime sugli altri database. + +Script DDL idempotente disponibile in: `alpha/db/postgres.sql` + +--- + +### Design decision — Multi-tenancy anche in PostgreSQL + +Coerentemente con la scelta adottata per Qdrant, tutte le tabelle includono il campo `user_id TEXT NOT NULL DEFAULT 'martin'`. I valori `'martin'` e `'shared'` sono seedati in `user_profile` come utenti iniziali del sistema. + +Aggiungere un nuovo utente in futuro non richiede modifiche allo schema — è sufficiente inserire una riga in `user_profile` e usare il nuovo `user_id` negli INSERT. + +--- + +### Design decision — agent_messages come blackboard persistente + +La tabella `agent_messages` implementa il **blackboard pattern** del message broker: ogni agente n8n inserisce le proprie osservazioni con `arbiter_decision = NULL` (pending). Il Proactive Arbiter legge i messaggi in coda, decide (`notify` / `defer` / `discard`) e aggiorna `arbiter_decision`, `arbiter_reason` e `processed_at`. + +Rispetto a usare solo NATS/Redis come broker, questo approccio garantisce un **audit log permanente** di tutte le osservazioni e decisioni, interrogabile via SQL per debug, tuning e analisi storiche. + +--- + +### Schema creato + +**5 tabelle** nel database `pompeo`: + +| Tabella | Ruolo | +|---|---| +| `user_profile` | Preferenze statiche per utente (lingua, timezone, stile notifiche, quiet hours). Seed: `martin`, `shared` | +| `memory_facts` | Fatti episodici prodotti da tutti gli agenti, con TTL (`expires_at`) e riferimento al punto Qdrant (`qdrant_id`) | +| `finance_documents` | Documenti finanziari strutturati: bollette, fatture, cedolini. Include `raw_text` per embedding | +| `behavioral_context` | Contesto IoT/comportamentale per l'Arbiter: DND, home presence, tipo evento | +| `agent_messages` | Blackboard del message broker — osservazioni agenti + decisioni Arbiter | + +**15 index** totali: + +| Index | Tabella | Tipo | +|---|---|---| +| `idx_memory_facts_user_source_cat` | `memory_facts` | `(user_id, source, category)` | +| `idx_memory_facts_expires` | `memory_facts` | `(expires_at)` WHERE NOT NULL | +| `idx_memory_facts_action` | `memory_facts` | `(user_id, action_required)` WHERE true | +| `idx_finance_docs_user_date` | `finance_documents` | `(user_id, doc_date DESC)` | +| `idx_finance_docs_correspondent` | `finance_documents` | `(user_id, correspondent)` | +| `idx_behavioral_ctx_user_time` | `behavioral_context` | `(user_id, start_at, end_at)` | +| `idx_behavioral_ctx_dnd` | `behavioral_context` | `(user_id, do_not_disturb)` WHERE true | +| `idx_agent_msgs_pending` | `agent_messages` | `(user_id, priority, created_at)` WHERE pending | +| `idx_agent_msgs_agent_type` | `agent_messages` | `(agent, event_type, created_at)` | +| `idx_agent_msgs_expires` | `agent_messages` | `(expires_at)` WHERE pending AND NOT NULL | + +--- + +### Phase 0 — Stato aggiornato + +- [x] ~~Deploy **Qdrant** sul cluster~~ ✅ 2026-03-21 +- [x] ~~Collections Qdrant con multi-tenancy `user_id`~~ ✅ 2026-03-21 +- [x] ~~Payload indexes Qdrant~~ ✅ 2026-03-21 +- [x] ~~Database `pompeo` + schema PostgreSQL~~ ✅ 2026-03-21 +- [ ] Verify embedding endpoint via Copilot (`text-embedding-3-small`) +- [ ] Migrazione a Ollama `nomic-embed-text` (quando LLM server è online) + +--- + +## [2026-03-21] Qdrant — Deploy e setup collections (Phase 0) + +### Overview + +Completato il deploy di **Qdrant v1.17.0** sul cluster Kubernetes (namespace `persistence`) e la creazione delle collections per la memoria semantica di Pompeo. Questa è la prima milestone della Phase 0 — Infrastructure Bootstrap. + +--- + +### Deploy infrastruttura + +Qdrant deployato via Helm chart ufficiale (`qdrant/qdrant`) nel namespace `persistence`, coerente con il pattern infrastrutturale esistente (Longhorn storage, Sealed Secrets, ServiceMonitor Prometheus). + +**Risorse create:** + +| Risorsa | Dettaglio | +|---|---| +| StatefulSet `qdrant` | 1/1 pod Running, image `qdrant/qdrant:v1.17.0` | +| PVC `qdrant-storage-qdrant-0` | 20Gi Longhorn RWO | +| Service `qdrant` | ClusterIP — porte 6333 (REST), 6334 (gRPC), 6335 (p2p) | +| SealedSecret `qdrant-api-secret` | API key cifrata, namespace `persistence` | +| ServiceMonitor `qdrant` | Prometheus scraping su `:6333/metrics`, label `release: monitoring` | + +**Endpoint interno:** `qdrant.persistence.svc.cluster.local:6333` + +Manifest in: `infra/cluster/persistence/qdrant/` + +--- + +### Design decision — Multi-tenancy collections (Opzione B) + +**Problema affrontato**: nominare le collections `martin_episodes`, `martin_knowledge`, `martin_preferences` avrebbe vincolato Pompeo ad essere esclusivamente un assistente personale singolo, rendendo impossibile — senza migration — estendere il sistema ad altri membri della famiglia in futuro. + +**Scelta adottata**: architettura multi-tenant con 3 collection condivise e isolamento via campo `user_id` nel payload di ogni punto vettoriale. + +``` +episodes ← user_id: "martin" | "shared" | +knowledge ← user_id: "martin" | "shared" | +preferences ← user_id: "martin" | "shared" | +``` + +Il valore `"shared"` è riservato a dati della casa/famiglia visibili a tutti gli utenti (es. calendario condiviso, documenti di casa, finanze comuni). Le query n8n usano un filtro `should: [user_id=martin, user_id=shared]` per recuperare sia il contesto personale che quello condiviso. + +**Vantaggi**: aggiungere un nuovo utente domani non richiede alcuna modifica infrastrutturale — solo includere il nuovo `user_id` negli upsert e nelle query. + +--- + +### Collections create + +Tutte e 3 le collections sono operative (status `green`): + +| Collection | Contenuto | +|---|---| +| `episodes` | Fatti episodici con timestamp (email, IoT, calendario, conversazioni) | +| `knowledge` | Documenti, note Outline, newsletter, knowledge base | +| `preferences` | Preferenze, abitudini e pattern comportamentali per utente | + +**Payload schema comune** (5 index su ogni collection): + +| Campo | Tipo | Scopo | +|---|---|---| +| `user_id` | keyword | Filtro multi-tenant (`"martin"`, `"shared"`) | +| `source` | keyword | Origine del dato (`"email"`, `"calendar"`, `"iot"`, `"paperless"`, …) | +| `category` | keyword | Dominio semantico (`"finance"`, `"work"`, `"personal"`, …) | +| `date` | datetime | Timestamp del fatto — filtrabile per range | +| `action_required` | bool | Flag per il Proactive Arbiter | + +**Dimensione vettori**: 1536 (compatibile con `text-embedding-3-small` via GitHub Copilot — bootstrap phase). Da rivedere alla migrazione verso `nomic-embed-text` su Ollama. + +--- + +### Phase 0 — Stato al momento del deploy Qdrant + +- [x] ~~Deploy **Qdrant** sul cluster~~ +- [x] ~~Creazione collections con multi-tenancy `user_id`~~ +- [x] ~~Payload indexes: `user_id`, `source`, `category`, `date`, `action_required`~~ +- [x] ~~Run **PostgreSQL migrations** su Patroni~~ ✅ completato nella sessione stessa diff --git a/README.md b/README.md index ce14267..2e02ecb 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,7 @@ Production-grade self-hosted stack. Key components relevant to ALPHA_PROJECT: | **n8n** | Primary orchestrator and workflow engine for all agents | | **Node-RED** | Event-driven automation, Home Assistant bridge | | **Patroni / PostgreSQL** | Persistent structured memory store | -| **Qdrant** | Vector store for semantic/episodic memory *(to be deployed)* | +| **Qdrant** | Vector store for semantic/episodic memory — `qdrant.persistence.svc.cluster.local:6333` | | **NATS / Redis Streams** | Message broker between agents *(to be chosen and deployed)* | | **Authentik** | SSO / IAM (OIDC) | | **Home Assistant** | IoT hub — device tracking, automations, sensors | @@ -160,17 +160,17 @@ CREATE TABLE behavioral_context ( ); ``` -**2. Semantic memory — Qdrant** +**2. Semantic memory — Qdrant** — `qdrant.persistence.svc.cluster.local:6333` -Vector embeddings for similarity search. Three collections: +Vector embeddings for similarity search. Three collections with **multi-tenant design**: isolation via `user_id` payload field (`"martin"`, `"shared"`, future users). | Collection | Content | |---|---| -| `martin_episodes` | Conversations, episodic facts with timestamp | -| `martin_knowledge` | Documents, Outline notes, newsletters, knowledge base | -| `martin_preferences` | Preferences, habits, behavioral patterns | +| `episodes` | Conversations, episodic facts with timestamp | +| `knowledge` | Documents, Outline notes, newsletters, knowledge base | +| `preferences` | Preferences, habits, behavioral patterns | -Each Qdrant point includes a metadata payload for pre-filtering (source, date, category, action_required) to avoid full-scan similarity searches. +Each Qdrant point includes a metadata payload for pre-filtering (`user_id`, `source`, `date`, `category`, `action_required`) to avoid full-scan similarity searches. **3. Profile memory — PostgreSQL (static table)** @@ -266,12 +266,15 @@ Notification is sent via **Amazon Echo / Pompeo** (TTS) for voice, and **Telegra ### Phase 0 — Infrastructure Bootstrap *(prerequisite for everything)* -- [ ] Deploy **Qdrant** on the Kubernetes cluster - - Create collections: `martin_episodes`, `martin_knowledge`, `martin_preferences` - - Configure payload indexes on: `source`, `category`, `date`, `action_required` -- [ ] Run **PostgreSQL migrations** on Patroni - - Create tables: `memory_facts`, `finance_documents`, `behavioral_context` - - Add index on `memory_facts(source, category, expires_at)` +- [x] ~~Deploy **Qdrant** on the Kubernetes cluster~~ ✅ 2026-03-21 + - Collections: `episodes`, `knowledge`, `preferences` (multi-tenant via `user_id` payload field) + - Payload indexes: `user_id`, `source`, `category`, `date`, `action_required` + - Endpoint: `qdrant.persistence.svc.cluster.local:6333` +- [x] ~~Run **PostgreSQL migrations** on Patroni~~ ✅ 2026-03-21 + - Database `pompeo` creato (Zalando Operator) + - Tabelle: `user_profile`, `memory_facts`, `finance_documents`, `behavioral_context`, `agent_messages` + - Multi-tenancy: campo `user_id` su tutte le tabelle, seed `martin` + `shared` + - Script DDL: `alpha/db/postgres.sql` - [ ] Verify embedding endpoint via Copilot (`text-embedding-3-small`) as bootstrap fallback - [ ] Plan migration to local Ollama embedding model once LLM server is online @@ -281,13 +284,13 @@ Notification is sent via **Amazon Echo / Pompeo** (TTS) for voice, and **Telegra - [ ] **Daily Digest**: after `Parse risposta GPT-4.1`, add: - Postgres INSERT into `memory_facts` (source=email, category, subject, detail JSONB, action_required, expires_at) - - Embedding generation (Copilot endpoint) → Qdrant upsert into `martin_episodes` + - Embedding generation (Copilot endpoint) → Qdrant upsert into `episodes` (user_id=martin) - Thread dedup: use `thread_id` as logical key, update existing Qdrant point if thread already exists - [ ] **Upload Bolletta** + **Upload Documento (Telegram)**: after `Paperless - Patch Metadati`, add: - Postgres INSERT into `finance_documents` (correspondent, amount, doc_date, doc_type, tags, paperless_doc_id) - Postgres INSERT into `memory_facts` (source=paperless, category=finance, cross-reference) - - Embedding of OCR text chunks → Qdrant upsert into `martin_knowledge` + - Embedding of OCR text chunks → Qdrant upsert into `knowledge` (user_id=martin) --- @@ -316,7 +319,7 @@ Notification is sent via **Amazon Echo / Pompeo** (TTS) for voice, and **Telegra - [ ] **Newsletter Agent** - Separate Gmail label for newsletters (excluded from Daily Digest main flow) - - Morning cron: summarize + extract relevant articles → `martin_knowledge` + - Morning cron: summarize + extract relevant articles → `knowledge` --- @@ -351,9 +354,9 @@ Notification is sent via **Amazon Echo / Pompeo** (TTS) for voice, and **Telegra - PDF → FileWizard OCR → GPT-4.1 metadata extraction (month, gross, net, deductions) - Paperless upload with tag `Cedolino` - Persist structured data to `finance_documents` (custom fields for payslip) - - Trend embedding in `martin_knowledge` for finance agent queries -- [ ] Behavioral habit modeling: aggregate `behavioral_context` records over time, generate periodic "habit summary" embeddings in `martin_preferences` -- [ ] Outline → Qdrant pipeline: sync selected Outline documents into `martin_knowledge` on edit/publish event + - Trend embedding in `knowledge` for finance agent queries +- [ ] Behavioral habit modeling: aggregate `behavioral_context` records over time, generate periodic "habit summary" embeddings in `preferences` +- [ ] Outline → Qdrant pipeline: sync selected Outline documents into `knowledge` on edit/publish event - [ ] Chrome browsing history ingestion (privacy-filtered): evaluate browser extension or local export → embedding pipeline for interest/preference modeling - [ ] "Posti e persone" graph: structured contact/location model in Postgres, populated from email senders, calendar attendees, Home Assistant presence data - [ ] Local embedding model: migrate from Copilot `text-embedding-3-small` to Ollama-served model (e.g. `nomic-embed-text`) once LLM server is stable diff --git a/db/postgres.sql b/db/postgres.sql new file mode 100644 index 0000000..9cf0140 --- /dev/null +++ b/db/postgres.sql @@ -0,0 +1,170 @@ +-- ============================================================================= +-- ALPHA_PROJECT — Database "pompeo" — Schema iniziale +-- ============================================================================= +-- Applicare su: postgresql://martin@postgres.persistence.svc.cluster.local:5432/pompeo +-- +-- Esecuzione dal cluster: +-- sudo microk8s kubectl run psql-pompeo --rm -it \ +-- --image=postgres:17-alpine --namespace=persistence \ +-- --env="PGPASSWORD=" --restart=Never \ +-- -- psql "postgresql://martin@postgres:5432/pompeo" -f /dev/stdin < postgres.sql +-- +-- Esecuzione via port-forward: +-- sudo microk8s kubectl port-forward svc/postgres -n persistence 5432:5432 +-- psql "postgresql://martin@localhost:5432/pompeo" -f postgres.sql +-- ============================================================================= + +\c pompeo + +-- --------------------------------------------------------------------------- +-- Estensioni +-- --------------------------------------------------------------------------- +CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; +CREATE EXTENSION IF NOT EXISTS "pg_trgm"; -- full-text similarity search su subject/detail + + +-- ============================================================================= +-- 1. USER_PROFILE +-- Preferenze statiche per utente. Aggiornata manualmente o via agent action. +-- user_id 'shared' = preferenze della casa (visibili a tutti). +-- ============================================================================= +CREATE TABLE IF NOT EXISTS user_profile ( + user_id TEXT PRIMARY KEY, + display_name TEXT, + language TEXT NOT NULL DEFAULT 'it', + timezone TEXT NOT NULL DEFAULT 'Europe/Rome', + notification_style TEXT NOT NULL DEFAULT 'concise', -- 'concise' | 'verbose' + quiet_start TIME NOT NULL DEFAULT '23:00', + quiet_end TIME NOT NULL DEFAULT '07:00', + preferences JSONB, -- freeform: soglie, preferenze extra per agente + updated_at TIMESTAMP NOT NULL DEFAULT now() +); + +-- Utenti iniziali +INSERT INTO user_profile (user_id, display_name) VALUES + ('martin', 'Martin'), + ('shared', 'Shared') +ON CONFLICT (user_id) DO NOTHING; + + +-- ============================================================================= +-- 2. MEMORY_FACTS +-- Fatti episodici prodotti da tutti gli agenti. TTL tramite expires_at. +-- qdrant_id: riferimento al punto vettoriale corrispondente nella collection "episodes". +-- ============================================================================= +CREATE TABLE IF NOT EXISTS memory_facts ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + user_id TEXT NOT NULL DEFAULT 'martin', + source TEXT NOT NULL, -- 'email' | 'calendar' | 'iot' | 'paperless' | 'n8n' | ... + category TEXT, -- 'finance' | 'personal' | 'work' | 'health' | ... + subject TEXT, + detail JSONB, -- payload flessibile per-source + action_required BOOLEAN NOT NULL DEFAULT false, + action_text TEXT, + created_at TIMESTAMP NOT NULL DEFAULT now(), + expires_at TIMESTAMP, -- NULL = permanente + qdrant_id UUID -- FK logico → collection "episodes" +); + +CREATE INDEX IF NOT EXISTS idx_memory_facts_user_source_cat + ON memory_facts(user_id, source, category); + +CREATE INDEX IF NOT EXISTS idx_memory_facts_expires + ON memory_facts(expires_at) + WHERE expires_at IS NOT NULL; + +CREATE INDEX IF NOT EXISTS idx_memory_facts_action + ON memory_facts(user_id, action_required) + WHERE action_required = true; + + +-- ============================================================================= +-- 3. FINANCE_DOCUMENTS +-- Documenti finanziari strutturati (bollette, fatture, cedolini). +-- paperless_doc_id: riferimento al documento in Paperless-ngx. +-- ============================================================================= +CREATE TABLE IF NOT EXISTS finance_documents ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + user_id TEXT NOT NULL DEFAULT 'martin', + paperless_doc_id INT, -- ID documento in Paperless-ngx + correspondent TEXT, + amount NUMERIC(10,2), + currency TEXT NOT NULL DEFAULT 'EUR', + doc_date DATE, + doc_type TEXT, -- 'bolletta' | 'fattura' | 'cedolino' | ... + tags TEXT[], + raw_text TEXT, -- testo OCR grezzo (per embedding) + created_at TIMESTAMP NOT NULL DEFAULT now() +); + +CREATE INDEX IF NOT EXISTS idx_finance_docs_user_date + ON finance_documents(user_id, doc_date DESC); + +CREATE INDEX IF NOT EXISTS idx_finance_docs_correspondent + ON finance_documents(user_id, correspondent); + + +-- ============================================================================= +-- 4. BEHAVIORAL_CONTEXT +-- Contesto comportamentale prodotto dall'IoT Agent e dal Calendar Agent. +-- Usato dal Proactive Arbiter per rispettare DND e stimare presence. +-- ============================================================================= +CREATE TABLE IF NOT EXISTS behavioral_context ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + user_id TEXT NOT NULL DEFAULT 'martin', + event_type TEXT, -- 'sport_event' | 'dog_walk' | 'work_session' | 'commute' | ... + start_at TIMESTAMP, + end_at TIMESTAMP, + do_not_disturb BOOLEAN NOT NULL DEFAULT false, + home_presence_expected BOOLEAN, + notes TEXT, + created_at TIMESTAMP NOT NULL DEFAULT now() +); + +CREATE INDEX IF NOT EXISTS idx_behavioral_ctx_user_time + ON behavioral_context(user_id, start_at, end_at); + +CREATE INDEX IF NOT EXISTS idx_behavioral_ctx_dnd + ON behavioral_context(user_id, do_not_disturb) + WHERE do_not_disturb = true; + + +-- ============================================================================= +-- 5. AGENT_MESSAGES +-- Blackboard: ogni agente pubblica qui le proprie osservazioni. +-- Il Proactive Arbiter legge, decide (notify/defer/discard) e aggiorna. +-- Corrisponde al message schema definito in alpha/README.md. +-- ============================================================================= +CREATE TABLE IF NOT EXISTS agent_messages ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + agent TEXT NOT NULL, -- 'mail' | 'calendar' | 'iot' | 'finance' | 'infra' | ... + priority TEXT NOT NULL, -- 'low' | 'high' + event_type TEXT NOT NULL, -- 'new_fact' | 'reminder' | 'alert' | 'behavioral_observation' + user_id TEXT NOT NULL DEFAULT 'martin', + subject TEXT, + detail JSONB, + source_ref TEXT, -- ID record Postgres o ref esterna + expires_at TIMESTAMP, + arbiter_decision TEXT, -- NULL (pending) | 'notify' | 'defer' | 'discard' + arbiter_reason TEXT, + created_at TIMESTAMP NOT NULL DEFAULT now(), + processed_at TIMESTAMP +); + +CREATE INDEX IF NOT EXISTS idx_agent_msgs_pending + ON agent_messages(user_id, priority, created_at) + WHERE arbiter_decision IS NULL; + +CREATE INDEX IF NOT EXISTS idx_agent_msgs_agent_type + ON agent_messages(agent, event_type, created_at); + +CREATE INDEX IF NOT EXISTS idx_agent_msgs_expires + ON agent_messages(expires_at) + WHERE expires_at IS NOT NULL AND arbiter_decision IS NULL; + + +-- ============================================================================= +-- Fine script +-- ============================================================================= +\echo '✅ Schema pompeo applicato correttamente.' +\echo ' Tabelle: user_profile, memory_facts, finance_documents, behavioral_context, agent_messages' diff --git a/db/qdrant.sh b/db/qdrant.sh new file mode 100644 index 0000000..7e76ab1 --- /dev/null +++ b/db/qdrant.sh @@ -0,0 +1,91 @@ +#!/usr/bin/env bash +# ============================================================================= +# ALPHA_PROJECT — Qdrant — Setup collections e payload indexes +# ============================================================================= +# Collections già create il 2026-03-21. Script conservato per tracciabilità +# e disaster recovery (da eseguire su un'istanza Qdrant vuota). +# +# Prerequisiti: +# sudo microk8s kubectl port-forward svc/qdrant -n persistence 6333:6333 +# +# Esecuzione: +# bash alpha/db/qdrant.sh +# ============================================================================= + +set -euo pipefail + +QDRANT_URL="${QDRANT_URL:-http://localhost:6333}" +QDRANT_API_KEY="${QDRANT_API_KEY:-__Montecarlo00!}" + +# Dimensione vettori: 1536 = text-embedding-3-small (Copilot, bootstrap phase) +# Da aggiornare a 768 alla migrazione verso nomic-embed-text su Ollama +VECTOR_SIZE=1536 + +header_key="api-key: ${QDRANT_API_KEY}" + +echo "==> Connessione a ${QDRANT_URL}" +curl -sf "${QDRANT_URL}/" -H "${header_key}" | grep -o '"version":"[^"]*"' +echo "" + +# ----------------------------------------------------------------------------- +# Collections +# Architettura multi-tenant: isolamento via campo user_id nel payload. +# Valori user_id: "martin" | "shared" | +# ----------------------------------------------------------------------------- +for COL in episodes knowledge preferences; do + echo "==> Creazione collection: ${COL}" + curl -sf -X PUT "${QDRANT_URL}/collections/${COL}" \ + -H "${header_key}" \ + -H "Content-Type: application/json" \ + -d "{ + \"vectors\": { \"size\": ${VECTOR_SIZE}, \"distance\": \"Cosine\" }, + \"optimizers_config\": { \"default_segment_number\": 2 }, + \"replication_factor\": 1 + }" | grep -o '"status":"[^"]*"' +done + +echo "" + +# ----------------------------------------------------------------------------- +# Payload indexes (per pre-filtering efficiente prima della ricerca vettoriale) +# ----------------------------------------------------------------------------- +for COL in episodes knowledge preferences; do + echo "==> Indexes per collection: ${COL}" + + for FIELD in user_id source category; do + printf " %-20s (keyword) → " "${FIELD}" + curl -sf -X PUT "${QDRANT_URL}/collections/${COL}/index" \ + -H "${header_key}" \ + -H "Content-Type: application/json" \ + -d "{\"field_name\": \"${FIELD}\", \"field_schema\": \"keyword\"}" \ + | grep -o '"status":"[^"]*"' + done + + printf " %-20s (datetime) → " "date" + curl -sf -X PUT "${QDRANT_URL}/collections/${COL}/index" \ + -H "${header_key}" \ + -H "Content-Type: application/json" \ + -d '{"field_name": "date", "field_schema": "datetime"}' \ + | grep -o '"status":"[^"]*"' + + printf " %-20s (bool) → " "action_required" + curl -sf -X PUT "${QDRANT_URL}/collections/${COL}/index" \ + -H "${header_key}" \ + -H "Content-Type: application/json" \ + -d '{"field_name": "action_required", "field_schema": "bool"}' \ + | grep -o '"status":"[^"]*"' +done + +echo "" + +# ----------------------------------------------------------------------------- +# Verifica finale +# ----------------------------------------------------------------------------- +echo "==> Collections attive:" +curl -sf "${QDRANT_URL}/collections" -H "${header_key}" \ + | python3 -c "import sys,json; [print(' -', c['name']) for c in json.load(sys.stdin)['result']['collections']]" + +echo "" +echo "✅ Setup Qdrant completato." +echo " Collections: episodes, knowledge, preferences" +echo " Payload indexes: user_id, source, category, date, action_required"