feat(alexa-skill): implementata aws-lambda e alexa-skill per invocazione webhook n8n
This commit is contained in:
1
.gitignore
vendored
Normal file
1
.gitignore
vendored
Normal file
@@ -0,0 +1 @@
|
||||
.venv/
|
||||
353
CHANGELOG.md
353
CHANGELOG.md
@@ -4,6 +4,359 @@ Tutte le modifiche significative al progetto ALPHA_PROJECT sono documentate qui.
|
||||
|
||||
---
|
||||
|
||||
## [2026-03-21] AWS Lambda Bridge for Alexa Skill "Pompeo"
|
||||
|
||||
Completata la pianificazione e l'implementazione della funzione AWS Lambda che funge da ponte tra la skill Alexa "Pompeo" e il backend n8n. Questo conclude la parte di sviluppo locale della "Phase 4 — Voice Interface (Pompeo)".
|
||||
|
||||
### Pianificazione e Progettazione
|
||||
|
||||
- **Creato `aws-lambda/README.md`**: Un piano di deploy dettagliato che include:
|
||||
- Analisi dei requisiti funzionali e non funzionali (costo zero, distribuzione privata).
|
||||
- Progettazione dell'architettura (Lambda come "thin bridge").
|
||||
- Specifiche per il ruolo IAM (`AWSLambdaBasicExecutionRole`).
|
||||
- Design della logica della funzione in Python.
|
||||
- Una guida passo-passo per la creazione delle risorse su AWS (IAM, Lambda) e sulla Alexa Developer Console.
|
||||
- **Ricerca e Scelte Tecniche**:
|
||||
- Confermato che il piano gratuito di AWS Lambda è sufficiente per un costo nullo.
|
||||
- Stabilito che la distribuzione privata si ottiene mantenendo la skill in stato "In Sviluppo", senza pubblicarla.
|
||||
- Definito l'uso di un "Invocation Name" di due parole (es. "maggiordomo Pompeo") per rispettare le policy di Alexa.
|
||||
|
||||
### Implementazione
|
||||
|
||||
- **Struttura del Progetto**: Creata la sottocartella `aws-lambda/src` per il codice sorgente.
|
||||
- **Codice Sorgente**: Implementata la funzione `lambda_handler` in `aws-lambda/src/index.py`, che inoltra le richieste Alexa al webhook n8n in modo sicuro.
|
||||
- **Dipendenze**: Definite le dipendenze (`requests`) in `aws-lambda/src/requirements.txt`.
|
||||
- **Pacchetto di Deploy**: Creato l'ambiente virtuale Python e installate le dipendenze direttamente nella cartella `src` per preparare il pacchetto di deploy.
|
||||
- **Artefatto Finale**: Generato il file `aws-lambda/pompeo-alexa-bridge.zip`, pronto per essere caricato sulla console AWS Lambda.
|
||||
|
||||
---
|
||||
|
||||
## [2026-03-21] Jellyfin Playback Agent — Blocco A completato
|
||||
|
||||
### Nuovo workflow n8n
|
||||
|
||||
- **`🎬 Pompeo — Jellyfin Playback [Webhook]`** (`AyrKWvboPldzZPsM`): riceve webhook da Jellyfin (PlaybackStart / PlaybackStop), filtra per utente `martin` (userId whitelist), e scrive su Postgres:
|
||||
- **PlaybackStart** → INSERT in `behavioral_context` (`event_type=watching_media`, `do_not_disturb=true`, notes con item/device/session_id) + INSERT in `agent_messages` (soggetto `▶️ <titolo> (<device>)`)
|
||||
- **PlaybackStop** → UPDATE su riga aperta più recente (`end_at=now()`, `do_not_disturb=false`) + INSERT in `agent_messages` (soggetto `⏹️ ...`)
|
||||
|
||||
### Bug risolti (infrastruttura n8n)
|
||||
|
||||
- **Webhook path n8n v2**: per registrare un webhook con path statico via API, il campo `webhookId` va impostato come attributo top-level del nodo (non dentro `parameters`). Senza di esso n8n genera il path dinamico `{workflowId}/{nodeName}/{path}` che il webhook pod non carica correttamente in queue mode.
|
||||
- **SSL Postgres / Patroni**: le credential Postgres create via API usavano SSL con `rejectUnauthorized=true` di default, incompatibile con il certificato self-signed di Patroni. Fix: aggiunto `NODE_TLS_REJECT_UNAUTHORIZED=0` ai deployment `n8n-app` e `n8n-app-worker`.
|
||||
- **queryParams Postgres node**: `additionalFields.queryParams` con espressioni `$json.*` non funziona correttamente in n8n v2.5.2. Fix: valori inline nella SQL via espressioni n8n `{{ $json.field }}`.
|
||||
|
||||
### Configurazione Jellyfin
|
||||
|
||||
- Webhook plugin Jellyfin configurato su `http://n8n-app-webhook.automation.svc.cluster.local/webhook/jellyfin-playback` (POST, eventi: PlaybackStart + PlaybackStop)
|
||||
|
||||
---
|
||||
|
||||
## [2026-03-21] Daily Digest — integrazione memoria Postgres
|
||||
|
||||
### Modifiche al workflow `📬 Gmail — Daily Digest [Schedule]` (`1lIKvVJQIcva30YM`)
|
||||
|
||||
Aggiunto branch parallelo di salvataggio fatti in memoria dopo la classificazione GPT:
|
||||
|
||||
```
|
||||
Parse risposta GPT-4.1 ──┬──> Telegram - Invia Report (invariato)
|
||||
├──> Dividi Email (invariato)
|
||||
└──> 🧠 Estrai Fatti ──> 🔀 Ha Fatti? ──> 💾 Upsert Memoria
|
||||
```
|
||||
|
||||
**`🧠 Estrai Fatti` (Code):**
|
||||
- Filtra le email con `action != 'trash'` e summary non vuoto
|
||||
- Chiama GPT-4.1 in batch per estrarre per ogni email: `fact`, `category`, `ttl_days`, `pompeo_note`, `entity_refs`
|
||||
- Calcola `expires_at` da `ttl_days` (14gg prenotazioni, 45gg bollette, 90gg lavoro/condominio, 30gg default)
|
||||
- Restituisce un item per ogni fatto da persistere
|
||||
|
||||
**`💾 Upsert Memoria` (Postgres node → `mRqzxhSboGscolqI`):**
|
||||
- `INSERT INTO memory_facts` con `source='email'`, `source_ref=threadId`
|
||||
- `ON CONFLICT ON CONSTRAINT memory_facts_dedup_idx DO UPDATE` → aggiorna se lo stesso thread viene riprocessato
|
||||
- Campi salvati: `category`, `subject`, `detail` (JSONB), `action_required`, `action_text`, `pompeo_note`, `entity_refs`, `expires_at`
|
||||
|
||||
### Fix contestuale
|
||||
|
||||
- Aggiunto `newer_than:1d` alla query Gmail su entrambi i nodi fetch — evitava di rifetchare email vecchie di mesi non marcate `Processed`
|
||||
|
||||
---
|
||||
|
||||
## [2026-03-21] Schema DB v2 — contacts, memory_facts_archive, entity_refs
|
||||
|
||||
### Nuove tabelle
|
||||
|
||||
- **`contacts`**: grafo di persone multi-tenant. Ogni riga modella una relazione `user_id → subject` con `relation`, `city`, `country`, `profession`, `aliases[]`, `born_year`, `details` (narrativa libera per LLM) e `metadata` JSONB. Traversabile ricorsivamente per inferire relazioni di secondo grado (es. Martin → zio Mujsi → figlio Euris → cugino di primo grado da parte di madre). Indici GIN su `subject` (trigram) e `aliases` per similarity search.
|
||||
- **`memory_facts_archive`**: destinazione del cleanup settimanale dei fatti scaduti. Struttura identica a `memory_facts` + `archived_at` + `archive_reason` (`expired` | `superseded` | `merged`). I fatti archiviati vengono poi condensati in un episodio Qdrant settimanale.
|
||||
|
||||
### Colonne aggiunte a `memory_facts`
|
||||
|
||||
- **`pompeo_note TEXT`**: inner monologue dell'LLM al momento dell'insert — il "perché" del fatto (già in uso nel Calendar Agent, ora standardizzato su tutti i source).
|
||||
- **`entity_refs JSONB`**: entità estratte dal fatto strutturato — `{people: [], places: [], products: [], amounts: []}`. Permette query SQL su persone/luoghi senza full-text scan (es. `entity_refs->'people' ? 'euris vruzhaj'`).
|
||||
|
||||
### Applicato a
|
||||
|
||||
- `alpha/db/postgres.sql` aggiornato (schema v2)
|
||||
- Live su Patroni primary (`postgres-1`, namespace `persistence`, DB `pompeo`)
|
||||
|
||||
---
|
||||
|
||||
## [2026-03-21] Actual Budget — Import Estratto Conto via Telegram
|
||||
|
||||
### Nuovi workflow
|
||||
|
||||
- **`💰 Actual — Import Estratto Conto [Telegram]`** (`qtvB3r0cgejyCxUp`): importa l'estratto conto Banca Sella (CSV) in Actual Budget tramite Telegram.
|
||||
- Trigger: documento Telegram con caption `Estratto conto`
|
||||
- Parse CSV Banca Sella (separatore `;`, date `gg/mm/aaaa`, importi con `.` decimale)
|
||||
- Skip automatico di `SALDO FINALE` e `SALDO INIZIALE`
|
||||
- Classificazione GPT-4.1 in batch da 30 transazioni: assegna payee e categoria, crea automaticamente i mancanti su Actual
|
||||
- Import via `/transactions/import` con dedup nativo tramite `imported_id` (pattern `banca-sella-{Id}` o hash fallback)
|
||||
- Report Telegram con nuove transazioni importate, già presenti e totale CSV
|
||||
- **`⏰ Actual — Reminder Estratto Conto [Schedule]`** (`w0oJ1i6sESvaB5W1`): reminder giornaliero (09:00) su Telegram se il task Google "Actual - Estratto conto" nella lista "Finanze" è scaduto.
|
||||
|
||||
### Note tecniche
|
||||
|
||||
- Binary data letta con `getBinaryDataBuffer()` (compatibile con filesystem binary mode di n8n)
|
||||
- Loop GPT gestito con iterazione interna nel Code node (no `splitInBatches` — instabile con input multipli)
|
||||
- Payee/categorie mancanti creati al volo e riutilizzati nei batch successivi della stessa run
|
||||
- Dedup Actual: `added` = nuove, `updated` = già presenti
|
||||
|
||||
---
|
||||
|
||||
## [2026-03-21] Calendar Agent — fix sincronizzazione e schedule
|
||||
|
||||
### Problemi risolti
|
||||
|
||||
- **`ON CONFLICT DO NOTHING` → `DO UPDATE`**: gli eventi modificati (orario, titolo) venivano ignorati. Ora vengono aggiornati in Postgres.
|
||||
- **Cleanup eventi cancellati**: aggiunto step `🗑️ Cleanup Cancellati` che esegue `DELETE FROM memory_facts WHERE source_ref NOT IN (UID attuali da HA)` per la finestra 7 giorni. Se Martin cancella un meeting, sparisce da Postgres al prossimo run.
|
||||
- **Schedule `*/30 * * * *`**: da cron 06:30 giornaliero a ogni 30 minuti — il calendario Postgres è sempre allineato alla source of truth (HA/Google Calendar).
|
||||
|
||||
### Flusso aggiornato
|
||||
|
||||
```
|
||||
... → 📋 Parse GPT → 🗑️ Cleanup Cancellati → 🔀 Riemetti → 💾 Upsert → 📦 → 📱
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## [2026-03-20] Calendar Agent — primo workflow Pompeo in produzione
|
||||
|
||||
### Cosa è stato fatto
|
||||
|
||||
Primo agente Pompeo deployato e attivo su n8n: `📅 Pompeo — Calendar Agent [Schedule]` (ID `4ZIEGck9n4l5qaDt`).
|
||||
|
||||
### Design
|
||||
|
||||
- **Sorgente dati**: Home Assistant REST API usata come proxy Google Calendar — evita OAuth Google diretto in n8n e funziona per tutti i 25 calendari registrati in HA.
|
||||
- **Calendari tracciati** (12): Lavoro, Famiglia, Spazzatura, Pulizie, Formula 1, WEC, Inter, Compleanni, Varie, Festività Italia, Films (Radarr), Serie TV (Sonarr).
|
||||
- **LLM enrichment**: GPT-4.1 (via Copilot) classifica ogni evento: category, action_required, do_not_disturb, priority, behavioral_context, pompeo_note.
|
||||
- **Dedup**: `memory_facts.source_ref` = HA event UID; `ON CONFLICT DO NOTHING` su indice unico parziale.
|
||||
- **Telegram briefing**: ogni mattina alle 06:30, riepilogo eventi prossimi 7 giorni raggruppati per calendario.
|
||||
|
||||
### Migrazioni DB applicate
|
||||
|
||||
- `ALTER TABLE memory_facts ADD COLUMN source_ref TEXT` — colonna per ID esterno di dedup
|
||||
- `CREATE UNIQUE INDEX memory_facts_dedup_idx ON memory_facts (user_id, source, source_ref) WHERE source_ref IS NOT NULL`
|
||||
- `CREATE INDEX idx_memory_facts_source_ref ON memory_facts (source_ref) WHERE source_ref IS NOT NULL`
|
||||
|
||||
### Credential n8n create
|
||||
|
||||
| ID | Nome | Tipo |
|
||||
|---|---|---|
|
||||
| `u0JCseXGnDG5hS9F` | Home Assistant API | HTTP Header Auth |
|
||||
| `mRqzxhSboGscolqI` | Pompeo — PostgreSQL | Postgres (pompeo/martin) |
|
||||
|
||||
### Flusso workflow
|
||||
|
||||
```
|
||||
⏰ Schedule (06:30) → 📅 Range → 🔑 Token Copilot
|
||||
→ 📋 Calendari (12 items) → 📡 HA Fetch (×12) → 🏷️ Estrai + Tag
|
||||
→ 📝 Prompt (dedup) → 🤖 GPT-4.1 → 📋 Parse
|
||||
→ 💾 Postgres Upsert (memory_facts) → 📦 Aggrega → 📱 Telegram
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## [2026-03-21] ADR — Message Broker: nessun broker dedicato
|
||||
|
||||
### Decisione
|
||||
|
||||
**Non verrà deployato un message broker dedicato** (né NATS JetStream né Redis Streams). Il blackboard pattern viene implementato interamente su PostgreSQL + webhook n8n.
|
||||
|
||||
### Ragionamento
|
||||
|
||||
Al momento della progettazione iniziale, il broker era necessario per disaccoppiare gli agenti dall'Arbiter. Con l'introduzione della tabella `agent_messages` nel database `pompeo`, questo obiettivo è già raggiunto:
|
||||
|
||||
```
|
||||
Agente n8n → INSERT agent_messages (arbiter_decision = NULL)
|
||||
Arbiter → SELECT WHERE arbiter_decision IS NULL (polling a cron)
|
||||
→ UPDATE arbiter_decision = 'notify' | 'defer' | 'discard'
|
||||
```
|
||||
|
||||
Il flusso high-priority (bypass immediato dell'Arbiter) viene gestito con una chiamata diretta al **webhook n8n dell'Arbiter** da parte dell'agente — zero infrastruttura aggiuntiva.
|
||||
|
||||
### Alternative valutate
|
||||
|
||||
| Opzione | Esito | Motivazione |
|
||||
|---|---|---|
|
||||
| `agent_messages` su PostgreSQL | ✅ **Adottata** | Già deployata, persistente, queryabile, audit log gratuito |
|
||||
| Redis Streams | ⏸ Rimandato | Già in cluster, valutabile se volume cresce |
|
||||
| NATS JetStream | ❌ Scartato | Nuovo componente da operare, overkill per il volume attuale (pochi msg/ora) e per il caso d'uso single-household |
|
||||
|
||||
### Impatto su README.md
|
||||
|
||||
La sezione "Message Broker (Blackboard Pattern)" rimane valida concettualmente. Il campo `agent` e il message schema definiti nel README vengono rispettati nella tabella `agent_messages` — cambia solo il mezzo di trasporto (Postgres invece di NATS/Redis).
|
||||
|
||||
---
|
||||
|
||||
## [2026-03-21] PostgreSQL — Database "pompeo" e schema ALPHA_PROJECT
|
||||
|
||||
### Overview
|
||||
|
||||
Creato il database `pompeo` sul cluster Patroni (namespace `persistence`) e applicato lo schema iniziale per la memoria strutturata di Pompeo. Seconda milestone della Phase 0 — Infrastructure Bootstrap.
|
||||
|
||||
---
|
||||
|
||||
### Modifica manifest Patroni
|
||||
|
||||
Aggiunto `pompeo: martin` nella sezione `databases` di `infra/cluster/persistence/patroni/postgres.yaml`. Il database è stato creato automaticamente dallo Zalando Operator senza downtime sugli altri database.
|
||||
|
||||
Script DDL idempotente disponibile in: `alpha/db/postgres.sql`
|
||||
|
||||
---
|
||||
|
||||
### Design decision — Multi-tenancy anche in PostgreSQL
|
||||
|
||||
Coerentemente con la scelta adottata per Qdrant, tutte le tabelle includono il campo `user_id TEXT NOT NULL DEFAULT 'martin'`. I valori `'martin'` e `'shared'` sono seedati in `user_profile` come utenti iniziali del sistema.
|
||||
|
||||
Aggiungere un nuovo utente in futuro non richiede modifiche allo schema — è sufficiente inserire una riga in `user_profile` e usare il nuovo `user_id` negli INSERT.
|
||||
|
||||
---
|
||||
|
||||
### Design decision — agent_messages come blackboard persistente
|
||||
|
||||
La tabella `agent_messages` implementa il **blackboard pattern** del message broker: ogni agente n8n inserisce le proprie osservazioni con `arbiter_decision = NULL` (pending). Il Proactive Arbiter legge i messaggi in coda, decide (`notify` / `defer` / `discard`) e aggiorna `arbiter_decision`, `arbiter_reason` e `processed_at`.
|
||||
|
||||
Rispetto a usare solo NATS/Redis come broker, questo approccio garantisce un **audit log permanente** di tutte le osservazioni e decisioni, interrogabile via SQL per debug, tuning e analisi storiche.
|
||||
|
||||
---
|
||||
|
||||
### Schema creato
|
||||
|
||||
**5 tabelle** nel database `pompeo`:
|
||||
|
||||
| Tabella | Ruolo |
|
||||
|---|---|
|
||||
| `user_profile` | Preferenze statiche per utente (lingua, timezone, stile notifiche, quiet hours). Seed: `martin`, `shared` |
|
||||
| `memory_facts` | Fatti episodici prodotti da tutti gli agenti, con TTL (`expires_at`) e riferimento al punto Qdrant (`qdrant_id`) |
|
||||
| `finance_documents` | Documenti finanziari strutturati: bollette, fatture, cedolini. Include `raw_text` per embedding |
|
||||
| `behavioral_context` | Contesto IoT/comportamentale per l'Arbiter: DND, home presence, tipo evento |
|
||||
| `agent_messages` | Blackboard del message broker — osservazioni agenti + decisioni Arbiter |
|
||||
|
||||
**15 index** totali:
|
||||
|
||||
| Index | Tabella | Tipo |
|
||||
|---|---|---|
|
||||
| `idx_memory_facts_user_source_cat` | `memory_facts` | `(user_id, source, category)` |
|
||||
| `idx_memory_facts_expires` | `memory_facts` | `(expires_at)` WHERE NOT NULL |
|
||||
| `idx_memory_facts_action` | `memory_facts` | `(user_id, action_required)` WHERE true |
|
||||
| `idx_finance_docs_user_date` | `finance_documents` | `(user_id, doc_date DESC)` |
|
||||
| `idx_finance_docs_correspondent` | `finance_documents` | `(user_id, correspondent)` |
|
||||
| `idx_behavioral_ctx_user_time` | `behavioral_context` | `(user_id, start_at, end_at)` |
|
||||
| `idx_behavioral_ctx_dnd` | `behavioral_context` | `(user_id, do_not_disturb)` WHERE true |
|
||||
| `idx_agent_msgs_pending` | `agent_messages` | `(user_id, priority, created_at)` WHERE pending |
|
||||
| `idx_agent_msgs_agent_type` | `agent_messages` | `(agent, event_type, created_at)` |
|
||||
| `idx_agent_msgs_expires` | `agent_messages` | `(expires_at)` WHERE pending AND NOT NULL |
|
||||
|
||||
---
|
||||
|
||||
### Phase 0 — Stato aggiornato
|
||||
|
||||
- [x] ~~Deploy **Qdrant** sul cluster~~ ✅ 2026-03-21
|
||||
- [x] ~~Collections Qdrant con multi-tenancy `user_id`~~ ✅ 2026-03-21
|
||||
- [x] ~~Payload indexes Qdrant~~ ✅ 2026-03-21
|
||||
- [x] ~~Database `pompeo` + schema PostgreSQL~~ ✅ 2026-03-21
|
||||
- [ ] Verify embedding endpoint via Copilot (`text-embedding-3-small`)
|
||||
- [ ] Migrazione a Ollama `nomic-embed-text` (quando LLM server è online)
|
||||
|
||||
---
|
||||
|
||||
## [2026-03-21] Qdrant — Deploy e setup collections (Phase 0)
|
||||
|
||||
### Overview
|
||||
|
||||
Completato il deploy di **Qdrant v1.17.0** sul cluster Kubernetes (namespace `persistence`) e la creazione delle collections per la memoria semantica di Pompeo. Questa è la prima milestone della Phase 0 — Infrastructure Bootstrap.
|
||||
|
||||
---
|
||||
|
||||
### Deploy infrastruttura
|
||||
|
||||
Qdrant deployato via Helm chart ufficiale (`qdrant/qdrant`) nel namespace `persistence`, coerente con il pattern infrastrutturale esistente (Longhorn storage, Sealed Secrets, ServiceMonitor Prometheus).
|
||||
|
||||
**Risorse create:**
|
||||
|
||||
| Risorsa | Dettaglio |
|
||||
|---|---|
|
||||
| StatefulSet `qdrant` | 1/1 pod Running, image `qdrant/qdrant:v1.17.0` |
|
||||
| PVC `qdrant-storage-qdrant-0` | 20Gi Longhorn RWO |
|
||||
| Service `qdrant` | ClusterIP — porte 6333 (REST), 6334 (gRPC), 6335 (p2p) |
|
||||
| SealedSecret `qdrant-api-secret` | API key cifrata, namespace `persistence` |
|
||||
| ServiceMonitor `qdrant` | Prometheus scraping su `:6333/metrics`, label `release: monitoring` |
|
||||
|
||||
**Endpoint interno:** `qdrant.persistence.svc.cluster.local:6333`
|
||||
|
||||
Manifest in: `infra/cluster/persistence/qdrant/`
|
||||
|
||||
---
|
||||
|
||||
### Design decision — Multi-tenancy collections (Opzione B)
|
||||
|
||||
**Problema affrontato**: nominare le collections `martin_episodes`, `martin_knowledge`, `martin_preferences` avrebbe vincolato Pompeo ad essere esclusivamente un assistente personale singolo, rendendo impossibile — senza migration — estendere il sistema ad altri membri della famiglia in futuro.
|
||||
|
||||
**Scelta adottata**: architettura multi-tenant con 3 collection condivise e isolamento via campo `user_id` nel payload di ogni punto vettoriale.
|
||||
|
||||
```
|
||||
episodes ← user_id: "martin" | "shared" | <futuri utenti>
|
||||
knowledge ← user_id: "martin" | "shared" | <futuri utenti>
|
||||
preferences ← user_id: "martin" | "shared" | <futuri utenti>
|
||||
```
|
||||
|
||||
Il valore `"shared"` è riservato a dati della casa/famiglia visibili a tutti gli utenti (es. calendario condiviso, documenti di casa, finanze comuni). Le query n8n usano un filtro `should: [user_id=martin, user_id=shared]` per recuperare sia il contesto personale che quello condiviso.
|
||||
|
||||
**Vantaggi**: aggiungere un nuovo utente domani non richiede alcuna modifica infrastrutturale — solo includere il nuovo `user_id` negli upsert e nelle query.
|
||||
|
||||
---
|
||||
|
||||
### Collections create
|
||||
|
||||
Tutte e 3 le collections sono operative (status `green`):
|
||||
|
||||
| Collection | Contenuto |
|
||||
|---|---|
|
||||
| `episodes` | Fatti episodici con timestamp (email, IoT, calendario, conversazioni) |
|
||||
| `knowledge` | Documenti, note Outline, newsletter, knowledge base |
|
||||
| `preferences` | Preferenze, abitudini e pattern comportamentali per utente |
|
||||
|
||||
**Payload schema comune** (5 index su ogni collection):
|
||||
|
||||
| Campo | Tipo | Scopo |
|
||||
|---|---|---|
|
||||
| `user_id` | keyword | Filtro multi-tenant (`"martin"`, `"shared"`) |
|
||||
| `source` | keyword | Origine del dato (`"email"`, `"calendar"`, `"iot"`, `"paperless"`, …) |
|
||||
| `category` | keyword | Dominio semantico (`"finance"`, `"work"`, `"personal"`, …) |
|
||||
| `date` | datetime | Timestamp del fatto — filtrabile per range |
|
||||
| `action_required` | bool | Flag per il Proactive Arbiter |
|
||||
|
||||
**Dimensione vettori**: 1536 (compatibile con `text-embedding-3-small` via GitHub Copilot — bootstrap phase). Da rivedere alla migrazione verso `nomic-embed-text` su Ollama.
|
||||
|
||||
---
|
||||
|
||||
### Phase 0 — Stato al momento del deploy Qdrant
|
||||
|
||||
- [x] ~~Deploy **Qdrant** sul cluster~~
|
||||
- [x] ~~Creazione collections con multi-tenancy `user_id`~~
|
||||
- [x] ~~Payload indexes: `user_id`, `source`, `category`, `date`, `action_required`~~
|
||||
- [x] ~~Run **PostgreSQL migrations** su Patroni~~ ✅ completato nella sessione stessa
|
||||
|
||||
|
||||
## [2026-03-21] Jellyfin Playback Agent — Blocco A completato
|
||||
|
||||
### Nuovo workflow n8n
|
||||
|
||||
206
aws-lambda/README.md
Normal file
206
aws-lambda/README.md
Normal file
@@ -0,0 +1,206 @@
|
||||
# Piano di Sviluppo per la Lambda "Pompeo"
|
||||
|
||||
Questo documento descrive il piano di analisi, progettazione, implementazione e deploy per la funzione AWS Lambda che funge da ponte tra la skill Alexa "Pompeo" e l'istanza n8n del progetto ALPHA_PROJECT.
|
||||
|
||||
## 1. Obiettivo
|
||||
|
||||
La funzione Lambda ha un unico scopo: agire come un "ponte" (bridge) ultra-leggero e veloce. Il suo compito è ricevere le richieste inviate dalla skill Alexa, inoltrarle in modo sicuro al webhook di n8n che contiene la logica dell'agente, e restituire ad Alexa la risposta testuale (TTS) generata da n8n.
|
||||
|
||||
---
|
||||
|
||||
## 2. Analisi e Requisiti
|
||||
|
||||
### Requisiti Funzionali
|
||||
|
||||
1. **Ricezione Richiesta:** Deve essere in grado di ricevere e interpretare l'oggetto JSON inviato dal servizio Alexa.
|
||||
2. **Inoltro a n8n:** Deve inoltrare il corpo della richiesta Alexa a un webhook n8n predefinito.
|
||||
3. **Autenticazione (Opzionale ma Raccomandato):** La chiamata verso n8n dovrebbe includere un `secret token` per assicurare che solo la Lambda possa attivare il workflow.
|
||||
4. **Ricezione Risposta da n8n:** Deve attendere la risposta da n8n, che conterrà il testo da pronunciare.
|
||||
5. **Formattazione Risposta Alexa:** Deve costruire un oggetto JSON valido per Alexa, contenente la stringa TTS.
|
||||
6. **Gestione Errori:** Deve rispondere ad Alexa con un messaggio di errore cortese se n8n non è raggiungibile o restituisce un errore.
|
||||
|
||||
### Requisiti Non Funzionali
|
||||
|
||||
1. **Costo Zero:** L'intera infrastruttura deve operare sotto la soglia del piano gratuito di AWS (AWS Free Tier).
|
||||
* **AWS Lambda:** Il free tier include 1 milione di chiamate/mese e 400.000 GB-secondi, ampiamente sufficienti.
|
||||
* **API Gateway (se usata):** Il free tier include 1 milione di chiamate API/mese.
|
||||
2. **Distribuzione Privata:** La skill Alexa non deve essere pubblicata sullo store pubblico. Deve rimanere in stato "In Sviluppo" (`development`), rendendola automaticamente disponibile solo sui dispositivi Echo associati all'account Amazon dello sviluppatore.
|
||||
3. **Bassa Latenza:** La funzione deve essere scritta in un linguaggio performante per I/O (es. Python, Node.js) e avere una logica minimale per non introdurre ritardi.
|
||||
4. **Sicurezza:** La comunicazione tra Lambda e n8n deve avvenire su HTTPS. L'URL del webhook e il secret token devono essere gestiti tramite environment variables della Lambda, non hardcodati nel codice.
|
||||
|
||||
---
|
||||
|
||||
## 3. Progettazione (Design)
|
||||
|
||||
### Architettura
|
||||
|
||||
```
|
||||
+--------------+ 1. Utente parla +-----------------+ 3. Inoltra richiesta +---------------+
|
||||
| | ----------------------> | | --------------------------> | |
|
||||
| Echo Device | | AWS Lambda | | n8n Webhook |
|
||||
| | <---------------------- | | <------------------------- | |
|
||||
+--------------+ 5. Risposta TTS +-----------------+ 4. Risposta TTS +---------------+
|
||||
(Text-to-Speech)
|
||||
|
|
||||
| 2. Trigger
|
||||
|
|
||||
+---------------------+
|
||||
| Alexa Skills Kit |
|
||||
+---------------------+
|
||||
```
|
||||
|
||||
### Dettagli Tecnici
|
||||
|
||||
* **Runtime:** **Python 3.11**. È una scelta eccellente per task I/O-bound come questo, con un'ottima integrazione in ambiente AWS.
|
||||
* **Trigger:** Il trigger della Lambda sarà **"Alexa Skills Kit"**. Per sicurezza, verrà configurato per accettare chiamate solo dalla specifica Skill ID di "Pompeo".
|
||||
* **IAM Role:** Verrà creato un ruolo IAM con due policy:
|
||||
1. **Trust Policy:** Permette al servizio `lambda.amazonaws.com` di assumere questo ruolo.
|
||||
2. **Permissions Policy:** Si utilizzerà la policy gestita da AWS `AWSLambdaBasicExecutionRole`, che garantisce i permessi necessari per scrivere log su Amazon CloudWatch (`logs:CreateLogGroup`, `logs:CreateLogStream`, `logs:PutLogEvents`). Non sono necessari altri permessi.
|
||||
* **Logica del Codice (Python):**
|
||||
1. La funzione `lambda_handler(event, context)` sarà il punto di ingresso.
|
||||
2. Recupererà l'URL del webhook e il secret token dalle variabili d'ambiente (`os.environ.get('N8N_WEBHOOK_URL')`).
|
||||
3. Eseguirà una richiesta HTTP POST (usando la libreria `requests` o `urllib3`) verso l'URL di n8n.
|
||||
4. Il `body` della POST conterrà l'intero oggetto `event` ricevuto da Alexa. L'`header` conterrà il `secret token`.
|
||||
5. Attenderà la risposta di n8n. Se la risposta è 200 OK e contiene un JSON con un campo `tts_response`, procederà.
|
||||
6. Costruirà l'oggetto di risposta per Alexa, come da documentazione ufficiale.
|
||||
* **Gestione degli Errori:** In caso di timeout, codice di stato HTTP non-200, o JSON malformato da n8n, la Lambda costruirà una risposta di errore standard per Alexa (es. "Mi dispiace, si è verificato un problema. Riprova più tardi.").
|
||||
|
||||
---
|
||||
|
||||
## 4. Implementazione (Codice)
|
||||
|
||||
La Lambda richiederà un singolo file `index.py` e un file `requirements.txt` per le dipendenze.
|
||||
|
||||
**`requirements.txt`:**
|
||||
```
|
||||
requests
|
||||
```
|
||||
|
||||
**`index.py` (Esempio Boilerplate):**
|
||||
```python
|
||||
import os
|
||||
import json
|
||||
import requests
|
||||
|
||||
# Recupera le variabili d'ambiente
|
||||
N8N_WEBHOOK_URL = os.environ.get('N8N_WEBHOOK_URL')
|
||||
N8N_SECRET_TOKEN = os.environ.get('N8N_SECRET_TOKEN')
|
||||
|
||||
def build_alexa_response(text):
|
||||
"""Costruisce la risposta JSON per Alexa."""
|
||||
return {
|
||||
'version': '1.0',
|
||||
'response': {
|
||||
'outputSpeech': {
|
||||
'type': 'PlainText',
|
||||
'text': text
|
||||
},
|
||||
'shouldEndSession': True
|
||||
}
|
||||
}
|
||||
|
||||
def lambda_handler(event, context):
|
||||
"""Punto di ingresso della funzione Lambda."""
|
||||
print(f"Richiesta ricevuta da Alexa: {json.dumps(event)}")
|
||||
|
||||
if not N8N_WEBHOOK_URL or not N8N_SECRET_TOKEN:
|
||||
print("Errore: Variabili d'ambiente non configurate.")
|
||||
return build_alexa_response("Errore di configurazione del server.")
|
||||
|
||||
headers = {
|
||||
'Content-Type': 'application/json',
|
||||
'X-N8N-Webhook-Secret': N8N_SECRET_TOKEN
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.post(
|
||||
N8N_WEBHOOK_URL,
|
||||
headers=headers,
|
||||
data=json.dumps(event),
|
||||
timeout=8 # Alexa attende max 10 secondi
|
||||
)
|
||||
response.raise_for_status() # Solleva un'eccezione per status code non-2xx
|
||||
|
||||
n8n_data = response.json()
|
||||
tts_text = n8n_data.get('tts_response', 'Nessuna risposta ricevuta da Pompeo.')
|
||||
|
||||
print(f"Risposta da n8n: {tts_text}")
|
||||
return build_alexa_response(tts_text)
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
print(f"Errore nella chiamata a n8n: {e}")
|
||||
return build_alexa_response("Mi dispiace, non riesco a contattare Pompeo in questo momento.")
|
||||
except Exception as e:
|
||||
print(f"Errore generico: {e}")
|
||||
return build_alexa_response("Si è verificato un errore inaspettato.")
|
||||
|
||||
```
|
||||
Il codice andrà zippato insieme alla cartella delle dipendenze installate localmente (`pip install -r requirements.txt -t .`).
|
||||
|
||||
---
|
||||
|
||||
## 5. Procedura Burocratica di Deploy e Configurazione
|
||||
|
||||
Questa è la checklist passo-passo per mettere tutto in funzione.
|
||||
|
||||
### Passo 1: Creazione del Ruolo IAM
|
||||
|
||||
1. Vai alla console AWS -> **IAM** -> **Roles**.
|
||||
2. Clicca su **Create role**.
|
||||
3. **Trusted entity type**: Seleziona **AWS service**.
|
||||
4. **Use case**: Seleziona **Lambda**, poi clicca **Next**.
|
||||
5. Nella pagina **Add permissions**, cerca e seleziona la policy `AWSLambdaBasicExecutionRole`. Clicca **Next**.
|
||||
6. **Role name**: Inserisci un nome, es. `PompeoAlexaLambdaRole`.
|
||||
7. Clicca **Create role**.
|
||||
|
||||
### Passo 2: Creazione della Funzione Lambda
|
||||
|
||||
1. Prepara il pacchetto di deploy:
|
||||
* Crea una cartella, es. `lambda_package`.
|
||||
* Salva il codice Python come `index.py` in quella cartella.
|
||||
* Salva `requirements.txt`.
|
||||
* Da terminale, nella cartella, esegui: `pip install -r requirements.txt --target .`
|
||||
* Zippa l'intero contenuto della cartella `lambda_package` (non la cartella stessa).
|
||||
2. Vai alla console AWS -> **Lambda**.
|
||||
3. Clicca **Create function**.
|
||||
4. Seleziona **Author from scratch**.
|
||||
5. **Function name**: `pompeo-alexa-bridge`.
|
||||
6. **Runtime**: Seleziona **Python 3.11**.
|
||||
7. **Architecture**: `x86_64`.
|
||||
8. **Permissions**: Espandi "Change default execution role", seleziona "Use an existing role" e scegli il ruolo `PompeoAlexaLambdaRole` creato prima.
|
||||
9. Clicca **Create function**.
|
||||
10. Nella pagina della funzione, vai su **Code source** e clicca **Upload from -> .zip file**. Carica lo zip creato.
|
||||
11. Vai in **Configuration -> Environment variables** e aggiungi:
|
||||
* `N8N_WEBHOOK_URL`: L'URL del tuo webhook n8n.
|
||||
* `N8N_SECRET_TOKEN`: Il token segreto che configurerai su n8n.
|
||||
|
||||
### Passo 3: Creazione della Skill Alexa "Pompeo"
|
||||
|
||||
1. Vai su [Alexa Developer Console](https://developer.amazon.com/alexa/console/ask).
|
||||
2. Clicca **Create Skill**.
|
||||
3. **Skill name**: `Pompeo`.
|
||||
4. **Choose a model**: `Custom`.
|
||||
5. **Choose a method to host**: `Provision your own`.
|
||||
6. Clicca **Create skill**.
|
||||
7. Una volta nella dashboard della skill, vai nel menu a sinistra su **Endpoint**.
|
||||
8. **Copia il tuo Skill ID**. Sarà una stringa simile a `amzn1.ask.skill.xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`.
|
||||
|
||||
### Passo 4: Collegamento tra Skill e Lambda
|
||||
|
||||
1. Torna alla pagina della funzione **Lambda** su AWS.
|
||||
2. Nella sezione **Function overview**, clicca **Add trigger**.
|
||||
3. Come sorgente, seleziona **Alexa Skills Kit**.
|
||||
4. Abilita **Skill ID verification** e incolla lo **Skill ID** copiato dal portale Alexa.
|
||||
5. Clicca **Add**.
|
||||
6. Ora, nella pagina della Lambda, in alto a destra, **copia l'ARN (Amazon Resource Name)** della funzione.
|
||||
7. Torna alla pagina **Endpoint** della skill nella Alexa Developer Console.
|
||||
8. Seleziona **AWS Lambda ARN** come Service Endpoint Type.
|
||||
9. Incolla l'ARN della Lambda nel campo **Default Region**.
|
||||
10. Clicca **Save Endpoints** in alto.
|
||||
|
||||
### Passo 5: Test
|
||||
|
||||
1. Nella Alexa Developer Console, vai alla tab **Test**.
|
||||
2. Nella casella di testo, scrivi "apri pompeo" o qualsiasi altra frase di avvio.
|
||||
3. Controlla i log della funzione Lambda su **AWS CloudWatch** per vedere la richiesta in arrivo e la risposta.
|
||||
4. La skill è ora attiva in modalità "Development" e funzionerà su tutti i dispositivi Echo associati al tuo account Amazon, rimanendo completamente privata.
|
||||
BIN
aws-lambda/pompeo-alexa-bridge.zip
Normal file
BIN
aws-lambda/pompeo-alexa-bridge.zip
Normal file
Binary file not shown.
BIN
aws-lambda/src/81d243bd2c585b0f4821__mypyc.cp313-win_amd64.pyd
Normal file
BIN
aws-lambda/src/81d243bd2c585b0f4821__mypyc.cp313-win_amd64.pyd
Normal file
Binary file not shown.
BIN
aws-lambda/src/bin/normalizer.exe
Normal file
BIN
aws-lambda/src/bin/normalizer.exe
Normal file
Binary file not shown.
1
aws-lambda/src/certifi-2026.2.25.dist-info/INSTALLER
Normal file
1
aws-lambda/src/certifi-2026.2.25.dist-info/INSTALLER
Normal file
@@ -0,0 +1 @@
|
||||
pip
|
||||
78
aws-lambda/src/certifi-2026.2.25.dist-info/METADATA
Normal file
78
aws-lambda/src/certifi-2026.2.25.dist-info/METADATA
Normal file
@@ -0,0 +1,78 @@
|
||||
Metadata-Version: 2.4
|
||||
Name: certifi
|
||||
Version: 2026.2.25
|
||||
Summary: Python package for providing Mozilla's CA Bundle.
|
||||
Home-page: https://github.com/certifi/python-certifi
|
||||
Author: Kenneth Reitz
|
||||
Author-email: me@kennethreitz.com
|
||||
License: MPL-2.0
|
||||
Project-URL: Source, https://github.com/certifi/python-certifi
|
||||
Classifier: Development Status :: 5 - Production/Stable
|
||||
Classifier: Intended Audience :: Developers
|
||||
Classifier: License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)
|
||||
Classifier: Natural Language :: English
|
||||
Classifier: Programming Language :: Python
|
||||
Classifier: Programming Language :: Python :: 3
|
||||
Classifier: Programming Language :: Python :: 3 :: Only
|
||||
Classifier: Programming Language :: Python :: 3.7
|
||||
Classifier: Programming Language :: Python :: 3.8
|
||||
Classifier: Programming Language :: Python :: 3.9
|
||||
Classifier: Programming Language :: Python :: 3.10
|
||||
Classifier: Programming Language :: Python :: 3.11
|
||||
Classifier: Programming Language :: Python :: 3.12
|
||||
Classifier: Programming Language :: Python :: 3.13
|
||||
Classifier: Programming Language :: Python :: 3.14
|
||||
Requires-Python: >=3.7
|
||||
License-File: LICENSE
|
||||
Dynamic: author
|
||||
Dynamic: author-email
|
||||
Dynamic: classifier
|
||||
Dynamic: description
|
||||
Dynamic: home-page
|
||||
Dynamic: license
|
||||
Dynamic: license-file
|
||||
Dynamic: project-url
|
||||
Dynamic: requires-python
|
||||
Dynamic: summary
|
||||
|
||||
Certifi: Python SSL Certificates
|
||||
================================
|
||||
|
||||
Certifi provides Mozilla's carefully curated collection of Root Certificates for
|
||||
validating the trustworthiness of SSL certificates while verifying the identity
|
||||
of TLS hosts. It has been extracted from the `Requests`_ project.
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
``certifi`` is available on PyPI. Simply install it with ``pip``::
|
||||
|
||||
$ pip install certifi
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
To reference the installed certificate authority (CA) bundle, you can use the
|
||||
built-in function::
|
||||
|
||||
>>> import certifi
|
||||
|
||||
>>> certifi.where()
|
||||
'/usr/local/lib/python3.7/site-packages/certifi/cacert.pem'
|
||||
|
||||
Or from the command line::
|
||||
|
||||
$ python -m certifi
|
||||
/usr/local/lib/python3.7/site-packages/certifi/cacert.pem
|
||||
|
||||
Enjoy!
|
||||
|
||||
.. _`Requests`: https://requests.readthedocs.io/en/master/
|
||||
|
||||
Addition/Removal of Certificates
|
||||
--------------------------------
|
||||
|
||||
Certifi does not support any addition/removal or other modification of the
|
||||
CA trust store content. This project is intended to provide a reliable and
|
||||
highly portable root of trust to python deployments. Look to upstream projects
|
||||
for methods to use alternate trust.
|
||||
14
aws-lambda/src/certifi-2026.2.25.dist-info/RECORD
Normal file
14
aws-lambda/src/certifi-2026.2.25.dist-info/RECORD
Normal file
@@ -0,0 +1,14 @@
|
||||
certifi-2026.2.25.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
|
||||
certifi-2026.2.25.dist-info/METADATA,sha256=4NMuGXdg_hBiRA3paKVXYcDmE3VXEBWxTvCL2xlDyPU,2474
|
||||
certifi-2026.2.25.dist-info/RECORD,,
|
||||
certifi-2026.2.25.dist-info/WHEEL,sha256=YCfwYGOYMi5Jhw2fU4yNgwErybb2IX5PEwBKV4ZbdBo,91
|
||||
certifi-2026.2.25.dist-info/licenses/LICENSE,sha256=6TcW2mucDVpKHfYP5pWzcPBpVgPSH2-D8FPkLPwQyvc,989
|
||||
certifi-2026.2.25.dist-info/top_level.txt,sha256=KMu4vUCfsjLrkPbSNdgdekS-pVJzBAJFO__nI8NF6-U,8
|
||||
certifi/__init__.py,sha256=c9eaYufv1pSLl0Q8QNcMiMLLH4WquDcxdPyKjmI4opY,94
|
||||
certifi/__main__.py,sha256=xBBoj905TUWBLRGANOcf7oi6e-3dMP4cEoG9OyMs11g,243
|
||||
certifi/__pycache__/__init__.cpython-313.pyc,,
|
||||
certifi/__pycache__/__main__.cpython-313.pyc,,
|
||||
certifi/__pycache__/core.cpython-313.pyc,,
|
||||
certifi/cacert.pem,sha256=_JFloSQDJj5-v72te-ej6sD6XTJdPHBGXyjTaQByyig,272441
|
||||
certifi/core.py,sha256=XFXycndG5pf37ayeF8N32HUuDafsyhkVMbO4BAPWHa0,3394
|
||||
certifi/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
||||
5
aws-lambda/src/certifi-2026.2.25.dist-info/WHEEL
Normal file
5
aws-lambda/src/certifi-2026.2.25.dist-info/WHEEL
Normal file
@@ -0,0 +1,5 @@
|
||||
Wheel-Version: 1.0
|
||||
Generator: setuptools (82.0.0)
|
||||
Root-Is-Purelib: true
|
||||
Tag: py3-none-any
|
||||
|
||||
20
aws-lambda/src/certifi-2026.2.25.dist-info/licenses/LICENSE
Normal file
20
aws-lambda/src/certifi-2026.2.25.dist-info/licenses/LICENSE
Normal file
@@ -0,0 +1,20 @@
|
||||
This package contains a modified version of ca-bundle.crt:
|
||||
|
||||
ca-bundle.crt -- Bundle of CA Root Certificates
|
||||
|
||||
This is a bundle of X.509 certificates of public Certificate Authorities
|
||||
(CA). These were automatically extracted from Mozilla's root certificates
|
||||
file (certdata.txt). This file can be found in the mozilla source tree:
|
||||
https://hg.mozilla.org/mozilla-central/file/tip/security/nss/lib/ckfw/builtins/certdata.txt
|
||||
It contains the certificates in PEM format and therefore
|
||||
can be directly used with curl / libcurl / php_curl, or with
|
||||
an Apache+mod_ssl webserver for SSL client authentication.
|
||||
Just configure this file as the SSLCACertificateFile.#
|
||||
|
||||
***** BEGIN LICENSE BLOCK *****
|
||||
This Source Code Form is subject to the terms of the Mozilla Public License,
|
||||
v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain
|
||||
one at http://mozilla.org/MPL/2.0/.
|
||||
|
||||
***** END LICENSE BLOCK *****
|
||||
@(#) $RCSfile: certdata.txt,v $ $Revision: 1.80 $ $Date: 2011/11/03 15:11:58 $
|
||||
1
aws-lambda/src/certifi-2026.2.25.dist-info/top_level.txt
Normal file
1
aws-lambda/src/certifi-2026.2.25.dist-info/top_level.txt
Normal file
@@ -0,0 +1 @@
|
||||
certifi
|
||||
4
aws-lambda/src/certifi/__init__.py
Normal file
4
aws-lambda/src/certifi/__init__.py
Normal file
@@ -0,0 +1,4 @@
|
||||
from .core import contents, where
|
||||
|
||||
__all__ = ["contents", "where"]
|
||||
__version__ = "2026.02.25"
|
||||
12
aws-lambda/src/certifi/__main__.py
Normal file
12
aws-lambda/src/certifi/__main__.py
Normal file
@@ -0,0 +1,12 @@
|
||||
import argparse
|
||||
|
||||
from certifi import contents, where
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("-c", "--contents", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.contents:
|
||||
print(contents())
|
||||
else:
|
||||
print(where())
|
||||
BIN
aws-lambda/src/certifi/__pycache__/__init__.cpython-313.pyc
Normal file
BIN
aws-lambda/src/certifi/__pycache__/__init__.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/certifi/__pycache__/__main__.cpython-313.pyc
Normal file
BIN
aws-lambda/src/certifi/__pycache__/__main__.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/certifi/__pycache__/core.cpython-313.pyc
Normal file
BIN
aws-lambda/src/certifi/__pycache__/core.cpython-313.pyc
Normal file
Binary file not shown.
4494
aws-lambda/src/certifi/cacert.pem
Normal file
4494
aws-lambda/src/certifi/cacert.pem
Normal file
File diff suppressed because it is too large
Load Diff
83
aws-lambda/src/certifi/core.py
Normal file
83
aws-lambda/src/certifi/core.py
Normal file
@@ -0,0 +1,83 @@
|
||||
"""
|
||||
certifi.py
|
||||
~~~~~~~~~~
|
||||
|
||||
This module returns the installation location of cacert.pem or its contents.
|
||||
"""
|
||||
import sys
|
||||
import atexit
|
||||
|
||||
def exit_cacert_ctx() -> None:
|
||||
_CACERT_CTX.__exit__(None, None, None) # type: ignore[union-attr]
|
||||
|
||||
|
||||
if sys.version_info >= (3, 11):
|
||||
|
||||
from importlib.resources import as_file, files
|
||||
|
||||
_CACERT_CTX = None
|
||||
_CACERT_PATH = None
|
||||
|
||||
def where() -> str:
|
||||
# This is slightly terrible, but we want to delay extracting the file
|
||||
# in cases where we're inside of a zipimport situation until someone
|
||||
# actually calls where(), but we don't want to re-extract the file
|
||||
# on every call of where(), so we'll do it once then store it in a
|
||||
# global variable.
|
||||
global _CACERT_CTX
|
||||
global _CACERT_PATH
|
||||
if _CACERT_PATH is None:
|
||||
# This is slightly janky, the importlib.resources API wants you to
|
||||
# manage the cleanup of this file, so it doesn't actually return a
|
||||
# path, it returns a context manager that will give you the path
|
||||
# when you enter it and will do any cleanup when you leave it. In
|
||||
# the common case of not needing a temporary file, it will just
|
||||
# return the file system location and the __exit__() is a no-op.
|
||||
#
|
||||
# We also have to hold onto the actual context manager, because
|
||||
# it will do the cleanup whenever it gets garbage collected, so
|
||||
# we will also store that at the global level as well.
|
||||
_CACERT_CTX = as_file(files("certifi").joinpath("cacert.pem"))
|
||||
_CACERT_PATH = str(_CACERT_CTX.__enter__())
|
||||
atexit.register(exit_cacert_ctx)
|
||||
|
||||
return _CACERT_PATH
|
||||
|
||||
def contents() -> str:
|
||||
return files("certifi").joinpath("cacert.pem").read_text(encoding="ascii")
|
||||
|
||||
else:
|
||||
|
||||
from importlib.resources import path as get_path, read_text
|
||||
|
||||
_CACERT_CTX = None
|
||||
_CACERT_PATH = None
|
||||
|
||||
def where() -> str:
|
||||
# This is slightly terrible, but we want to delay extracting the
|
||||
# file in cases where we're inside of a zipimport situation until
|
||||
# someone actually calls where(), but we don't want to re-extract
|
||||
# the file on every call of where(), so we'll do it once then store
|
||||
# it in a global variable.
|
||||
global _CACERT_CTX
|
||||
global _CACERT_PATH
|
||||
if _CACERT_PATH is None:
|
||||
# This is slightly janky, the importlib.resources API wants you
|
||||
# to manage the cleanup of this file, so it doesn't actually
|
||||
# return a path, it returns a context manager that will give
|
||||
# you the path when you enter it and will do any cleanup when
|
||||
# you leave it. In the common case of not needing a temporary
|
||||
# file, it will just return the file system location and the
|
||||
# __exit__() is a no-op.
|
||||
#
|
||||
# We also have to hold onto the actual context manager, because
|
||||
# it will do the cleanup whenever it gets garbage collected, so
|
||||
# we will also store that at the global level as well.
|
||||
_CACERT_CTX = get_path("certifi", "cacert.pem")
|
||||
_CACERT_PATH = str(_CACERT_CTX.__enter__())
|
||||
atexit.register(exit_cacert_ctx)
|
||||
|
||||
return _CACERT_PATH
|
||||
|
||||
def contents() -> str:
|
||||
return read_text("certifi", "cacert.pem", encoding="ascii")
|
||||
0
aws-lambda/src/certifi/py.typed
Normal file
0
aws-lambda/src/certifi/py.typed
Normal file
@@ -0,0 +1 @@
|
||||
pip
|
||||
798
aws-lambda/src/charset_normalizer-3.4.6.dist-info/METADATA
Normal file
798
aws-lambda/src/charset_normalizer-3.4.6.dist-info/METADATA
Normal file
@@ -0,0 +1,798 @@
|
||||
Metadata-Version: 2.4
|
||||
Name: charset-normalizer
|
||||
Version: 3.4.6
|
||||
Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
|
||||
Author-email: "Ahmed R. TAHRI" <tahri.ahmed@proton.me>
|
||||
Maintainer-email: "Ahmed R. TAHRI" <tahri.ahmed@proton.me>
|
||||
License: MIT
|
||||
Project-URL: Changelog, https://github.com/jawah/charset_normalizer/blob/master/CHANGELOG.md
|
||||
Project-URL: Documentation, https://charset-normalizer.readthedocs.io/
|
||||
Project-URL: Code, https://github.com/jawah/charset_normalizer
|
||||
Project-URL: Issue tracker, https://github.com/jawah/charset_normalizer/issues
|
||||
Keywords: encoding,charset,charset-detector,detector,normalization,unicode,chardet,detect
|
||||
Classifier: Development Status :: 5 - Production/Stable
|
||||
Classifier: Intended Audience :: Developers
|
||||
Classifier: Operating System :: OS Independent
|
||||
Classifier: Programming Language :: Python
|
||||
Classifier: Programming Language :: Python :: 3
|
||||
Classifier: Programming Language :: Python :: 3.7
|
||||
Classifier: Programming Language :: Python :: 3.8
|
||||
Classifier: Programming Language :: Python :: 3.9
|
||||
Classifier: Programming Language :: Python :: 3.10
|
||||
Classifier: Programming Language :: Python :: 3.11
|
||||
Classifier: Programming Language :: Python :: 3.12
|
||||
Classifier: Programming Language :: Python :: 3.13
|
||||
Classifier: Programming Language :: Python :: 3.14
|
||||
Classifier: Programming Language :: Python :: 3 :: Only
|
||||
Classifier: Programming Language :: Python :: Implementation :: CPython
|
||||
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
||||
Classifier: Topic :: Text Processing :: Linguistic
|
||||
Classifier: Topic :: Utilities
|
||||
Classifier: Typing :: Typed
|
||||
Requires-Python: >=3.7
|
||||
Description-Content-Type: text/markdown
|
||||
License-File: LICENSE
|
||||
Provides-Extra: unicode-backport
|
||||
Dynamic: license-file
|
||||
|
||||
<h1 align="center">Charset Detection, for Everyone 👋</h1>
|
||||
|
||||
<p align="center">
|
||||
<sup>The Real First Universal Charset Detector</sup><br>
|
||||
<a href="https://pypi.org/project/charset-normalizer">
|
||||
<img src="https://img.shields.io/pypi/pyversions/charset_normalizer.svg?orange=blue" />
|
||||
</a>
|
||||
<a href="https://pepy.tech/project/charset-normalizer/">
|
||||
<img alt="Download Count Total" src="https://static.pepy.tech/badge/charset-normalizer/month" />
|
||||
</a>
|
||||
<a href="https://bestpractices.coreinfrastructure.org/projects/7297">
|
||||
<img src="https://bestpractices.coreinfrastructure.org/projects/7297/badge">
|
||||
</a>
|
||||
</p>
|
||||
<p align="center">
|
||||
<sup><i>Featured Packages</i></sup><br>
|
||||
<a href="https://github.com/jawah/niquests">
|
||||
<img alt="Static Badge" src="https://img.shields.io/badge/Niquests-Most_Advanced_HTTP_Client-cyan">
|
||||
</a>
|
||||
<a href="https://github.com/jawah/wassima">
|
||||
<img alt="Static Badge" src="https://img.shields.io/badge/Wassima-Certifi_Replacement-cyan">
|
||||
</a>
|
||||
</p>
|
||||
<p align="center">
|
||||
<sup><i>In other language (unofficial port - by the community)</i></sup><br>
|
||||
<a href="https://github.com/nickspring/charset-normalizer-rs">
|
||||
<img alt="Static Badge" src="https://img.shields.io/badge/Rust-red">
|
||||
</a>
|
||||
</p>
|
||||
|
||||
> A library that helps you read text from an unknown charset encoding.<br /> Motivated by `chardet`,
|
||||
> I'm trying to resolve the issue by taking a new approach.
|
||||
> All IANA character set names for which the Python core library provides codecs are supported.
|
||||
> You can also register your own set of codecs, and yes, it would work as-is.
|
||||
|
||||
<p align="center">
|
||||
>>>>> <a href="https://charsetnormalizerweb.ousret.now.sh" target="_blank">👉 Try Me Online Now, Then Adopt Me 👈 </a> <<<<<
|
||||
</p>
|
||||
|
||||
This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**.
|
||||
|
||||
| Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
|
||||
|--------------------------------------------------|:---------------------------------------------:|:-----------------------------------------------------------------------------------------------:|:-----------------------------------------------:|
|
||||
| `Fast` | ✅ | ✅ | ✅ |
|
||||
| `Universal`[^1] | ❌ | ✅ | ❌ |
|
||||
| `Reliable` **without** distinguishable standards | ✅ | ✅ | ✅ |
|
||||
| `Reliable` **with** distinguishable standards | ✅ | ✅ | ✅ |
|
||||
| `License` | _Disputed_[^2]<br>_restrictive_ | MIT | MPL-1.1<br>_restrictive_ |
|
||||
| `Native Python` | ✅ | ✅ | ❌ |
|
||||
| `Detect spoken language` | ✅ | ✅ | N/A |
|
||||
| `UnicodeDecodeError Safety` | ✅ | ✅ | ❌ |
|
||||
| `Whl Size (min)` | 500 kB | 150 kB | ~200 kB |
|
||||
| `Supported Encoding` | 99 | [99](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40 |
|
||||
| `Can register custom encoding` | ❌ | ✅ | ❌ |
|
||||
|
||||
<p align="center">
|
||||
<img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/>
|
||||
</p>
|
||||
|
||||
[^1]: They are clearly using specific code for a specific encoding even if covering most of used one.
|
||||
[^2]: Chardet 7.0+ was relicensed from LGPL-2.1 to MIT following an AI-assisted rewrite. This relicensing is disputed on two independent grounds: **(a)** the original author [contests](https://github.com/chardet/chardet/issues/327) that the maintainer had the right to relicense, arguing the rewrite is a derivative work of the LGPL-licensed codebase since it was not a clean room implementation; **(b)** the copyright claim itself is [questionable](https://github.com/chardet/chardet/issues/334) given the code was primarily generated by an LLM, and AI-generated output may not be copyrightable under most jurisdictions. Either issue alone could undermine the MIT license. Beyond licensing, the rewrite raises questions about responsible use of AI in open source: key architectural ideas pioneered by charset-normalizer - notably decode-first validity filtering (our foundational approach since v1) and encoding pairwise similarity with the same algorithm and threshold — surfaced in chardet 7 without acknowledgment. The project also imported test files from charset-normalizer to train and benchmark against it, then claimed superior accuracy on those very files. Charset-normalizer has always been MIT-licensed, encoding-agnostic by design, and built on a verifiable human-authored history.
|
||||
|
||||
## ⚡ Performance
|
||||
|
||||
This package offer better performances (99th, and 95th) against Chardet. Here are some numbers.
|
||||
|
||||
| Package | Accuracy | Mean per file (ms) | File per sec (est) |
|
||||
|---------------------------------------------------|:--------:|:------------------:|:------------------:|
|
||||
| [chardet 7.1](https://github.com/chardet/chardet) | 89 % | 3 ms | 333 file/sec |
|
||||
| charset-normalizer | **97 %** | 3 ms | 333 file/sec |
|
||||
|
||||
| Package | 99th percentile | 95th percentile | 50th percentile |
|
||||
|---------------------------------------------------|:---------------:|:---------------:|:---------------:|
|
||||
| [chardet 7.1](https://github.com/chardet/chardet) | 32 ms | 17 ms | < 1 ms |
|
||||
| charset-normalizer | 16 ms | 10 ms | 1 ms |
|
||||
|
||||
_updated as of March 2026 using CPython 3.12, Charset-Normalizer 3.4.6, and Chardet 7.1.0_
|
||||
|
||||
~Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.~ No longer the case since Chardet 7.0+
|
||||
|
||||
> Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows.
|
||||
> And yes, these results might change at any time. The dataset can be updated to include more files.
|
||||
> The actual delays heavily depends on your CPU capabilities. The factors should remain the same.
|
||||
> Chardet claims on his documentation to have a greater accuracy than us based on the dataset they trained Chardet on(...)
|
||||
> Well, it's normal, the opposite would have been worrying. Whereas charset-normalizer don't train on anything, our solution
|
||||
> is based on a completely different algorithm, still heuristic through, it does not need weights across every encoding tables.
|
||||
|
||||
## ✨ Installation
|
||||
|
||||
Using pip:
|
||||
|
||||
```sh
|
||||
pip install charset-normalizer -U
|
||||
```
|
||||
|
||||
## 🚀 Basic Usage
|
||||
|
||||
### CLI
|
||||
This package comes with a CLI.
|
||||
|
||||
```
|
||||
usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
|
||||
file [file ...]
|
||||
|
||||
The Real First Universal Charset Detector. Discover originating encoding used
|
||||
on text file. Normalize text to unicode.
|
||||
|
||||
positional arguments:
|
||||
files File(s) to be analysed
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-v, --verbose Display complementary information about file if any.
|
||||
Stdout will contain logs about the detection process.
|
||||
-a, --with-alternative
|
||||
Output complementary possibilities if any. Top-level
|
||||
JSON WILL be a list.
|
||||
-n, --normalize Permit to normalize input file. If not set, program
|
||||
does not write anything.
|
||||
-m, --minimal Only output the charset detected to STDOUT. Disabling
|
||||
JSON output.
|
||||
-r, --replace Replace file when trying to normalize it instead of
|
||||
creating a new one.
|
||||
-f, --force Replace file without asking if you are sure, use this
|
||||
flag with caution.
|
||||
-t THRESHOLD, --threshold THRESHOLD
|
||||
Define a custom maximum amount of chaos allowed in
|
||||
decoded content. 0. <= chaos <= 1.
|
||||
--version Show version information and exit.
|
||||
```
|
||||
|
||||
```bash
|
||||
normalizer ./data/sample.1.fr.srt
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```bash
|
||||
python -m charset_normalizer ./data/sample.1.fr.srt
|
||||
```
|
||||
|
||||
🎉 Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.
|
||||
|
||||
```json
|
||||
{
|
||||
"path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
|
||||
"encoding": "cp1252",
|
||||
"encoding_aliases": [
|
||||
"1252",
|
||||
"windows_1252"
|
||||
],
|
||||
"alternative_encodings": [
|
||||
"cp1254",
|
||||
"cp1256",
|
||||
"cp1258",
|
||||
"iso8859_14",
|
||||
"iso8859_15",
|
||||
"iso8859_16",
|
||||
"iso8859_3",
|
||||
"iso8859_9",
|
||||
"latin_1",
|
||||
"mbcs"
|
||||
],
|
||||
"language": "French",
|
||||
"alphabets": [
|
||||
"Basic Latin",
|
||||
"Latin-1 Supplement"
|
||||
],
|
||||
"has_sig_or_bom": false,
|
||||
"chaos": 0.149,
|
||||
"coherence": 97.152,
|
||||
"unicode_path": null,
|
||||
"is_preferred": true
|
||||
}
|
||||
```
|
||||
|
||||
### Python
|
||||
*Just print out normalized text*
|
||||
```python
|
||||
from charset_normalizer import from_path
|
||||
|
||||
results = from_path('./my_subtitle.srt')
|
||||
|
||||
print(str(results.best()))
|
||||
```
|
||||
|
||||
*Upgrade your code without effort*
|
||||
```python
|
||||
from charset_normalizer import detect
|
||||
```
|
||||
|
||||
The above code will behave the same as **chardet**. We ensure that we offer the best (reasonable) BC result possible.
|
||||
|
||||
See the docs for advanced usage : [readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/)
|
||||
|
||||
## 😇 Why
|
||||
|
||||
When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a
|
||||
reliable alternative using a completely different method. Also! I never back down on a good challenge!
|
||||
|
||||
I **don't care** about the **originating charset** encoding, because **two different tables** can
|
||||
produce **two identical rendered string.**
|
||||
What I want is to get readable text, the best I can.
|
||||
|
||||
In a way, **I'm brute forcing text decoding.** How cool is that ? 😎
|
||||
|
||||
Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is to repair Unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.
|
||||
|
||||
## 🍰 How
|
||||
|
||||
- Discard all charset encoding table that could not fit the binary content.
|
||||
- Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding.
|
||||
- Extract matches with the lowest mess detected.
|
||||
- Additionally, we measure coherence / probe for a language.
|
||||
|
||||
**Wait a minute**, what is noise/mess and coherence according to **YOU ?**
|
||||
|
||||
*Noise :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then
|
||||
**I established** some ground rules about **what is obvious** when **it seems like** a mess (aka. defining noise in rendered text).
|
||||
I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to
|
||||
improve or rewrite it.
|
||||
|
||||
*Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought
|
||||
that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.
|
||||
|
||||
## ⚡ Known limitations
|
||||
|
||||
- Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
|
||||
- Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.
|
||||
|
||||
## ⚠️ About Python EOLs
|
||||
|
||||
**If you are running:**
|
||||
|
||||
- Python >=2.7,<3.5: Unsupported
|
||||
- Python 3.5: charset-normalizer < 2.1
|
||||
- Python 3.6: charset-normalizer < 3.1
|
||||
|
||||
Upgrade your Python interpreter as soon as possible.
|
||||
|
||||
## 👤 Contributing
|
||||
|
||||
Contributions, issues and feature requests are very much welcome.<br />
|
||||
Feel free to check [issues page](https://github.com/ousret/charset_normalizer/issues) if you want to contribute.
|
||||
|
||||
## 📝 License
|
||||
|
||||
Copyright © [Ahmed TAHRI @Ousret](https://github.com/Ousret).<br />
|
||||
This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed.
|
||||
|
||||
Characters frequencies used in this project © 2012 [Denny Vrandečić](http://simia.net/letters/)
|
||||
|
||||
## 💼 For Enterprise
|
||||
|
||||
Professional support for charset-normalizer is available as part of the [Tidelift
|
||||
Subscription][1]. Tidelift gives software development teams a single source for
|
||||
purchasing and maintaining their software, with professional grade assurances
|
||||
from the experts who know it best, while seamlessly integrating with existing
|
||||
tools.
|
||||
|
||||
[1]: https://tidelift.com/subscription/pkg/pypi-charset-normalizer?utm_source=pypi-charset-normalizer&utm_medium=readme
|
||||
|
||||
[](https://www.bestpractices.dev/projects/7297)
|
||||
|
||||
# Changelog
|
||||
All notable changes to charset-normalizer will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
||||
|
||||
## [3.4.6](https://github.com/Ousret/charset_normalizer/compare/3.4.5...3.4.6) (2026-03-15)
|
||||
|
||||
### Changed
|
||||
- Flattened the logic in `charset_normalizer.md` for higher performance. Removed `eligible(..)` and `feed(...)`
|
||||
in favor of `feed_info(...)`.
|
||||
- Raised upper bound for mypy[c] to 1.20, for our optimized version.
|
||||
- Updated `UNICODE_RANGES_COMBINED` using Unicode blocks v17.
|
||||
|
||||
### Fixed
|
||||
- Edge case where noise difference between two candidates can be almost insignificant. (#672)
|
||||
- CLI `--normalize` writing to wrong path when passing multiple files in. (#702)
|
||||
|
||||
### Misc
|
||||
- Freethreaded pre-built wheels now shipped in PyPI starting with 3.14t. (#616)
|
||||
|
||||
## [3.4.5](https://github.com/Ousret/charset_normalizer/compare/3.4.4...3.4.5) (2026-03-06)
|
||||
|
||||
### Changed
|
||||
- Update `setuptools` constraint to `setuptools>=68,<=82`.
|
||||
- Raised upper bound of mypyc for the optional pre-built extension to v1.19.1
|
||||
|
||||
### Fixed
|
||||
- Add explicit link to lib math in our optimized build. (#692)
|
||||
- Logger level not restored correctly for empty byte sequences. (#701)
|
||||
- TypeError when passing bytearray to from_bytes. (#703)
|
||||
|
||||
### Misc
|
||||
- Applied safe micro-optimizations in both our noise detector and language detector.
|
||||
- Rewrote the `query_yes_no` function (inside CLI) to avoid using ambiguous licensed code.
|
||||
- Added `cd.py` submodule into mypyc optional compilation to reduce further the performance impact.
|
||||
|
||||
## [3.4.4](https://github.com/Ousret/charset_normalizer/compare/3.4.2...3.4.4) (2025-10-13)
|
||||
|
||||
### Changed
|
||||
- Bound `setuptools` to a specific constraint `setuptools>=68,<=81`.
|
||||
- Raised upper bound of mypyc for the optional pre-built extension to v1.18.2
|
||||
|
||||
### Removed
|
||||
- `setuptools-scm` as a build dependency.
|
||||
|
||||
### Misc
|
||||
- Enforced hashes in `dev-requirements.txt` and created `ci-requirements.txt` for security purposes.
|
||||
- Additional pre-built wheels for riscv64, s390x, and armv7l architectures.
|
||||
- Restore ` multiple.intoto.jsonl` in GitHub releases in addition to individual attestation file per wheel.
|
||||
|
||||
## [3.4.3](https://github.com/Ousret/charset_normalizer/compare/3.4.2...3.4.3) (2025-08-09)
|
||||
|
||||
### Changed
|
||||
- mypy(c) is no longer a required dependency at build time if `CHARSET_NORMALIZER_USE_MYPYC` isn't set to `1`. (#595) (#583)
|
||||
- automatically lower confidence on small bytes samples that are not Unicode in `detect` output legacy function. (#391)
|
||||
|
||||
### Added
|
||||
- Custom build backend to overcome inability to mark mypy as an optional dependency in the build phase.
|
||||
- Support for Python 3.14
|
||||
|
||||
### Fixed
|
||||
- sdist archive contained useless directories.
|
||||
- automatically fallback on valid UTF-16 or UTF-32 even if the md says it's noisy. (#633)
|
||||
|
||||
### Misc
|
||||
- SBOM are automatically published to the relevant GitHub release to comply with regulatory changes.
|
||||
Each published wheel comes with its SBOM. We choose CycloneDX as the format.
|
||||
- Prebuilt optimized wheel are no longer distributed by default for CPython 3.7 due to a change in cibuildwheel.
|
||||
|
||||
## [3.4.2](https://github.com/Ousret/charset_normalizer/compare/3.4.1...3.4.2) (2025-05-02)
|
||||
|
||||
### Fixed
|
||||
- Addressed the DeprecationWarning in our CLI regarding `argparse.FileType` by backporting the target class into the package. (#591)
|
||||
- Improved the overall reliability of the detector with CJK Ideographs. (#605) (#587)
|
||||
|
||||
### Changed
|
||||
- Optional mypyc compilation upgraded to version 1.15 for Python >= 3.8
|
||||
|
||||
## [3.4.1](https://github.com/Ousret/charset_normalizer/compare/3.4.0...3.4.1) (2024-12-24)
|
||||
|
||||
### Changed
|
||||
- Project metadata are now stored using `pyproject.toml` instead of `setup.cfg` using setuptools as the build backend.
|
||||
- Enforce annotation delayed loading for a simpler and consistent types in the project.
|
||||
- Optional mypyc compilation upgraded to version 1.14 for Python >= 3.8
|
||||
|
||||
### Added
|
||||
- pre-commit configuration.
|
||||
- noxfile.
|
||||
|
||||
### Removed
|
||||
- `build-requirements.txt` as per using `pyproject.toml` native build configuration.
|
||||
- `bin/integration.py` and `bin/serve.py` in favor of downstream integration test (see noxfile).
|
||||
- `setup.cfg` in favor of `pyproject.toml` metadata configuration.
|
||||
- Unused `utils.range_scan` function.
|
||||
|
||||
### Fixed
|
||||
- Converting content to Unicode bytes may insert `utf_8` instead of preferred `utf-8`. (#572)
|
||||
- Deprecation warning "'count' is passed as positional argument" when converting to Unicode bytes on Python 3.13+
|
||||
|
||||
## [3.4.0](https://github.com/Ousret/charset_normalizer/compare/3.3.2...3.4.0) (2024-10-08)
|
||||
|
||||
### Added
|
||||
- Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
|
||||
- Support for Python 3.13 (#512)
|
||||
|
||||
### Fixed
|
||||
- Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
|
||||
- Improved the general reliability of the detector based on user feedbacks. (#520) (#509) (#498) (#407) (#537)
|
||||
- Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. (#381)
|
||||
|
||||
## [3.3.2](https://github.com/Ousret/charset_normalizer/compare/3.3.1...3.3.2) (2023-10-31)
|
||||
|
||||
### Fixed
|
||||
- Unintentional memory usage regression when using large payload that match several encoding (#376)
|
||||
- Regression on some detection case showcased in the documentation (#371)
|
||||
|
||||
### Added
|
||||
- Noise (md) probe that identify malformed arabic representation due to the presence of letters in isolated form (credit to my wife)
|
||||
|
||||
## [3.3.1](https://github.com/Ousret/charset_normalizer/compare/3.3.0...3.3.1) (2023-10-22)
|
||||
|
||||
### Changed
|
||||
- Optional mypyc compilation upgraded to version 1.6.1 for Python >= 3.8
|
||||
- Improved the general detection reliability based on reports from the community
|
||||
|
||||
## [3.3.0](https://github.com/Ousret/charset_normalizer/compare/3.2.0...3.3.0) (2023-09-30)
|
||||
|
||||
### Added
|
||||
- Allow to execute the CLI (e.g. normalizer) through `python -m charset_normalizer.cli` or `python -m charset_normalizer`
|
||||
- Support for 9 forgotten encoding that are supported by Python but unlisted in `encoding.aliases` as they have no alias (#323)
|
||||
|
||||
### Removed
|
||||
- (internal) Redundant utils.is_ascii function and unused function is_private_use_only
|
||||
- (internal) charset_normalizer.assets is moved inside charset_normalizer.constant
|
||||
|
||||
### Changed
|
||||
- (internal) Unicode code blocks in constants are updated using the latest v15.0.0 definition to improve detection
|
||||
- Optional mypyc compilation upgraded to version 1.5.1 for Python >= 3.8
|
||||
|
||||
### Fixed
|
||||
- Unable to properly sort CharsetMatch when both chaos/noise and coherence were close due to an unreachable condition in \_\_lt\_\_ (#350)
|
||||
|
||||
## [3.2.0](https://github.com/Ousret/charset_normalizer/compare/3.1.0...3.2.0) (2023-06-07)
|
||||
|
||||
### Changed
|
||||
- Typehint for function `from_path` no longer enforce `PathLike` as its first argument
|
||||
- Minor improvement over the global detection reliability
|
||||
|
||||
### Added
|
||||
- Introduce function `is_binary` that relies on main capabilities, and optimized to detect binaries
|
||||
- Propagate `enable_fallback` argument throughout `from_bytes`, `from_path`, and `from_fp` that allow a deeper control over the detection (default True)
|
||||
- Explicit support for Python 3.12
|
||||
|
||||
### Fixed
|
||||
- Edge case detection failure where a file would contain 'very-long' camel cased word (Issue #289)
|
||||
|
||||
## [3.1.0](https://github.com/Ousret/charset_normalizer/compare/3.0.1...3.1.0) (2023-03-06)
|
||||
|
||||
### Added
|
||||
- Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR #262)
|
||||
|
||||
### Removed
|
||||
- Support for Python 3.6 (PR #260)
|
||||
|
||||
### Changed
|
||||
- Optional speedup provided by mypy/c 1.0.1
|
||||
|
||||
## [3.0.1](https://github.com/Ousret/charset_normalizer/compare/3.0.0...3.0.1) (2022-11-18)
|
||||
|
||||
### Fixed
|
||||
- Multi-bytes cutter/chunk generator did not always cut correctly (PR #233)
|
||||
|
||||
### Changed
|
||||
- Speedup provided by mypy/c 0.990 on Python >= 3.7
|
||||
|
||||
## [3.0.0](https://github.com/Ousret/charset_normalizer/compare/2.1.1...3.0.0) (2022-10-20)
|
||||
|
||||
### Added
|
||||
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
|
||||
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
|
||||
- Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio
|
||||
- `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl)
|
||||
|
||||
### Changed
|
||||
- Build with static metadata using 'build' frontend
|
||||
- Make the language detection stricter
|
||||
- Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1
|
||||
|
||||
### Fixed
|
||||
- CLI with opt --normalize fail when using full path for files
|
||||
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it
|
||||
- Sphinx warnings when generating the documentation
|
||||
|
||||
### Removed
|
||||
- Coherence detector no longer return 'Simple English' instead return 'English'
|
||||
- Coherence detector no longer return 'Classical Chinese' instead return 'Chinese'
|
||||
- Breaking: Method `first()` and `best()` from CharsetMatch
|
||||
- UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)
|
||||
- Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
|
||||
- Breaking: Top-level function `normalize`
|
||||
- Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch
|
||||
- Support for the backport `unicodedata2`
|
||||
|
||||
## [3.0.0rc1](https://github.com/Ousret/charset_normalizer/compare/3.0.0b2...3.0.0rc1) (2022-10-18)
|
||||
|
||||
### Added
|
||||
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
|
||||
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
|
||||
- Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio
|
||||
|
||||
### Changed
|
||||
- Build with static metadata using 'build' frontend
|
||||
- Make the language detection stricter
|
||||
|
||||
### Fixed
|
||||
- CLI with opt --normalize fail when using full path for files
|
||||
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it
|
||||
|
||||
### Removed
|
||||
- Coherence detector no longer return 'Simple English' instead return 'English'
|
||||
- Coherence detector no longer return 'Classical Chinese' instead return 'Chinese'
|
||||
|
||||
## [3.0.0b2](https://github.com/Ousret/charset_normalizer/compare/3.0.0b1...3.0.0b2) (2022-08-21)
|
||||
|
||||
### Added
|
||||
- `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl)
|
||||
|
||||
### Removed
|
||||
- Breaking: Method `first()` and `best()` from CharsetMatch
|
||||
- UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)
|
||||
|
||||
### Fixed
|
||||
- Sphinx warnings when generating the documentation
|
||||
|
||||
## [3.0.0b1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...3.0.0b1) (2022-08-15)
|
||||
|
||||
### Changed
|
||||
- Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1
|
||||
|
||||
### Removed
|
||||
- Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
|
||||
- Breaking: Top-level function `normalize`
|
||||
- Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch
|
||||
- Support for the backport `unicodedata2`
|
||||
|
||||
## [2.1.1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...2.1.1) (2022-08-19)
|
||||
|
||||
### Deprecated
|
||||
- Function `normalize` scheduled for removal in 3.0
|
||||
|
||||
### Changed
|
||||
- Removed useless call to decode in fn is_unprintable (#206)
|
||||
|
||||
### Fixed
|
||||
- Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from [@aleksandernovikov](https://github.com/aleksandernovikov) (#204)
|
||||
|
||||
## [2.1.0](https://github.com/Ousret/charset_normalizer/compare/2.0.12...2.1.0) (2022-06-19)
|
||||
|
||||
### Added
|
||||
- Output the Unicode table version when running the CLI with `--version` (PR #194)
|
||||
|
||||
### Changed
|
||||
- Re-use decoded buffer for single byte character sets from [@nijel](https://github.com/nijel) (PR #175)
|
||||
- Fixing some performance bottlenecks from [@deedy5](https://github.com/deedy5) (PR #183)
|
||||
|
||||
### Fixed
|
||||
- Workaround potential bug in cpython with Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space (PR #175)
|
||||
- CLI default threshold aligned with the API threshold from [@oleksandr-kuzmenko](https://github.com/oleksandr-kuzmenko) (PR #181)
|
||||
|
||||
### Removed
|
||||
- Support for Python 3.5 (PR #192)
|
||||
|
||||
### Deprecated
|
||||
- Use of backport unicodedata from `unicodedata2` as Python is quickly catching up, scheduled for removal in 3.0 (PR #194)
|
||||
|
||||
## [2.0.12](https://github.com/Ousret/charset_normalizer/compare/2.0.11...2.0.12) (2022-02-12)
|
||||
|
||||
### Fixed
|
||||
- ASCII miss-detection on rare cases (PR #170)
|
||||
|
||||
## [2.0.11](https://github.com/Ousret/charset_normalizer/compare/2.0.10...2.0.11) (2022-01-30)
|
||||
|
||||
### Added
|
||||
- Explicit support for Python 3.11 (PR #164)
|
||||
|
||||
### Changed
|
||||
- The logging behavior have been completely reviewed, now using only TRACE and DEBUG levels (PR #163 #165)
|
||||
|
||||
## [2.0.10](https://github.com/Ousret/charset_normalizer/compare/2.0.9...2.0.10) (2022-01-04)
|
||||
|
||||
### Fixed
|
||||
- Fallback match entries might lead to UnicodeDecodeError for large bytes sequence (PR #154)
|
||||
|
||||
### Changed
|
||||
- Skipping the language-detection (CD) on ASCII (PR #155)
|
||||
|
||||
## [2.0.9](https://github.com/Ousret/charset_normalizer/compare/2.0.8...2.0.9) (2021-12-03)
|
||||
|
||||
### Changed
|
||||
- Moderating the logging impact (since 2.0.8) for specific environments (PR #147)
|
||||
|
||||
### Fixed
|
||||
- Wrong logging level applied when setting kwarg `explain` to True (PR #146)
|
||||
|
||||
## [2.0.8](https://github.com/Ousret/charset_normalizer/compare/2.0.7...2.0.8) (2021-11-24)
|
||||
### Changed
|
||||
- Improvement over Vietnamese detection (PR #126)
|
||||
- MD improvement on trailing data and long foreign (non-pure latin) data (PR #124)
|
||||
- Efficiency improvements in cd/alphabet_languages from [@adbar](https://github.com/adbar) (PR #122)
|
||||
- call sum() without an intermediary list following PEP 289 recommendations from [@adbar](https://github.com/adbar) (PR #129)
|
||||
- Code style as refactored by Sourcery-AI (PR #131)
|
||||
- Minor adjustment on the MD around european words (PR #133)
|
||||
- Remove and replace SRTs from assets / tests (PR #139)
|
||||
- Initialize the library logger with a `NullHandler` by default from [@nmaynes](https://github.com/nmaynes) (PR #135)
|
||||
- Setting kwarg `explain` to True will add provisionally (bounded to function lifespan) a specific stream handler (PR #135)
|
||||
|
||||
### Fixed
|
||||
- Fix large (misleading) sequence giving UnicodeDecodeError (PR #137)
|
||||
- Avoid using too insignificant chunk (PR #137)
|
||||
|
||||
### Added
|
||||
- Add and expose function `set_logging_handler` to configure a specific StreamHandler from [@nmaynes](https://github.com/nmaynes) (PR #135)
|
||||
- Add `CHANGELOG.md` entries, format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) (PR #141)
|
||||
|
||||
## [2.0.7](https://github.com/Ousret/charset_normalizer/compare/2.0.6...2.0.7) (2021-10-11)
|
||||
### Added
|
||||
- Add support for Kazakh (Cyrillic) language detection (PR #109)
|
||||
|
||||
### Changed
|
||||
- Further, improve inferring the language from a given single-byte code page (PR #112)
|
||||
- Vainly trying to leverage PEP263 when PEP3120 is not supported (PR #116)
|
||||
- Refactoring for potential performance improvements in loops from [@adbar](https://github.com/adbar) (PR #113)
|
||||
- Various detection improvement (MD+CD) (PR #117)
|
||||
|
||||
### Removed
|
||||
- Remove redundant logging entry about detected language(s) (PR #115)
|
||||
|
||||
### Fixed
|
||||
- Fix a minor inconsistency between Python 3.5 and other versions regarding language detection (PR #117 #102)
|
||||
|
||||
## [2.0.6](https://github.com/Ousret/charset_normalizer/compare/2.0.5...2.0.6) (2021-09-18)
|
||||
### Fixed
|
||||
- Unforeseen regression with the loss of the backward-compatibility with some older minor of Python 3.5.x (PR #100)
|
||||
- Fix CLI crash when using --minimal output in certain cases (PR #103)
|
||||
|
||||
### Changed
|
||||
- Minor improvement to the detection efficiency (less than 1%) (PR #106 #101)
|
||||
|
||||
## [2.0.5](https://github.com/Ousret/charset_normalizer/compare/2.0.4...2.0.5) (2021-09-14)
|
||||
### Changed
|
||||
- The project now comply with: flake8, mypy, isort and black to ensure a better overall quality (PR #81)
|
||||
- The BC-support with v1.x was improved, the old staticmethods are restored (PR #82)
|
||||
- The Unicode detection is slightly improved (PR #93)
|
||||
- Add syntax sugar \_\_bool\_\_ for results CharsetMatches list-container (PR #91)
|
||||
|
||||
### Removed
|
||||
- The project no longer raise warning on tiny content given for detection, will be simply logged as warning instead (PR #92)
|
||||
|
||||
### Fixed
|
||||
- In some rare case, the chunks extractor could cut in the middle of a multi-byte character and could mislead the mess detection (PR #95)
|
||||
- Some rare 'space' characters could trip up the UnprintablePlugin/Mess detection (PR #96)
|
||||
- The MANIFEST.in was not exhaustive (PR #78)
|
||||
|
||||
## [2.0.4](https://github.com/Ousret/charset_normalizer/compare/2.0.3...2.0.4) (2021-07-30)
|
||||
### Fixed
|
||||
- The CLI no longer raise an unexpected exception when no encoding has been found (PR #70)
|
||||
- Fix accessing the 'alphabets' property when the payload contains surrogate characters (PR #68)
|
||||
- The logger could mislead (explain=True) on detected languages and the impact of one MBCS match (PR #72)
|
||||
- Submatch factoring could be wrong in rare edge cases (PR #72)
|
||||
- Multiple files given to the CLI were ignored when publishing results to STDOUT. (After the first path) (PR #72)
|
||||
- Fix line endings from CRLF to LF for certain project files (PR #67)
|
||||
|
||||
### Changed
|
||||
- Adjust the MD to lower the sensitivity, thus improving the global detection reliability (PR #69 #76)
|
||||
- Allow fallback on specified encoding if any (PR #71)
|
||||
|
||||
## [2.0.3](https://github.com/Ousret/charset_normalizer/compare/2.0.2...2.0.3) (2021-07-16)
|
||||
### Changed
|
||||
- Part of the detection mechanism has been improved to be less sensitive, resulting in more accurate detection results. Especially ASCII. (PR #63)
|
||||
- According to the community wishes, the detection will fall back on ASCII or UTF-8 in a last-resort case. (PR #64)
|
||||
|
||||
## [2.0.2](https://github.com/Ousret/charset_normalizer/compare/2.0.1...2.0.2) (2021-07-15)
|
||||
### Fixed
|
||||
- Empty/Too small JSON payload miss-detection fixed. Report from [@tseaver](https://github.com/tseaver) (PR #59)
|
||||
|
||||
### Changed
|
||||
- Don't inject unicodedata2 into sys.modules from [@akx](https://github.com/akx) (PR #57)
|
||||
|
||||
## [2.0.1](https://github.com/Ousret/charset_normalizer/compare/2.0.0...2.0.1) (2021-07-13)
|
||||
### Fixed
|
||||
- Make it work where there isn't a filesystem available, dropping assets frequencies.json. Report from [@sethmlarson](https://github.com/sethmlarson). (PR #55)
|
||||
- Using explain=False permanently disable the verbose output in the current runtime (PR #47)
|
||||
- One log entry (language target preemptive) was not show in logs when using explain=True (PR #47)
|
||||
- Fix undesired exception (ValueError) on getitem of instance CharsetMatches (PR #52)
|
||||
|
||||
### Changed
|
||||
- Public function normalize default args values were not aligned with from_bytes (PR #53)
|
||||
|
||||
### Added
|
||||
- You may now use charset aliases in cp_isolation and cp_exclusion arguments (PR #47)
|
||||
|
||||
## [2.0.0](https://github.com/Ousret/charset_normalizer/compare/1.4.1...2.0.0) (2021-07-02)
|
||||
### Changed
|
||||
- 4x to 5 times faster than the previous 1.4.0 release. At least 2x faster than Chardet.
|
||||
- Accent has been made on UTF-8 detection, should perform rather instantaneous.
|
||||
- The backward compatibility with Chardet has been greatly improved. The legacy detect function returns an identical charset name whenever possible.
|
||||
- The detection mechanism has been slightly improved, now Turkish content is detected correctly (most of the time)
|
||||
- The program has been rewritten to ease the readability and maintainability. (+Using static typing)+
|
||||
- utf_7 detection has been reinstated.
|
||||
|
||||
### Removed
|
||||
- This package no longer require anything when used with Python 3.5 (Dropped cached_property)
|
||||
- Removed support for these languages: Catalan, Esperanto, Kazakh, Baque, Volapük, Azeri, Galician, Nynorsk, Macedonian, and Serbocroatian.
|
||||
- The exception hook on UnicodeDecodeError has been removed.
|
||||
|
||||
### Deprecated
|
||||
- Methods coherence_non_latin, w_counter, chaos_secondary_pass of the class CharsetMatch are now deprecated and scheduled for removal in v3.0
|
||||
|
||||
### Fixed
|
||||
- The CLI output used the relative path of the file(s). Should be absolute.
|
||||
|
||||
## [1.4.1](https://github.com/Ousret/charset_normalizer/compare/1.4.0...1.4.1) (2021-05-28)
|
||||
### Fixed
|
||||
- Logger configuration/usage no longer conflict with others (PR #44)
|
||||
|
||||
## [1.4.0](https://github.com/Ousret/charset_normalizer/compare/1.3.9...1.4.0) (2021-05-21)
|
||||
### Removed
|
||||
- Using standard logging instead of using the package loguru.
|
||||
- Dropping nose test framework in favor of the maintained pytest.
|
||||
- Choose to not use dragonmapper package to help with gibberish Chinese/CJK text.
|
||||
- Require cached_property only for Python 3.5 due to constraint. Dropping for every other interpreter version.
|
||||
- Stop support for UTF-7 that does not contain a SIG.
|
||||
- Dropping PrettyTable, replaced with pure JSON output in CLI.
|
||||
|
||||
### Fixed
|
||||
- BOM marker in a CharsetNormalizerMatch instance could be False in rare cases even if obviously present. Due to the sub-match factoring process.
|
||||
- Not searching properly for the BOM when trying utf32/16 parent codec.
|
||||
|
||||
### Changed
|
||||
- Improving the package final size by compressing frequencies.json.
|
||||
- Huge improvement over the larges payload.
|
||||
|
||||
### Added
|
||||
- CLI now produces JSON consumable output.
|
||||
- Return ASCII if given sequences fit. Given reasonable confidence.
|
||||
|
||||
## [1.3.9](https://github.com/Ousret/charset_normalizer/compare/1.3.8...1.3.9) (2021-05-13)
|
||||
|
||||
### Fixed
|
||||
- In some very rare cases, you may end up getting encode/decode errors due to a bad bytes payload (PR #40)
|
||||
|
||||
## [1.3.8](https://github.com/Ousret/charset_normalizer/compare/1.3.7...1.3.8) (2021-05-12)
|
||||
|
||||
### Fixed
|
||||
- Empty given payload for detection may cause an exception if trying to access the `alphabets` property. (PR #39)
|
||||
|
||||
## [1.3.7](https://github.com/Ousret/charset_normalizer/compare/1.3.6...1.3.7) (2021-05-12)
|
||||
|
||||
### Fixed
|
||||
- The legacy detect function should return UTF-8-SIG if sig is present in the payload. (PR #38)
|
||||
|
||||
## [1.3.6](https://github.com/Ousret/charset_normalizer/compare/1.3.5...1.3.6) (2021-02-09)
|
||||
|
||||
### Changed
|
||||
- Amend the previous release to allow prettytable 2.0 (PR #35)
|
||||
|
||||
## [1.3.5](https://github.com/Ousret/charset_normalizer/compare/1.3.4...1.3.5) (2021-02-08)
|
||||
|
||||
### Fixed
|
||||
- Fix error while using the package with a python pre-release interpreter (PR #33)
|
||||
|
||||
### Changed
|
||||
- Dependencies refactoring, constraints revised.
|
||||
|
||||
### Added
|
||||
- Add python 3.9 and 3.10 to the supported interpreters
|
||||
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2025 TAHRI Ahmed R.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
36
aws-lambda/src/charset_normalizer-3.4.6.dist-info/RECORD
Normal file
36
aws-lambda/src/charset_normalizer-3.4.6.dist-info/RECORD
Normal file
@@ -0,0 +1,36 @@
|
||||
../../bin/normalizer.exe,sha256=uY57Qx6M5YGU_c0zRMEtN02pv6-80LYhpGggR0-sPPE,108438
|
||||
81d243bd2c585b0f4821__mypyc.cp313-win_amd64.pyd,sha256=rocMwTSyn1MsT2POch5hU59FyTp_OMExLtRy07yhNQI,209920
|
||||
charset_normalizer-3.4.6.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
|
||||
charset_normalizer-3.4.6.dist-info/METADATA,sha256=5gXi0-a3mdA2Fe0k_nTWRax6j6VrMC_CE3v3x1PECTU,41354
|
||||
charset_normalizer-3.4.6.dist-info/RECORD,,
|
||||
charset_normalizer-3.4.6.dist-info/WHEEL,sha256=Xr-hSQu17ZxKorLWItir4Mz0GplQpPFz9u2i9sztbpM,101
|
||||
charset_normalizer-3.4.6.dist-info/entry_points.txt,sha256=ADSTKrkXZ3hhdOVFi6DcUEHQRS0xfxDIE_pEz4wLIXA,65
|
||||
charset_normalizer-3.4.6.dist-info/licenses/LICENSE,sha256=GFd0hdNwTxpHne2OVzwJds_tMV_S_ReYP6mI2kwvcNE,1092
|
||||
charset_normalizer-3.4.6.dist-info/top_level.txt,sha256=c_vZbitqecT2GfK3zdxSTLCn8C-6pGnHQY5o_5Y32M0,47
|
||||
charset_normalizer/__init__.py,sha256=0NT8MHi7SKq3juMqYfOdrkzjisK0L73lneNHH4qaUAs,1638
|
||||
charset_normalizer/__main__.py,sha256=2sj_BS6H0sU25C1bMqz9DVwa6kOK9lchSEbSU-_iu7M,115
|
||||
charset_normalizer/__pycache__/__init__.cpython-313.pyc,,
|
||||
charset_normalizer/__pycache__/__main__.cpython-313.pyc,,
|
||||
charset_normalizer/__pycache__/api.cpython-313.pyc,,
|
||||
charset_normalizer/__pycache__/cd.cpython-313.pyc,,
|
||||
charset_normalizer/__pycache__/constant.cpython-313.pyc,,
|
||||
charset_normalizer/__pycache__/legacy.cpython-313.pyc,,
|
||||
charset_normalizer/__pycache__/md.cpython-313.pyc,,
|
||||
charset_normalizer/__pycache__/models.cpython-313.pyc,,
|
||||
charset_normalizer/__pycache__/utils.cpython-313.pyc,,
|
||||
charset_normalizer/__pycache__/version.cpython-313.pyc,,
|
||||
charset_normalizer/api.py,sha256=9MQY-Yxk3o_7INO7_95Cp0LoMBPlS7bLwd2Xmatx53o,38934
|
||||
charset_normalizer/cd.cp313-win_amd64.pyd,sha256=cnp2QZy3mf-VrHmEabh1lP3vPI5holac7219PUzG6ls,10752
|
||||
charset_normalizer/cd.py,sha256=oxPRzNqGqGMm6V4Os7lI2jQK_Pe8vgzEOWtIH4FWYhQ,15628
|
||||
charset_normalizer/cli/__init__.py,sha256=d9MUx-1V_qD3x9igIy4JT4oC5CU0yjulk7QyZWeRFhg,144
|
||||
charset_normalizer/cli/__main__.py,sha256=8TfOJ7pE3_USMlquyzg1TnEvSWdshj3Vmycsmsd0EJA,12302
|
||||
charset_normalizer/cli/__pycache__/__init__.cpython-313.pyc,,
|
||||
charset_normalizer/cli/__pycache__/__main__.cpython-313.pyc,,
|
||||
charset_normalizer/constant.py,sha256=LZKBEFlORPukohph-72Z6xMzPNk5LtM3EFUjiwajW7w,46481
|
||||
charset_normalizer/legacy.py,sha256=MCz6fcVj-h_VG9lMgnMUibESu4iAQqznE6HbMdleQIM,2740
|
||||
charset_normalizer/md.cp313-win_amd64.pyd,sha256=rcs4vBX9AW57dBV0uTo44kwF7g14wQxgAnNGfkMtxaM,10752
|
||||
charset_normalizer/md.py,sha256=_Sv-zOZTD64bz7LsMcs5GEOBdJdvYreOiVjtUiEcVuU,31377
|
||||
charset_normalizer/models.py,sha256=pxN8ssKWCQy5DHJkHpCRri7XDrPRMXjwjzpS80LJY-k,12719
|
||||
charset_normalizer/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
||||
charset_normalizer/utils.py,sha256=NzMBuvIt5IZ5_sYqboqJs7L1X_uWhwz5RfPWBYaDPX0,12704
|
||||
charset_normalizer/version.py,sha256=3A-eWTw8y-IdqpnvH7bB772EHzrl49SpSvdMGsXkhAM,123
|
||||
5
aws-lambda/src/charset_normalizer-3.4.6.dist-info/WHEEL
Normal file
5
aws-lambda/src/charset_normalizer-3.4.6.dist-info/WHEEL
Normal file
@@ -0,0 +1,5 @@
|
||||
Wheel-Version: 1.0
|
||||
Generator: setuptools (82.0.0)
|
||||
Root-Is-Purelib: false
|
||||
Tag: cp313-cp313-win_amd64
|
||||
|
||||
@@ -0,0 +1,2 @@
|
||||
[console_scripts]
|
||||
normalizer = charset_normalizer.cli:cli_detect
|
||||
@@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2025 TAHRI Ahmed R.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
@@ -0,0 +1,2 @@
|
||||
81d243bd2c585b0f4821__mypyc
|
||||
charset_normalizer
|
||||
48
aws-lambda/src/charset_normalizer/__init__.py
Normal file
48
aws-lambda/src/charset_normalizer/__init__.py
Normal file
@@ -0,0 +1,48 @@
|
||||
"""
|
||||
Charset-Normalizer
|
||||
~~~~~~~~~~~~~~
|
||||
The Real First Universal Charset Detector.
|
||||
A library that helps you read text from an unknown charset encoding.
|
||||
Motivated by chardet, This package is trying to resolve the issue by taking a new approach.
|
||||
All IANA character set names for which the Python core library provides codecs are supported.
|
||||
|
||||
Basic usage:
|
||||
>>> from charset_normalizer import from_bytes
|
||||
>>> results = from_bytes('Bсеки човек има право на образование. Oбразованието!'.encode('utf_8'))
|
||||
>>> best_guess = results.best()
|
||||
>>> str(best_guess)
|
||||
'Bсеки човек има право на образование. Oбразованието!'
|
||||
|
||||
Others methods and usages are available - see the full documentation
|
||||
at <https://github.com/Ousret/charset_normalizer>.
|
||||
:copyright: (c) 2021 by Ahmed TAHRI
|
||||
:license: MIT, see LICENSE for more details.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
|
||||
from .api import from_bytes, from_fp, from_path, is_binary
|
||||
from .legacy import detect
|
||||
from .models import CharsetMatch, CharsetMatches
|
||||
from .utils import set_logging_handler
|
||||
from .version import VERSION, __version__
|
||||
|
||||
__all__ = (
|
||||
"from_fp",
|
||||
"from_path",
|
||||
"from_bytes",
|
||||
"is_binary",
|
||||
"detect",
|
||||
"CharsetMatch",
|
||||
"CharsetMatches",
|
||||
"__version__",
|
||||
"VERSION",
|
||||
"set_logging_handler",
|
||||
)
|
||||
|
||||
# Attach a NullHandler to the top level logger by default
|
||||
# https://docs.python.org/3.3/howto/logging.html#configuring-logging-for-a-library
|
||||
|
||||
logging.getLogger("charset_normalizer").addHandler(logging.NullHandler())
|
||||
6
aws-lambda/src/charset_normalizer/__main__.py
Normal file
6
aws-lambda/src/charset_normalizer/__main__.py
Normal file
@@ -0,0 +1,6 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from .cli import cli_detect
|
||||
|
||||
if __name__ == "__main__":
|
||||
cli_detect()
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
aws-lambda/src/charset_normalizer/__pycache__/cd.cpython-313.pyc
Normal file
BIN
aws-lambda/src/charset_normalizer/__pycache__/cd.cpython-313.pyc
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
aws-lambda/src/charset_normalizer/__pycache__/md.cpython-313.pyc
Normal file
BIN
aws-lambda/src/charset_normalizer/__pycache__/md.cpython-313.pyc
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
974
aws-lambda/src/charset_normalizer/api.py
Normal file
974
aws-lambda/src/charset_normalizer/api.py
Normal file
@@ -0,0 +1,974 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from os import PathLike
|
||||
from typing import BinaryIO
|
||||
|
||||
from .cd import (
|
||||
coherence_ratio,
|
||||
encoding_languages,
|
||||
mb_encoding_languages,
|
||||
merge_coherence_ratios,
|
||||
)
|
||||
from .constant import (
|
||||
IANA_SUPPORTED,
|
||||
IANA_SUPPORTED_SIMILAR,
|
||||
TOO_BIG_SEQUENCE,
|
||||
TOO_SMALL_SEQUENCE,
|
||||
TRACE,
|
||||
)
|
||||
from .md import mess_ratio
|
||||
from .models import CharsetMatch, CharsetMatches
|
||||
from .utils import (
|
||||
any_specified_encoding,
|
||||
cut_sequence_chunks,
|
||||
iana_name,
|
||||
identify_sig_or_bom,
|
||||
is_multi_byte_encoding,
|
||||
should_strip_sig_or_bom,
|
||||
)
|
||||
|
||||
logger = logging.getLogger("charset_normalizer")
|
||||
explain_handler = logging.StreamHandler()
|
||||
explain_handler.setFormatter(
|
||||
logging.Formatter("%(asctime)s | %(levelname)s | %(message)s")
|
||||
)
|
||||
|
||||
# Pre-compute a reordered encoding list: multibyte first, then single-byte.
|
||||
# This allows the mb_definitive_match optimization to fire earlier, skipping
|
||||
# all single-byte encodings for genuine CJK content. Multibyte codecs
|
||||
# hard-fail (UnicodeDecodeError) on single-byte data almost instantly, so
|
||||
# testing them first costs negligible time for non-CJK files.
|
||||
_mb_supported: list[str] = []
|
||||
_sb_supported: list[str] = []
|
||||
|
||||
for _supported_enc in IANA_SUPPORTED:
|
||||
try:
|
||||
if is_multi_byte_encoding(_supported_enc):
|
||||
_mb_supported.append(_supported_enc)
|
||||
else:
|
||||
_sb_supported.append(_supported_enc)
|
||||
except ImportError:
|
||||
_sb_supported.append(_supported_enc)
|
||||
|
||||
IANA_SUPPORTED_MB_FIRST: list[str] = _mb_supported + _sb_supported
|
||||
|
||||
|
||||
def from_bytes(
|
||||
sequences: bytes | bytearray,
|
||||
steps: int = 5,
|
||||
chunk_size: int = 512,
|
||||
threshold: float = 0.2,
|
||||
cp_isolation: list[str] | None = None,
|
||||
cp_exclusion: list[str] | None = None,
|
||||
preemptive_behaviour: bool = True,
|
||||
explain: bool = False,
|
||||
language_threshold: float = 0.1,
|
||||
enable_fallback: bool = True,
|
||||
) -> CharsetMatches:
|
||||
"""
|
||||
Given a raw bytes sequence, return the best possibles charset usable to render str objects.
|
||||
If there is no results, it is a strong indicator that the source is binary/not text.
|
||||
By default, the process will extract 5 blocks of 512o each to assess the mess and coherence of a given sequence.
|
||||
And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will.
|
||||
|
||||
The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page
|
||||
but never take it for granted. Can improve the performance.
|
||||
|
||||
You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that
|
||||
purpose.
|
||||
|
||||
This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32.
|
||||
By default the library does not setup any handler other than the NullHandler, if you choose to set the 'explain'
|
||||
toggle to True it will alter the logger configuration to add a StreamHandler that is suitable for debugging.
|
||||
Custom logging format and handler can be set manually.
|
||||
"""
|
||||
|
||||
if not isinstance(sequences, (bytearray, bytes)):
|
||||
raise TypeError(
|
||||
"Expected object of type bytes or bytearray, got: {}".format(
|
||||
type(sequences)
|
||||
)
|
||||
)
|
||||
|
||||
if explain:
|
||||
previous_logger_level: int = logger.level
|
||||
logger.addHandler(explain_handler)
|
||||
logger.setLevel(TRACE)
|
||||
|
||||
length: int = len(sequences)
|
||||
|
||||
if length == 0:
|
||||
logger.debug("Encoding detection on empty bytes, assuming utf_8 intention.")
|
||||
if explain: # Defensive: ensure exit path clean handler
|
||||
logger.removeHandler(explain_handler)
|
||||
logger.setLevel(previous_logger_level)
|
||||
return CharsetMatches([CharsetMatch(sequences, "utf_8", 0.0, False, [], "")])
|
||||
|
||||
if cp_isolation is not None:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"cp_isolation is set. use this flag for debugging purpose. "
|
||||
"limited list of encoding allowed : %s.",
|
||||
", ".join(cp_isolation),
|
||||
)
|
||||
cp_isolation = [iana_name(cp, False) for cp in cp_isolation]
|
||||
else:
|
||||
cp_isolation = []
|
||||
|
||||
if cp_exclusion is not None:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"cp_exclusion is set. use this flag for debugging purpose. "
|
||||
"limited list of encoding excluded : %s.",
|
||||
", ".join(cp_exclusion),
|
||||
)
|
||||
cp_exclusion = [iana_name(cp, False) for cp in cp_exclusion]
|
||||
else:
|
||||
cp_exclusion = []
|
||||
|
||||
if length <= (chunk_size * steps):
|
||||
logger.log(
|
||||
TRACE,
|
||||
"override steps (%i) and chunk_size (%i) as content does not fit (%i byte(s) given) parameters.",
|
||||
steps,
|
||||
chunk_size,
|
||||
length,
|
||||
)
|
||||
steps = 1
|
||||
chunk_size = length
|
||||
|
||||
if steps > 1 and length / steps < chunk_size:
|
||||
chunk_size = int(length / steps)
|
||||
|
||||
is_too_small_sequence: bool = len(sequences) < TOO_SMALL_SEQUENCE
|
||||
is_too_large_sequence: bool = len(sequences) >= TOO_BIG_SEQUENCE
|
||||
|
||||
if is_too_small_sequence:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Trying to detect encoding from a tiny portion of ({}) byte(s).".format(
|
||||
length
|
||||
),
|
||||
)
|
||||
elif is_too_large_sequence:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Using lazy str decoding because the payload is quite large, ({}) byte(s).".format(
|
||||
length
|
||||
),
|
||||
)
|
||||
|
||||
prioritized_encodings: list[str] = []
|
||||
|
||||
specified_encoding: str | None = (
|
||||
any_specified_encoding(sequences) if preemptive_behaviour else None
|
||||
)
|
||||
|
||||
if specified_encoding is not None:
|
||||
prioritized_encodings.append(specified_encoding)
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Detected declarative mark in sequence. Priority +1 given for %s.",
|
||||
specified_encoding,
|
||||
)
|
||||
|
||||
tested: set[str] = set()
|
||||
tested_but_hard_failure: list[str] = []
|
||||
tested_but_soft_failure: list[str] = []
|
||||
soft_failure_skip: set[str] = set()
|
||||
success_fast_tracked: set[str] = set()
|
||||
|
||||
# Cache for decoded payload deduplication: hash(decoded_payload) -> (mean_mess_ratio, cd_ratios_merged, passed)
|
||||
# When multiple encodings decode to the exact same string, we can skip the expensive
|
||||
# mess_ratio and coherence_ratio analysis and reuse the results from the first encoding.
|
||||
payload_result_cache: dict[int, tuple[float, list[tuple[str, float]], bool]] = {}
|
||||
|
||||
# When a definitive result (chaos=0.0 and good coherence) is found after testing
|
||||
# the prioritized encodings (ascii, utf_8), we can significantly reduce the remaining
|
||||
# work. Encodings that target completely different language families (e.g., Cyrillic
|
||||
# when the definitive match is Latin) are skipped entirely.
|
||||
# Additionally, for same-family encodings that pass chaos probing, we reuse the
|
||||
# definitive match's coherence ratios instead of recomputing them — a major savings
|
||||
# since coherence_ratio accounts for ~30% of total time on slow Latin files.
|
||||
definitive_match_found: bool = False
|
||||
definitive_target_languages: set[str] = set()
|
||||
# After the definitive match fires, we cap the number of additional same-family
|
||||
# single-byte encodings that pass chaos probing. Once we've accumulated enough
|
||||
# good candidates (N), further same-family SB encodings are unlikely to produce
|
||||
# a better best() result and just waste mess_ratio + coherence_ratio time.
|
||||
# The first encoding to trigger the definitive match is NOT counted (it's already in).
|
||||
post_definitive_sb_success_count: int = 0
|
||||
POST_DEFINITIVE_SB_CAP: int = 7
|
||||
|
||||
# When a non-UTF multibyte encoding passes chaos probing with significant multibyte
|
||||
# content (decoded length < 98% of raw length), skip all remaining single-byte encodings.
|
||||
# Rationale: multi-byte decoders (CJK) have strict byte-sequence validation — if they
|
||||
# decode without error AND pass chaos probing with substantial multibyte content, the
|
||||
# data is genuinely multibyte encoded. Single-byte encodings will always decode (every
|
||||
# byte maps to something) but waste time on mess_ratio before failing.
|
||||
# The 98% threshold prevents false triggers on files that happen to have a few valid
|
||||
# multibyte pairs (e.g., cp424/_ude_1.txt where big5 decodes with 99% ratio).
|
||||
mb_definitive_match_found: bool = False
|
||||
|
||||
fallback_ascii: CharsetMatch | None = None
|
||||
fallback_u8: CharsetMatch | None = None
|
||||
fallback_specified: CharsetMatch | None = None
|
||||
|
||||
results: CharsetMatches = CharsetMatches()
|
||||
|
||||
early_stop_results: CharsetMatches = CharsetMatches()
|
||||
|
||||
sig_encoding, sig_payload = identify_sig_or_bom(sequences)
|
||||
|
||||
if sig_encoding is not None:
|
||||
prioritized_encodings.append(sig_encoding)
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Detected a SIG or BOM mark on first %i byte(s). Priority +1 given for %s.",
|
||||
len(sig_payload),
|
||||
sig_encoding,
|
||||
)
|
||||
|
||||
prioritized_encodings.append("ascii")
|
||||
|
||||
if "utf_8" not in prioritized_encodings:
|
||||
prioritized_encodings.append("utf_8")
|
||||
|
||||
for encoding_iana in prioritized_encodings + IANA_SUPPORTED_MB_FIRST:
|
||||
if cp_isolation and encoding_iana not in cp_isolation:
|
||||
continue
|
||||
|
||||
if cp_exclusion and encoding_iana in cp_exclusion:
|
||||
continue
|
||||
|
||||
if encoding_iana in tested:
|
||||
continue
|
||||
|
||||
tested.add(encoding_iana)
|
||||
|
||||
decoded_payload: str | None = None
|
||||
bom_or_sig_available: bool = sig_encoding == encoding_iana
|
||||
strip_sig_or_bom: bool = bom_or_sig_available and should_strip_sig_or_bom(
|
||||
encoding_iana
|
||||
)
|
||||
|
||||
if encoding_iana in {"utf_16", "utf_32"} and not bom_or_sig_available:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Encoding %s won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.",
|
||||
encoding_iana,
|
||||
)
|
||||
continue
|
||||
if encoding_iana in {"utf_7"} and not bom_or_sig_available:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Encoding %s won't be tested as-is because detection is unreliable without BOM/SIG.",
|
||||
encoding_iana,
|
||||
)
|
||||
continue
|
||||
|
||||
# Skip encodings similar to ones that already soft-failed (high mess ratio).
|
||||
# Checked BEFORE the expensive decode attempt.
|
||||
if encoding_iana in soft_failure_skip:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"%s is deemed too similar to a code page that was already considered unsuited. Continuing!",
|
||||
encoding_iana,
|
||||
)
|
||||
continue
|
||||
|
||||
# Skip encodings that were already fast-tracked from a similar successful encoding.
|
||||
if encoding_iana in success_fast_tracked:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Skipping %s: already fast-tracked from a similar successful encoding.",
|
||||
encoding_iana,
|
||||
)
|
||||
continue
|
||||
|
||||
try:
|
||||
is_multi_byte_decoder: bool = is_multi_byte_encoding(encoding_iana)
|
||||
except (ModuleNotFoundError, ImportError): # Defensive:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Encoding %s does not provide an IncrementalDecoder",
|
||||
encoding_iana,
|
||||
)
|
||||
continue
|
||||
|
||||
# When we've already found a definitive match (chaos=0.0 with good coherence)
|
||||
# after testing the prioritized encodings, skip encodings that target
|
||||
# completely different language families. This avoids running expensive
|
||||
# mess_ratio + coherence_ratio on clearly unrelated candidates (e.g., Cyrillic
|
||||
# when the definitive match is Latin-based).
|
||||
if definitive_match_found:
|
||||
if not is_multi_byte_decoder:
|
||||
enc_languages = set(encoding_languages(encoding_iana))
|
||||
else:
|
||||
enc_languages = set(mb_encoding_languages(encoding_iana))
|
||||
if not enc_languages.intersection(definitive_target_languages):
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Skipping %s: definitive match already found, this encoding targets different languages (%s vs %s).",
|
||||
encoding_iana,
|
||||
enc_languages,
|
||||
definitive_target_languages,
|
||||
)
|
||||
continue
|
||||
|
||||
# After the definitive match, cap the number of additional same-family
|
||||
# single-byte encodings that pass chaos probing. This avoids testing the
|
||||
# tail of rare, low-value same-family encodings (mac_iceland, cp860, etc.)
|
||||
# that almost never change best() but each cost ~1-2ms of mess_ratio + coherence.
|
||||
if (
|
||||
definitive_match_found
|
||||
and not is_multi_byte_decoder
|
||||
and post_definitive_sb_success_count >= POST_DEFINITIVE_SB_CAP
|
||||
):
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Skipping %s: already accumulated %d same-family results after definitive match (cap=%d).",
|
||||
encoding_iana,
|
||||
post_definitive_sb_success_count,
|
||||
POST_DEFINITIVE_SB_CAP,
|
||||
)
|
||||
continue
|
||||
|
||||
# When a multibyte encoding with significant multibyte content has already
|
||||
# passed chaos probing, skip all single-byte encodings. They will either fail
|
||||
# chaos probing (wasting mess_ratio time) or produce inferior results.
|
||||
if mb_definitive_match_found and not is_multi_byte_decoder:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Skipping single-byte %s: multi-byte definitive match already found.",
|
||||
encoding_iana,
|
||||
)
|
||||
continue
|
||||
|
||||
try:
|
||||
if is_too_large_sequence and is_multi_byte_decoder is False:
|
||||
str(
|
||||
(
|
||||
sequences[: int(50e4)]
|
||||
if strip_sig_or_bom is False
|
||||
else sequences[len(sig_payload) : int(50e4)]
|
||||
),
|
||||
encoding=encoding_iana,
|
||||
)
|
||||
else:
|
||||
decoded_payload = str(
|
||||
(
|
||||
sequences
|
||||
if strip_sig_or_bom is False
|
||||
else sequences[len(sig_payload) :]
|
||||
),
|
||||
encoding=encoding_iana,
|
||||
)
|
||||
except (UnicodeDecodeError, LookupError) as e:
|
||||
if not isinstance(e, LookupError):
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Code page %s does not fit given bytes sequence at ALL. %s",
|
||||
encoding_iana,
|
||||
str(e),
|
||||
)
|
||||
tested_but_hard_failure.append(encoding_iana)
|
||||
continue
|
||||
|
||||
r_ = range(
|
||||
0 if not bom_or_sig_available else len(sig_payload),
|
||||
length,
|
||||
int(length / steps),
|
||||
)
|
||||
|
||||
multi_byte_bonus: bool = (
|
||||
is_multi_byte_decoder
|
||||
and decoded_payload is not None
|
||||
and len(decoded_payload) < length
|
||||
)
|
||||
|
||||
if multi_byte_bonus:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Code page %s is a multi byte encoding table and it appear that at least one character "
|
||||
"was encoded using n-bytes.",
|
||||
encoding_iana,
|
||||
)
|
||||
|
||||
# Payload-hash deduplication: if another encoding already decoded to the
|
||||
# exact same string, reuse its mess_ratio and coherence results entirely.
|
||||
# This is strictly more general than the old IANA_SUPPORTED_SIMILAR approach
|
||||
# because it catches ALL identical decoding, not just pre-mapped ones.
|
||||
if decoded_payload is not None and not is_multi_byte_decoder:
|
||||
payload_hash: int = hash(decoded_payload)
|
||||
cached = payload_result_cache.get(payload_hash)
|
||||
if cached is not None:
|
||||
cached_mess, cached_cd, cached_passed = cached
|
||||
if cached_passed:
|
||||
# The previous encoding with identical output passed chaos probing.
|
||||
fast_match = CharsetMatch(
|
||||
sequences,
|
||||
encoding_iana,
|
||||
cached_mess,
|
||||
bom_or_sig_available,
|
||||
cached_cd,
|
||||
(
|
||||
decoded_payload
|
||||
if (
|
||||
is_too_large_sequence is False
|
||||
or encoding_iana
|
||||
in [specified_encoding, "ascii", "utf_8"]
|
||||
)
|
||||
else None
|
||||
),
|
||||
preemptive_declaration=specified_encoding,
|
||||
)
|
||||
results.append(fast_match)
|
||||
success_fast_tracked.add(encoding_iana)
|
||||
logger.log(
|
||||
TRACE,
|
||||
"%s fast-tracked (identical decoded payload to a prior encoding, chaos=%f %%).",
|
||||
encoding_iana,
|
||||
round(cached_mess * 100, ndigits=3),
|
||||
)
|
||||
|
||||
if (
|
||||
encoding_iana in [specified_encoding, "ascii", "utf_8"]
|
||||
and cached_mess < 0.1
|
||||
):
|
||||
if cached_mess == 0.0:
|
||||
logger.debug(
|
||||
"Encoding detection: %s is most likely the one.",
|
||||
fast_match.encoding,
|
||||
)
|
||||
if explain:
|
||||
logger.removeHandler(explain_handler)
|
||||
logger.setLevel(previous_logger_level)
|
||||
return CharsetMatches([fast_match])
|
||||
early_stop_results.append(fast_match)
|
||||
|
||||
if (
|
||||
len(early_stop_results)
|
||||
and (specified_encoding is None or specified_encoding in tested)
|
||||
and "ascii" in tested
|
||||
and "utf_8" in tested
|
||||
):
|
||||
probable_result: CharsetMatch = early_stop_results.best() # type: ignore[assignment]
|
||||
logger.debug(
|
||||
"Encoding detection: %s is most likely the one.",
|
||||
probable_result.encoding,
|
||||
)
|
||||
if explain:
|
||||
logger.removeHandler(explain_handler)
|
||||
logger.setLevel(previous_logger_level)
|
||||
return CharsetMatches([probable_result])
|
||||
|
||||
continue
|
||||
else:
|
||||
# The previous encoding with identical output failed chaos probing.
|
||||
tested_but_soft_failure.append(encoding_iana)
|
||||
logger.log(
|
||||
TRACE,
|
||||
"%s fast-skipped (identical decoded payload to a prior encoding that failed chaos probing).",
|
||||
encoding_iana,
|
||||
)
|
||||
# Prepare fallbacks for special encodings even when skipped.
|
||||
if enable_fallback and encoding_iana in [
|
||||
"ascii",
|
||||
"utf_8",
|
||||
specified_encoding,
|
||||
"utf_16",
|
||||
"utf_32",
|
||||
]:
|
||||
fallback_entry = CharsetMatch(
|
||||
sequences,
|
||||
encoding_iana,
|
||||
threshold,
|
||||
bom_or_sig_available,
|
||||
[],
|
||||
decoded_payload,
|
||||
preemptive_declaration=specified_encoding,
|
||||
)
|
||||
if encoding_iana == specified_encoding:
|
||||
fallback_specified = fallback_entry
|
||||
elif encoding_iana == "ascii":
|
||||
fallback_ascii = fallback_entry
|
||||
else:
|
||||
fallback_u8 = fallback_entry
|
||||
continue
|
||||
|
||||
max_chunk_gave_up: int = int(len(r_) / 4)
|
||||
|
||||
max_chunk_gave_up = max(max_chunk_gave_up, 2)
|
||||
early_stop_count: int = 0
|
||||
lazy_str_hard_failure = False
|
||||
|
||||
md_chunks: list[str] = []
|
||||
md_ratios = []
|
||||
|
||||
try:
|
||||
for chunk in cut_sequence_chunks(
|
||||
sequences,
|
||||
encoding_iana,
|
||||
r_,
|
||||
chunk_size,
|
||||
bom_or_sig_available,
|
||||
strip_sig_or_bom,
|
||||
sig_payload,
|
||||
is_multi_byte_decoder,
|
||||
decoded_payload,
|
||||
):
|
||||
md_chunks.append(chunk)
|
||||
|
||||
md_ratios.append(
|
||||
mess_ratio(
|
||||
chunk,
|
||||
threshold,
|
||||
explain is True and 1 <= len(cp_isolation) <= 2,
|
||||
)
|
||||
)
|
||||
|
||||
if md_ratios[-1] >= threshold:
|
||||
early_stop_count += 1
|
||||
|
||||
if (early_stop_count >= max_chunk_gave_up) or (
|
||||
bom_or_sig_available and strip_sig_or_bom is False
|
||||
):
|
||||
break
|
||||
except (
|
||||
UnicodeDecodeError
|
||||
) as e: # Lazy str loading may have missed something there
|
||||
logger.log(
|
||||
TRACE,
|
||||
"LazyStr Loading: After MD chunk decode, code page %s does not fit given bytes sequence at ALL. %s",
|
||||
encoding_iana,
|
||||
str(e),
|
||||
)
|
||||
early_stop_count = max_chunk_gave_up
|
||||
lazy_str_hard_failure = True
|
||||
|
||||
# We might want to check the sequence again with the whole content
|
||||
# Only if initial MD tests passes
|
||||
if (
|
||||
not lazy_str_hard_failure
|
||||
and is_too_large_sequence
|
||||
and not is_multi_byte_decoder
|
||||
):
|
||||
try:
|
||||
sequences[int(50e3) :].decode(encoding_iana, errors="strict")
|
||||
except UnicodeDecodeError as e:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"LazyStr Loading: After final lookup, code page %s does not fit given bytes sequence at ALL. %s",
|
||||
encoding_iana,
|
||||
str(e),
|
||||
)
|
||||
tested_but_hard_failure.append(encoding_iana)
|
||||
continue
|
||||
|
||||
mean_mess_ratio: float = sum(md_ratios) / len(md_ratios) if md_ratios else 0.0
|
||||
if mean_mess_ratio >= threshold or early_stop_count >= max_chunk_gave_up:
|
||||
tested_but_soft_failure.append(encoding_iana)
|
||||
if encoding_iana in IANA_SUPPORTED_SIMILAR:
|
||||
soft_failure_skip.update(IANA_SUPPORTED_SIMILAR[encoding_iana])
|
||||
# Cache this soft-failure so identical decoding from other encodings
|
||||
# can be skipped immediately.
|
||||
if decoded_payload is not None and not is_multi_byte_decoder:
|
||||
payload_result_cache.setdefault(
|
||||
hash(decoded_payload), (mean_mess_ratio, [], False)
|
||||
)
|
||||
logger.log(
|
||||
TRACE,
|
||||
"%s was excluded because of initial chaos probing. Gave up %i time(s). "
|
||||
"Computed mean chaos is %f %%.",
|
||||
encoding_iana,
|
||||
early_stop_count,
|
||||
round(mean_mess_ratio * 100, ndigits=3),
|
||||
)
|
||||
# Preparing those fallbacks in case we got nothing.
|
||||
if (
|
||||
enable_fallback
|
||||
and encoding_iana
|
||||
in ["ascii", "utf_8", specified_encoding, "utf_16", "utf_32"]
|
||||
and not lazy_str_hard_failure
|
||||
):
|
||||
fallback_entry = CharsetMatch(
|
||||
sequences,
|
||||
encoding_iana,
|
||||
threshold,
|
||||
bom_or_sig_available,
|
||||
[],
|
||||
decoded_payload,
|
||||
preemptive_declaration=specified_encoding,
|
||||
)
|
||||
if encoding_iana == specified_encoding:
|
||||
fallback_specified = fallback_entry
|
||||
elif encoding_iana == "ascii":
|
||||
fallback_ascii = fallback_entry
|
||||
else:
|
||||
fallback_u8 = fallback_entry
|
||||
continue
|
||||
|
||||
logger.log(
|
||||
TRACE,
|
||||
"%s passed initial chaos probing. Mean measured chaos is %f %%",
|
||||
encoding_iana,
|
||||
round(mean_mess_ratio * 100, ndigits=3),
|
||||
)
|
||||
|
||||
if not is_multi_byte_decoder:
|
||||
target_languages: list[str] = encoding_languages(encoding_iana)
|
||||
else:
|
||||
target_languages = mb_encoding_languages(encoding_iana)
|
||||
|
||||
if target_languages:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"{} should target any language(s) of {}".format(
|
||||
encoding_iana, str(target_languages)
|
||||
),
|
||||
)
|
||||
|
||||
cd_ratios = []
|
||||
|
||||
# Run coherence detection on all chunks. We previously tried limiting to
|
||||
# 1-2 chunks for post-definitive encodings to save time, but this caused
|
||||
# coverage regressions by producing unrepresentative coherence scores.
|
||||
# The SB cap and language-family skip optimizations provide sufficient
|
||||
# speedup without sacrificing coherence accuracy.
|
||||
if encoding_iana != "ascii":
|
||||
# We shall skip the CD when its about ASCII
|
||||
# Most of the time its not relevant to run "language-detection" on it.
|
||||
for chunk in md_chunks:
|
||||
chunk_languages = coherence_ratio(
|
||||
chunk,
|
||||
language_threshold,
|
||||
",".join(target_languages) if target_languages else None,
|
||||
)
|
||||
|
||||
cd_ratios.append(chunk_languages)
|
||||
cd_ratios_merged = merge_coherence_ratios(cd_ratios)
|
||||
else:
|
||||
cd_ratios_merged = merge_coherence_ratios(cd_ratios)
|
||||
|
||||
if cd_ratios_merged:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"We detected language {} using {}".format(
|
||||
cd_ratios_merged, encoding_iana
|
||||
),
|
||||
)
|
||||
|
||||
current_match = CharsetMatch(
|
||||
sequences,
|
||||
encoding_iana,
|
||||
mean_mess_ratio,
|
||||
bom_or_sig_available,
|
||||
cd_ratios_merged,
|
||||
(
|
||||
decoded_payload
|
||||
if (
|
||||
is_too_large_sequence is False
|
||||
or encoding_iana in [specified_encoding, "ascii", "utf_8"]
|
||||
)
|
||||
else None
|
||||
),
|
||||
preemptive_declaration=specified_encoding,
|
||||
)
|
||||
|
||||
results.append(current_match)
|
||||
|
||||
# Cache the successful result for payload-hash deduplication.
|
||||
if decoded_payload is not None and not is_multi_byte_decoder:
|
||||
payload_result_cache.setdefault(
|
||||
hash(decoded_payload),
|
||||
(mean_mess_ratio, cd_ratios_merged, True),
|
||||
)
|
||||
|
||||
# Count post-definitive same-family SB successes for the early termination cap.
|
||||
# Only count low-mess encodings (< 2%) toward the cap. High-mess encodings are
|
||||
# marginal results that shouldn't prevent better-quality candidates from being
|
||||
# tested. For example, iso8859_4 (mess=0%) should not be skipped just because
|
||||
# 7 high-mess Latin encodings (cp1252 at 8%, etc.) were tried first.
|
||||
if (
|
||||
definitive_match_found
|
||||
and not is_multi_byte_decoder
|
||||
and mean_mess_ratio < 0.02
|
||||
):
|
||||
post_definitive_sb_success_count += 1
|
||||
|
||||
if (
|
||||
encoding_iana in [specified_encoding, "ascii", "utf_8"]
|
||||
and mean_mess_ratio < 0.1
|
||||
):
|
||||
# If md says nothing to worry about, then... stop immediately!
|
||||
if mean_mess_ratio == 0.0:
|
||||
logger.debug(
|
||||
"Encoding detection: %s is most likely the one.",
|
||||
current_match.encoding,
|
||||
)
|
||||
if explain: # Defensive: ensure exit path clean handler
|
||||
logger.removeHandler(explain_handler)
|
||||
logger.setLevel(previous_logger_level)
|
||||
return CharsetMatches([current_match])
|
||||
|
||||
early_stop_results.append(current_match)
|
||||
|
||||
if (
|
||||
len(early_stop_results)
|
||||
and (specified_encoding is None or specified_encoding in tested)
|
||||
and "ascii" in tested
|
||||
and "utf_8" in tested
|
||||
):
|
||||
probable_result = early_stop_results.best() # type: ignore[assignment]
|
||||
logger.debug(
|
||||
"Encoding detection: %s is most likely the one.",
|
||||
probable_result.encoding, # type: ignore[union-attr]
|
||||
)
|
||||
if explain: # Defensive: ensure exit path clean handler
|
||||
logger.removeHandler(explain_handler)
|
||||
logger.setLevel(previous_logger_level)
|
||||
|
||||
return CharsetMatches([probable_result])
|
||||
|
||||
# Once we find a result with good coherence (>= 0.5) after testing the
|
||||
# prioritized encodings (ascii, utf_8), activate "definitive mode": skip
|
||||
# encodings that target completely different language families. This avoids
|
||||
# running expensive mess_ratio + coherence_ratio on clearly unrelated
|
||||
# candidates (e.g., Cyrillic encodings when the match is Latin-based).
|
||||
# We require coherence >= 0.5 to avoid false positives (e.g., cp1251 decoding
|
||||
# Hebrew text with 0.0 chaos but wrong language detection at coherence 0.33).
|
||||
if not definitive_match_found and not is_multi_byte_decoder:
|
||||
best_coherence = (
|
||||
max((v for _, v in cd_ratios_merged), default=0.0)
|
||||
if cd_ratios_merged
|
||||
else 0.0
|
||||
)
|
||||
if best_coherence >= 0.5 and "ascii" in tested and "utf_8" in tested:
|
||||
definitive_match_found = True
|
||||
definitive_target_languages.update(target_languages)
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Definitive match found: %s (chaos=%.3f, coherence=%.2f). Encodings targeting different language families will be skipped.",
|
||||
encoding_iana,
|
||||
mean_mess_ratio,
|
||||
best_coherence,
|
||||
)
|
||||
|
||||
# When a non-UTF multibyte encoding passes chaos probing with significant
|
||||
# multibyte content (decoded < 98% of raw), activate mb_definitive_match.
|
||||
# This skips all remaining single-byte encodings which would either soft-fail
|
||||
# (running expensive mess_ratio for nothing) or produce inferior results.
|
||||
if (
|
||||
not mb_definitive_match_found
|
||||
and is_multi_byte_decoder
|
||||
and multi_byte_bonus
|
||||
and decoded_payload is not None
|
||||
and len(decoded_payload) < length * 0.98
|
||||
and encoding_iana
|
||||
not in {
|
||||
"utf_8",
|
||||
"utf_8_sig",
|
||||
"utf_16",
|
||||
"utf_16_be",
|
||||
"utf_16_le",
|
||||
"utf_32",
|
||||
"utf_32_be",
|
||||
"utf_32_le",
|
||||
"utf_7",
|
||||
}
|
||||
and "ascii" in tested
|
||||
and "utf_8" in tested
|
||||
):
|
||||
mb_definitive_match_found = True
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Multi-byte definitive match: %s (chaos=%.3f, decoded=%d/%d=%.1f%%). Single-byte encodings will be skipped.",
|
||||
encoding_iana,
|
||||
mean_mess_ratio,
|
||||
len(decoded_payload),
|
||||
length,
|
||||
len(decoded_payload) / length * 100,
|
||||
)
|
||||
|
||||
if encoding_iana == sig_encoding:
|
||||
logger.debug(
|
||||
"Encoding detection: %s is most likely the one as we detected a BOM or SIG within "
|
||||
"the beginning of the sequence.",
|
||||
encoding_iana,
|
||||
)
|
||||
if explain: # Defensive: ensure exit path clean handler
|
||||
logger.removeHandler(explain_handler)
|
||||
logger.setLevel(previous_logger_level)
|
||||
return CharsetMatches([results[encoding_iana]])
|
||||
|
||||
if len(results) == 0:
|
||||
if fallback_u8 or fallback_ascii or fallback_specified:
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Nothing got out of the detection process. Using ASCII/UTF-8/Specified fallback.",
|
||||
)
|
||||
|
||||
if fallback_specified:
|
||||
logger.debug(
|
||||
"Encoding detection: %s will be used as a fallback match",
|
||||
fallback_specified.encoding,
|
||||
)
|
||||
results.append(fallback_specified)
|
||||
elif (
|
||||
(fallback_u8 and fallback_ascii is None)
|
||||
or (
|
||||
fallback_u8
|
||||
and fallback_ascii
|
||||
and fallback_u8.fingerprint != fallback_ascii.fingerprint
|
||||
)
|
||||
or (fallback_u8 is not None)
|
||||
):
|
||||
logger.debug("Encoding detection: utf_8 will be used as a fallback match")
|
||||
results.append(fallback_u8)
|
||||
elif fallback_ascii:
|
||||
logger.debug("Encoding detection: ascii will be used as a fallback match")
|
||||
results.append(fallback_ascii)
|
||||
|
||||
if results:
|
||||
logger.debug(
|
||||
"Encoding detection: Found %s as plausible (best-candidate) for content. With %i alternatives.",
|
||||
results.best().encoding, # type: ignore
|
||||
len(results) - 1,
|
||||
)
|
||||
else:
|
||||
logger.debug("Encoding detection: Unable to determine any suitable charset.")
|
||||
|
||||
if explain:
|
||||
logger.removeHandler(explain_handler)
|
||||
logger.setLevel(previous_logger_level)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def from_fp(
|
||||
fp: BinaryIO,
|
||||
steps: int = 5,
|
||||
chunk_size: int = 512,
|
||||
threshold: float = 0.20,
|
||||
cp_isolation: list[str] | None = None,
|
||||
cp_exclusion: list[str] | None = None,
|
||||
preemptive_behaviour: bool = True,
|
||||
explain: bool = False,
|
||||
language_threshold: float = 0.1,
|
||||
enable_fallback: bool = True,
|
||||
) -> CharsetMatches:
|
||||
"""
|
||||
Same thing than the function from_bytes but using a file pointer that is already ready.
|
||||
Will not close the file pointer.
|
||||
"""
|
||||
return from_bytes(
|
||||
fp.read(),
|
||||
steps,
|
||||
chunk_size,
|
||||
threshold,
|
||||
cp_isolation,
|
||||
cp_exclusion,
|
||||
preemptive_behaviour,
|
||||
explain,
|
||||
language_threshold,
|
||||
enable_fallback,
|
||||
)
|
||||
|
||||
|
||||
def from_path(
|
||||
path: str | bytes | PathLike, # type: ignore[type-arg]
|
||||
steps: int = 5,
|
||||
chunk_size: int = 512,
|
||||
threshold: float = 0.20,
|
||||
cp_isolation: list[str] | None = None,
|
||||
cp_exclusion: list[str] | None = None,
|
||||
preemptive_behaviour: bool = True,
|
||||
explain: bool = False,
|
||||
language_threshold: float = 0.1,
|
||||
enable_fallback: bool = True,
|
||||
) -> CharsetMatches:
|
||||
"""
|
||||
Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode.
|
||||
Can raise IOError.
|
||||
"""
|
||||
with open(path, "rb") as fp:
|
||||
return from_fp(
|
||||
fp,
|
||||
steps,
|
||||
chunk_size,
|
||||
threshold,
|
||||
cp_isolation,
|
||||
cp_exclusion,
|
||||
preemptive_behaviour,
|
||||
explain,
|
||||
language_threshold,
|
||||
enable_fallback,
|
||||
)
|
||||
|
||||
|
||||
def is_binary(
|
||||
fp_or_path_or_payload: PathLike | str | BinaryIO | bytes, # type: ignore[type-arg]
|
||||
steps: int = 5,
|
||||
chunk_size: int = 512,
|
||||
threshold: float = 0.20,
|
||||
cp_isolation: list[str] | None = None,
|
||||
cp_exclusion: list[str] | None = None,
|
||||
preemptive_behaviour: bool = True,
|
||||
explain: bool = False,
|
||||
language_threshold: float = 0.1,
|
||||
enable_fallback: bool = False,
|
||||
) -> bool:
|
||||
"""
|
||||
Detect if the given input (file, bytes, or path) points to a binary file. aka. not a string.
|
||||
Based on the same main heuristic algorithms and default kwargs at the sole exception that fallbacks match
|
||||
are disabled to be stricter around ASCII-compatible but unlikely to be a string.
|
||||
"""
|
||||
if isinstance(fp_or_path_or_payload, (str, PathLike)):
|
||||
guesses = from_path(
|
||||
fp_or_path_or_payload,
|
||||
steps=steps,
|
||||
chunk_size=chunk_size,
|
||||
threshold=threshold,
|
||||
cp_isolation=cp_isolation,
|
||||
cp_exclusion=cp_exclusion,
|
||||
preemptive_behaviour=preemptive_behaviour,
|
||||
explain=explain,
|
||||
language_threshold=language_threshold,
|
||||
enable_fallback=enable_fallback,
|
||||
)
|
||||
elif isinstance(
|
||||
fp_or_path_or_payload,
|
||||
(
|
||||
bytes,
|
||||
bytearray,
|
||||
),
|
||||
):
|
||||
guesses = from_bytes(
|
||||
fp_or_path_or_payload,
|
||||
steps=steps,
|
||||
chunk_size=chunk_size,
|
||||
threshold=threshold,
|
||||
cp_isolation=cp_isolation,
|
||||
cp_exclusion=cp_exclusion,
|
||||
preemptive_behaviour=preemptive_behaviour,
|
||||
explain=explain,
|
||||
language_threshold=language_threshold,
|
||||
enable_fallback=enable_fallback,
|
||||
)
|
||||
else:
|
||||
guesses = from_fp(
|
||||
fp_or_path_or_payload,
|
||||
steps=steps,
|
||||
chunk_size=chunk_size,
|
||||
threshold=threshold,
|
||||
cp_isolation=cp_isolation,
|
||||
cp_exclusion=cp_exclusion,
|
||||
preemptive_behaviour=preemptive_behaviour,
|
||||
explain=explain,
|
||||
language_threshold=language_threshold,
|
||||
enable_fallback=enable_fallback,
|
||||
)
|
||||
|
||||
return not guesses
|
||||
BIN
aws-lambda/src/charset_normalizer/cd.cp313-win_amd64.pyd
Normal file
BIN
aws-lambda/src/charset_normalizer/cd.cp313-win_amd64.pyd
Normal file
Binary file not shown.
454
aws-lambda/src/charset_normalizer/cd.py
Normal file
454
aws-lambda/src/charset_normalizer/cd.py
Normal file
@@ -0,0 +1,454 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import importlib
|
||||
from codecs import IncrementalDecoder
|
||||
from collections import Counter
|
||||
from functools import lru_cache
|
||||
from typing import Counter as TypeCounter
|
||||
|
||||
from .constant import (
|
||||
FREQUENCIES,
|
||||
KO_NAMES,
|
||||
LANGUAGE_SUPPORTED_COUNT,
|
||||
TOO_SMALL_SEQUENCE,
|
||||
ZH_NAMES,
|
||||
_FREQUENCIES_SET,
|
||||
_FREQUENCIES_RANK,
|
||||
)
|
||||
from .md import is_suspiciously_successive_range
|
||||
from .models import CoherenceMatches
|
||||
from .utils import (
|
||||
is_accentuated,
|
||||
is_latin,
|
||||
is_multi_byte_encoding,
|
||||
is_unicode_range_secondary,
|
||||
unicode_range,
|
||||
)
|
||||
|
||||
|
||||
def encoding_unicode_range(iana_name: str) -> list[str]:
|
||||
"""
|
||||
Return associated unicode ranges in a single byte code page.
|
||||
"""
|
||||
if is_multi_byte_encoding(iana_name):
|
||||
raise OSError( # Defensive:
|
||||
"Function not supported on multi-byte code page"
|
||||
)
|
||||
|
||||
decoder = importlib.import_module(f"encodings.{iana_name}").IncrementalDecoder
|
||||
|
||||
p: IncrementalDecoder = decoder(errors="ignore")
|
||||
seen_ranges: dict[str, int] = {}
|
||||
character_count: int = 0
|
||||
|
||||
for i in range(0x40, 0xFF):
|
||||
chunk: str = p.decode(bytes([i]))
|
||||
|
||||
if chunk:
|
||||
character_range: str | None = unicode_range(chunk)
|
||||
|
||||
if character_range is None:
|
||||
continue
|
||||
|
||||
if is_unicode_range_secondary(character_range) is False:
|
||||
if character_range not in seen_ranges:
|
||||
seen_ranges[character_range] = 0
|
||||
seen_ranges[character_range] += 1
|
||||
character_count += 1
|
||||
|
||||
return sorted(
|
||||
[
|
||||
character_range
|
||||
for character_range in seen_ranges
|
||||
if seen_ranges[character_range] / character_count >= 0.15
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def unicode_range_languages(primary_range: str) -> list[str]:
|
||||
"""
|
||||
Return inferred languages used with a unicode range.
|
||||
"""
|
||||
languages: list[str] = []
|
||||
|
||||
for language, characters in FREQUENCIES.items():
|
||||
for character in characters:
|
||||
if unicode_range(character) == primary_range:
|
||||
languages.append(language)
|
||||
break
|
||||
|
||||
return languages
|
||||
|
||||
|
||||
@lru_cache()
|
||||
def encoding_languages(iana_name: str) -> list[str]:
|
||||
"""
|
||||
Single-byte encoding language association. Some code page are heavily linked to particular language(s).
|
||||
This function does the correspondence.
|
||||
"""
|
||||
unicode_ranges: list[str] = encoding_unicode_range(iana_name)
|
||||
primary_range: str | None = None
|
||||
|
||||
for specified_range in unicode_ranges:
|
||||
if "Latin" not in specified_range:
|
||||
primary_range = specified_range
|
||||
break
|
||||
|
||||
if primary_range is None:
|
||||
return ["Latin Based"]
|
||||
|
||||
return unicode_range_languages(primary_range)
|
||||
|
||||
|
||||
@lru_cache()
|
||||
def mb_encoding_languages(iana_name: str) -> list[str]:
|
||||
"""
|
||||
Multi-byte encoding language association. Some code page are heavily linked to particular language(s).
|
||||
This function does the correspondence.
|
||||
"""
|
||||
if (
|
||||
iana_name.startswith("shift_")
|
||||
or iana_name.startswith("iso2022_jp")
|
||||
or iana_name.startswith("euc_j")
|
||||
or iana_name == "cp932"
|
||||
):
|
||||
return ["Japanese"]
|
||||
if iana_name.startswith("gb") or iana_name in ZH_NAMES:
|
||||
return ["Chinese"]
|
||||
if iana_name.startswith("iso2022_kr") or iana_name in KO_NAMES:
|
||||
return ["Korean"]
|
||||
|
||||
return []
|
||||
|
||||
|
||||
@lru_cache(maxsize=LANGUAGE_SUPPORTED_COUNT)
|
||||
def get_target_features(language: str) -> tuple[bool, bool]:
|
||||
"""
|
||||
Determine main aspects from a supported language if it contains accents and if is pure Latin.
|
||||
"""
|
||||
target_have_accents: bool = False
|
||||
target_pure_latin: bool = True
|
||||
|
||||
for character in FREQUENCIES[language]:
|
||||
if not target_have_accents and is_accentuated(character):
|
||||
target_have_accents = True
|
||||
if target_pure_latin and is_latin(character) is False:
|
||||
target_pure_latin = False
|
||||
|
||||
return target_have_accents, target_pure_latin
|
||||
|
||||
|
||||
def alphabet_languages(
|
||||
characters: list[str], ignore_non_latin: bool = False
|
||||
) -> list[str]:
|
||||
"""
|
||||
Return associated languages associated to given characters.
|
||||
"""
|
||||
languages: list[tuple[str, float]] = []
|
||||
|
||||
characters_set: frozenset[str] = frozenset(characters)
|
||||
source_have_accents = any(is_accentuated(character) for character in characters)
|
||||
|
||||
for language, language_characters in FREQUENCIES.items():
|
||||
target_have_accents, target_pure_latin = get_target_features(language)
|
||||
|
||||
if ignore_non_latin and target_pure_latin is False:
|
||||
continue
|
||||
|
||||
if target_have_accents is False and source_have_accents:
|
||||
continue
|
||||
|
||||
character_count: int = len(language_characters)
|
||||
|
||||
character_match_count: int = len(_FREQUENCIES_SET[language] & characters_set)
|
||||
|
||||
ratio: float = character_match_count / character_count
|
||||
|
||||
if ratio >= 0.2:
|
||||
languages.append((language, ratio))
|
||||
|
||||
languages = sorted(languages, key=lambda x: x[1], reverse=True)
|
||||
|
||||
return [compatible_language[0] for compatible_language in languages]
|
||||
|
||||
|
||||
def characters_popularity_compare(
|
||||
language: str, ordered_characters: list[str]
|
||||
) -> float:
|
||||
"""
|
||||
Determine if a ordered characters list (by occurrence from most appearance to rarest) match a particular language.
|
||||
The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit).
|
||||
Beware that is function is not strict on the match in order to ease the detection. (Meaning close match is 1.)
|
||||
"""
|
||||
if language not in FREQUENCIES:
|
||||
raise ValueError(f"{language} not available") # Defensive:
|
||||
|
||||
character_approved_count: int = 0
|
||||
frequencies_language_set: frozenset[str] = _FREQUENCIES_SET[language]
|
||||
lang_rank: dict[str, int] = _FREQUENCIES_RANK[language]
|
||||
|
||||
ordered_characters_count: int = len(ordered_characters)
|
||||
target_language_characters_count: int = len(FREQUENCIES[language])
|
||||
|
||||
large_alphabet: bool = target_language_characters_count > 26
|
||||
|
||||
expected_projection_ratio: float = (
|
||||
target_language_characters_count / ordered_characters_count
|
||||
)
|
||||
|
||||
# Pre-built rank dict for ordered_characters (avoids repeated list slicing).
|
||||
ordered_rank: dict[str, int] = {
|
||||
char: rank for rank, char in enumerate(ordered_characters)
|
||||
}
|
||||
|
||||
# Pre-compute characters common to both orderings.
|
||||
# Avoids repeated `c in ordered_rank` dict lookups in the inner counts.
|
||||
common_chars: list[tuple[int, int]] = [
|
||||
(lr, ordered_rank[c]) for c, lr in lang_rank.items() if c in ordered_rank
|
||||
]
|
||||
|
||||
# Pre-extract lr and orr arrays for faster iteration in the inner loop.
|
||||
# Plain integer loops with local arrays are much faster under mypyc than
|
||||
# generator expression sums over a list of tuples.
|
||||
common_count: int = len(common_chars)
|
||||
common_lr: list[int] = [p[0] for p in common_chars]
|
||||
common_orr: list[int] = [p[1] for p in common_chars]
|
||||
|
||||
for character, character_rank in zip(
|
||||
ordered_characters, range(0, ordered_characters_count)
|
||||
):
|
||||
if character not in frequencies_language_set:
|
||||
continue
|
||||
|
||||
character_rank_in_language: int = lang_rank[character]
|
||||
character_rank_projection: int = int(character_rank * expected_projection_ratio)
|
||||
|
||||
if (
|
||||
large_alphabet is False
|
||||
and abs(character_rank_projection - character_rank_in_language) > 4
|
||||
):
|
||||
continue
|
||||
|
||||
if (
|
||||
large_alphabet is True
|
||||
and abs(character_rank_projection - character_rank_in_language)
|
||||
< target_language_characters_count / 3
|
||||
):
|
||||
character_approved_count += 1
|
||||
continue
|
||||
|
||||
# Count how many characters appear "before" in both orderings,
|
||||
# and how many appear "at or after" in both orderings.
|
||||
# Single pass over pre-extracted arrays — much faster under mypyc
|
||||
# than two generator expression sums.
|
||||
before_match_count: int = 0
|
||||
after_match_count: int = 0
|
||||
for i in range(common_count):
|
||||
lr_i: int = common_lr[i]
|
||||
orr_i: int = common_orr[i]
|
||||
if lr_i < character_rank_in_language:
|
||||
if orr_i < character_rank:
|
||||
before_match_count += 1
|
||||
else:
|
||||
if orr_i >= character_rank:
|
||||
after_match_count += 1
|
||||
|
||||
after_len: int = target_language_characters_count - character_rank_in_language
|
||||
|
||||
if character_rank_in_language == 0 and before_match_count <= 4:
|
||||
character_approved_count += 1
|
||||
continue
|
||||
|
||||
if after_len == 0 and after_match_count <= 4:
|
||||
character_approved_count += 1
|
||||
continue
|
||||
|
||||
if (
|
||||
character_rank_in_language > 0
|
||||
and before_match_count / character_rank_in_language >= 0.4
|
||||
) or (after_len > 0 and after_match_count / after_len >= 0.4):
|
||||
character_approved_count += 1
|
||||
continue
|
||||
|
||||
return character_approved_count / len(ordered_characters)
|
||||
|
||||
|
||||
def alpha_unicode_split(decoded_sequence: str) -> list[str]:
|
||||
"""
|
||||
Given a decoded text sequence, return a list of str. Unicode range / alphabet separation.
|
||||
Ex. a text containing English/Latin with a bit a Hebrew will return two items in the resulting list;
|
||||
One containing the latin letters and the other hebrew.
|
||||
"""
|
||||
layers: dict[str, list[str]] = {}
|
||||
|
||||
# Fast path: track single-layer key to skip dict iteration for single-script text.
|
||||
single_layer_key: str | None = None
|
||||
multi_layer: bool = False
|
||||
|
||||
# Cache the last character_range and its resolved layer to avoid repeated
|
||||
# is_suspiciously_successive_range calls for consecutive same-range chars.
|
||||
prev_character_range: str | None = None
|
||||
prev_layer_target: str | None = None
|
||||
|
||||
for character in decoded_sequence:
|
||||
if character.isalpha() is False:
|
||||
continue
|
||||
|
||||
# ASCII fast-path: a-z and A-Z are always "Basic Latin".
|
||||
# Avoids unicode_range() function call overhead for the most common case.
|
||||
character_ord: int = ord(character)
|
||||
if character_ord < 128:
|
||||
character_range: str | None = "Basic Latin"
|
||||
else:
|
||||
character_range = unicode_range(character)
|
||||
|
||||
if character_range is None:
|
||||
continue
|
||||
|
||||
# Fast path: same range as previous character → reuse cached layer target.
|
||||
if character_range == prev_character_range:
|
||||
if prev_layer_target is not None:
|
||||
layers[prev_layer_target].append(character)
|
||||
continue
|
||||
|
||||
layer_target_range: str | None = None
|
||||
|
||||
if multi_layer:
|
||||
for discovered_range in layers:
|
||||
if (
|
||||
is_suspiciously_successive_range(discovered_range, character_range)
|
||||
is False
|
||||
):
|
||||
layer_target_range = discovered_range
|
||||
break
|
||||
elif single_layer_key is not None:
|
||||
if (
|
||||
is_suspiciously_successive_range(single_layer_key, character_range)
|
||||
is False
|
||||
):
|
||||
layer_target_range = single_layer_key
|
||||
|
||||
if layer_target_range is None:
|
||||
layer_target_range = character_range
|
||||
|
||||
if layer_target_range not in layers:
|
||||
layers[layer_target_range] = []
|
||||
if single_layer_key is None:
|
||||
single_layer_key = layer_target_range
|
||||
else:
|
||||
multi_layer = True
|
||||
|
||||
layers[layer_target_range].append(character)
|
||||
|
||||
# Cache for next iteration
|
||||
prev_character_range = character_range
|
||||
prev_layer_target = layer_target_range
|
||||
|
||||
return ["".join(chars).lower() for chars in layers.values()]
|
||||
|
||||
|
||||
def merge_coherence_ratios(results: list[CoherenceMatches]) -> CoherenceMatches:
|
||||
"""
|
||||
This function merge results previously given by the function coherence_ratio.
|
||||
The return type is the same as coherence_ratio.
|
||||
"""
|
||||
per_language_ratios: dict[str, list[float]] = {}
|
||||
for result in results:
|
||||
for sub_result in result:
|
||||
language, ratio = sub_result
|
||||
if language not in per_language_ratios:
|
||||
per_language_ratios[language] = [ratio]
|
||||
continue
|
||||
per_language_ratios[language].append(ratio)
|
||||
|
||||
merge = [
|
||||
(
|
||||
language,
|
||||
round(
|
||||
sum(per_language_ratios[language]) / len(per_language_ratios[language]),
|
||||
4,
|
||||
),
|
||||
)
|
||||
for language in per_language_ratios
|
||||
]
|
||||
|
||||
return sorted(merge, key=lambda x: x[1], reverse=True)
|
||||
|
||||
|
||||
def filter_alt_coherence_matches(results: CoherenceMatches) -> CoherenceMatches:
|
||||
"""
|
||||
We shall NOT return "English—" in CoherenceMatches because it is an alternative
|
||||
of "English". This function only keeps the best match and remove the em-dash in it.
|
||||
"""
|
||||
index_results: dict[str, list[float]] = dict()
|
||||
|
||||
for result in results:
|
||||
language, ratio = result
|
||||
no_em_name: str = language.replace("—", "")
|
||||
|
||||
if no_em_name not in index_results:
|
||||
index_results[no_em_name] = []
|
||||
|
||||
index_results[no_em_name].append(ratio)
|
||||
|
||||
if any(len(index_results[e]) > 1 for e in index_results):
|
||||
filtered_results: CoherenceMatches = []
|
||||
|
||||
for language in index_results:
|
||||
filtered_results.append((language, max(index_results[language])))
|
||||
|
||||
return filtered_results
|
||||
|
||||
return results
|
||||
|
||||
|
||||
@lru_cache(maxsize=2048)
|
||||
def coherence_ratio(
|
||||
decoded_sequence: str, threshold: float = 0.1, lg_inclusion: str | None = None
|
||||
) -> CoherenceMatches:
|
||||
"""
|
||||
Detect ANY language that can be identified in given sequence. The sequence will be analysed by layers.
|
||||
A layer = Character extraction by alphabets/ranges.
|
||||
"""
|
||||
|
||||
results: list[tuple[str, float]] = []
|
||||
ignore_non_latin: bool = False
|
||||
|
||||
sufficient_match_count: int = 0
|
||||
|
||||
lg_inclusion_list = lg_inclusion.split(",") if lg_inclusion is not None else []
|
||||
if "Latin Based" in lg_inclusion_list:
|
||||
ignore_non_latin = True
|
||||
lg_inclusion_list.remove("Latin Based")
|
||||
|
||||
for layer in alpha_unicode_split(decoded_sequence):
|
||||
sequence_frequencies: TypeCounter[str] = Counter(layer)
|
||||
most_common = sequence_frequencies.most_common()
|
||||
|
||||
character_count: int = len(layer)
|
||||
|
||||
if character_count <= TOO_SMALL_SEQUENCE:
|
||||
continue
|
||||
|
||||
popular_character_ordered: list[str] = [c for c, o in most_common]
|
||||
|
||||
for language in lg_inclusion_list or alphabet_languages(
|
||||
popular_character_ordered, ignore_non_latin
|
||||
):
|
||||
ratio: float = characters_popularity_compare(
|
||||
language, popular_character_ordered
|
||||
)
|
||||
|
||||
if ratio < threshold:
|
||||
continue
|
||||
elif ratio >= 0.8:
|
||||
sufficient_match_count += 1
|
||||
|
||||
results.append((language, round(ratio, 4)))
|
||||
|
||||
if sufficient_match_count >= 3:
|
||||
break
|
||||
|
||||
return sorted(
|
||||
filter_alt_coherence_matches(results), key=lambda x: x[1], reverse=True
|
||||
)
|
||||
8
aws-lambda/src/charset_normalizer/cli/__init__.py
Normal file
8
aws-lambda/src/charset_normalizer/cli/__init__.py
Normal file
@@ -0,0 +1,8 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from .__main__ import cli_detect, query_yes_no
|
||||
|
||||
__all__ = (
|
||||
"cli_detect",
|
||||
"query_yes_no",
|
||||
)
|
||||
362
aws-lambda/src/charset_normalizer/cli/__main__.py
Normal file
362
aws-lambda/src/charset_normalizer/cli/__main__.py
Normal file
@@ -0,0 +1,362 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import typing
|
||||
from json import dumps
|
||||
from os.path import abspath, basename, dirname, join, realpath
|
||||
from platform import python_version
|
||||
from unicodedata import unidata_version
|
||||
|
||||
import charset_normalizer.md as md_module
|
||||
from charset_normalizer import from_fp
|
||||
from charset_normalizer.models import CliDetectionResult
|
||||
from charset_normalizer.version import __version__
|
||||
|
||||
|
||||
def query_yes_no(question: str, default: str = "yes") -> bool: # Defensive:
|
||||
"""Ask a yes/no question via input() and return the answer as a bool."""
|
||||
prompt = " [Y/n] " if default == "yes" else " [y/N] "
|
||||
|
||||
while True:
|
||||
choice = input(question + prompt).strip().lower()
|
||||
if not choice:
|
||||
return default == "yes"
|
||||
if choice in ("y", "yes"):
|
||||
return True
|
||||
if choice in ("n", "no"):
|
||||
return False
|
||||
print("Please respond with 'y' or 'n'.")
|
||||
|
||||
|
||||
class FileType:
|
||||
"""Factory for creating file object types
|
||||
|
||||
Instances of FileType are typically passed as type= arguments to the
|
||||
ArgumentParser add_argument() method.
|
||||
|
||||
Keyword Arguments:
|
||||
- mode -- A string indicating how the file is to be opened. Accepts the
|
||||
same values as the builtin open() function.
|
||||
- bufsize -- The file's desired buffer size. Accepts the same values as
|
||||
the builtin open() function.
|
||||
- encoding -- The file's encoding. Accepts the same values as the
|
||||
builtin open() function.
|
||||
- errors -- A string indicating how encoding and decoding errors are to
|
||||
be handled. Accepts the same value as the builtin open() function.
|
||||
|
||||
Backported from CPython 3.12
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
mode: str = "r",
|
||||
bufsize: int = -1,
|
||||
encoding: str | None = None,
|
||||
errors: str | None = None,
|
||||
):
|
||||
self._mode = mode
|
||||
self._bufsize = bufsize
|
||||
self._encoding = encoding
|
||||
self._errors = errors
|
||||
|
||||
def __call__(self, string: str) -> typing.IO: # type: ignore[type-arg]
|
||||
# the special argument "-" means sys.std{in,out}
|
||||
if string == "-":
|
||||
if "r" in self._mode:
|
||||
return sys.stdin.buffer if "b" in self._mode else sys.stdin
|
||||
elif any(c in self._mode for c in "wax"):
|
||||
return sys.stdout.buffer if "b" in self._mode else sys.stdout
|
||||
else:
|
||||
msg = f'argument "-" with mode {self._mode}'
|
||||
raise ValueError(msg)
|
||||
|
||||
# all other arguments are used as file names
|
||||
try:
|
||||
return open(string, self._mode, self._bufsize, self._encoding, self._errors)
|
||||
except OSError as e:
|
||||
message = f"can't open '{string}': {e}"
|
||||
raise argparse.ArgumentTypeError(message)
|
||||
|
||||
def __repr__(self) -> str:
|
||||
args = self._mode, self._bufsize
|
||||
kwargs = [("encoding", self._encoding), ("errors", self._errors)]
|
||||
args_str = ", ".join(
|
||||
[repr(arg) for arg in args if arg != -1]
|
||||
+ [f"{kw}={arg!r}" for kw, arg in kwargs if arg is not None]
|
||||
)
|
||||
return f"{type(self).__name__}({args_str})"
|
||||
|
||||
|
||||
def cli_detect(argv: list[str] | None = None) -> int:
|
||||
"""
|
||||
CLI assistant using ARGV and ArgumentParser
|
||||
:param argv:
|
||||
:return: 0 if everything is fine, anything else equal trouble
|
||||
"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="The Real First Universal Charset Detector. "
|
||||
"Discover originating encoding used on text file. "
|
||||
"Normalize text to unicode."
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"files", type=FileType("rb"), nargs="+", help="File(s) to be analysed"
|
||||
)
|
||||
parser.add_argument(
|
||||
"-v",
|
||||
"--verbose",
|
||||
action="store_true",
|
||||
default=False,
|
||||
dest="verbose",
|
||||
help="Display complementary information about file if any. "
|
||||
"Stdout will contain logs about the detection process.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-a",
|
||||
"--with-alternative",
|
||||
action="store_true",
|
||||
default=False,
|
||||
dest="alternatives",
|
||||
help="Output complementary possibilities if any. Top-level JSON WILL be a list.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-n",
|
||||
"--normalize",
|
||||
action="store_true",
|
||||
default=False,
|
||||
dest="normalize",
|
||||
help="Permit to normalize input file. If not set, program does not write anything.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-m",
|
||||
"--minimal",
|
||||
action="store_true",
|
||||
default=False,
|
||||
dest="minimal",
|
||||
help="Only output the charset detected to STDOUT. Disabling JSON output.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-r",
|
||||
"--replace",
|
||||
action="store_true",
|
||||
default=False,
|
||||
dest="replace",
|
||||
help="Replace file when trying to normalize it instead of creating a new one.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-f",
|
||||
"--force",
|
||||
action="store_true",
|
||||
default=False,
|
||||
dest="force",
|
||||
help="Replace file without asking if you are sure, use this flag with caution.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-i",
|
||||
"--no-preemptive",
|
||||
action="store_true",
|
||||
default=False,
|
||||
dest="no_preemptive",
|
||||
help="Disable looking at a charset declaration to hint the detector.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-t",
|
||||
"--threshold",
|
||||
action="store",
|
||||
default=0.2,
|
||||
type=float,
|
||||
dest="threshold",
|
||||
help="Define a custom maximum amount of noise allowed in decoded content. 0. <= noise <= 1.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--version",
|
||||
action="version",
|
||||
version="Charset-Normalizer {} - Python {} - Unicode {} - SpeedUp {}".format(
|
||||
__version__,
|
||||
python_version(),
|
||||
unidata_version,
|
||||
"OFF" if md_module.__file__.lower().endswith(".py") else "ON",
|
||||
),
|
||||
help="Show version information and exit.",
|
||||
)
|
||||
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
if args.replace is True and args.normalize is False:
|
||||
if args.files:
|
||||
for my_file in args.files:
|
||||
my_file.close()
|
||||
print("Use --replace in addition of --normalize only.", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
if args.force is True and args.replace is False:
|
||||
if args.files:
|
||||
for my_file in args.files:
|
||||
my_file.close()
|
||||
print("Use --force in addition of --replace only.", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
if args.threshold < 0.0 or args.threshold > 1.0:
|
||||
if args.files:
|
||||
for my_file in args.files:
|
||||
my_file.close()
|
||||
print("--threshold VALUE should be between 0. AND 1.", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
x_ = []
|
||||
|
||||
for my_file in args.files:
|
||||
matches = from_fp(
|
||||
my_file,
|
||||
threshold=args.threshold,
|
||||
explain=args.verbose,
|
||||
preemptive_behaviour=args.no_preemptive is False,
|
||||
)
|
||||
|
||||
best_guess = matches.best()
|
||||
|
||||
if best_guess is None:
|
||||
print(
|
||||
'Unable to identify originating encoding for "{}". {}'.format(
|
||||
my_file.name,
|
||||
(
|
||||
"Maybe try increasing maximum amount of chaos."
|
||||
if args.threshold < 1.0
|
||||
else ""
|
||||
),
|
||||
),
|
||||
file=sys.stderr,
|
||||
)
|
||||
x_.append(
|
||||
CliDetectionResult(
|
||||
abspath(my_file.name),
|
||||
None,
|
||||
[],
|
||||
[],
|
||||
"Unknown",
|
||||
[],
|
||||
False,
|
||||
1.0,
|
||||
0.0,
|
||||
None,
|
||||
True,
|
||||
)
|
||||
)
|
||||
else:
|
||||
cli_result = CliDetectionResult(
|
||||
abspath(my_file.name),
|
||||
best_guess.encoding,
|
||||
best_guess.encoding_aliases,
|
||||
[
|
||||
cp
|
||||
for cp in best_guess.could_be_from_charset
|
||||
if cp != best_guess.encoding
|
||||
],
|
||||
best_guess.language,
|
||||
best_guess.alphabets,
|
||||
best_guess.bom,
|
||||
best_guess.percent_chaos,
|
||||
best_guess.percent_coherence,
|
||||
None,
|
||||
True,
|
||||
)
|
||||
x_.append(cli_result)
|
||||
|
||||
if len(matches) > 1 and args.alternatives:
|
||||
for el in matches:
|
||||
if el != best_guess:
|
||||
x_.append(
|
||||
CliDetectionResult(
|
||||
abspath(my_file.name),
|
||||
el.encoding,
|
||||
el.encoding_aliases,
|
||||
[
|
||||
cp
|
||||
for cp in el.could_be_from_charset
|
||||
if cp != el.encoding
|
||||
],
|
||||
el.language,
|
||||
el.alphabets,
|
||||
el.bom,
|
||||
el.percent_chaos,
|
||||
el.percent_coherence,
|
||||
None,
|
||||
False,
|
||||
)
|
||||
)
|
||||
|
||||
if args.normalize is True:
|
||||
if best_guess.encoding.startswith("utf") is True:
|
||||
print(
|
||||
'"{}" file does not need to be normalized, as it already came from unicode.'.format(
|
||||
my_file.name
|
||||
),
|
||||
file=sys.stderr,
|
||||
)
|
||||
if my_file.closed is False:
|
||||
my_file.close()
|
||||
continue
|
||||
|
||||
dir_path = dirname(realpath(my_file.name))
|
||||
file_name = basename(realpath(my_file.name))
|
||||
|
||||
o_: list[str] = file_name.split(".")
|
||||
|
||||
if args.replace is False:
|
||||
o_.insert(-1, best_guess.encoding)
|
||||
if my_file.closed is False:
|
||||
my_file.close()
|
||||
elif (
|
||||
args.force is False
|
||||
and query_yes_no(
|
||||
'Are you sure to normalize "{}" by replacing it ?'.format(
|
||||
my_file.name
|
||||
),
|
||||
"no",
|
||||
)
|
||||
is False
|
||||
):
|
||||
if my_file.closed is False:
|
||||
my_file.close()
|
||||
continue
|
||||
|
||||
try:
|
||||
cli_result.unicode_path = join(dir_path, ".".join(o_))
|
||||
|
||||
with open(cli_result.unicode_path, "wb") as fp:
|
||||
fp.write(best_guess.output())
|
||||
except OSError as e: # Defensive:
|
||||
print(str(e), file=sys.stderr)
|
||||
if my_file.closed is False:
|
||||
my_file.close()
|
||||
return 2
|
||||
|
||||
if my_file.closed is False:
|
||||
my_file.close()
|
||||
|
||||
if args.minimal is False:
|
||||
print(
|
||||
dumps(
|
||||
[el.__dict__ for el in x_] if len(x_) > 1 else x_[0].__dict__,
|
||||
ensure_ascii=True,
|
||||
indent=4,
|
||||
)
|
||||
)
|
||||
else:
|
||||
for my_file in args.files:
|
||||
print(
|
||||
", ".join(
|
||||
[
|
||||
el.encoding or "undefined"
|
||||
for el in x_
|
||||
if el.path == abspath(my_file.name)
|
||||
]
|
||||
)
|
||||
)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__": # Defensive:
|
||||
cli_detect()
|
||||
Binary file not shown.
Binary file not shown.
2050
aws-lambda/src/charset_normalizer/constant.py
Normal file
2050
aws-lambda/src/charset_normalizer/constant.py
Normal file
File diff suppressed because it is too large
Load Diff
79
aws-lambda/src/charset_normalizer/legacy.py
Normal file
79
aws-lambda/src/charset_normalizer/legacy.py
Normal file
@@ -0,0 +1,79 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import TYPE_CHECKING, Any
|
||||
from warnings import warn
|
||||
|
||||
from .api import from_bytes
|
||||
from .constant import CHARDET_CORRESPONDENCE, TOO_SMALL_SEQUENCE
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from typing import TypedDict
|
||||
|
||||
class ResultDict(TypedDict):
|
||||
encoding: str | None
|
||||
language: str
|
||||
confidence: float | None
|
||||
|
||||
|
||||
def detect(
|
||||
byte_str: bytes, should_rename_legacy: bool = False, **kwargs: Any
|
||||
) -> ResultDict:
|
||||
"""
|
||||
chardet legacy method
|
||||
Detect the encoding of the given byte string. It should be mostly backward-compatible.
|
||||
Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it)
|
||||
This function is deprecated and should be used to migrate your project easily, consult the documentation for
|
||||
further information. Not planned for removal.
|
||||
|
||||
:param byte_str: The byte sequence to examine.
|
||||
:param should_rename_legacy: Should we rename legacy encodings
|
||||
to their more modern equivalents?
|
||||
"""
|
||||
if len(kwargs):
|
||||
warn(
|
||||
f"charset-normalizer disregard arguments '{','.join(list(kwargs.keys()))}' in legacy function detect()"
|
||||
)
|
||||
|
||||
if not isinstance(byte_str, (bytearray, bytes)):
|
||||
raise TypeError( # pragma: nocover
|
||||
f"Expected object of type bytes or bytearray, got: {type(byte_str)}"
|
||||
)
|
||||
|
||||
if isinstance(byte_str, bytearray):
|
||||
byte_str = bytes(byte_str)
|
||||
|
||||
r = from_bytes(byte_str).best()
|
||||
|
||||
encoding = r.encoding if r is not None else None
|
||||
language = r.language if r is not None and r.language != "Unknown" else ""
|
||||
confidence = 1.0 - r.chaos if r is not None else None
|
||||
|
||||
# automatically lower confidence
|
||||
# on small bytes samples.
|
||||
# https://github.com/jawah/charset_normalizer/issues/391
|
||||
if (
|
||||
confidence is not None
|
||||
and confidence >= 0.9
|
||||
and encoding
|
||||
not in {
|
||||
"utf_8",
|
||||
"ascii",
|
||||
}
|
||||
and r.bom is False # type: ignore[union-attr]
|
||||
and len(byte_str) < TOO_SMALL_SEQUENCE
|
||||
):
|
||||
confidence -= 0.2
|
||||
|
||||
# Note: CharsetNormalizer does not return 'UTF-8-SIG' as the sig get stripped in the detection/normalization process
|
||||
# but chardet does return 'utf-8-sig' and it is a valid codec name.
|
||||
if r is not None and encoding == "utf_8" and r.bom:
|
||||
encoding += "_sig"
|
||||
|
||||
if should_rename_legacy is False and encoding in CHARDET_CORRESPONDENCE:
|
||||
encoding = CHARDET_CORRESPONDENCE[encoding]
|
||||
|
||||
return {
|
||||
"encoding": encoding,
|
||||
"language": language,
|
||||
"confidence": confidence,
|
||||
}
|
||||
BIN
aws-lambda/src/charset_normalizer/md.cp313-win_amd64.pyd
Normal file
BIN
aws-lambda/src/charset_normalizer/md.cp313-win_amd64.pyd
Normal file
Binary file not shown.
936
aws-lambda/src/charset_normalizer/md.py
Normal file
936
aws-lambda/src/charset_normalizer/md.py
Normal file
@@ -0,0 +1,936 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
from functools import lru_cache
|
||||
from logging import getLogger
|
||||
|
||||
if sys.version_info >= (3, 8):
|
||||
from typing import final
|
||||
else:
|
||||
try:
|
||||
from typing_extensions import final
|
||||
except ImportError:
|
||||
|
||||
def final(cls): # type: ignore[misc,no-untyped-def]
|
||||
return cls
|
||||
|
||||
|
||||
from .constant import (
|
||||
COMMON_CJK_CHARACTERS,
|
||||
COMMON_SAFE_ASCII_CHARACTERS,
|
||||
TRACE,
|
||||
UNICODE_SECONDARY_RANGE_KEYWORD,
|
||||
_ACCENTUATED,
|
||||
_ARABIC,
|
||||
_ARABIC_ISOLATED_FORM,
|
||||
_CJK,
|
||||
_HANGUL,
|
||||
_HIRAGANA,
|
||||
_KATAKANA,
|
||||
_LATIN,
|
||||
_THAI,
|
||||
)
|
||||
from .utils import (
|
||||
_character_flags,
|
||||
is_emoticon,
|
||||
is_punctuation,
|
||||
is_separator,
|
||||
is_symbol,
|
||||
remove_accent,
|
||||
unicode_range,
|
||||
)
|
||||
|
||||
# Combined bitmask for CJK/Hangul/Katakana/Hiragana/Thai glyph detection.
|
||||
_GLYPH_MASK: int = _CJK | _HANGUL | _KATAKANA | _HIRAGANA | _THAI
|
||||
|
||||
|
||||
@final
|
||||
class CharInfo:
|
||||
"""Pre-computed character properties shared across all detectors.
|
||||
|
||||
Instantiated once and reused via :meth:`update` on every character
|
||||
in the hot loop so that redundant calls to str methods
|
||||
(``isalpha``, ``isupper``, …) and cached utility functions
|
||||
(``_character_flags``, ``is_punctuation``, …) are avoided when
|
||||
several plugins need the same information.
|
||||
"""
|
||||
|
||||
__slots__ = (
|
||||
"character",
|
||||
"printable",
|
||||
"alpha",
|
||||
"upper",
|
||||
"lower",
|
||||
"space",
|
||||
"digit",
|
||||
"is_ascii",
|
||||
"case_variable",
|
||||
"flags",
|
||||
"accentuated",
|
||||
"latin",
|
||||
"is_cjk",
|
||||
"is_arabic",
|
||||
"is_glyph",
|
||||
"punct",
|
||||
"sym",
|
||||
)
|
||||
|
||||
def __init__(self) -> None:
|
||||
self.character: str = ""
|
||||
self.printable: bool = False
|
||||
self.alpha: bool = False
|
||||
self.upper: bool = False
|
||||
self.lower: bool = False
|
||||
self.space: bool = False
|
||||
self.digit: bool = False
|
||||
self.is_ascii: bool = False
|
||||
self.case_variable: bool = False
|
||||
self.flags: int = 0
|
||||
self.accentuated: bool = False
|
||||
self.latin: bool = False
|
||||
self.is_cjk: bool = False
|
||||
self.is_arabic: bool = False
|
||||
self.is_glyph: bool = False
|
||||
self.punct: bool = False
|
||||
self.sym: bool = False
|
||||
|
||||
def update(self, character: str) -> None:
|
||||
"""Update all properties for *character* (called once per character)."""
|
||||
self.character = character
|
||||
|
||||
# ASCII fast-path: for characters with ord < 128, we can skip
|
||||
# _character_flags() entirely and derive most properties from ord.
|
||||
o: int = ord(character)
|
||||
if o < 128:
|
||||
self.is_ascii = True
|
||||
self.accentuated = False
|
||||
self.is_cjk = False
|
||||
self.is_arabic = False
|
||||
self.is_glyph = False
|
||||
# ASCII alpha: a-z (97-122) or A-Z (65-90)
|
||||
if 65 <= o <= 90:
|
||||
# Uppercase ASCII letter
|
||||
self.alpha = True
|
||||
self.upper = True
|
||||
self.lower = False
|
||||
self.space = False
|
||||
self.digit = False
|
||||
self.printable = True
|
||||
self.case_variable = True
|
||||
self.flags = _LATIN
|
||||
self.latin = True
|
||||
self.punct = False
|
||||
self.sym = False
|
||||
elif 97 <= o <= 122:
|
||||
# Lowercase ASCII letter
|
||||
self.alpha = True
|
||||
self.upper = False
|
||||
self.lower = True
|
||||
self.space = False
|
||||
self.digit = False
|
||||
self.printable = True
|
||||
self.case_variable = True
|
||||
self.flags = _LATIN
|
||||
self.latin = True
|
||||
self.punct = False
|
||||
self.sym = False
|
||||
elif 48 <= o <= 57:
|
||||
# ASCII digit 0-9
|
||||
self.alpha = False
|
||||
self.upper = False
|
||||
self.lower = False
|
||||
self.space = False
|
||||
self.digit = True
|
||||
self.printable = True
|
||||
self.case_variable = False
|
||||
self.flags = 0
|
||||
self.latin = False
|
||||
self.punct = False
|
||||
self.sym = False
|
||||
elif o == 32 or (9 <= o <= 13):
|
||||
# Space, tab, newline, etc.
|
||||
self.alpha = False
|
||||
self.upper = False
|
||||
self.lower = False
|
||||
self.space = True
|
||||
self.digit = False
|
||||
self.printable = o == 32
|
||||
self.case_variable = False
|
||||
self.flags = 0
|
||||
self.latin = False
|
||||
self.punct = False
|
||||
self.sym = False
|
||||
else:
|
||||
# Other ASCII (punctuation, symbols, control chars)
|
||||
self.printable = character.isprintable()
|
||||
self.alpha = False
|
||||
self.upper = False
|
||||
self.lower = False
|
||||
self.space = False
|
||||
self.digit = False
|
||||
self.case_variable = False
|
||||
self.flags = 0
|
||||
self.latin = False
|
||||
self.punct = is_punctuation(character) if self.printable else False
|
||||
self.sym = is_symbol(character) if self.printable else False
|
||||
else:
|
||||
# Non-ASCII path
|
||||
self.is_ascii = False
|
||||
self.printable = character.isprintable()
|
||||
self.alpha = character.isalpha()
|
||||
self.upper = character.isupper()
|
||||
self.lower = character.islower()
|
||||
self.space = character.isspace()
|
||||
self.digit = character.isdigit()
|
||||
self.case_variable = self.lower != self.upper
|
||||
|
||||
# Flag-based classification (single unicodedata.name() call, lru-cached)
|
||||
flags: int
|
||||
if self.alpha:
|
||||
flags = _character_flags(character)
|
||||
else:
|
||||
flags = 0
|
||||
self.flags = flags
|
||||
self.accentuated = bool(flags & _ACCENTUATED)
|
||||
self.latin = bool(flags & _LATIN)
|
||||
self.is_cjk = bool(flags & _CJK)
|
||||
self.is_arabic = bool(flags & _ARABIC)
|
||||
self.is_glyph = bool(flags & _GLYPH_MASK)
|
||||
|
||||
# Eagerly compute punct and sym (avoids property dispatch overhead
|
||||
# on 300K+ accesses in the hot loop).
|
||||
self.punct = is_punctuation(character) if self.printable else False
|
||||
self.sym = is_symbol(character) if self.printable else False
|
||||
|
||||
|
||||
class MessDetectorPlugin:
|
||||
"""
|
||||
Base abstract class used for mess detection plugins.
|
||||
All detectors MUST extend and implement given methods.
|
||||
"""
|
||||
|
||||
__slots__ = ()
|
||||
|
||||
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||
"""
|
||||
The main routine to be executed upon character.
|
||||
Insert the logic in witch the text would be considered chaotic.
|
||||
"""
|
||||
raise NotImplementedError # Defensive:
|
||||
|
||||
def reset(self) -> None: # Defensive:
|
||||
"""
|
||||
Permit to reset the plugin to the initial state.
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
@property
|
||||
def ratio(self) -> float:
|
||||
"""
|
||||
Compute the chaos ratio based on what your feed() has seen.
|
||||
Must NOT be lower than 0.; No restriction gt 0.
|
||||
"""
|
||||
raise NotImplementedError # Defensive:
|
||||
|
||||
|
||||
@final
|
||||
class TooManySymbolOrPunctuationPlugin(MessDetectorPlugin):
|
||||
__slots__ = (
|
||||
"_punctuation_count",
|
||||
"_symbol_count",
|
||||
"_character_count",
|
||||
"_last_printable_char",
|
||||
"_frenzy_symbol_in_word",
|
||||
)
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._punctuation_count: int = 0
|
||||
self._symbol_count: int = 0
|
||||
self._character_count: int = 0
|
||||
|
||||
self._last_printable_char: str | None = None
|
||||
self._frenzy_symbol_in_word: bool = False
|
||||
|
||||
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||
"""Optimized feed using pre-computed character info."""
|
||||
self._character_count += 1
|
||||
|
||||
if (
|
||||
character != self._last_printable_char
|
||||
and character not in COMMON_SAFE_ASCII_CHARACTERS
|
||||
):
|
||||
if info.punct:
|
||||
self._punctuation_count += 1
|
||||
elif not info.digit and info.sym and not is_emoticon(character):
|
||||
self._symbol_count += 2
|
||||
|
||||
self._last_printable_char = character
|
||||
|
||||
def reset(self) -> None: # Abstract
|
||||
self._punctuation_count = 0
|
||||
self._character_count = 0
|
||||
self._symbol_count = 0
|
||||
|
||||
@property
|
||||
def ratio(self) -> float:
|
||||
if self._character_count == 0:
|
||||
return 0.0
|
||||
|
||||
ratio_of_punctuation: float = (
|
||||
self._punctuation_count + self._symbol_count
|
||||
) / self._character_count
|
||||
|
||||
return ratio_of_punctuation if ratio_of_punctuation >= 0.3 else 0.0
|
||||
|
||||
|
||||
@final
|
||||
class TooManyAccentuatedPlugin(MessDetectorPlugin):
|
||||
__slots__ = ("_character_count", "_accentuated_count")
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._character_count: int = 0
|
||||
self._accentuated_count: int = 0
|
||||
|
||||
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||
"""Optimized feed using pre-computed character info."""
|
||||
self._character_count += 1
|
||||
|
||||
if info.accentuated:
|
||||
self._accentuated_count += 1
|
||||
|
||||
def reset(self) -> None: # Abstract
|
||||
self._character_count = 0
|
||||
self._accentuated_count = 0
|
||||
|
||||
@property
|
||||
def ratio(self) -> float:
|
||||
if self._character_count < 8:
|
||||
return 0.0
|
||||
|
||||
ratio_of_accentuation: float = self._accentuated_count / self._character_count
|
||||
return ratio_of_accentuation if ratio_of_accentuation >= 0.35 else 0.0
|
||||
|
||||
|
||||
@final
|
||||
class UnprintablePlugin(MessDetectorPlugin):
|
||||
__slots__ = ("_unprintable_count", "_character_count")
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._unprintable_count: int = 0
|
||||
self._character_count: int = 0
|
||||
|
||||
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||
"""Optimized feed using pre-computed character info."""
|
||||
if (
|
||||
not info.space
|
||||
and not info.printable
|
||||
and character != "\x1a"
|
||||
and character != "\ufeff"
|
||||
):
|
||||
self._unprintable_count += 1
|
||||
self._character_count += 1
|
||||
|
||||
def reset(self) -> None: # Abstract
|
||||
self._unprintable_count = 0
|
||||
|
||||
@property
|
||||
def ratio(self) -> float:
|
||||
if self._character_count == 0: # Defensive:
|
||||
return 0.0
|
||||
|
||||
return (self._unprintable_count * 8) / self._character_count
|
||||
|
||||
|
||||
@final
|
||||
class SuspiciousDuplicateAccentPlugin(MessDetectorPlugin):
|
||||
__slots__ = (
|
||||
"_successive_count",
|
||||
"_character_count",
|
||||
"_last_latin_character",
|
||||
"_last_was_accentuated",
|
||||
)
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._successive_count: int = 0
|
||||
self._character_count: int = 0
|
||||
|
||||
self._last_latin_character: str | None = None
|
||||
self._last_was_accentuated: bool = False
|
||||
|
||||
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||
"""Optimized feed using pre-computed character info."""
|
||||
self._character_count += 1
|
||||
if (
|
||||
self._last_latin_character is not None
|
||||
and info.accentuated
|
||||
and self._last_was_accentuated
|
||||
):
|
||||
if info.upper and self._last_latin_character.isupper():
|
||||
self._successive_count += 1
|
||||
if remove_accent(character) == remove_accent(self._last_latin_character):
|
||||
self._successive_count += 1
|
||||
self._last_latin_character = character
|
||||
self._last_was_accentuated = info.accentuated
|
||||
|
||||
def reset(self) -> None: # Abstract
|
||||
self._successive_count = 0
|
||||
self._character_count = 0
|
||||
self._last_latin_character = None
|
||||
self._last_was_accentuated = False
|
||||
|
||||
@property
|
||||
def ratio(self) -> float:
|
||||
if self._character_count == 0:
|
||||
return 0.0
|
||||
|
||||
return (self._successive_count * 2) / self._character_count
|
||||
|
||||
|
||||
@final
|
||||
class SuspiciousRange(MessDetectorPlugin):
|
||||
__slots__ = (
|
||||
"_suspicious_successive_range_count",
|
||||
"_character_count",
|
||||
"_last_printable_seen",
|
||||
"_last_printable_range",
|
||||
)
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._suspicious_successive_range_count: int = 0
|
||||
self._character_count: int = 0
|
||||
self._last_printable_seen: str | None = None
|
||||
self._last_printable_range: str | None = None
|
||||
|
||||
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||
"""Optimized feed using pre-computed character info."""
|
||||
self._character_count += 1
|
||||
|
||||
if info.space or info.punct or character in COMMON_SAFE_ASCII_CHARACTERS:
|
||||
self._last_printable_seen = None
|
||||
self._last_printable_range = None
|
||||
return
|
||||
|
||||
if self._last_printable_seen is None:
|
||||
self._last_printable_seen = character
|
||||
self._last_printable_range = unicode_range(character)
|
||||
return
|
||||
|
||||
unicode_range_a: str | None = self._last_printable_range
|
||||
unicode_range_b: str | None = unicode_range(character)
|
||||
|
||||
if is_suspiciously_successive_range(unicode_range_a, unicode_range_b):
|
||||
self._suspicious_successive_range_count += 1
|
||||
|
||||
self._last_printable_seen = character
|
||||
self._last_printable_range = unicode_range_b
|
||||
|
||||
def reset(self) -> None: # Abstract
|
||||
self._character_count = 0
|
||||
self._suspicious_successive_range_count = 0
|
||||
self._last_printable_seen = None
|
||||
self._last_printable_range = None
|
||||
|
||||
@property
|
||||
def ratio(self) -> float:
|
||||
if self._character_count <= 13:
|
||||
return 0.0
|
||||
|
||||
ratio_of_suspicious_range_usage: float = (
|
||||
self._suspicious_successive_range_count * 2
|
||||
) / self._character_count
|
||||
|
||||
return ratio_of_suspicious_range_usage
|
||||
|
||||
|
||||
@final
|
||||
class SuperWeirdWordPlugin(MessDetectorPlugin):
|
||||
__slots__ = (
|
||||
"_word_count",
|
||||
"_bad_word_count",
|
||||
"_foreign_long_count",
|
||||
"_is_current_word_bad",
|
||||
"_foreign_long_watch",
|
||||
"_character_count",
|
||||
"_bad_character_count",
|
||||
"_buffer_length",
|
||||
"_buffer_last_char",
|
||||
"_buffer_last_char_accentuated",
|
||||
"_buffer_accent_count",
|
||||
"_buffer_glyph_count",
|
||||
"_buffer_upper_count",
|
||||
)
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._word_count: int = 0
|
||||
self._bad_word_count: int = 0
|
||||
self._foreign_long_count: int = 0
|
||||
|
||||
self._is_current_word_bad: bool = False
|
||||
self._foreign_long_watch: bool = False
|
||||
|
||||
self._character_count: int = 0
|
||||
self._bad_character_count: int = 0
|
||||
|
||||
self._buffer_length: int = 0
|
||||
self._buffer_last_char: str | None = None
|
||||
self._buffer_last_char_accentuated: bool = False
|
||||
self._buffer_accent_count: int = 0
|
||||
self._buffer_glyph_count: int = 0
|
||||
self._buffer_upper_count: int = 0
|
||||
|
||||
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||
"""Optimized feed using pre-computed character info."""
|
||||
if info.alpha:
|
||||
self._buffer_length += 1
|
||||
self._buffer_last_char = character
|
||||
|
||||
if info.upper:
|
||||
self._buffer_upper_count += 1
|
||||
|
||||
self._buffer_last_char_accentuated = info.accentuated
|
||||
|
||||
if info.accentuated:
|
||||
self._buffer_accent_count += 1
|
||||
if (
|
||||
not self._foreign_long_watch
|
||||
and (not info.latin or info.accentuated)
|
||||
and not info.is_glyph
|
||||
):
|
||||
self._foreign_long_watch = True
|
||||
if info.is_glyph:
|
||||
self._buffer_glyph_count += 1
|
||||
return
|
||||
if not self._buffer_length:
|
||||
return
|
||||
if info.space or info.punct or is_separator(character):
|
||||
self._word_count += 1
|
||||
buffer_length: int = self._buffer_length
|
||||
|
||||
self._character_count += buffer_length
|
||||
|
||||
if buffer_length >= 4:
|
||||
if self._buffer_accent_count / buffer_length >= 0.5:
|
||||
self._is_current_word_bad = True
|
||||
elif (
|
||||
self._buffer_last_char_accentuated
|
||||
and self._buffer_last_char.isupper() # type: ignore[union-attr]
|
||||
and self._buffer_upper_count != buffer_length
|
||||
):
|
||||
self._foreign_long_count += 1
|
||||
self._is_current_word_bad = True
|
||||
elif self._buffer_glyph_count == 1:
|
||||
self._is_current_word_bad = True
|
||||
self._foreign_long_count += 1
|
||||
if buffer_length >= 24 and self._foreign_long_watch:
|
||||
probable_camel_cased: bool = (
|
||||
self._buffer_upper_count > 0
|
||||
and self._buffer_upper_count / buffer_length <= 0.3
|
||||
)
|
||||
|
||||
if not probable_camel_cased:
|
||||
self._foreign_long_count += 1
|
||||
self._is_current_word_bad = True
|
||||
|
||||
if self._is_current_word_bad:
|
||||
self._bad_word_count += 1
|
||||
self._bad_character_count += buffer_length
|
||||
self._is_current_word_bad = False
|
||||
|
||||
self._foreign_long_watch = False
|
||||
self._buffer_length = 0
|
||||
self._buffer_last_char = None
|
||||
self._buffer_last_char_accentuated = False
|
||||
self._buffer_accent_count = 0
|
||||
self._buffer_glyph_count = 0
|
||||
self._buffer_upper_count = 0
|
||||
elif (
|
||||
character not in {"<", ">", "-", "=", "~", "|", "_"}
|
||||
and not info.digit
|
||||
and info.sym
|
||||
):
|
||||
self._is_current_word_bad = True
|
||||
self._buffer_length += 1
|
||||
self._buffer_last_char = character
|
||||
self._buffer_last_char_accentuated = False
|
||||
|
||||
def reset(self) -> None: # Abstract
|
||||
self._buffer_length = 0
|
||||
self._buffer_last_char = None
|
||||
self._buffer_last_char_accentuated = False
|
||||
self._is_current_word_bad = False
|
||||
self._foreign_long_watch = False
|
||||
self._bad_word_count = 0
|
||||
self._word_count = 0
|
||||
self._character_count = 0
|
||||
self._bad_character_count = 0
|
||||
self._foreign_long_count = 0
|
||||
self._buffer_accent_count = 0
|
||||
self._buffer_glyph_count = 0
|
||||
self._buffer_upper_count = 0
|
||||
|
||||
@property
|
||||
def ratio(self) -> float:
|
||||
if self._word_count <= 10 and self._foreign_long_count == 0:
|
||||
return 0.0
|
||||
|
||||
return self._bad_character_count / self._character_count
|
||||
|
||||
|
||||
@final
|
||||
class CjkUncommonPlugin(MessDetectorPlugin):
|
||||
"""
|
||||
Detect messy CJK text that probably means nothing.
|
||||
"""
|
||||
|
||||
__slots__ = ("_character_count", "_uncommon_count")
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._character_count: int = 0
|
||||
self._uncommon_count: int = 0
|
||||
|
||||
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||
"""Optimized feed using pre-computed character info."""
|
||||
self._character_count += 1
|
||||
|
||||
if character not in COMMON_CJK_CHARACTERS:
|
||||
self._uncommon_count += 1
|
||||
|
||||
def reset(self) -> None: # Abstract
|
||||
self._character_count = 0
|
||||
self._uncommon_count = 0
|
||||
|
||||
@property
|
||||
def ratio(self) -> float:
|
||||
if self._character_count < 8:
|
||||
return 0.0
|
||||
|
||||
uncommon_form_usage: float = self._uncommon_count / self._character_count
|
||||
|
||||
# we can be pretty sure it's garbage when uncommon characters are widely
|
||||
# used. otherwise it could just be traditional chinese for example.
|
||||
return uncommon_form_usage / 10 if uncommon_form_usage > 0.5 else 0.0
|
||||
|
||||
|
||||
@final
|
||||
class ArchaicUpperLowerPlugin(MessDetectorPlugin):
|
||||
__slots__ = (
|
||||
"_buf",
|
||||
"_character_count_since_last_sep",
|
||||
"_successive_upper_lower_count",
|
||||
"_successive_upper_lower_count_final",
|
||||
"_character_count",
|
||||
"_last_alpha_seen",
|
||||
"_last_alpha_seen_upper",
|
||||
"_last_alpha_seen_lower",
|
||||
"_current_ascii_only",
|
||||
)
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._buf: bool = False
|
||||
|
||||
self._character_count_since_last_sep: int = 0
|
||||
|
||||
self._successive_upper_lower_count: int = 0
|
||||
self._successive_upper_lower_count_final: int = 0
|
||||
|
||||
self._character_count: int = 0
|
||||
|
||||
self._last_alpha_seen: str | None = None
|
||||
self._last_alpha_seen_upper: bool = False
|
||||
self._last_alpha_seen_lower: bool = False
|
||||
self._current_ascii_only: bool = True
|
||||
|
||||
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||
"""Optimized feed using pre-computed character info."""
|
||||
is_concerned: bool = info.alpha and info.case_variable
|
||||
chunk_sep: bool = not is_concerned
|
||||
|
||||
if chunk_sep and self._character_count_since_last_sep > 0:
|
||||
if (
|
||||
self._character_count_since_last_sep <= 64
|
||||
and not info.digit
|
||||
and not self._current_ascii_only
|
||||
):
|
||||
self._successive_upper_lower_count_final += (
|
||||
self._successive_upper_lower_count
|
||||
)
|
||||
|
||||
self._successive_upper_lower_count = 0
|
||||
self._character_count_since_last_sep = 0
|
||||
self._last_alpha_seen = None
|
||||
self._buf = False
|
||||
self._character_count += 1
|
||||
self._current_ascii_only = True
|
||||
|
||||
return
|
||||
|
||||
if self._current_ascii_only and not info.is_ascii:
|
||||
self._current_ascii_only = False
|
||||
|
||||
if self._last_alpha_seen is not None:
|
||||
if (info.upper and self._last_alpha_seen_lower) or (
|
||||
info.lower and self._last_alpha_seen_upper
|
||||
):
|
||||
if self._buf:
|
||||
self._successive_upper_lower_count += 2
|
||||
self._buf = False
|
||||
else:
|
||||
self._buf = True
|
||||
else:
|
||||
self._buf = False
|
||||
|
||||
self._character_count += 1
|
||||
self._character_count_since_last_sep += 1
|
||||
self._last_alpha_seen = character
|
||||
self._last_alpha_seen_upper = info.upper
|
||||
self._last_alpha_seen_lower = info.lower
|
||||
|
||||
def reset(self) -> None: # Abstract
|
||||
self._character_count = 0
|
||||
self._character_count_since_last_sep = 0
|
||||
self._successive_upper_lower_count = 0
|
||||
self._successive_upper_lower_count_final = 0
|
||||
self._last_alpha_seen = None
|
||||
self._last_alpha_seen_upper = False
|
||||
self._last_alpha_seen_lower = False
|
||||
self._buf = False
|
||||
self._current_ascii_only = True
|
||||
|
||||
@property
|
||||
def ratio(self) -> float:
|
||||
if self._character_count == 0: # Defensive:
|
||||
return 0.0
|
||||
|
||||
return self._successive_upper_lower_count_final / self._character_count
|
||||
|
||||
|
||||
@final
|
||||
class ArabicIsolatedFormPlugin(MessDetectorPlugin):
|
||||
__slots__ = ("_character_count", "_isolated_form_count")
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._character_count: int = 0
|
||||
self._isolated_form_count: int = 0
|
||||
|
||||
def reset(self) -> None: # Abstract
|
||||
self._character_count = 0
|
||||
self._isolated_form_count = 0
|
||||
|
||||
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||
"""Optimized feed using pre-computed character info."""
|
||||
self._character_count += 1
|
||||
|
||||
if info.flags & _ARABIC_ISOLATED_FORM:
|
||||
self._isolated_form_count += 1
|
||||
|
||||
@property
|
||||
def ratio(self) -> float:
|
||||
if self._character_count < 8:
|
||||
return 0.0
|
||||
|
||||
isolated_form_usage: float = self._isolated_form_count / self._character_count
|
||||
|
||||
return isolated_form_usage
|
||||
|
||||
|
||||
@lru_cache(maxsize=1024)
|
||||
def is_suspiciously_successive_range(
|
||||
unicode_range_a: str | None, unicode_range_b: str | None
|
||||
) -> bool:
|
||||
"""
|
||||
Determine if two Unicode range seen next to each other can be considered as suspicious.
|
||||
"""
|
||||
if unicode_range_a is None or unicode_range_b is None:
|
||||
return True
|
||||
|
||||
if unicode_range_a == unicode_range_b:
|
||||
return False
|
||||
|
||||
if "Latin" in unicode_range_a and "Latin" in unicode_range_b:
|
||||
return False
|
||||
|
||||
if "Emoticons" in unicode_range_a or "Emoticons" in unicode_range_b:
|
||||
return False
|
||||
|
||||
# Latin characters can be accompanied with a combining diacritical mark
|
||||
# eg. Vietnamese.
|
||||
if ("Latin" in unicode_range_a or "Latin" in unicode_range_b) and (
|
||||
"Combining" in unicode_range_a or "Combining" in unicode_range_b
|
||||
):
|
||||
return False
|
||||
|
||||
keywords_range_a, keywords_range_b = (
|
||||
unicode_range_a.split(" "),
|
||||
unicode_range_b.split(" "),
|
||||
)
|
||||
|
||||
for el in keywords_range_a:
|
||||
if el in UNICODE_SECONDARY_RANGE_KEYWORD:
|
||||
continue
|
||||
if el in keywords_range_b:
|
||||
return False
|
||||
|
||||
# Japanese Exception
|
||||
range_a_jp_chars, range_b_jp_chars = (
|
||||
unicode_range_a
|
||||
in (
|
||||
"Hiragana",
|
||||
"Katakana",
|
||||
),
|
||||
unicode_range_b in ("Hiragana", "Katakana"),
|
||||
)
|
||||
if (range_a_jp_chars or range_b_jp_chars) and (
|
||||
"CJK" in unicode_range_a or "CJK" in unicode_range_b
|
||||
):
|
||||
return False
|
||||
if range_a_jp_chars and range_b_jp_chars:
|
||||
return False
|
||||
|
||||
if "Hangul" in unicode_range_a or "Hangul" in unicode_range_b:
|
||||
if "CJK" in unicode_range_a or "CJK" in unicode_range_b:
|
||||
return False
|
||||
if unicode_range_a == "Basic Latin" or unicode_range_b == "Basic Latin":
|
||||
return False
|
||||
|
||||
# Chinese/Japanese use dedicated range for punctuation and/or separators.
|
||||
if ("CJK" in unicode_range_a or "CJK" in unicode_range_b) or (
|
||||
unicode_range_a in ["Katakana", "Hiragana"]
|
||||
and unicode_range_b in ["Katakana", "Hiragana"]
|
||||
):
|
||||
if "Punctuation" in unicode_range_a or "Punctuation" in unicode_range_b:
|
||||
return False
|
||||
if "Forms" in unicode_range_a or "Forms" in unicode_range_b:
|
||||
return False
|
||||
if unicode_range_a == "Basic Latin" or unicode_range_b == "Basic Latin":
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
|
||||
@lru_cache(maxsize=2048)
|
||||
def mess_ratio(
|
||||
decoded_sequence: str, maximum_threshold: float = 0.2, debug: bool = False
|
||||
) -> float:
|
||||
"""
|
||||
Compute a mess ratio given a decoded bytes sequence. The maximum threshold does stop the computation earlier.
|
||||
"""
|
||||
|
||||
seq_len: int = len(decoded_sequence)
|
||||
|
||||
if seq_len < 511:
|
||||
step: int = 32
|
||||
elif seq_len < 1024:
|
||||
step = 64
|
||||
else:
|
||||
step = 128
|
||||
|
||||
# Create each detector as a named local variable (unrolled from the generic loop).
|
||||
# This eliminates per-character iteration over the detector list and
|
||||
# per-character eligible() virtual dispatch, while keeping every plugin class
|
||||
# intact and fully readable.
|
||||
d_sp: TooManySymbolOrPunctuationPlugin = TooManySymbolOrPunctuationPlugin()
|
||||
d_ta: TooManyAccentuatedPlugin = TooManyAccentuatedPlugin()
|
||||
d_up: UnprintablePlugin = UnprintablePlugin()
|
||||
d_sda: SuspiciousDuplicateAccentPlugin = SuspiciousDuplicateAccentPlugin()
|
||||
d_sr: SuspiciousRange = SuspiciousRange()
|
||||
d_sw: SuperWeirdWordPlugin = SuperWeirdWordPlugin()
|
||||
d_cu: CjkUncommonPlugin = CjkUncommonPlugin()
|
||||
d_au: ArchaicUpperLowerPlugin = ArchaicUpperLowerPlugin()
|
||||
d_ai: ArabicIsolatedFormPlugin = ArabicIsolatedFormPlugin()
|
||||
|
||||
# Local references for feed_info methods called in the hot loop.
|
||||
d_sp_feed = d_sp.feed_info
|
||||
d_ta_feed = d_ta.feed_info
|
||||
d_up_feed = d_up.feed_info
|
||||
d_sda_feed = d_sda.feed_info
|
||||
d_sr_feed = d_sr.feed_info
|
||||
d_sw_feed = d_sw.feed_info
|
||||
d_cu_feed = d_cu.feed_info
|
||||
d_au_feed = d_au.feed_info
|
||||
d_ai_feed = d_ai.feed_info
|
||||
|
||||
# Single reusable CharInfo object (avoids per-character allocation).
|
||||
info: CharInfo = CharInfo()
|
||||
info_update = info.update
|
||||
|
||||
mean_mess_ratio: float
|
||||
|
||||
for block_start in range(0, seq_len, step):
|
||||
for character in decoded_sequence[block_start : block_start + step]:
|
||||
# Pre-compute all character properties once (shared across all plugins).
|
||||
info_update(character)
|
||||
|
||||
# Detectors with eligible() == always True
|
||||
d_up_feed(character, info)
|
||||
d_sw_feed(character, info)
|
||||
d_au_feed(character, info)
|
||||
|
||||
# Detectors with eligible() == isprintable
|
||||
if info.printable:
|
||||
d_sp_feed(character, info)
|
||||
d_sr_feed(character, info)
|
||||
|
||||
# Detectors with eligible() == isalpha
|
||||
if info.alpha:
|
||||
d_ta_feed(character, info)
|
||||
# SuspiciousDuplicateAccent: isalpha() and is_latin()
|
||||
if info.latin:
|
||||
d_sda_feed(character, info)
|
||||
# CjkUncommon: is_cjk()
|
||||
if info.is_cjk:
|
||||
d_cu_feed(character, info)
|
||||
# ArabicIsolatedForm: is_arabic()
|
||||
if info.is_arabic:
|
||||
d_ai_feed(character, info)
|
||||
|
||||
mean_mess_ratio = (
|
||||
d_sp.ratio
|
||||
+ d_ta.ratio
|
||||
+ d_up.ratio
|
||||
+ d_sda.ratio
|
||||
+ d_sr.ratio
|
||||
+ d_sw.ratio
|
||||
+ d_cu.ratio
|
||||
+ d_au.ratio
|
||||
+ d_ai.ratio
|
||||
)
|
||||
|
||||
if mean_mess_ratio >= maximum_threshold:
|
||||
break
|
||||
else:
|
||||
# Flush last word buffer in SuperWeirdWordPlugin via trailing newline.
|
||||
info_update("\n")
|
||||
d_sw_feed("\n", info)
|
||||
d_au_feed("\n", info)
|
||||
d_up_feed("\n", info)
|
||||
|
||||
mean_mess_ratio = (
|
||||
d_sp.ratio
|
||||
+ d_ta.ratio
|
||||
+ d_up.ratio
|
||||
+ d_sda.ratio
|
||||
+ d_sr.ratio
|
||||
+ d_sw.ratio
|
||||
+ d_cu.ratio
|
||||
+ d_au.ratio
|
||||
+ d_ai.ratio
|
||||
)
|
||||
|
||||
if debug: # Defensive:
|
||||
logger = getLogger("charset_normalizer")
|
||||
|
||||
logger.log(
|
||||
TRACE,
|
||||
"Mess-detector extended-analysis start. "
|
||||
f"intermediary_mean_mess_ratio_calc={step} mean_mess_ratio={mean_mess_ratio} "
|
||||
f"maximum_threshold={maximum_threshold}",
|
||||
)
|
||||
|
||||
if seq_len > 16:
|
||||
logger.log(TRACE, f"Starting with: {decoded_sequence[:16]}")
|
||||
logger.log(TRACE, f"Ending with: {decoded_sequence[-16::]}")
|
||||
|
||||
for dt in [d_sp, d_ta, d_up, d_sda, d_sr, d_sw, d_cu, d_au, d_ai]:
|
||||
logger.log(TRACE, f"{dt.__class__}: {dt.ratio}")
|
||||
|
||||
return round(mean_mess_ratio, 3)
|
||||
359
aws-lambda/src/charset_normalizer/models.py
Normal file
359
aws-lambda/src/charset_normalizer/models.py
Normal file
@@ -0,0 +1,359 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from encodings.aliases import aliases
|
||||
from json import dumps
|
||||
from re import sub
|
||||
from typing import Any, Iterator, List, Tuple
|
||||
|
||||
from .constant import RE_POSSIBLE_ENCODING_INDICATION, TOO_BIG_SEQUENCE
|
||||
from .utils import iana_name, is_multi_byte_encoding, unicode_range
|
||||
|
||||
|
||||
class CharsetMatch:
|
||||
def __init__(
|
||||
self,
|
||||
payload: bytes | bytearray,
|
||||
guessed_encoding: str,
|
||||
mean_mess_ratio: float,
|
||||
has_sig_or_bom: bool,
|
||||
languages: CoherenceMatches,
|
||||
decoded_payload: str | None = None,
|
||||
preemptive_declaration: str | None = None,
|
||||
):
|
||||
self._payload: bytes | bytearray = payload
|
||||
|
||||
self._encoding: str = guessed_encoding
|
||||
self._mean_mess_ratio: float = mean_mess_ratio
|
||||
self._languages: CoherenceMatches = languages
|
||||
self._has_sig_or_bom: bool = has_sig_or_bom
|
||||
self._unicode_ranges: list[str] | None = None
|
||||
|
||||
self._leaves: list[CharsetMatch] = []
|
||||
self._mean_coherence_ratio: float = 0.0
|
||||
|
||||
self._output_payload: bytes | None = None
|
||||
self._output_encoding: str | None = None
|
||||
|
||||
self._string: str | None = decoded_payload
|
||||
|
||||
self._preemptive_declaration: str | None = preemptive_declaration
|
||||
|
||||
def __eq__(self, other: object) -> bool:
|
||||
if not isinstance(other, CharsetMatch):
|
||||
if isinstance(other, str):
|
||||
return iana_name(other) == self.encoding
|
||||
return False
|
||||
return self.encoding == other.encoding and self.fingerprint == other.fingerprint
|
||||
|
||||
def __lt__(self, other: object) -> bool:
|
||||
"""
|
||||
Implemented to make sorted available upon CharsetMatches items.
|
||||
"""
|
||||
if not isinstance(other, CharsetMatch):
|
||||
raise ValueError
|
||||
|
||||
chaos_difference: float = abs(self.chaos - other.chaos)
|
||||
coherence_difference: float = abs(self.coherence - other.coherence)
|
||||
|
||||
# Below 0.5% difference --> Use Coherence
|
||||
if chaos_difference < 0.005 and coherence_difference > 0.02:
|
||||
return self.coherence > other.coherence
|
||||
elif chaos_difference < 0.005 and coherence_difference <= 0.02:
|
||||
# When having a difficult decision, use the result that decoded as many multi-byte as possible.
|
||||
# preserve RAM usage!
|
||||
if len(self._payload) >= TOO_BIG_SEQUENCE:
|
||||
return self.chaos < other.chaos
|
||||
return self.multi_byte_usage > other.multi_byte_usage
|
||||
|
||||
return self.chaos < other.chaos
|
||||
|
||||
@property
|
||||
def multi_byte_usage(self) -> float:
|
||||
return 1.0 - (len(str(self)) / len(self.raw))
|
||||
|
||||
def __str__(self) -> str:
|
||||
# Lazy Str Loading
|
||||
if self._string is None:
|
||||
self._string = str(self._payload, self._encoding, "strict")
|
||||
return self._string
|
||||
|
||||
def __repr__(self) -> str:
|
||||
return f"<CharsetMatch '{self.encoding}' fp({self.fingerprint})>"
|
||||
|
||||
def add_submatch(self, other: CharsetMatch) -> None:
|
||||
if not isinstance(other, CharsetMatch) or other == self:
|
||||
raise ValueError(
|
||||
"Unable to add instance <{}> as a submatch of a CharsetMatch".format(
|
||||
other.__class__
|
||||
)
|
||||
)
|
||||
|
||||
other._string = None # Unload RAM usage; dirty trick.
|
||||
self._leaves.append(other)
|
||||
|
||||
@property
|
||||
def encoding(self) -> str:
|
||||
return self._encoding
|
||||
|
||||
@property
|
||||
def encoding_aliases(self) -> list[str]:
|
||||
"""
|
||||
Encoding name are known by many name, using this could help when searching for IBM855 when it's listed as CP855.
|
||||
"""
|
||||
also_known_as: list[str] = []
|
||||
for u, p in aliases.items():
|
||||
if self.encoding == u:
|
||||
also_known_as.append(p)
|
||||
elif self.encoding == p:
|
||||
also_known_as.append(u)
|
||||
return also_known_as
|
||||
|
||||
@property
|
||||
def bom(self) -> bool:
|
||||
return self._has_sig_or_bom
|
||||
|
||||
@property
|
||||
def byte_order_mark(self) -> bool:
|
||||
return self._has_sig_or_bom
|
||||
|
||||
@property
|
||||
def languages(self) -> list[str]:
|
||||
"""
|
||||
Return the complete list of possible languages found in decoded sequence.
|
||||
Usually not really useful. Returned list may be empty even if 'language' property return something != 'Unknown'.
|
||||
"""
|
||||
return [e[0] for e in self._languages]
|
||||
|
||||
@property
|
||||
def language(self) -> str:
|
||||
"""
|
||||
Most probable language found in decoded sequence. If none were detected or inferred, the property will return
|
||||
"Unknown".
|
||||
"""
|
||||
if not self._languages:
|
||||
# Trying to infer the language based on the given encoding
|
||||
# Its either English or we should not pronounce ourselves in certain cases.
|
||||
if "ascii" in self.could_be_from_charset:
|
||||
return "English"
|
||||
|
||||
# doing it there to avoid circular import
|
||||
from charset_normalizer.cd import encoding_languages, mb_encoding_languages
|
||||
|
||||
languages = (
|
||||
mb_encoding_languages(self.encoding)
|
||||
if is_multi_byte_encoding(self.encoding)
|
||||
else encoding_languages(self.encoding)
|
||||
)
|
||||
|
||||
if len(languages) == 0 or "Latin Based" in languages:
|
||||
return "Unknown"
|
||||
|
||||
return languages[0]
|
||||
|
||||
return self._languages[0][0]
|
||||
|
||||
@property
|
||||
def chaos(self) -> float:
|
||||
return self._mean_mess_ratio
|
||||
|
||||
@property
|
||||
def coherence(self) -> float:
|
||||
if not self._languages:
|
||||
return 0.0
|
||||
return self._languages[0][1]
|
||||
|
||||
@property
|
||||
def percent_chaos(self) -> float:
|
||||
return round(self.chaos * 100, ndigits=3)
|
||||
|
||||
@property
|
||||
def percent_coherence(self) -> float:
|
||||
return round(self.coherence * 100, ndigits=3)
|
||||
|
||||
@property
|
||||
def raw(self) -> bytes | bytearray:
|
||||
"""
|
||||
Original untouched bytes.
|
||||
"""
|
||||
return self._payload
|
||||
|
||||
@property
|
||||
def submatch(self) -> list[CharsetMatch]:
|
||||
return self._leaves
|
||||
|
||||
@property
|
||||
def has_submatch(self) -> bool:
|
||||
return len(self._leaves) > 0
|
||||
|
||||
@property
|
||||
def alphabets(self) -> list[str]:
|
||||
if self._unicode_ranges is not None:
|
||||
return self._unicode_ranges
|
||||
# list detected ranges
|
||||
detected_ranges: list[str | None] = [unicode_range(char) for char in str(self)]
|
||||
# filter and sort
|
||||
self._unicode_ranges = sorted(list({r for r in detected_ranges if r}))
|
||||
return self._unicode_ranges
|
||||
|
||||
@property
|
||||
def could_be_from_charset(self) -> list[str]:
|
||||
"""
|
||||
The complete list of encoding that output the exact SAME str result and therefore could be the originating
|
||||
encoding.
|
||||
This list does include the encoding available in property 'encoding'.
|
||||
"""
|
||||
return [self._encoding] + [m.encoding for m in self._leaves]
|
||||
|
||||
def output(self, encoding: str = "utf_8") -> bytes:
|
||||
"""
|
||||
Method to get re-encoded bytes payload using given target encoding. Default to UTF-8.
|
||||
Any errors will be simply ignored by the encoder NOT replaced.
|
||||
"""
|
||||
if self._output_encoding is None or self._output_encoding != encoding:
|
||||
self._output_encoding = encoding
|
||||
decoded_string = str(self)
|
||||
if (
|
||||
self._preemptive_declaration is not None
|
||||
and self._preemptive_declaration.lower()
|
||||
not in ["utf-8", "utf8", "utf_8"]
|
||||
):
|
||||
patched_header = sub(
|
||||
RE_POSSIBLE_ENCODING_INDICATION,
|
||||
lambda m: m.string[m.span()[0] : m.span()[1]].replace(
|
||||
m.groups()[0],
|
||||
iana_name(self._output_encoding).replace("_", "-"), # type: ignore[arg-type]
|
||||
),
|
||||
decoded_string[:8192],
|
||||
count=1,
|
||||
)
|
||||
|
||||
decoded_string = patched_header + decoded_string[8192:]
|
||||
|
||||
self._output_payload = decoded_string.encode(encoding, "replace")
|
||||
|
||||
return self._output_payload # type: ignore
|
||||
|
||||
@property
|
||||
def fingerprint(self) -> int:
|
||||
"""
|
||||
Retrieve a hash fingerprint of the decoded payload, used for deduplication.
|
||||
"""
|
||||
return hash(str(self))
|
||||
|
||||
|
||||
class CharsetMatches:
|
||||
"""
|
||||
Container with every CharsetMatch items ordered by default from most probable to the less one.
|
||||
Act like a list(iterable) but does not implements all related methods.
|
||||
"""
|
||||
|
||||
def __init__(self, results: list[CharsetMatch] | None = None):
|
||||
self._results: list[CharsetMatch] = sorted(results) if results else []
|
||||
|
||||
def __iter__(self) -> Iterator[CharsetMatch]:
|
||||
yield from self._results
|
||||
|
||||
def __getitem__(self, item: int | str) -> CharsetMatch:
|
||||
"""
|
||||
Retrieve a single item either by its position or encoding name (alias may be used here).
|
||||
Raise KeyError upon invalid index or encoding not present in results.
|
||||
"""
|
||||
if isinstance(item, int):
|
||||
return self._results[item]
|
||||
if isinstance(item, str):
|
||||
item = iana_name(item, False)
|
||||
for result in self._results:
|
||||
if item in result.could_be_from_charset:
|
||||
return result
|
||||
raise KeyError
|
||||
|
||||
def __len__(self) -> int:
|
||||
return len(self._results)
|
||||
|
||||
def __bool__(self) -> bool:
|
||||
return len(self._results) > 0
|
||||
|
||||
def append(self, item: CharsetMatch) -> None:
|
||||
"""
|
||||
Insert a single match. Will be inserted accordingly to preserve sort.
|
||||
Can be inserted as a submatch.
|
||||
"""
|
||||
if not isinstance(item, CharsetMatch):
|
||||
raise ValueError(
|
||||
"Cannot append instance '{}' to CharsetMatches".format(
|
||||
str(item.__class__)
|
||||
)
|
||||
)
|
||||
# We should disable the submatch factoring when the input file is too heavy (conserve RAM usage)
|
||||
if len(item.raw) < TOO_BIG_SEQUENCE:
|
||||
for match in self._results:
|
||||
if match.fingerprint == item.fingerprint and match.chaos == item.chaos:
|
||||
match.add_submatch(item)
|
||||
return
|
||||
self._results.append(item)
|
||||
self._results = sorted(self._results)
|
||||
|
||||
def best(self) -> CharsetMatch | None:
|
||||
"""
|
||||
Simply return the first match. Strict equivalent to matches[0].
|
||||
"""
|
||||
if not self._results:
|
||||
return None
|
||||
return self._results[0]
|
||||
|
||||
def first(self) -> CharsetMatch | None:
|
||||
"""
|
||||
Redundant method, call the method best(). Kept for BC reasons.
|
||||
"""
|
||||
return self.best()
|
||||
|
||||
|
||||
CoherenceMatch = Tuple[str, float]
|
||||
CoherenceMatches = List[CoherenceMatch]
|
||||
|
||||
|
||||
class CliDetectionResult:
|
||||
def __init__(
|
||||
self,
|
||||
path: str,
|
||||
encoding: str | None,
|
||||
encoding_aliases: list[str],
|
||||
alternative_encodings: list[str],
|
||||
language: str,
|
||||
alphabets: list[str],
|
||||
has_sig_or_bom: bool,
|
||||
chaos: float,
|
||||
coherence: float,
|
||||
unicode_path: str | None,
|
||||
is_preferred: bool,
|
||||
):
|
||||
self.path: str = path
|
||||
self.unicode_path: str | None = unicode_path
|
||||
self.encoding: str | None = encoding
|
||||
self.encoding_aliases: list[str] = encoding_aliases
|
||||
self.alternative_encodings: list[str] = alternative_encodings
|
||||
self.language: str = language
|
||||
self.alphabets: list[str] = alphabets
|
||||
self.has_sig_or_bom: bool = has_sig_or_bom
|
||||
self.chaos: float = chaos
|
||||
self.coherence: float = coherence
|
||||
self.is_preferred: bool = is_preferred
|
||||
|
||||
@property
|
||||
def __dict__(self) -> dict[str, Any]: # type: ignore
|
||||
return {
|
||||
"path": self.path,
|
||||
"encoding": self.encoding,
|
||||
"encoding_aliases": self.encoding_aliases,
|
||||
"alternative_encodings": self.alternative_encodings,
|
||||
"language": self.language,
|
||||
"alphabets": self.alphabets,
|
||||
"has_sig_or_bom": self.has_sig_or_bom,
|
||||
"chaos": self.chaos,
|
||||
"coherence": self.coherence,
|
||||
"unicode_path": self.unicode_path,
|
||||
"is_preferred": self.is_preferred,
|
||||
}
|
||||
|
||||
def to_json(self) -> str:
|
||||
return dumps(self.__dict__, ensure_ascii=True, indent=4)
|
||||
0
aws-lambda/src/charset_normalizer/py.typed
Normal file
0
aws-lambda/src/charset_normalizer/py.typed
Normal file
422
aws-lambda/src/charset_normalizer/utils.py
Normal file
422
aws-lambda/src/charset_normalizer/utils.py
Normal file
@@ -0,0 +1,422 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import importlib
|
||||
import logging
|
||||
import unicodedata
|
||||
from bisect import bisect_right
|
||||
from codecs import IncrementalDecoder
|
||||
from encodings.aliases import aliases
|
||||
from functools import lru_cache
|
||||
from re import findall
|
||||
from typing import Generator
|
||||
|
||||
from _multibytecodec import ( # type: ignore[import-not-found,import]
|
||||
MultibyteIncrementalDecoder,
|
||||
)
|
||||
|
||||
from .constant import (
|
||||
ENCODING_MARKS,
|
||||
IANA_SUPPORTED_SIMILAR,
|
||||
RE_POSSIBLE_ENCODING_INDICATION,
|
||||
UNICODE_RANGES_COMBINED,
|
||||
UNICODE_SECONDARY_RANGE_KEYWORD,
|
||||
UTF8_MAXIMAL_ALLOCATION,
|
||||
COMMON_CJK_CHARACTERS,
|
||||
_LATIN,
|
||||
_CJK,
|
||||
_HANGUL,
|
||||
_KATAKANA,
|
||||
_HIRAGANA,
|
||||
_THAI,
|
||||
_ARABIC,
|
||||
_ARABIC_ISOLATED_FORM,
|
||||
_ACCENT_KEYWORDS,
|
||||
_ACCENTUATED,
|
||||
)
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def _character_flags(character: str) -> int:
|
||||
"""Compute all name-based classification flags with a single unicodedata.name() call."""
|
||||
try:
|
||||
desc: str = unicodedata.name(character)
|
||||
except ValueError:
|
||||
return 0
|
||||
|
||||
flags: int = 0
|
||||
|
||||
if "LATIN" in desc:
|
||||
flags |= _LATIN
|
||||
if "CJK" in desc:
|
||||
flags |= _CJK
|
||||
if "HANGUL" in desc:
|
||||
flags |= _HANGUL
|
||||
if "KATAKANA" in desc:
|
||||
flags |= _KATAKANA
|
||||
if "HIRAGANA" in desc:
|
||||
flags |= _HIRAGANA
|
||||
if "THAI" in desc:
|
||||
flags |= _THAI
|
||||
if "ARABIC" in desc:
|
||||
flags |= _ARABIC
|
||||
if "ISOLATED FORM" in desc:
|
||||
flags |= _ARABIC_ISOLATED_FORM
|
||||
|
||||
for kw in _ACCENT_KEYWORDS:
|
||||
if kw in desc:
|
||||
flags |= _ACCENTUATED
|
||||
break
|
||||
|
||||
return flags
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_accentuated(character: str) -> bool:
|
||||
return bool(_character_flags(character) & _ACCENTUATED)
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def remove_accent(character: str) -> str:
|
||||
decomposed: str = unicodedata.decomposition(character)
|
||||
if not decomposed:
|
||||
return character
|
||||
|
||||
codes: list[str] = decomposed.split(" ")
|
||||
|
||||
return chr(int(codes[0], 16))
|
||||
|
||||
|
||||
# Pre-built sorted lookup table for O(log n) binary search in unicode_range().
|
||||
# Each entry is (range_start, range_end_exclusive, range_name).
|
||||
_UNICODE_RANGES_SORTED: list[tuple[int, int, str]] = sorted(
|
||||
(ord_range.start, ord_range.stop, name)
|
||||
for name, ord_range in UNICODE_RANGES_COMBINED.items()
|
||||
)
|
||||
_UNICODE_RANGE_STARTS: list[int] = [e[0] for e in _UNICODE_RANGES_SORTED]
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def unicode_range(character: str) -> str | None:
|
||||
"""
|
||||
Retrieve the Unicode range official name from a single character.
|
||||
"""
|
||||
character_ord: int = ord(character)
|
||||
|
||||
# Binary search: find the rightmost range whose start <= character_ord
|
||||
idx = bisect_right(_UNICODE_RANGE_STARTS, character_ord) - 1
|
||||
if idx >= 0:
|
||||
start, stop, name = _UNICODE_RANGES_SORTED[idx]
|
||||
if character_ord < stop:
|
||||
return name
|
||||
|
||||
return None
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_latin(character: str) -> bool:
|
||||
return bool(_character_flags(character) & _LATIN)
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_punctuation(character: str) -> bool:
|
||||
character_category: str = unicodedata.category(character)
|
||||
|
||||
if "P" in character_category:
|
||||
return True
|
||||
|
||||
character_range: str | None = unicode_range(character)
|
||||
|
||||
if character_range is None:
|
||||
return False
|
||||
|
||||
return "Punctuation" in character_range
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_symbol(character: str) -> bool:
|
||||
character_category: str = unicodedata.category(character)
|
||||
|
||||
if "S" in character_category or "N" in character_category:
|
||||
return True
|
||||
|
||||
character_range: str | None = unicode_range(character)
|
||||
|
||||
if character_range is None:
|
||||
return False
|
||||
|
||||
return "Forms" in character_range and character_category != "Lo"
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_emoticon(character: str) -> bool:
|
||||
character_range: str | None = unicode_range(character)
|
||||
|
||||
if character_range is None:
|
||||
return False
|
||||
|
||||
return "Emoticons" in character_range or "Pictographs" in character_range
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_separator(character: str) -> bool:
|
||||
if character.isspace() or character in {"|", "+", "<", ">"}:
|
||||
return True
|
||||
|
||||
character_category: str = unicodedata.category(character)
|
||||
|
||||
return "Z" in character_category or character_category in {"Po", "Pd", "Pc"}
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_case_variable(character: str) -> bool:
|
||||
return character.islower() != character.isupper()
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_cjk(character: str) -> bool:
|
||||
return bool(_character_flags(character) & _CJK)
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_hiragana(character: str) -> bool:
|
||||
return bool(_character_flags(character) & _HIRAGANA)
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_katakana(character: str) -> bool:
|
||||
return bool(_character_flags(character) & _KATAKANA)
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_hangul(character: str) -> bool:
|
||||
return bool(_character_flags(character) & _HANGUL)
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_thai(character: str) -> bool:
|
||||
return bool(_character_flags(character) & _THAI)
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_arabic(character: str) -> bool:
|
||||
return bool(_character_flags(character) & _ARABIC)
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_arabic_isolated_form(character: str) -> bool:
|
||||
return bool(_character_flags(character) & _ARABIC_ISOLATED_FORM)
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_cjk_uncommon(character: str) -> bool:
|
||||
return character not in COMMON_CJK_CHARACTERS
|
||||
|
||||
|
||||
@lru_cache(maxsize=len(UNICODE_RANGES_COMBINED))
|
||||
def is_unicode_range_secondary(range_name: str) -> bool:
|
||||
return any(keyword in range_name for keyword in UNICODE_SECONDARY_RANGE_KEYWORD)
|
||||
|
||||
|
||||
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||
def is_unprintable(character: str) -> bool:
|
||||
return (
|
||||
character.isspace() is False # includes \n \t \r \v
|
||||
and character.isprintable() is False
|
||||
and character != "\x1a" # Why? Its the ASCII substitute character.
|
||||
and character != "\ufeff" # bug discovered in Python,
|
||||
# Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space.
|
||||
)
|
||||
|
||||
|
||||
def any_specified_encoding(
|
||||
sequence: bytes | bytearray, search_zone: int = 8192
|
||||
) -> str | None:
|
||||
"""
|
||||
Extract using ASCII-only decoder any specified encoding in the first n-bytes.
|
||||
"""
|
||||
if not isinstance(sequence, (bytes, bytearray)):
|
||||
raise TypeError
|
||||
|
||||
seq_len: int = len(sequence)
|
||||
|
||||
results: list[str] = findall(
|
||||
RE_POSSIBLE_ENCODING_INDICATION,
|
||||
sequence[: min(seq_len, search_zone)].decode("ascii", errors="ignore"),
|
||||
)
|
||||
|
||||
if len(results) == 0:
|
||||
return None
|
||||
|
||||
for specified_encoding in results:
|
||||
specified_encoding = specified_encoding.lower().replace("-", "_")
|
||||
|
||||
encoding_alias: str
|
||||
encoding_iana: str
|
||||
|
||||
for encoding_alias, encoding_iana in aliases.items():
|
||||
if encoding_alias == specified_encoding:
|
||||
return encoding_iana
|
||||
if encoding_iana == specified_encoding:
|
||||
return encoding_iana
|
||||
|
||||
return None
|
||||
|
||||
|
||||
@lru_cache(maxsize=128)
|
||||
def is_multi_byte_encoding(name: str) -> bool:
|
||||
"""
|
||||
Verify is a specific encoding is a multi byte one based on it IANA name
|
||||
"""
|
||||
return name in {
|
||||
"utf_8",
|
||||
"utf_8_sig",
|
||||
"utf_16",
|
||||
"utf_16_be",
|
||||
"utf_16_le",
|
||||
"utf_32",
|
||||
"utf_32_le",
|
||||
"utf_32_be",
|
||||
"utf_7",
|
||||
} or issubclass(
|
||||
importlib.import_module(f"encodings.{name}").IncrementalDecoder,
|
||||
MultibyteIncrementalDecoder,
|
||||
)
|
||||
|
||||
|
||||
def identify_sig_or_bom(sequence: bytes | bytearray) -> tuple[str | None, bytes]:
|
||||
"""
|
||||
Identify and extract SIG/BOM in given sequence.
|
||||
"""
|
||||
|
||||
for iana_encoding in ENCODING_MARKS:
|
||||
marks: bytes | list[bytes] = ENCODING_MARKS[iana_encoding]
|
||||
|
||||
if isinstance(marks, bytes):
|
||||
marks = [marks]
|
||||
|
||||
for mark in marks:
|
||||
if sequence.startswith(mark):
|
||||
return iana_encoding, mark
|
||||
|
||||
return None, b""
|
||||
|
||||
|
||||
def should_strip_sig_or_bom(iana_encoding: str) -> bool:
|
||||
return iana_encoding not in {"utf_16", "utf_32"}
|
||||
|
||||
|
||||
def iana_name(cp_name: str, strict: bool = True) -> str:
|
||||
"""Returns the Python normalized encoding name (Not the IANA official name)."""
|
||||
cp_name = cp_name.lower().replace("-", "_")
|
||||
|
||||
encoding_alias: str
|
||||
encoding_iana: str
|
||||
|
||||
for encoding_alias, encoding_iana in aliases.items():
|
||||
if cp_name in [encoding_alias, encoding_iana]:
|
||||
return encoding_iana
|
||||
|
||||
if strict:
|
||||
raise ValueError(f"Unable to retrieve IANA for '{cp_name}'")
|
||||
|
||||
return cp_name
|
||||
|
||||
|
||||
def cp_similarity(iana_name_a: str, iana_name_b: str) -> float:
|
||||
if is_multi_byte_encoding(iana_name_a) or is_multi_byte_encoding(iana_name_b):
|
||||
return 0.0
|
||||
|
||||
decoder_a = importlib.import_module(f"encodings.{iana_name_a}").IncrementalDecoder
|
||||
decoder_b = importlib.import_module(f"encodings.{iana_name_b}").IncrementalDecoder
|
||||
|
||||
id_a: IncrementalDecoder = decoder_a(errors="ignore")
|
||||
id_b: IncrementalDecoder = decoder_b(errors="ignore")
|
||||
|
||||
character_match_count: int = 0
|
||||
|
||||
for i in range(256):
|
||||
to_be_decoded: bytes = bytes([i])
|
||||
if id_a.decode(to_be_decoded) == id_b.decode(to_be_decoded):
|
||||
character_match_count += 1
|
||||
|
||||
return character_match_count / 256
|
||||
|
||||
|
||||
def is_cp_similar(iana_name_a: str, iana_name_b: str) -> bool:
|
||||
"""
|
||||
Determine if two code page are at least 80% similar. IANA_SUPPORTED_SIMILAR dict was generated using
|
||||
the function cp_similarity.
|
||||
"""
|
||||
return (
|
||||
iana_name_a in IANA_SUPPORTED_SIMILAR
|
||||
and iana_name_b in IANA_SUPPORTED_SIMILAR[iana_name_a]
|
||||
)
|
||||
|
||||
|
||||
def set_logging_handler(
|
||||
name: str = "charset_normalizer",
|
||||
level: int = logging.INFO,
|
||||
format_string: str = "%(asctime)s | %(levelname)s | %(message)s",
|
||||
) -> None:
|
||||
logger = logging.getLogger(name)
|
||||
logger.setLevel(level)
|
||||
|
||||
handler = logging.StreamHandler()
|
||||
handler.setFormatter(logging.Formatter(format_string))
|
||||
logger.addHandler(handler)
|
||||
|
||||
|
||||
def cut_sequence_chunks(
|
||||
sequences: bytes | bytearray,
|
||||
encoding_iana: str,
|
||||
offsets: range,
|
||||
chunk_size: int,
|
||||
bom_or_sig_available: bool,
|
||||
strip_sig_or_bom: bool,
|
||||
sig_payload: bytes,
|
||||
is_multi_byte_decoder: bool,
|
||||
decoded_payload: str | None = None,
|
||||
) -> Generator[str, None, None]:
|
||||
if decoded_payload and is_multi_byte_decoder is False:
|
||||
for i in offsets:
|
||||
chunk = decoded_payload[i : i + chunk_size]
|
||||
if not chunk:
|
||||
break
|
||||
yield chunk
|
||||
else:
|
||||
for i in offsets:
|
||||
chunk_end = i + chunk_size
|
||||
if chunk_end > len(sequences) + 8:
|
||||
continue
|
||||
|
||||
cut_sequence = sequences[i : i + chunk_size]
|
||||
|
||||
if bom_or_sig_available and strip_sig_or_bom is False:
|
||||
cut_sequence = sig_payload + cut_sequence
|
||||
|
||||
chunk = cut_sequence.decode(
|
||||
encoding_iana,
|
||||
errors="ignore" if is_multi_byte_decoder else "strict",
|
||||
)
|
||||
|
||||
# multi-byte bad cutting detector and adjustment
|
||||
# not the cleanest way to perform that fix but clever enough for now.
|
||||
if is_multi_byte_decoder and i > 0:
|
||||
chunk_partial_size_chk: int = min(chunk_size, 16)
|
||||
|
||||
if (
|
||||
decoded_payload
|
||||
and chunk[:chunk_partial_size_chk] not in decoded_payload
|
||||
):
|
||||
for j in range(i, i - 4, -1):
|
||||
cut_sequence = sequences[j:chunk_end]
|
||||
|
||||
if bom_or_sig_available and strip_sig_or_bom is False:
|
||||
cut_sequence = sig_payload + cut_sequence
|
||||
|
||||
chunk = cut_sequence.decode(encoding_iana, errors="ignore")
|
||||
|
||||
if chunk[:chunk_partial_size_chk] in decoded_payload:
|
||||
break
|
||||
|
||||
yield chunk
|
||||
8
aws-lambda/src/charset_normalizer/version.py
Normal file
8
aws-lambda/src/charset_normalizer/version.py
Normal file
@@ -0,0 +1,8 @@
|
||||
"""
|
||||
Expose version
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
__version__ = "3.4.6"
|
||||
VERSION = __version__.split(".")
|
||||
1
aws-lambda/src/idna-3.11.dist-info/INSTALLER
Normal file
1
aws-lambda/src/idna-3.11.dist-info/INSTALLER
Normal file
@@ -0,0 +1 @@
|
||||
pip
|
||||
209
aws-lambda/src/idna-3.11.dist-info/METADATA
Normal file
209
aws-lambda/src/idna-3.11.dist-info/METADATA
Normal file
@@ -0,0 +1,209 @@
|
||||
Metadata-Version: 2.4
|
||||
Name: idna
|
||||
Version: 3.11
|
||||
Summary: Internationalized Domain Names in Applications (IDNA)
|
||||
Author-email: Kim Davies <kim+pypi@gumleaf.org>
|
||||
Requires-Python: >=3.8
|
||||
Description-Content-Type: text/x-rst
|
||||
License-Expression: BSD-3-Clause
|
||||
Classifier: Development Status :: 5 - Production/Stable
|
||||
Classifier: Intended Audience :: Developers
|
||||
Classifier: Intended Audience :: System Administrators
|
||||
Classifier: Operating System :: OS Independent
|
||||
Classifier: Programming Language :: Python
|
||||
Classifier: Programming Language :: Python :: 3
|
||||
Classifier: Programming Language :: Python :: 3 :: Only
|
||||
Classifier: Programming Language :: Python :: 3.8
|
||||
Classifier: Programming Language :: Python :: 3.9
|
||||
Classifier: Programming Language :: Python :: 3.10
|
||||
Classifier: Programming Language :: Python :: 3.11
|
||||
Classifier: Programming Language :: Python :: 3.12
|
||||
Classifier: Programming Language :: Python :: 3.13
|
||||
Classifier: Programming Language :: Python :: 3.14
|
||||
Classifier: Programming Language :: Python :: Implementation :: CPython
|
||||
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
||||
Classifier: Topic :: Internet :: Name Service (DNS)
|
||||
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
||||
Classifier: Topic :: Utilities
|
||||
License-File: LICENSE.md
|
||||
Requires-Dist: ruff >= 0.6.2 ; extra == "all"
|
||||
Requires-Dist: mypy >= 1.11.2 ; extra == "all"
|
||||
Requires-Dist: pytest >= 8.3.2 ; extra == "all"
|
||||
Requires-Dist: flake8 >= 7.1.1 ; extra == "all"
|
||||
Project-URL: Changelog, https://github.com/kjd/idna/blob/master/HISTORY.rst
|
||||
Project-URL: Issue tracker, https://github.com/kjd/idna/issues
|
||||
Project-URL: Source, https://github.com/kjd/idna
|
||||
Provides-Extra: all
|
||||
|
||||
Internationalized Domain Names in Applications (IDNA)
|
||||
=====================================================
|
||||
|
||||
Support for `Internationalized Domain Names in
|
||||
Applications (IDNA) <https://tools.ietf.org/html/rfc5891>`_
|
||||
and `Unicode IDNA Compatibility Processing
|
||||
<https://unicode.org/reports/tr46/>`_.
|
||||
|
||||
The latest versions of these standards supplied here provide
|
||||
more comprehensive language coverage and reduce the potential of
|
||||
allowing domains with known security vulnerabilities. This library
|
||||
is a suitable replacement for the “encodings.idna”
|
||||
module that comes with the Python standard library, but which
|
||||
only supports an older superseded IDNA specification from 2003.
|
||||
|
||||
Basic functions are simply executed:
|
||||
|
||||
.. code-block:: pycon
|
||||
|
||||
>>> import idna
|
||||
>>> idna.encode('ドメイン.テスト')
|
||||
b'xn--eckwd4c7c.xn--zckzah'
|
||||
>>> print(idna.decode('xn--eckwd4c7c.xn--zckzah'))
|
||||
ドメイン.テスト
|
||||
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
This package is available for installation from PyPI via the
|
||||
typical mechanisms, such as:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ python3 -m pip install idna
|
||||
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
For typical usage, the ``encode`` and ``decode`` functions will take a
|
||||
domain name argument and perform a conversion to ASCII compatible encoding
|
||||
(known as A-labels), or to Unicode strings (known as U-labels)
|
||||
respectively.
|
||||
|
||||
.. code-block:: pycon
|
||||
|
||||
>>> import idna
|
||||
>>> idna.encode('ドメイン.テスト')
|
||||
b'xn--eckwd4c7c.xn--zckzah'
|
||||
>>> print(idna.decode('xn--eckwd4c7c.xn--zckzah'))
|
||||
ドメイン.テスト
|
||||
|
||||
Conversions can be applied at a per-label basis using the ``ulabel`` or
|
||||
``alabel`` functions if necessary:
|
||||
|
||||
.. code-block:: pycon
|
||||
|
||||
>>> idna.alabel('测试')
|
||||
b'xn--0zwm56d'
|
||||
|
||||
|
||||
Compatibility Mapping (UTS #46)
|
||||
+++++++++++++++++++++++++++++++
|
||||
|
||||
This library provides support for `Unicode IDNA Compatibility
|
||||
Processing <https://unicode.org/reports/tr46/>`_ which normalizes input from
|
||||
different potential ways a user may input a domain prior to performing the IDNA
|
||||
conversion operations. This functionality, known as a
|
||||
`mapping <https://tools.ietf.org/html/rfc5895>`_, is considered by the
|
||||
specification to be a local user-interface issue distinct from IDNA
|
||||
conversion functionality.
|
||||
|
||||
For example, “Königsgäßchen” is not a permissible label as *LATIN
|
||||
CAPITAL LETTER K* is not allowed (nor are capital letters in general).
|
||||
UTS 46 will convert this into lower case prior to applying the IDNA
|
||||
conversion.
|
||||
|
||||
.. code-block:: pycon
|
||||
|
||||
>>> import idna
|
||||
>>> idna.encode('Königsgäßchen')
|
||||
...
|
||||
idna.core.InvalidCodepoint: Codepoint U+004B at position 1 of 'Königsgäßchen' not allowed
|
||||
>>> idna.encode('Königsgäßchen', uts46=True)
|
||||
b'xn--knigsgchen-b4a3dun'
|
||||
>>> print(idna.decode('xn--knigsgchen-b4a3dun'))
|
||||
königsgäßchen
|
||||
|
||||
|
||||
Exceptions
|
||||
----------
|
||||
|
||||
All errors raised during the conversion following the specification
|
||||
should raise an exception derived from the ``idna.IDNAError`` base
|
||||
class.
|
||||
|
||||
More specific exceptions that may be generated as ``idna.IDNABidiError``
|
||||
when the error reflects an illegal combination of left-to-right and
|
||||
right-to-left characters in a label; ``idna.InvalidCodepoint`` when
|
||||
a specific codepoint is an illegal character in an IDN label (i.e.
|
||||
INVALID); and ``idna.InvalidCodepointContext`` when the codepoint is
|
||||
illegal based on its position in the string (i.e. it is CONTEXTO or CONTEXTJ
|
||||
but the contextual requirements are not satisfied.)
|
||||
|
||||
Building and Diagnostics
|
||||
------------------------
|
||||
|
||||
The IDNA and UTS 46 functionality relies upon pre-calculated lookup
|
||||
tables for performance. These tables are derived from computing against
|
||||
eligibility criteria in the respective standards using the command-line
|
||||
script ``tools/idna-data``.
|
||||
|
||||
This tool will fetch relevant codepoint data from the Unicode repository
|
||||
and perform the required calculations to identify eligibility. There are
|
||||
three main modes:
|
||||
|
||||
* ``idna-data make-libdata``. Generates ``idnadata.py`` and
|
||||
``uts46data.py``, the pre-calculated lookup tables used for IDNA and
|
||||
UTS 46 conversions. Implementers who wish to track this library against
|
||||
a different Unicode version may use this tool to manually generate a
|
||||
different version of the ``idnadata.py`` and ``uts46data.py`` files.
|
||||
|
||||
* ``idna-data make-table``. Generate a table of the IDNA disposition
|
||||
(e.g. PVALID, CONTEXTJ, CONTEXTO) in the format found in Appendix
|
||||
B.1 of RFC 5892 and the pre-computed tables published by `IANA
|
||||
<https://www.iana.org/>`_.
|
||||
|
||||
* ``idna-data U+0061``. Prints debugging output on the various
|
||||
properties associated with an individual Unicode codepoint (in this
|
||||
case, U+0061), that are used to assess the IDNA and UTS 46 status of a
|
||||
codepoint. This is helpful in debugging or analysis.
|
||||
|
||||
The tool accepts a number of arguments, described using ``idna-data
|
||||
-h``. Most notably, the ``--version`` argument allows the specification
|
||||
of the version of Unicode to be used in computing the table data. For
|
||||
example, ``idna-data --version 9.0.0 make-libdata`` will generate
|
||||
library data against Unicode 9.0.0.
|
||||
|
||||
|
||||
Additional Notes
|
||||
----------------
|
||||
|
||||
* **Packages**. The latest tagged release version is published in the
|
||||
`Python Package Index <https://pypi.org/project/idna/>`_.
|
||||
|
||||
* **Version support**. This library supports Python 3.8 and higher.
|
||||
As this library serves as a low-level toolkit for a variety of
|
||||
applications, many of which strive for broad compatibility with older
|
||||
Python versions, there is no rush to remove older interpreter support.
|
||||
Support for older versions are likely to be removed from new releases
|
||||
as automated tests can no longer easily be run, i.e. once the Python
|
||||
version is officially end-of-life.
|
||||
|
||||
* **Testing**. The library has a test suite based on each rule of the
|
||||
IDNA specification, as well as tests that are provided as part of the
|
||||
Unicode Technical Standard 46, `Unicode IDNA Compatibility Processing
|
||||
<https://unicode.org/reports/tr46/>`_.
|
||||
|
||||
* **Emoji**. It is an occasional request to support emoji domains in
|
||||
this library. Encoding of symbols like emoji is expressly prohibited by
|
||||
the technical standard IDNA 2008 and emoji domains are broadly phased
|
||||
out across the domain industry due to associated security risks. For
|
||||
now, applications that need to support these non-compliant labels
|
||||
may wish to consider trying the encode/decode operation in this library
|
||||
first, and then falling back to using `encodings.idna`. See `the Github
|
||||
project <https://github.com/kjd/idna/issues/18>`_ for more discussion.
|
||||
|
||||
* **Transitional processing**. Unicode 16.0.0 removed transitional
|
||||
processing so the `transitional` argument for the encode() method
|
||||
no longer has any effect and will be removed at a later date.
|
||||
|
||||
22
aws-lambda/src/idna-3.11.dist-info/RECORD
Normal file
22
aws-lambda/src/idna-3.11.dist-info/RECORD
Normal file
@@ -0,0 +1,22 @@
|
||||
idna-3.11.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
|
||||
idna-3.11.dist-info/METADATA,sha256=fCwSww9SuiN8TIHllFSASUQCW55hAs8dzKnr9RaEEbA,8378
|
||||
idna-3.11.dist-info/RECORD,,
|
||||
idna-3.11.dist-info/WHEEL,sha256=G2gURzTEtmeR8nrdXUJfNiB3VYVxigPQ-bEQujpNiNs,82
|
||||
idna-3.11.dist-info/licenses/LICENSE.md,sha256=t6M2q_OwThgOwGXN0W5wXQeeHMehT5EKpukYfza5zYc,1541
|
||||
idna/__init__.py,sha256=MPqNDLZbXqGaNdXxAFhiqFPKEQXju2jNQhCey6-5eJM,868
|
||||
idna/__pycache__/__init__.cpython-313.pyc,,
|
||||
idna/__pycache__/codec.cpython-313.pyc,,
|
||||
idna/__pycache__/compat.cpython-313.pyc,,
|
||||
idna/__pycache__/core.cpython-313.pyc,,
|
||||
idna/__pycache__/idnadata.cpython-313.pyc,,
|
||||
idna/__pycache__/intranges.cpython-313.pyc,,
|
||||
idna/__pycache__/package_data.cpython-313.pyc,,
|
||||
idna/__pycache__/uts46data.cpython-313.pyc,,
|
||||
idna/codec.py,sha256=M2SGWN7cs_6B32QmKTyTN6xQGZeYQgQ2wiX3_DR6loE,3438
|
||||
idna/compat.py,sha256=RzLy6QQCdl9784aFhb2EX9EKGCJjg0P3PilGdeXXcx8,316
|
||||
idna/core.py,sha256=P26_XVycuMTZ1R2mNK1ZREVzM5mvTzdabBXfyZVU1Lc,13246
|
||||
idna/idnadata.py,sha256=SG8jhaGE53iiD6B49pt2pwTv_UvClciWE-N54oR2p4U,79623
|
||||
idna/intranges.py,sha256=amUtkdhYcQG8Zr-CoMM_kVRacxkivC1WgxN1b63KKdU,1898
|
||||
idna/package_data.py,sha256=_CUavOxobnbyNG2FLyHoN8QHP3QM9W1tKuw7eq9QwBk,21
|
||||
idna/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
||||
idna/uts46data.py,sha256=H9J35VkD0F9L9mKOqjeNGd2A-Va6FlPoz6Jz4K7h-ps,243725
|
||||
4
aws-lambda/src/idna-3.11.dist-info/WHEEL
Normal file
4
aws-lambda/src/idna-3.11.dist-info/WHEEL
Normal file
@@ -0,0 +1,4 @@
|
||||
Wheel-Version: 1.0
|
||||
Generator: flit 3.12.0
|
||||
Root-Is-Purelib: true
|
||||
Tag: py3-none-any
|
||||
31
aws-lambda/src/idna-3.11.dist-info/licenses/LICENSE.md
Normal file
31
aws-lambda/src/idna-3.11.dist-info/licenses/LICENSE.md
Normal file
@@ -0,0 +1,31 @@
|
||||
BSD 3-Clause License
|
||||
|
||||
Copyright (c) 2013-2025, Kim Davies and contributors.
|
||||
All rights reserved.
|
||||
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in the
|
||||
documentation and/or other materials provided with the distribution.
|
||||
|
||||
3. Neither the name of the copyright holder nor the names of its
|
||||
contributors may be used to endorse or promote products derived from
|
||||
this software without specific prior written permission.
|
||||
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
||||
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
||||
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
||||
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
||||
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
||||
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
|
||||
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
|
||||
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
|
||||
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
|
||||
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
45
aws-lambda/src/idna/__init__.py
Normal file
45
aws-lambda/src/idna/__init__.py
Normal file
@@ -0,0 +1,45 @@
|
||||
from .core import (
|
||||
IDNABidiError,
|
||||
IDNAError,
|
||||
InvalidCodepoint,
|
||||
InvalidCodepointContext,
|
||||
alabel,
|
||||
check_bidi,
|
||||
check_hyphen_ok,
|
||||
check_initial_combiner,
|
||||
check_label,
|
||||
check_nfc,
|
||||
decode,
|
||||
encode,
|
||||
ulabel,
|
||||
uts46_remap,
|
||||
valid_contextj,
|
||||
valid_contexto,
|
||||
valid_label_length,
|
||||
valid_string_length,
|
||||
)
|
||||
from .intranges import intranges_contain
|
||||
from .package_data import __version__
|
||||
|
||||
__all__ = [
|
||||
"__version__",
|
||||
"IDNABidiError",
|
||||
"IDNAError",
|
||||
"InvalidCodepoint",
|
||||
"InvalidCodepointContext",
|
||||
"alabel",
|
||||
"check_bidi",
|
||||
"check_hyphen_ok",
|
||||
"check_initial_combiner",
|
||||
"check_label",
|
||||
"check_nfc",
|
||||
"decode",
|
||||
"encode",
|
||||
"intranges_contain",
|
||||
"ulabel",
|
||||
"uts46_remap",
|
||||
"valid_contextj",
|
||||
"valid_contexto",
|
||||
"valid_label_length",
|
||||
"valid_string_length",
|
||||
]
|
||||
BIN
aws-lambda/src/idna/__pycache__/__init__.cpython-313.pyc
Normal file
BIN
aws-lambda/src/idna/__pycache__/__init__.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/idna/__pycache__/codec.cpython-313.pyc
Normal file
BIN
aws-lambda/src/idna/__pycache__/codec.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/idna/__pycache__/compat.cpython-313.pyc
Normal file
BIN
aws-lambda/src/idna/__pycache__/compat.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/idna/__pycache__/core.cpython-313.pyc
Normal file
BIN
aws-lambda/src/idna/__pycache__/core.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/idna/__pycache__/idnadata.cpython-313.pyc
Normal file
BIN
aws-lambda/src/idna/__pycache__/idnadata.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/idna/__pycache__/intranges.cpython-313.pyc
Normal file
BIN
aws-lambda/src/idna/__pycache__/intranges.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/idna/__pycache__/package_data.cpython-313.pyc
Normal file
BIN
aws-lambda/src/idna/__pycache__/package_data.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/idna/__pycache__/uts46data.cpython-313.pyc
Normal file
BIN
aws-lambda/src/idna/__pycache__/uts46data.cpython-313.pyc
Normal file
Binary file not shown.
122
aws-lambda/src/idna/codec.py
Normal file
122
aws-lambda/src/idna/codec.py
Normal file
@@ -0,0 +1,122 @@
|
||||
import codecs
|
||||
import re
|
||||
from typing import Any, Optional, Tuple
|
||||
|
||||
from .core import IDNAError, alabel, decode, encode, ulabel
|
||||
|
||||
_unicode_dots_re = re.compile("[\u002e\u3002\uff0e\uff61]")
|
||||
|
||||
|
||||
class Codec(codecs.Codec):
|
||||
def encode(self, data: str, errors: str = "strict") -> Tuple[bytes, int]:
|
||||
if errors != "strict":
|
||||
raise IDNAError('Unsupported error handling "{}"'.format(errors))
|
||||
|
||||
if not data:
|
||||
return b"", 0
|
||||
|
||||
return encode(data), len(data)
|
||||
|
||||
def decode(self, data: bytes, errors: str = "strict") -> Tuple[str, int]:
|
||||
if errors != "strict":
|
||||
raise IDNAError('Unsupported error handling "{}"'.format(errors))
|
||||
|
||||
if not data:
|
||||
return "", 0
|
||||
|
||||
return decode(data), len(data)
|
||||
|
||||
|
||||
class IncrementalEncoder(codecs.BufferedIncrementalEncoder):
|
||||
def _buffer_encode(self, data: str, errors: str, final: bool) -> Tuple[bytes, int]:
|
||||
if errors != "strict":
|
||||
raise IDNAError('Unsupported error handling "{}"'.format(errors))
|
||||
|
||||
if not data:
|
||||
return b"", 0
|
||||
|
||||
labels = _unicode_dots_re.split(data)
|
||||
trailing_dot = b""
|
||||
if labels:
|
||||
if not labels[-1]:
|
||||
trailing_dot = b"."
|
||||
del labels[-1]
|
||||
elif not final:
|
||||
# Keep potentially unfinished label until the next call
|
||||
del labels[-1]
|
||||
if labels:
|
||||
trailing_dot = b"."
|
||||
|
||||
result = []
|
||||
size = 0
|
||||
for label in labels:
|
||||
result.append(alabel(label))
|
||||
if size:
|
||||
size += 1
|
||||
size += len(label)
|
||||
|
||||
# Join with U+002E
|
||||
result_bytes = b".".join(result) + trailing_dot
|
||||
size += len(trailing_dot)
|
||||
return result_bytes, size
|
||||
|
||||
|
||||
class IncrementalDecoder(codecs.BufferedIncrementalDecoder):
|
||||
def _buffer_decode(self, data: Any, errors: str, final: bool) -> Tuple[str, int]:
|
||||
if errors != "strict":
|
||||
raise IDNAError('Unsupported error handling "{}"'.format(errors))
|
||||
|
||||
if not data:
|
||||
return ("", 0)
|
||||
|
||||
if not isinstance(data, str):
|
||||
data = str(data, "ascii")
|
||||
|
||||
labels = _unicode_dots_re.split(data)
|
||||
trailing_dot = ""
|
||||
if labels:
|
||||
if not labels[-1]:
|
||||
trailing_dot = "."
|
||||
del labels[-1]
|
||||
elif not final:
|
||||
# Keep potentially unfinished label until the next call
|
||||
del labels[-1]
|
||||
if labels:
|
||||
trailing_dot = "."
|
||||
|
||||
result = []
|
||||
size = 0
|
||||
for label in labels:
|
||||
result.append(ulabel(label))
|
||||
if size:
|
||||
size += 1
|
||||
size += len(label)
|
||||
|
||||
result_str = ".".join(result) + trailing_dot
|
||||
size += len(trailing_dot)
|
||||
return (result_str, size)
|
||||
|
||||
|
||||
class StreamWriter(Codec, codecs.StreamWriter):
|
||||
pass
|
||||
|
||||
|
||||
class StreamReader(Codec, codecs.StreamReader):
|
||||
pass
|
||||
|
||||
|
||||
def search_function(name: str) -> Optional[codecs.CodecInfo]:
|
||||
if name != "idna2008":
|
||||
return None
|
||||
return codecs.CodecInfo(
|
||||
name=name,
|
||||
encode=Codec().encode,
|
||||
decode=Codec().decode, # type: ignore
|
||||
incrementalencoder=IncrementalEncoder,
|
||||
incrementaldecoder=IncrementalDecoder,
|
||||
streamwriter=StreamWriter,
|
||||
streamreader=StreamReader,
|
||||
)
|
||||
|
||||
|
||||
codecs.register(search_function)
|
||||
15
aws-lambda/src/idna/compat.py
Normal file
15
aws-lambda/src/idna/compat.py
Normal file
@@ -0,0 +1,15 @@
|
||||
from typing import Any, Union
|
||||
|
||||
from .core import decode, encode
|
||||
|
||||
|
||||
def ToASCII(label: str) -> bytes:
|
||||
return encode(label)
|
||||
|
||||
|
||||
def ToUnicode(label: Union[bytes, bytearray]) -> str:
|
||||
return decode(label)
|
||||
|
||||
|
||||
def nameprep(s: Any) -> None:
|
||||
raise NotImplementedError("IDNA 2008 does not utilise nameprep protocol")
|
||||
437
aws-lambda/src/idna/core.py
Normal file
437
aws-lambda/src/idna/core.py
Normal file
@@ -0,0 +1,437 @@
|
||||
import bisect
|
||||
import re
|
||||
import unicodedata
|
||||
from typing import Optional, Union
|
||||
|
||||
from . import idnadata
|
||||
from .intranges import intranges_contain
|
||||
|
||||
_virama_combining_class = 9
|
||||
_alabel_prefix = b"xn--"
|
||||
_unicode_dots_re = re.compile("[\u002e\u3002\uff0e\uff61]")
|
||||
|
||||
|
||||
class IDNAError(UnicodeError):
|
||||
"""Base exception for all IDNA-encoding related problems"""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class IDNABidiError(IDNAError):
|
||||
"""Exception when bidirectional requirements are not satisfied"""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class InvalidCodepoint(IDNAError):
|
||||
"""Exception when a disallowed or unallocated codepoint is used"""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class InvalidCodepointContext(IDNAError):
|
||||
"""Exception when the codepoint is not valid in the context it is used"""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
def _combining_class(cp: int) -> int:
|
||||
v = unicodedata.combining(chr(cp))
|
||||
if v == 0:
|
||||
if not unicodedata.name(chr(cp)):
|
||||
raise ValueError("Unknown character in unicodedata")
|
||||
return v
|
||||
|
||||
|
||||
def _is_script(cp: str, script: str) -> bool:
|
||||
return intranges_contain(ord(cp), idnadata.scripts[script])
|
||||
|
||||
|
||||
def _punycode(s: str) -> bytes:
|
||||
return s.encode("punycode")
|
||||
|
||||
|
||||
def _unot(s: int) -> str:
|
||||
return "U+{:04X}".format(s)
|
||||
|
||||
|
||||
def valid_label_length(label: Union[bytes, str]) -> bool:
|
||||
if len(label) > 63:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def valid_string_length(label: Union[bytes, str], trailing_dot: bool) -> bool:
|
||||
if len(label) > (254 if trailing_dot else 253):
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def check_bidi(label: str, check_ltr: bool = False) -> bool:
|
||||
# Bidi rules should only be applied if string contains RTL characters
|
||||
bidi_label = False
|
||||
for idx, cp in enumerate(label, 1):
|
||||
direction = unicodedata.bidirectional(cp)
|
||||
if direction == "":
|
||||
# String likely comes from a newer version of Unicode
|
||||
raise IDNABidiError("Unknown directionality in label {} at position {}".format(repr(label), idx))
|
||||
if direction in ["R", "AL", "AN"]:
|
||||
bidi_label = True
|
||||
if not bidi_label and not check_ltr:
|
||||
return True
|
||||
|
||||
# Bidi rule 1
|
||||
direction = unicodedata.bidirectional(label[0])
|
||||
if direction in ["R", "AL"]:
|
||||
rtl = True
|
||||
elif direction == "L":
|
||||
rtl = False
|
||||
else:
|
||||
raise IDNABidiError("First codepoint in label {} must be directionality L, R or AL".format(repr(label)))
|
||||
|
||||
valid_ending = False
|
||||
number_type: Optional[str] = None
|
||||
for idx, cp in enumerate(label, 1):
|
||||
direction = unicodedata.bidirectional(cp)
|
||||
|
||||
if rtl:
|
||||
# Bidi rule 2
|
||||
if direction not in [
|
||||
"R",
|
||||
"AL",
|
||||
"AN",
|
||||
"EN",
|
||||
"ES",
|
||||
"CS",
|
||||
"ET",
|
||||
"ON",
|
||||
"BN",
|
||||
"NSM",
|
||||
]:
|
||||
raise IDNABidiError("Invalid direction for codepoint at position {} in a right-to-left label".format(idx))
|
||||
# Bidi rule 3
|
||||
if direction in ["R", "AL", "EN", "AN"]:
|
||||
valid_ending = True
|
||||
elif direction != "NSM":
|
||||
valid_ending = False
|
||||
# Bidi rule 4
|
||||
if direction in ["AN", "EN"]:
|
||||
if not number_type:
|
||||
number_type = direction
|
||||
else:
|
||||
if number_type != direction:
|
||||
raise IDNABidiError("Can not mix numeral types in a right-to-left label")
|
||||
else:
|
||||
# Bidi rule 5
|
||||
if direction not in ["L", "EN", "ES", "CS", "ET", "ON", "BN", "NSM"]:
|
||||
raise IDNABidiError("Invalid direction for codepoint at position {} in a left-to-right label".format(idx))
|
||||
# Bidi rule 6
|
||||
if direction in ["L", "EN"]:
|
||||
valid_ending = True
|
||||
elif direction != "NSM":
|
||||
valid_ending = False
|
||||
|
||||
if not valid_ending:
|
||||
raise IDNABidiError("Label ends with illegal codepoint directionality")
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def check_initial_combiner(label: str) -> bool:
|
||||
if unicodedata.category(label[0])[0] == "M":
|
||||
raise IDNAError("Label begins with an illegal combining character")
|
||||
return True
|
||||
|
||||
|
||||
def check_hyphen_ok(label: str) -> bool:
|
||||
if label[2:4] == "--":
|
||||
raise IDNAError("Label has disallowed hyphens in 3rd and 4th position")
|
||||
if label[0] == "-" or label[-1] == "-":
|
||||
raise IDNAError("Label must not start or end with a hyphen")
|
||||
return True
|
||||
|
||||
|
||||
def check_nfc(label: str) -> None:
|
||||
if unicodedata.normalize("NFC", label) != label:
|
||||
raise IDNAError("Label must be in Normalization Form C")
|
||||
|
||||
|
||||
def valid_contextj(label: str, pos: int) -> bool:
|
||||
cp_value = ord(label[pos])
|
||||
|
||||
if cp_value == 0x200C:
|
||||
if pos > 0:
|
||||
if _combining_class(ord(label[pos - 1])) == _virama_combining_class:
|
||||
return True
|
||||
|
||||
ok = False
|
||||
for i in range(pos - 1, -1, -1):
|
||||
joining_type = idnadata.joining_types.get(ord(label[i]))
|
||||
if joining_type == ord("T"):
|
||||
continue
|
||||
elif joining_type in [ord("L"), ord("D")]:
|
||||
ok = True
|
||||
break
|
||||
else:
|
||||
break
|
||||
|
||||
if not ok:
|
||||
return False
|
||||
|
||||
ok = False
|
||||
for i in range(pos + 1, len(label)):
|
||||
joining_type = idnadata.joining_types.get(ord(label[i]))
|
||||
if joining_type == ord("T"):
|
||||
continue
|
||||
elif joining_type in [ord("R"), ord("D")]:
|
||||
ok = True
|
||||
break
|
||||
else:
|
||||
break
|
||||
return ok
|
||||
|
||||
if cp_value == 0x200D:
|
||||
if pos > 0:
|
||||
if _combining_class(ord(label[pos - 1])) == _virama_combining_class:
|
||||
return True
|
||||
return False
|
||||
|
||||
else:
|
||||
return False
|
||||
|
||||
|
||||
def valid_contexto(label: str, pos: int, exception: bool = False) -> bool:
|
||||
cp_value = ord(label[pos])
|
||||
|
||||
if cp_value == 0x00B7:
|
||||
if 0 < pos < len(label) - 1:
|
||||
if ord(label[pos - 1]) == 0x006C and ord(label[pos + 1]) == 0x006C:
|
||||
return True
|
||||
return False
|
||||
|
||||
elif cp_value == 0x0375:
|
||||
if pos < len(label) - 1 and len(label) > 1:
|
||||
return _is_script(label[pos + 1], "Greek")
|
||||
return False
|
||||
|
||||
elif cp_value == 0x05F3 or cp_value == 0x05F4:
|
||||
if pos > 0:
|
||||
return _is_script(label[pos - 1], "Hebrew")
|
||||
return False
|
||||
|
||||
elif cp_value == 0x30FB:
|
||||
for cp in label:
|
||||
if cp == "\u30fb":
|
||||
continue
|
||||
if _is_script(cp, "Hiragana") or _is_script(cp, "Katakana") or _is_script(cp, "Han"):
|
||||
return True
|
||||
return False
|
||||
|
||||
elif 0x660 <= cp_value <= 0x669:
|
||||
for cp in label:
|
||||
if 0x6F0 <= ord(cp) <= 0x06F9:
|
||||
return False
|
||||
return True
|
||||
|
||||
elif 0x6F0 <= cp_value <= 0x6F9:
|
||||
for cp in label:
|
||||
if 0x660 <= ord(cp) <= 0x0669:
|
||||
return False
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def check_label(label: Union[str, bytes, bytearray]) -> None:
|
||||
if isinstance(label, (bytes, bytearray)):
|
||||
label = label.decode("utf-8")
|
||||
if len(label) == 0:
|
||||
raise IDNAError("Empty Label")
|
||||
|
||||
check_nfc(label)
|
||||
check_hyphen_ok(label)
|
||||
check_initial_combiner(label)
|
||||
|
||||
for pos, cp in enumerate(label):
|
||||
cp_value = ord(cp)
|
||||
if intranges_contain(cp_value, idnadata.codepoint_classes["PVALID"]):
|
||||
continue
|
||||
elif intranges_contain(cp_value, idnadata.codepoint_classes["CONTEXTJ"]):
|
||||
try:
|
||||
if not valid_contextj(label, pos):
|
||||
raise InvalidCodepointContext(
|
||||
"Joiner {} not allowed at position {} in {}".format(_unot(cp_value), pos + 1, repr(label))
|
||||
)
|
||||
except ValueError:
|
||||
raise IDNAError(
|
||||
"Unknown codepoint adjacent to joiner {} at position {} in {}".format(
|
||||
_unot(cp_value), pos + 1, repr(label)
|
||||
)
|
||||
)
|
||||
elif intranges_contain(cp_value, idnadata.codepoint_classes["CONTEXTO"]):
|
||||
if not valid_contexto(label, pos):
|
||||
raise InvalidCodepointContext(
|
||||
"Codepoint {} not allowed at position {} in {}".format(_unot(cp_value), pos + 1, repr(label))
|
||||
)
|
||||
else:
|
||||
raise InvalidCodepoint(
|
||||
"Codepoint {} at position {} of {} not allowed".format(_unot(cp_value), pos + 1, repr(label))
|
||||
)
|
||||
|
||||
check_bidi(label)
|
||||
|
||||
|
||||
def alabel(label: str) -> bytes:
|
||||
try:
|
||||
label_bytes = label.encode("ascii")
|
||||
ulabel(label_bytes)
|
||||
if not valid_label_length(label_bytes):
|
||||
raise IDNAError("Label too long")
|
||||
return label_bytes
|
||||
except UnicodeEncodeError:
|
||||
pass
|
||||
|
||||
check_label(label)
|
||||
label_bytes = _alabel_prefix + _punycode(label)
|
||||
|
||||
if not valid_label_length(label_bytes):
|
||||
raise IDNAError("Label too long")
|
||||
|
||||
return label_bytes
|
||||
|
||||
|
||||
def ulabel(label: Union[str, bytes, bytearray]) -> str:
|
||||
if not isinstance(label, (bytes, bytearray)):
|
||||
try:
|
||||
label_bytes = label.encode("ascii")
|
||||
except UnicodeEncodeError:
|
||||
check_label(label)
|
||||
return label
|
||||
else:
|
||||
label_bytes = bytes(label)
|
||||
|
||||
label_bytes = label_bytes.lower()
|
||||
if label_bytes.startswith(_alabel_prefix):
|
||||
label_bytes = label_bytes[len(_alabel_prefix) :]
|
||||
if not label_bytes:
|
||||
raise IDNAError("Malformed A-label, no Punycode eligible content found")
|
||||
if label_bytes.decode("ascii")[-1] == "-":
|
||||
raise IDNAError("A-label must not end with a hyphen")
|
||||
else:
|
||||
check_label(label_bytes)
|
||||
return label_bytes.decode("ascii")
|
||||
|
||||
try:
|
||||
label = label_bytes.decode("punycode")
|
||||
except UnicodeError:
|
||||
raise IDNAError("Invalid A-label")
|
||||
check_label(label)
|
||||
return label
|
||||
|
||||
|
||||
def uts46_remap(domain: str, std3_rules: bool = True, transitional: bool = False) -> str:
|
||||
"""Re-map the characters in the string according to UTS46 processing."""
|
||||
from .uts46data import uts46data
|
||||
|
||||
output = ""
|
||||
|
||||
for pos, char in enumerate(domain):
|
||||
code_point = ord(char)
|
||||
try:
|
||||
uts46row = uts46data[code_point if code_point < 256 else bisect.bisect_left(uts46data, (code_point, "Z")) - 1]
|
||||
status = uts46row[1]
|
||||
replacement: Optional[str] = None
|
||||
if len(uts46row) == 3:
|
||||
replacement = uts46row[2]
|
||||
if (
|
||||
status == "V"
|
||||
or (status == "D" and not transitional)
|
||||
or (status == "3" and not std3_rules and replacement is None)
|
||||
):
|
||||
output += char
|
||||
elif replacement is not None and (
|
||||
status == "M" or (status == "3" and not std3_rules) or (status == "D" and transitional)
|
||||
):
|
||||
output += replacement
|
||||
elif status != "I":
|
||||
raise IndexError()
|
||||
except IndexError:
|
||||
raise InvalidCodepoint(
|
||||
"Codepoint {} not allowed at position {} in {}".format(_unot(code_point), pos + 1, repr(domain))
|
||||
)
|
||||
|
||||
return unicodedata.normalize("NFC", output)
|
||||
|
||||
|
||||
def encode(
|
||||
s: Union[str, bytes, bytearray],
|
||||
strict: bool = False,
|
||||
uts46: bool = False,
|
||||
std3_rules: bool = False,
|
||||
transitional: bool = False,
|
||||
) -> bytes:
|
||||
if not isinstance(s, str):
|
||||
try:
|
||||
s = str(s, "ascii")
|
||||
except UnicodeDecodeError:
|
||||
raise IDNAError("should pass a unicode string to the function rather than a byte string.")
|
||||
if uts46:
|
||||
s = uts46_remap(s, std3_rules, transitional)
|
||||
trailing_dot = False
|
||||
result = []
|
||||
if strict:
|
||||
labels = s.split(".")
|
||||
else:
|
||||
labels = _unicode_dots_re.split(s)
|
||||
if not labels or labels == [""]:
|
||||
raise IDNAError("Empty domain")
|
||||
if labels[-1] == "":
|
||||
del labels[-1]
|
||||
trailing_dot = True
|
||||
for label in labels:
|
||||
s = alabel(label)
|
||||
if s:
|
||||
result.append(s)
|
||||
else:
|
||||
raise IDNAError("Empty label")
|
||||
if trailing_dot:
|
||||
result.append(b"")
|
||||
s = b".".join(result)
|
||||
if not valid_string_length(s, trailing_dot):
|
||||
raise IDNAError("Domain too long")
|
||||
return s
|
||||
|
||||
|
||||
def decode(
|
||||
s: Union[str, bytes, bytearray],
|
||||
strict: bool = False,
|
||||
uts46: bool = False,
|
||||
std3_rules: bool = False,
|
||||
) -> str:
|
||||
try:
|
||||
if not isinstance(s, str):
|
||||
s = str(s, "ascii")
|
||||
except UnicodeDecodeError:
|
||||
raise IDNAError("Invalid ASCII in A-label")
|
||||
if uts46:
|
||||
s = uts46_remap(s, std3_rules, False)
|
||||
trailing_dot = False
|
||||
result = []
|
||||
if not strict:
|
||||
labels = _unicode_dots_re.split(s)
|
||||
else:
|
||||
labels = s.split(".")
|
||||
if not labels or labels == [""]:
|
||||
raise IDNAError("Empty domain")
|
||||
if not labels[-1]:
|
||||
del labels[-1]
|
||||
trailing_dot = True
|
||||
for label in labels:
|
||||
s = ulabel(label)
|
||||
if s:
|
||||
result.append(s)
|
||||
else:
|
||||
raise IDNAError("Empty label")
|
||||
if trailing_dot:
|
||||
result.append("")
|
||||
return ".".join(result)
|
||||
4309
aws-lambda/src/idna/idnadata.py
Normal file
4309
aws-lambda/src/idna/idnadata.py
Normal file
File diff suppressed because it is too large
Load Diff
57
aws-lambda/src/idna/intranges.py
Normal file
57
aws-lambda/src/idna/intranges.py
Normal file
@@ -0,0 +1,57 @@
|
||||
"""
|
||||
Given a list of integers, made up of (hopefully) a small number of long runs
|
||||
of consecutive integers, compute a representation of the form
|
||||
((start1, end1), (start2, end2) ...). Then answer the question "was x present
|
||||
in the original list?" in time O(log(# runs)).
|
||||
"""
|
||||
|
||||
import bisect
|
||||
from typing import List, Tuple
|
||||
|
||||
|
||||
def intranges_from_list(list_: List[int]) -> Tuple[int, ...]:
|
||||
"""Represent a list of integers as a sequence of ranges:
|
||||
((start_0, end_0), (start_1, end_1), ...), such that the original
|
||||
integers are exactly those x such that start_i <= x < end_i for some i.
|
||||
|
||||
Ranges are encoded as single integers (start << 32 | end), not as tuples.
|
||||
"""
|
||||
|
||||
sorted_list = sorted(list_)
|
||||
ranges = []
|
||||
last_write = -1
|
||||
for i in range(len(sorted_list)):
|
||||
if i + 1 < len(sorted_list):
|
||||
if sorted_list[i] == sorted_list[i + 1] - 1:
|
||||
continue
|
||||
current_range = sorted_list[last_write + 1 : i + 1]
|
||||
ranges.append(_encode_range(current_range[0], current_range[-1] + 1))
|
||||
last_write = i
|
||||
|
||||
return tuple(ranges)
|
||||
|
||||
|
||||
def _encode_range(start: int, end: int) -> int:
|
||||
return (start << 32) | end
|
||||
|
||||
|
||||
def _decode_range(r: int) -> Tuple[int, int]:
|
||||
return (r >> 32), (r & ((1 << 32) - 1))
|
||||
|
||||
|
||||
def intranges_contain(int_: int, ranges: Tuple[int, ...]) -> bool:
|
||||
"""Determine if `int_` falls into one of the ranges in `ranges`."""
|
||||
tuple_ = _encode_range(int_, 0)
|
||||
pos = bisect.bisect_left(ranges, tuple_)
|
||||
# we could be immediately ahead of a tuple (start, end)
|
||||
# with start < int_ <= end
|
||||
if pos > 0:
|
||||
left, right = _decode_range(ranges[pos - 1])
|
||||
if left <= int_ < right:
|
||||
return True
|
||||
# or we could be immediately behind a tuple (int_, end)
|
||||
if pos < len(ranges):
|
||||
left, _ = _decode_range(ranges[pos])
|
||||
if left == int_:
|
||||
return True
|
||||
return False
|
||||
1
aws-lambda/src/idna/package_data.py
Normal file
1
aws-lambda/src/idna/package_data.py
Normal file
@@ -0,0 +1 @@
|
||||
__version__ = "3.11"
|
||||
0
aws-lambda/src/idna/py.typed
Normal file
0
aws-lambda/src/idna/py.typed
Normal file
8841
aws-lambda/src/idna/uts46data.py
Normal file
8841
aws-lambda/src/idna/uts46data.py
Normal file
File diff suppressed because it is too large
Load Diff
55
aws-lambda/src/index.py
Normal file
55
aws-lambda/src/index.py
Normal file
@@ -0,0 +1,55 @@
|
||||
import os
|
||||
import json
|
||||
import requests
|
||||
|
||||
# Recupera le variabili d'ambiente
|
||||
N8N_WEBHOOK_URL = os.environ.get('N8N_WEBHOOK_URL')
|
||||
N8N_SECRET_TOKEN = os.environ.get('N8N_SECRET_TOKEN')
|
||||
|
||||
def build_alexa_response(text):
|
||||
"""Costruisce la risposta JSON per Alexa."""
|
||||
return {
|
||||
'version': '1.0',
|
||||
'response': {
|
||||
'outputSpeech': {
|
||||
'type': 'PlainText',
|
||||
'text': text
|
||||
},
|
||||
'shouldEndSession': True
|
||||
}
|
||||
}
|
||||
|
||||
def lambda_handler(event, context):
|
||||
"""Punto di ingresso della funzione Lambda."""
|
||||
print(f"Richiesta ricevuta da Alexa: {json.dumps(event)}")
|
||||
|
||||
if not N8N_WEBHOOK_URL or not N8N_SECRET_TOKEN:
|
||||
print("Errore: Variabili d'ambiente non configurate.")
|
||||
return build_alexa_response("Errore di configurazione del server.")
|
||||
|
||||
headers = {
|
||||
'Content-Type': 'application/json',
|
||||
'X-N8N-Webhook-Secret': N8N_SECRET_TOKEN
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.post(
|
||||
N8N_WEBHOOK_URL,
|
||||
headers=headers,
|
||||
data=json.dumps(event),
|
||||
timeout=8 # Alexa attende max 10 secondi
|
||||
)
|
||||
response.raise_for_status() # Solleva un'eccezione per status code non-2xx
|
||||
|
||||
n8n_data = response.json()
|
||||
tts_text = n8n_data.get('tts_response', 'Nessuna risposta ricevuta da Pompeo.')
|
||||
|
||||
print(f"Risposta da n8n: {tts_text}")
|
||||
return build_alexa_response(tts_text)
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
print(f"Errore nella chiamata a n8n: {e}")
|
||||
return build_alexa_response("Mi dispiace, non riesco a contattare Pompeo in questo momento.")
|
||||
except Exception as e:
|
||||
print(f"Errore generico: {e}")
|
||||
return build_alexa_response("Si è verificato un errore inaspettato.")
|
||||
1
aws-lambda/src/requests-2.32.5.dist-info/INSTALLER
Normal file
1
aws-lambda/src/requests-2.32.5.dist-info/INSTALLER
Normal file
@@ -0,0 +1 @@
|
||||
pip
|
||||
133
aws-lambda/src/requests-2.32.5.dist-info/METADATA
Normal file
133
aws-lambda/src/requests-2.32.5.dist-info/METADATA
Normal file
@@ -0,0 +1,133 @@
|
||||
Metadata-Version: 2.4
|
||||
Name: requests
|
||||
Version: 2.32.5
|
||||
Summary: Python HTTP for Humans.
|
||||
Home-page: https://requests.readthedocs.io
|
||||
Author: Kenneth Reitz
|
||||
Author-email: me@kennethreitz.org
|
||||
License: Apache-2.0
|
||||
Project-URL: Documentation, https://requests.readthedocs.io
|
||||
Project-URL: Source, https://github.com/psf/requests
|
||||
Classifier: Development Status :: 5 - Production/Stable
|
||||
Classifier: Environment :: Web Environment
|
||||
Classifier: Intended Audience :: Developers
|
||||
Classifier: License :: OSI Approved :: Apache Software License
|
||||
Classifier: Natural Language :: English
|
||||
Classifier: Operating System :: OS Independent
|
||||
Classifier: Programming Language :: Python
|
||||
Classifier: Programming Language :: Python :: 3
|
||||
Classifier: Programming Language :: Python :: 3.9
|
||||
Classifier: Programming Language :: Python :: 3.10
|
||||
Classifier: Programming Language :: Python :: 3.11
|
||||
Classifier: Programming Language :: Python :: 3.12
|
||||
Classifier: Programming Language :: Python :: 3.13
|
||||
Classifier: Programming Language :: Python :: 3.14
|
||||
Classifier: Programming Language :: Python :: 3 :: Only
|
||||
Classifier: Programming Language :: Python :: Implementation :: CPython
|
||||
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
||||
Classifier: Topic :: Internet :: WWW/HTTP
|
||||
Classifier: Topic :: Software Development :: Libraries
|
||||
Requires-Python: >=3.9
|
||||
Description-Content-Type: text/markdown
|
||||
License-File: LICENSE
|
||||
Requires-Dist: charset_normalizer<4,>=2
|
||||
Requires-Dist: idna<4,>=2.5
|
||||
Requires-Dist: urllib3<3,>=1.21.1
|
||||
Requires-Dist: certifi>=2017.4.17
|
||||
Provides-Extra: security
|
||||
Provides-Extra: socks
|
||||
Requires-Dist: PySocks!=1.5.7,>=1.5.6; extra == "socks"
|
||||
Provides-Extra: use-chardet-on-py3
|
||||
Requires-Dist: chardet<6,>=3.0.2; extra == "use-chardet-on-py3"
|
||||
Dynamic: author
|
||||
Dynamic: author-email
|
||||
Dynamic: classifier
|
||||
Dynamic: description
|
||||
Dynamic: description-content-type
|
||||
Dynamic: home-page
|
||||
Dynamic: license
|
||||
Dynamic: license-file
|
||||
Dynamic: project-url
|
||||
Dynamic: provides-extra
|
||||
Dynamic: requires-dist
|
||||
Dynamic: requires-python
|
||||
Dynamic: summary
|
||||
|
||||
# Requests
|
||||
|
||||
**Requests** is a simple, yet elegant, HTTP library.
|
||||
|
||||
```python
|
||||
>>> import requests
|
||||
>>> r = requests.get('https://httpbin.org/basic-auth/user/pass', auth=('user', 'pass'))
|
||||
>>> r.status_code
|
||||
200
|
||||
>>> r.headers['content-type']
|
||||
'application/json; charset=utf8'
|
||||
>>> r.encoding
|
||||
'utf-8'
|
||||
>>> r.text
|
||||
'{"authenticated": true, ...'
|
||||
>>> r.json()
|
||||
{'authenticated': True, ...}
|
||||
```
|
||||
|
||||
Requests allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your `PUT` & `POST` data — but nowadays, just use the `json` method!
|
||||
|
||||
Requests is one of the most downloaded Python packages today, pulling in around `30M downloads / week`— according to GitHub, Requests is currently [depended upon](https://github.com/psf/requests/network/dependents?package_id=UGFja2FnZS01NzA4OTExNg%3D%3D) by `1,000,000+` repositories. You may certainly put your trust in this code.
|
||||
|
||||
[](https://pepy.tech/project/requests)
|
||||
[](https://pypi.org/project/requests)
|
||||
[](https://github.com/psf/requests/graphs/contributors)
|
||||
|
||||
## Installing Requests and Supported Versions
|
||||
|
||||
Requests is available on PyPI:
|
||||
|
||||
```console
|
||||
$ python -m pip install requests
|
||||
```
|
||||
|
||||
Requests officially supports Python 3.9+.
|
||||
|
||||
## Supported Features & Best–Practices
|
||||
|
||||
Requests is ready for the demands of building robust and reliable HTTP–speaking applications, for the needs of today.
|
||||
|
||||
- Keep-Alive & Connection Pooling
|
||||
- International Domains and URLs
|
||||
- Sessions with Cookie Persistence
|
||||
- Browser-style TLS/SSL Verification
|
||||
- Basic & Digest Authentication
|
||||
- Familiar `dict`–like Cookies
|
||||
- Automatic Content Decompression and Decoding
|
||||
- Multi-part File Uploads
|
||||
- SOCKS Proxy Support
|
||||
- Connection Timeouts
|
||||
- Streaming Downloads
|
||||
- Automatic honoring of `.netrc`
|
||||
- Chunked HTTP Requests
|
||||
|
||||
## API Reference and User Guide available on [Read the Docs](https://requests.readthedocs.io)
|
||||
|
||||
[](https://requests.readthedocs.io)
|
||||
|
||||
## Cloning the repository
|
||||
|
||||
When cloning the Requests repository, you may need to add the `-c
|
||||
fetch.fsck.badTimezone=ignore` flag to avoid an error about a bad commit timestamp (see
|
||||
[this issue](https://github.com/psf/requests/issues/2690) for more background):
|
||||
|
||||
```shell
|
||||
git clone -c fetch.fsck.badTimezone=ignore https://github.com/psf/requests.git
|
||||
```
|
||||
|
||||
You can also apply this setting to your global Git config:
|
||||
|
||||
```shell
|
||||
git config --global fetch.fsck.badTimezone ignore
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
[](https://kennethreitz.org) [](https://www.python.org/psf)
|
||||
43
aws-lambda/src/requests-2.32.5.dist-info/RECORD
Normal file
43
aws-lambda/src/requests-2.32.5.dist-info/RECORD
Normal file
@@ -0,0 +1,43 @@
|
||||
requests-2.32.5.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
|
||||
requests-2.32.5.dist-info/METADATA,sha256=ZbWgjagfSRVRPnYJZf8Ut1GPZbe7Pv4NqzZLvMTUDLA,4945
|
||||
requests-2.32.5.dist-info/RECORD,,
|
||||
requests-2.32.5.dist-info/REQUESTED,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
||||
requests-2.32.5.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
|
||||
requests-2.32.5.dist-info/licenses/LICENSE,sha256=CeipvOyAZxBGUsFoaFqwkx54aPnIKEtm9a5u2uXxEws,10142
|
||||
requests-2.32.5.dist-info/top_level.txt,sha256=fMSVmHfb5rbGOo6xv-O_tUX6j-WyixssE-SnwcDRxNQ,9
|
||||
requests/__init__.py,sha256=4xaAERmPDIBPsa2PsjpU9r06yooK-2mZKHTZAhWRWts,5072
|
||||
requests/__pycache__/__init__.cpython-313.pyc,,
|
||||
requests/__pycache__/__version__.cpython-313.pyc,,
|
||||
requests/__pycache__/_internal_utils.cpython-313.pyc,,
|
||||
requests/__pycache__/adapters.cpython-313.pyc,,
|
||||
requests/__pycache__/api.cpython-313.pyc,,
|
||||
requests/__pycache__/auth.cpython-313.pyc,,
|
||||
requests/__pycache__/certs.cpython-313.pyc,,
|
||||
requests/__pycache__/compat.cpython-313.pyc,,
|
||||
requests/__pycache__/cookies.cpython-313.pyc,,
|
||||
requests/__pycache__/exceptions.cpython-313.pyc,,
|
||||
requests/__pycache__/help.cpython-313.pyc,,
|
||||
requests/__pycache__/hooks.cpython-313.pyc,,
|
||||
requests/__pycache__/models.cpython-313.pyc,,
|
||||
requests/__pycache__/packages.cpython-313.pyc,,
|
||||
requests/__pycache__/sessions.cpython-313.pyc,,
|
||||
requests/__pycache__/status_codes.cpython-313.pyc,,
|
||||
requests/__pycache__/structures.cpython-313.pyc,,
|
||||
requests/__pycache__/utils.cpython-313.pyc,,
|
||||
requests/__version__.py,sha256=QKDceK8K_ujqwDDc3oYrR0odOBYgKVOQQ5vFap_G_cg,435
|
||||
requests/_internal_utils.py,sha256=nMQymr4hs32TqVo5AbCrmcJEhvPUh7xXlluyqwslLiQ,1495
|
||||
requests/adapters.py,sha256=8nX113gbb123aUtx2ETkAN_6IsYX-M2fRoLGluTEcRk,26285
|
||||
requests/api.py,sha256=_Zb9Oa7tzVIizTKwFrPjDEY9ejtm_OnSRERnADxGsQs,6449
|
||||
requests/auth.py,sha256=kF75tqnLctZ9Mf_hm9TZIj4cQWnN5uxRz8oWsx5wmR0,10186
|
||||
requests/certs.py,sha256=Z9Sb410Anv6jUFTyss0jFFhU6xst8ctELqfy8Ev23gw,429
|
||||
requests/compat.py,sha256=J7sIjR6XoDGp5JTVzOxkK5fSoUVUa_Pjc7iRZhAWGmI,2142
|
||||
requests/cookies.py,sha256=bNi-iqEj4NPZ00-ob-rHvzkvObzN3lEpgw3g6paS3Xw,18590
|
||||
requests/exceptions.py,sha256=jJPS1UWATs86ShVUaLorTiJb1SaGuoNEWgICJep-VkY,4260
|
||||
requests/help.py,sha256=gPX5d_H7Xd88aDABejhqGgl9B1VFRTt5BmiYvL3PzIQ,3875
|
||||
requests/hooks.py,sha256=CiuysiHA39V5UfcCBXFIx83IrDpuwfN9RcTUgv28ftQ,733
|
||||
requests/models.py,sha256=MjZdZ4k7tnw-1nz5PKShjmPmqyk0L6DciwnFngb_Vk4,35510
|
||||
requests/packages.py,sha256=_g0gZ681UyAlKHRjH6kanbaoxx2eAb6qzcXiODyTIoc,904
|
||||
requests/sessions.py,sha256=Cl1dpEnOfwrzzPbku-emepNeN4Rt_0_58Iy2x-JGTm8,30503
|
||||
requests/status_codes.py,sha256=iJUAeA25baTdw-6PfD0eF4qhpINDJRJI-yaMqxs4LEI,4322
|
||||
requests/structures.py,sha256=-IbmhVz06S-5aPSZuUthZ6-6D9XOjRuTXHOabY041XM,2912
|
||||
requests/utils.py,sha256=WqU86rZ3wvhC-tQjWcjtH_HEKZwWB3iWCZV6SW5DEdQ,33213
|
||||
0
aws-lambda/src/requests-2.32.5.dist-info/REQUESTED
Normal file
0
aws-lambda/src/requests-2.32.5.dist-info/REQUESTED
Normal file
5
aws-lambda/src/requests-2.32.5.dist-info/WHEEL
Normal file
5
aws-lambda/src/requests-2.32.5.dist-info/WHEEL
Normal file
@@ -0,0 +1,5 @@
|
||||
Wheel-Version: 1.0
|
||||
Generator: setuptools (80.9.0)
|
||||
Root-Is-Purelib: true
|
||||
Tag: py3-none-any
|
||||
|
||||
175
aws-lambda/src/requests-2.32.5.dist-info/licenses/LICENSE
Normal file
175
aws-lambda/src/requests-2.32.5.dist-info/licenses/LICENSE
Normal file
@@ -0,0 +1,175 @@
|
||||
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
1
aws-lambda/src/requests-2.32.5.dist-info/top_level.txt
Normal file
1
aws-lambda/src/requests-2.32.5.dist-info/top_level.txt
Normal file
@@ -0,0 +1 @@
|
||||
requests
|
||||
184
aws-lambda/src/requests/__init__.py
Normal file
184
aws-lambda/src/requests/__init__.py
Normal file
@@ -0,0 +1,184 @@
|
||||
# __
|
||||
# /__) _ _ _ _ _/ _
|
||||
# / ( (- (/ (/ (- _) / _)
|
||||
# /
|
||||
|
||||
"""
|
||||
Requests HTTP Library
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Requests is an HTTP library, written in Python, for human beings.
|
||||
Basic GET usage:
|
||||
|
||||
>>> import requests
|
||||
>>> r = requests.get('https://www.python.org')
|
||||
>>> r.status_code
|
||||
200
|
||||
>>> b'Python is a programming language' in r.content
|
||||
True
|
||||
|
||||
... or POST:
|
||||
|
||||
>>> payload = dict(key1='value1', key2='value2')
|
||||
>>> r = requests.post('https://httpbin.org/post', data=payload)
|
||||
>>> print(r.text)
|
||||
{
|
||||
...
|
||||
"form": {
|
||||
"key1": "value1",
|
||||
"key2": "value2"
|
||||
},
|
||||
...
|
||||
}
|
||||
|
||||
The other HTTP methods are supported - see `requests.api`. Full documentation
|
||||
is at <https://requests.readthedocs.io>.
|
||||
|
||||
:copyright: (c) 2017 by Kenneth Reitz.
|
||||
:license: Apache 2.0, see LICENSE for more details.
|
||||
"""
|
||||
|
||||
import warnings
|
||||
|
||||
import urllib3
|
||||
|
||||
from .exceptions import RequestsDependencyWarning
|
||||
|
||||
try:
|
||||
from charset_normalizer import __version__ as charset_normalizer_version
|
||||
except ImportError:
|
||||
charset_normalizer_version = None
|
||||
|
||||
try:
|
||||
from chardet import __version__ as chardet_version
|
||||
except ImportError:
|
||||
chardet_version = None
|
||||
|
||||
|
||||
def check_compatibility(urllib3_version, chardet_version, charset_normalizer_version):
|
||||
urllib3_version = urllib3_version.split(".")
|
||||
assert urllib3_version != ["dev"] # Verify urllib3 isn't installed from git.
|
||||
|
||||
# Sometimes, urllib3 only reports its version as 16.1.
|
||||
if len(urllib3_version) == 2:
|
||||
urllib3_version.append("0")
|
||||
|
||||
# Check urllib3 for compatibility.
|
||||
major, minor, patch = urllib3_version # noqa: F811
|
||||
major, minor, patch = int(major), int(minor), int(patch)
|
||||
# urllib3 >= 1.21.1
|
||||
assert major >= 1
|
||||
if major == 1:
|
||||
assert minor >= 21
|
||||
|
||||
# Check charset_normalizer for compatibility.
|
||||
if chardet_version:
|
||||
major, minor, patch = chardet_version.split(".")[:3]
|
||||
major, minor, patch = int(major), int(minor), int(patch)
|
||||
# chardet_version >= 3.0.2, < 6.0.0
|
||||
assert (3, 0, 2) <= (major, minor, patch) < (6, 0, 0)
|
||||
elif charset_normalizer_version:
|
||||
major, minor, patch = charset_normalizer_version.split(".")[:3]
|
||||
major, minor, patch = int(major), int(minor), int(patch)
|
||||
# charset_normalizer >= 2.0.0 < 4.0.0
|
||||
assert (2, 0, 0) <= (major, minor, patch) < (4, 0, 0)
|
||||
else:
|
||||
warnings.warn(
|
||||
"Unable to find acceptable character detection dependency "
|
||||
"(chardet or charset_normalizer).",
|
||||
RequestsDependencyWarning,
|
||||
)
|
||||
|
||||
|
||||
def _check_cryptography(cryptography_version):
|
||||
# cryptography < 1.3.4
|
||||
try:
|
||||
cryptography_version = list(map(int, cryptography_version.split(".")))
|
||||
except ValueError:
|
||||
return
|
||||
|
||||
if cryptography_version < [1, 3, 4]:
|
||||
warning = "Old version of cryptography ({}) may cause slowdown.".format(
|
||||
cryptography_version
|
||||
)
|
||||
warnings.warn(warning, RequestsDependencyWarning)
|
||||
|
||||
|
||||
# Check imported dependencies for compatibility.
|
||||
try:
|
||||
check_compatibility(
|
||||
urllib3.__version__, chardet_version, charset_normalizer_version
|
||||
)
|
||||
except (AssertionError, ValueError):
|
||||
warnings.warn(
|
||||
"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
|
||||
"version!".format(
|
||||
urllib3.__version__, chardet_version, charset_normalizer_version
|
||||
),
|
||||
RequestsDependencyWarning,
|
||||
)
|
||||
|
||||
# Attempt to enable urllib3's fallback for SNI support
|
||||
# if the standard library doesn't support SNI or the
|
||||
# 'ssl' library isn't available.
|
||||
try:
|
||||
try:
|
||||
import ssl
|
||||
except ImportError:
|
||||
ssl = None
|
||||
|
||||
if not getattr(ssl, "HAS_SNI", False):
|
||||
from urllib3.contrib import pyopenssl
|
||||
|
||||
pyopenssl.inject_into_urllib3()
|
||||
|
||||
# Check cryptography version
|
||||
from cryptography import __version__ as cryptography_version
|
||||
|
||||
_check_cryptography(cryptography_version)
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# urllib3's DependencyWarnings should be silenced.
|
||||
from urllib3.exceptions import DependencyWarning
|
||||
|
||||
warnings.simplefilter("ignore", DependencyWarning)
|
||||
|
||||
# Set default logging handler to avoid "No handler found" warnings.
|
||||
import logging
|
||||
from logging import NullHandler
|
||||
|
||||
from . import packages, utils
|
||||
from .__version__ import (
|
||||
__author__,
|
||||
__author_email__,
|
||||
__build__,
|
||||
__cake__,
|
||||
__copyright__,
|
||||
__description__,
|
||||
__license__,
|
||||
__title__,
|
||||
__url__,
|
||||
__version__,
|
||||
)
|
||||
from .api import delete, get, head, options, patch, post, put, request
|
||||
from .exceptions import (
|
||||
ConnectionError,
|
||||
ConnectTimeout,
|
||||
FileModeWarning,
|
||||
HTTPError,
|
||||
JSONDecodeError,
|
||||
ReadTimeout,
|
||||
RequestException,
|
||||
Timeout,
|
||||
TooManyRedirects,
|
||||
URLRequired,
|
||||
)
|
||||
from .models import PreparedRequest, Request, Response
|
||||
from .sessions import Session, session
|
||||
from .status_codes import codes
|
||||
|
||||
logging.getLogger(__name__).addHandler(NullHandler())
|
||||
|
||||
# FileModeWarnings go off per the default.
|
||||
warnings.simplefilter("default", FileModeWarning, append=True)
|
||||
BIN
aws-lambda/src/requests/__pycache__/__init__.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/__init__.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/__version__.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/__version__.cpython-313.pyc
Normal file
Binary file not shown.
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/adapters.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/adapters.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/api.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/api.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/auth.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/auth.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/certs.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/certs.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/compat.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/compat.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/cookies.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/cookies.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/exceptions.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/exceptions.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/help.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/help.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/hooks.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/hooks.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/models.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/models.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/packages.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/packages.cpython-313.pyc
Normal file
Binary file not shown.
BIN
aws-lambda/src/requests/__pycache__/sessions.cpython-313.pyc
Normal file
BIN
aws-lambda/src/requests/__pycache__/sessions.cpython-313.pyc
Normal file
Binary file not shown.
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user