# outsend — Full documentation bundle

Every page of the outsend public documentation, concatenated for ingestion by LLMs. Each page is delimited by `<!-- doc: <slug> -->`.


<!-- doc: index -->

---
title: outsend documentation
slug: 
section: 
summary: Technical reference for outsend — modules, pipelines, monitoring, API. Built for developers and AI assistants.
---

This documentation describes the **public contracts** of every outsend module — what each one accepts as input, what it returns, how it behaves over time, and how modules chain into pipelines.

The goal is twofold:

1. **Help integrators and power users** understand what each module does and how to drive it from the UI or the API.
2. **Be readable by AI assistants** — every page is plain markdown, downloadable in bulk, exposed through the `llms.txt` standard.

## How to read it

- **Concepts** — start here if you're new. Covers what a *job*, a *pipeline*, and a *veille* are, plus the lifecycle and the events they emit.
- **Modules** — one page per module (19 active + 4 on-demand). Each page is structured: Purpose → Inputs → Outputs → Lifecycle → Limits → Errors.
- **API reference** — every REST endpoint, grouped by domain.
- **Integration** — bring-your-own-key (BYOK), MCP server (planned), `llms.txt`.

## Copy everything in one click

The **Copy** button in the top-right corner of every page lets you grab:

- The current page (raw markdown)
- The current section (e.g. all modules pages)
- **The entire documentation** — a single concatenated markdown bundle, ready to paste into Claude, ChatGPT, Cursor, or any AI assistant.

There is also a stable LLM index at [`/docs/llms.txt`](/docs/llms.txt) and the full bundle at [`/docs/llms-full.txt`](/docs/llms-full.txt) — both follow the [llms.txt](https://llmstxt.org) standard, so most AI tools detect them automatically.

## Scope

This documentation describes **what outsend exposes**, not how it is built internally. Implementation details — scraping stack, proxy infrastructure, DOM selectors, timing heuristics, exact success rates — are intentionally omitted. They are not stable contracts and would not help you integrate.

If something you need is missing, write to [support@outsend.xyz](mailto:support@outsend.xyz).

## Quick links

- [What is outsend](/docs/what-is-outsend)
- [Quickstart](/docs/quickstart)
- [Jobs & lifecycle](/docs/concepts/jobs-lifecycle)
- [Module registry](/docs/concepts/module-registry)
- [API overview](/docs/api/overview)


<!-- doc: api/auth -->

---
title: Authentication
slug: api/auth
section: API
summary: Session cookie issuance, credential management, email verification, and GDPR self-service endpoints under /api/auth.
---

# Authentication

The Authentication API issues and revokes session cookies, manages credentials, verifies email ownership, and exposes the GDPR self-service endpoints. All routes are mounted under `/api/auth` and respond with JSON unless noted.

## Session cookie

Successful `signup`, `login`, and `password/change` calls set an `outsend_session` cookie:

| Attribute | Value |
|-----------|-------|
| Name | `outsend_session` |
| TTL | 7 days (`SESSION_DURATION_DAYS = 7`) |
| `HttpOnly` | true |
| `Secure` | true (production) |
| `SameSite` | `Lax` |
| `Path` | `/` |

The cookie is a signed token bound to a row in `sessions`. Revoking a session (logout, password change, account delete) deletes the row server-side even if the cookie is replayed.

## Rate limits and errors

Each endpoint applies per-IP and per-identity windows (see [Limits](/docs/concepts/limits)). Exhaustion returns `429 Too Many Requests` with a French message containing the retry-after delay in seconds.

All errors follow FastAPI's `{ "detail": "<message>" }` shape. Generic codes: `400` (invalid payload, expired token, wrong current password, captcha failure), `401` (bad credentials or missing session on protected routes), `429` (rate limit). Endpoint-specific `detail` messages are listed inline below.

---

## POST /api/auth/signup

Creates a user, sends the welcome + verification email, and opens a session. No auth. Rate limit: 3 / hour / IP.

### Request body

| Field | Type | Notes |
|-------|------|-------|
| `email` | string (email) | Required. |
| `password` | string | 8 to 128 chars, must contain a letter AND a digit/symbol. |
| `invitation_code` | string | 1 to 64 chars. Alpha is invite-only. |
| `accept_responsibility` | boolean | Must be `true`. |
| `hcaptcha_token` | string or null | Required when `HCAPTCHA_SECRET` is configured. |

```json
{
  "email": "ada@example.com",
  "password": "lovelace-1843",
  "invitation_code": "ALPHA-7K2",
  "accept_responsibility": true,
  "hcaptcha_token": "10000000-aaaa-bbbb-cccc-000000000001"
}
```

### Response — `200 OK`

```json
{
  "ok": true,
  "user": {
    "id": 42,
    "email": "ada@example.com",
    "is_admin": false,
    "is_active": true,
    "email_verified": false,
    "created_at": "2026-05-27T09:14:00Z"
  }
}
```

Sets the `outsend_session` cookie.

### Specific errors

| Status | Detail |
|--------|--------|
| `400` | `Captcha invalide. Réessaie.` |
| `400` | `Code invitation invalide` |
| `400` | `Email existe déjà` |

---

## POST /api/auth/login

Validates credentials and opens a session. No auth. Rate limit: 5 / 15 min / IP and 5 / 15 min / email.

### Request body

| Field | Type | Notes |
|-------|------|-------|
| `email` | string (email) | Required. |
| `password` | string | 1 to 128 chars. |

```json
{ "email": "ada@example.com", "password": "lovelace-1843" }
```

### Response — `200 OK`

```json
{ "ok": true, "user": { "id": 42, "email": "ada@example.com", "is_admin": false, "is_active": true, "email_verified": true, "created_at": "2026-05-27T09:14:00Z" } }
```

Sets the `outsend_session` cookie.

### Specific errors

| Status | Detail |
|--------|--------|
| `401` | `Email ou mot de passe incorrect` |
| `401` | `Compte désactivé` |

---

## POST /api/auth/logout

Revokes the current session and clears the cookie. Auth optional. Empty body. Response: `200 OK` `{ "ok": true }`.

---

## GET /api/auth/me

Returns the currently authenticated user.

```json
{
  "id": 42,
  "email": "ada@example.com",
  "is_admin": false,
  "is_active": true,
  "email_verified": true,
  "created_at": "2026-05-27T09:14:00Z"
}
```

---

## POST /api/auth/password/reset-request

Sends a reset link to the email if (and only if) it matches an active user. The response is identical in every case to prevent account enumeration. No auth. Rate limit: 3 / hour / IP and 3 / hour / email (silent when exhausted).

### Request body

```json
{ "email": "ada@example.com" }
```

### Response — `200 OK`

```json
{ "ok": true }
```

---

## POST /api/auth/password/reset-confirm

Consumes a single-use reset token and sets the new password. Revokes every existing session for the user. No auth (the token is the credential).

### Request body

| Field | Type | Notes |
|-------|------|-------|
| `token` | string | 10 to 256 chars, delivered by email. |
| `new_password` | string | 8 to 128 chars, letter + digit/symbol. |

```json
{ "token": "eyJ...", "new_password": "babbage-1822" }
```

### Response — `200 OK`

```json
{ "ok": true }
```

### Specific errors

| Status | Detail |
|--------|--------|
| `400` | `Lien invalide ou expiré` |
| `422` | Password complexity rejected by validator. |

---

## POST /api/auth/password/change

Rotates the password for a logged-in user. Requires the current password, revokes other sessions, issues a fresh cookie. Rate limit: 5 / hour / user.

### Request body

| Field | Type | Notes |
|-------|------|-------|
| `current_password` | string | 1 to 200 chars. |
| `new_password` | string | 8 to 200 chars, must differ from current. |

```json
{ "current_password": "lovelace-1843", "new_password": "babbage-1822" }
```

### Response — `200 OK`

```json
{ "ok": true }
```

Sets a refreshed `outsend_session` cookie.

### Specific errors

| Status | Detail |
|--------|--------|
| `400` | `Mot de passe actuel incorrect` |
| `400` | `Le nouveau mot de passe doit être différent de l'actuel` |

---

## POST /api/auth/email/verify

Consumes a single-use verification token and flips `email_verified` to `true`. No auth.

### Request body

```json
{ "token": "eyJ..." }
```

### Response — `200 OK`

```json
{ "ok": true }
```

Specific error: `400 Lien de vérification invalide ou expiré`.

---

## POST /api/auth/email/resend-verify

Re-sends the verification email to the authenticated user. Idempotent when the address is already verified. Empty body. Rate limit: 3 / hour / user.

### Response — `200 OK`

```json
{ "ok": true }
```

or, if already verified:

```json
{ "ok": true, "already_verified": true }
```

---

## DELETE /api/auth/me

Permanently deletes the account and every owned record (jobs, pipelines, surveillances, sessions, tokens). Job files on disk are purged after the cascading DB delete. Feedback threads are anonymised rather than removed.

### Request body

| Field | Type | Notes |
|-------|------|-------|
| `confirm_email` | string | Must equal the user's email (case-insensitive). |

```json
{ "confirm_email": "ada@example.com" }
```

### Response — `204 No Content`

Empty body. Clears the `outsend_session` cookie.

Specific error: `400 Confirmation email incorrecte`.

---

## GET /api/auth/me/export

GDPR portability endpoint. Streams a ZIP archive containing every record owned by the user.

### Response — `200 OK`

`Content-Type: application/zip`
`Content-Disposition: attachment; filename="outsend-export-<local>-<YYYY-MM-DD>.zip"`

Archive layout:

| Entry | Contents |
|-------|----------|
| `account.json` | Account metadata, no secrets. |
| `jobs.json` | All jobs with metadata. |
| `jobs/<job_id>/*` | CSV/JSON outputs for every `done` job. |
| `pipelines.json` | Pipeline definitions. |
| `veille.json` | `recurring_scraps` + run history. |
| `manifest.txt` | Human-readable summary. |


<!-- doc: api/feedback -->

---
title: Feedback API
slug: api/feedback
section: API
summary: In-app chat with the platform admin and entry point for on-demand module activation requests.
---

# Feedback API

The Feedback API powers the in-app chat between an authenticated user and the platform admin. It also doubles as the entry point for on-demand module activation requests: clicking "Request" on a stub module (email, SMS, WhatsApp, phone carrier) opens a feedback thread with a dedicated `topic`, which surfaces in the admin dashboard's "On demand" inbox.

A thread is a stable conversation pinned to a `topic`. Every reply is a `feedback_message` row scoped to that thread. Read state is tracked per role (user, admin) so each side sees only its own unread badge.

All endpoints require an authenticated caller. Generic errors: `401` (not authenticated), `404` (thread does not exist). Endpoint-specific causes are listed inline.

## Topic conventions

The `topic` field on a thread is a free-form string capped at 64 chars, but the product follows a small set of conventions:

| Topic value              | Meaning                                          |
| ------------------------ | ------------------------------------------------ |
| `general`                | Default. Catch-all chat.                         |
| `feedback`               | Generic product feedback.                        |
| `bug`                    | Bug report.                                      |
| `feature`                | Feature request.                                 |
| `on_demand_email`        | Activation request for the email-campaign stub.  |
| `on_demand_sms`          | Activation request for the SMS-campaign stub.    |
| `on_demand_whatsapp`     | Activation request for the WhatsApp stub.        |
| `on_demand_phone_carrier`| Activation request for the phone-carrier stub.   |

Any `topic` matching `on_demand_*` is picked up by the admin endpoint `GET /api/admin/feedback/on-demand`, which groups threads by topic and exposes open counts. The on-demand stubs are listed in the module registry under `on_demand`; a client can read the registry and build `topic = "on_demand_" + slug`.

The shorter `type` field (`bug`, `feature`, `other`) is independent of topic and only carries the coarse intent for sorting.

---

## POST /api/feedback/threads

Create a new thread together with its first message. Rate limit: 20 threads per user per hour.

| Field          | Type     | Notes                                             |
| -------------- | -------- | ------------------------------------------------- |
| `type`         | string   | `bug`, `feature`, or `other`. Defaults to `other`.|
| `message`      | string   | 3 to 5000 chars. The first message body.          |
| `topic`        | string   | Optional. Defaults to `general`. Max 64 chars.    |

### Request

```json
POST /api/feedback/threads
{
  "type": "feature",
  "topic": "on_demand_whatsapp",
  "message": "Sending WhatsApp follow-ups to scraped leads would be useful."
}
```

### Response — 201 Created

```json
{
  "id": 142,
  "user_id": 7,
  "user_email": "user@example.com",
  "type": "feature",
  "status": "open",
  "created_at": "2026-05-27 10:11:12",
  "last_read_user": "2026-05-27 10:11:12",
  "last_read_admin": null,
  "messages": [
    {
      "id": 991,
      "author_role": "user",
      "author_user_id": 7,
      "message": "Sending WhatsApp follow-ups to scraped leads would be useful.",
      "created_at": "2026-05-27 10:11:12"
    }
  ],
  "preview": "Sending WhatsApp follow-ups to scraped leads would be useful.",
  "last_message_at": "2026-05-27 10:11:12",
  "unread_for_me": 0
}
```

Specific causes: `400` `type` not in `{bug, feature, other}`; `422` `message` shorter than 3 or longer than 5000; `429` more than 20 threads in the last hour.

---

## POST /api/feedback/threads/{thread_id}/messages

Append a reply to an existing thread. The caller must own the thread, and the thread must not be `closed`. Posting a message also marks the thread as read for the user side.

### Request

```json
POST /api/feedback/threads/142/messages
{
  "message": "Adding more context: opt-out tracking would also be required."
}
```

### Response — 201 Created

Returns the full serialized thread, identical in shape to the `POST /threads` response, with the appended message included.

Specific causes: `400` thread is `closed`; `403` caller does not own the thread; `422` message empty or longer than 5000.

---

## GET /api/feedback/threads

List the caller's threads, newest first. Capped at 100 rows. Each entry embeds the full message list so the client can render previews and unread counts without a second round trip.

### Response — 200 OK

```json
[
  {
    "id": 142,
    "user_id": 7,
    "user_email": "user@example.com",
    "type": "feature",
    "status": "open",
    "created_at": "2026-05-27 10:11:12",
    "last_read_user": "2026-05-27 10:11:12",
    "last_read_admin": null,
    "messages": [ /* ... */ ],
    "preview": "Sending WhatsApp follow-ups...",
    "last_message_at": "2026-05-27 10:11:12",
    "unread_for_me": 0
  }
]
```

The `unread_for_me` counter reflects admin replies not yet seen, computed from `last_read_user`. The companion endpoint `GET /api/feedback/unread` returns the same number aggregated across every thread, ready to bind to a header badge.


<!-- doc: api/jobs -->

---
title: Jobs API
slug: api/jobs
section: API
summary: Unified surface for every workload Outsend runs — source acquisition, enrichment, verification, reporting.
---

# Jobs API

The Jobs API is the unified surface for every workload Outsend runs on a tenant's behalf: source acquisition (`scrap`) and the enrichment, verification and reporting modules that operate on the resulting items. A job is the only billable unit.

See also:

- [Jobs lifecycle](/docs/concepts/jobs-lifecycle) — pending → running → done | failed | cancelled | expired
- [States and events](/docs/concepts/states-and-events) — SSE event payload reference
- [Limits](/docs/concepts/limits) — EF quota, per-job caps, retention

All endpoints require an authenticated session cookie. Endpoints that create or mutate jobs additionally require an active user; `POST /api/jobs` and `POST /api/jobs/resume` also require a verified email address. Admin-only routes (`/api/admin/*`, `/api/jobs/queue`) are not documented here.

## Conventions

| Item | Value |
|---|---|
| Base URL | `https://outsend.xyz` |
| Auth | Session cookie (`outsend_session`) |
| Content-Type | `application/json` for POST bodies |
| Job identifier | Opaque string (`job.id`), stable for the lifetime of the job |
| Timestamps | ISO 8601 UTC |

### The `JobPublic` object

Every endpoint that returns a job returns the same shape:

```json
{
  "id": "j_01HXYZ...",
  "job_type": "scrap",
  "queries": ["dentiste"],
  "zones": ["Paris", "75015"],
  "include_reviews": false,
  "status": "running",
  "grid_points_count": 412,
  "processed_points": 87,
  "results_count": 64,
  "error_count": 0,
  "ef_cost": 0.041,
  "created_at": "2026-05-27T09:12:03Z",
  "started_at": "2026-05-27T09:12:05Z",
  "completed_at": null,
  "expires_at": "2026-06-26T09:12:03Z",
  "error_message": null,
  "output_filename": null,
  "download_available": false,
  "source_job_id": null,
  "pipeline_id": null,
  "email_mode": null,
  "breakdown": { "by_query": {"dentiste": 64}, "by_zone": {"Paris": 64} },
  "dead_queries": [],
  "flagged_tiles_count": 0,
  "total_attempts_count": 87,
  "query_stats": { "dentiste": { "tiles": 87, "with_results": 71 } }
}
```

`status` is one of `pending | running | done | failed | cancelled | expired`.

### Errors

All endpoints return `{"detail": "..."}` (or `{"detail": {"message": ..., "errors": [...]}}` for validation errors). Generic codes: `401` not authenticated, `403` not authorised (other tenant or unverified email), `404` not found, `422` Pydantic validation. Endpoint-specific causes are listed inline.

---

## Create a job (generic)

```
POST /api/jobs
```

Creates a `scrap` job — the canonical source acquisition workload that runs queries across a geographic grid. For every other workload, use the typed shortcut described below; passing a `type` field to `POST /api/jobs` is **not** supported.

**Request body**

```json
{
  "queries": ["dentiste", "orthodontiste"],
  "zones": ["Paris", "75015", "Lyon 2e"],
  "include_reviews": false,
  "extra_columns": ["gps", "departement", "region"]
}
```

| Field | Type | Notes |
|---|---|---|
| `queries` | `string[]` (1..20) | Each item ≤ 200 chars, trimmed, deduplicated |
| `zones` | `string[]` (1..50) | City names, postal codes, or arrondissements; resolved server-side |
| `include_reviews` | `boolean` | If `true`, fetches the latest reviews per POI (raises EF cost) |
| `extra_columns` | `string[]` | Optional output columns, off by default. Allowed: `gps` (adds exact `lat`/`lon`), `departement`, `region`. Unknown values are ignored. See [the scrap module](/docs/modules/scrap). |

**Response** — `200 OK`, a `JobPublic` in `pending` status.

Specific causes: `400` zone parsing failed / EF quota exceeded / empty grid; `403` email not verified.

---

## Create a job (typed shortcut)

Every enrichment, verification and report module has a dedicated endpoint that accepts the items it operates on. Each shortcut returns a `JobPublic` whose `job_type` is fixed to the module slug.

```
POST /api/jobs/{type}
```

| `type` | Purpose | Module doc |
|---|---|---|
| `reviews` | Pull the latest reviews for each POI | [reviews](/docs/modules/reviews) |
| `emails` | Discover contact emails from each POI's website | [emails](/docs/modules/emails) |
| `verify-emails` | Anti-bounce verification (no VPN) | [verify-emails](/docs/modules/verify-emails) |
| `socials` | Detect linked social network profiles | [socials](/docs/modules/socials) |
| `phones-extra` | Find additional phone numbers beyond the Maps listing | [phones-extra](/docs/modules/phones-extra) |
| `legal-ids` | Extract SIRET / SIREN from the website | [legal-ids](/docs/modules/legal-ids) |
| `legal-mentions` | Parse the legal-notice page (capital, RCS, …) | [legal-mentions](/docs/modules/legal-mentions) |
| `legal-data` | Enrich via SIRENE / INPI (`api.gouv.fr`) | [legal-data](/docs/modules/legal-data) |
| `pricing` | Extract SaaS / B2B pricing | [pricing](/docs/modules/pricing) |
| `techstack` | Detect CMS, frameworks, analytics, payment, CRM | [techstack](/docs/modules/techstack) |
| `pagespeed` | Score via Google PSI API v5 | [pagespeed](/docs/modules/pagespeed) |
| `ads-intelligence` | Marketing/ads profiling (pixels, CMP, retargeting) | [ads-intelligence](/docs/modules/ads-intelligence) |
| `brand-assets` | Logo, favicon, palette, optional screenshot | [brand-assets](/docs/modules/brand-assets) |
| `dead-check` | Flag dead sites (DNS, parking, default-server, SSL) | [dead-check](/docs/modules/dead-check) |
| `delivery-check` | Gmail Inbox / Promotions / Spam placement test | [delivery-check](/docs/modules/delivery-check) |

**Request body (shape shared by every item-driven module)**

```json
{
  "items": [
    { "nom": "Cabinet Dupont", "site_web": "https://dupont-dentiste.fr", "ville": "Paris" }
  ],
  "source_job_id": "j_01HXYZ..."
}
```

| Field | Type | Notes |
|---|---|---|
| `items` | `dict[]` (1..10 000) | Module-specific keys; usually a subset of a previous job's CSV |
| `source_job_id` | `string?` | Chains the new job to a previous job, used for traceability and billing display |

**Module-specific overrides**

- `POST /api/jobs/emails` — accepts `mode: "normal" | "deep"` (default `normal`).
- `POST /api/jobs/brand-assets` — accepts `capture_screenshot: boolean` (default `false`, ~5× slower per item when on).
- `POST /api/jobs/delivery-check` — does **not** take `items`. Body:

  ```json
  { "domain": "example.com", "subject_filter": "outsend" }
  ```

**Response** — `200 OK`, a `JobPublic` in `pending` status. Additional cause: `422` if `items` is empty, too large, or missing keys required by the module.

---

## List jobs

```
GET /api/jobs?limit={n}&offset={n}
```

Returns the authenticated user's jobs, most recent first.

| Param | Type | Default | Range |
|---|---|---|---|
| `limit` | `int` | `100` | clamped to `[1, 500]` |
| `offset` | `int` | `0` | `≥ 0` |

**Response** — `200 OK`, `JobPublic[]`.

---

## Get a job

```
GET /api/jobs/{id}
```

**Response** — `200 OK`, a single `JobPublic`. Includes live counters (`processed_points`, `results_count`, `query_stats`, `breakdown`) that the dashboard polls between SSE events.

---

## Stream live progress (SSE)

```
GET /api/jobs/{id}/stream?since={log_id}
```

Server-Sent Events stream that emits status transitions, log lines and counter updates as the worker progresses. Reconnects honour the `Last-Event-ID` header automatically; the `since` query param is a fallback for clients that don't speak SSE natively. Event taxonomy (`status`, `log`, `progress`, `done`, `error`) and payload shapes are documented in [States and events](/docs/concepts/states-and-events).

**Headers returned**

```
Content-Type: text/event-stream
Cache-Control: no-cache
X-Accel-Buffering: no
```

---

## List a job's items

```
GET /api/jobs/{id}/items?offset={n}&limit={n}
```

Returns the rows of the job's output CSV as JSON, for chaining into an enrichment job. Only available for jobs whose `status == "done"` and whose `job_type` produces a reusable CSV (i.e. not `delivery_check` and not `viewport_test`).

**Response** — `200 OK`

```json
{
  "count": 412,
  "items": [
    { "nom": "Cabinet Dupont", "site_web": "https://...", "telephone": "+33 1 ...", "...": "..." }
  ]
}
```

Specific causes: `400` job not done or job_type has no reusable output; `410` CSV expired or deleted.

---

## Download a job's result

```
GET /api/jobs/{id}/download?format=csv|json|xlsx
```

Downloads the job's output. CSV is the canonical artefact written by the worker (UTF-8 BOM, `;` separator); JSON and XLSX are derived on the fly. All exports are run through a spreadsheet-formula-injection sanitiser.

| `format` | Media type | Filename |
|---|---|---|
| `csv` (default) | `text/csv; charset=utf-8` | `{job.output_filename}` |
| `json` | `application/json; charset=utf-8` | `{base}.json` |
| `xlsx` | `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` | `{base}.xlsx` |

Specific causes: `400` job still pending/running or unsupported `format`; `410` output expired, missing, or job failed before writing a row.

---

## Cancel a job

```
POST /api/jobs/{id}/cancel
```

Requests cancellation of a `pending` or `running` job. Returns `400` if the job is already terminal. If the job belongs to a pipeline, downstream stages are short-circuited.

**Response** — `200 OK`, `{"ok": true}`.

---

## Resume a job

```
POST /api/jobs/{id}/resume
```

Creates a **new** job that picks up a `cancelled` or `failed` `scrap` where it left off. The new job inherits the source's queries, zones and partial CSV; the worker skips coordinates already processed. EF is debited only for the remaining grid points.

**Response** — `200 OK`, a new `JobPublic` (the resume job) in `pending` status. Its `source_job_id` references the original.

Specific causes: `400` source job not resumable (wrong type, not interrupted, or already fully processed); `403` email not verified.

---

## Delete a job

```
DELETE /api/jobs/{id}
```

Permanently removes the job and its output CSV. Refuses to delete a job that is still running — cancel it first.

**Response** — `204 No Content`. Specific cause: `400` job still running.

---

## Estimate EF cost

```
POST /api/estimate
```

Computes the EF cost of a hypothetical `scrap` job without creating one. Drives the live cost meter in the launch form. Estimation is free and unmetered.

**Request body** — same shape as `POST /api/jobs`, but `queries` and `zones` may be empty (returns `valid: false`).

**Response** — `200 OK`, a `JobEstimateResponse`:

```json
{
  "valid": true,
  "grid_points": 412,
  "total_requests": 824,
  "queries_count": 2,
  "ef_cost": 0.041,
  "estimated_duration_seconds": 1380,
  "errors": [],
  "warnings": []
}
```

| Field | Meaning |
|---|---|
| `valid` | `true` iff `errors` is empty |
| `grid_points` | Distinct GPS tiles across the union of zones |
| `total_requests` | `grid_points × len(queries)` — what the worker will actually call |
| `queries_count` | Echoes `len(queries)` for UI display |
| `ef_cost` | France-equivalent units; see [Limits](/docs/concepts/limits) |
| `estimated_duration_seconds` | Best-effort wall-clock estimate |
| `errors` | Hard blockers (over-quota, unparseable zones, empty grid) |
| `warnings` | Soft signals (not currently used) |

---

## Notes on omitted endpoints

The following routes exist but are intentionally not part of the public surface:

- `GET /api/jobs/queue` — anonymised global queue for the public dashboard widget. Tenant-agnostic, scoped separately.
- `/api/admin/*` — operator-only.
- `GET /api/jobs/{id}/breakdown`, `GET /api/jobs/{id}/map`, `GET /api/jobs/{id}/output-columns`, `GET /api/jobs/{id}/delivery-result`, `POST /api/jobs/parse-list`, `GET /api/brand-lookup`, `GET /api/brand-assets/{owner}/{filename}`, `GET /api/delivery-check/seeds` — UI-internal helpers that may change without notice.


<!-- doc: api/overview -->

---
title: API overview
slug: api/overview
section: API
summary: Conventions shared by every Outsend API endpoint — base URL, auth, content types, versioning, errors.
---

# API overview

The Outsend API exposes the same surface that powers the web application. The dashboard and the API share one backend, one authentication scheme, and one set of objects.

## Base URL

```
https://outsend.xyz
```

Endpoints under `/api/` return JSON or stream events. The base URL is stable for the alpha.

## Authentication

Sessions use a cookie named `outsend_session`. Obtain one by posting credentials:

```
POST /api/auth/login
Content-Type: application/json

{ "email": "name@example.com", "password": "..." }
```

The response sets `outsend_session` as `HttpOnly`, `Secure`, `SameSite=Lax`. Subsequent requests must include it. Sessions remain valid until logout (`POST /api/auth/logout`) or expiry. Requests without a valid cookie receive `401` on protected routes.

API tokens scoped per workspace are on the roadmap; cookie sessions are currently the only supported mechanism.

## Content types

| Surface | Content type | Notes |
|---|---|---|
| Read and write endpoints | `application/json` | UTF-8, snake_case fields |
| Event streams | `text/event-stream` | Server-Sent Events |
| Downloads | `application/octet-stream` and friends | Endpoints whose path ends in `/download` |
| Tabular exports | `text/csv`, `application/json`, `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` | Selected via `?format=csv|json|xlsx` |

Endpoints that accept a `format` query parameter default to JSON when it is omitted.

## Versioning

The API is in alpha. There is no `/v1/` prefix and no version header — the surface evolves in place. Breaking changes are announced in advance through the changelog and, when relevant, through in-app banners. Additive changes ship without notice. A versioned prefix will be introduced before general availability.

## Rate limits

Sensitive endpoints (authentication, contact, job creation) are protected by per-route quotas. Exceeded limits return `429` with a `Retry-After` header. See [/docs/concepts/limits](/docs/concepts/limits).

## Errors

Failures return a JSON body and a conventional HTTP status:

```json
{
  "detail": "Human-readable message",
  "errors": [
    { "field": "email", "message": "Invalid format" }
  ]
}
```

The `errors` array is present only when the failure is tied to specific input fields.

| Status | Meaning |
|---|---|
| 400 | Malformed request, business rule violation |
| 401 | No session, or session expired |
| 403 | Authenticated but not allowed; also returned for deactivated accounts |
| 404 | Resource does not exist, or is not visible to the caller |
| 422 | Request was well-formed but failed validation |
| 429 | Rate limit reached; retry after the header value |
| 5xx | Server-side fault; retries with backoff are safe |

Treat any 5xx as transient and apply exponential backoff.

## Endpoint groups

| Group | Path | Purpose |
|---|---|---|
| Authentication | [/docs/api/auth](/docs/api/auth) | Login, logout, signup, password reset, email verification |
| Jobs | [/docs/api/jobs](/docs/api/jobs) | Create, list, inspect, control, and export jobs |
| Pipelines | [/docs/api/pipelines](/docs/api/pipelines) | Compose multi-step workflows and run them |
| Veille | [/docs/api/veille](/docs/api/veille) | Continuous monitoring of queries and sources |
| Feedback | [/docs/api/feedback](/docs/api/feedback) | Submit in-product feedback and bug reports |
| Registry | [/docs/api/registry](/docs/api/registry) | Discover available job types and their parameters |

## SSE protocol

Long-running operations expose progress through Server-Sent Events. Event names, payload shape, and the state machine are documented at [/docs/concepts/states-and-events](/docs/concepts/states-and-events).


<!-- doc: api/pipelines -->

---
title: Pipelines API
slug: api/pipelines
section: API
summary: Compose and run DAGs of scraping, enrichment, and transformation steps under /api/pipelines.
---

# Pipelines API

A pipeline is a directed acyclic graph (DAG) of nodes that chains scraping, enrichment, and transformation steps. Submitting a pipeline starts the root jobs synchronously; downstream jobs are spawned as each predecessor reaches `done`.

All endpoints live under `/api/pipelines` and require an authenticated session. Mutating routes additionally require an active (non-suspended) account. Generic errors: `401` no session, `403` not owner / account suspended, `404` pipeline or node id unknown — endpoint-specific causes are listed inline.

See also: [Pipeline orchestration](/docs/concepts/pipeline-orchestration) and [Filter module](/docs/modules/filter).

## Graph shape

The `definition` document describes the DAG. Edges are explicit and reference node ids; they are not inferred from any per-node `inputs` field.

```json
{
  "nodes": [
    {"id": "n1", "type": "scrap",  "config": {"queries": ["dentist"], "zones": ["Paris"]}, "x": 100, "y": 100},
    {"id": "n2", "type": "emails", "config": {"mode": "normal"}, "x": 320, "y": 100},
    {"id": "n3", "type": "verify", "config": {}, "x": 540, "y": 100}
  ],
  "edges": [
    {"id": "e1", "from": "n1", "to": "n2"},
    {"id": "e2", "from": "n2", "to": "n3"}
  ]
}
```

### Node types

| Type              | Role                                     | Accepts (input)   | Produces (output) |
|-------------------|------------------------------------------|-------------------|-------------------|
| `scrap`           | Google Maps scrape (root)                | none              | `pois`            |
| `import`          | CSV/Sheets import (root)                 | none              | `pois`            |
| `reviews`         | Fetch reviews for POIs                   | `pois_any`        | `reviews`         |
| `emails`          | Discover emails from websites            | `pois_any`        | `pois_email`      |
| `verify`          | SMTP verify emails                       | `pois_email`      | `verified`        |
| `socials`         | Discover social profiles                 | `pois_any`        | `pois`            |
| `dead_check`      | Detect inactive POIs                     | `pois_any`        | `pois`            |
| `techstack`       | Detect website tech stack                | `pois_any`        | `pois`            |
| `ads_intelligence`| Detect ad campaigns                      | `pois_any`        | `pois`            |
| `brand_assets`    | Extract logo and brand assets            | `pois_any`        | `pois`            |
| `legal_ids`       | Find SIRET/SIREN from website            | `pois_any`        | `pois`            |
| `legal_data`      | Company data via api.gouv.fr             | `pois_any`        | `pois`            |
| `legal_mentions`  | Parse legal mentions page                | `pois_any`        | `pois`            |
| `phones_extra`    | Find extra phone numbers                 | `pois_any`        | `pois`            |
| `pricing`         | Extract pricing/tariffs                  | `pois_any`        | `pois`            |
| `pagespeed`       | Google PageSpeed scoring                 | `pois_any`        | `pois`            |
| `phone_info`      | Phone line type / carrier (cache)        | `pois_any`        | `pois`            |
| `filter`          | Apply rule-based row filter              | `any_pois`        | passthrough       |
| `sort`            | Reorder rows by a column                 | `any_pois`        | passthrough       |

`filter` and `sort` preserve the upstream type; type compatibility is resolved by walking back to the nearest non-passthrough ancestor.

### Column guarantee

No module ever drops a column. Every enrichment and check node outputs **all the columns it received** — in their original order — plus its own columns appended at the end. This holds across an entire chain: a `scrap → legal_ids → legal_data → emails` pipeline ends with the Google Maps columns (`lien_google_maps`, `note`, `nb_avis`, `lat`, `lon`, …), the identifiers, the legal profile, and the emails side by side. Custom columns from an `import` are passed through untouched as well. The only way a column disappears is an explicit transformation the pipeline asks for (a `filter` rule, a `sort` `top_n` cut applies to rows, never columns).

> The full machine-readable contract for every node — `category`, `input`/`output`, the `needs`/`produces` columns, and each block's `config_schema` — is served live at [`GET /api/pipelines/schema`](#get-apipipelinesschema). That endpoint is the single source of truth; this table is a summary.

### Validation rules

The server rejects a definition with HTTP 400 if any of the following hold:

| Rule                                            | Error message                                   |
|-------------------------------------------------|-------------------------------------------------|
| Empty `nodes` list                              | `Pipeline vide`                                 |
| More than 20 nodes                              | `Trop de nodes (max 20)`                        |
| Duplicate node id                               | `IDs de nodes en doublon`                       |
| Unknown `type`                                  | `Type de node inconnu : ...`                    |
| Edge endpoint references missing node           | `Edge référence un node inexistant`             |
| Self-loop (`from == to`)                        | `Edge vers soi-même interdit`                   |
| Root type connected as a successor              | `Le node '...' ne peut pas avoir de prédécesseur` |
| Incompatible output to input                    | `Connexion X → Y incompatible`                  |
| Node with more than one predecessor (MVP limit) | `Le node ... a plusieurs prédécesseurs`         |
| Missing required `config` field                 | `Node '...' : champ requis manquant « ... »`    |
| Wrong config field type / bad enum value        | `Node '...', champ « ... » : ...`               |

Roots must be one of `scrap` or `import`. Fan-out (one node feeding several successors) is allowed; fan-in is not.

### Portable envelope

Pipelines export and import as a single self-describing JSON envelope. The same shape is produced by the editor's **Export** button and accepted by **Import** and AI generation:

```json
{
  "schema_version": 1,
  "name": "Dentists Paris",
  "definition": { "nodes": [...], "edges": [...] },
  "meta": { "exported_from": "outsend.xyz", "kind": "pipeline" }
}
```

`POST /api/pipelines` and `POST /api/pipelines/validate` accept either the full envelope or a bare `definition`. Before validation the server **normalizes** the definition: it generates missing edge ids, auto-lays out nodes that have no `x`/`y`, applies `config_schema` defaults, coerces newline-separated strings into `string[]` fields, and strips config fields not in the schema. This means a minimal hand-written or AI-generated `{nodes, edges}` (no coordinates, no edge ids) is accepted as-is.

---

## GET /api/pipelines/schema

Return the canonical, machine-readable pipeline schema — the single source of truth shared by the editor, import, AI generation, and the planned [MCP](/docs/integration/mcp) `create_pipeline` tool. **Public** (no auth): it is format documentation, identical for every caller.

**Response 200**

```json
{
  "schema_version": 1,
  "compat": { "pois_any": ["pois", "pois_email"], "pois_email": ["pois_email"], "any_pois": ["pois", "pois_email", "verified"] },
  "root_types": ["import", "scrap"],
  "nodes": {
    "scrap": {
      "category": "source", "is_root": true,
      "input": null, "output": "pois",
      "needs": [], "produces": ["nom", "site_web", "telephone", "..."],
      "config_schema": {
        "queries": {"type": "string[]", "required": true, "label": "..."},
        "zones":   {"type": "string[]", "required": true, "label": "..."}
      }
    },
    "emails": { "category": "enrich", "input": "pois_any", "output": "pois_email",
                "needs": ["site_web"], "produces": ["email", "email_personal"],
                "config_schema": {"mode": {"type": "enum", "enum": ["normal", "deep"], "default": "normal"}} }
  }
}
```

`config_schema` field types: `string`, `string[]`, `int`, `float`, `bool`, `enum` (with `enum` list), `object`. `required` and `default` are optional per field.

---

## POST /api/pipelines/validate

Normalize and validate a definition **without creating or running anything**. Used by Import (review before launch) and AI generation (check the JSON Claude produced). Requires a session.

**Request body**

```json
{ "definition": { "nodes": [...], "edges": [...] }, "schema_version": 1 }
```

**Response 200 — valid**

```json
{
  "ok": true,
  "definition": { "nodes": [...with ids, x/y, defaults...], "edges": [...] },
  "summary": { "n_nodes": 3, "n_edges": 2, "types": ["scrap", "emails", "verify"] }
}
```

**Response 200 — invalid** (note: still HTTP 200, with `ok: false`)

```json
{ "ok": false, "error": "Connexion scrap → verify incompatible" }
```

The returned `definition` is the normalized form, ready to load into the editor or submit verbatim to `POST /api/pipelines`.

---

## Generate a pipeline with any AI

You don't need to write the JSON by hand. Two ways:

**1. Built-in (inside outsend).** The editor's **🤖 Build with AI** button sends the schema above plus your plain-language description to Claude using your own key ([BYOK](/docs/integration/byok)), then validates and lays out the result.

**2. Bring-your-own assistant (copy/paste anywhere).** Open any AI assistant — claude.ai, Claude Desktop, Cursor, ChatGPT — and:

1. Paste the contract: either this page, or the whole docs bundle at [`/docs/llms-full.txt`](/docs/llms-full.txt), or just the JSON of [`GET /api/pipelines/schema`](#get-apipipelinesschema).
2. Add your request, e.g. *"Compose an outsend pipeline that finds dentists in Berlin, gets their emails, verifies deliverability, and keeps the top-rated. Return only the JSON envelope."*
3. The assistant returns a `{schema_version, name, definition}` envelope. Paste it into the editor's **Import** dialog (it is validated before anything runs), or `POST` it to `/api/pipelines`.

Because the editor, import, and this API all accept the same envelope and the server normalizes it (missing edge ids, coordinates, and config defaults are filled in), a hand-assembled `{nodes, edges}` works without any layout fields.

---

## POST /api/pipelines

Create a pipeline and launch its root jobs.

**Request body**

```json
{
  "name": "Dentists Paris",
  "definition": { "nodes": [...], "edges": [...] }
}
```

`name` is optional (≤ 120 chars, defaults to `"Pipeline"`). The `definition` is normalized (see [Portable envelope](#portable-envelope)) before validation, so a minimal `{nodes, edges}` works.

**Response 201**

```json
{
  "id": "f1a2…-uuid",
  "status": "running",
  "initial_jobs": ["job_abc", "job_def"]
}
```

Specific cause: `400` definition fails any validation rule, or root job creation fails. On root failure the pipeline is persisted with `status = failed`.

---

## GET /api/pipelines

List the caller's pipelines (most recent first, capped at 50).

**Response 200**

```json
[
  {
    "id": "f1a2…",
    "name": "Dentists Paris",
    "status": "running",
    "created_at": "2026-05-27 10:14:02",
    "completed_at": null,
    "nodes_count": 3,
    "done_count": 1,
    "results_count": 187,
    "progress_pct": 42
  }
]
```

`status` is one of `pending | running | done | failed | cancelled`. `nodes_count` is derived from the stored definition. `done_count` is the number of stages already finished, `results_count` the rows aggregated across all stages so far, and `progress_pct` (0–100) a duration-weighted completion estimate — transform stages (`filter`, `sort`) count far less than scraping/enrichment stages, and the in-flight stage contributes its real sub-progress. The same `progress_pct` is also returned by `GET /api/pipelines/{id}`.

---

## GET /api/pipelines/{id}

Return a single pipeline with its definition and the jobs spawned so far.

**Response 200**

```json
{
  "id": "f1a2…",
  "user_id": 42,
  "name": "Dentists Paris",
  "definition": { "nodes": [...], "edges": [...] },
  "status": "running",
  "created_at": "2026-05-27 10:14:02",
  "completed_at": null,
  "progress_pct": 42,
  "jobs": [
    {
      "id": "job_abc",
      "job_type": "scrap",
      "status": "done",
      "pipeline_node_id": "n1",
      "results_count": 187,
      "error_message": null,
      "created_at": "2026-05-27 10:14:02",
      "completed_at": "2026-05-27 10:18:55"
    }
  ],
  "output_job": {
    "id": "job_xyz",
    "job_type": "verify_emails",
    "results_count": 142,
    "status": "done",
    "download_available": true
  }
}
```

`output_job` is the pipeline's **final dataset** — the output of the most-downstream stage that has actually produced rows (a pipeline filters/reduces, it does not sum; `output_job.results_count` therefore matches the headline count, not the sum of all stages). Download it via [`GET /api/jobs/{id}/download`](jobs.md#get-apijobsiddownload) using `output_job.id`, in `csv` / `xlsx` / `json`. It is `null` while the pipeline has produced nothing downloadable yet, and `download_available` reflects whether a CSV (final **or** partial — so it works for running/stopped pipelines too) is still on disk and unexpired.

---

## PATCH /api/pipelines/{id}

Not implemented. The current API does not expose graph mutation after creation; clone the pipeline by re-issuing `POST /api/pipelines` with an updated definition. Returns `405 Method Not Allowed`.

---

## DELETE /api/pipelines/{id}

Not implemented. Pipelines are immutable once created; deletion will be added once retention policy is defined. Returns `405 Method Not Allowed`.

---

## POST /api/pipelines/{id}/run

Not implemented. Pipelines start automatically when created via `POST /api/pipelines`; there is no separate run endpoint. To re-execute an existing graph, post it again as a new pipeline.

---

## GET /api/pipelines/{id}/nodes/{node_id}/input-columns

Inspect the schema of the CSV that will feed a given node. Useful for building filter UIs.

**Behaviour.** The endpoint locates the node's most recent predecessor job. If the predecessor is not yet `done`, the response carries an empty `columns` list and a `reason` code. Otherwise the predecessor's output CSV is read (up to 5000 rows) and each column is profiled for type, fill rate, and sample values.

**Response 200 — predecessor done**

```json
{
  "columns": [
    {
      "name": "telephone",
      "type": "phone",
      "fill_rate": 0.92,
      "sample_values": ["+33 1 23 45 67 89", "0612345678"],
      "distinct_count": null
    },
    {
      "name": "categorie",
      "type": "category",
      "fill_rate": 1.0,
      "sample_values": ["dentiste", "orthodontiste"],
      "distinct_count": 4,
      "distinct_values": ["dentiste", "endodontiste", "orthodontiste", "stomatologue"]
    }
  ],
  "row_count": 187,
  "predecessor_job_id": "job_abc"
}
```

`type` is one of `phone | email | url | number | category | text`. A column is tagged `category` only if it has between 1 and 200 distinct non-empty values; otherwise it falls back to `text`. A typed verdict requires ≥ 80% of non-empty values to match the corresponding pattern.

**Response 200 — no usable input**

```json
{ "columns": [], "reason": "no_predecessor" }
```

| `reason`          | Meaning                                                  |
|-------------------|----------------------------------------------------------|
| `no_predecessor`  | The node is a root, or has no incoming edge yet.         |
| `no_data_yet`     | Predecessor job exists but is not in status `done`.      |
| `no_csv_found`    | Predecessor finished but no output CSV is on disk.       |
| `csv_read_error`  | The CSV file could not be parsed.                        |

---

## POST /api/pipelines/{id}/nodes/{node_id}/filter-preview

Apply a set of filter rules in memory against the upstream CSV and return the match count plus a small sample. No job is created; no state is mutated.

The target node must be of type `filter`. The body uses the same `rules` shape that `filter` nodes persist in their `config.rules`; previews are computed by the same function the worker uses at execution time, so the count is authoritative for the data inspected.

**Request body**

```json
{
  "rules": {
    "logic": "AND",
    "conditions": [
      {"column": "fill_rate", "op": ">=", "value": 0.5},
      {"column": "categorie", "op": "in", "value": ["dentiste", "orthodontiste"]}
    ]
  }
}
```

The exact rule grammar is defined by the filter module (see [Filter module](/docs/modules/filter)).

**Response 200**

```json
{
  "total": 187,
  "matched": 73,
  "samples": [
    {"nom": "Cabinet Dupont", "telephone": "0123456789", "categorie": "dentiste"}
  ],
  "predecessor_job_id": "job_abc",
  "fieldnames": ["nom", "telephone", "categorie", "site_web"],
  "capped": false
}
```

`samples` contains up to 5 matched rows with empty fields stripped. `capped` is `true` when the upstream CSV exceeded the 5000-row preview limit — in that case `total` reflects only the inspected window, but the `matched/total` ratio remains representative.

When the predecessor is not ready, the response is the same `{total, matched, samples, reason}` skeleton with all counts at `0`. Possible `reason` codes mirror the input-columns endpoint: `no_predecessor`, `no_data_yet`, `no_csv_found`.

Specific causes: `400` target node is not of type `filter`, or rule application raised; `500` CSV could not be read.

---

## Lifecycle summary

1. `POST /api/pipelines` validates the graph, persists the pipeline as `running`, and spawns one job per root node.
2. As each job reaches `done`, the worker reads its CSV, transforms rows for the successor's input type, and creates the next job. Empty outputs short-circuit the branch.
3. When every spawned job has reached a terminal status (`done`, `failed`, `cancelled`, `expired`), the pipeline is finalized as `done` if all succeeded, otherwise `failed`.


<!-- doc: api/registry -->

---
title: Module registry API
slug: api/registry
section: API
summary: Single source of truth listing every module the platform exposes — active scrapers, on-demand stubs, meta features, coming-soon items.
---

# Module registry API

The Module registry is the single source of truth listing every module the platform exposes: active scrapers, on-demand stubs, meta features, and coming-soon items still gathering interest votes. The frontend reads the registry instead of hardcoding module slugs, so adding a module only takes two files (`frontend/static/job_types.js` and `app/job_registry.py`).

See also: [/docs/concepts/module-registry](/docs/concepts/module-registry).

## Registry entry shape

Each module is described by a small object that the frontend renders in the dashboard tiles, the search palette, and the pricing pages.

| Field          | Type           | Purpose                                                                 |
| -------------- | -------------- | ----------------------------------------------------------------------- |
| `slug`         | string         | Stable identifier. Used as job_type, route param, and registry key.     |
| `category`     | string         | Group bucket (`sources`, `enrich`, `signals`, `outreach`, `tools`).     |
| `label`        | object         | `{ "fr": "...", "en": "..." }`. Bilingual display name.                 |
| `needs`        | string[]       | Upstream artifacts the module consumes (e.g. `["leads"]`).              |
| `produces`     | string[]       | Downstream artifacts it emits (e.g. `["emails"]`).                      |
| `pipelinable`  | boolean        | Whether the module can be chained inside a Pipeline.                    |
| `is_on_demand` | boolean        | Stub module — clicking activate opens a feedback thread, not a job.     |
| `coming_soon`  | boolean        | Listed for interest voting only. No backend execution.                  |
| `alpha_unavailable` | boolean   | Built and listed as active, but frozen during alpha. Its create endpoint returns `503`. |
| `api_endpoint` | string \| null | Path the dashboard calls to start a run, or `null` for stubs.           |

A module is at most one of `is_on_demand`, `coming_soon`, `alpha_unavailable`, or plain active. Active modules have a non-null `api_endpoint`; stubs and coming-soon modules have `api_endpoint = null`. An `alpha_unavailable` module is presented as active and keeps a non-null `api_endpoint`, but that endpoint returns `503` while the alpha freeze is in effect.

---

## GET /api/modules-registry

Public endpoint. Returns the server-side mirror of the JS registry. The response is a flat object with one array per bucket plus a `feature_pages` mapping that points each active module to its published `/features/<slug>` sales page (or `null` if not written yet).

### Response — 200 OK

```json
{
  "active": [
    "ads_intelligence", "brand_assets", "dead_check", "delivery_check",
    "emails", "filter", "import", "legal_data", "legal_ids",
    "legal_mentions", "pagespeed", "phones_extra", "pricing", "reviews",
    "scrap", "socials", "sort", "techstack", "verify_emails",
    "viewport_test"
  ],
  "multi_proxy": [
    "dead_check", "emails", "legal_ids", "legal_mentions", "phones_extra",
    "pricing", "reviews", "scrap", "socials", "techstack"
  ],
  "parallel": [
    "ads_intelligence", "brand_assets", "delivery_check", "filter",
    "import", "legal_data", "pagespeed", "sort", "verify_emails",
    "viewport_test"
  ],
  "on_demand": [
    "email_campaign", "phone_carrier", "sms_campaign", "whatsapp_campaign"
  ],
  "meta": ["pipeline", "veille"],
  "coming_soon": [
    "ai_personalization", "ai_team_members", "bing_places", "campaign",
    "chrome_extension", "crm", "directories", "email_warmup",
    "funding", "hiring", "integrations", "job_changes", "linkedin",
    "mobile_phones", "multichannel", "natural_filter", "pagesjaunes",
    "press_monitoring", "public_api", "review_patterns", "seo_data",
    "tech_adoption", "tracking", "whatsapp", "yelp_tripadvisor"
  ],
  "alpha_unavailable": ["finance"],
  "feature_pages": {
    "scrap": "scraper-google-maps-gratuit-export-csv",
    "emails": "email-finder-pro-rgpd-france",
    "ads_intelligence": null
  }
}
```

The `multi_proxy` set lists scrapers that share the global VPN pool — only one can run at a time platform-wide. `parallel` modules use direct HTTP and may run concurrently. Clients that schedule jobs should check both sets to surface "Will queue" warnings.

---

## GET /api/features

Returns the caller's interest state plus a global counter per coming-soon feature. The counts include every allowed feature id, even those with zero votes, so the frontend can render `Needed (N)` labels without a fallback branch.

The list of acceptable feature ids equals `coming_soon` from the registry, plus a tiny legacy set (`company`, `monitoring`, `pagespeed`) kept around to preserve historic votes.

### Response — 200 OK

```json
{
  "voted": ["linkedin", "funding"],
  "counts": {
    "linkedin": 27,
    "funding": 14,
    "hiring": 6,
    "ai_personalization": 3,
    "directories": 0,
    "press_monitoring": 0
  }
}
```

Specific cause: `401` caller is not authenticated.

---

## POST /api/features/{feature_id}/interest

Records an interest vote for `feature_id`. The operation is idempotent — a second call by the same user is a no-op. Use `DELETE` on the same path to retract the vote.

`feature_id` is validated against the allow-list from the registry (coming-soon ids plus legacy ids). Unknown ids return 404 so the endpoint cannot be used as a write-anywhere KV store.

### Request

```json
POST /api/features/linkedin/interest
```

No body. The user is identified by session.

### Response — 204 No Content

Empty body. Re-fetch `GET /api/features` for the updated counter.

Specific causes: `401` not authenticated; `403` authenticated but not active (pending invite); `404` `feature_id` does not match the registry allow-list.

---

## Related

- [Module registry concept](/docs/concepts/module-registry)
- [Feedback API](/docs/api/feedback) — used by on-demand stubs to surface activation requests in the admin dashboard.


<!-- doc: api/veille -->

---
title: Veille API
slug: api/veille
section: API
summary: Recurring monitoring of scrapes and pipelines, with diff buckets and reputation signals.
---

# Veille API

The Veille API manages recurring monitoring jobs. A *veille* (watch) replays a source scrape — or an entire pipeline — on a fixed cadence, then computes a diff against the previous run to surface what changed.

See [Veille monitoring concepts](/docs/concepts/veille-monitoring) for lifecycle, scheduling, and diff model.

All endpoints are mounted under `/api/veille` and require an authenticated, active session. Responses are JSON. Resource ownership is enforced on every request: cross-user access returns `404`. Generic errors are `401` (no session) and `404` (not found / not owned); endpoint-specific causes are listed inline.

## Resource model

A `Veille` object exposes the following fields:

| Field                | Type            | Description                                                  |
| -------------------- | --------------- | ------------------------------------------------------------ |
| `id`                 | integer         | Stable identifier.                                           |
| `name`               | string          | Human-readable label (2–200 chars).                          |
| `source_job_id`      | string \| null  | Source scrape replayed on each tick (mutually exclusive with `source_pipeline_id`). |
| `source_pipeline_id` | string \| null  | Source pipeline replayed on each tick.                       |
| `frequency_days`     | integer         | Cadence in days, between `1` and `365`.                      |
| `status`             | string          | One of `active`, `paused`, `deleted`.                        |
| `next_run_at`        | string (ISO8601)| Next scheduled execution.                                    |
| `last_run_at`        | string \| null  | Timestamp of the most recent completed run.                  |
| `last_run_job_id`    | string \| null  | Job id of the most recent run.                               |
| `run_count`          | integer         | Total successful runs.                                       |
| `created_at`         | string (ISO8601)| Creation timestamp.                                          |

## Endpoints

### List veilles

`GET /api/veille`

Returns the caller's active and paused veilles. Soft-deleted entries are excluded.

**Response** `200 OK` — `{ "items": [Veille, ...] }`.

### Create a veille

`POST /api/veille`

Creates a recurring monitor from a completed scrape or pipeline owned by the caller. Exactly one of `source_job_id` or `source_pipeline_id` is required.

**Request body**

| Field                | Type    | Required | Notes                                  |
| -------------------- | ------- | -------- | -------------------------------------- |
| `name`               | string  | yes      | 2–200 characters.                      |
| `source_job_id`      | string  | one of   | 8–64 characters.                       |
| `source_pipeline_id` | string  | one of   | 8–64 characters.                       |
| `frequency_days`     | integer | yes      | `1` ≤ value ≤ `365`.                   |

```json
{
  "name": "Plombiers Lyon 3",
  "source_job_id": "job_8f2c91a4",
  "frequency_days": 7
}
```

**Response** `200 OK` — the newly created veille.

Specific cause: `400` validation failure (missing/both source fields, source not owned, source not completed, invalid frequency).

### Retrieve a veille

`GET /api/veille/{id}`

Returns a single veille owned by the caller. Soft-deleted entries return `404`.

### Update a veille

`PATCH /api/veille/{id}`

Patches mutable fields. Omitted fields are left untouched.

**Request body**

| Field            | Type    | Notes                                                  |
| ---------------- | ------- | ------------------------------------------------------ |
| `name`           | string  | 2–200 characters.                                      |
| `frequency_days` | integer | `1` ≤ value ≤ `365`. Reschedules `next_run_at`.        |
| `status`         | string  | `active`, `paused`, or `deleted`.                      |

```json
{ "status": "paused", "frequency_days": 14 }
```

**Response** `200 OK` — the updated veille. Specific cause: `400` invalid field value.

### Delete a veille

`DELETE /api/veille/{id}`

Soft-deletes the veille. The record is preserved for audit but excluded from all list endpoints and no longer scheduled.

**Response** `200 OK` — `{ "ok": true }`.

## Runs

A *run* is a single execution of the veille plus the diff statistics computed against the previous run. The first run is a *baseline* (`is_baseline: true`) and has no diff counters.

### List runs

`GET /api/veille/{id}/runs`

Returns the run history ordered by `computed_at` descending.

**Response** `200 OK`

```json
{
  "items": [{
    "id": 17,
    "job_id": "job_b71e0d22",
    "prev_job_id": "job_aa44e0f1",
    "is_baseline": false,
    "total_count": 312, "prev_total_count": 305,
    "new_count": 9, "removed_count": 2,
    "modified_count": 24, "unchanged_count": 279,
    "computed_at": "2026-05-27T08:11:04Z",
    "job_status": "done",
    "job_completed_at": "2026-05-27T08:10:48Z"
  }]
}
```

### Retrieve a run

`GET /api/veille/{id}/runs/{run_id}`

Returns the run, including `samples` — capped previews of the rows in each diff bucket.

**Response** `200 OK`

```json
{
  "id": 17,
  "job_id": "job_b71e0d22",
  "is_baseline": false,
  "new_count": 9,
  "removed_count": 2,
  "modified_count": 24,
  "unchanged_count": 279,
  "total_count": 312,
  "computed_at": "2026-05-27T08:11:04Z",
  "samples": {
    "new": [{ "key": "...", "nom": "..." }],
    "removed": [{ "key": "...", "nom": "..." }],
    "modified": [{
      "key": "...", "nom": "...",
      "before": { "note": "4.3", "nb_avis": 42 },
      "after":  { "note": "3.8", "nb_avis": 51 },
      "changed_fields": ["note", "nb_avis"]
    }]
  }
}
```

## Signal categories

Every non-baseline run classifies each row in the dataset into exactly one bucket:

| Category   | Meaning                                                                       |
| ---------- | ----------------------------------------------------------------------------- |
| `new`      | Row present in the current run, absent from the previous run.                 |
| `removed`  | Row present in the previous run, absent from the current run (closed/dropped).|
| `modified` | Row present in both runs with at least one tracked field changed.             |
| `unchanged`| Row present in both runs, identical on tracked fields.                        |

Bucket counts are surfaced as `new_count`, `removed_count`, `modified_count`, and `unchanged_count`. The matching `samples.{new,removed,modified}` arrays hold capped previews suitable for UI display.

> The `removed` field is the closed/dropped bucket: a record no longer listed at the source.

## Reputation signals

Reputation signals are a derived view of a run's `modified` bucket. They isolate rows whose public reputation moved in a way that is timing-sensitive for outreach — typically Google Maps listings whose rating dropped or whose review volume surged between two runs.

### Ranking logic (high level)

A modified row becomes a signal when at least one of the following holds:

- **Rating drop** — the average rating decreased by at least `0.2` points.
- **Review surge** — the review count grew by at least `3` since the previous run.

Each signal carries a `score` that ranks urgency. Larger rating drops dominate; review surges contribute a smaller, additive boost above a low-volume noise floor. Signals are returned sorted by `score` descending. The exact weighting is an implementation detail and may evolve; do not depend on absolute score values, only on relative order.

### List signals

`GET /api/veille/{id}/runs/{run_id}/signals`

**Response** `200 OK`

```json
{
  "items": [{
    "nom": "Garage du Centre",
    "adresse": "12 rue Voltaire, 69003 Lyon",
    "telephone": "+33 4 78 00 00 00",
    "site_web": "https://...", "email": "contact@...",
    "lien_google_maps": "https://maps.google.com/...",
    "note_avant": 4.3, "note_apres": 3.8, "delta_note": -0.5,
    "avis_avant": 42, "avis_apres": 51, "delta_avis": 9,
    "score": 12.0
  }],
  "total": 1
}
```

### Export signals

`GET /api/veille/{id}/runs/{run_id}/signals.{fmt}`

Streams the same ranked signal list as a downloadable file.

| Format | Media type                                                                | Extension |
| ------ | ------------------------------------------------------------------------- | --------- |
| `csv`  | `text/csv; charset=utf-8`                                                 | `.csv`    |
| `json` | `application/json`                                                        | `.json`   |
| `xlsx` | `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`       | `.xlsx`   |

The response sets `Content-Disposition: attachment` with a filename of the form `signaux-reputation-veille-{id}-run-{run_id}.{fmt}`.

Specific cause: `400` unsupported `fmt` (must be `csv`, `json`, `xlsx`).


<!-- doc: concepts/ai-spending-caps -->

---
title: AI spending caps
slug: concepts/ai-spending-caps
section: Concepts
summary: Hard per-user spend limits ($/request, $/day, $/month) on AI features, with up-front cost estimates and email alerts.
---

AI features in outsend run on **your own provider key** (BYOK): the provider bills you directly, at cost. So a miscalibrated prompt never burns your bill, outsend enforces **hard spending caps** on every AI request — server-side, so they can't be bypassed.

## The three caps

| Cap          | Default | Configurable up to |
|--------------|---------|--------------------|
| Per request  | $10     | $100               |
| Per day      | $10     | $100               |
| Per month    | $100    | $1,000             |

Set them in **Settings → AI spending caps**. Days and months are counted in UTC.

## How it works

1. **Estimate before** — before an AI action runs, outsend shows the worst-case cost (your input tokens + the maximum output tokens) and how much budget you have left today and this month.
2. **Block before overspending** — if a request could push you over a cap, it is refused *before* the provider is ever called. Nothing is spent.
3. **Track the real cost** — after each call, the actual cost (the provider's reported token usage × the model's price) is added to your daily and monthly totals.
4. **Email alerts** — you get an email at 80% of a daily/monthly cap, and again when a cap is reached (AI is paused until it resets).

## Models without a known price

Prices come from a public catalog of model prices (~2,700 models). If a model isn't in it (some custom or exotic endpoints), outsend can't compute its cost: the request is **allowed and tracked, but not capped**, and the UI flags the price as unknown. Mainstream models from every supported provider are priced.

## Good to know

- Caps are a **safety net on outsend's side** — the real bill is always your provider's, and the estimate is indicative.
- Resets are calendar-based: the daily total resets at 00:00 UTC, the monthly total on the 1st.
- Raising a cap takes effect immediately; AI resumes as soon as you're back under it.


<!-- doc: concepts/jobs-lifecycle -->

---
title: Jobs & lifecycle
slug: concepts/jobs-lifecycle
section: Concepts
summary: A job is one unit of work. This page describes its states, transitions, events, and retry semantics.
---

A **job** is one unit of work. Every module runs as a job. Jobs are isolated, observable, resumable.

## State machine

```
   ┌─────────┐    queue picks    ┌─────────┐    success    ┌──────┐
   │ pending │ ────────────────► │ running │ ────────────► │ done │
   └─────────┘                   └─────────┘               └──────┘
        │                             │
        │     user cancels            │     fatal error
        ▼                             ▼
   ┌───────────┐                 ┌────────┐
   │ cancelled │                 │ failed │
   └───────────┘                 └────────┘

   done / failed / cancelled  ──── (after 7 days) ────►  expired
```

| State       | Meaning                                                                   |
|-------------|---------------------------------------------------------------------------|
| `pending`   | Created, sitting in the FIFO queue                                        |
| `running`   | Picked by a worker, executing                                             |
| `done`      | Completed successfully, results downloadable                              |
| `failed`    | Errored out (see `error_message`)                                         |
| `cancelled` | Cancelled via the UI or API                                               |
| `expired`   | More than 7 days since terminal state — result files purged               |

Transitions and queue assignment are atomic; a job is never picked twice.

## Creation

```
POST /api/jobs             { "queries": [...], "zones": [...] }   # creates a scrap job
POST /api/jobs/{type}      { ...module-specific params }          # typed shortcut
```

See [Jobs API](/docs/api/jobs).

## Observability

```
GET /api/jobs/{id}            # status, counters, metadata
GET /api/jobs/{id}/stream     # SSE: status / log / done
```

The stream closes when the job terminates. Safety timeout: 6 hours. Event payloads: see [States & SSE events](/docs/concepts/states-and-events).

## Results

```
GET /api/jobs/{id}/download?format=csv|json|xlsx
GET /api/jobs/{id}/items?offset=0&limit=200
```

Results live **7 days** after terminal state, then are purged. The job record remains.

## Errors & retries

A `failed` job exposes `error_message` and `error_count` (items that errored inside the job — a job can be `done` with `error_count > 0`).

```
POST /api/jobs/{id}/resume
```

Creates a new attempt resuming from the last successful item.

## Cancellation

```
POST /api/jobs/{id}/cancel    # keeps partial results
DELETE /api/jobs/{id}         # cancels and removes record
```

## Concurrency

- Up to **5 simultaneous jobs per user** (queued beyond)
- Two lanes: **serial** (extraction) and **parallel** (6 slots: verification, pipeline utilities, `delivery_check`)
- Jobs are independent — re-runs do not wait on the original

## What's next

- [States & SSE events](/docs/concepts/states-and-events)
- [Pipeline orchestration](/docs/concepts/pipeline-orchestration)
- [Limits & quotas](/docs/concepts/limits)


<!-- doc: concepts/limits -->

---
title: Limits & quotas
slug: concepts/limits
section: Concepts
summary: Every numeric limit enforced by the platform, in one table.
---

Reference for capacity planning. Platform-wide unless noted per-user.

## Jobs

| Limit                              | Value           | Scope          |
|------------------------------------|-----------------|----------------|
| Concurrent jobs per user           | 5               | per user       |
| Parallel-lane worker slots         | 6               | platform-wide (verify_emails, delivery_check, import, filter, sort) |
| Result file retention              | 7 days          | per job        |
| SSE stream max duration            | 6 hours         | per stream     |
| Max EF per job                     | 1.0             | per job        |

The parallel lane is a pool separate from the serial lane used by extraction modules.

## Veille

| Limit              | Value      |
|--------------------|------------|
| Frequency min      | 1 day      |
| Frequency max      | 365 days   |

## Pipelines

| Limit              | Value      |
|--------------------|------------|
| Max nodes          | 20         |
| Max inputs/node    | 1 (MVP)    |

## AI spending (BYOK)

Hard per-user caps on AI features — billed to your own provider key — with email alerts. Configurable in Settings up to 10× the default.

| Cap          | Default | Max     | Scope               |
|--------------|---------|---------|---------------------|
| Per request  | $10     | $100    | per user            |
| Per day      | $10     | $100    | per user, UTC day   |
| Per month    | $100    | $1,000  | per user, UTC month |

Requests that would exceed a cap are blocked before the provider is ever called; you get an email at 80% and when a cap is reached. See [AI spending caps](/docs/concepts/ai-spending-caps).

## Auth rate limits

Per-endpoint windows. Exceeding returns `429 Too Many Requests`.

| Endpoint                       | Limit       | Window                   |
|--------------------------------|-------------|--------------------------|
| Signup                         | 3 attempts  | per hour, per IP         |
| Login                          | 5 attempts  | per 15 min, per IP+email |
| Password reset request         | 3 attempts  | per hour, per IP+email   |
| Password change (logged-in)    | 5 attempts  | per hour, per user       |
| Resend email verification      | 3 attempts  | per hour, per user       |
| Feedback thread creation       | 20 attempts | per hour, per user       |
| Session lifetime               | 7 days      | sliding window           |

No global API throttle beyond these.

## Module-specific

- **[`scrap`](/docs/modules/scrap)** — max 1.0 EF per job
- **[`emails`](/docs/modules/emails)** — `normal` and `deep` modes with different EF profiles
- All multi-proxy modules — `items` array bounded at 1–10000 per request

## What's next

- [Jobs & lifecycle](/docs/concepts/jobs-lifecycle)
- [API overview](/docs/api/overview)


<!-- doc: concepts/module-registry -->

---
title: Module registry
slug: concepts/module-registry
section: Concepts
summary: A single source of truth describes every module — its inputs, outputs, category, and where it appears in the UI.
---

Every module outsend exposes is declared in a **single registry**. It powers the dashboard module grid, the new-job picker, the pipeline editor, and the landing page listing.

Guarantees: a module visible in the dashboard has an endpoint (and vice versa); a machine-readable snapshot is published; categories are hints, while `slug`, `needs` and `produces` are stable.

## The endpoint

```
GET /api/modules-registry
```

Returns the full registry as JSON. Each entry:

```json
{
  "slug": "scrap",
  "category": "extraction",
  "label": { "fr": "Scrap Google Maps", "en": "Scrape Google Maps" },
  "needs": null,
  "produces": "poi_list",
  "pipelinable": true,
  "is_on_demand": false,
  "coming_soon": false,
  "api_endpoint": "/api/jobs/scrap"
}
```

| Field           | Meaning                                                                 |
|-----------------|-------------------------------------------------------------------------|
| `slug`          | Stable identifier, used in URLs and API paths                           |
| `category`      | `extraction` \| `enrichment` \| `intelligence` \| `verification` \| `pipeline` \| `meta` |
| `label`         | User-facing display names per language                                  |
| `needs`         | Input shape (`poi_list`, `csv_rows`, …) — `null` if produced from scratch |
| `produces`      | Output shape                                                            |
| `pipelinable`   | Usable as a node in a pipeline                                          |
| `is_on_demand`  | If true, no backend yet — triggers a conversation with the team         |
| `coming_soon`   | If true, listed for visibility only; interest can be voted              |
| `alpha_unavailable` | If true, the module is built and listed as active everywhere, but frozen during alpha — its create endpoint returns `503` |
| `api_endpoint`  | Shortcut to start a job of this type                                    |

## Flexible input matching

`needs` and `produces` describe *canonical* column names (`nom`, `telephone`, `site_web`, `email`, `lien_google_maps`, …). You never have to format your data to match them exactly: inputs are resolved against a shared table of accepted aliases, so columns named `Website`, `url`, `e-mail`, `name` or `raison sociale` map to the right canonical field. Header-less files are auto-detected and columns are inferred from their content.

Every job is transparent about it. Each run reports a non-blocking **`notice`** (shown as an info banner on the job page and as a discreet ⓘ on the dashboard) describing what was auto-mapped, guessed, or ignored — for example rows skipped because they had no website. A job only fails when a required column is genuinely absent (e.g. an enrichment that needs `site_web` finds it on zero rows), and that error explicitly **names the accepted aliases** so you know what header to provide.

## Categories

| Category       | What it does                                              | Examples                                                    |
|----------------|-----------------------------------------------------------|-------------------------------------------------------------|
| `extraction`   | Produces data from public sources                         | [`scrap`](/docs/modules/scrap)                              |
| `enrichment`   | Augments existing rows with new fields                    | [`emails`](/docs/modules/emails), [`socials`](/docs/modules/socials), [`legal_ids`](/docs/modules/legal_ids) |
| `intelligence` | Computes signals on existing rows                         | [`pricing`](/docs/modules/pricing), [`techstack`](/docs/modules/techstack), [`ads_intelligence`](/docs/modules/ads_intelligence) |
| `verification` | Validates or scores existing rows                         | [`verify_emails`](/docs/modules/verify_emails), [`delivery_check`](/docs/modules/delivery_check) |
| `pipeline`     | Orchestration utilities                                   | [`import`](/docs/modules/import), [`filter`](/docs/modules/filter), [`sort`](/docs/modules/sort) |
| `meta`         | Not a job — describes pipelines or veilles                | (no API endpoint)                                           |

## Lifecycle of a module

1. **Coming soon** — landing page only, no backend, interest votable
2. **On-demand** — listed in the dashboard, CTA opens a conversation, executed manually
3. **Active** — fully backed by an endpoint
4. **Available (alpha-frozen)** — built and presented as an active module across every surface, but not launchable during alpha: the UI shows a maintenance banner with a disabled launch button, and the create endpoint returns `503`. Unlike *coming soon*, it is not a placeholder and carries no interest vote — it is a finished module held back only by alpha capacity.
5. **Deprecated** — still callable but flagged

Phase changes appear in the registry via `coming_soon`, `is_on_demand`, `alpha_unavailable`, and `deprecated_at`.

## Adding a module (contributors)

Adding a module = 2 files in the codebase: a JS registry entry (UI surfaces) and a Python registry entry (API + worker dispatcher). The runtime then plugs the module everywhere automatically.

## What's next

- [Jobs & lifecycle](/docs/concepts/jobs-lifecycle)
- [Pipeline orchestration](/docs/concepts/pipeline-orchestration)


<!-- doc: concepts/pipeline-orchestration -->

---
title: Pipeline orchestration
slug: concepts/pipeline-orchestration
section: Concepts
summary: Chain modules into a reusable DAG. Each block consumes the previous block's output, no glue code required.
---

A **pipeline** is a directed acyclic graph of modules. Each node is one module call; each edge declares which output feeds which input. Pipelines save a multi-step recipe once and re-run it.

Pipelines also back [veille](/docs/concepts/veille-monitoring): a recurring scrap is internally a scheduled pipeline.

## Anatomy

```
   ┌──────────┐
   │  scrap   │   queries=["bakery"], zones=["Paris"]
   └────┬─────┘
        │ produces: poi_list
        ▼
   ┌──────────┐      ┌──────────┐
   │  emails  │      │ ads_intel│
   └────┬─────┘      └────┬─────┘
        │                  │
        ▼                  ▼
   ┌────────────────────────────┐
   │          filter            │   rules: emails_present=true, ads_score≥30
   └────────────┬───────────────┘
                ▼
            ┌────────┐
            │  sort  │   sort_by=ads_score, desc, top_n=200
            └────────┘
```

Each node has:

- **type** — module slug (see [module registry](/docs/concepts/module-registry))
- **params** — module config, identical to a standalone job
- **inputs** — references to upstream node(s)
- **id** — local identifier within the pipeline

## Chaining rules

An edge is valid only if the producer's `produces` matches the consumer's `needs` (shapes like `poi_list`, `enriched_list`, `csv_rows`). The editor enforces this at design time, and the server re-validates on submit.

The full set of chainable blocks — their `category`, `input`/`output` buckets, `needs`/`produces` columns, and per-block `config_schema` — is published as a single machine-readable contract at [`GET /api/pipelines/schema`](/docs/api/pipelines). That endpoint is the **single source of truth**: the editor palette, import, AI generation, and the planned MCP `create_pipeline` tool all read it. Every active enrichment module is chainable (scrap, import, reviews, emails, verify, socials, dead_check, techstack, ads_intelligence, brand_assets, legal_ids, legal_data, legal_mentions, phones_extra, pricing, pagespeed, phone_info) plus the `filter`/`sort` transforms.

## Build, export, import, or generate with AI

A pipeline graph is portable JSON. Four ways to obtain one:

- **Build** it visually in the editor (`/pipelines/new`).
- **Export** the current graph to a JSON envelope (`{schema_version, name, definition, meta}`) — the Export button downloads it.
- **Import** an envelope (paste or `.json` file) — it is validated via `POST /api/pipelines/validate` and loaded back into the editor for review before launch (nothing runs on import).
- **Generate with AI** — describe the pipeline in plain language; the editor sends the server schema plus your description to Claude (using your own key via [BYOK](/docs/integration/byok)), parses the returned JSON, validates it, and lays it out on the canvas.

## Limits

| Limit              | Value                                |
|--------------------|--------------------------------------|
| Max nodes          | 20                                   |
| Max inputs/node    | 1 (multi-input merges not yet open)  |
| Max depth          | 20                                   |
| Re-runs allowed    | Unlimited                            |

## Execution

Pipelines **auto-start at creation** — `POST /api/pipelines` queues the root node, the rest follows as predecessors reach `done`.

Each node runs as a normal job (same lifecycle, observability, retries). The coordinator advances on `done`, stops on the first `failed`. A failed pipeline can be resumed from the failing node. To re-run, create a new pipeline (the graph is JSON — copy and re-post).

## Endpoints

```
GET    /api/pipelines/schema           # canonical node schema (public, source of truth)
POST   /api/pipelines/validate         # normalize + validate, no side effects
POST   /api/pipelines                  # create (also auto-starts)
GET    /api/pipelines                  # list user pipelines
GET    /api/pipelines/{id}             # detail + graph
```

A pipeline is owned by one user.

### Filter preview

```
POST /api/pipelines/{id}/nodes/{node_id}/filter-preview
```

Runs a `filter` node against a sample of the predecessor's output without executing the full pipeline.

## What's next

- [Veille (monitoring)](/docs/concepts/veille-monitoring)
- [`filter`](/docs/modules/filter), [`sort`](/docs/modules/sort), [`import`](/docs/modules/import)


<!-- doc: concepts/scrape-modes -->

---
title: Scrape modes (Fast / Advanced / Ultra)
slug: concepts/scrape-modes
section: Concepts
summary: The three Google Maps scrape modes control adaptive subdivision depth — the trade-off between speed, cost (EF) and contact completeness.
---

The Google Maps scrape offers **three modes** that tune a single knob: **adaptive subdivision depth**. They trade off speed, cost and completeness.

| Mode | For | In one line |
|------|-----|-------------|
| **Fast** *(default)* | Most cases | Fast, cheaper, already captures the bulk of contacts. |
| **Advanced** | When you want to enrich | Balanced: more contacts in dense areas, moderate cost. |
| **Ultra** | Maximum coverage | Subdivides as deep as possible: near-exhaustive recall, slower and costlier. |

## Why three modes: the 120-result cap

Google Maps **caps any search at ~120 results** ("you've reached the end of the list"). To go further, outsend splits a saturated tile into 4 more-zoomed sub-tiles and re-scans each (dedup by Google Maps link). This is **adaptive subdivision**.

But subdividing only pays off if the sub-tile brings **new** contacts: in a low-density area Google widens its radius beyond the tile and often returns the same 120 places → subdividing means 4× the work for 0 new leads.

So each mode sets a **threshold**: a saturated tile is only subdivided if it brought at least *N* new unique contacts.

| Mode | Threshold (new uniques required to subdivide) | Effect |
|------|----------------------------------------------|--------|
| **Fast**   | 15 | Only subdivides genuinely rich areas → few tiles. |
| **Advanced** | 7  | Subdivides more readily → more coverage. |
| **Ultra**  | 1  | Subdivides whenever anything new remains → maximum coverage. |

Subdivision depth is bounded (zoom 13 → 17, i.e. 4 levels: a tile is then ~300 m across, ≈ one city block), so even Ultra stays finite.

## Modes only diverge in dense areas

Key point: **a mode only changes anything where tiles saturate** (≥ 120 results).

- **Dense area** (city center, a common query like "plumber" or "restaurant"): tiles saturate, subdivision kicks in → Fast / Advanced / Ultra yield **markedly different** contact volumes.
- **Sparse area** (rural, niche query): nothing saturates, no subdivision → **all three modes return the exact same result**. Picking Ultra there buys nothing (same result, same cost).

That's why the mode is a **per-scrape** choice, not a global setting: it depends on how dense what you're searching for is.

## Cost (EF) and duration

**EF** (France-equivalent) is the cost unit of a scrape. The baseline is simple:

> **1 EF = scraping the whole of France, once, in Fast mode.**

So a city or a département costs a small fraction of an EF. Because deeper modes fire **many more** Google Maps requests (they re-subdivide saturated tiles), they cost proportionally more:

| Mode | Relative cost | Relative duration |
|------|:---:|:---:|
| Fast | **×1** (base) | ×1 |
| Advanced | **≈ ×2** | ≈ ×2 |
| Ultra | **≈ ×6** | ≈ ×6 |

These factors are **measured averages** (ratio of tiles processed vs Fast, 2026-06-05 campaign). Real cost depends on the **actual density** of the zone:
- **Sparse area**: nothing saturates → no subdivision → all three modes cost **the same** (the factor barely applies).
- **Dense area**: the gap widens (Ultra can reach ×14 in a very dense city center).

The pre-scrape estimate applies these factors (the EF shown rises when you switch to Advanced/Ultra). During the scrape, the **ETA accounts** for upcoming subdivisions, and **elapsed time** is shown live.

## Measurements

> **Methodology.** 3 queries of differing density — "plumber" (clusters), "pharmacy" (numerous and spread out), "cobbler" (niche) — all categories that **display a phone** (consumer categories like restaurant/hairdresser show ~0 phones → wrongly filtered by the anti-bot, untestable). 3 zones (dense / medium / rural), all 3 modes each, **every scrape run to full completion** (no timeout). We measure: unique contacts, tiles processed (≈ cost/requests), real duration. Percentages are vs Fast.

**Full matrix (campaign 2026-06-05, "plumber", all to completion)**

| Zone | Density | Mode | Contacts | Tiles | Time | vs Fast | Contacts/tile |
|------|---------|------|---------:|------:|-----:|--------:|--------------:|
| Lyon 6 km | dense | Fast | 606 | 53 | 50 min | — | 11.4 |
| Lyon 6 km | dense | Advanced | 627 | 89 | 84 min | +3.5 % | 7.0 |
| Lyon 6 km | dense | Ultra | 647 | 193 | 180 min | +6.8 % | 3.4 |
| Tours 10 km | medium | Fast | 311 | 14 | 8 min | — | 22.2 |
| Tours 10 km | medium | Advanced | 351 | 42 | 20 min | +13 % | 8.4 |
| Tours 10 km | medium | Ultra | 377 | 150 | 72 min | +21 % | 2.5 |
| Aurillac 12 km | rural | Fast | 213 | 19 | 7 min | — | 11.2 |
| Aurillac 12 km | rural | Advanced | 211 | 23 | 9 min | −1 % | 9.2 |
| Aurillac 12 km | rural | Ultra | 215 | 83 | 40 min | +1 % | 2.6 |

- **Rural → all three modes are identical** (213 / 211 / 215). Ultra takes 40 min (vs 7 min for Fast) for **+2 contacts**. Going deeper is pointless when nothing saturates.
- **Medium → Ultra +21 %** vs Fast, but at **9× the time** (72 min vs 8 min); Advanced +13 % at 2.5×.
- **Dense → Ultra +6.8 %** vs Fast, at **3.6× the time** (3 h vs 50 min).
- **Efficiency**: Fast is **3–9× more cost-effective per tile** (i.e. per EF/time) than Ultra across all zones.

**Two more queries ("pharmacy" = dense and numerous; "cobbler" = niche), Ultra gain vs Fast**

| Query | Lyon (dense) | Tours (medium) | Aurillac (rural) |
|-------|:---:|:---:|:---:|
| plumber (clusters) | +6.8 % | +21 % | +1 % |
| **pharmacy (numerous, spread out)** | **+50 %** | **+44 %** | noise* |
| cobbler (niche) | +16 % | +3 % | +12 % |

<small>Pharmacy detail: Lyon Fast 411 / Ultra 617 (36→157 min); Tours Fast 253 / Ultra 364 (8→110 min). Cobbler Lyon Fast 173 / Ultra 200. *Rural pharmacy = noise: tiles don't saturate consistently (the 120 boundary), so mode order there is random.</small>

> **Takeaway.** Ultra's gain has **no single value: from +1 % to +50 % depending on the category**. **Numerous, spread-out** categories (pharmacies, regular shops) benefit hugely from Ultra (+44 to +50 % — Fast misses half because of the 120 cap). Categories that **cluster** (plumber) or are **rare** (cobbler) gain only +1 to +16 %. In all cases Ultra costs **3–14× the time** of Fast, and in rural/low density all three modes converge.

## Recommendation

- **Default: Fast.** Best speed/cost ratio for a first pass and for categories that cluster (trades, specialized services).
- **Ultra when the target is dense AND numerous** (pharmacies, shops, agencies…) and you want exhaustiveness: the gain is real, up to **+50 %** more contacts. Accept 3–14× the time.
- **Advanced** = middle ground.
- **Niche or sparse area → Fast**, period: modes converge, Ultra just wastes time.

See also: [Jobs & lifecycle](concepts/jobs-lifecycle), [Limits & quotas](concepts/limits).


<!-- doc: concepts/states-and-events -->

---
title: States & SSE events
slug: concepts/states-and-events
section: Concepts
summary: Exact payloads for every job state and every event emitted on the SSE stream.
---

The contract for integrating against the job stream — bots, dashboards, alerting, AI assistants.

## States — full enum

| Value       | Terminal | Result files available | Re-runnable |
|-------------|----------|------------------------|-------------|
| `pending`   | no       | no                     | n/a         |
| `running`   | no       | no                     | n/a         |
| `done`      | yes      | yes (7 days)           | yes         |
| `failed`    | yes      | partial                | yes         |
| `cancelled` | yes      | partial                | yes         |
| `expired`   | yes      | no (purged)            | no          |

A `pending` or `running` job cannot be deleted, only **cancelled**.

## SSE stream

```
GET /api/jobs/{id}/stream
Accept: text/event-stream
```

Standard SSE; each event:

```
event: <name>
data: <json-payload>

```

### `status` event

Every **2 seconds** while non-terminal, plus once at terminal state.

```json
{
  "id": "j_abc123",
  "status": "running",
  "processed_points": 412,
  "grid_points_count": 1280,
  "results_count": 387,
  "error_count": 2,
  "download_available": false,
  "query_stats": {
    "bakery": { "found_pct": 92 },
    "dentist": { "found_pct": 78 }
  }
}
```

| Field                | Type    | Description                                                  |
|----------------------|---------|--------------------------------------------------------------|
| `id`                 | string  | Job id                                                       |
| `status`             | enum    | See table above                                              |
| `processed_points`   | int     | Items finished                                               |
| `grid_points_count`  | int     | Items planned                                                |
| `results_count`      | int     | Result rows so far                                           |
| `error_count`        | int     | Items that failed (job can still reach `done`)               |
| `download_available` | bool    | `true` once the result file is ready                         |
| `query_stats`        | object  | Per-query stats; depends on module                           |

### `log` event

Emitted as new log lines accumulate (bundled, polled internally every 0.5 s).

```json
{
  "message": "Picked up 12 POIs in Lyon centre",
  "level": "info",
  "timestamp": "2026-05-27T14:21:08Z"
}
```

`level` ∈ `debug` · `info` · `warn` · `error`.

### `done` event

Emitted once, then the stream closes. Same event for `failed` and `cancelled` — check `status`.

```json
{
  "id": "j_abc123",
  "status": "done",
  "results_count": 1820,
  "duration_seconds": 1342
}
```

### `error` event

Stream-level errors (auth, not-found). Different from a job ending in `failed` (that one comes via `done` with `status: "failed"`).

```json
{ "code": "forbidden", "message": "Not your job" }
```

## Polling intervals (no SSE)

| Endpoint         | Min poll interval |
|------------------|-------------------|
| `/api/jobs/{id}` | 2 seconds         |
| `/api/jobs`      | 5 seconds         |

Internal state refreshes every 2 s; faster polling brings no benefit.

## Timeouts

| Thing                                   | Value      |
|-----------------------------------------|------------|
| SSE stream max duration                 | 6 hours    |
| Job overall timeout                     | 6 hours    |
| Idle worker reconnect window            | 30 seconds |
| Result file retention after `done`      | 7 days     |

## What's next

- [Jobs & lifecycle](/docs/concepts/jobs-lifecycle)
- [Limits & quotas](/docs/concepts/limits)


<!-- doc: concepts/veille-monitoring -->

---
title: Veille (monitoring)
slug: concepts/veille-monitoring
section: Concepts
summary: A recurring scrap that diffs each run against the previous one and surfaces reputation signals.
---

A **veille** (French for "watch") is a scheduled re-run of an existing job or pipeline. Each run is diffed against the previous one, and differences are exposed as **signals**.

A veille is created from an existing **scrap job** (the source). Its query + zones + parameters become the template, cloned at each scheduled run.

```
   source job (one-off scrap)
        │  registered as veille, frequency = 7 days
        ▼
   run 1   ──►   poi_list_v1
        │  7 days later
        ▼
   run 2   ──►   poi_list_v2
        │  diff(v1, v2)
        ▼
   change report:
     - new POIs (opened)
     - closed POIs (no longer found)
     - modified POIs (ratings dropped, contact changed, ...)
```

## Frequency

Days, **1**–**365**. Hourly is intentionally disallowed: prospect data doesn't move that fast, and source rate limits would not survive it. Typical: 7 (weekly), 30 (monthly), 90 (quarterly).

## Signals

Three categories extracted from each diff:

- **`new`** — in the new run, absent before (newly opened competitors, partners, acquisition targets)
- **`closed`** — absent from the new run, present before (outreach cleanup; early shutdown signal)
- **`modified`** — present in both, changed:
  - **Rating delta** — Google rating drop = strong "client in trouble" signal
  - **Review count delta** — surging or stalling activity
  - **Contact delta** — phone or website changed (often a relaunch)

Modified rows are scored; the signals endpoint returns them ranked.

## Endpoints

```
GET    /api/veille                              # list user's veilles
POST   /api/veille                              # create
PATCH  /api/veille/{id}                         # update name, frequency, status
DELETE /api/veille/{id}                         # soft-delete
GET    /api/veille/{id}/runs                    # historical runs
GET    /api/veille/{id}/runs/{run_id}           # one run + diff
GET    /api/veille/{id}/runs/{run_id}/signals   # filtered, scored signals
```

Signals endpoint supports CSV / JSON / XLSX via `?format=…`.

## States

| State    | Meaning                                                 |
|----------|---------------------------------------------------------|
| `active` | Will run on schedule                                    |
| `paused` | Schedule suspended; existing runs remain available      |
| `deleted`| Soft-deleted; data retained                             |

A veille run is a normal job — same workers, same quotas. Counts against the running-jobs ceiling only at run time.

## What's next

- [Jobs & lifecycle](/docs/concepts/jobs-lifecycle)
- [Pipeline orchestration](/docs/concepts/pipeline-orchestration)
- [`scrap`](/docs/modules/scrap)


<!-- doc: integration/byok -->

---
title: BYOK — Bring your own AI key
slug: integration/byok
section: Integration
summary: Connect a personal API key from any major AI provider (Anthropic, OpenAI, Gemini, Mistral, Groq, DeepSeek, xAI, or any OpenAI-compatible endpoint) and pick a model. The user's key, the user's quota.
---

> **Status: partially live.** Connecting a key, picking a provider, and selecting a model are available now in **Settings → Connect an AI**, and power the AI features shipping today (e.g. Google review summaries, pipeline generation from a description). The broader in-app assistant described below is still on the roadmap.

The BYOK ("bring your own key") integration lets the user paste an AI provider API key into the outsend settings and use an AI assistant directly inside the app — to configure searches, draft filter rules, summarise results, or build pipelines through natural language.

## Why BYOK and not a hosted model

- The user's spend stays on the user's account, billed by the AI provider directly.
- No outsend-side mediation: the assistant sees only what the user grants it.
- Provider choice stays with the user: Anthropic, OpenAI, or any compatible endpoint.

## Supported providers

The provider and model are chosen in **Settings → Connect an AI**. Models are **detected live** from the provider's own API — there is no fixed model list to maintain, and new models appear automatically as the provider releases them.

| Provider | Key format | Notes |
|----------|------------|-------|
| Anthropic (Claude) | `sk-ant-…` | Native Messages API |
| OpenAI | `sk-…` | Incl. reasoning models (o-series, GPT-5) |
| Google (Gemini) | `AIza…` | OpenAI-compatible endpoint |
| Mistral | — | |
| Groq | `gsk_…` | |
| DeepSeek | `sk-…` | |
| xAI (Grok) | `xai-…` | |
| Any OpenAI-compatible endpoint | — | Paste a custom base URL (Together, Perplexity, OpenRouter, local Ollama / vLLM, …) |

The key is stored encrypted at rest (Fernet, server secret), scoped to the user's account, and never sent outside the outsend backend except to the chosen provider. A rough cost estimate is shown before AI actions — it is indicative only (best-effort token counting against known public prices) and may differ from the provider's actual billing. AI spending is also protected by **hard caps** — per request, per day and per month — that you set in **Settings**: a request that would exceed a cap is blocked *before* the provider is called, with email alerts at 80% and when a cap is reached. See [AI spending caps](/docs/concepts/ai-spending-caps).

## What the assistant can do

The assistant uses the same outsend API surface documented in [API overview](/docs/api/overview). It can:

- Read the user's jobs, pipelines, and veilles
- Start new jobs (with explicit user confirmation for spend)
- Compose pipelines by chaining modules from the [registry](/docs/concepts/module-registry)
- Compute filter rules from natural-language descriptions and preview the result

It cannot:

- Access other users' data
- Modify billing, account settings, or invitation codes
- Run anything outside the user's normal permission scope

## Why not just use Claude.ai with outsend as MCP?

Both options will exist:

- **BYOK** — for users who want the assistant **inside outsend.xyz**, with the UI rendering search forms and tables natively while the model orchestrates.
- **[MCP](/docs/integration/mcp)** — for users who want to drive outsend from their own Claude.ai or Claude Desktop, with their existing subscription.

The two patterns are complementary, not competing.

## What's next

- [MCP integration](/docs/integration/mcp) — drive outsend from your own AI client
- [llms.txt](/docs/integration/llms-txt) — point any AI assistant at the docs


<!-- doc: integration/llms-txt -->

---
title: llms.txt — AI-friendly documentation
slug: integration/llms-txt
section: Integration
summary: A single URL exposes the entire outsend documentation to any AI assistant — no auth, no scraping, no parsing.
---

The outsend documentation is published in the [llms.txt](https://llmstxt.org) format. Any AI assistant — Claude, ChatGPT, Cursor, Perplexity, or a local model — can ingest the full reference in one fetch.

## The two endpoints

| URL                                                                 | Purpose                                                                   |
|---------------------------------------------------------------------|---------------------------------------------------------------------------|
| [`/docs/llms.txt`](/docs/llms.txt)                                  | Flat index — one line per page, with title + URL + one-line summary       |
| [`/docs/llms-full.txt`](/docs/llms-full.txt)                        | Full bundle — every page concatenated, delimited by `<!-- doc: <slug> -->` |

Both endpoints return `text/plain` with no auth, no rate limit, no JS rendering required.

## Use it from an AI assistant

Most AI clients now detect `llms.txt` automatically when a domain is mentioned. For the ones that don't, paste the URL directly:

```
https://outsend.xyz/docs/llms-full.txt
```

The bundle is ~150 KB and fits comfortably in any modern context window.

## Per-section bundles

For narrower scopes, the per-section endpoints are also available:

| URL                                          | Contains                          |
|----------------------------------------------|-----------------------------------|
| `/docs/_bundle/concepts.txt`                 | Only the Concepts pages           |
| `/docs/_bundle/modules.txt`                  | Only the Modules pages            |
| `/docs/_bundle/api.txt`                      | Only the API reference            |
| `/docs/_bundle/integration.txt`              | Only the Integration pages        |

## The Copy button

Every page in this documentation has a **Copy** button in the top-right corner. It exposes the same bundles, but as a one-click clipboard action:

- Copy this page (raw markdown)
- Copy this section
- Copy entire docs

The "Copy entire docs" action is the recommended path when handing the docs to an AI assistant interactively.

## Why this matters

AI assistants are increasingly used as the integration layer between SaaS products. A documentation that an assistant can ingest cleanly — without scraping, login flows, or HTML parsing — is integratable; one that cannot, is not.

outsend's docs are designed to be readable by humans, but their **first audience** is the LLM that will draft the integration code, write the prompt template, or diagnose the misconfigured pipeline.

## What's next

- [API overview](/docs/api/overview) — the surface the assistant will call
- [MCP](/docs/integration/mcp) — the protocol the assistant should prefer


<!-- doc: integration/mcp -->

---
title: MCP — Model Context Protocol
slug: integration/mcp
section: Integration
summary: Drive outsend from your own Claude.ai, Claude Desktop, or any MCP-compatible client. Your subscription, your tokens.
---

> **Status: planned.** The MCP server is on the roadmap; this page describes the intended endpoint shape so AI clients can plan against it. The release will be announced in the changelog.

The MCP integration exposes outsend as a **remote MCP server** that any MCP-compatible client can connect to: Claude.ai (custom connectors), Claude Desktop, Claude Code, Cursor, or any future client that speaks the protocol.

The user signs in once with their outsend account, and from then on the AI client can run searches, build pipelines, and read results — using the user's own LLM subscription (no outsend-side LLM cost).

## How it will work

1. The user opens settings in their MCP client (e.g. Claude.ai → Settings → Connectors → Add custom connector).
2. They paste `https://outsend.xyz/mcp` and authenticate.
3. The MCP server returns the list of available tools (see below).
4. The model can call those tools on the user's behalf; each call hits the outsend API as that user.

## Planned tools

| Tool                      | What it does                                                     |
|---------------------------|------------------------------------------------------------------|
| `list_jobs`               | List the user's recent jobs                                      |
| `get_job`                 | Fetch a job's status, counters, and a sample of its results      |
| `create_scrap_job`        | Start a Google Maps extraction                                   |
| `create_enrich_job`       | Start an enrichment on an existing job (emails, socials, …)      |
| `list_pipelines`          | List the user's pipelines                                        |
| `create_pipeline`         | Compose a pipeline from a description                            |
| `run_pipeline`            | Execute a saved pipeline                                         |
| `list_veilles`            | List recurring veilles                                           |
| `create_veille`           | Register an existing job as a recurring veille                   |
| `get_signals`             | Fetch the latest reputation signals from a veille run            |

Each tool's argument schema mirrors the corresponding [API endpoint](/docs/api/overview). In particular, `create_pipeline` takes the same portable envelope as the REST API (`{schema_version, name, definition}`), and the set of valid blocks plus their per-block `config_schema` is the contract already published at [`GET /api/pipelines/schema`](/docs/api/pipelines) — the MCP server reuses it rather than defining its own.

## Scope and limits

The MCP server inherits the user's normal permissions:

- It cannot access other users' data.
- It respects the same rate limits as the REST API.
- It cannot modify billing, account settings, or invitation codes.

## BYOK vs MCP

| Pattern | Where the chat lives                            | Who pays the LLM tokens          |
|---------|-------------------------------------------------|----------------------------------|
| [BYOK](/docs/integration/byok) | Inside outsend.xyz                  | The user, via a pasted API key   |
| MCP     | Inside the user's existing AI client            | The user, via their subscription |

The two patterns coexist. Pick BYOK if the assistant should live in the outsend UI; pick MCP if it should live wherever the user already works.

## What's next

- [BYOK](/docs/integration/byok) — assistant inside outsend.xyz
- [llms.txt](/docs/integration/llms-txt) — let any AI assistant ingest the docs


<!-- doc: modules/ads_intelligence -->

---
title: Ads profile
slug: modules/ads_intelligence
section: Modules
---

# Ads profile

The `ads_intelligence` module profiles the marketing stack of each POI's website and condenses the findings into a single 0–100 marketing maturity score. It splits a list of prospects into two actionable segments: businesses that already invest in paid acquisition, and businesses still on a cold first-touch.

Detections match the homepage against community-maintained filter lists (uBlock Origin, EasyList, EasyPrivacy) plus a curated outsend signature table, covering advertising pixels, retargeting networks, CMPs, marketing CRMs and chat widgets.

## Inputs

Only items with a non-empty `site_web` are processed.

| Field           | Type   | Required | Notes                                  |
|-----------------|--------|----------|----------------------------------------|
| `site_web`      | string | yes      | Absolute URL of the POI's website      |
| `nom`           | string | no       | Carried through for reporting          |
| `place_id`      | string | no       | Used to join back to the source list   |
| `source_job_id` | string | no       | ID of an upstream `scrap` job to chain |

Batch size: 1 to 10 000 items per job.

## Outputs

One row per processed POI. Paid-media pixels and retargeting weigh the most in the score; chat widgets the least.

| Column            | Type     | Description                                                                 |
|-------------------|----------|-----------------------------------------------------------------------------|
| `ads_score`       | integer  | Marketing maturity score, 0–100                                             |
| `pixels_detected` | string[] | Advertising pixels found on the page (e.g. `meta`, `google_ads`, `tiktok`)  |
| `crm_detected`    | string   | Marketing CRM identified, if any (e.g. `hubspot`, `klaviyo`, `brevo`)       |
| `chat_widget`     | string   | Chat solution identified, if any (e.g. `intercom`, `crisp`, `drift`)        |
| `marketing_tools` | string[] | Other marketing technologies (CMP, CDP, affiliation, retargeting networks)  |

Granular fields also stored: `ads_active`, `ads_networks`, `pixel_meta`, `pixel_google_ads`, `cmp_vendor`, `retargeting`, `crm_marketing`, `chat_widgets`.

## Lifecycle

Standard outsend job lifecycle; see [/docs/concepts/jobs-lifecycle](/docs/concepts/jobs-lifecycle). Progress is reported per item in the `sites` unit.

## Pipeline

| Direction  | Keys                                                                                                              |
|------------|-------------------------------------------------------------------------------------------------------------------|
| `needs`    | `site_web`                                                                                                        |
| `produces` | `ads_active`, `ads_score`, `ads_networks`, `pixel_meta`, `pixel_google_ads`, `cmp_vendor`, `retargeting`, `crm_marketing`, `chat_widgets` |

Any upstream job that emits `site_web` (typically `scrap`) can feed `ads_intelligence`. The job picker defaults to the most recent `scrap` job of the current account.

## Endpoints

### Create job

`POST /api/jobs/ads-intelligence`

```json
{
  "items": [
    { "site_web": "https://example.com", "nom": "Example", "place_id": "..." }
  ],
  "source_job_id": "optional-upstream-job-uuid"
}
```

Response: a `JobPublic` document describing the newly created job (`id`, `status`, `job_type`, `output_filename`, `ef_cost`, timestamps).

| Status | When                                                                |
|--------|---------------------------------------------------------------------|
| `400`  | No item has a `site_web`, or per-job EF quota exceeded              |
| `401`  | Missing or invalid session                                          |
| `403`  | Account inactive                                                    |
| `422`  | Payload does not match the schema (e.g. `items` empty or > 10 000)  |

Job state, progress and results are read through the shared job endpoints (`GET /api/jobs/{id}`, `GET /api/jobs/{id}/results`, SSE stream).

## Limits

See [/docs/concepts/limits](/docs/concepts/limits). Per-item EF cost: ~1 / 3 / 3700 EF. Wall time per item: 0.6 – 6 s.

## Errors

| Error                                  | Cause                                                       |
|----------------------------------------|-------------------------------------------------------------|
| `Aucun établissement avec site web`    | All items were missing `site_web` after normalisation       |
| `Quota dépassé`                        | Estimated EF cost exceeds the per-job ceiling               |
| Item-level fetch failure               | Recorded on the row; the job continues with the next item   |
| Empty homepage / non-HTML response     | Row emitted with `ads_score = 0` and empty detections       |

## What's next

Pair `ads_intelligence` with the following modules to extend the prospect profile:

- [`techstack`](/docs/modules/techstack) — full CMS, framework and hosting fingerprint.
- [`pricing`](/docs/modules/pricing) — surface visible pricing and commercial terms.
- [`pagespeed`](/docs/modules/pagespeed) — Core Web Vitals and performance budget.


<!-- doc: modules/brand_assets -->

---
title: Brand assets
slug: modules/brand_assets
section: Modules
---

# Brand assets

Extracts the visual identity of each prospect from its own website: main logo, logo variants, favicon, dominant brand color, harmonic palette derived from the logo. Optional homepage screenshot. All images are re-hosted in the caller's private storage, so a link never breaks when the prospect rotates its CDN.

The module is read-only against the prospect's site — no form submission, no login, no authentication crossing.

## Inputs

One row per POI. Only `site_web` is required; the rest is passed through.

| Field      | Type   | Required | Notes                                     |
|------------|--------|----------|-------------------------------------------|
| `site_web` | string | yes      | HTTP(S) URL of the prospect's website.    |
| `nom`      | string | no       | Display name, surfaced in the UI.         |

A batch accepts 1 to 10 000 rows. Rows without `site_web` are dropped before the job is enqueued.

Job-level options:

| Option               | Type   | Default | Notes                                          |
|----------------------|--------|---------|------------------------------------------------|
| `source_job_id`      | string | null    | Parent job in the pipeline chain.              |
| `capture_screenshot` | bool   | false   | Adds a homepage screenshot. ~5x slower per row.|

## Outputs

Per-row output. Local URLs point to assets re-hosted under `/api/brand-assets/<owner_user_id>/<sha256>.<ext>` and served only to the owner (or an admin).

| Column                       | Type    | Notes                                                                 |
|------------------------------|---------|-----------------------------------------------------------------------|
| `logo_url`                   | string  | Source URL of the main logo as found on the prospect's site.         |
| `logo_local_url`             | string  | Re-hosted copy of the main logo, stable URL.                          |
| `logo_source`                | string  | Where the logo was picked up (e.g. `og:image`, JSON-LD, apple-touch). |
| `logo_variants_local_urls`   | list    | Re-hosted alternate marks: apple-touch, mask-icon, monochrome, etc.   |
| `favicon_url`                | string  | Source URL of the highest-quality favicon detected.                   |
| `favicon_local_url`          | string  | Re-hosted copy of the favicon.                                        |
| `brand_color`                | string  | Dominant brand color as a hex string.                                 |
| `brand_color_source`         | string  | Origin of the color (theme-color meta, logo sampling, etc.).          |
| `brand_palette`              | list    | Five harmonic hex colors derived from the logo.                       |
| `screenshot_local_url`       | string  | Homepage screenshot. Populated only when `capture_screenshot=true`.   |

Binaries are stored as `data/brand_assets/<user_id>/<sha256>.<ext>`. Allowed extensions: `svg`, `png`, `jpg`, `jpeg`, `webp`, `gif`, `ico`, `avif`. Each asset is hashed for de-duplication across rows of the same owner.

## Lifecycle

Standard job states — see [Jobs lifecycle](/docs/concepts/jobs-lifecycle). Per-row HTTP errors never fail the job: a failed row carries `fetch_error` and a null `logo_local_url`.

## Pipeline

| Needs       | Produces                                                                                                                                                  |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| `site_web`  | `logo_url`, `logo_local_url`, `logo_variants_local_urls`, `favicon_url`, `favicon_local_url`, `brand_color`, `brand_palette`, `screenshot_local_url`      |

`pipelinable: true`, slots after any step that yields `site_web` — most commonly a `scrap` parent. `supports_veille: false`: brand identity is a one-shot extraction, not a recurring signal.

## Endpoints

### Create a batch job

```
POST /api/jobs/brand-assets
```

Body:

```json
{
  "items": [
    {"nom": "Stripe", "site_web": "https://stripe.com"}
  ],
  "source_job_id": null,
  "capture_screenshot": false
}
```

Returns a `JobPublic` envelope. Authentication: any active user.

### Live single-domain lookup

```
GET /api/brand-lookup?domain=<domain>&refresh=<bool>
```

Single-shot, no batch job created. The first call for a given domain fetches live (~2–3s) and stores the result in a per-user cache. Subsequent calls within seven days return the cached profile instantly. `refresh=true` forces a re-fetch.

Response shape:

```json
{
  "domain": "stripe.com",
  "cached": false,
  "cached_at": null,
  "profile": {
    "status": "ok",
    "logo_url": "...",
    "logo_local_url": "...",
    "logo_source": "og:image",
    "logo_variants_local_urls": ["..."],
    "favicon_url": "...",
    "favicon_local_url": "...",
    "brand_color": "#635BFF",
    "brand_color_source": "theme-color",
    "brand_palette": ["#635BFF", "..."],
    "http_status": 200,
    "final_url": "https://stripe.com/",
    "fetch_error": null
  }
}
```

### Serve a re-hosted asset

```
GET /api/brand-assets/<owner_user_id>/<sha256>.<ext>
```

Per-owner isolation: only the owner (or an admin) can read assets from the namespace. Filenames are validated against a strict regex; path traversal attempts are rejected with `400`.

For global quotas and caps, see [Limits](/docs/concepts/limits). Brand-lookup cache TTL: 7 days, per user, per domain. Screenshot mode multiplies per-row cost by ~5; opt-in only.

## Errors

| Code | Meaning                                                                  |
|------|--------------------------------------------------------------------------|
| 400  | `Aucun établissement avec site web` — no row carried a usable `site_web`. |
| 400  | Invalid domain on `/api/brand-lookup`.                                   |
| 400  | Invalid asset filename on the serve endpoint.                            |
| 403  | Access to an asset owned by another user.                                |
| 502  | Live lookup failed upstream (`Lookup failed: <type>: <message>`).        |

## What's next

- [techstack](/docs/modules/techstack) — detect the CMS, analytics, and frameworks behind the same `site_web`.
- [ads_intelligence](/docs/modules/ads_intelligence) — surface the prospect's paid acquisition footprint to complement the visual identity.


<!-- doc: modules/dead_check -->

---
title: Closed check
slug: modules/dead_check
section: Modules
summary: Check whether each point of interest is still operating, has shut down, or is uncertain — based on real abandonment signals on its website.
---

## Purpose

The `dead_check` module inspects the website attached to each point of interest (POI) and decides whether the underlying business looks alive, dead, or uncertain. It correlates several abandonment signals on the same domain: expiring registration, redirection to an unrelated property (resale or rebranding), parking pages disguised as real sites, and stale or invalid TLS certificates. Directory listings (Doctolib, Pages Jaunes, Yelp, etc.) are recognised as such rather than blindly flagged as a personal site.

## Inputs

A list of POIs, each carrying at least a website. Items without a `site_web` are filtered out at submit time.

| Field | Type | Required | Description |
|---|---|---|---|
| `items` | array of POI objects | yes | 1 to 10 000 entries. Entries without a `site_web` are dropped before execution. |
| `source_job_id` | string | no | Identifier of the upstream list-producing job (typically a `scrap`). Used for lineage in the UI and the pipeline. |

No other tuning parameters: the module runs in a single mode.

## Outputs

Each input POI is augmented with the closed-check verdict for its website. The original POI columns are kept and the following is appended:

| Column | Type | Description |
|---|---|---|
| `site_alive` | `"open"` \| `"closed"` \| `"uncertain"` | Final verdict. `open` means the site behaves like an active business presence, `closed` means converging signals of abandonment, `uncertain` means signals were too thin to decide. |

The progress unit during execution is `sites`; the result unit is also `sites`.

## Lifecycle

Standard job states — see [Jobs lifecycle](/docs/concepts/jobs-lifecycle). Partial counters stream over SSE so early verdicts can be consumed without waiting for the final export.

## Pipeline

Pipelinable; typically inserted right after the list-producing step, before any expensive enrichment.

| Slot | Value |
|---|---|
| `needs` | `site_web` |
| `produces` | `site_alive` |
| Category | `verify` |
| Typical upstream | `scrap` |
| Typical downstream | `emails`, `techstack`, `ads_intelligence`, `filter` |

A common pattern is `scrap` then `dead_check` then `filter` (keep `site_alive = "open"`) then any enrichment module — avoiding outreach spend on shuttered businesses.

## Endpoints

Create a job:

```
POST /api/jobs/dead-check
Content-Type: application/json

{
  "items": [
    { "site_web": "https://example.com", "nom": "Example Co" }
  ],
  "source_job_id": "…"
}
```

Response: a `JobPublic` object with `id`, `status`, and the standard job metadata. Poll `GET /api/jobs/{id}` or subscribe to the SSE stream for progress; download the final CSV from the job detail page once `status = "done"`.

For the full job API surface (list, detail, cancel, export, events), see [Jobs API](/docs/api/jobs). For per-account quotas, see [Limits](/docs/concepts/limits).

## Errors

| HTTP | `detail` | Cause |
|---|---|---|
| 400 | `Aucun établissement avec site web` | No input item carries a `site_web` field. |
| 400 | `Quota dépassé : …` | Estimated cost exceeds the per-job equivalent-France quota. |
| 401 / 403 | — | Missing or inactive session. |

Errors raised after the job has been created surface on the job detail page and via the SSE `error` event; the job ends in `status = "error"` and partial results, if any, remain downloadable.

## What's next

- [filter](/docs/modules/filter) — keep only POIs whose `site_alive` is `open` (or exclude `closed`) before spending budget on enrichment.
- [reviews](/docs/modules/reviews) — for POIs marked `uncertain`, recent reviews are a strong tiebreaker between an active business and a dormant one.


<!-- doc: modules/delivery_check -->

---
title: Inbox placement
slug: modules/delivery_check
section: Modules
---

# Inbox placement

Tests where a message sent from a given domain actually lands. The module sends nothing on the caller's behalf — the real message is sent to fifteen seed inboxes, and the module reports back where each one ended up: primary inbox, a secondary tab (Promotions, Social), or spam.

The result is a snapshot of how a recipient mailbox treated this exact message, from this exact domain, at this exact moment. It is not a simulation, a reputation lookup, or a header inspection. The module answers one question: *if this message is sent from this domain right now, where does it go?*

## Inputs

A test job takes two values.

| Field | Required | Description |
| --- | --- | --- |
| `domain` | yes | The sending domain to test, lowercased, without the `@` (for example `acme.fr`). Must contain a dot and be 3 to 120 characters. |
| `subject_filter` | no | Optional substring matched against the seed inbox subject lines. Useful to disambiguate when multiple tests run in parallel from the same domain. Up to 120 characters. |

The module does not take a recipient list. Inbox placement is a standalone job — it is not part of a pipeline and cannot consume the output of another job.

## Outputs

The job writes one row per seed mailbox to `results_delivery.csv`. Fifteen seed inboxes are queried; each row describes what that mailbox observed.

| Column | Description |
| --- | --- |
| `seed_email` | Address of the test inbox the row refers to. |
| `seed_kind` | Provider family for the seed (used to group results by mailbox type). |
| `status` | `received` if the message was found, otherwise an empty or pending state. |
| `placement` | Where the message landed: `Inbox principal`, `Inbox · <tab>` (for example Promotions, Social), `Spam`, or empty if not received. |
| `subject` | Subject line as observed in the seed mailbox. |
| `received_relative` | Human-readable delay between send and observation (for example `2 min`). |

The structured report endpoint aggregates these rows into a summary.

| Field | Description |
| --- | --- |
| `received` | Number of seeds that observed the message. |
| `total` | Total seed inboxes queried (15). |
| `missing` | `total - received`. |
| `primary` | Seeds where placement is `Inbox principal`. |
| `primary_pct` | Primary inbox rate as a percentage of `received`. |
| `inbox_secondary` | Seeds where placement is a non-primary inbox tab. |
| `promotions` | Seeds where placement matched Promotions or Social. |
| `spam` | Seeds where placement is `Spam`. |
| `spam_pct` | Spam rate as a percentage of `received`. |
| `verdict` | Contextual verdict — see below. |
| `seeds` | The per-seed array described in the previous table. |

The `verdict` object carries a one-line judgment and an actionable note.

| `verdict.label` | When |
| --- | --- |
| `EXCELLENT` | `primary_pct` ≥ 90. |
| `TRÈS BON` | `primary_pct` ≥ 70. |
| `MOYEN` | `primary_pct` ≥ 50. |
| `MAUVAIS` | `spam_pct` ≥ 50. |
| `INSUFFISANT` | Most messages landed in secondary tabs. |
| `EN ATTENTE` | Nothing received yet. |

## Lifecycle

Standard job states — see [Jobs lifecycle](/docs/concepts/jobs-lifecycle). The runtime workflow is: create the job, fetch seeds via `GET /api/delivery-check/seeds`, send the real message to all fifteen from the domain under test, wait for the worker to poll until all seeds report `received` or a timeout fires, then read the aggregated report.

The module does not chain. Its output is not reusable as input for another job — Inbox placement is listed in the non-chainable job set alongside `viewport_test`.

## Pipeline

Inbox placement is `standalone_only`.

- **Needs:** nothing. The job takes a domain string, not a list of records.
- **Produces:** no reusable columns. The CSV exists for export but is not exposed to the pipeline graph.
- **Pipelinable:** no.
- **Veille:** not supported.

If a campaign needs to react to a placement result, the report is consumed from the API and branched on in external orchestration — the module will not feed another node directly.

## Endpoints

| Method | Path | Purpose |
| --- | --- | --- |
| `POST` | `/api/jobs/delivery-check` | Create a delivery-check job. Body: `{ "domain": "...", "subject_filter": "..." }`. Returns the public job object. |
| `GET` | `/api/delivery-check/seeds` | List the fifteen seed addresses to send the test message to. |
| `GET` | `/api/jobs/{job_id}/delivery-result` | Aggregated report with summary, verdict, and per-seed rows. |
| `GET` | `/api/jobs/{job_id}` | Standard job status (queued, running, done, failed). |

All endpoints require an authenticated, active user. Reading another user's job returns `403`.

Budget per job is fixed at fifteen seed observations; no slider, no override. Inbox placement does not consume scraping credits (`ef_per_item: 0`), though the per-user job quota still applies. A complete run typically lands between two and eight minutes after the seed message is sent. The domain must contain a dot and is lowercased internally. For global caps, see [Limits](/docs/concepts/limits).

## Errors

| Status | Reason |
| --- | --- |
| `400` | `Domaine d'envoi invalide` — the domain is empty or does not contain a dot. |
| `400` | `Quota dépassé` — `MAX_EF_PER_JOB` reached. Inbox placement itself is free, but the quota check still applies. |
| `400` | `Pas un job de test de délivrabilité` — `/delivery-result` was called on a job whose type is not `delivery_check`. |
| `403` | The job belongs to another user. |
| `404` | The job ID does not exist. |
| `410` | The job's CSV has expired and been deleted. |

If the report returns `received: 0` after the worker has run, the seed message never arrived — either it was not sent, was blocked entirely, or the domain is on a complete blocklist. Re-send to the seeds and re-poll before drawing a conclusion.

## What's next

- [Verify emails](/docs/modules/verify_emails) — clean a list of addresses before sending, so that the seed test reflects what the deliverable subset will see.
- [Ads intelligence](/docs/modules/ads_intelligence) — once placement is solid, see which competitors are paying for visibility on the same audience.


<!-- doc: modules/emails -->

---
title: Emails
slug: modules/emails
section: Modules
summary: Find a working email address for each point of interest from a previous list.
---

## Purpose

Enrichment module: infers email addresses from each POI's website. Personal mailboxes rank above generic ones (`info@`, `contact@`, `hello@`). No address is invented — empty when no candidate qualifies.

## Inputs

A list of POIs each carrying at least a website. The two execution modes differ in coverage vs cost.

| Field | Type | Required | Description |
|---|---|---|---|
| `items` | array of POI objects | yes | 1 to 10 000 entries. Entries without a `site_web` are filtered out before execution. |
| `mode` | `"normal"` \| `"deep"` | no, defaults to `"normal"` | `normal` runs the standard extraction. `deep` runs an exhaustive second pass and requires a previously completed `normal` run on the same source. |
| `source_job_id` | string | conditionally | Required when `mode = "deep"`. Must reference a `done` `emails` job in `normal` mode on the same upstream source. |

## Outputs

Each input POI is augmented with up to two email fields. Original POI columns are preserved; the job appends:

| Column | Type | Description |
|---|---|---|
| `email` | string \| null | Best-ranked address for this POI. Empty when no candidate qualifies. |
| `email_personal` | string \| null | Set when the best candidate looks like a person's mailbox rather than a generic role address. |

Ranking is deterministic. Progress unit: `sites`. Result unit: `emails`.

## Lifecycle

Standard job lifecycle: see [Jobs & lifecycle](/docs/concepts/jobs-lifecycle).

## Pipeline

| Slot | Value |
|---|---|
| `needs` | `poi_list` (POIs with a `site_web` field) |
| `produces` | `enriched_list` (POIs augmented with `email`, `email_personal`) |
| Typical upstream | `scrap` |
| Typical downstream | `verify_emails`, `delivery_check`, `filter` |

Default pipeline config: `{ "mode": "normal" }`. `deep` is intended as a manual follow-up on POIs that came back empty from the normal run.

## Endpoints

Create a job:

```
POST /api/jobs/emails
Content-Type: application/json

{
  "items": [
    { "site_web": "https://example.com", "nom": "Example Co" }
  ],
  "mode": "normal"
}
```

Response: a `JobPublic` object with `id`, `status`, and standard metadata. For the full job API surface, see [Jobs API](/docs/api/jobs).

## Limits

Global quotas: see [/docs/concepts/limits](/docs/concepts/limits). Module-specific caps:

| Limit | Value |
|---|---|
| Minimum items per job | 1 |
| Maximum items per job | 10 000 |
| Items kept | only those with a non-empty `site_web` |
| `deep` mode prerequisite | A `done` `normal` `emails` job on the same `source_job_id` |

Items without a website are dropped during normalization. If the filtered list is empty, the job is rejected with `"Aucun établissement avec site web"`.

## Errors

| HTTP | `detail` | Cause |
|---|---|---|
| 400 | `Mode email invalide : ... (attendu: normal | deep)` | `mode` is neither `normal` nor `deep`. |
| 400 | `Aucun établissement avec site web` | No input item carries a `site_web` field. |
| 400 | `Le mode Deep Extract n'est dispo qu'après une extraction normale ...` | `mode = "deep"` submitted without a valid prior normal run on the same source. |
| 400 | `Quota dépassé : ...` | Estimated cost exceeds the per-job equivalent-France quota. |
| 401 / 403 | — | Missing or inactive session. |

Errors raised after creation surface via the SSE `error` event; the job ends in `status = "error"` and partial results remain downloadable.

## What's next

- [verify_emails](/docs/modules/verify-emails) — confirm each address is deliverable before sending.
- [delivery_check](/docs/modules/delivery-check) — measure inbox placement on a real message.
- [filter](/docs/modules/filter) — keep only POIs that have a personal address, exclude disposable domains, or sample the list.


<!-- doc: modules/filter -->

---
title: Filter
slug: modules/filter
section: Modules
---

## Purpose

The `filter` module narrows a dataset to the rows that match a set of rules. It is pipeline-internal (see [/docs/concepts/pipeline-orchestration](/docs/concepts/pipeline-orchestration)): it consumes the CSV produced by an upstream node and emits a strict subset, with the same columns. No new data is fetched and no column is added. Filtering early saves budget on the expensive enrichment steps that follow.

## Inputs

The rules are read from the node's `config` object and applied row-by-row in a fixed order. Every key is optional; an empty rule is a no-op.

### Standard rules

| Key | Type | Behaviour |
| --- | --- | --- |
| `require_phone` | `bool` | Keep rows where `telephone` is non-empty. |
| `require_site` | `bool` | Keep rows where `site_web` is non-empty. |
| `require_email` | `bool` | Keep rows where `email` is non-empty. |
| `exclude_aggregators` | `bool` | Drop rows whose `site_web` points to a known aggregator domain. |
| `alive_only` | `bool` | Keep rows whose dead-check `status` is `alive` or `stale`. |
| `has_personal_email` | `bool` | Keep rows where at least one address in `email` is a personal mailbox (not role-based). |
| `rating_min` | `float` | Keep rows where `note >= rating_min`. |
| `reviews_min` | `int` | Keep rows where `nb_avis >= reviews_min`. |

### Advanced rules

| Key | Shape | Behaviour |
| --- | --- | --- |
| `phone_prefix` | `{ column?, prefixes[], prefix_unparseable_keep? }` | Keep rows whose phone column starts with one of `prefixes` (e.g. `06`, `+33`). Requires the `phonenumbers` library on the worker — otherwise the rule is logged and skipped. |
| `email_domain` | `{ column?, include[], exclude[], reject_disposable? }` | Keep rows whose email domain is in `include` (if set) and not in `exclude`. `reject_disposable` drops known throwaway providers. |
| `category` | `{ column, values[] }` | Keep rows whose `column` value is contained in `values`. |
| `dedup_column` | `string` | Collapse rows that share the same value on this column (first row wins). |

### Sampling

| Key | Type | Behaviour |
| --- | --- | --- |
| `sample_type` | `"n" \| "pct" \| ""` | Selects which sampling mode applies after the rules above. |
| `sample_n` | `int` | Keep the first `n` matched rows. |
| `sample_pct` | `0..100` | Keep a percentage of matched rows. |
| `sample_seed` | `int` | Seed for reproducible random sampling. |

Order of application: requirement flags → aggregator/alive/rating/reviews → personal-email → `phone_prefix` → `email_domain` → `category` → `dedup_column` → sampling.

## Outputs

The module writes a CSV with the same columns as the upstream node, containing only the matched rows. It does not produce new fields (`needs: []`, `produces: []`, `pipeline_passthrough: true`).

| Field | Value |
| --- | --- |
| `output_filename` | `results_<label>.csv` (same shape as upstream) |
| `n_items` | Number of rows kept |
| `progress_unit` | `lignes` |
| `results_unit` | `lignes gardées` |

## Lifecycle

Standard job lifecycle — see [/docs/concepts/jobs-lifecycle](/docs/concepts/jobs-lifecycle). Filter jobs run in the `parallel` pool, are created by the pipeline runner, and are not surfaced on the dashboard or in "New job".

## Pipeline

| Attribute | Value |
| --- | --- |
| `category` | `process` |
| `pipelinable` | `true` |
| `pipeline_passthrough` | `true` |
| `needs` | `[]` (works on any upstream type) |
| `produces` | `[]` |
| `hidden_from_new_job` | `true` |
| `hidden_from_dashboard` | `true` |

`filter` accepts any upstream module. The UI only exposes advanced rules whose target field is actually present in the upstream output — for example, the `phone_prefix` block is shown only if an upstream node produces a `phone` field.

## Endpoints

`filter` is a pipeline-internal job type (see [/docs/concepts/pipeline-orchestration](/docs/concepts/pipeline-orchestration)). It has no public `POST /api/jobs/filter` endpoint: filter jobs are created by the pipeline runner and configured through the pipeline definition.

Two endpoints are user-facing:

### Filter preview

`POST /api/pipelines/{pipeline_id}/nodes/{node_id}/filter-preview`

Applies a rule set to the upstream node's CSV in memory and returns how many rows would match — without creating a job. Used by the editor for live feedback while rules are being edited.

Request:

```json
{
  "rules": {
    "require_email": true,
    "phone_prefix": { "column": "telephone", "prefixes": ["06", "+33"] },
    "sample_type": "pct",
    "sample_pct": 25
  }
}
```

Response:

| Field | Type | Notes |
| --- | --- | --- |
| `total` | `int` | Rows read from the upstream CSV. |
| `matched` | `int` | Rows kept after the rules. |
| `samples` | `array` | Up to 5 matched rows, with empty columns stripped. |
| `predecessor_job_id` | `string` | Job whose CSV was previewed. |
| `fieldnames` | `string[]` | Columns of the upstream CSV. |
| `capped` | `bool` | `true` if the read hit the row cap (see Limits). |
| `reason` | `string` | Present only when `total = 0`: `no_predecessor`, `no_data_yet`, or `no_csv_found`. |

Errors: `404` if the pipeline or node is missing, `403` if the caller does not own the pipeline, `400` if the node is not of type `filter`, `400` if the rules are malformed.

### Pipeline job items

Once the pipeline runs the filter node, the resulting CSV is served by the generic job endpoints:

- `GET /api/jobs/{job_id}/items`
- `GET /api/jobs/{job_id}/output-columns`
- `GET /api/jobs/{job_id}/download`

## Limits

Global limits — see [/docs/concepts/limits](/docs/concepts/limits). Module-specific:

| Limit | Value |
| --- | --- |
| Preview row cap | `5000` rows (`_PREVIEW_ROWS_LIMIT`). When the upstream CSV is larger, the preview reads the first 5000 rows and sets `capped: true`. The full filter job, when executed, applies the rules to every row. |
| `phone_prefix` dependency | Requires the `phonenumbers` package on the worker. If missing, the rule is logged and ignored — the rest of the rules still apply. |

## Errors

| Code | Cause |
| --- | --- |
| `400` | The node referenced by the preview is not a `filter` node, or `rules` is not a valid object. |
| `403` | The pipeline does not belong to the caller and the caller is not admin. |
| `404` | The pipeline or the node does not exist. |
| `500` | The upstream CSV could not be read (corrupted file, missing on disk). |

A filter job itself fails only if the upstream CSV is unreadable; invalid rule values are coerced to no-ops rather than raising.

## What's next

- [Sort](/docs/modules/sort) — order the filtered rows and optionally keep the top N.
- [Import](/docs/modules/import) — bring an external CSV into a pipeline so it can be filtered like any other source.


<!-- doc: modules/finance -->

---
title: Financial data
slug: modules/finance
section: Modules
---

> **Frozen during alpha.** This module is built and listed across the product as a regular module, but it cannot be launched while outsend is in alpha — extracting and parsing filed annual accounts is too resource-intensive for the current alpha capacity. The new-job page shows it with a 🛠️ *"unavailable in alpha"* banner and a disabled **Launch** button, and `POST /api/jobs/finance` returns `503`. It will be enabled once capacity allows, with the contract described below.

The `finance` module turns each company in a list into a **financial profile** built from publicly filed annual accounts. Where [`legal_data`](/docs/modules/legal_data) returns an administrative snapshot with headline figures (revenue, net income, capital), `finance` goes deeper: multi-year revenue, profitability, balance-sheet structure, and derived solvency and liquidity ratios with a consolidated risk score.

The module is read-only: no credentials are required, no fees are charged by the upstream sources, and no business is contacted as part of the lookup.

## Purpose

Qualify a list financially before spending outreach effort on it — prioritise solid companies and screen out fragile structures.

Typical use cases:

- Rank a scraped list by revenue trend or solvency before a campaign.
- Drop companies with a deteriorating risk score from a sequence.
- Segment by balance-sheet size (equity, debt, cash) for tiered offers.
- Spot growth signals (multi-year revenue increase) as an opportunity cue.

## Inputs

`finance` is an enrichment module: it consumes an existing list of POI rather than producing one. The expected input is a `poi_list`, typically the output of a discovery job.

| Field         | Required | Notes                                              |
| ------------- | -------- | -------------------------------------------------- |
| `nom`         | yes      | Company name, used for matching.                   |
| `siren`       | no       | If present, used for an exact match (preferred).   |
| `code_postal` | no       | Disambiguates fuzzy name matches.                  |

Match resolution mirrors [`legal_data`](/docs/modules/legal_data): exact SIREN lookup when available, then a fuzzy match on `nom` + `code_postal`. A row that cannot be resolved — or a company that has not filed accounts — is returned with empty enrichment columns and an error code (see Errors).

## Outputs

Each input row is augmented with the following columns. Empty values are preserved as empty strings — the module never fabricates a value.

| Column                | Type   | Description                                              |
| --------------------- | ------ | ------------------------------------------------------- |
| `ca_3ans`             | object | Revenue for the last three filed fiscal years.          |
| `resultat_net`        | number | Net income for the latest filed year, in EUR.           |
| `ebitda`              | number | Earnings before interest, taxes, D&A, in EUR.           |
| `fonds_propres`       | number | Shareholders' equity, in EUR.                           |
| `dettes_financieres`  | number | Financial debt, in EUR.                                 |
| `tresorerie`          | number | Cash and cash equivalents, in EUR.                      |
| `bfr`                 | number | Working capital requirement, in EUR.                    |
| `ratio_solvabilite`   | number | Solvency ratio (equity / total assets).                 |
| `ratio_liquidite`     | number | Current ratio (current assets / current liabilities).   |
| `marge_nette`         | number | Net margin (net income / revenue), as a percentage.     |
| `score_risque`        | string | Consolidated risk score (`faible`, `modere`, `eleve`).  |
| `date_cloture`        | date   | Closing date of the latest filed fiscal year.           |

## Lifecycle

Standard job lifecycle — see [Jobs lifecycle](/docs/concepts/jobs-lifecycle). Progress is reported per company processed. While the module is frozen in alpha, no job can be created; the section below documents the contract that applies once it is enabled.

## Pipeline

Not pipelinable in alpha. The module appears in the dashboard, the new-job picker, the tools catalogue, and the landing page as an active module, but it is not exposed as a pipeline node while frozen. Once enabled it consumes a `poi_list` and emits an `enriched_list` carrying the original rows plus the columns described in Outputs.

## Endpoints

### Create a job

```
POST /api/jobs/finance
```

**During alpha this endpoint returns `503`** with an explanatory message. The request body below describes the contract that applies once the module is enabled:

```json
{
  "items": [
    { "nom": "Boulangerie Martin", "code_postal": "75011" },
    { "siren": "552120222" }
  ],
  "source_job_id": "job_01HXYZ..."
}
```

Either `items` or `source_job_id` must be provided. When `source_job_id` references a completed discovery job, its rows are used as input directly.

### Retrieve a job

```
GET /api/jobs/{job_id}
```

Returns the current state, progress counters, and — when `done` — the download URL for the enriched CSV.

Financial figures depend on the company having filed its accounts. A large share of French SMEs file under a confidentiality option, in which case detailed figures are unavailable and the row is returned with empty financial columns and a `no_accounts_filed` code.

## Errors

Row-level errors are reported in an `error` column on the enriched output. Job-level errors transition the job to `failed`.

| Code                 | Scope | Meaning                                                 |
| -------------------- | ----- | ------------------------------------------------------- |
| `alpha_unavailable`  | job   | Module is frozen during alpha. `POST` returns `503`.    |
| `not_found`          | row   | No registry match for the provided name and postcode.   |
| `no_accounts_filed`  | row   | Company exists but has not filed (or filed confidential) accounts. |
| `ambiguous_match`    | row   | Several candidates with equal score; none selected.     |
| `source_unavailable` | job   | One or more upstream public sources are unreachable.    |
| `invalid_input`      | job   | Input list is empty or missing required fields.         |

## What's next

- [`legal_data`](/docs/modules/legal_data) — administrative snapshot (legal form, capital, executives, headline financials) matched by name. The lighter companion to this module.
- [`legal_mentions`](/docs/modules/legal_mentions) — parse the legal notice page of a website to extract registered name, capital, RCS, postal address, and VAT number.
- [Module registry](/docs/concepts/module-registry) — how a module's state (active, on-demand, coming-soon, alpha-frozen) is published.


<!-- doc: modules/import -->

---
title: Import
slug: modules/import
section: Modules
---

## Purpose

The `import` module brings external data into outsend as a pipeline source. It is pipeline-internal — see [/docs/concepts/pipeline-orchestration](/docs/concepts/pipeline-orchestration) — and produces a normalized POI list that downstream enrichment, verification, or processing nodes can consume. Unlike `scrap`, `import` consumes no extraction quota (EF cost is zero).

## Inputs

The node config exposes a single discriminator, `source`, with three mutually exclusive modes.

| Field | Type | Required | Description |
|---|---|---|---|
| `source` | `"paste"` \| `"url"` \| `"from_job"` | yes | Selects which of the three input channels below applies. Defaults to `paste` when omitted. |
| `text` | string | when `source = paste` | Raw CSV content. Read only in `paste` mode. |
| `url` | string | when `source = url` | Public spreadsheet URL. Read only in `url` mode. |
| `from_job_id` | string | when `source = from_job` | UUID of an existing scrap job owned by the caller. Read only in `from_job` mode. |

### `paste` — inline CSV

The `text` payload is parsed as CSV by the shared resolution layer ([`app/column_map.py`](/docs/concepts/module-registry#flexible-input-matching)). The delimiter is auto-detected (comma, semicolon, or tab); UTF-8 is expected, with UTF-8 BOM and Latin-1 / cp1252 accepted as fallbacks. Headers are **not** mandatory: column names are matched flexibly against accepted aliases (a `Website`, `url`, `e-mail` or `raison sociale` header maps to the right canonical column), and a header-less sheet is auto-detected — its columns are then inferred from their content. Either way the import emits a **`notice`** (info banner on the job page, ⓘ on the dashboard) reporting what was auto-mapped, inferred, or ignored, so the mapping is never silent.

### `url` — public spreadsheet

The `url` payload points to a publicly readable spreadsheet (typical shape: `https://docs.google.com/spreadsheets/d/.../edit#gid=0`). The sheet must be shared as "anyone with the link can view" — outsend does not authenticate to third-party providers. The fetched content is parsed with the same CSV rules as `paste`.

### `from_job` — recent scrap reuse

The `from_job_id` payload references a previous `scrap` job. The reference is validated server-side at job creation:

| Constraint | Rule |
|---|---|
| Existence | The job ID must resolve to an existing job. |
| Ownership | The caller must own the source job. |
| Job type | Must be `scrap`. Other job types cannot be re-imported through this channel. |
| Availability | The source CSV must still be downloadable (`is_download_available`). |
| Recency | The source job must be less than 7 days old. |

When valid, the resulting import inherits all columns produced by the source scrap.

## Outputs

`import` produces a normalized POI list, declared in the pipeline registry as `output: "pois"` — the same shape `scrap` emits. Downstream nodes that accept `pois_any` (reviews, emails, socials, dead-check, techstack, ads-intelligence, brand-assets) chain directly. Nodes that require `pois_email` (verify) chain only if the imported CSV already carries an email column.

The column set is dynamic: it mirrors whatever the source provides. The registry declares `needs: []` and `produces: []` for this reason — the module is permissive on input and propagates the input schema as output.

## Lifecycle

Standard job lifecycle — see [/docs/concepts/jobs-lifecycle](/docs/concepts/jobs-lifecycle). The job is linked to its pipeline via `pipeline_id` and `pipeline_node_id` and runs as soon as the pipeline transitions to `running`.

## Pipeline

`import` is a root node. It accepts no upstream edges. Any node whose `input` is `pois_any`, `any_pois`, or `pois_email` (when the CSV carries emails) can be wired downstream.

| Direction | Compatible types |
|---|---|
| Upstream | none — `import` is a `ROOT_TYPE` alongside `scrap` |
| Downstream | `reviews`, `emails`, `verify` (with email column), `socials`, `dead_check`, `techstack`, `ads_intelligence`, `brand_assets`, `filter`, `sort` |

Registry: `needs: []`, `produces: []`.

## Endpoints

The `import` module is not exposed as a standalone job endpoint — it is pipeline-internal (see [/docs/concepts/pipeline-orchestration](/docs/concepts/pipeline-orchestration)) and created only as a pipeline root. Two adjacent endpoints are useful when assembling an import:

| Method | Path | Purpose |
|---|---|---|
| `POST` | `/api/jobs/parse-list` | Validates CSV input before submission. Accepts either `{"text": "..."}` JSON or a `multipart/form-data` upload with a `file` field. Returns `{count, with_lien_google_maps, with_site_web, sample, items, delimiter}`. |
| `GET` | `/api/jobs/{job_id}/items` | Returns the CSV rows of a finished `scrap` job in a structure suitable for `from_job` reuse. |

The pipeline node payload itself follows this shape:

```json
{
  "type": "import",
  "config": {
    "source": "paste",
    "text": "nom,site_web\n...",
    "url": "",
    "from_job_id": ""
  }
}
```

Exactly one of `text`, `url`, `from_job_id` is read, determined by `source`. The unused fields are persisted as empty strings.

## Limits

Global limits — see [/docs/concepts/limits](/docs/concepts/limits). Module-specific:

| Limit | Value |
|---|---|
| `from_job` recency | 7 days. The source job is rejected past that window. |
| `from_job` source type | `scrap` only. |
| Supported formats | CSV with comma, semicolon, or tab delimiter. Encodings: UTF-8 (preferred), UTF-8 with BOM, Latin-1 / cp1252 (fallback). Headers optional — a header-less sheet is auto-detected and its columns inferred from content. |

## Errors

| Condition | Surface | Message shape |
|---|---|---|
| `source` not in `{paste, url, from_job}` | Pipeline creation | `Source d'import invalide : <value> (attendu: paste \| url \| from_job)` |
| `from_job` without `from_job_id` | Pipeline creation | `Source 'from_job' : aucun job sélectionné` |
| `from_job_id` unknown | Pipeline creation | `Job source introuvable : <id>` |
| `from_job` source not owned by caller | Pipeline creation | `Job source non autorisé pour cet utilisateur` |
| `from_job` source not a scrap | Pipeline creation | `Seuls les scraps Gmaps peuvent être importés via 'from_job'` |
| `from_job` source CSV unavailable | Pipeline creation | `Le CSV du job source n'est pas (ou plus) disponible` |
| `from_job` source older than 7 days | Pipeline creation | `Le job source a plus de 7 jours — relancez un scrap ou collez le CSV.` |
| Empty paste payload | `parse-list` | HTTP 400, `Aucun texte fourni` |
| CSV parse failure | `parse-list` | HTTP 400, `CSV invalide: <detail>` |
| Zero parsed rows | `parse-list` | HTTP 400, `Aucune ligne lue dans le CSV` |
| Multipart upload missing file | `parse-list` | HTTP 400, `Aucun fichier fourni` |
| URL unreachable or non-CSV response | Pipeline execution | The import job transitions to `failed`; the message names the unreachable source. |
| Private spreadsheet (login page returned instead of CSV) | Pipeline execution | The import **fails loudly** with an explanation instead of silently succeeding — the content was HTML (a sign-in page), not CSV. Share the sheet as "anyone with the link can view". |
| Empty, header-only, or nothing exploitable | Pipeline execution / `parse-list` | The import **fails with an explanation** (no usable rows) rather than reporting a misleading success. |

## What's next

| Module | Use it to |
|---|---|
| [filter](/docs/modules/filter) | Narrow the imported list by column predicates before paying for downstream enrichment. |
| [sort](/docs/modules/sort) | Order the imported list — useful when combined with row limits in later steps. |


<!-- doc: modules/legal_data -->

---
title: French legal data
slug: modules/legal_data
section: Modules
---

The `legal_data` module enriches a list of POI with official records from French public legal data sources. For each input row, the module queries `api.gouv.fr` (SIRENE, INPI, RNCS) and complements the response with BODACC legal notices and Infogreffe public extracts. The result is a structured profile attached to each company: legal form, capital, registered executives, NAF code, headcount band, headline financials, and a consolidated lead status.

The module is read-only: no credentials are required, no fees are charged by the upstream sources, and no business is contacted as part of the lookup.

> **No website needed.** Unlike [`legal_ids`](/docs/modules/legal_ids), which reads identifiers from a website, this module matches each row by **name + address** against SIRENE and **also returns the SIRET/SIREN**. Choose it when your list has no website column — for example a Google Maps scrape with names and map links only.

## Purpose

French B2B prospecting lists tend to start with a name, an address, and maybe a website. `legal_data` turns each row into a qualified company record using only public registries.

Typical use cases:

- Filter a scraped list by capital, headcount band, or NAF code before outreach.
- Detect dead or insolvent entities (`bodacc_procedure_collective`) and drop them from a sequence.
- Identify companies with recent legal events (capital increase, executive change, address change) as opportunity signals.
- Recover named executives to personalize a first-touch email.

## Inputs

`legal_data` is an enrichment module: it consumes an existing list of POI rather than producing one. The expected input is a `poi_list`, typically the output of a discovery job.

| Field          | Required | Notes                                              |
| -------------- | -------- | -------------------------------------------------- |
| `nom`          | yes      | Company name, used for fuzzy matching.             |
| `siren`        | no       | If present, used for an exact match (preferred).   |
| `code_postal`  | no       | Disambiguates fuzzy name matches.                  |
| `lat`, `lon`   | no       | Geographic fallback when name and SIREN both fail. |

Match resolution follows three tiers, in order:

1. Exact SIREN lookup when the identifier is provided.
2. Fuzzy match on `nom` + `code_postal`.
3. Geographic fallback on coordinates within a small radius.

A row that cannot be resolved is returned with empty enrichment columns and an error code (see Errors).

## Outputs

Each input row is augmented with the following columns. Empty values are preserved as empty strings — the module never fabricates a value.

| Column           | Type    | Description                                              |
| ---------------- | ------- | -------------------------------------------------------- |
| `legal_form`     | string  | Legal form (SAS, SARL, SA, EI, association, etc.).       |
| `capital`        | number  | Registered share capital in EUR.                         |
| `founding_date`  | date    | Date of registration in the company register.            |
| `executives`     | list    | Named executives with role (Président, Gérant, DG).      |
| `financials`     | object  | Last available revenue and net income, with fiscal year. |
| `naf_code`       | string  | Five-character NAF/APE activity code.                    |
| `employees_range`| string  | INSEE headcount band (e.g. `10-19`, `100-199`).          |

A consolidated `lead_status` is also returned, taking one of four values: `mort`, `alerte`, `opportunite`, `actif`. It encodes the combination of administrative state, BODACC signals, and recency of legal events.

## Lifecycle

Standard job lifecycle — see [Jobs lifecycle](/docs/concepts/jobs-lifecycle). Progress is reported per establishment processed. The job is idempotent within a session: re-running on the same input list yields the same enriched columns, modulo upstream registry updates.

## Pipeline

```
needs:     poi_list
produces:  enriched_list
```

`legal_data` consumes a `poi_list` and emits an `enriched_list` carrying the original rows plus the columns described in Outputs. The enriched list can itself be consumed by downstream enrichment modules (`legal_mentions`, `legal_ids`, etc.).

## Endpoints

### Create a job

```
POST /api/jobs/legal-data
```

Request body:

```json
{
  "items": [
    { "nom": "Boulangerie Martin", "code_postal": "75011" },
    { "siren": "552120222" }
  ],
  "source_job_id": "job_01HXYZ..."
}
```

Either `items` or `source_job_id` must be provided. When `source_job_id` references a completed discovery job, its rows are used as input directly.

Response: a `Job` resource with `id`, `type`, `status`, and progress fields.

### Retrieve a job

```
GET /api/jobs/{job_id}
```

Returns the current state, progress counters, and — when `done` — the download URL for the enriched CSV.

### List jobs

```
GET /api/jobs?type=legal_data
```

Maximum **5,000 rows per job**. Larger lists must be split client-side. Global quotas and rate limits: see [Limits](/docs/concepts/limits).

Financial figures depend on the company having filed its accounts (roughly 60 percent of French SMEs). Executive names reflect the last filing; recent changes may take a few weeks to propagate.

## Errors

Row-level errors are reported in an `error` column on the enriched output. Job-level errors transition the job to `failed`.

| Code                     | Scope | Meaning                                                |
| ------------------------ | ----- | ------------------------------------------------------ |
| `not_found`              | row   | No match in SIRENE for the provided name and postcode. |
| `foreign_business`       | row   | Establishment is not registered in France.             |
| `ambiguous_match`        | row   | Several candidates with equal score; none selected.    |
| `source_unavailable`     | job   | One or more upstream public sources are unreachable.   |
| `quota_exceeded`         | job   | Daily fair-use quota reached; retry the next day.      |
| `invalid_input`          | job   | Input list is empty or missing required fields.        |

A `source_unavailable` failure preserves all rows already enriched before the outage. The job can be re-submitted with the remaining rows once the upstream source recovers.

## What's next

- [`legal_ids`](/docs/modules/legal_ids) — detect SIREN and SIRET identifiers directly on a company website, with Luhn validation. A useful prerequisite when input rows lack a `siren`.
- [`legal_mentions`](/docs/modules/legal_mentions) — parse the legal notice page of a website to extract registered name, capital, RCS, postal address, and VAT number. Complements `legal_data` when registry filings are sparse.


<!-- doc: modules/legal_ids -->

---
title: Business IDs
slug: modules/legal_ids
section: Modules
---

# Business IDs

The `legal_ids` module extracts French business identifiers from a list of POI that already have a website. For each site, the module locates and validates the **SIRET** (14 digits) and **SIREN** (9 digits), and records the URL where the identifier was found. The Luhn checksum is applied to every candidate, so phone numbers that happen to be nine digits long are rejected before they reach the output.

This module is the standard entry point into the legal-data pipeline: once an item carries a verified SIREN, downstream modules such as `legal_data` can query official registries without a name-matching step.

> **Website required.** This module reads the SIRET/SIREN from each establishment's **website** (the legal-mentions page), so every input row needs a `site_web` value. If your list has only names or Google Maps links (no website), use [French legal data](/docs/modules/legal_data) instead — it returns the same SIRET/SIREN by matching the **name** against the SIRENE registry, no website needed.

## Purpose

- Attach an authoritative French business ID (SIREN/SIRET) to each POI.
- Keep provenance: every identifier ships with the URL it was extracted from.
- Provide a clean, deduplicated key for further enrichment (`legal_data`, `legal_mentions`).

## Inputs

The module consumes an `enriched_list` of POI. The minimum field required on each item is `site_web`; items without a website are filtered out at job creation and counted as ignored.

| Field      | Type   | Required | Notes                                       |
| ---------- | ------ | -------- | ------------------------------------------- |
| `site_web` | string | yes      | Root domain or any URL on the target site.  |
| `nom`      | string | no       | Carried through to the output for context.  |
| `adresse`  | string | no       | Carried through to the output for context.  |

A POI list is typically produced by the `scrap` module, but any `enriched_list` with `site_web` populated is accepted.

## Outputs

Each input item is returned with three new fields. All other input fields are passed through unchanged.

| Column              | Type   | Description                                                                 |
| ------------------- | ------ | --------------------------------------------------------------------------- |
| `siren`             | string | 9-digit business identifier, Luhn-validated. Empty if not found.            |
| `siret`             | string | 14-digit establishment identifier, Luhn-validated. Empty if not found.      |
| `siret_source_url`  | string | URL of the page where the identifier was extracted. Empty if not found.    |

When only a SIREN is detected, `siret` is left empty and downstream modules can still operate on the SIREN alone. Additional company attributes — legal form, RCS number, NAF code, headcount, directors, financials — live in `legal_data`.

## Lifecycle

Standard job lifecycle — see [Jobs lifecycle](/docs/concepts/jobs-lifecycle). Failures on individual items do not stop the job: the corresponding output row simply has empty `siren`/`siret` fields, and the `errors` counter tracks items finished without an identifier.

## Pipeline

```
needs:    poi_list
produces: enriched_list
```

`legal_ids` is positioned in the `enrich` category. Typical chains:

- `scrap` → `legal_ids` → `legal_data`
- `scrap` → `legal_ids` → `legal_mentions`

## Endpoints

### Create a job

```
POST /api/jobs/legal-ids
```

Body:

| Field            | Type                | Required | Description                                                              |
| ---------------- | ------------------- | -------- | ------------------------------------------------------------------------ |
| `items`          | array of POI items  | yes      | Input list. Each item must carry `site_web`.                             |
| `source_job_id`  | string (UUID)       | no       | When chaining from a previous job, the upstream job ID for traceability. |

Response: the standard `Job` object, including `id`, `status`, `output_filename`, and the cost estimate in equivalent-France units.

The created job is then driven by the usual endpoints (`GET /api/jobs/{id}`, `GET /api/jobs/{id}/download`, etc.), shared by every module.

Global quotas and per-job ceilings: see [Limits](/docs/concepts/limits).

## Errors

| Situation                            | Behavior                                                                 |
| ------------------------------------ | ------------------------------------------------------------------------ |
| Item has no `site_web`               | Dropped before the job starts. If no item remains, the job is rejected.  |
| Site reachable, no identifier found  | Item finishes with empty `siren`/`siret`. Counted in the error tally.    |
| POI not registered (no SIREN/SIRET)  | Same as above: no identifier is invented; outputs stay empty.            |
| Foreign business                     | Same as above: only French identifiers are recognized.                   |
| Candidate fails the Luhn check       | Discarded silently; no false positive is written to the output.          |
| No item with a website               | Job fails with an explicit message — e.g. `0 of 120 rows have a website (your list has names, Google Maps links and phones only)` — and recommends running [`legal_data`](/docs/modules/legal_data) to get the SIRET/SIREN by name instead. |

## What's next

- [Company data](/docs/modules/legal_data) — given a SIREN, fetches the full official record: legal form, RCS number, NAF code, headcount, directors, financials.
- [Legal mentions](/docs/modules/legal_mentions) — extracts the legal notice block (publisher, host, contact) from the same sites.


<!-- doc: modules/legal_mentions -->

---
title: Legal mentions
slug: modules/legal_mentions
section: Modules
---

# Legal mentions

The `legal_mentions` module locates the legal-mentions page (also known as *mentions légales* or *Impressum*) on each POI's website and extracts its structured contents. It runs as an enrichment step on top of an existing POI list and returns one row per input site, whether or not a legal page was found.

## Purpose

Most B2B due-diligence workflows hinge on facts only published on a company's own website: registered company name, director, share capital, hosting provider. The module surfaces those facts at list scale, so a downstream campaign or audit can filter, segment, and cross-reference without manual visits.

Typical uses:

- Qualifying a prospect list by company size proxies (capital, legal form).
- Matching a trade name against the registered entity before outreach.
- Building a directors index for personalized messaging.
- Auditing host providers across a sector.

The module never invents values. Fields left empty mean the information could not be located on the target page.

## Inputs

The job consumes a list of POI items. Each item must carry a website URL; items without one are dropped during validation.

| Field        | Type   | Required | Notes                                                   |
| ------------ | ------ | -------- | ------------------------------------------------------- |
| `site_web`   | string | yes      | Root URL of the establishment's website.                |
| `name`       | string | no       | Carried through to the output for joining.              |
| `source_job_id` | string | no   | ID of an upstream `scrap` job to inherit items from.    |

Submit between 1 and 10,000 items per job. Items are normalized and deduplicated before execution.

## Outputs

One row is produced per input site. Columns:

| Column            | Type   | Description                                                              |
| ----------------- | ------ | ------------------------------------------------------------------------ |
| `raison_sociale`  | string | Registered company name as it appears on the legal-mentions page.        |
| `forme_juridique` | string | Legal form (SAS, SARL, SA, EI, etc.).                                    |
| `capital_social`  | string | Declared share capital, in the currency given on the page.               |
| `rcs`             | string | RCS registration entry (city + identifier).                              |
| `adresse_postale` | string | Postal address of the registered office.                                 |
| `dirigeant`       | string | Publication director or legal representative, when stated.               |
| `tva_intracom`    | string | Intra-community VAT number, validated against the FR format when present. |

Empty cells indicate the field was not present on the parsed page — the module never invents values. The output is delivered as a CSV alongside the input columns.

## Lifecycle

Standard job lifecycle — see [Jobs lifecycle](/docs/concepts/jobs-lifecycle). Progress is reported per site; partial output is preserved if the job is canceled or fails mid-run.

## Pipeline

The module is an enrichment step. It plugs into the standard list pipeline:

```yaml
needs: [site_web]
produces: [raison_sociale, forme_juridique, capital_social, rcs, adresse_postale, dirigeant, tva_intracom]
```

A typical chain looks like `scrap` → `legal_mentions` → `legal_data`. The source can be selected by uploading items directly or by referencing a recent `scrap` job through `source_job_id`.

## Endpoints

All endpoints require an authenticated, active user.

| Method | Path                          | Body                              | Returns      |
| ------ | ----------------------------- | --------------------------------- | ------------ |
| POST   | `/api/jobs/legal-mentions`    | `{ items: [...], source_job_id }` | `JobPublic`  |
| GET    | `/api/jobs/{id}`              | —                                 | `JobPublic`  |
| GET    | `/api/jobs/{id}/output`       | —                                 | CSV stream   |
| POST   | `/api/jobs/{id}/cancel`       | —                                 | `JobPublic`  |

The create endpoint validates quota up front and returns `400` with a descriptive message if validation fails. Items per job: 1 to 10,000. Global quotas: see [Limits](/docs/concepts/limits).

## Errors

Two outcomes are surfaced as empty rows rather than job failures, because they are expected at list scale:

| Condition                | Behavior                                                           |
| ------------------------ | ------------------------------------------------------------------ |
| No legal page found      | Row returned with every legal field blank.                         |
| Website down             | Row returned with all fields blank; the site is marked unreachable.|

Job-level failures (`status = failed`) are reserved for non-recoverable conditions such as invalid input or quota errors. The error message is exposed on the job record.

## What's next

- [legal_ids](/docs/modules/legal_ids) — detect SIREN/SIRET from the same website set.
- [legal_data](/docs/modules/legal_data) — enrich each identifier with official company data.


<!-- doc: modules/on-demand/email_campaign -->

---
title: Email campaigns
slug: modules/on-demand/email_campaign
section: Modules · on-demand
---

# Email campaigns

Cold-email outreach, set up and operated by the outsend team on your behalf. This module is **on-demand**: there is no self-serve endpoint and no "Run" button. You ask, we build, you receive.

## Purpose

Email campaigns turn your outsend prospect data into sequenced outreach that actually lands in the inbox. The goal is simple: get your offer in front of qualified prospects, track who reads and replies, and feed the answers back into your pipeline — without you touching deliverability plumbing.

You bring the audience (already living in outsend after scraping, enrichment and qualification) and the message intent. We bring the sending infrastructure, the warmup discipline, and the sequence logic.

## How to request it

There is **no public API** to start an email campaign. The module is delivered through a conversation with the team.

Two equivalent ways to open that conversation:

- **From the dashboard.** Open the Email campaigns module card. The primary CTA does not launch a job — it opens a feedback thread with topic `on_demand_email`, pre-filled with the questions we need answered to scope the work.
- **Programmatically.** `POST /api/feedback/threads` with `topic: "on_demand_email"` and your initial message in the body. The team is notified and replies in the same thread, viewable from your dashboard.

Either path lands in the same place: a thread where we agree on volume, sending domain, sequence shape, and timing before any email goes out.

## What gets delivered

Every engagement covers the same core capabilities, tuned to your account:

- Configuration of your sending domain, including warmup before any campaign starts so messages don't land in spam on day one.
- Sequence build-out — first-touch email plus spaced follow-ups, written with you, reviewed before send.
- **Per-prospect personalisation** drawn from the data you already have in outsend (company, role, signals, whatever your enrichment captured).
- Open and click tracking on every message.
- Automatic reply detection that pauses follow-ups the moment a prospect responds, so nobody gets chased after they've answered.
- Detailed reporting on open rate, reply rate, and downstream conversion, surfaced back into your dashboard.

The deliverables list is the contract. If something isn't in it, ask in the thread — we'll either fold it in or tell you why we won't.

## Why on-demand

Cold email at any meaningful volume is a deliverability problem dressed up as a copywriting problem. Getting it wrong is expensive: a burned domain takes weeks to recover, and the damage is silent — your messages stop landing long before anyone tells you.

The setup is irreducibly per-customer. Domain warmup, SPF, DKIM, DMARC alignment, dedicated IP allocation when volume justifies it, ramp schedules, bounce and complaint thresholds — none of this generalises into a "click here to send" button without handing customers a footgun.

So we don't ship the button. We ship the outcome instead: a configured sending stack and a sequence that's been pressure-tested before the first prospect sees it.

## Pricing

Bespoke. Pricing depends on monthly send volume, number of sequences, whether you bring an existing domain or need one provisioned and warmed from scratch, and the depth of personalisation. There is no listed rate card because there is no listed configuration.

Open the thread, share your expected volume and sequence shape, and you'll get a quote in the same conversation.

## What's next

- [SMS campaigns](/docs/modules/sms_campaign) — same on-demand model, mobile channel.
- [WhatsApp campaigns](/docs/modules/on-demand/whatsapp_campaign) — same on-demand model, conversational channel.


<!-- doc: modules/on-demand/phone_carrier -->

---
title: Phone carrier detection
slug: modules/on-demand/phone_carrier
section: Modules · on-demand
summary: Identify the real carrier behind every mobile number — including portability — for SMS routing, dedup and segmentation.
---

# Phone carrier detection

Identify the carrier of a list of mobile numbers — Orange, SFR, Bouygues, Free, MVNOs — accounting for portability. Useful before sending SMS at scale, deduping a CRM, or segmenting by carrier.

## Two ways to use it

- **Public lookup tool** — instant, free, no signup, no batch limit: [`/en/which-carrier`](/en/which-carrier). Returns the original ARCEP-allocated carrier per number plus name, SIRET, head office and registration date. See [Operator lookup](/docs/modules/operator_lookup).
- **Live-portability batch** — this module. Adds the *current* carrier after portability for a list of numbers, with caching. On-demand during alpha.

## Inputs

| Field | Type | Required | Notes |
|---|---|---|---|
| `items` | array of objects | yes | Each item must include at least `phone` (E.164 or French 10-digit). 1 to 10 000 items per batch. |
| `source_job_id` | string | no | ID of an upstream job for lineage. |

## Outputs

Each input row is returned with the carrier breakdown appended.

| Column | Type | Description |
|---|---|---|
| `carrier_name` | string | Current carrier after portability (commercial name). |
| `carrier_original` | string | Carrier originally allocated the range by ARCEP. |
| `is_ported` | bool | `true` if the current carrier differs from the original. |
| `is_reachable` | bool | Whether the number is currently active on the network. |
| `line_type` | string | `mobile`, `fixed_line`, `fixed_or_mobile`, or `other`. |

A persistent cache means re-running the same number is free for you — only first-time lookups consume a slot.

## Lifecycle

Standard job lifecycle — see [Jobs lifecycle](/docs/concepts/jobs-lifecycle).

## Limits

- **On-demand during alpha** — no self-service create endpoint yet; we wire it for you. Use the public ARCEP-only tool for unlimited self-service lookups.
- French numbers only. International numbers return `error: "not_french_number"` per row.
- Live enrichment is best-effort. When unavailable, the row falls back to the ARCEP-allocated carrier.

## Why it's on-demand

We tune throughput and concurrency per customer based on volume. Reach out via [Contact](/contact) and we set the right cadence for you.


<!-- doc: modules/on-demand/sms_campaign -->

---
title: SMS campaigns
slug: modules/on-demand/sms_campaign
section: Modules · on-demand
---

# SMS campaigns

Reach your prospects where they actually read: their mobile inbox. Cold SMS outreach is run end-to-end by the outsend team on your behalf, against a list you bring or one we help you assemble.

Mobile messaging routinely posts open rates above 95%, which makes it one of the highest-signal channels in B2B outbound — provided the routing, sender identity and consent layer are handled correctly. That is exactly what this module covers.

## Purpose

The `sms_campaign` module turns a verified contact list into a delivered, compliant SMS campaign with replies routed back into your outsend dashboard.

In practice, it gives you:

- A managed outbound channel that does not require you to sign carrier contracts, register a sender brand, or operate an SMS gateway.
- A reply-aware setup: prospects can answer, and their responses land where the rest of your outsend activity already lives.
- A compliance posture aligned with B2B prospecting rules in your target market, including opt-out handling.

It is designed for teams that already have a clear use case — appointment reminders, time-sensitive offers, follow-up sequences on stalled leads — and want a delivery channel they do not have to build themselves.

## How to request it

There is no self-serve endpoint for this module. SMS campaigns are activated case by case, because each setup depends on volume, sender identity and message type.

To start the conversation:

1. Open the in-app feedback chat from your outsend dashboard.
2. Select the topic **`on_demand_sms`**. A pre-filled brief will help you describe your use case.
3. Share the basics — estimated monthly volume, message type (reminder, promo, follow-up), target audience (B2B or B2C), and the sender identity you would like prospects to see.

The outsend team takes it from there: scoping, routing, sender registration, and a delivery plan tailored to your campaign. You stay in the loop through the same feedback thread until the campaign is live.

## What gets delivered

Each engagement is shaped around your brief, but every SMS campaign run through outsend includes the same operational baseline:

- **Pre-flight number verification.** Numbers are checked upstream so your sends do not vanish into dead or invalid lines.
- **Consent and opt-out handling.** Built-in management of prospect consent and unsubscribe requests, applied automatically across your list.
- **Per-contact personalization.** First name, last name, company and other fields are merged into each message, so the send reads as one-to-one rather than broadcast.
- **Frequency caps per prospect.** Anti-spam guardrails prevent the same contact from being touched more often than you want.
- **Replies routed to your dashboard.** Inbound responses appear inside outsend, next to the lead they came from, so your team can act on them without context-switching.
- **Deliverability reporting.** Per-campaign reporting on delivery outcomes and opt-outs, available once the send completes.

You receive the campaign as an operated service: the outsend team configures the run, executes it, and hands you results inside the product you already use.

## Why on-demand

B2B SMS outreach is not a simple API call. To deliver at meaningful volume without burning your sender reputation, three things must line up:

- **Licensed carrier routing.** Outbound SMS in regulated markets flows through accredited carrier partners, not generic gateways.
- **Validated sender identity.** Short codes and alphanumeric sender IDs are registered and approved by the relevant authority before any traffic flows.
- **Volume-based billing.** Pricing scales with throughput, sender type and destination — there is no one-size-fits-all unit price.

The outsend team handles carrier relationships, sender registration, consent posture and per-volume billing on your behalf. That is why this module is on-demand: the setup work happens once, against your specific use case, and then your campaigns ride on top of it.

## Pricing

Bespoke. Pricing depends on your monthly volume, sender configuration, message type and destination mix. After the initial feedback thread, the outsend team comes back with a quote that reflects your actual setup — no list price, no surprise overage.

## What's next

If SMS is part of a broader multichannel push, two related on-demand modules pair naturally with this one:

- **[Email campaigns](/docs/modules/on-demand/email_campaign)** — managed cold email sequences with deliverability and reply handling baked in.
- **[WhatsApp campaigns](/docs/modules/on-demand/whatsapp_campaign)** — official Meta Business setup with pre-approved templates and two-way conversations.

Mention any of these in your feedback thread and the outsend team will scope them together, so your channels stay coordinated from day one.


<!-- doc: modules/on-demand/whatsapp_campaign -->

---
title: WhatsApp campaigns
slug: modules/on-demand/whatsapp_campaign
section: Modules · on-demand
---

# WhatsApp campaigns

WhatsApp is the most engaging channel in your stack: ~80% open rate and reply rates roughly 5x what email gets. The `whatsapp_campaign` module is an on-demand offering run end-to-end by the outsend team — you bring the leads and the message intent, we handle the Meta plumbing.

There is no self-serve dashboard for this module. Every campaign goes through a short conversation so we can size volume, validate templates with Meta, and wire the WhatsApp Business number to your account before the first send.

## Purpose

Use `whatsapp_campaign` when:

- You have a scraped list of mobile numbers in outsend and want to reach the prospects on the channel they actually read.
- Email open rates have plateaued and you want a higher-signal touchpoint for warm leads, booking reminders, or follow-ups on quotes.
- You need bi-directional conversations — prospects reply, your team picks up the thread, manual relances stay possible.
- You want personalization per prospect (first name, business name, custom merge fields) using your own tone of voice.

It is not the right tool for cold blast at unknown numbers: WhatsApp Business policies require an opt-in pretext and template-based first contact. The outsend team will help you find a compliant angle before sending anything.

## How to request it

1. Open the outsend dashboard.
2. Find the WhatsApp campaign card in the on-demand modules section.
3. Click the CTA — it opens a feedback chat thread with topic `on_demand_whatsapp`. No backend job endpoint is involved.
4. Send the pre-filled message and complete the four details we need:
   - Estimated monthly volume.
   - Message type (appointment, follow-up, lead-gen, etc.).
   - WhatsApp Business number if you already have one — otherwise we provision it.
   - Draft templates you have in mind (rough French or English copy is fine, we shape them for Meta validation).

A team member picks up the thread within one business day and walks you through setup. You can keep iterating in the same conversation until the first batch is live.

## What gets delivered

Once your account is activated, every campaign run includes:

- An official **WhatsApp Business account** wired through the Meta API on your name, not a grey-area number pool.
- **Meta-validated templates** — we draft, submit, and iterate with Meta until your message types are approved. No risk of mid-campaign blocking.
- A **pre-flight WhatsApp check** on every number in your list: prospects who don't have WhatsApp are filtered out before send, so you don't burn quota.
- **Bi-directional conversations**: replies come back into a shared inbox view, with manual relances handled by your team or ours.
- **Per-prospect personalization** using fields from your outsend leads (business name, owner first name, city, custom tags).
- A **delivery report** at the end of each batch: sent, delivered, read, replied — with per-prospect status so you can route hot leads back into a regular outsend pipeline.

Sends happen in waves we agree on together, never as a single burst, to keep your Meta reputation clean.

## Why on-demand (WhatsApp policy compliance)

WhatsApp Business is the strictest outbound channel outsend touches. The Meta rulebook requires:

- A verified business account tied to a real legal entity (not a personal number).
- Each outbound template reviewed and approved by Meta before it can be sent at scale.
- A documented opt-in basis for the recipient list — cold lists scraped from Google Maps are not automatically compliant.
- Hard rate limits per business tier, with permanent bans for accounts that trigger too many user reports.

Getting any one of those wrong shuts down the account, and Meta does not usually reverse the decision. So instead of shipping a self-serve form that would brick most users on their first send, we keep `whatsapp_campaign` as an on-demand module: the outsend team creates the business account on your name, walks the templates through Meta validation, sets the rate limits to safe values for your tier, and reviews the audience before each wave goes out.

You stay the owner of the account and the data — we just operate the channel for you until you have the volume and the templates stable enough to take it over.

## Pricing

`whatsapp_campaign` is billed on a per-campaign basis, scoped during the intake conversation. The quote depends on three drivers:

- Monthly conversation volume (Meta charges per 24h conversation window opened, with different rates for marketing, utility, and authentication categories).
- Number of templates that need Meta validation in the initial setup.
- Whether outsend provisions the WhatsApp Business number or you bring an existing one.

Setup is one-off; subsequent waves reuse the validated templates and the same account, so cost per send drops sharply after the first campaign. Meta platform fees are passed through at cost — no markup on conversation pricing. You get a written estimate before any work starts in the feedback thread, and nothing is billed until you confirm.

---

**What's next**

- [Email campaigns](/docs/modules/on-demand/email_campaign) — the warmest outbound channel for longer-form pitches, also run by the outsend team.
- [SMS campaigns](/docs/modules/on-demand/sms_campaign) — for short, transactional touches when WhatsApp coverage on your list is thin.


<!-- doc: modules/pagespeed -->

---
title: PageSpeed
slug: modules/pagespeed
section: Modules
---

# PageSpeed

Run an official Google PageSpeed Insights audit on every prospect's website. The output ranks the list by site quality, exposes the weakest performers, and surfaces an angle worth pitching: a slow or broken site is a concrete improvement to offer.

The module never crawls a site directly. It delegates the measurement to Google, which keeps the signal comparable across prospects.

## Inputs

The job accepts an array of POI dictionaries. The only field actually used is `site_web`; any other field is carried through to the output unchanged.

| Field | Type | Required | Notes |
|---|---|---|---|
| `items` | `list[dict]` | yes | 1 to 10,000 POIs. Rows without `site_web` are kept and marked. |
| `source_job_id` | `string` | no | UUID of the upstream job (typically a Google Maps scrap). |

```json
{
  "nom": "Studio Atlas",
  "adresse": "12 rue de Rivoli, 75001 Paris",
  "site_web": "https://studio-atlas.fr"
}
```

## Outputs

One row per input POI. Empty cells signal that PSI could not score the URL (see Errors).

| Column | Type | Description |
|---|---|---|
| `perf_score_mobile` | int 0-100 | Lighthouse performance score, mobile profile. |
| `perf_score_desktop` | int 0-100 | Lighthouse performance score, desktop profile. |
| `lcp_ms` | int | Largest Contentful Paint in milliseconds. Good < 2500. |
| `cls` | float | Cumulative Layout Shift. Good < 0.1. |
| `accessibility_score` | int 0-100 | Lighthouse accessibility score. |
| `seo_score` | int 0-100 | Lighthouse SEO score. |
| `suggestions[]` | list[string] | Top Lighthouse audits flagged as failing, ready to quote in an outreach message. |

The CSV export also carries `best_practices_score`, `fcp_ms`, `tbt_ms`, and `inp_ms` for completeness.

## Lifecycle

Standard outsend job lifecycle; see [/docs/concepts/jobs-lifecycle](/docs/concepts/jobs-lifecycle). Progress is reported in `sites`; result count in `audits`.

## Pipeline

```
needs:     site_web
produces:  perf_score, accessibility_score, seo_score,
           best_practices_score, lcp, cls, inp
```

PageSpeed is a `check` module: it consumes a list and returns the same list enriched. Typical pattern: start from a `scrap` job, then run `pagespeed` against the resulting POIs.

## Endpoints

### Create a job

```
POST /api/jobs/pagespeed
Content-Type: application/json
```

```json
{
  "items": [
    { "nom": "Studio Atlas", "site_web": "https://studio-atlas.fr" }
  ],
  "source_job_id": "8b1f...-uuid"
}
```

Response: a `JobPublic` envelope carrying the job UUID, its status, and the EF cost reserved for tracking.

### Read a job

```
GET  /api/jobs/{job_id}
GET  /api/jobs/{job_id}/results
GET  /api/jobs/{job_id}/export.csv
```

The CSV export is available once the job is `completed`.

## Limits

See [/docs/concepts/limits](/docs/concepts/limits). Upstream is the Google PSI v5 free tier (one URL per request); bursts above the per-key quota are retried inside the job rather than failed.

## Errors

| Condition | Behaviour |
|---|---|
| POI has no `site_web` | Row kept, audit columns empty, `suggestions[]` set to a single explanatory entry. |
| URL unreachable or returns a non-HTML page | Row kept, scores empty, error noted on the row. |
| PSI quota exhausted | Affected URLs are re-queued inside the job; if the quota stays exhausted the row is marked failed and the rest continues. |
| Site blocks the PSI fetcher | Row marked failed with the upstream error code; the job itself stays healthy. |
| `items` empty | `400 Bad Request`, no job created. |
| Missing `PAGESPEED_API_KEY` server-side | Job created but moves straight to `failed`. |

A failed row never blocks the job: the contract is "every input POI gets one output row, scored or explained".

## What's next

- [`techstack`](/docs/modules/techstack) — detect the CMS, analytics, and frameworks behind each site. A weak PageSpeed score combined with a known-slow stack is a sharper angle than either signal alone.
- [`ads_intelligence`](/docs/modules/ads_intelligence) — check whether the prospect is actively spending on ads. Paying for traffic that lands on a slow page is the highest-conviction pitch this module enables.


<!-- doc: modules/phone_info -->

---
title: Phone info
slug: modules/phone_info
section: Modules
summary: Enrich a list of phone numbers with carrier, line type, portability and operator metadata — from the official ARCEP registry plus a live-portability check.
---

# Phone info

The `phone_info` module takes a list of phone numbers and returns one enriched row per number. For each French number it identifies the carrier originally allocated the range by **ARCEP** (Orange, SFR, Bouygues, Free, MVNOs), plus the carrier's commercial name, SIRET, head office, RCS, and ARCEP registration date. When available it also returns the current carrier after portability and a reachability flag.

This is a pipelinable enrichment module: chain it after a discovery job, an import, or any list that carries a phone column.

## Inputs

A list of items, each carrying at least one phone field. Other columns are preserved as-is.

| Field | Required | Notes |
|---|---|---|
| `phone` (or `telephone`, `phone_number`, `numero`, `number`) | yes | French or international number. Accepts spaces, `+`, `00` and bare 10-digit. |
| `nom` (or `name`) | no | Surfaced in the output for display. |
| any other column | no | Preserved unchanged. |

Batch size: 1 to 10 000 items per job.

Job-level options:

| Option | Default | Notes |
|---|---|---|
| `live_mode` | `"cache_only"` | `"cache_only"` reads only what's already in the live-portability cache. `"with_live"` attempts a live check for French mobile numbers not yet in cache. |
| `source_job_id` | none | ID of an upstream job for lineage in pipelines. |

## Outputs

Each input row is returned with the columns below appended.

| Column | Type | Description |
|---|---|---|
| `phone_e164` | string | Canonical international format. |
| `phone_national` | string | Pretty national format. |
| `phone_country` | string | ISO 3166-1 alpha-2. |
| `phone_line_type` | string | `mobile`, `fixed_line`, `fixed_or_mobile`, `other`. |
| `phone_carrier_original` | string | Carrier allocated the range by ARCEP. |
| `phone_operator_siret` | string | SIRET of the original carrier. |
| `phone_operator_rcs` | string | RCS city. |
| `phone_operator_address` | string | Head office address. |
| `phone_operator_registered_since` | string | Date the carrier was registered with ARCEP. |
| `phone_tranche_attribution_date` | string | Day the range was allocated. |
| `phone_territory` | string | Mainland, Réunion, Mayotte, etc. |
| `phone_carrier_current` | string | Carrier currently attached to the number (post-portability) when available. |
| `phone_is_ported` | string | `yes`, `no`, `unknown`. |
| `phone_is_reachable` | string | `yes`, `no`, `unknown`. |
| `phone_is_valid` | string | `yes`, `no`. |
| `phone_status` | string | `ok`, `not_french`, `invalid`, `no_match`. |

## Pipeline

```yaml
needs:    telephone
produces: phone_e164, phone_carrier_original, phone_carrier_current, phone_is_ported, ...
```

Typical chains:

```
scrap → phone_info → filter (by carrier) → sms_campaign
import → phone_info → emails enrichment …
```

## Endpoints

### `POST /api/jobs/phone-info`

Create a new phone-info job from a list of items.

**Body**

| Field | Type | Required | Description |
|---|---|---|---|
| `items` | array<object> | yes | 1 to 10 000 entries. Each must carry a phone field. |
| `live_mode` | string | no | `"cache_only"` (default) or `"with_live"`. |
| `source_job_id` | string (uuid) | no | Upstream job ID when chaining. |

**Response**

Standard `JobPublic` envelope.

**Example**

```http
POST /api/jobs/phone-info
Content-Type: application/json

{
  "items": [
    { "nom": "Acme",  "phone": "+33612345678" },
    { "nom": "Beta",  "phone": "0142868828"  }
  ],
  "live_mode": "cache_only"
}
```

## Data sources

- **MAJNUM** — ranges allocated to each carrier ([Open data ARCEP](https://www.data.gouv.fr/datasets/ressources-en-numerotation-telephonique))
- **Identifiants CE** — commercial name, SIRET, RCS, head office, and declaration date ([Open data ARCEP](https://www.data.gouv.fr/datasets/identifiants-de-communications-electroniques))
- A weekly cron refreshes the local copy after each ARCEP board meeting.


<!-- doc: modules/phones_extra -->

---
title: Extra phones
slug: modules/phones_extra
section: Modules
---

# Extra phones

The `phones_extra` module digs past the single switchboard line returned by a typical map listing and surfaces the additional voice channels a business exposes on its own site: direct lines, mobile numbers, sales desks, support hotlines, and lingering fax numbers.

## Purpose

Discovery sources publish one canonical phone per location. Real organisations publish several across homepage, landing pages, team bios, and legal notices. `phones_extra` reads the public-facing pages of each website attached to a POI, extracts every phone-shaped token, validates them through Google's `phonenumbers` library, normalises to E.164, and deduplicates against the primary number already on file.

## Inputs

The module operates on an enriched POI list — typically the result of a prior discovery run.

| Field        | Required | Notes                                               |
|--------------|----------|-----------------------------------------------------|
| `site_web`   | yes      | POIs without a website are filtered out at submit.  |
| `name`       | no       | Used for output labelling and audit trail.          |
| `phone`      | no       | When present, used as the dedupe reference.         |
| `address`    | no       | Used to bias country detection for ambiguous formats. |

Items missing `site_web` are silently dropped; if the filtered list is empty the request is rejected with a validation error.

## Outputs

Each input POI is returned with up to three additional phone fields. Empty strings indicate the module ran but found nothing of that kind.

| Column            | Type   | Description                                        |
|-------------------|--------|----------------------------------------------------|
| `phone_secondary` | string | Additional landline distinct from the primary number, E.164. |
| `mobile`          | string | Mobile line detected via national numbering plan, E.164.     |
| `fax`             | string | Fax number when explicitly labelled on the page, E.164.      |

All numbers are validated and normalised. Anything that fails validation is discarded rather than surfaced as a best-effort guess.

## Lifecycle

Standard job lifecycle — see [Jobs lifecycle](/docs/concepts/jobs-lifecycle). Progress is reported in `sites` and final volume in `numéros`.

## Pipeline

`phones_extra` is an **enrichment** module: it augments an existing POI list rather than generating one.

```yaml
needs:    poi_list
produces: enriched_list
```

Typical chain:

```
discovery → phones_extra → verify_emails → filter → campaign
```

## Endpoints

### `POST /api/jobs/phones-extra`

Create a new extra-phones job from a list of POIs.

**Body**

| Field           | Type           | Required | Description                                    |
|-----------------|----------------|----------|------------------------------------------------|
| `items`         | array<object>  | yes      | POIs to enrich, each with at least `site_web`. |
| `source_job_id` | string (uuid)  | no       | Parent job ID when chaining from a prior run.  |

**Response**

Returns the standard `JobPublic` envelope (`id`, `status`, `job_type`, `output_filename`, quota cost, timestamps).

**Example**

```http
POST /api/jobs/phones-extra
Content-Type: application/json

{
  "source_job_id": "f3c2…",
  "items": [
    { "name": "Acme SAS", "site_web": "https://acme.example", "phone": "+33123456789" },
    { "name": "Beta Co",  "site_web": "https://beta.example" }
  ]
}
```

```json
{
  "id": "9a7b…",
  "status": "pending",
  "job_type": "phones_extra",
  "output_filename": "telephones-extra-2-sites.xlsx"
}
```

Global quotas and per-job ceilings: see [Limits](/docs/concepts/limits).

## Errors

| Condition                                       | Response                                  |
|-------------------------------------------------|-------------------------------------------|
| No item carries a `site_web`                    | `400` — *Aucun établissement avec site web*. |
| Estimated cost above the per-job quota          | `400` — quota exceeded, with the numeric overage. |
| Account quota exhausted                         | `400` — quota check failure before insert. |
| Malformed body (missing `items`, wrong types)   | `422` — request validation error.         |

Runtime errors on individual sites do not abort the job: the affected POI is recorded with empty extra-phone fields and the worker moves on.

## What's next

- [Verify emails](/docs/modules/verify_emails) — pair freshly found direct lines with deliverable inboxes.
- [Filter](/docs/modules/filter) — slice the enriched list by presence of a mobile, by country code, or by any combination of the new fields.


<!-- doc: modules/pricing -->

---
title: Pricing
slug: modules/pricing
section: Modules
---

# Pricing

Extract published rate cards from each point of interest's website. The module walks a cascade of structured sources (JSON-LD, Microdata, Open Graph, dedicated pricing routes, homepage fallbacks) and returns a normalized amount, currency, and billing period per site.

It is an enrichment step, not a discovery step. It expects a prior list of POIs with a resolved `site_web` — typically the output of a `scrap` job.

## Inputs

Items missing `site_web` are silently dropped at validation time.

| Field           | Type   | Required | Notes                                        |
| --------------- | ------ | -------- | -------------------------------------------- |
| `items`         | array  | yes      | 1 to 10,000 points of interest               |
| `items[].site_web` | string | yes  | Absolute URL of the vendor's main site        |
| `source_job_id` | string | no       | ID of the upstream job that produced the list |

## Outputs

Rows without signal are kept and flagged with `price_confidence = low` so input cardinality is preserved.

| Column             | Type    | Description                                                                       |
| ------------------ | ------- | --------------------------------------------------------------------------------- |
| `price_amount`     | number  | Lowest visible published price for the vendor, as a numeric value                  |
| `price_currency`   | string  | ISO-4217 currency code, typically `EUR` or `USD`                                  |
| `price_period`     | string  | Billing period attached to the amount: `month`, `year`, `one_time`, or `unknown`  |
| `price_confidence` | string  | Extraction confidence: `high`, `medium`, or `low`                                 |

## Lifecycle

Standard outsend job lifecycle; see [/docs/concepts/jobs-lifecycle](/docs/concepts/jobs-lifecycle). Pricing jobs are serialized against other network-heavy jobs on the same account.

## Pipeline

```
needs:    [site_web]
produces: [price_amount, price_currency, price_period, price_confidence]
```

Typical chain: `scrap -> pricing -> filter("price_amount >= 1000")`.

## Endpoints

| Method | Path                | Purpose                                |
| ------ | ------------------- | -------------------------------------- |
| `POST` | `/api/jobs/pricing` | Create a pricing job                   |
| `GET`  | `/api/jobs`         | List the caller's jobs                 |
| `GET`  | `/api/jobs/{id}`    | Fetch a single job and its progress    |
| `GET`  | `/api/jobs/{id}/download` | Stream results as JSON or CSV      |
| `POST` | `/api/jobs/{id}/cancel` | Stop a running job                 |

### Create a job

```http
POST /api/jobs/pricing
Content-Type: application/json

{
  "items": [
    { "site_web": "https://example-saas.com" },
    { "site_web": "https://another-vendor.io" }
  ],
  "source_job_id": "job_01HX..."
}
```

## Limits

See [/docs/concepts/limits](/docs/concepts/limits). One concurrent pricing job per account.

## Errors

Per-item outcomes rather than whole-job failure.

| Condition             | Behavior                                                                 |
| --------------------- | ------------------------------------------------------------------------ |
| No pricing page found | Row returned with blank `price_amount` and `price_confidence = low`. Job still completes `done`. |
| Login wall            | Row returned with blank `price_amount` and `price_confidence = low`. |

Job-level errors — invalid payload, empty input after filtering, quota exceeded — surface as `400` from `POST /api/jobs/pricing` with a `JobValidationError` message.

## What's next

- [Tech stack](/docs/modules/techstack) — detect the technologies behind the same sites to pair price with build profile.
- [Ads intelligence](/docs/modules/ads_intelligence) — see which of the priced vendors are actively spending on paid acquisition.


<!-- doc: modules/reviews -->

---
title: Reviews
slug: modules/reviews
section: Modules
summary: Enrichment module that extracts the full Google reviews stream for each point of interest in a list.
---

## Purpose

Enrichment module: turns a list of POIs (each with a Google Maps link) into a flat stream of reviews — author, rating, date, text, and owner-reply flag. Typically chained after `scrap`.

## Inputs

JSON body with a list of POI items. Each item must carry a Google Maps link; other columns are preserved as context but ignored for extraction.

| Field           | Type     | Required | Description                                                                  |
| --------------- | -------- | -------- | ---------------------------------------------------------------------------- |
| `items`         | `list`   | yes      | POI rows. Between 1 and 10000 entries per job.                               |
| `source_job_id` | `string` | no       | UUID of the upstream job (typically a `scrap` job). Used for lineage.        |

Each entry in `items` is an object. The single required key is `lien_google_maps`. Other keys are carried through unchanged and reattached on every produced review row.

| Item key            | Type     | Required | Notes                                                       |
| ------------------- | -------- | -------- | ----------------------------------------------------------- |
| `lien_google_maps`  | `string` | yes      | Canonical Google Maps URL of the POI. Items without it are dropped at validation. |
| `nom`               | `string` | no       | Business name, propagated to each review row.               |
| `adresse`           | `string` | no       | Postal address, propagated.                                 |
| `telephone`         | `string` | no       | Propagated.                                                 |
| `site_web`          | `string` | no       | Propagated.                                                 |

Request example:

```json
{
  "source_job_id": "9d2f8e0a-1c4b-4f7e-a2b9-3d6a5e10c8f1",
  "items": [
    {
      "nom": "Boulangerie Centrale",
      "adresse": "12 rue de la Paix, 75002 Paris",
      "lien_google_maps": "https://www.google.com/maps/place/..."
    },
    {
      "nom": "Garage Dupont",
      "lien_google_maps": "https://www.google.com/maps/place/..."
    }
  ]
}
```

## Outputs

One row per review (not per POI). POI context is duplicated on every row.

| Column              | Type      | Description                                                       |
| ------------------- | --------- | ----------------------------------------------------------------- |
| `nom`               | `string`  | Business name (from input).                                       |
| `adresse`           | `string`  | Business address (from input).                                    |
| `lien_google_maps`  | `string`  | Source POI URL.                                                   |
| `reviewer_name`     | `string`  | Public display name of the review author.                         |
| `rating`            | `integer` | Star rating, 1–5.                                                 |
| `date`              | `string`  | Relative date as published by Google (e.g. "il y a 2 mois").      |
| `review_text`       | `string`  | Body of the review.                                               |
| `owner_replied`     | `boolean` | `true` when the business owner posted a public reply.             |
| `owner_reply_text`  | `string`  | Owner reply body, empty when none.                                |

Two export formats: CSV (UTF-8, comma-separated, RFC 4180-quoted) and XLSX. The CSV is canonical.

## Lifecycle

Standard job lifecycle: see [Jobs & lifecycle](/docs/concepts/jobs-lifecycle).

## Pipeline

| Slot          | Value                                          |
| ------------- | ---------------------------------------------- |
| `category`    | `enrich`                                       |
| `needs`       | `poi_list` (requires `lien_google_maps`)       |
| `produces`    | `reviews_list`, `owner_replied`                |

Upstream: [`scrap`](/docs/modules/scrap), or any import with a valid `lien_google_maps` column. Downstream: typically [`filter`](/docs/modules/filter) and [`sort`](/docs/modules/sort) to narrow by rating, owner-reply, or recency.

## Endpoints

Module-specific:

```
POST /api/jobs/reviews
```

Body: `ReviewsJobCreateRequest` (see Inputs). Returns the created job in `pending`.

Generic job endpoints:

| Method | Path                          | Purpose                          |
| ------ | ----------------------------- | -------------------------------- |
| GET    | `/api/jobs/{job_id}`          | Fetch job state and metadata.    |
| GET    | `/api/jobs/{job_id}/events`   | SSE stream of progress events.   |
| GET    | `/api/jobs/{job_id}/download` | Download the produced CSV/XLSX.  |
| POST   | `/api/jobs/{job_id}/cancel`   | Request cancellation.            |

## Limits

Global quotas: see [/docs/concepts/limits](/docs/concepts/limits). The `reviews` module consumes roughly `0.4 / 3700` EF per POI. Payload bounds: 1 to 10000 items per job. No client-side cap on reviews per POI.

## Errors

| Condition                                | Outcome                                                                              |
| ---------------------------------------- | ------------------------------------------------------------------------------------ |
| No item carries a `lien_google_maps`     | Job creation fails with HTTP 400 and message `Aucun établissement valide`.           |
| Estimated cost exceeds the per-job quota | Job creation fails with HTTP 400 and an explicit `Quota dépassé` message.            |
| POI has no public reviews                | The POI is processed; zero rows are emitted for it. The job still succeeds.          |
| Source temporarily unavailable for a POI | That POI is skipped and surfaced in the job summary; remaining POIs continue.        |
| Source unreachable for the whole run     | Job ends in `failed` with the upstream error surfaced on the job record.             |

## What's next

- [Scrap module](/docs/modules/scrap) — the canonical upstream producer of POI lists.
- [Filter module](/docs/modules/filter) — narrow the review stream by rating, recency, or owner-reply.
- [Jobs lifecycle](/docs/concepts/jobs-lifecycle) — states, events, and cancellation semantics.


<!-- doc: modules/scrap -->

---
title: Scrap (Google Maps)
slug: modules/scrap
section: Modules
summary: Source module that extracts Google Maps listings for a set of search queries across one or more geographic zones.
---

## Purpose

Source module: runs each query over every grid point covering the requested zones and returns a flat CSV of Google Maps establishments (name, contact, location, rating).

## Inputs

| Field | Type | Required | Description |
|---|---|---|---|
| `queries` | `string[]` (1–20) | yes | Search terms run against Google Maps. Each query is trimmed and capped at 200 chars. |
| `zones` | `string[]` (1–50) | yes | Geographic zones. Accepts INSEE codes, department codes, region names, or `"France"`. Each zone is resolved to a grid of points server-side. |
| `include_reviews` | `bool` | no | Kept for backward compatibility. Does not chain a reviews job — use the `reviews` module instead. Defaults to `false`. |

Request body:

```json
{
  "queries": ["plombier", "chauffagiste"],
  "zones": ["75", "92"],
  "include_reviews": false
}
```

Effective Google Maps requests = `len(queries) × grid_points(zones)`. Rejected at submit if cost exceeds the per-job EF ceiling.

## Outputs

Result file: UTF-8 CSV, semicolon delimiter, BOM (Excel-safe). Same dataset available in three formats via download endpoint.

| Column | Type | Description |
|---|---|---|
| `nom` | string | Establishment name as displayed on Google Maps. |
| `site_web` | string | Public website URL if listed. |
| `telephone` | string | Phone number as listed. |
| `adresse` | string | Street address (number + street) as shown in the Maps list. |
| `ville` | string | City, taken from Google's own structured place data (exact, including Paris/Lyon/Marseille arrondissements and multi-postcode cities). Empty for listings Google has no address for. |
| `code_postal` | string | Postal code, from the same Google source as `ville`. |
| `rating` | float | Average star rating (0.0–5.0). |
| `reviews_count` | int | Number of public reviews. |
| `category` | string | Primary Google Maps category. |
| `lien_google_maps` | string | Canonical Google Maps URL for the listing. |
| `aggregator_flag` | bool | True if the listing looks like a directory/aggregator rather than an end business. |
| `query` | string | Source query that produced the row. |
| `lat`, `lon` | float | Grid point at which the row was collected. |

> `ville` and `code_postal` come from the structured place data Google ships with each result, not from reverse-geocoding — so they match Google exactly. The Maps list view only renders the street, which is why `adresse` alone never carried the city.

### Optional columns

Three extra columns are off by default and enabled per job via `extra_columns` (a list). The default output is street + `ville` + `code_postal`.

| Option in `extra_columns` | Adds column(s) | Description |
|---|---|---|
| `gps` | `lat`, `lon` | Exact latitude/longitude of the business (Google's own coordinates — not a grid approximation). |
| `departement` | `departement` | Department name, derived from `code_postal`. |
| `region` | `region` | Region name, derived from `code_postal`. |

Example request body: `{ "queries": ["plumber"], "zones": ["Paris 10km"], "extra_columns": ["gps", "departement", "region"] }`.

Formats: `csv` (original), `json`, `xlsx`. Selected via `?format=` on the download endpoint.

## Lifecycle

Standard job lifecycle: see [Jobs & lifecycle](/docs/concepts/jobs-lifecycle). While running, the SSE `status` event carries a `query_stats` payload of shape `{ "<query>": { "tiles": int, "with_results": int } }`, updated in real time to expose per-query hit ratio.

## Pipeline

| Field | Value |
|---|---|
| `needs` | `null` (source module — no input CSV required) |
| `produces` | `poi_list` |

Typical downstream modules chained against a `scrap` output:

- [`emails`](/docs/modules/emails) — find professional and personal emails from `site_web`.
- [`socials`](/docs/modules/socials) — extract social network handles from `site_web`.
- [`legal_ids`](/docs/modules/legal_ids) — extract SIREN/SIRET from the establishment's website (legal-mentions page).
- [`reviews`](/docs/modules/reviews) — collect full review threads from `lien_google_maps`.
- [`techstack`](/docs/modules/techstack), [`dead_check`](/docs/modules/dead_check), [`brand_assets`](/docs/modules/brand_assets), [`ads_intelligence`](/docs/modules/ads_intelligence) — site-level enrichments keyed on `site_web`.

## Endpoints

Dedicated endpoint:

```
POST /api/jobs
Content-Type: application/json

{
  "queries": ["plombier"],
  "zones": ["75"],
  "include_reviews": false
}
```

Generic job endpoint (equivalent — same payload, `job_type` inferred from shape):

```
POST /api/jobs
Content-Type: application/json

{
  "job_type": "scrap",
  "queries": ["plombier"],
  "zones": ["75"]
}
```

Both responses return the created `JobPublic` object including `id`, `status`, `grid_points_count`, `ef_cost` and `output_filename`.

Download:

```
GET /api/jobs/{job_id}/download?format=csv|json|xlsx
```

## Limits

Platform-wide quotas: see [/docs/concepts/limits](/docs/concepts/limits). Module-specific caps:

| Limit | Value |
|---|---|
| Maximum queries per job | 20 |
| Maximum zones per job | 50 |
| Maximum query length | 200 chars |
| Maximum cost per job | `1.0` equivalent-France (EF) |
| Email verification | Required on the account before a `scrap` job can be created. |

## Errors

| Scenario | HTTP | Resolution |
|---|---|---|
| Unrecognised zone string | 400 | Inspect the `errors` array in the response body; use INSEE/department codes or `"France"`. |
| No grid points resolved | 400 | The zone set is empty after resolution — broaden the zone selection. |
| EF quota exceeded | 400 | Reduce the number of queries or shrink the zones until estimated EF ≤ 1.0. |
| Email not verified | 403 | Verify the account email before creating a `scrap` job. |
| No worker available | The job stays in `pending` until the shared multi-proxy pool is free. Only one multi-proxy job runs at a time platform-wide. |
| Job failed mid-run | A partial CSV is preserved. A `POST /api/jobs/{id}/resume` creates a follow-up job that skips already-processed grid points and is billed only for the remainder. |
| Download expired | 410 | Result files have a retention window — re-run the job or chain from a fresh source. |

Queries refused by Google Maps surface in `dead_queries` on the job object.

## What's next

- [Jobs lifecycle](/docs/concepts/jobs-lifecycle)
- [Pipelines](/docs/concepts/pipelines)
- [`emails` module](/docs/modules/emails)


<!-- doc: modules/socials -->

---
title: Socials
slug: modules/socials
section: Modules
summary: Enrichment module that attaches public social profile URLs to each point of interest carrying a website.
---

## Purpose

Enrichment module: attaches public social profile URLs (operated by the entity itself) to each POI carrying a website. Parasitic links — share buttons, third-party directories, review aggregators — are discarded. Does not discover entities, verify ownership, or access authenticated content.

## Inputs

JSON body with two fields.

| Field | Type | Required | Description |
|---|---|---|---|
| `items` | array of objects | yes | Between 1 and 10 000 points of interest. Each item must carry a `site_web` field. Items without a website are filtered out before the job is queued. |
| `source_job_id` | string | no | Identifier of an upstream job whose output feeds this one. Used to chain modules inside a pipeline. |

A minimal item shape is:

```json
{
  "name": "Studio Atlas",
  "site_web": "https://studio-atlas.example"
}
```

Any additional keys on input items are preserved untouched alongside the social columns.

## Outputs

Each input item is returned with the same identifying fields plus one column per network. A column is left empty when no profile is found.

| Column | Type | Description |
|---|---|---|
| `social_facebook` | string | Canonical Facebook page URL. |
| `social_instagram` | string | Canonical Instagram profile URL. |
| `social_linkedin` | string | Canonical LinkedIn company or profile URL. |
| `social_twitter` | string | Canonical X (formerly Twitter) profile URL. |
| `social_tiktok` | string | Canonical TikTok profile URL. |
| `social_youtube` | string | Canonical YouTube channel URL. |

## Lifecycle

Standard job lifecycle: see [Jobs & lifecycle](/docs/concepts/jobs-lifecycle).

## Pipeline

| Property | Value |
|---|---|
| Category | enrich |
| Needs | `site_web` |
| Produces | `social_facebook`, `social_instagram`, `social_linkedin`, `social_twitter`, `social_tiktok`, `social_youtube` |
| Pipelinable | yes |
| Supports continuous monitoring | no |

Typical chain: `scrap` → `socials` → outreach. `source_job_id` links the run to its parent.

## Endpoints

### Create a socials job

```
POST /api/jobs/socials
```

Request body matches the `Inputs` table. Response: a `Job` resource with identifier, queued status, and estimated EF cost.

```json
{
  "items": [
    { "name": "Studio Atlas", "site_web": "https://studio-atlas.example" },
    { "name": "Atelier Nord",  "site_web": "https://atelier-nord.example" }
  ],
  "source_job_id": "b1f2c3d4-..."
}
```

A successful response returns:

```json
{
  "id": "f9e8d7c6-...",
  "job_type": "socials",
  "status": "pending",
  "items_count": 2,
  "ef_cost": 0.0011
}
```

### Read state, stream progress, download results

Standard job endpoints apply unchanged:

| Method | Path | Purpose |
|---|---|---|
| `GET` | `/api/jobs/{id}` | Snapshot of the job state. |
| `GET` | `/api/jobs/{id}/events` | Server-sent events stream for progress updates. |
| `GET` | `/api/jobs/{id}/results` | Enriched list, JSON or CSV. |

## Limits

Global quotas: see [/docs/concepts/limits](/docs/concepts/limits). Module-specific caps:

| Limit | Value |
|---|---|
| Items per job | 1 to 10 000 |
| Required field per item | `site_web` |

Items without a `site_web` are dropped at validation. If no item remains, the job is rejected.

## Errors

Standard HTTP error envelope.

| Status | Code | Condition |
|---|---|---|
| 400 | `validation_error` | The payload fails schema validation (missing `items`, more than 10 000 entries, malformed JSON). |
| 400 | `no_eligible_items` | None of the submitted items carries a usable `site_web`. |
| 400 | `quota_exceeded` | The estimated cost exceeds the account ceiling. The response body carries the estimate and the ceiling. |
| 401 | `unauthenticated` | The request is missing a valid session. |
| 403 | `inactive_account` | The account is not active. |
| 429 | `rate_limited` | Too many job creations in a short window. |

Per-item failures never fail the job: empty columns for that item, surfaced in the job summary.

## What's next

- [Jobs lifecycle](/docs/concepts/jobs-lifecycle) for state transitions and event payloads.
- [Pipelines](/docs/concepts/pipelines) to chain `socials` with upstream and downstream modules.
- [Emails](/docs/modules/emails) to attach contact emails to the same list.
- [Tech stack](/docs/modules/techstack) to profile the technology behind each website.


<!-- doc: modules/sort -->

---
title: Sort
slug: modules/sort
section: Modules
---

## Purpose

Reorder rows produced by a previous pipeline step by a chosen column, ascending or descending, and optionally truncate the result to the top N rows. Sort is pipeline-internal (see [/docs/concepts/pipeline-orchestration](/docs/concepts/pipeline-orchestration)): it consumes the output of the predecessor node and emits the same columns in a new order. A common use is ordering a freshly scraped list by `completeness` descending and keeping the top 200 before handing the result to an email-finder or cold-outreach step. Sort never adds, removes, or rewrites columns — truncation is the only content change it can apply.

## Inputs

Configuration is attached to the pipeline node. There is no standalone request body.

| Field       | Type                | Required | Default        | Description                                              |
|-------------|---------------------|----------|----------------|----------------------------------------------------------|
| `sort_by`   | string (enum)       | yes      | `completeness` | Column to order by.                                      |
| `direction` | `"asc"` \| `"desc"` | yes      | `desc`         | Sort direction.                                          |
| `top_n`     | integer \| `null`   | no       | `null`         | Keep only the first `top_n` rows after sorting. Min `1`. |

Accepted values for `sort_by`:

| Value           | Meaning                                                 |
|-----------------|---------------------------------------------------------|
| `completeness`  | Aggregate fill rate of a row's enrichment fields.       |
| `note`          | Star rating of the establishment (Google Maps).         |
| `nb_avis`       | Number of reviews on the establishment.                 |
| `email_quality` | Quality score of the extracted email (personal > role). |

`top_n` must be a positive integer or `null`. A value of `null` keeps every row.

## Outputs

Same schema as the predecessor node — sort is declared `passthrough` in the pipeline graph. The downstream node sees the same columns it would have seen without the sort step, only in a different order and possibly with fewer rows.

| Property      | Behaviour                                                                  |
|---------------|----------------------------------------------------------------------------|
| Columns       | Identical to input, byte-for-byte.                                         |
| Row order     | Determined by `sort_by` and `direction`.                                   |
| Row count     | `min(input_count, top_n)` if `top_n` is set, otherwise equal to the input. |
| Pipeline type | Same as the upstream source (resolved transitively across sort/filter).    |

## Lifecycle

Standard job lifecycle — see [/docs/concepts/jobs-lifecycle](/docs/concepts/jobs-lifecycle). A sort step is created automatically by the pipeline runner when the predecessor reaches `done`, and is treated as a structural pipeline step rather than a billable action.

## Pipeline

| Property               | Value                       |
|------------------------|-----------------------------|
| Category               | `process`                   |
| Pipelinable            | yes                         |
| Needs                  | none (accepts any input)    |
| Produces               | none (passthrough output)   |
| Pipeline input type    | `any_pois`                  |
| Pipeline output type   | `passthrough`               |

Because the output type is `passthrough`, the effective downstream type is inherited from the closest non-pass-through predecessor. A sort placed after `scrap` exposes the same downstream contract as `scrap` would on its own.

## Endpoints

Sort has no public REST endpoint — it is pipeline-internal (see [/docs/concepts/pipeline-orchestration](/docs/concepts/pipeline-orchestration)) and created exclusively by the pipeline runner as the predecessor node finishes, through the internal helper `create_pipeline_internal_job(job_type="sort", …)`.

To use sort, define it as a node inside a pipeline created via the pipelines API:

| Method | Path                          | Purpose                                       |
|--------|-------------------------------|-----------------------------------------------|
| POST   | `/api/pipelines`              | Create a pipeline containing a `sort` node.   |
| GET    | `/api/pipelines/{id}`         | Inspect node configuration and node statuses. |

A node entry for sort looks like:

```json
{
  "type": "sort",
  "config": {
    "sort_by": "completeness",
    "direction": "desc",
    "top_n": 200
  }
}
```

The runner reads `config` at execution time and emits the sorted CSV into the node's job directory.

## Limits

Global limits — see [/docs/concepts/limits](/docs/concepts/limits). Sort is not billed (quota cost `0`), runs in the standard parallel job pool, and its maximum input size is bounded by the predecessor's output, not by sort itself.

## Errors

| Condition                              | Result                                                                 |
|----------------------------------------|------------------------------------------------------------------------|
| `sort_by` references an unknown column | Node transitions to `failed`; downstream nodes stay `pending`.         |
| `direction` is not `asc` or `desc`     | Node transitions to `failed` with a validation error.                  |
| `top_n` is `0` or negative             | Rejected at pipeline creation; the API returns `400`.                  |
| Empty input                            | Node completes as `done` with an empty output CSV.                     |
| Predecessor did not reach `done`       | Sort stays `pending`; it is never scheduled until the parent finishes. |

## What's next

- [Filter](/docs/modules/filter) — drop rows that don't match a rule before, or after, sorting.
- [Import](/docs/modules/import) — bring an external CSV into a pipeline so it can be sorted.


<!-- doc: modules/techstack -->

---
title: Tech stack
slug: modules/techstack
section: Modules
---

# Tech stack

Detect the technologies powering each prospect's website — CMS, JavaScript frameworks, analytics suites, payment processors, hosting and CDN — and return one structured row per site.

Typical use cases:

- Qualify prospects by maturity (custom build vs. drag-and-drop site builder).
- Filter by buying signal (already running paid analytics, already accepting cards).
- Route leads to the right offer based on the platform they operate.

## Inputs

Items without a resolvable `site_web` are dropped before billing.

| Field           | Type     | Required | Notes                                     |
| --------------- | -------- | -------- | ----------------------------------------- |
| `items`         | array    | yes      | 1 to 10,000 POI rows                      |
| `items[].site_web` | string | yes      | HTTP(S) URL of the website to fingerprint |
| `source_job_id` | string   | no       | UUID of an upstream job to chain from     |

## Outputs

One enriched row per input item. Detection is best-effort: any column may be `null` when the signal is absent or ambiguous.

| Column            | Type            | Description                                                                            |
| ----------------- | --------------- | -------------------------------------------------------------------------------------- |
| `tech_cms`        | string \| null  | Primary content management system or site builder (e.g. `wordpress`, `shopify`, `webflow`). |
| `tech_analytics`  | string \| null  | Analytics products detected on the site (e.g. `ga4`, `matomo`, `plausible`).            |
| `tech_ads_pixels` | string \| null  | Advertising pixels detected (e.g. `meta`, `google_ads`, `linkedin`, `tiktok`).         |

The full per-site report — three-tier hierarchy, business signals, technical metadata — is available on the job detail page; the CSV exposes the flat signals above for filtering and pipeline branching.

## Lifecycle

Standard outsend job lifecycle; see [/docs/concepts/jobs-lifecycle](/docs/concepts/jobs-lifecycle). Progress is reported per site, in the `sites` unit.

## Pipeline

```
needs:    [site_web]
produces: [tech_cms, tech_analytics, tech_ads_pixels]
```

Any upstream module that emits `site_web` — `scrap` being the canonical source — can feed `techstack`.

## Endpoints

### Create a job

```
POST /api/jobs/techstack
```

```json
{
  "items": [
    { "site_web": "https://example.com" },
    { "site_web": "https://another.example" }
  ],
  "source_job_id": "8b3e…optional"
}
```

Returns the created job with its `id` and initial `pending` status.

### Inspect a job

```
GET /api/jobs/{job_id}
```

### Stream progress

```
GET /api/jobs/{job_id}/stream
```

Server-sent events, one event per processed site plus terminal status transitions.

### Download results

```
GET /api/jobs/{job_id}/download
```

Returns the enriched list. Each row carries the original POI fields plus the columns documented in [Outputs](#outputs).

## Limits

See [/docs/concepts/limits](/docs/concepts/limits) for global limits. URL scheme must be `http://` or `https://`. A job with zero valid items after `site_web` filtering is rejected with `400`.

## Errors

| HTTP | Code / message                                  | When it happens                                          |
| ---- | ----------------------------------------------- | -------------------------------------------------------- |
| 400  | `Aucun établissement avec site web`             | No item in the payload has a usable `site_web`.          |
| 400  | `Quota dépassé`                                 | The job's EF cost exceeds `MAX_EF_PER_JOB`.              |
| 400  | Validation error                                | Payload shape invalid (size, missing fields, bad types). |

Errors are returned as JSON with a `detail` field. Once a job is running, per-item failures land in the result row (empty columns) rather than as HTTP errors.

## What's next

The `techstack` output composes naturally with:

- [`pricing`](/docs/modules/pricing) — extract published prices and plans from the same sites.
- [`ads_intelligence`](/docs/modules/ads_intelligence) — profile marketing maturity (pixels, retargeting, CMP, CRM).
- [`pagespeed`](/docs/modules/pagespeed) — measure real-world performance and Core Web Vitals.


<!-- doc: modules/verify_emails -->

---
title: Email verification
slug: modules/verify_emails
section: Modules
---

# Email verification

The `verify_emails` module validates the deliverability of a list of email addresses before a campaign is sent. It checks syntax, resolves MX records, opens a probe against the receiving server, and flags addresses that are disposable or shaped like a role mailbox.

It runs after an enrichment step that produced emails, or against an imported list. The job is parallel: it does not consume a slot on the multi-proxy queue and can run alongside extraction jobs.

Disposable and alias domains are detected, including modern providers (Apple, DuckDuckGo, ProtonMail) that lookalike tools incorrectly reject. Catch-all domains are surfaced as cases where deliverability cannot be guaranteed.

## Inputs

| Field         | Required | Source                                      |
|---------------|----------|---------------------------------------------|
| `email`       | yes      | from a previous `emails` job, or imported   |
| `nom`         | no       | passed through to the output                |
| `telephone`   | no       | passed through                              |
| `site_web`    | no       | passed through                              |

`needs: ['email']`. Items missing an `@` are discarded at job creation. Duplicates are deduplicated on the lowercased address. The job accepts between `1` and `10000` items per call.

`source_job_id` is optional and points to the `emails` job whose output is being verified. The UI surfaces this picker as `from_jobs_of_type: 'emails'`.

## Outputs

The job produces a verification report. Each row carries the verified address, the verdict, and a passthrough of identifying fields from the input.

| Column          | Type   | Meaning                                                                                                   |
|-----------------|--------|----------------------------------------------------------------------------------------------------------|
| `email`         | string | Lowercased address as submitted                                                                          |
| `status`        | string | Deliverability verdict (see values below)                                                                |
| `category`      | string | How the verdict was reached: `smtp`, `syntax`, `disposable`, `suspect`, `big_provider`, `no_mx`, `error` |
| `reason`        | string | Human-readable explanation of the verdict                                                                 |
| `suggested_fix` | string | Suggested correction for an obvious typo, when detected (optional)                                        |
| `smtp_code`     | string | Raw SMTP response code from the probe, when an SMTP check ran                                             |
| `catch_all`     | string | `yes` / `no` / `unknown` — whether the domain accepts any address                                         |
| `nom`           | string | Passed through from input                                                                                 |
| `telephone`     | string | Passed through from input                                                                                 |
| `site_web`      | string | Passed through from input                                                                                 |

`status` takes one of:

| Value             | Meaning                                                                                          |
|-------------------|-------------------------------------------------------------------------------------------------|
| `valid`           | The receiving server accepted the address — deliverable                                          |
| `valid_catch_all` | Accepted, but the domain is catch-all so acceptance is not a guarantee                           |
| `invalid`         | The server rejected the address — do not send                                                    |
| `greylisted`      | Temporarily deferred by the server; can be retried later                                         |
| `unknown`         | Could not be determined (timeout, blocked probe…)                                                |
| `filtered`        | Rejected before the SMTP step — bad syntax, disposable, role/suspect, or no MX (see `category`)  |
| `skipped`         | Big free provider (Gmail, Outlook…) that rejects probes; treat as deliverable                    |

The signal columns (`status`, `category`) are declared in the module registry under `produces`. They are the columns campaigns and downstream `filter` / `sort` nodes branch on.

## Lifecycle

Standard job states — see [Jobs lifecycle](/docs/concepts/jobs-lifecycle). Progress is reported per email (`progress_unit: 'emails'`).

## Pipeline

The module declares the following contract:

| Property       | Value                       |
|----------------|-----------------------------|
| `needs`        | `email`                     |
| `produces`     | `status`, `category` |
| `category`     | `verify`                    |
| `pipelinable`  | `true`                      |
| `supports_veille` | `false`                  |

In a pipeline, `verify_emails` accepts any upstream node that emits `email` (typically `emails` or `import`). Its `verified` output can be wired into a downstream `filter` or `sort` node — filter on `status` to keep only deliverable rows (e.g. `status` in `valid`, `valid_catch_all`). Enrichment nodes cannot run after `verify_emails`: the verification report does not carry the full POI columns they need.

## Endpoints

### Create a job

```
POST /api/jobs/verify-emails
```

Body:

```json
{
  "items": [
    { "email": "alex@example.com", "nom": "Alex", "site_web": "https://example.com" },
    { "email": "contact@example.org" }
  ],
  "source_job_id": "f3c2…"
}
```

`source_job_id` is optional. `items` is required, with at least one record and at most `10000`.

Response: `JobPublic` (the standard job envelope).

### Read a job

The standard job endpoints apply:

```
GET  /api/jobs/{id}
GET  /api/jobs/{id}/events     # SSE stream
GET  /api/jobs/{id}/download   # CSV
```

For per-job and per-account caps, see [Limits](/docs/concepts/limits). Throughput is throttled to roughly five verifications per second so the outbound IP does not get flagged by mail providers. Concurrency: the job runs on the parallel worker pool, so a `verify_emails` job never blocks a scraping job and is never blocked by one.

## Errors

| Code | Condition                                                    |
|------|--------------------------------------------------------------|
| 400  | `Aucun email valide dans la liste` — `items` empty, every entry missing `@`, or all entries are duplicates |
| 400  | `Quota dépassé` — estimated cost exceeds `MAX_EF_PER_JOB`    |
| 401  | Caller is not an active user                                 |
| 422  | Payload does not match `VerifyEmailsJobCreateRequest`        |

Per-item failures (timeout on MX, refused SMTP probe, etc.) do not fail the job. The row is written with `status = unknown` and the job advances.

## What's next

- [delivery_check](/docs/modules/delivery_check) — confirms the receiving server actually accepted a test delivery, beyond the SMTP handshake.
- [filter](/docs/modules/filter) — keep only deliverable rows by filtering on `status` (`valid` / `valid_catch_all`), then hand the result to a campaign or export.


<!-- doc: quickstart -->

---
title: Quickstart
slug: quickstart
section: Get started
summary: From signup to a first exported CSV in under five minutes.
---

This is the shortest path from a fresh account to a usable list of prospects.

## 1. Get access

outsend is in **invitation-only alpha**. Request access at [outsend.xyz/demander-acces](https://outsend.xyz/demander-acces). When approved, an invitation code arrives by email.

## 2. Sign up

At [outsend.xyz/signup](https://outsend.xyz/signup), the form takes:

- Email
- Password (8 chars min, with at least one letter and one digit or symbol)
- The invitation code (`XXXX-XXXX-XXXX`)

Account creation is immediate. A verification email is sent, but use of the app does not block on it.

## 3. Run a first scrap

From the dashboard, click **New job → Scrap (Google Maps)**.

Fill in:

- **Queries** — what to look for, e.g. `bakery`, `dentist`, `accounting firm`. Multiple queries chip-style.
- **Zones** — French regions, departments, or cities. Multiple zones supported.
- **Include reviews** — toggle if review text should be pulled too (slower, richer).

Click **Run**. The job appears on the dashboard with a live log stream.

## 4. Watch it run

Open the job detail page. The status bar progresses through `pending` → `running` → `done`. Logs stream in real time over SSE. See [Jobs & lifecycle](/docs/concepts/jobs-lifecycle) for the full state machine.

Typical durations:

- Small query, single city: a few minutes
- Multi-region scrap with reviews: tens of minutes to a few hours

## 5. Enrich

Once the scrap is done, the **Add module** action on the job detail page chains an enrichment. Common chains:

- **Emails** — find email addresses for each POI
- **Socials** — find LinkedIn, Instagram, Facebook profiles
- **Ads intelligence** — score each POI by marketing maturity (premium signal: budget available)

Each enrichment runs as its own job, sharing the parent job's POI list as input.

## 6. Export

On any done job, click **Download** and pick CSV, JSON, or XLSX. Files are kept for **7 days**, then purged.

## 7. (Optional) Build a pipeline

To run the same chain repeatedly, build it once as a **pipeline**: drag-and-drop the blocks, connect them, save. The pipeline can be re-run on demand or registered as a [veille](/docs/concepts/veille-monitoring).

## What's next

- [Jobs & lifecycle](/docs/concepts/jobs-lifecycle) — how a job moves from creation to result
- [Module registry](/docs/concepts/module-registry) — full list of modules, with categories
- [API reference](/docs/api/overview) — drive everything programmatically


<!-- doc: what-is-outsend -->

---
title: What is outsend
slug: what-is-outsend
section: Get started
summary: A processor for B2B prospect data — extract from Google Maps, enrich, verify, orchestrate, monitor.
---

outsend is a **data processor for B2B prospecting**. It turns a search query and a geographic zone into a workable list of qualified prospects, then keeps that list fresh over time.

The work happens in three layers:

## 1. Extraction modules

These modules pull data from public sources. The most central one is [`scrap`](/docs/modules/scrap) (Google Maps listings). Around it sit modules that enrich each point of interest: [`reviews`](/docs/modules/reviews), [`emails`](/docs/modules/emails), [`socials`](/docs/modules/socials), [`phones_extra`](/docs/modules/phones_extra), [`legal_ids`](/docs/modules/legal_ids), [`legal_mentions`](/docs/modules/legal_mentions), [`legal_data`](/docs/modules/legal_data).

## 2. Intelligence modules

These modules compute signals on a list that already exists: [`pricing`](/docs/modules/pricing), [`techstack`](/docs/modules/techstack), [`pagespeed`](/docs/modules/pagespeed), [`ads_intelligence`](/docs/modules/ads_intelligence), [`brand_assets`](/docs/modules/brand_assets), [`dead_check`](/docs/modules/dead_check). They turn a flat contact list into something segmentable.

## 3. Pipeline & monitoring modules

These don't extract data — they orchestrate it: [`import`](/docs/modules/import), [`filter`](/docs/modules/filter), [`sort`](/docs/modules/sort), plus verification ([`verify_emails`](/docs/modules/verify_emails), [`delivery_check`](/docs/modules/delivery_check)).

Chained together, they form a **pipeline** — a DAG you can edit visually. A pipeline can be run once, or registered as a **veille** that re-runs on a schedule and reports the delta versus the previous run.

## Mental model

```
Search query + zone
       │
       ▼
   [scrap]  ──►  Points of interest (POIs)
       │
       ▼
   [emails] [socials] [legal_ids] ...   ──►  Enriched POIs
       │
       ▼
   [filter] [sort]                       ──►  Curated list
       │
       ▼
   Export (CSV / JSON / XLSX)
```

The same shape, registered as a [veille](/docs/concepts/veille-monitoring), re-runs every N days and diffs against the previous output — surfacing new businesses, closures, and reputation shifts as signals.

## What outsend is not

- **Not a CRM.** Exports go *to* a CRM. outsend keeps the CRM clean by filtering before insertion.
- **Not a cold-email sender.** Campaign modules ([`email_campaign`](/docs/modules/on-demand/email_campaign), [`sms_campaign`](/docs/modules/on-demand/sms_campaign)) are on-demand: the team builds the send for you rather than exposing a deliverability footgun.
- **Not a database of contacts.** Each search runs live. No stale pre-built lists.