Modules

Filter

Purpose

The filter module narrows a dataset to the rows that match a set of rules. It is pipeline-internal (see /docs/concepts/pipeline-orchestration): it consumes the CSV produced by an upstream node and emits a strict subset, with the same columns. No new data is fetched and no column is added. Filtering early saves budget on the expensive enrichment steps that follow.

Inputs

The rules are read from the node's config object and applied row-by-row in a fixed order. Every key is optional; an empty rule is a no-op.

Standard rules

Key	Type	Behaviour
`require_phone`	`bool`	Keep rows where `telephone` is non-empty.
`require_site`	`bool`	Keep rows where `site_web` is non-empty.
`require_email`	`bool`	Keep rows where `email` is non-empty.
`exclude_aggregators`	`bool`	Drop rows whose `site_web` points to a known aggregator domain.
`alive_only`	`bool`	Keep rows whose dead-check `status` is `alive` or `stale`.
`has_personal_email`	`bool`	Keep rows where at least one address in `email` is a personal mailbox (not role-based).
`rating_min`	`float`	Keep rows where `note >= rating_min`.
`reviews_min`	`int`	Keep rows where `nb_avis >= reviews_min`.

Advanced rules

Key	Shape	Behaviour
`phone_prefix`	`{ column?, prefixes[], prefix_unparseable_keep? }`	Keep rows whose phone column starts with one of `prefixes` (e.g. `06`, `+33`). Requires the `phonenumbers` library on the worker — otherwise the rule is logged and skipped.
`email_domain`	`{ column?, include[], exclude[], reject_disposable? }`	Keep rows whose email domain is in `include` (if set) and not in `exclude`. `reject_disposable` drops known throwaway providers.
`category`	`{ column, values[] }`	Keep rows whose `column` value is contained in `values`.
`dedup_column`	`string`	Collapse rows that share the same value on this column (first row wins).

Sampling

Key	Type	Behaviour
`sample_type`	`"n" \\| "pct" \\| ""`	Selects which sampling mode applies after the rules above.
`sample_n`	`int`	Keep the first `n` matched rows.
`sample_pct`	`0..100`	Keep a percentage of matched rows.
`sample_seed`	`int`	Seed for reproducible random sampling.

Order of application: requirement flags → aggregator/alive/rating/reviews → personal-email → phone_prefix → email_domain → category → dedup_column → sampling.

Outputs

The module writes a CSV with the same columns as the upstream node, containing only the matched rows. It does not produce new fields (needs: [], produces: [], pipeline_passthrough: true).

Field	Value
`output_filename`	`results_<label>.csv` (same shape as upstream)
`n_items`	Number of rows kept
`progress_unit`	`lignes`
`results_unit`	`lignes gardées`

Lifecycle

Standard job lifecycle — see /docs/concepts/jobs-lifecycle. Filter jobs run in the parallel pool, are created by the pipeline runner, and are not surfaced on the dashboard or in "New job".

Pipeline

Attribute	Value
`category`	`process`
`pipelinable`	`true`
`pipeline_passthrough`	`true`
`needs`	`[]` (works on any upstream type)
`produces`	`[]`
`hidden_from_new_job`	`true`
`hidden_from_dashboard`	`true`

filter accepts any upstream module. The UI only exposes advanced rules whose target field is actually present in the upstream output — for example, the phone_prefix block is shown only if an upstream node produces a phone field.

Endpoints

filter is a pipeline-internal job type (see /docs/concepts/pipeline-orchestration). It has no public POST /api/jobs/filter endpoint: filter jobs are created by the pipeline runner and configured through the pipeline definition.

Two endpoints are user-facing:

Filter preview

POST /api/pipelines/{pipeline_id}/nodes/{node_id}/filter-preview

Applies a rule set to the upstream node's CSV in memory and returns how many rows would match — without creating a job. Used by the editor for live feedback while rules are being edited.

Request:

{
  "rules": {
    "require_email": true,
    "phone_prefix": { "column": "telephone", "prefixes": ["06", "+33"] },
    "sample_type": "pct",
    "sample_pct": 25
  }
}

Response:

Field	Type	Notes
`total`	`int`	Rows read from the upstream CSV.
`matched`	`int`	Rows kept after the rules.
`samples`	`array`	Up to 5 matched rows, with empty columns stripped.
`predecessor_job_id`	`string`	Job whose CSV was previewed.
`fieldnames`	`string[]`	Columns of the upstream CSV.
`capped`	`bool`	`true` if the read hit the row cap (see Limits).
`reason`	`string`	Present only when `total = 0`: `no_predecessor`, `no_data_yet`, or `no_csv_found`.

Errors: 404 if the pipeline or node is missing, 403 if the caller does not own the pipeline, 400 if the node is not of type filter, 400 if the rules are malformed.

Pipeline job items

Once the pipeline runs the filter node, the resulting CSV is served by the generic job endpoints:

GET /api/jobs/{job_id}/items
GET /api/jobs/{job_id}/output-columns
GET /api/jobs/{job_id}/download

Limits

Global limits — see /docs/concepts/limits. Module-specific:

Limit	Value
Preview row cap	`5000` rows (`_PREVIEW_ROWS_LIMIT`). When the upstream CSV is larger, the preview reads the first 5000 rows and sets `capped: true`. The full filter job, when executed, applies the rules to every row.
`phone_prefix` dependency	Requires the `phonenumbers` package on the worker. If missing, the rule is logged and ignored — the rest of the rules still apply.

Errors

Code	Cause
`400`	The node referenced by the preview is not a `filter` node, or `rules` is not a valid object.
`403`	The pipeline does not belong to the caller and the caller is not admin.
`404`	The pipeline or the node does not exist.
`500`	The upstream CSV could not be read (corrupted file, missing on disk).

A filter job itself fails only if the upstream CSV is unreadable; invalid rule values are coerced to no-ops rather than raising.

What's next

Sort — order the filtered rows and optionally keep the top N.
Import — bring an external CSV into a pipeline so it can be filtered like any other source.