FR
Copied
Modules

Filter

Purpose

The filter module narrows a dataset to the rows that match a set of rules. It is pipeline-internal (see /docs/concepts/pipeline-orchestration): it consumes the CSV produced by an upstream node and emits a strict subset, with the same columns. No new data is fetched and no column is added. Filtering early saves budget on the expensive enrichment steps that follow.

Inputs

The rules are read from the node's config object and applied row-by-row in a fixed order. Every key is optional; an empty rule is a no-op.

Standard rules

Key Type Behaviour
require_phone bool Keep rows where telephone is non-empty.
require_site bool Keep rows where site_web is non-empty.
require_email bool Keep rows where email is non-empty.
exclude_aggregators bool Drop rows whose site_web points to a known aggregator domain.
alive_only bool Keep rows whose dead-check status is alive or stale.
has_personal_email bool Keep rows where at least one address in email is a personal mailbox (not role-based).
rating_min float Keep rows where note >= rating_min.
reviews_min int Keep rows where nb_avis >= reviews_min.

Advanced rules

Key Shape Behaviour
phone_prefix { column?, prefixes[], prefix_unparseable_keep? } Keep rows whose phone column starts with one of prefixes (e.g. 06, +33). Requires the phonenumbers library on the worker — otherwise the rule is logged and skipped.
email_domain { column?, include[], exclude[], reject_disposable? } Keep rows whose email domain is in include (if set) and not in exclude. reject_disposable drops known throwaway providers.
category { column, values[] } Keep rows whose column value is contained in values.
dedup_column string Collapse rows that share the same value on this column (first row wins).

Sampling

Key Type Behaviour
sample_type "n" \| "pct" \| "" Selects which sampling mode applies after the rules above.
sample_n int Keep the first n matched rows.
sample_pct 0..100 Keep a percentage of matched rows.
sample_seed int Seed for reproducible random sampling.

Order of application: requirement flags → aggregator/alive/rating/reviews → personal-email → phone_prefixemail_domaincategorydedup_column → sampling.

Outputs

The module writes a CSV with the same columns as the upstream node, containing only the matched rows. It does not produce new fields (needs: [], produces: [], pipeline_passthrough: true).

Field Value
output_filename results_<label>.csv (same shape as upstream)
n_items Number of rows kept
progress_unit lignes
results_unit lignes gardées

Lifecycle

Standard job lifecycle — see /docs/concepts/jobs-lifecycle. Filter jobs run in the parallel pool, are created by the pipeline runner, and are not surfaced on the dashboard or in "New job".

Pipeline

Attribute Value
category process
pipelinable true
pipeline_passthrough true
needs [] (works on any upstream type)
produces []
hidden_from_new_job true
hidden_from_dashboard true

filter accepts any upstream module. The UI only exposes advanced rules whose target field is actually present in the upstream output — for example, the phone_prefix block is shown only if an upstream node produces a phone field.

Endpoints

filter is a pipeline-internal job type (see /docs/concepts/pipeline-orchestration). It has no public POST /api/jobs/filter endpoint: filter jobs are created by the pipeline runner and configured through the pipeline definition.

Two endpoints are user-facing:

Filter preview

POST /api/pipelines/{pipeline_id}/nodes/{node_id}/filter-preview

Applies a rule set to the upstream node's CSV in memory and returns how many rows would match — without creating a job. Used by the editor for live feedback while rules are being edited.

Request:

{
  "rules": {
    "require_email": true,
    "phone_prefix": { "column": "telephone", "prefixes": ["06", "+33"] },
    "sample_type": "pct",
    "sample_pct": 25
  }
}

Response:

Field Type Notes
total int Rows read from the upstream CSV.
matched int Rows kept after the rules.
samples array Up to 5 matched rows, with empty columns stripped.
predecessor_job_id string Job whose CSV was previewed.
fieldnames string[] Columns of the upstream CSV.
capped bool true if the read hit the row cap (see Limits).
reason string Present only when total = 0: no_predecessor, no_data_yet, or no_csv_found.

Errors: 404 if the pipeline or node is missing, 403 if the caller does not own the pipeline, 400 if the node is not of type filter, 400 if the rules are malformed.

Pipeline job items

Once the pipeline runs the filter node, the resulting CSV is served by the generic job endpoints:

Limits

Global limits — see /docs/concepts/limits. Module-specific:

Limit Value
Preview row cap 5000 rows (_PREVIEW_ROWS_LIMIT). When the upstream CSV is larger, the preview reads the first 5000 rows and sets capped: true. The full filter job, when executed, applies the rules to every row.
phone_prefix dependency Requires the phonenumbers package on the worker. If missing, the rule is logged and ignored — the rest of the rules still apply.

Errors

Code Cause
400 The node referenced by the preview is not a filter node, or rules is not a valid object.
403 The pipeline does not belong to the caller and the caller is not admin.
404 The pipeline or the node does not exist.
500 The upstream CSV could not be read (corrupted file, missing on disk).

A filter job itself fails only if the upstream CSV is unreadable; invalid rule values are coerced to no-ops rather than raising.

What's next