Filter
Purpose
The filter module narrows a dataset to the rows that match a set of rules. It is pipeline-internal (see /docs/concepts/pipeline-orchestration): it consumes the CSV produced by an upstream node and emits a strict subset, with the same columns. No new data is fetched and no column is added. Filtering early saves budget on the expensive enrichment steps that follow.
Inputs
The rules are read from the node's config object and applied row-by-row in a fixed order. Every key is optional; an empty rule is a no-op.
Standard rules
| Key | Type | Behaviour |
|---|---|---|
require_phone |
bool |
Keep rows where telephone is non-empty. |
require_site |
bool |
Keep rows where site_web is non-empty. |
require_email |
bool |
Keep rows where email is non-empty. |
exclude_aggregators |
bool |
Drop rows whose site_web points to a known aggregator domain. |
alive_only |
bool |
Keep rows whose dead-check status is alive or stale. |
has_personal_email |
bool |
Keep rows where at least one address in email is a personal mailbox (not role-based). |
rating_min |
float |
Keep rows where note >= rating_min. |
reviews_min |
int |
Keep rows where nb_avis >= reviews_min. |
Advanced rules
| Key | Shape | Behaviour |
|---|---|---|
phone_prefix |
{ column?, prefixes[], prefix_unparseable_keep? } |
Keep rows whose phone column starts with one of prefixes (e.g. 06, +33). Requires the phonenumbers library on the worker — otherwise the rule is logged and skipped. |
email_domain |
{ column?, include[], exclude[], reject_disposable? } |
Keep rows whose email domain is in include (if set) and not in exclude. reject_disposable drops known throwaway providers. |
category |
{ column, values[] } |
Keep rows whose column value is contained in values. |
dedup_column |
string |
Collapse rows that share the same value on this column (first row wins). |
Sampling
| Key | Type | Behaviour |
|---|---|---|
sample_type |
"n" \| "pct" \| "" |
Selects which sampling mode applies after the rules above. |
sample_n |
int |
Keep the first n matched rows. |
sample_pct |
0..100 |
Keep a percentage of matched rows. |
sample_seed |
int |
Seed for reproducible random sampling. |
Order of application: requirement flags → aggregator/alive/rating/reviews → personal-email → phone_prefix → email_domain → category → dedup_column → sampling.
Outputs
The module writes a CSV with the same columns as the upstream node, containing only the matched rows. It does not produce new fields (needs: [], produces: [], pipeline_passthrough: true).
| Field | Value |
|---|---|
output_filename |
results_<label>.csv (same shape as upstream) |
n_items |
Number of rows kept |
progress_unit |
lignes |
results_unit |
lignes gardées |
Lifecycle
Standard job lifecycle — see /docs/concepts/jobs-lifecycle. Filter jobs run in the parallel pool, are created by the pipeline runner, and are not surfaced on the dashboard or in "New job".
Pipeline
| Attribute | Value |
|---|---|
category |
process |
pipelinable |
true |
pipeline_passthrough |
true |
needs |
[] (works on any upstream type) |
produces |
[] |
hidden_from_new_job |
true |
hidden_from_dashboard |
true |
filter accepts any upstream module. The UI only exposes advanced rules whose target field is actually present in the upstream output — for example, the phone_prefix block is shown only if an upstream node produces a phone field.
Endpoints
filter is a pipeline-internal job type (see /docs/concepts/pipeline-orchestration). It has no public POST /api/jobs/filter endpoint: filter jobs are created by the pipeline runner and configured through the pipeline definition.
Two endpoints are user-facing:
Filter preview
POST /api/pipelines/{pipeline_id}/nodes/{node_id}/filter-preview
Applies a rule set to the upstream node's CSV in memory and returns how many rows would match — without creating a job. Used by the editor for live feedback while rules are being edited.
Request:
{
"rules": {
"require_email": true,
"phone_prefix": { "column": "telephone", "prefixes": ["06", "+33"] },
"sample_type": "pct",
"sample_pct": 25
}
}
Response:
| Field | Type | Notes |
|---|---|---|
total |
int |
Rows read from the upstream CSV. |
matched |
int |
Rows kept after the rules. |
samples |
array |
Up to 5 matched rows, with empty columns stripped. |
predecessor_job_id |
string |
Job whose CSV was previewed. |
fieldnames |
string[] |
Columns of the upstream CSV. |
capped |
bool |
true if the read hit the row cap (see Limits). |
reason |
string |
Present only when total = 0: no_predecessor, no_data_yet, or no_csv_found. |
Errors: 404 if the pipeline or node is missing, 403 if the caller does not own the pipeline, 400 if the node is not of type filter, 400 if the rules are malformed.
Pipeline job items
Once the pipeline runs the filter node, the resulting CSV is served by the generic job endpoints:
GET /api/jobs/{job_id}/itemsGET /api/jobs/{job_id}/output-columnsGET /api/jobs/{job_id}/download
Limits
Global limits — see /docs/concepts/limits. Module-specific:
| Limit | Value |
|---|---|
| Preview row cap | 5000 rows (_PREVIEW_ROWS_LIMIT). When the upstream CSV is larger, the preview reads the first 5000 rows and sets capped: true. The full filter job, when executed, applies the rules to every row. |
phone_prefix dependency |
Requires the phonenumbers package on the worker. If missing, the rule is logged and ignored — the rest of the rules still apply. |
Errors
| Code | Cause |
|---|---|
400 |
The node referenced by the preview is not a filter node, or rules is not a valid object. |
403 |
The pipeline does not belong to the caller and the caller is not admin. |
404 |
The pipeline or the node does not exist. |
500 |
The upstream CSV could not be read (corrupted file, missing on disk). |
A filter job itself fails only if the upstream CSV is unreadable; invalid rule values are coerced to no-ops rather than raising.