Business IDs
Business IDs
The legal_ids module extracts French business identifiers from a list of POI that already have a website. For each site, the module locates and validates the SIRET (14 digits) and SIREN (9 digits), and records the URL where the identifier was found. The Luhn checksum is applied to every candidate, so phone numbers that happen to be nine digits long are rejected before they reach the output.
This module is the standard entry point into the legal-data pipeline: once an item carries a verified SIREN, downstream modules such as legal_data can query official registries without a name-matching step.
Website required. This module reads the SIRET/SIREN from each establishment's website (the legal-mentions page), so every input row needs a
site_webvalue. If your list has only names or Google Maps links (no website), use French legal data instead — it returns the same SIRET/SIREN by matching the name against the SIRENE registry, no website needed.
Purpose
- Attach an authoritative French business ID (SIREN/SIRET) to each POI.
- Keep provenance: every identifier ships with the URL it was extracted from.
- Provide a clean, deduplicated key for further enrichment (
legal_data,legal_mentions).
Inputs
The module consumes an enriched_list of POI. The minimum field required on each item is site_web; items without a website are filtered out at job creation and counted as ignored.
| Field | Type | Required | Notes |
|---|---|---|---|
site_web |
string | yes | Root domain or any URL on the target site. |
nom |
string | no | Carried through to the output for context. |
adresse |
string | no | Carried through to the output for context. |
A POI list is typically produced by the scrap module, but any enriched_list with site_web populated is accepted.
Outputs
Each input item is returned with three new fields. All other input fields are passed through unchanged.
| Column | Type | Description |
|---|---|---|
siren |
string | 9-digit business identifier, Luhn-validated. Empty if not found. |
siret |
string | 14-digit establishment identifier, Luhn-validated. Empty if not found. |
siret_source_url |
string | URL of the page where the identifier was extracted. Empty if not found. |
When only a SIREN is detected, siret is left empty and downstream modules can still operate on the SIREN alone. Additional company attributes — legal form, RCS number, NAF code, headcount, directors, financials — live in legal_data.
Lifecycle
Standard job lifecycle — see Jobs lifecycle. Failures on individual items do not stop the job: the corresponding output row simply has empty siren/siret fields, and the errors counter tracks items finished without an identifier.
Pipeline
needs: poi_list
produces: enriched_list
legal_ids is positioned in the enrich category. Typical chains:
scrap→legal_ids→legal_datascrap→legal_ids→legal_mentions
Endpoints
Create a job
POST /api/jobs/legal-ids
Body:
| Field | Type | Required | Description |
|---|---|---|---|
items |
array of POI items | yes | Input list. Each item must carry site_web. |
source_job_id |
string (UUID) | no | When chaining from a previous job, the upstream job ID for traceability. |
Response: the standard Job object, including id, status, output_filename, and the cost estimate in equivalent-France units.
The created job is then driven by the usual endpoints (GET /api/jobs/{id}, GET /api/jobs/{id}/download, etc.), shared by every module.
Global quotas and per-job ceilings: see Limits.
Errors
| Situation | Behavior |
|---|---|
Item has no site_web |
Dropped before the job starts. If no item remains, the job is rejected. |
| Site reachable, no identifier found | Item finishes with empty siren/siret. Counted in the error tally. |
| POI not registered (no SIREN/SIRET) | Same as above: no identifier is invented; outputs stay empty. |
| Foreign business | Same as above: only French identifiers are recognized. |
| Candidate fails the Luhn check | Discarded silently; no false positive is written to the output. |
| No item with a website | Job fails with an explicit message — e.g. 0 of 120 rows have a website (your list has names, Google Maps links and phones only) — and recommends running legal_data to get the SIRET/SIREN by name instead. |
What's next
- Company data — given a SIREN, fetches the full official record: legal form, RCS number, NAF code, headcount, directors, financials.
- Legal mentions — extracts the legal notice block (publisher, host, contact) from the same sites.