Catalogian supports CSV, TSV, and JSONL — plus gzip-compressed variants of all three. This page covers encoding, delimiters, headers, and edge cases.
| Format | Extensions | Notes |
|---|---|---|
| CSV | .csv | Comma-separated values with optional quoting |
| TSV | .tsv | Tab-separated values |
| JSONL | .jsonl | One JSON object per line (newline-delimited JSON) |
| Gzip CSV | .csv.gz | Gzip-compressed CSV |
| Gzip TSV | .tsv.gz | Gzip-compressed TSV |
| Gzip JSONL | .jsonl.gz | Gzip-compressed JSONL |
The first row is always treated as the header. Column names from the header become field names in the snapshot. Headers are case-sensitive — SKU and sku are different fields.
Catalogian auto-detects the delimiter by sampling the first few rows. Supported delimiters:
, — comma (default for .csv)\t — tab (default for .tsv)| — pipe; — semicolonYou can also set the delimiter explicitly when creating a source via the API by passing the delimiter field.
Standard CSV quoting rules apply. Fields containing commas, newlines, or double quotes should be wrapped in double quotes. Double quotes within a quoted field are escaped by doubling them:
sku,name,description SKU-001,"Widget, Large","A ""premium"" widget" SKU-002,Gadget,Simple gadget
If a row has fewer columns than the header, missing fields are stored as empty strings. If a row has more columns than the header, extra values are ignored. Catalogian records a column_mismatch warning (up to 50 per ingest) but still processes the row.
JSONL (JSON Lines) files contain one JSON object per line. Each line must be a valid JSON object. The key field must be a top-level property of each object.
{"sku": "SKU-001", "name": "Widget", "price": 29.99, "in_stock": true}
{"sku": "SKU-002", "name": "Gadget", "price": 14.99, "in_stock": false}
{"sku": "SKU-003", "name": "Doohickey", "price": 9.99, "in_stock": true}Nested objects: Catalogian flattens nested JSON into dot-notation field names. For example, {"dimensions": {"width": 10}} becomes the field dimensions.width.
Malformed JSON lines generate a malformed_json warning and are skipped. The first 50 warnings per ingest are recorded.
Append .gz to any supported extension. Catalogian detects gzip encoding from the file extension and decompresses on the fly during streaming ingest. No configuration needed.
Gzip is recommended for files over 10 MB — it reduces transfer time and bandwidth for HTTP sources.
Catalogian expects UTF-8 encoding. Files with a UTF-8 BOM (byte order mark) are handled automatically — the BOM is stripped before parsing. Other encodings (Latin-1, Windows-1252) may produce garbled field values.
Maximum file size is 500 MB (uncompressed). For large feeds, gzip compression is strongly recommended — a 500 MB CSV typically compresses to under 50 MB.
Learn about source types and how to connect them. Source Types →