File

This page contains the setup guide and reference information for the File (CSV, JSON, Excel, Feather, Parquet) source connector.

Prerequisites

A file hosted on AWS S3, GCS, HTTPS, or an SFTP server
Dataset Name
File Format
URL
Storage Provider

Setup guide

Set up File (CSV, JSON, Excel, Feather, Parquet)

note

For Calabi Connect users: Please note that locally stored files cannot be used as a source in Calabi Connect.

Set up the File (CSV, JSON, Excel, Feather, Parquet) connector in Calabi Connect

For Calabi Connect:

Log into your Calabi Connect account.
Click Sources and then click + New source.
On the Set up the source page, select File (CSV, JSON, Excel, Feather, Parquet) from the Source type dropdown.
Enter a name for the File (CSV, JSON, Excel, Feather, Parquet) connector.

For Calabi Connect:

Navigate to the Calabi Connect dashboard.
Click Sources and then click + New source.
On the Set up the source page, select File (CSV, JSON, Excel, Feather, Parquet) from the Source type dropdown.
Enter a name for the File (CSV, JSON, Excel, Feather, Parquet) connector.

Step 2: Select the provider and set provider-specific configurations:

For Storage Provider, use the dropdown menu to select the Storage Provider or Location of the file(s) which should be replicated, then configure the provider-specific fields as needed:

Supported sync modes

The File (CSV, JSON, Excel, Feather, Parquet) source connector supports the following sync modes:

Feature	Supported?
Full Refresh Sync	Yes
Incremental Sync	No
Replicate Incremental Deletes	No
Replicate Folders (multiple Files)	No
Replicate Glob Patterns (multiple Files)	No

note

This source produces a single table for the target file as it replicates only one file at a time for the moment. Note that you should provide the dataset_name which dictates how the table will be identified in the destination (since URL can be made of complex characters).

Supported Streams

File / Stream Compression

Compression	Supported?
Gzip	Yes
Zip	Yes
Bzip2	No
Lzma	No
Xz	No
Snappy	No

Storage Providers

Storage Providers	Supported?
HTTPS	Yes
Google Cloud Storage	Yes
Amazon Web Services S3	Yes
SFTP	Yes
SSH / SCP	Yes
local filesystem	Local use only (inaccessible for Calabi Connect)

File Formats

Format	Supported?
CSV	Yes
JSON/JSONL	Yes
HTML	No
XML	No
Excel	Yes
Excel Binary Workbook	Yes
Fixed Width File	Yes
Feather	Yes
Parquet	Yes
Pickle	No
YAML	Yes

Changing data types of source columns

Normally, Calabi Connect tries to infer the data type from the source, but you can use reader_options to force specific data types. If you input {"dtype":"string"}, all columns will be forced to be parsed as strings. If you only want a specific column to be parsed as a string, simply use {"dtype" : {"column name": "string"}}.

Examples

Here are a list of examples of possible file inputs:

Dataset Name	Storage	URL	Reader Impl	Service Account	Description
epidemiology	HTTPS	https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv			COVID-19 Public dataset on BigQuery
hr_and_financials	GCS	gs://calabi-connect-vault/financial.csv	smart_open or gcfs	`{"type": "service_account", "private_key_id": "XXXXXXXX", ...}`	data from a private bucket, a service account is necessary
landsat_index	GCS	gcp-public-data-landsat/index.csv.gz	smart_open		Using smart_open, we don't need to specify the compression (note the gs:// is optional too, same for other providers)

Examples with reader options:

Dataset Name	Storage	URL	Reader Impl	Reader Options	Description
landsat_index	GCS	gs://gcp-public-data-landsat/index.csv.gz	GCFS	`{"compression": "gzip"}`	Additional reader options to specify a compression option to `read_csv`
GDELT	S3	s3://gdelt-open-data/events/20190914.export.csv		`{"sep": "\t", "header": null}`	Here is TSV data separated by tabs without header row from AWS Open Data
server_logs	local	/local/logs.log		`{"sep": ";"}`	After making sure a local text file exists at `/tmp/airbyte_local/logs.log` with logs file from some server that are delimited by ';' delimiters

Example for SFTP:

Dataset Name	Storage	User	Password	Host	URL	Reader Options	Description
Test Rebext	SFTP	demo	password	test.rebext.net	/pub/example/readme.txt	`{"sep": "\r\n", "header": null, "names": \["text"], "engine": "python"}`	We use `python` engine for `read_csv` in order to handle delimiter of more than 1 character while providing our own column names.

Please see (or add) more at airbyte-integrations/connectors/source-file/integration_tests/integration_source_test.py for further usages examples.

Performance Considerations and Notes

In order to read large files from a remote location, this connector uses the smart_open library. However, it is possible to switch to either GCSFS or S3FS implementations as it is natively supported by the pandas library. This choice is made possible through the optional reader_impl parameter.

Note that for local filesystem, the file probably have to be stored somewhere in the /tmp/airbyte_local folder with the same limitations as the CSV Destination so the URL should also starts with /local/.
Please make sure that Docker Desktop has access to /tmp (and /private on a MacOS, as /tmp has a symlink that points to /private. It will not work otherwise). You allow it with "File sharing" in Settings -> Resources -> File sharing -> add the one or two above folder and hit the "Apply & restart" button.
The JSON implementation needs to be tweaked in order to produce more complex catalog and is still in an experimental state: Simple JSON schemas should work at this point but may not be well handled when there are multiple layers of nesting.

Prerequisites​

Setup guide​

Set up File (CSV, JSON, Excel, Feather, Parquet)​

Set up the File (CSV, JSON, Excel, Feather, Parquet) connector in Calabi Connect​

For Calabi Connect:​

For Calabi Connect:​

Step 2: Select the provider and set provider-specific configurations:​

Supported sync modes​

Supported Streams​

File / Stream Compression​

Storage Providers​

File Formats​

Changing data types of source columns​

Examples​

Performance Considerations and Notes​

Prerequisites

Setup guide

Set up File (CSV, JSON, Excel, Feather, Parquet)

Set up the File (CSV, JSON, Excel, Feather, Parquet) connector in Calabi Connect

For Calabi Connect:

For Calabi Connect:

Step 2: Select the provider and set provider-specific configurations:

Supported sync modes

Supported Streams

File / Stream Compression

Storage Providers

File Formats

Changing data types of source columns

Examples

Performance Considerations and Notes