page_title	subcategory	description
airbyte_source_s3 Resource - terraform-provider-airbyte		SourceS3 Resource

airbyte_source_s3 (Resource)

SourceS3 Resource

Example Usage

resource "airbyte_source_s3" "my_source_s3" {
  configuration = {
    aws_access_key_id     = "...my_aws_access_key_id..."
    aws_secret_access_key = "...my_aws_secret_access_key..."
    bucket                = "...my_bucket..."
    dataset               = "...my_dataset..."
    delivery_method = {
      copy_raw_files = {
        preserve_directory_structure = false
      }
    }
    endpoint = "my-s3-endpoint.com"
    format = {
      parquet = {
        batch_size  = 6
        buffer_size = 8
        columns = [
          "..."
        ]
      }
    }
    path_pattern = "**"
    provider = {
      aws_access_key_id     = "...my_aws_access_key_id..."
      aws_secret_access_key = "...my_aws_secret_access_key..."
      bucket                = "...my_bucket..."
      endpoint              = "...my_endpoint..."
      path_prefix           = "...my_path_prefix..."
      region_name           = "...my_region_name..."
      role_arn              = "...my_role_arn..."
      start_date            = "2021-01-01T00:00:00Z"
    }
    region_name = "...my_region_name..."
    role_arn    = "...my_role_arn..."
    schema      = "{\"column_1\": \"number\", \"column_2\": \"string\", \"column_3\": \"array\", \"column_4\": \"object\", \"column_5\": \"boolean\"}"
    start_date  = "2021-01-01T00:00:00.000000Z"
    streams = [
      {
        days_to_sync_if_history_is_full = 5
        format = {
          excel_format = {
            # ...
          }
        }
        globs = [
          "..."
        ]
        input_schema                                = "...my_input_schema..."
        legacy_prefix                               = "...my_legacy_prefix..."
        name                                        = "...my_name..."
        primary_key                                 = "...my_primary_key..."
        recent_n_files_to_read_for_schema_discovery = 10
        schemaless                                  = true
        validation_policy                           = "Wait for Discover"
      }
    ]
  }
  definition_id = "07ef8ae4-b6a4-4fd9-99ea-a368c6fc144c"
  name          = "...my_name..."
  secret_id     = "...my_secret_id..."
  workspace_id  = "bba7dce0-5020-4916-bbd7-be8f298d5f78"
}

Schema

Required

configuration (Attributes) NOTE: When this Spec is changed, legacy_config_transformer.py must also be modified to uptake the changes because it is responsible for converting legacy S3 v3 configs into v4 configs using the File-Based CDK. (see below for nested schema)
name (String) Name of the source e.g. dev-mysql-instance.
workspace_id (String)

Optional

definition_id (String) The UUID of the connector definition. One of configuration.sourceType or definitionId must be provided. Default: "69589781-7828-43c5-9f63-8925b1c1ccc2"; Requires replacement if changed.
secret_id (String) Optional secretID obtained through the public API OAuth redirect flow. Requires replacement if changed.

Read-Only

created_at (Number)
resource_allocation (Attributes) actor or actor definition specific resource requirements. if default is set, these are the requirements that should be set for ALL jobs run for this actor definition. it is overriden by the job type specific configurations. if not set, the platform will use defaults. these values will be overriden by configuration at the connection level. (see below for nested schema)
source_id (String)
source_type (String)

Nested Schema for `configuration`

Required:

bucket (String) Name of the S3 bucket where the file(s) exist.
streams (Attributes List) Each instance of this configuration defines a stream. Use this to define which files belong in the stream, their format, and how they should be parsed and validated. When sending data to warehouse destination such as Snowflake or BigQuery, each stream is a separate table. (see below for nested schema)

Optional:

aws_access_key_id (String, Sensitive) In order to access private Buckets stored on AWS S3, this connector requires credentials with the proper permissions. If accessing publicly available data, this field is not necessary.
aws_secret_access_key (String, Sensitive) In order to access private Buckets stored on AWS S3, this connector requires credentials with the proper permissions. If accessing publicly available data, this field is not necessary.
dataset (String) Deprecated and will be removed soon. Please do not use this field anymore and use streams.name instead. The name of the stream you would like this source to output. Can contain letters, numbers, or underscores.
delivery_method (Attributes) (see below for nested schema)
endpoint (String) Endpoint to an S3 compatible service. Leave empty to use AWS. The custom endpoint must be secure, but the 'https' prefix is not required. Default: ""
format (Attributes) Deprecated and will be removed soon. Please do not use this field anymore and use streams.format instead. The format of the files you'd like to replicate (see below for nested schema)
path_pattern (String) Deprecated and will be removed soon. Please do not use this field anymore and use streams.globs instead. A regular expression which tells the connector which files to replicate. All files which match this pattern will be replicated. Use | to separate multiple patterns. See this page to understand pattern syntax (GLOBSTAR and SPLIT flags are enabled). Use pattern ** to pick up all files.
provider (Attributes) Deprecated and will be removed soon. Please do not use this field anymore and use bucket, aws_access_key_id, aws_secret_access_key and endpoint instead. Use this to load files from S3 or S3-compatible services (see below for nested schema)
region_name (String) AWS region where the S3 bucket is located. If not provided, the region will be determined automatically.
role_arn (String) Specifies the Amazon Resource Name (ARN) of an IAM role that you want to use to perform operations requested using this profile. Set the External ID to the Airbyte workspace ID, which can be found in the URL of this page.
schema (String) Deprecated and will be removed soon. Please do not use this field anymore and use streams.input_schema instead. Optionally provide a schema to enforce, as a valid JSON string. Ensure this is a mapping of { "column" : "type" }, where types are valid JSON Schema datatypes. Leave as {} to auto-infer the schema.
start_date (String) UTC date and time in the format 2017-01-25T00:00:00.000000Z. Any file modified before this date will not be replicated.

Nested Schema for `configuration.streams`

Required:

format (Attributes) The configuration options that are used to alter how to read incoming files that deviate from the standard formatting. (see below for nested schema)
name (String) The name of the stream.

Optional:

days_to_sync_if_history_is_full (Number) When the state history of the file store is full, syncs will only read files that were last modified in the provided day range. Default: 3
globs (List of String) The pattern used to specify which files should be selected from the file system. For more information on glob pattern matching look here. Default: ["**"]
input_schema (String) The schema that will be used to validate records extracted from the file. This will override the stream schema that is auto-detected from incoming files.
legacy_prefix (String) The path prefix configured in v3 versions of the S3 connector. This option is deprecated in favor of a single glob.
primary_key (String) The column or columns (for a composite key) that serves as the unique identifier of a record. If empty, the primary key will default to the parser's default primary key.
recent_n_files_to_read_for_schema_discovery (Number) The number of resent files which will be used to discover the schema for this stream.
schemaless (Boolean) When enabled, syncs will not validate or structure records against the stream's schema. Default: false
validation_policy (String) The name of the validation policy that dictates sync behavior when a record does not adhere to the stream schema. Default: "Emit Record"; must be one of ["Emit Record", "Skip Record", "Wait for Discover"]

Nested Schema for `configuration.streams.format`

Optional:

avro_format (Attributes) (see below for nested schema)
csv_format (Attributes) (see below for nested schema)
excel_format (Attributes) (see below for nested schema)
jsonl_format (Attributes) (see below for nested schema)
parquet_format (Attributes) (see below for nested schema)
unstructured_document_format (Attributes) Extract text from document formats (.pdf, .docx, .md, .pptx) and emit as one record per file. (see below for nested schema)

Nested Schema for `configuration.streams.format.avro_format`

Optional:

double_as_string (Boolean) Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers. Default: false

Nested Schema for `configuration.streams.format.csv_format`

Optional:

delimiter (String) The character delimiting individual cells in the CSV data. This may only be a 1-character string. For tab-delimited data enter '\t'. Default: ","
double_quote (Boolean) Whether two quotes in a quoted CSV value denote a single quote in the data. Default: true
encoding (String) The character encoding of the CSV data. Leave blank to default to UTF8. See list of python encodings for allowable options. Default: "utf8"
escape_char (String) The character used for escaping special characters. To disallow escaping, leave this field blank.
false_values (List of String) A set of case-sensitive strings that should be interpreted as false values. Default: ["n","no","f","false","off","0"]
header_definition (Attributes) How headers will be defined. User Provided assumes the CSV does not have a header row and uses the headers provided and Autogenerated assumes the CSV does not have a header row and the CDK will generate headers using for f{i} where i is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can skip rows. (see below for nested schema)
ignore_errors_on_fields_mismatch (Boolean) Whether to ignore errors that occur when the number of fields in the CSV does not match the number of columns in the schema. Default: false
inference_type (String) How to infer the types of the columns. If none, inference default to strings. must be one of ["None", "Primitive Types Only"]
null_values (List of String) A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field. Default: []
quote_char (String) The character used for quoting CSV values. To disallow quoting, make this field blank. Default: """
skip_rows_after_header (Number) The number of rows to skip after the header row. Default: 0
skip_rows_before_header (Number) The number of rows to skip before the header row. For example, if the header row is on the 3rd row, enter 2 in this field. Default: 0
strings_can_be_null (Boolean) Whether strings can be interpreted as null values. If true, strings that match the null_values set will be interpreted as null. If false, strings that match the null_values set will be interpreted as the string itself. Default: true
true_values (List of String) A set of case-sensitive strings that should be interpreted as true values. Default: ["y","yes","t","true","on","1"]

Nested Schema for `configuration.streams.format.csv_format.header_definition`

Optional:

autogenerated (Attributes) (see below for nested schema)
from_csv (Attributes) (see below for nested schema)
user_provided (Attributes) (see below for nested schema)

Nested Schema for `configuration.streams.format.csv_format.header_definition.autogenerated`

Nested Schema for `configuration.streams.format.csv_format.header_definition.from_csv`

Nested Schema for `configuration.streams.format.csv_format.header_definition.user_provided`

Required:

column_names (List of String) The column names that will be used while emitting the CSV records

Nested Schema for `configuration.streams.format.excel_format`

Nested Schema for `configuration.streams.format.jsonl_format`

Nested Schema for `configuration.streams.format.parquet_format`

Optional:

decimal_as_float (Boolean) Whether to convert decimal fields to floats. There is a loss of precision when converting decimals to floats, so this is not recommended. Default: false

Nested Schema for `configuration.streams.format.unstructured_document_format`

Optional:

processing (Attributes) Processing configuration (see below for nested schema)
skip_unprocessable_files (Boolean) If true, skip files that cannot be parsed and pass the error message along as the _ab_source_file_parse_error field. If false, fail the sync. Default: true
strategy (String) The strategy used to parse documents. fast extracts text directly from the document which doesn't work for all files. ocr_only is more reliable, but slower. hi_res is the most reliable, but requires an API key and a hosted instance of unstructured and can't be used with local mode. See the unstructured.io documentation for more details: https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf. Default: "auto"; must be one of ["auto", "fast", "ocr_only", "hi_res"]

Nested Schema for `configuration.streams.format.unstructured_document_format.processing`

Optional:

local (Attributes) Process files locally, supporting fast and ocr modes. This is the default option. (see below for nested schema)

Nested Schema for `configuration.streams.format.unstructured_document_format.processing.local`

Nested Schema for `configuration.delivery_method`

Optional:

copy_raw_files (Attributes) Copy raw files without parsing their contents. Bits are copied into the destination exactly as they appeared in the source. Recommended for use with unstructured text data, non-text and compressed files. (see below for nested schema)
replicate_records (Attributes) Recommended - Extract and load structured records into your destination of choice. This is the classic method of moving data in Airbyte. It allows for blocking and hashing individual fields or files from a structured schema. Data can be flattened, typed and deduped depending on the destination. (see below for nested schema)

Nested Schema for `configuration.delivery_method.copy_raw_files`

Optional:

preserve_directory_structure (Boolean) If enabled, sends subdirectory folder structure along with source file names to the destination. Otherwise, files will be synced by their names only. This option is ignored when file-based replication is not enabled. Default: true

Nested Schema for `configuration.delivery_method.replicate_records`

Nested Schema for `configuration.format`

Optional:

avro (Attributes) This connector utilises fastavro for Avro parsing. (see below for nested schema)
csv (Attributes) This connector utilises PyArrow (Apache Arrow) for CSV parsing. (see below for nested schema)
jsonl (Attributes) This connector uses PyArrow for JSON Lines (jsonl) file parsing. (see below for nested schema)
parquet (Attributes) This connector utilises PyArrow (Apache Arrow) for Parquet parsing. (see below for nested schema)

Nested Schema for `configuration.format.avro`

Nested Schema for `configuration.format.csv`

Optional:

additional_reader_options (String) Optionally add a valid JSON string here to provide additional options to the csv reader. Mappings must correspond to options detailed here. 'column_types' is used internally to handle schema so overriding that would likely cause problems.
advanced_options (String) Optionally add a valid JSON string here to provide additional Pyarrow ReadOptions. Specify 'column_names' here if your CSV doesn't have header, or if you want to use custom column names. 'block_size' and 'encoding' are already used above, specify them again here will override the values above.
block_size (Number) The chunk size in bytes to process at a time in memory from each file. If your data is particularly wide and failing during schema detection, increasing this should solve it. Beware of raising this too high as you could hit OOM errors. Default: 10000
delimiter (String) The character delimiting individual cells in the CSV data. This may only be a 1-character string. For tab-delimited data enter '\t'. Default: ","
double_quote (Boolean) Whether two quotes in a quoted CSV value denote a single quote in the data. Default: true
encoding (String) The character encoding of the CSV data. Leave blank to default to UTF8. See list of python encodings for allowable options. Default: "utf8"
escape_char (String) The character used for escaping special characters. To disallow escaping, leave this field blank.
infer_datatypes (Boolean) Configures whether a schema for the source should be inferred from the current data or not. If set to false and a custom schema is set, then the manually enforced schema is used. If a schema is not manually set, and this is set to false, then all fields will be read as strings. Default: true
newlines_in_values (Boolean) Whether newline characters are allowed in CSV values. Turning this on may affect performance. Leave blank to default to False. Default: false
quote_char (String) The character used for quoting CSV values. To disallow quoting, make this field blank. Default: """

Nested Schema for `configuration.format.jsonl`

Optional:

block_size (Number) The chunk size in bytes to process at a time in memory from each file. If your data is particularly wide and failing during schema detection, increasing this should solve it. Beware of raising this too high as you could hit OOM errors. Default: 0
newlines_in_values (Boolean) Whether newline characters are allowed in JSON values. Turning this on may affect performance. Leave blank to default to False. Default: false
unexpected_field_behavior (String) How JSON fields outside of explicit_schema (if given) are treated. Check PyArrow documentation for details. Default: "infer"; must be one of ["ignore", "infer", "error"]

Nested Schema for `configuration.format.parquet`

Optional:

batch_size (Number) Maximum number of records per batch read from the input files. Batches may be smaller if there aren’t enough rows in the file. This option can help avoid out-of-memory errors if your data is particularly wide. Default: 65536
buffer_size (Number) Perform read buffering when deserializing individual column chunks. By default every group column will be loaded fully to memory. This option can help avoid out-of-memory errors if your data is particularly wide. Default: 2
columns (List of String) If you only want to sync a subset of the columns from the file(s), add the columns you want here as a comma-delimited list. Leave it empty to sync all columns.

Nested Schema for `configuration.provider`

Optional:

aws_access_key_id (String, Sensitive) In order to access private Buckets stored on AWS S3, this connector requires credentials with the proper permissions. If accessing publicly available data, this field is not necessary.
aws_secret_access_key (String, Sensitive) In order to access private Buckets stored on AWS S3, this connector requires credentials with the proper permissions. If accessing publicly available data, this field is not necessary.
bucket (String) Name of the S3 bucket where the file(s) exist.
endpoint (String) Endpoint to an S3 compatible service. Leave empty to use AWS. Default: ""
path_prefix (String) By providing a path-like prefix (e.g. myFolder/thisTable/) under which all the relevant files sit, we can optimize finding these in S3. This is optional but recommended if your bucket contains many folders/files which you don't need to replicate. Default: ""
region_name (String) AWS region where the S3 bucket is located. If not provided, the region will be determined automatically.
role_arn (String) Specifies the Amazon Resource Name (ARN) of an IAM role that you want to use to perform operations requested using this profile. Set the External ID to the Airbyte workspace ID, which can be found in the URL of this page.
start_date (String) UTC date and time in the format 2017-01-25T00:00:00Z. Any file modified before this date will not be replicated.

Nested Schema for `resource_allocation`

Read-Only:

default (Attributes) optional resource requirements to run workers (blank for unbounded allocations) (see below for nested schema)
job_specific (Attributes List) (see below for nested schema)

Nested Schema for `resource_allocation.default`

Read-Only:

cpu_limit (String)
cpu_request (String)
ephemeral_storage_limit (String)
ephemeral_storage_request (String)
memory_limit (String)
memory_request (String)

Nested Schema for `resource_allocation.job_specific`

Read-Only:

job_type (String) enum that describes the different types of jobs that the platform runs.
resource_requirements (Attributes) optional resource requirements to run workers (blank for unbounded allocations) (see below for nested schema)

Nested Schema for `resource_allocation.job_specific.resource_requirements`

Read-Only:

cpu_limit (String)
cpu_request (String)
ephemeral_storage_limit (String)
ephemeral_storage_request (String)
memory_limit (String)
memory_request (String)

Import

Import is supported using the following syntax:

In Terraform v1.5.0 and later, the import block can be used with the id attribute, for example:

import {
  to = airbyte_source_s3.my_airbyte_source_s3
  id = "..."
}

The terraform import command can be used, for example:

terraform import airbyte_source_s3.my_airbyte_source_s3 "..."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

airbyte_source_s3 (Resource)

Example Usage

Schema

Required

Optional

Read-Only

Nested Schema for `configuration`

Nested Schema for `configuration.streams`

Nested Schema for `configuration.streams.format`

Nested Schema for `configuration.streams.format.avro_format`

Nested Schema for `configuration.streams.format.csv_format`

Nested Schema for `configuration.streams.format.csv_format.header_definition`

Nested Schema for `configuration.streams.format.csv_format.header_definition.autogenerated`

Nested Schema for `configuration.streams.format.csv_format.header_definition.from_csv`

Nested Schema for `configuration.streams.format.csv_format.header_definition.user_provided`

Nested Schema for `configuration.streams.format.excel_format`

Nested Schema for `configuration.streams.format.jsonl_format`

Nested Schema for `configuration.streams.format.parquet_format`

Nested Schema for `configuration.streams.format.unstructured_document_format`

Nested Schema for `configuration.streams.format.unstructured_document_format.processing`

Nested Schema for `configuration.streams.format.unstructured_document_format.processing.local`

Nested Schema for `configuration.delivery_method`

Nested Schema for `configuration.delivery_method.copy_raw_files`

Nested Schema for `configuration.delivery_method.replicate_records`

Nested Schema for `configuration.format`

Nested Schema for `configuration.format.avro`

Nested Schema for `configuration.format.csv`

Nested Schema for `configuration.format.jsonl`

Nested Schema for `configuration.format.parquet`

Nested Schema for `configuration.provider`

Nested Schema for `resource_allocation`

Nested Schema for `resource_allocation.default`

Nested Schema for `resource_allocation.job_specific`

Nested Schema for `resource_allocation.job_specific.resource_requirements`

Import

FilesExpand file tree

source_s3.md

Latest commit

History

source_s3.md

File metadata and controls

airbyte_source_s3 (Resource)

Example Usage

Schema

Required

Optional

Read-Only

Nested Schema for configuration

Nested Schema for configuration.streams

Nested Schema for configuration.streams.format

Nested Schema for configuration.streams.format.avro_format

Nested Schema for configuration.streams.format.csv_format

Nested Schema for configuration.streams.format.csv_format.header_definition

Nested Schema for configuration.streams.format.csv_format.header_definition.autogenerated

Nested Schema for configuration.streams.format.csv_format.header_definition.from_csv

Nested Schema for configuration.streams.format.csv_format.header_definition.user_provided

Nested Schema for configuration.streams.format.excel_format

Nested Schema for configuration.streams.format.jsonl_format

Nested Schema for configuration.streams.format.parquet_format

Nested Schema for configuration.streams.format.unstructured_document_format

Nested Schema for configuration.streams.format.unstructured_document_format.processing

Nested Schema for configuration.streams.format.unstructured_document_format.processing.local

Nested Schema for configuration.delivery_method

Nested Schema for configuration.delivery_method.copy_raw_files

Nested Schema for configuration.delivery_method.replicate_records

Nested Schema for configuration.format

Nested Schema for configuration.format.avro

Nested Schema for configuration.format.csv

Nested Schema for configuration.format.jsonl

Nested Schema for configuration.format.parquet

Nested Schema for configuration.provider

Nested Schema for resource_allocation

Nested Schema for resource_allocation.default

Nested Schema for resource_allocation.job_specific

Nested Schema for resource_allocation.job_specific.resource_requirements

Import

Nested Schema for `configuration`

Nested Schema for `configuration.streams`

Nested Schema for `configuration.streams.format`

Nested Schema for `configuration.streams.format.avro_format`

Nested Schema for `configuration.streams.format.csv_format`

Nested Schema for `configuration.streams.format.csv_format.header_definition`

Nested Schema for `configuration.streams.format.csv_format.header_definition.autogenerated`

Nested Schema for `configuration.streams.format.csv_format.header_definition.from_csv`

Nested Schema for `configuration.streams.format.csv_format.header_definition.user_provided`

Nested Schema for `configuration.streams.format.excel_format`

Nested Schema for `configuration.streams.format.jsonl_format`

Nested Schema for `configuration.streams.format.parquet_format`

Nested Schema for `configuration.streams.format.unstructured_document_format`

Nested Schema for `configuration.streams.format.unstructured_document_format.processing`

Nested Schema for `configuration.streams.format.unstructured_document_format.processing.local`

Nested Schema for `configuration.delivery_method`

Nested Schema for `configuration.delivery_method.copy_raw_files`

Nested Schema for `configuration.delivery_method.replicate_records`

Nested Schema for `configuration.format`

Nested Schema for `configuration.format.avro`

Nested Schema for `configuration.format.csv`

Nested Schema for `configuration.format.jsonl`

Nested Schema for `configuration.format.parquet`

Nested Schema for `configuration.provider`

Nested Schema for `resource_allocation`

Nested Schema for `resource_allocation.default`

Nested Schema for `resource_allocation.job_specific`

Nested Schema for `resource_allocation.job_specific.resource_requirements`