Skip to content

GCSToBQLoadRunnable doesn't respect GCS folderΒ #271

@zachary-povey

Description

@zachary-povey

I have run into a problem when using the GCS->BQ batch mode of the BigQuerySinkConnector; each connector schedules it's own instance of the GCSToBQLoadRunnable which does not use the GCS folder when listing objects to load into BigQuery.

Because of this, if you have multiple connectors using the same bucket but different folders they all load all the objects in the bucket, irrespective of the folder they are in, and so you receive many duplicates in BQ. Further to this, only one instance will successfully delete the object and when the other instances try and fail, they will simply try again and again.

GCSToBQLoadRunnable can be seen here

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions