Skip to content

Automated schema exploration in input connectors #224

@zxqfd555

Description

@zxqfd555

Is your feature request related to a problem? Please describe.

All structured input connectors in Pathway that read from sources with a known schema — in particular pw.io.postgres.read, pw.io.mssql.read, pw.io.mysql.read, and similar database connectors — currently require the user to explicitly define a pw.Schema subclass before the pipeline can be started. This creates unnecessary boilerplate for common use cases such as exploratory data work, quick prototyping, and pipelines where the table structure is already fully defined in the source system. The user is forced to manually replicate schema information that the connector could retrieve itself.

Describe the solution you'd like

For connectors that read from sources with queryable metadata (SQL databases, BigQuery, DynamoDB, etc.), make the schema parameter optional. When it is omitted, the connector should perform schema exploration at startup by querying the source's catalog or information schema, derive a pw.Schema automatically from the column names and types found there, and proceed with that schema as if it had been provided explicitly.

The type mapping used during exploration should follow the same conversion rules already documented for each connector (e.g. the PostgreSQL type conversion table in pw.io.postgres). Columns whose source type has no direct Pathway equivalent should fall back to typing.Any.

The derived schema should be logged at startup so the user can inspect it, copy it into their code, and promote it to an explicit declaration if they need stricter type control or column filtering later.

The existing behavior — passing an explicit pw.Schema — should remain fully supported and unchanged. Schema exploration is purely an opt-in convenience for cases where an explicit schema is not provided.

Describe alternatives you've considered

  • Using typing.Any for all columns manually — possible today via pw.schema_builder, but requires the user to know the column names in advance and provides no type information.
  • Using pw.io.airbyte.read as a schema-free alternative — works for some sources, but introduces a heavy dependency and is not a like-for-like replacement for the native connectors.
  • Querying the information schema manually before constructing the pipeline and building a schema with pw.schema_builder — works, but puts the boilerplate on the user and duplicates logic the connector already has to execute internally (e.g. pw.io.postgres.read already connects to the database at startup).

Additional context

The connectors best suited for an initial implementation are the ones where the source already exposes a rich, strongly-typed schema at the catalog level:

  • pw.io.postgres.read — column types available via information_schema.columns or the PostgreSQL catalog; the type mapping table is already documented in the API docs.
  • pw.io.mssql.read — column types available via sys.columns / information_schema.columns.
  • pw.io.mysql.read — column types available via information_schema.columns.

For sources where schema inference is more ambiguous (e.g. DynamoDB, MongoDB), the fallback to typing.Any per column is acceptable, and the feature can be introduced incrementally connector by connector.

Primary key information should also be explored where available (e.g. pg_constraint in PostgreSQL, KEY_COLUMN_USAGE in MySQL/MSSQL), since it is required for correct behavior of streaming connectors. If primary key information cannot be reliably determined, the connector should warn the user and fall back to auto-generated row identifiers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions