Automated schema exploration in input connectors

**Is your feature request related to a problem? Please describe.**

All structured input connectors in Pathway that read from sources with a known schema — in particular `pw.io.postgres.read`, `pw.io.mssql.read`, `pw.io.mysql.read`, and similar database connectors — currently require the user to explicitly define a `pw.Schema` subclass before the pipeline can be started. This creates unnecessary boilerplate for common use cases such as exploratory data work, quick prototyping, and pipelines where the table structure is already fully defined in the source system. The user is forced to manually replicate schema information that the connector could retrieve itself.

**Describe the solution you'd like**

For connectors that read from sources with queryable metadata (SQL databases, BigQuery, DynamoDB, etc.), make the `schema` parameter optional. When it is omitted, the connector should perform schema exploration at startup by querying the source's catalog or information schema, derive a `pw.Schema` automatically from the column names and types found there, and proceed with that schema as if it had been provided explicitly.

The type mapping used during exploration should follow the same conversion rules already documented for each connector (e.g. the PostgreSQL type conversion table in `pw.io.postgres`). Columns whose source type has no direct Pathway equivalent should fall back to `typing.Any`.

The derived schema should be logged at startup so the user can inspect it, copy it into their code, and promote it to an explicit declaration if they need stricter type control or column filtering later.

The existing behavior — passing an explicit `pw.Schema` — should remain fully supported and unchanged. Schema exploration is purely an opt-in convenience for cases where an explicit schema is not provided.

**Describe alternatives you've considered**

- Using `typing.Any` for all columns manually — possible today via `pw.schema_builder`, but requires the user to know the column names in advance and provides no type information.
- Using `pw.io.airbyte.read` as a schema-free alternative — works for some sources, but introduces a heavy dependency and is not a like-for-like replacement for the native connectors.
- Querying the information schema manually before constructing the pipeline and building a schema with `pw.schema_builder` — works, but puts the boilerplate on the user and duplicates logic the connector already has to execute internally (e.g. `pw.io.postgres.read` already connects to the database at startup).

**Additional context**

The connectors best suited for an initial implementation are the ones where the source already exposes a rich, strongly-typed schema at the catalog level:

- `pw.io.postgres.read` — column types available via `information_schema.columns` or the PostgreSQL catalog; the type mapping table is already documented in the API docs.
- `pw.io.mssql.read` — column types available via `sys.columns` / `information_schema.columns`.
- `pw.io.mysql.read` — column types available via `information_schema.columns`.

For sources where schema inference is more ambiguous (e.g. DynamoDB, MongoDB), the fallback to `typing.Any` per column is acceptable, and the feature can be introduced incrementally connector by connector.

Primary key information should also be explored where available (e.g. `pg_constraint` in PostgreSQL, `KEY_COLUMN_USAGE` in MySQL/MSSQL), since it is required for correct behavior of streaming connectors. If primary key information cannot be reliably determined, the connector should warn the user and fall back to auto-generated row identifiers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated schema exploration in input connectors #224

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Automated schema exploration in input connectors #224

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions