Releases · pathwaycom/pathway · GitHub

05 Aug 10:32

v0.14.1

Added

pw.xpacks.llm.embedders.GeminiEmbedder which is a wrapper for Google Gemini Embedding services.

Assets 5

25 Jul 20:50

v0.14.0

Fixed

pw.debug.table_to_pandas now exports int | None columns correctly.

Changed

pw.io.airbyte.read can now be used with Airbyte connectors implemented in Python without requiring Docker.
BREAKING: UDFs now verify the type of returned values at runtime. If it is possible to cast a returned value to a proper type, the values is cast. If the value does not match the expected type and can't be cast, an error is raised.
BREAKING: pw.reducers.ndarray reducer requires input column to either have type float, int or Array.
pw.xpacks.llm.parsers.OpenParse can now extract and parse images & diagrams from PDFs. This can be enabled by setting the parse_images. processing_pipeline can be also set to customize the post processing of doc elements.

Assets 5

08 Jul 20:53

v0.13.2

Added

pw.io.deltalake.read now supports S3 data sources.
pw.xpacks.llm.parsers.ImageParser which allows parsing images with the vision LMs.
pw.xpacks.llm.parsers.SlideParser that enables parsing PDF and PPTX slides with the vision LMs.
pw.xpacks.llm.parsers.question_answering.RAGClient, Python client for Pathway hosted RAG apps.
pw.xpacks.llm.parsers.question_answeringDeckRetriever, a RAG app that enables searching through slide decks with visual-heavy elements.

Fixed

pw.xpacks.llm.vector_store.VectorStoreServer now uses new indexes.

Changed

pw.xpacks.llm.parsers.OpenParse now supports any vision Language model including local and propriety models via LiteLLM.

Assets 5

27 Jun 10:31

v0.13.1

Added

pw.io.kafka.read now accepts an autogenerate_key flag. This flag determines the primary key generation policy to apply when reading raw data from the source. You can either use the key from the Kafka message or have Pathway autogenerate one.
pw.io.deltalake.read input connector that fetches changes from DeltaLake into a Pathway table.
pw.xpacks.llm.parsers.OpenParse which allows parsing tables and images in PDFs.

Fixed

All S3 input connectors (including S3, Min.io, Digital Ocean, and Wasabi) now automatically retry network operations if a failure occurs.
The issue where the connection to the S3 source fails after partially ingesting an object has been resolved by downloading the object in full first.

Assets 5

13 Jun 12:12

v0.13.0

Added

pw.io.deltalake.write now supports S3 destinations.

Changed

pw.debug.compute_and_print now allows passing more than one table.
BREAKING: path parameter in pw.io.deltalake.write renamed to uri.

Fixed

A bug in pw.Table.deduplicate. If persistent_id is not set, it is no longer generated in pw.PersistenceMode.SELECTIVE_PERSISTING mode.

Assets 5

10 Jun 06:06

v0.12.0

Added

pw.PyObjectWrapper that enables passing python objects of any type to the engine.
cache_strategy option added for pw.io.http.rest_connector. It enables cache configuration, which is useful for duplicated requests.
allow_misses argument to Table.ix and Table.ix_ref methods which allows for filling rows with missing keys with None values.
pw.io.deltalake.write output connector that streams the changes of a given table into a DeltaLake storage.
pw.io.airbyte.read now supports data extraction with Google Cloud Runs.

Removed

BREAKING: Removed Table.having method.
BREAKING: Removed pw.DATE_TIME_UTC, pw.DATE_TIME_NAIVE and pw.DURATION as dtype markers. Instead, pw.DateTimeUtc, pw.DateTimeNaive and pw.Duration should be used, which are wrappers for corresponding pandas types.
BREAKING: Removed class transformers from public API: pw.ClassArg, pw.attribute, pw.input_attribute, pw.input_method, pw.method, pw.output_attribute and pw.transformer.
BREAKING: Removed several methods from pw.indexing module: binsearch_oracle, filter_cmp_helper, filter_smallest_k and prefix_sum_oracle.

Assets 5

27 May 08:33

v0.11.2

Added

pathway.assert_table_has_schema and pathway.table_transformer now accept allow_subtype argument, which, if True, allows column types in the Table be subtypes of types in the Schema.
next method to pw.io.python.ConnectorSubject (python connector) that enables passing values of any type to the engine, not only values that are json-serializable. The next method should be the preferred way of passing values from the python connector.

Changed

The format argument of pw.io.python.read is deprecated. A data format is inferred from the method used (next_json, next_str, next_bytes) and the provided schema.

Removed

Removed pw.numba_apply and numba dependency.

Fixed

Fixed pw.this desugaring bug, where __getitem__ in .ix context was not working properly.
pw.io.sqlite.read now checks if the data matches the passed schema.

Assets 5

16 May 19:30

v0.11.1

Added

query and query_as_of_now of pathway.stdlib.indexing.data_index.DataIndex now accept in metadata_column parameter a column with data of type str | None.
pathway.xpacks.connectors.sharepoint module under Pathway for Business License.

Assets 5

10 May 14:56

v0.11.0

Added

Embedders in the LLM xpack now have method get_embedding_dimension that returns number of dimension used by the chosen embedder.
pathway.stdlib.indexing.nearest_neighbors, with implementations of pathway.stdlib.indexing.data_index.InnerIndex based on k-NN via LSH (implemented in Pathway), and k-NN provided by USearch library.
pathway.stdlib.indexing.vector_document_index, with a few predefined instances of pathway.stdlib.indexing.data_index.DataIndex.
pathway.stdlib.indexing.bm25, with implementations of pathway.stdlib.indexing.data_index.InnerIndex based on BM25 index provided by Tantivy.
pathway.stdlib.indexing.full_text_document_index, with a predefined instance of pathway.stdlib.indexing.data_index.DataIndex.
Introduced the reranker module under llm.xpacks. Includes few re-ranking strategies and utility functions for RAG applications.

Changed

BREAKING: windowby generates IDs of produced rows differently than in the previous version.
BREAKING: pw.io.csv.write prints printable non-ascii characters as regular text, not \u{xxxx}.
BREAKING: Connector methods pw.io.elasticsearch.read, pw.io.debezium.read, pw.io.fs.read, pw.io.jsonlines.read, pw.io.kafka.read, pw.io.python.read, pw.io.redpanda.read, pw.io.s3.read now check the type of the input data. Previously it was not checked if the provided format was "json"/"jsonlines". If the data is inconsistent with the provided schema, the row is skipped and the error message is emitted.
BREAKING: query and query_as_of_now methods of pathway.stdlib.indexing.data_index.DataIndex now return pathway.JoinResult, to allow resolving column name conflicts (between columns in the table with queries and table with index data).
BREAKING: DataIndex methods query and query_as_of_now now return score in a column named _pw_index_reply_score (defined as _SCORE variable in pathway.stdlib.indexing.colnames.py).

Removed

BREAKING: pathway.stdlib.indexing.data_index.VectorDocumentIndex class, some predefined instances are now meant to be obtained via methods provided in pathway.stdlib.indexing.vector_document_index.
BREAKING: with_distances parameter of query and query_as_of_now methods in pathway.stdlib.indexing.data_index.DataIndex. Instead of 'distance', we now operate with a more general term 'score' (higher = better). For distance based indices score is usually defined as negative distance. Score is now always included in the answer, as long as underlying index returns something that indicates quality of a match.

Assets 5

30 Apr 12:25

v0.10.1

Added

query method to VectorStoreServer to enable compatible API with DataIndex.
AdaptiveRAGQuestionAnswerer to xpacks.question_answering. End-to-end pipeline and accompanying code for Private RAG showcase.

Assets 5