All notable changes to this project will be documented in this file.
This project adheres to Semantic Versioning.
- pathway.xpacks.llm.splitter.TokenCountSplitter.
- Introducing new methods for strict conversion of
pw.Jsonto desired types within a UDF body:as_int()as_float()as_str()as_bool()as_list()as_dict()
- Added
table.col.dt.utc_from_timestampmethod: CreatesDateTimeUtcfrom timestamps represented asints orfloats. - Enhanced the
table.col.dt.timestampmethod with a newunitargument to specify the unit of the returned timestamp.
- Introduced an experimental xpack with a Microsoft SharePoint input connector.
- Index operator (
[]) can now be directly applied topw.Jsonwithin UDFs to access elements of JSON objects, arrays, and strings.
- Enhanced the
table.col.dt.from_timestampmethod to createDateTimeNaivefrom timestamps represented asints orfloats. - Deprecated not specifying the
unitargument of thetable.col.dt.timestampmethod.
KNNIndexnow supports returning computed distances.- Added support for cosine similarity in
KNNIndex.
- The
offsetargument ofpw.stdlib.temporal.slidingandpw.stdlib.temporal.tumblingis deprecated. Useorigininstead, as it represents a point in time, not a duration.
- Sliding window now works correctly with UTC Datetimes.
- Temporal column in
asof_joinno longer has to be namedt. asof_joinincludes rows with equal times for all values of thedirectionparameter.
- Fixed an issue with
pw.io.gdrive.read: Shared folders support is now working seamlessly.
- Added Table.split() method for splitting table based on an expression into two tables.
- Columns with datatype duration can now be multiplied and divided by floats.
- Columns with datatype duration now support both true and floor division (
/and//) by integers.
- Pathway is better at typing if_else expressions when optional types are involved.
table.flatten()operator now supports Json array.- Buffers (used to delay outputs, configured via delay in
common_behavior) now flush the data when the computation is finished. The effect of this change can be seen when run in bounded (batch / multi-revision) mode. pw.io.subscribe()takes additional argumenton_time_end- the callback function to be called on each closed time of computation.pw.io.subscribe()is now a single-worker operator, guaranteeing thaton_endis triggered at most once.KNNIndexsupports now metadata filtering. Each query can specify it's own filter in the JMESPath format.
- Resolved an optimization bug causing
pw.iterateto malfunction when handling columns effectively pointing to the same data.
- Pathway now keeps track of
arraycolumntype better - it is able to keep track of Array dtype and number of dimensions, wherever applicable.
- Fixed issues with standalone panel+Bokeh dashboards to ensure optimal functionality and performance.
- A method
weekdayhas been added to thedtnamespace, that can be called on column expressions containing datetime data. This method returns an integer that represents the day of the week. - EXPERIMENTAL: Methods
showandploton Tables, providing visualizations of data using HoloViz Panel. - Added support for
instanceparameter togroupby,join,windowbyand temporal join methods. pw.PersistenceMode.UDF_CACHINGpersistence mode enabling automatic caching ofAsyncTransformerinvocations.
- Methods
roundandflooron columns with datetimes now accept duration argument to be a string. pw.debug.compute_and_printandpw.debug.compute_and_print_update_streamhave a new argumentn_rowsthat limits the number of rows printed.pw.debug.table_to_pandashas a new argumentinclude_id(by defaultTrue). If set toFalse, creates a new index for the Pandas DataFrame, rather than using the keys of the Pathway Table.windowbyfunctionshardargument is now deprecated andinstanceshould be used.- Special column name
_pw_shardis now deprecated, and_pw_instanceshould be used. pw.ReplayModenow can be accessed aspw.PersistenceMode, while theSPEEDRUNandREALTIMEvariants are now accessible asSPEEDRUN_REPLAYandREALTIME_REPLAY.- EXPERIMENTAL:
pw.io.gdrive.readhas a new argumentwith_metadata(by defaultFalse). If set toTrue, adds a_metadatacolumn containing file metadata to the resulting table. - Methods
get_nearest_itemsandget_nearest_items_asof_nowofKNNIndexallow to specifyk(number of returned elements) separately in each query.
- Added ability of creating custom reducers using
pw.reducers.udf_reducerdecorator. Usepw.BaseCustomAccumulatoras a base class for creating accumulators. Decorating accumulator returns reducer following custom logic. - A function
pw.debug.compute_and_print_update_streamthat computes and prints the update stream of the table. - SQLite input connector (
pw.io.sqlite).
pw.debug.parse_to_tableis now deprecated,pw.debug.table_from_markdownshould be used instead.pw.schema_from_csvnow hasquoteanddouble_quote_escapesarguments.
- Schema returned from
pw.schema_from_csvwill have quotes removed from column names, so it will now work properly withpw.io.csv.read.
- Experimental Google Drive input connector.
- Stateful deduplication function (
pw.stateful.deduplicate) allowing alerting on significant changes. - The ability to split data into batches in
pw.debug.table_from_markdownandpw.debug.table_from_pandas.
- class
Behavior, a superclass of all behavior classes. - class
ExactlyOnceBehaviorindicating we want to create aCommonBehaviorthat results in each window producing exactly one output (shifted in time by an optionalshiftparameter). - function
exactly_once_behaviorcreating an instance ofExactlyOnceBehavior.
- BREAKING:
WindowBehavioris now calledCommonBehavior, as it can be also used with interval joins. - BREAKING:
window_behavioris now calledcommon_behavior, as it can be also used with interval joins. - Deprecating parameter
keep_queriesinpw.io.http.rest_connector. Nowdelete_completed_querieswith an opposite meaning should be used instead. The default is stilldelete_completed_queries=True(equivalent tokeep_queries=False) but it will soon be required to be set explicitly.
- A flag
with_metadatafor the filesystem-based connectors to attach the source file metadata to the table entries. - Methods
pw.debug.table_from_list_of_batchesandpw.debug.table_from_list_of_batches_by_workersfor creating tables with defined data being inserted over time.
- BREAKING:
pw.debug.table_from_pandasandpw.debug.table_from_markdownnow will create tables in the streaming mode, instead of static, if given table definition contains_timecolumn. - BREAKING: Renamed the parameter
keep_queriesinpw.io.http.rest_connectortodelete_querieswith the opposite meaning. It changes the default behavior - it waskeep_queries=False, now it isdelete_queries=False.
- A method
get_nearest_items_asof_nowinKNNIndexthat allows to get nearest neighbors without updating old queries in the future. - A method
asof_now_joininTableto join rows from left side of the join with right side of the join at their processing time. Past rows from left side are not used when new data appears on the right side.
interval_joinnow supports forgetting old entries. The configuration can be passed usingbehaviorparameter ofinterval_joinmethod.- Decorator
@table_transformerfor marking that functions take Tables as arguments. - Namespace for all columns
Table.C.*. - Output connectors now provide logs about the number of entries written and time taken.
- Filesystem connectors now support reading whole files as rows.
- Command line option for
pathway spawnto record data andpathway replaycommand to replay data.
selectoperates only on consistent states.
Schemamethodtypehintsthat returns dict of mypy-compatible typehints.- Support for JSON parsing from CSV sources.
restrictmethod inTableto restrict table universe to the universe of the other table.- Better support for postgresql types in the output connector.
- BREAKING: renamed
Tablemethoddtypestotypehints. It now returns adictof mypy-compatible typehints. - BREAKING:
Schema.__getitem__returns a data classColumnSchemacontaining all related information on particular column. - BREAKING:
tuplereducer used after intervals_over window now sorts values by time. - BREAKING: expressions used in
select,filter,flatten,with_columns,with_id,with_id_fromhave to have the same universe as the table. Earlier it was possible to use an expression from a superset of a table universe. To use expressions from wider universes, one can userestricton the expression source table. - BREAKING:
pw.universes.promise_are_equal(t1, t2)no longer allows to use references fromt1andt2in a single expression. To change the universe of a table, usewith_universe_of. - BREAKING:
ixandix_refare temporarily broken inside joins (both temporal and ordinary). select,filter,concatkeep columns as a single stream. The work for other operators is ongoing.
- Optional types other than string correctly output to PostgreSQL.
- Support for messages compressed with zstd in the Kafka connector.
- Support for JSON data format, including
pw.Jsontype. - Methods
as_int(),as_float(),as_str(),as_bool()to convert values fromJson. - New argument
skip_nonesfortupleandsorted_tuplereducers. - New argument
is_outerforintervals_overwindow. pw.schema_from_dictandpw.schema_from_csvfor generating schema based, respectively, on provided definition as a dictionary and CSV file with sample data.generate_classmethod inSchemaclass for generating schema class code.
- Method
get()and[]to support accessing elements in Jsons. - Function
pw.assert_table_has_schemafor writing asserts checking, whether given table has the same schema as the one that is given as an argument. - BREAKING:
ixandix_refoperations are now standalone transformations ofpw.Tableintopw.Table. Most of the usages remain the same, but sometimes user needs to provide a context (when e.g. using them insidejoinorgroupbyoperations).ixandix_refare temporarily broken inside temporal joins.
- Fixed a bug where new-style optional types (e.g.
int | None) were translated toAnydtype.
- Incompatible
beartypeversion is now excluded from dependencies.
- Module
pathway.dtto construct and manipulate DTypes. - New argument
keep_queriesinpw.io.http.rest_connector.
- Internal representation of DTypes. Inputting types is compatible backwards.
- Temporal functions now accept arguments of mixed types (ints and floats). For example,
pw.temporal.intervalcan use ints while columns it interacts with are floats. - Single-element arrays are now treated as arrays, not as scalars.
to_string()method on datetimes always prints 9 fractional digits.%fformat code instrptime()parses fractional part of a second correctly regardless of the number of digits.
Table.cast_to_types()function that can performpathway.caston multiple columns.intervals_overwindow, which allows to get temporally close data to given times.demo.replay_csv_with_timefunction that can replay a CSV file following the timestamps of a given column.
- Static data is now copied to ensure immutability.
- Improved error tracing mechanism to work with any type of error.
tuplereducer, that returns a tuple with values.ndarrayreducer, that returns an array with values.
numpyarrays ofint32,uint32andfloat32are now converted to their 64-bit variants instead of tuples.- KNNIndex interface to take columns as inputs.
- Reducers now check types of their arguments.
- Fixed delayed reporting of output connector errors.
- Python objects are now freed more often, reducing peak memory usage.
@(matrix multiplication) operator.
- Python version 3.10 or later is now required.
- Type checking is now more strict.
- Immediately forget queries in REST connector.
- Make type annotations mandatory in
Schema.
- Fixed IDs coming from CSV source.
- Fixed indices of dataframes from pandas transformer.