Releases · MTSWebServices/onetl

10 Oct 11:38

github-actions

0.9.5

17ed2de

0.9.5 (2023-10-10)

Features

Add XML file format support. (#163)
Tested compatibility with Spark 3.5.0. MongoDB and Excel are not supported yet, but other packages do. (#159)

Improvements

Add check to all DB and FileDF connections that Spark session is alive. (#164)

Bug Fixes

Fix Hive.check() behavior when Hive Metastore is not available. (#164)

Assets 4

26 Sep 12:34

github-actions

0.9.4

6944c4f

0.9.4 (2023-09-26)

Features

Add Excel file format support. (#148)
Add Samba file connection. It is now possible to download and upload files to Samba shared folders using FileDownloader/FileUploader. (#150)
Add if_exists="ignore" and error to Hive.WriteOptions (#143)
Add if_exists="ignore" and error to JDBC.WriteOptions (#144)
Add if_exists="ignore" and error to MongoDB.WriteOptions (#145)

Improvements

Add documentation about different ways of passing packages to Spark session. (#151)
Drastically improve Greenplum documentation:
- Added information about network ports, grants, pg_hba.conf and so on.
- Added interaction schemas for reading, writing and executing statements in Greenplum.
- Added recommendations about reading data from views and JOIN results from Greenplum. (#154)
Make .fetch and .execute methods of DB connections thread-safe. Each thread works with its own connection. (#156)
Call .close() on FileConnection then it is removed by garbage collector. (#156)

Bug Fixes

Fix issue while stopping Python interpreter calls JDBCMixin.close() and prints exceptions to log. (#156)

Assets 4

06 Sep 15:03

github-actions

0.9.3

39c497b

0.9.3 (2023-09-06)

Bug Fixes

Fix documentation build

Assets 4

06 Sep 12:57

github-actions

0.9.2

3b17ce5

0.9.2 (2023-09-06)

Features

Add if_exists="ignore" and error to Greenplum.WriteOptions (#142)

Improvements

Improve validation messages while writing dataframe to Kafka. (#131)
Improve documentation:
- Add notes about reading and writing to database connections documentation
- Add notes about executing statements in JDBC and Greenplum connections

Bug Fixes

Fixed validation of headers column is written to Kafka with default Kafka.WriteOptions() - default value was False, but instead of raising an exception, column value was just ignored. (#131)
Fix reading data from Oracle with partitioningMode="range" without explicitly set lowerBound / upperBound. (#133)
Update Kafka documentation with SSLProtocol usage. (#136)
Raise exception if someone tries to read data from Kafka topic which does not exist. (#138)
Allow to pass Kafka topics with name like some.topic.name to DBReader. Same for MongoDB collections. (#139)

Assets 4

17 Aug 18:57

github-actions

0.9.1

5f0f86a

0.9.1 (2023-08-17)

Bug Fixes

Fixed bug then number of threads created by FileDownloader / FileUploader / FileMover was not min(workers, len(files)), but max(workers, len(files)). leading to create too much workers on large files list.

Assets 4

17 Aug 12:47

github-actions

0.9.0

b448bbd

0.9.0 (2023-08-17)

Breaking Changes

Rename methods:
- DBConnection.read_df -> DBConnection.read_source_as_df
- DBConnection.write_df -> DBConnection.write_df_to_target (#66)
Rename classes:
- HDFS.slots -> HDFS.Slots
- Hive.slots -> Hive.Slots
Old names are left intact, but will be removed in v1.0.0 (#103)
Rename options to make them self-explanatory:
- Hive.WriteOptions(mode="append") -> Hive.WriteOptions(if_exists="append")
- Hive.WriteOptions(mode="overwrite_table") -> Hive.WriteOptions(if_exists="replace_entire_table")
- Hive.WriteOptions(mode="overwrite_partitions") -> Hive.WriteOptions(if_exists="replace_overlapping_partitions")
- JDBC.WriteOptions(mode="append") -> JDBC.WriteOptions(if_exists="append")
- JDBC.WriteOptions(mode="overwrite") -> JDBC.WriteOptions(if_exists="replace_entire_table")
- Greenplum.WriteOptions(mode="append") -> Greenplum.WriteOptions(if_exists="append")
- Greenplum.WriteOptions(mode="overwrite") -> Greenplum.WriteOptions(if_exists="replace_entire_table")
- MongoDB.WriteOptions(mode="append") -> Greenplum.WriteOptions(if_exists="append")
- MongoDB.WriteOptions(mode="overwrite") -> Greenplum.WriteOptions(if_exists="replace_entire_collection")
- FileDownloader.Options(mode="error") -> FileDownloader.Options(if_exists="error")
- FileDownloader.Options(mode="ignore") -> FileDownloader.Options(if_exists="ignore")
- FileDownloader.Options(mode="overwrite") -> FileDownloader.Options(if_exists="replace_file")
- FileDownloader.Options(mode="delete_all") -> FileDownloader.Options(if_exists="replace_entire_directory")
- FileUploader.Options(mode="error") -> FileUploader.Options(if_exists="error")
- FileUploader.Options(mode="ignore") -> FileUploader.Options(if_exists="ignore")
- FileUploader.Options(mode="overwrite") -> FileUploader.Options(if_exists="replace_file")
- FileUploader.Options(mode="delete_all") -> FileUploader.Options(if_exists="replace_entire_directory")
- FileMover.Options(mode="error") -> FileMover.Options(if_exists="error")
- FileMover.Options(mode="ignore") -> FileMover.Options(if_exists="ignore")
- FileMover.Options(mode="overwrite") -> FileMover.Options(if_exists="replace_file")
- FileMover.Options(mode="delete_all") -> FileMover.Options(if_exists="replace_entire_directory")
Old names are left intact, but will be removed in v1.0.0 (#108)
Rename onetl.log.disable_clients_logging() to onetl.log.setup_clients_logging(). (#120)

Features

Add new methods returning Maven packages for specific connection class:
- Clickhouse.get_packages()
- MySQL.get_packages()
- Postgres.get_packages()
- Teradata.get_packages()
- MSSQL.get_packages(java_version="8")
- Oracle.get_packages(java_version="8")
- Greenplum.get_packages(scala_version="2.12")
- MongoDB.get_packages(scala_version="2.12")
- Kafka.get_packages(spark_version="3.4.1", scala_version="2.12")
Deprecate old syntax:
- Clickhouse.package
- MySQL.package
- Postgres.package
- Teradata.package
- MSSQL.package
- Oracle.package
- Greenplum.package_spark_2_3
- Greenplum.package_spark_2_4
- Greenplum.package_spark_3_2
- MongoDB.package_spark_3_2
- MongoDB.package_spark_3_3
- MongoDB.package_spark_3_4 (#87)
Allow to set client modules log level in onetl.log.setup_clients_logging().

Allow to enable underlying client modules logging in onetl.log.setup_logging() by providing additional argument enable_clients=True. This is useful for debug. (#120)
Added support for reading and writing data to Kafka topics.

For these operations, new classes were added.
- Kafka (#54, #60, #72, #84, #87, #89, #93, #96, #102, #104)
- Kafka.PlaintextProtocol (#79)
- Kafka.SSLProtocol (#118)
- Kafka.BasicAuth (#63, #77)
- Kafka.KerberosAuth (#63, #77, #110)
- Kafka.ScramAuth (#115)
- Kafka.Slots (#109)
- Kafka.ReadOptions (#68)
- Kafka.WriteOptions (#68)
Currently, Kafka does not support incremental read strategies, this will be implemented in future releases.
Added support for reading files as Spark DataFrame and saving DataFrame as Files.

For these operations, new classes were added.

FileDFConnections:
- SparkHDFS (#98)
- SparkS3 (#94, #100, #124)
- SparkLocalFS (#67)
High-level classes:
- FileDFReader (#73)
- FileDFWriter (#81)
File formats:
- Avro (#69)
- CSV (#92)
- JSON (#83)
- JSONLine (#83)
- ORC (#86)
- Parquet (#88)

Improvements

Remove redundant checks for driver availability in Greenplum and MongoDB connections. (#67)
Check of Java class availability moved from .check() method to connection constructor. (#97)

Assets 4

10 Jul 09:08

github-actions

0.8.1

4ac9e3b

0.8.1 (2023-07-10)

Features

Add @slot decorator to public methods of:
- DBConnection
- FileConnection
- DBReader
- DBWriter
- FileDownloader
- FileUploader
- FileMover (#49)
Add workers field to FileDownloader / FileUploader / FileMover. Options classes.

This allows to speed up all file operations using parallel threads. (#57)

Improvements

Add documentation for HWM store .get and .save methods. (#49)
Improve Readme:
- Move Quick start section from documentation
- Add Non-goals section
- Fix code blocks indentation (#50)
Improve Contributing guide:
- Move Develop section from Readme
- Move docs/changelog/README.rst content
- Add Limitations section
- Add instruction of creating a fork and building documentation (#50)
Remove duplicated checks for source file existence in FileDownloader / FileMover. (#57)
Update default logging format to include thread name. (#57)

Bug Fixes

Fix S3.list_dir('/') returns empty list on latest Minio version. (#58)

Assets 4

31 May 12:34

dolfinus

0.8.0

9c6b44c

0.8.0 (2023-05-31)

Breaking Changes

Rename methods of FileConnection classes:
- get_directory → resolve_dir
- get_file → resolve_file
- listdir → list_dir
- mkdir → create_dir
- rmdir → remove_dir
New naming should be more consistent.
They were undocumented in previous versions, but someone could use these methods, so this is a breaking change. (#36)
Deprecate onetl.core.FileFilter class, replace it with new classes:
- onetl.file.filter.Glob
- onetl.file.filter.Regexp
- onetl.file.filter.ExcludeDir
Old class will be removed in v1.0.0. (#43)
Deprecate onetl.core.FileLimit class, replace it with new class onetl.file.limit.MaxFilesCount.

Old class will be removed in v1.0.0. (#44)
Change behavior of BaseFileLimit.reset method.

This method should now return self instead of None. Return value could be the same limit object or a copy, this is an implementation detail. (#44)
Replaced FileDownloader.filter and .limit with new options .filters and .limits:
```
FileDownloader(
    ...,
    filter=FileFilter(glob="*.txt", exclude_dir="/path"),
    limit=FileLimit(count_limit=10),
)
```
```
FileDownloader(
    ...,
    filters=[Glob("*.txt"), ExcludeDir("/path")],
    limits=[MaxFilesCount(10)],
)
```
This allows to developers to implement their own filter and limit classes, and combine them with existing ones.

Old behavior still supported, but it will be removed in v1.0.0. (#45)
Removed default value for FileDownloader.limits, user should pass limits list explicitly. (#45)

Move classes from module onetl.core:

from onetl.core import DBReader
from onetl.core import DBWriter
from onetl.core import FileDownloader
from onetl.core import FileUploader
from onetl.core import FileResult
from onetl.core import FileSet

with new modules onetl.db and onetl.file:

from onetl.db import DBReader
from onetl.db import DBWriter

from onetl.file import FileDownloader
from onetl.file import FileUploader

# not a public interface
from onetl.file.file_result import FileResult
from onetl.file.file_set import FileSet

Imports from old module onetl.core still can be used, but marked as deprecated. Module will be removed in v1.0.0. (#46)

Features

Add rename_dir method.

Method was added to following connections:
- FTP
- FTPS
- HDFS
- SFTP
- WebDAV
It allows to rename/move directory to new path with all its content.

S3 does not have directories, so there is no such method in that class. (#40)
Add onetl.file.FileMover class.

It allows to move files between directories of remote file system. Signature is almost the same as in FileDownloader, but without HWM support. (#42)

Improvements

Document all public methods in FileConnection classes:
- download_file
- resolve_dir
- resolve_file
- get_stat
- is_dir
- is_file
- list_dir
- create_dir
- path_exists
- remove_file
- rename_file
- remove_dir
- upload_file
- walk (#39)
Update documentation of check method of all connections - add usage example and document result type. (#39)
Add new exception type FileSizeMismatchError.

Methods connection.download_file and connection.upload_file now raise new exception type instead of RuntimeError, if target file after download/upload has different size than source. (#39)
Add new exception type DirectoryExistsError - it is raised if target directory already exists. (#40)
Improved FileDownloader / FileUploader exception logging.

If DEBUG logging is enabled, print exception with stacktrace instead of printing only exception message. (#42)
Updated documentation of FileUploader.
- Class does not support read strategies, added note to documentation.
- Added examples of using run method with explicit files list passing, both absolute and relative paths.
- Fix outdated imports and class names in examples. (#42)
Updated documentation of DownloadResult class - fix outdated imports and class names. (#42)
Improved file filters documentation section.

Document interface class onetl.base.BaseFileFilter and function match_all_filters. (#43)
Improved file limits documentation section.

Document interface class onetl.base.BaseFileLimit and functions limits_stop_at / limits_reached / reset_limits. (#44)
Added changelog.

Changelog is generated from separated news files using towncrier. (#47)

Misc

Improved CI workflow for tests.
- If developer haven't changed source core of a specific connector or its dependencies, run tests only against maximum supported versions of Spark, Python, Java and db/file server.
- If developed made some changes in a specific connector, or in core classes, or in dependencies, run tests for both minimal and maximum versions.
- Once a week run all aganst for minimal and latest versions to detect breaking changes in dependencies
- Minimal tested Spark version is 2.3.1 instead on 2.4.8. (#32)

Full Changelog: 0.7.2...0.8.0

Assets 4

31 May 08:56

dolfinus

0.7.2

12ddaac

0.7.2 (2023-05-24)

Dependencies

Limited typing-extensions version.

typing-extensions==4.6.0 release contains some breaking changes causing errors like:

Traceback (most recent call last):
File "/Users/project/lib/python3.9/typing.py", line 852, in __subclasscheck__
    return issubclass(cls, self.__origin__)
TypeError: issubclass() arg 1 must be a class

typing-extensions==4.6.1 was causing another error:

Traceback (most recent call last):
File "/home/maxim/Repo/typing_extensions/1.py", line 33, in <module>
    isinstance(file, ContainsException)
File "/home/maxim/Repo/typing_extensions/src/typing_extensions.py", line 599, in __instancecheck__
    if super().__instancecheck__(instance):
File "/home/maxim/.pyenv/versions/3.7.8/lib/python3.7/abc.py", line 139, in __instancecheck__
    return _abc_instancecheck(cls, instance)
File "/home/maxim/Repo/typing_extensions/src/typing_extensions.py", line 583, in __subclasscheck__
    return super().__subclasscheck__(other)
File "/home/maxim/.pyenv/versions/3.7.8/lib/python3.7/abc.py", line 143, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
File "/home/maxim/Repo/typing_extensions/src/typing_extensions.py", line 661, in _proto_hook
    and other._is_protocol
AttributeError: type object 'PathWithFailure' has no attribute '_is_protocol'

We updated requirements with typing-extensions<4.6 until fixing compatibility issues.

Full Changelog: 0.7.1...0.7.2

Assets 4

31 May 08:55

dolfinus

0.7.1

befa06d

0.7.1 (2023-05-23)

Bug Fixes

Fixed setup_logging function.

In onETL==0.7.0 calling onetl.log.setup_logging() broke the logging:

Traceback (most recent call last):
File "/opt/anaconda/envs/py39/lib/python3.9/logging/__init__.py", line 434, in format
    return self._format(record)
File "/opt/anaconda/envs/py39/lib/python3.9/logging/__init__.py", line 430, in _format
    return self._fmt % record.dict
KeyError: 'levelname:8s'

Fixed installation examples.

In onETL==0.7.0 there are examples of installing onETL with extras:
```
pip install onetl[files, kerberos, spark]
```
But pip fails to install such package:
```
ERROR: Invalid requirement: 'onet[files,'
```
This is because of spaces in extras clause. Fixed:
```
pip install onetl[files,kerberos,spark]
```

Full Changelog: 0.7.0...0.7.1

Assets 4

Releases: MTSWebServices/onetl

0.9.5 (2023-10-10)

Features

Improvements

Bug Fixes

Uh oh!

0.9.4 (2023-09-26)

Features

Improvements

Bug Fixes

Uh oh!

0.9.3 (2023-09-06)

Bug Fixes

Uh oh!

0.9.2 (2023-09-06)

Features

Improvements

Bug Fixes

Uh oh!

0.9.1 (2023-08-17)

Bug Fixes

Uh oh!

0.9.0 (2023-08-17)

Breaking Changes

Features

Improvements

Uh oh!

0.8.1 (2023-07-10)

Features

Improvements

Bug Fixes

Uh oh!

0.8.0 (2023-05-31)

Breaking Changes

Features

Improvements

Misc

Uh oh!

0.7.2 (2023-05-24)

Dependencies

Uh oh!

0.7.1 (2023-05-23)

Bug Fixes

Uh oh!