Skip to content

Releases: MTSWebServices/onetl

0.9.5 (2023-10-10)

10 Oct 11:38
17ed2de

Choose a tag to compare

Features

  • Add XML file format support. (#163)
  • Tested compatibility with Spark 3.5.0. MongoDB and Excel are not supported yet, but other packages do. (#159)

Improvements

  • Add check to all DB and FileDF connections that Spark session is alive. (#164)

Bug Fixes

  • Fix Hive.check() behavior when Hive Metastore is not available. (#164)

0.9.4 (2023-09-26)

26 Sep 12:34
6944c4f

Choose a tag to compare

Features

  • Add Excel file format support. (#148)
  • Add Samba file connection. It is now possible to download and upload files to Samba shared folders using FileDownloader/FileUploader. (#150)
  • Add if_exists="ignore" and error to Hive.WriteOptions (#143)
  • Add if_exists="ignore" and error to JDBC.WriteOptions (#144)
  • Add if_exists="ignore" and error to MongoDB.WriteOptions (#145)

Improvements

  • Add documentation about different ways of passing packages to Spark session. (#151)

  • Drastically improve Greenplum documentation:

    • Added information about network ports, grants, pg_hba.conf and so on.
    • Added interaction schemas for reading, writing and executing statements in Greenplum.
    • Added recommendations about reading data from views and JOIN results from Greenplum. (#154)
  • Make .fetch and .execute methods of DB connections thread-safe. Each thread works with its own connection. (#156)

  • Call .close() on FileConnection then it is removed by garbage collector. (#156)

Bug Fixes

  • Fix issue while stopping Python interpreter calls JDBCMixin.close() and prints exceptions to log. (#156)

0.9.3 (2023-09-06)

06 Sep 15:03
39c497b

Choose a tag to compare

Bug Fixes

  • Fix documentation build

0.9.2 (2023-09-06)

06 Sep 12:57
3b17ce5

Choose a tag to compare

Features

  • Add if_exists="ignore" and error to Greenplum.WriteOptions (#142)

Improvements

  • Improve validation messages while writing dataframe to Kafka. (#131)
  • Improve documentation:
    • Add notes about reading and writing to database connections documentation
    • Add notes about executing statements in JDBC and Greenplum connections

Bug Fixes

  • Fixed validation of headers column is written to Kafka with default Kafka.WriteOptions() - default value was False, but instead of raising an exception, column value was just ignored. (#131)
  • Fix reading data from Oracle with partitioningMode="range" without explicitly set lowerBound / upperBound. (#133)
  • Update Kafka documentation with SSLProtocol usage. (#136)
  • Raise exception if someone tries to read data from Kafka topic which does not exist. (#138)
  • Allow to pass Kafka topics with name like some.topic.name to DBReader. Same for MongoDB collections. (#139)

0.9.1 (2023-08-17)

17 Aug 18:57
5f0f86a

Choose a tag to compare

Bug Fixes

  • Fixed bug then number of threads created by FileDownloader / FileUploader / FileMover was not min(workers, len(files)), but max(workers, len(files)). leading to create too much workers on large files list.

0.9.0 (2023-08-17)

17 Aug 12:47
b448bbd

Choose a tag to compare

Breaking Changes

  • Rename methods:

    • DBConnection.read_df -> DBConnection.read_source_as_df
    • DBConnection.write_df -> DBConnection.write_df_to_target (#66)
  • Rename classes:

    • HDFS.slots -> HDFS.Slots
    • Hive.slots -> Hive.Slots

    Old names are left intact, but will be removed in v1.0.0 (#103)

  • Rename options to make them self-explanatory:

    • Hive.WriteOptions(mode="append") -> Hive.WriteOptions(if_exists="append")
    • Hive.WriteOptions(mode="overwrite_table") -> Hive.WriteOptions(if_exists="replace_entire_table")
    • Hive.WriteOptions(mode="overwrite_partitions") -> Hive.WriteOptions(if_exists="replace_overlapping_partitions")
    • JDBC.WriteOptions(mode="append") -> JDBC.WriteOptions(if_exists="append")
    • JDBC.WriteOptions(mode="overwrite") -> JDBC.WriteOptions(if_exists="replace_entire_table")
    • Greenplum.WriteOptions(mode="append") -> Greenplum.WriteOptions(if_exists="append")
    • Greenplum.WriteOptions(mode="overwrite") -> Greenplum.WriteOptions(if_exists="replace_entire_table")
    • MongoDB.WriteOptions(mode="append") -> Greenplum.WriteOptions(if_exists="append")
    • MongoDB.WriteOptions(mode="overwrite") -> Greenplum.WriteOptions(if_exists="replace_entire_collection")
    • FileDownloader.Options(mode="error") -> FileDownloader.Options(if_exists="error")
    • FileDownloader.Options(mode="ignore") -> FileDownloader.Options(if_exists="ignore")
    • FileDownloader.Options(mode="overwrite") -> FileDownloader.Options(if_exists="replace_file")
    • FileDownloader.Options(mode="delete_all") -> FileDownloader.Options(if_exists="replace_entire_directory")
    • FileUploader.Options(mode="error") -> FileUploader.Options(if_exists="error")
    • FileUploader.Options(mode="ignore") -> FileUploader.Options(if_exists="ignore")
    • FileUploader.Options(mode="overwrite") -> FileUploader.Options(if_exists="replace_file")
    • FileUploader.Options(mode="delete_all") -> FileUploader.Options(if_exists="replace_entire_directory")
    • FileMover.Options(mode="error") -> FileMover.Options(if_exists="error")
    • FileMover.Options(mode="ignore") -> FileMover.Options(if_exists="ignore")
    • FileMover.Options(mode="overwrite") -> FileMover.Options(if_exists="replace_file")
    • FileMover.Options(mode="delete_all") -> FileMover.Options(if_exists="replace_entire_directory")

    Old names are left intact, but will be removed in v1.0.0 (#108)

  • Rename onetl.log.disable_clients_logging() to onetl.log.setup_clients_logging(). (#120)

Features

  • Add new methods returning Maven packages for specific connection class:

    • Clickhouse.get_packages()
    • MySQL.get_packages()
    • Postgres.get_packages()
    • Teradata.get_packages()
    • MSSQL.get_packages(java_version="8")
    • Oracle.get_packages(java_version="8")
    • Greenplum.get_packages(scala_version="2.12")
    • MongoDB.get_packages(scala_version="2.12")
    • Kafka.get_packages(spark_version="3.4.1", scala_version="2.12")

    Deprecate old syntax:

    • Clickhouse.package
    • MySQL.package
    • Postgres.package
    • Teradata.package
    • MSSQL.package
    • Oracle.package
    • Greenplum.package_spark_2_3
    • Greenplum.package_spark_2_4
    • Greenplum.package_spark_3_2
    • MongoDB.package_spark_3_2
    • MongoDB.package_spark_3_3
    • MongoDB.package_spark_3_4 (#87)
  • Allow to set client modules log level in onetl.log.setup_clients_logging().

    Allow to enable underlying client modules logging in onetl.log.setup_logging() by providing additional argument enable_clients=True. This is useful for debug. (#120)

  • Added support for reading and writing data to Kafka topics.

    For these operations, new classes were added.

    Currently, Kafka does not support incremental read strategies, this will be implemented in future releases.

  • Added support for reading files as Spark DataFrame and saving DataFrame as Files.

    For these operations, new classes were added.

    FileDFConnections:

    High-level classes:

    • FileDFReader (#73)
    • FileDFWriter (#81)

    File formats:

Improvements

  • Remove redundant checks for driver availability in Greenplum and MongoDB connections. (#67)
  • Check of Java class availability moved from .check() method to connection constructor. (#97)

0.8.1 (2023-07-10)

10 Jul 09:08
4ac9e3b

Choose a tag to compare

Features

  • Add @slot decorator to public methods of:

    • DBConnection
    • FileConnection
    • DBReader
    • DBWriter
    • FileDownloader
    • FileUploader
    • FileMover (#49)
  • Add workers field to FileDownloader / FileUploader / FileMover. Options classes.

    This allows to speed up all file operations using parallel threads. (#57)

Improvements

  • Add documentation for HWM store .get and .save methods. (#49)
  • Improve Readme:
    • Move Quick start section from documentation
    • Add Non-goals section
    • Fix code blocks indentation (#50)
  • Improve Contributing guide:
    • Move Develop section from Readme
    • Move docs/changelog/README.rst content
    • Add Limitations section
    • Add instruction of creating a fork and building documentation (#50)
  • Remove duplicated checks for source file existence in FileDownloader / FileMover. (#57)
  • Update default logging format to include thread name. (#57)

Bug Fixes

  • Fix S3.list_dir('/') returns empty list on latest Minio version. (#58)

0.8.0 (2023-05-31)

31 May 12:34
9c6b44c

Choose a tag to compare

Breaking Changes

  • Rename methods of FileConnection classes:

    • get_directoryresolve_dir
    • get_fileresolve_file
    • listdirlist_dir
    • mkdircreate_dir
    • rmdirremove_dir

    New naming should be more consistent.
    They were undocumented in previous versions, but someone could use these methods, so this is a breaking change. (#36)

  • Deprecate onetl.core.FileFilter class, replace it with new classes:

    • onetl.file.filter.Glob
    • onetl.file.filter.Regexp
    • onetl.file.filter.ExcludeDir

    Old class will be removed in v1.0.0. (#43)

  • Deprecate onetl.core.FileLimit class, replace it with new class onetl.file.limit.MaxFilesCount.

    Old class will be removed in v1.0.0. (#44)

  • Change behavior of BaseFileLimit.reset method.

    This method should now return self instead of None. Return value could be the same limit object or a copy, this is an implementation detail. (#44)

  • Replaced FileDownloader.filter and .limit with new options .filters and .limits:

    FileDownloader(
        ...,
        filter=FileFilter(glob="*.txt", exclude_dir="/path"),
        limit=FileLimit(count_limit=10),
    )
    FileDownloader(
        ...,
        filters=[Glob("*.txt"), ExcludeDir("/path")],
        limits=[MaxFilesCount(10)],
    )

    This allows to developers to implement their own filter and limit classes, and combine them with existing ones.

    Old behavior still supported, but it will be removed in v1.0.0. (#45)

  • Removed default value for FileDownloader.limits, user should pass limits list explicitly. (#45)

  • Move classes from module onetl.core:

    from onetl.core import DBReader
    from onetl.core import DBWriter
    from onetl.core import FileDownloader
    from onetl.core import FileUploader
    from onetl.core import FileResult
    from onetl.core import FileSet

    with new modules onetl.db and onetl.file:

    from onetl.db import DBReader
    from onetl.db import DBWriter
    
    from onetl.file import FileDownloader
    from onetl.file import FileUploader
    
    # not a public interface
    from onetl.file.file_result import FileResult
    from onetl.file.file_set import FileSet

    Imports from old module onetl.core still can be used, but marked as deprecated. Module will be removed in v1.0.0. (#46)

Features

  • Add rename_dir method.

    Method was added to following connections:

    • FTP
    • FTPS
    • HDFS
    • SFTP
    • WebDAV

    It allows to rename/move directory to new path with all its content.

    S3 does not have directories, so there is no such method in that class. (#40)

  • Add onetl.file.FileMover class.

    It allows to move files between directories of remote file system. Signature is almost the same as in FileDownloader, but without HWM support. (#42)

Improvements

  • Document all public methods in FileConnection classes:

    • download_file
    • resolve_dir
    • resolve_file
    • get_stat
    • is_dir
    • is_file
    • list_dir
    • create_dir
    • path_exists
    • remove_file
    • rename_file
    • remove_dir
    • upload_file
    • walk (#39)
  • Update documentation of check method of all connections - add usage example and document result type. (#39)

  • Add new exception type FileSizeMismatchError.

    Methods connection.download_file and connection.upload_file now raise new exception type instead of RuntimeError, if target file after download/upload has different size than source. (#39)

  • Add new exception type DirectoryExistsError - it is raised if target directory already exists. (#40)

  • Improved FileDownloader / FileUploader exception logging.

    If DEBUG logging is enabled, print exception with stacktrace instead of printing only exception message. (#42)

  • Updated documentation of FileUploader.

    • Class does not support read strategies, added note to documentation.
    • Added examples of using run method with explicit files list passing, both absolute and relative paths.
    • Fix outdated imports and class names in examples. (#42)
  • Updated documentation of DownloadResult class - fix outdated imports and class names. (#42)

  • Improved file filters documentation section.

    Document interface class onetl.base.BaseFileFilter and function match_all_filters. (#43)

  • Improved file limits documentation section.

    Document interface class onetl.base.BaseFileLimit and functions limits_stop_at / limits_reached / reset_limits. (#44)

  • Added changelog.

    Changelog is generated from separated news files using towncrier. (#47)

Misc

  • Improved CI workflow for tests.
    • If developer haven't changed source core of a specific connector or its dependencies, run tests only against maximum supported versions of Spark, Python, Java and db/file server.
    • If developed made some changes in a specific connector, or in core classes, or in dependencies, run tests for both minimal and maximum versions.
    • Once a week run all aganst for minimal and latest versions to detect breaking changes in dependencies
    • Minimal tested Spark version is 2.3.1 instead on 2.4.8. (#32)

Full Changelog: 0.7.2...0.8.0

0.7.2 (2023-05-24)

31 May 08:56
12ddaac

Choose a tag to compare

Dependencies

  • Limited typing-extensions version.

    typing-extensions==4.6.0 release contains some breaking changes causing errors like:

    Traceback (most recent call last):
    File "/Users/project/lib/python3.9/typing.py", line 852, in __subclasscheck__
        return issubclass(cls, self.__origin__)
    TypeError: issubclass() arg 1 must be a class
    

    typing-extensions==4.6.1 was causing another error:

    Traceback (most recent call last):
    File "/home/maxim/Repo/typing_extensions/1.py", line 33, in <module>
        isinstance(file, ContainsException)
    File "/home/maxim/Repo/typing_extensions/src/typing_extensions.py", line 599, in __instancecheck__
        if super().__instancecheck__(instance):
    File "/home/maxim/.pyenv/versions/3.7.8/lib/python3.7/abc.py", line 139, in __instancecheck__
        return _abc_instancecheck(cls, instance)
    File "/home/maxim/Repo/typing_extensions/src/typing_extensions.py", line 583, in __subclasscheck__
        return super().__subclasscheck__(other)
    File "/home/maxim/.pyenv/versions/3.7.8/lib/python3.7/abc.py", line 143, in __subclasscheck__
        return _abc_subclasscheck(cls, subclass)
    File "/home/maxim/Repo/typing_extensions/src/typing_extensions.py", line 661, in _proto_hook
        and other._is_protocol
    AttributeError: type object 'PathWithFailure' has no attribute '_is_protocol'
    

    We updated requirements with typing-extensions<4.6 until fixing compatibility issues.

Full Changelog: 0.7.1...0.7.2

0.7.1 (2023-05-23)

31 May 08:55
befa06d

Choose a tag to compare

Bug Fixes

  • Fixed setup_logging function.

    In onETL==0.7.0 calling onetl.log.setup_logging() broke the logging:

    Traceback (most recent call last):
    File "/opt/anaconda/envs/py39/lib/python3.9/logging/__init__.py", line 434, in format
        return self._format(record)
    File "/opt/anaconda/envs/py39/lib/python3.9/logging/__init__.py", line 430, in _format
        return self._fmt % record.dict
    KeyError: 'levelname:8s'
    
  • Fixed installation examples.

    In onETL==0.7.0 there are examples of installing onETL with extras:

    pip install onetl[files, kerberos, spark]

    But pip fails to install such package:

    ERROR: Invalid requirement: 'onet[files,'
    

    This is because of spaces in extras clause. Fixed:

    pip install onetl[files,kerberos,spark]

Full Changelog: 0.7.0...0.7.1