Skip to content

Commit 7c9c708

Browse files
committed
Merge branch 'develop'
2 parents 2335d7f + 90ba720 commit 7c9c708

File tree

32 files changed

+303
-120
lines changed

32 files changed

+303
-120
lines changed
Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
.github/workflows/data/db/**
2-
onetl/db_connection/db_connection.py
3-
onetl/db_connection/jdbc*.py
2+
onetl/db_connection/db_connection/*
43
onetl/db_connection/dialect_mixins/*
4+
onetl/db_connection/jdbc_connection/*
5+
onetl/db_connection/jdbc_mixin/*
56
onetl/db/**

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ repos:
5252
- --no-extra-eol
5353

5454
- repo: https://github.com/codespell-project/codespell
55-
rev: v2.2.6
55+
rev: v2.3.0
5656
hooks:
5757
- id: codespell
5858
args: [-w]

README.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -341,7 +341,7 @@ Read data from MSSQL, transform & write to Hive.
341341
database="Telecom",
342342
spark=spark,
343343
# These options are passed to MSSQL JDBC Driver:
344-
extra={"ApplicationIntent": "ReadOnly"},
344+
extra={"applicationIntent": "ReadOnly"},
345345
).check()
346346
347347
# >>> INFO:|MSSQL| Connection is available

docs/changelog/0.11.0.rst

Lines changed: 41 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -28,41 +28,6 @@ most of users will not see any differences.
2828

2929
This brings few bugfixes with datetime format handling.
3030

31-
- Serialize ``ColumnDatetimeHWM`` to Clickhouse's ``DateTime64(6)`` (precision up to microseconds) instead of ``DateTime`` (precision up to seconds) (:github:pull:`267`).
32-
33-
In previous onETL versions, ``ColumnDatetimeHWM`` value was rounded to the second, and thus reading some rows that were read in previous runs,
34-
producing duplicates.
35-
36-
For Clickhouse versions below 21.1 comparing column of type ``DateTime`` with a value of type ``DateTime64`` is not supported, returning an empty dataframe.
37-
To avoid this, replace:
38-
39-
.. code:: python
40-
41-
DBReader(
42-
...,
43-
hwm=DBReader.AutoDetectHWM(
44-
name="my_hwm",
45-
expression="hwm_column", # <--
46-
),
47-
)
48-
49-
with:
50-
51-
.. code:: python
52-
53-
DBReader(
54-
...,
55-
hwm=DBReader.AutoDetectHWM(
56-
name="my_hwm",
57-
expression="CAST(hwm_column AS DateTime64)", # <-- add explicit CAST
58-
),
59-
)
60-
61-
- Pass JDBC connection extra params as ``properties`` dict instead of URL with query part (:github:pull:`268`).
62-
63-
This allows passing custom connection parameters like ``Clickhouse(extra={"custom_http_options": "option1=value1,option2=value2"})``
64-
without need to apply urlencode to parameter value, like ``option1%3Dvalue1%2Coption2%3Dvalue2``.
65-
6631
- For JDBC connections add new ``SQLOptions`` class for ``DB.sql(query, options=...)`` method (:github:pull:`272`).
6732

6833
Firsly, to keep naming more consistent.
@@ -166,6 +131,41 @@ most of users will not see any differences.
166131
For now, ``DB.fetch(query, options=...)`` and ``DB.execute(query, options=...)`` can accept ``JDBCOptions``, to keep backward compatibility,
167132
but emit a deprecation warning. The old class will be removed in ``v1.0.0``.
168133

134+
- Serialize ``ColumnDatetimeHWM`` to Clickhouse's ``DateTime64(6)`` (precision up to microseconds) instead of ``DateTime`` (precision up to seconds) (:github:pull:`267`).
135+
136+
In previous onETL versions, ``ColumnDatetimeHWM`` value was rounded to the second, and thus reading some rows that were read in previous runs,
137+
producing duplicates.
138+
139+
For Clickhouse versions below 21.1 comparing column of type ``DateTime`` with a value of type ``DateTime64`` is not supported, returning an empty dataframe.
140+
To avoid this, replace:
141+
142+
.. code:: python
143+
144+
DBReader(
145+
...,
146+
hwm=DBReader.AutoDetectHWM(
147+
name="my_hwm",
148+
expression="hwm_column", # <--
149+
),
150+
)
151+
152+
with:
153+
154+
.. code:: python
155+
156+
DBReader(
157+
...,
158+
hwm=DBReader.AutoDetectHWM(
159+
name="my_hwm",
160+
expression="CAST(hwm_column AS DateTime64)", # <-- add explicit CAST
161+
),
162+
)
163+
164+
- Pass JDBC connection extra params as ``properties`` dict instead of URL with query part (:github:pull:`268`).
165+
166+
This allows passing custom connection parameters like ``Clickhouse(extra={"custom_http_options": "option1=value1,option2=value2"})``
167+
without need to apply urlencode to parameter value, like ``option1%3Dvalue1%2Coption2%3Dvalue2``.
168+
169169
Features
170170
--------
171171

@@ -188,14 +188,14 @@ Improve user experience with Kafka messages and Database tables with serialized
188188
* ``CSV.parse_column(col, schema=...)`` (:github:pull:`258`).
189189
* ``XML.parse_column(col, schema=...)`` (:github:pull:`269`).
190190

191-
This allows parsing data in ``value`` field of Kafka message or string/binary column of some table as a nested Spark structure.
191+
This allows parsing data in ``value`` field of Kafka message or string/binary column of some table as a nested Spark structure.
192192

193193
- Add ``FileFormat.serialize_column(...)`` method to several classes:
194-
* ``Avro.serialize_column(col)`` (:github:pull:`265`).
195-
* ``JSON.serialize_column(col)`` (:github:pull:`257`).
196-
* ``CSV.serialize_column(col)`` (:github:pull:`258`).
194+
* ``Avro.serialize_column(col)`` (:github:pull:`265`).
195+
* ``JSON.serialize_column(col)`` (:github:pull:`257`).
196+
* ``CSV.serialize_column(col)`` (:github:pull:`258`).
197197

198-
This allows saving Spark nested structures or arrays to ``value`` field of Kafka message or string/binary column of some table.
198+
This allows saving Spark nested structures or arrays to ``value`` field of Kafka message or string/binary column of some table.
199199

200200
Improvements
201201
------------
@@ -220,7 +220,7 @@ Few documentation improvements.
220220

221221
- Add note about connecting to Clickhouse cluster. (:github:pull:`280`).
222222

223-
- Add notes about versions when specific class/method/attribute/argument was added, renamed or changed behavior (:github:`282`).
223+
- Add notes about versions when specific class/method/attribute/argument was added, renamed or changed behavior (:github:pull:`282`).
224224

225225

226226
Bug Fixes

docs/changelog/0.11.1.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
0.11.1 (2024-05-29)
2+
===================
3+
4+
Features
5+
--------
6+
7+
- Change ``MSSQL.port`` default from ``1433`` to ``None``, allowing use of ``instanceName`` to detect port number. (:github:pull:`287`)
8+
9+
10+
Bug Fixes
11+
---------
12+
13+
- Remove ``fetchsize`` from ``JDBC.WriteOptions``. (:github:pull:`288`)

docs/changelog/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
:caption: Changelog
44

55
DRAFT
6+
0.11.1
67
0.11.0
78
0.10.2
89
0.10.1

docs/connection/db_connection/hive/write.rst

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,15 @@ Examples
2121
# Use the Hive partitioning columns to group the data. Create only 20 files per Hive partition.
2222
# Also sort the data by column which most data is correlated with, reducing files size.
2323
write_df = df.repartition(
24-
20, "country", "business_date", "user_id"
25-
).sortWithinPartitions("country", "business_date", "user_id")
24+
20,
25+
"country",
26+
"business_date",
27+
"user_id",
28+
).sortWithinPartitions(
29+
"country",
30+
"business_date",
31+
"user_id",
32+
)
2633
2734
writer = DBWriter(
2835
connection=hive,

docs/connection/db_connection/kafka/prerequisites.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ Authentication mechanism
4646
~~~~~~~~~~~~~~~~~~~~~~~~
4747

4848
Kafka can support different authentication mechanism (also known as `SASL <https://en.wikipedia.org/wiki/Simple_Authentication_and_Security_Layer>`_).
49+
4950
List of currently supported mechanisms:
5051
* :obj:`PLAIN <onetl.connection.db_connection.kafka.kafka_basic_auth.KafkaBasicAuth>`. To no confuse this with ``PLAINTEXT`` connection protocol, onETL uses name ``BasicAuth``.
5152
* :obj:`GSSAPI <onetl.connection.db_connection.kafka.kafka_kerberos_auth.KafkaKerberosAuth>`. To simplify naming, onETL uses name ``KerberosAuth``.

docs/connection/db_connection/mssql/prerequisites.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,11 @@ Connecting to MSSQL
2727
Connection port
2828
~~~~~~~~~~~~~~~
2929

30-
Connection is usually performed to port 1443. Port may differ for different MSSQL instances.
30+
Connection is usually performed to port 1433. Port may differ for different MSSQL instances.
3131
Please ask your MSSQL administrator to provide required information.
3232

33+
For named MSSQL instances (``instanceName`` option), `port number is optional <https://learn.microsoft.com/en-us/sql/connect/jdbc/building-the-connection-url?view=sql-server-ver16#named-and-multiple-sql-server-instances>`_, and could be omitted.
34+
3335
Connection host
3436
~~~~~~~~~~~~~~~
3537

docs/connection/db_connection/mssql/types.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ This is how MSSQL connector performs this:
3535
it will be populated by MSSQL.
3636
3737
.. [2]
38-
This is true only if DataFrame column is a ``StringType()``, because text value is parsed automatically to tagret column type.
38+
This is true only if DataFrame column is a ``StringType()``, because text value is parsed automatically to target column type.
3939
4040
But other types cannot be silently converted, like ``int -> text``. This requires explicit casting, see `DBWriter`_.
4141

0 commit comments

Comments
 (0)