Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
| select_query = transformation_result.query() | ||
|
|
||
| # TODO: why? don't we prevent empty column schemas above? | ||
| all_columns = {**computed_columns, **(columns or {})} |
There was a problem hiding this comment.
I have added a test for columns merging, and the columns yielded from here overwrite all incoming columns definitions from the decorator, which I think generally makes sense in dlt, but not for the transformations. I would maybe keep this topic open for now until we decide how this should work exactly.
There was a problem hiding this comment.
w8: you are overwriting computed_columns with columns which come from the decorator. so how are they not overwritten at the end?
There was a problem hiding this comment.
I am not sure what you are asking. My goal would be, that if the user sets a column hin on the transformation decorator, this should overwrite whatever is discovered during lineage. At least I think it should, but I can also just remove this line and users will have to deal with this in some other way if they need to change precision or something like that.
| QUERY_KNOWN_TABLE_STAR_SELECT = "SELECT * FROM table_1" | ||
| QUERY_UNKNOWN_TABLE_STAR_SELECT = "SELECT * FROM table_unknown" | ||
| QUERY_ANONYMOUS_SELECT = "SELECT LEN(col_varchar) FROM table_1" | ||
| QUERY_GIBBERISH = "%&/ GIBBERISH" |
There was a problem hiding this comment.
this can not happen anymore, as non parseable sql strings are already caught in the dataset.
| def inventory_original(dataset: SupportsReadableDataset[Any]) -> Any: | ||
| return dataset["inventory"] | ||
|
|
||
| @dlt.transformation(columns={"price": {"precision": 20, "scale": 2}}) |
There was a problem hiding this comment.
I think this is the correct behavior, we should allow to set the precision for an existing column here. Or would you rather do this in the function body? I am not sure.
| Will be translated to | ||
|
|
||
| ```sql | ||
| INSERT INTO |
There was a problem hiding this comment.
this is the actually generated insert statement. I realized we were not quoting identifiers, so I added this. We are doing 2 subqueries here now, one stemming from the normalization in the extract and one from @anuunchin normalizer work. We can probably still optimize here?
| assert ( | ||
| normalized_query | ||
| == 'SELECT _dlt_subquery."a" AS a, _dlt_subquery."b" AS b, UUID() AS _dlt_id FROM' | ||
| == 'SELECT _dlt_subquery."a" AS "a", _dlt_subquery."b" AS "b", UUID() AS "_dlt_id" FROM' |
There was a problem hiding this comment.
the model item normalizer will now always use quoted identifiers on the outer queries, regardless wether the inner select has quoted identifiers. Since all queries arriving here now go through query normalization in the dataset where quotes are applied, this should be fine acutally.
| def _query(self) -> sge.Query: | ||
| from dlt.helpers.ibis import duckdb_compiler | ||
|
|
||
| select_query = duckdb_compiler.to_sqlglot(self._ibis_object) |
There was a problem hiding this comment.
For converting an ibis expression into sqlglot, you always need a compiler which is destination dependent. I think the compiler will fail if you use function that are known not to exist on a specific destination, and certain materializations will also occur in this step. If you have an expression that counts rows, the alias for the result will be created here, which will be CountStar(*) for duckdb, but which is not a valid identifier for bigquery for example (see comment in query normalizer)
dlt/destinations/dataset/utils.py
Outdated
| if node.db != expanded_path[0]: | ||
| node.set("catalog", sqlglot.to_identifier(expanded_path[0], quoted=False)) | ||
| if isinstance(node, sge.Alias): | ||
| node.set("alias", naming.normalize_identifier(node.alias)) |
There was a problem hiding this comment.
Here I make sure aliases that are created by the user or automatically by ibis in the compile step above adhere to the naming convention of the current destination. The alternative would be to use a different compiler for every destination. Normalizing the alias in the ModelNormalizer step is too late, since this code is also used for just accessing the data. Without this change, def test_row_counts will fail for bigquery because an alias with invalid symbols remains in the query. All other destinations seem to accept this.
There was a problem hiding this comment.
I have reverted this change and am now using the bigquery compiler only for bigquery destinations. Maybe we should be using the correct compiler for each destination type, but it seems duckdb works for all. I am not sure about this one.
211cbfc to
1f5a787
Compare
848233e to
28b7361
Compare
| # NOTE: We can use the duckdb compiler for all dialects except for bigquery | ||
| # as bigquery is more strict about identifier naming and will not accept | ||
| # identifiers generated by the duckdb compiler for anonymous columns | ||
| destination_dialect = self._dataset.sql_client.capabilities.sqlglot_dialect |
There was a problem hiding this comment.
I'll keep the old notes:
# NOTE: ibis is optimized for reading a real schema from db and pushing back optimized sql
# - it quotes all identifiers, there's no option to get unqoted query
# - it converts full lists of column names back into star
# - it optimizes the query inside, adds meaningless aliases to all tables etc.
for example: I bet that BigQuery problem is quotation problem where " are not accepted
dlt/destinations/dataset/relation.py
Outdated
| "Must be an SQL SELECT statement." | ||
| ) | ||
|
|
||
| return query.sql(dialect=self._dataset.sql_client.capabilities.sqlglot_dialect) |
There was a problem hiding this comment.
why provided_dialect is not used here?
dlt/destinations/dataset/utils.py
Outdated
| sqlglot_schema: SQLGlotSchema, | ||
| qualified_query: sge.Query, | ||
| sql_client: SqlClientBase[Any], | ||
| naming: NamingConvention, |
There was a problem hiding this comment.
not needed anymore. and I think it should not be needed here
| select_query = transformation_result.query() | ||
|
|
||
| # TODO: why? don't we prevent empty column schemas above? | ||
| all_columns = {**computed_columns, **(columns or {})} |
There was a problem hiding this comment.
w8: you are overwriting computed_columns with columns which come from the decorator. so how are they not overwritten at the end?
| + "Please run with strict lineage or provide data_type hints " | ||
| + f"for following columns: {unknown_column_types}", | ||
| ) | ||
| yield dlt.mark.with_hints( |
There was a problem hiding this comment.
this is something I wrote in the old PR: could we place all_column in the SqlModel? or better: could we place a relation in it? then:
- user is able to use
dlt.markon the model (ie. their own custom hints or table name) all_columns = {**computed_columns, **(columns or {})}is not needed. columns will be applied bycompute_table_schemainDltResource- you can use relation/computed_columns in SqlModel to apply to the resource in extractor. Look how we call this method
root_table_schema = resource.compute_table_schema(items, meta)and
def compute_table_schema(self, item: TDataItem = None, meta: Any = None) -> TTableSchema:
"""Computes the table schema based on hints and column definitions passed during resource creation.
`item` parameter is used to resolve table hints based on data.
`meta` parameter is taken from Pipe and may further specify table name if variant is to be used
"""so SqlModel will be passed there and you can use it to extract computed hints.
Or you can implement _compute_schema on ModelExtractor for fully custom logic
There was a problem hiding this comment.
I kind of did that now, but I also have to unwrap and rewrap the dataitem in the transform function and I am not fully convinced that the way I did it is the right way.. Also I don't think it works for materialized queries...
6ae764d to
3584402
Compare
cc89745 to
941dae0
Compare
| from dlt.extract.pipe_iterator import DataItemWithMeta | ||
|
|
||
|
|
||
| class MaterializableSqlModel(SqlModel, WithComputableHints): |
There was a problem hiding this comment.
Note: find better name, or merge this with relation.
| relation = unwrapped_item | ||
| # we see if the string is a valid sql query, if so we need a dataset | ||
| elif isinstance(unwrapped_item, str): | ||
| try: |
There was a problem hiding this comment.
this will allow a malformed sql query to be interpreted as string, possibly not what we want here, maybe single strings should always strictly be interpreted as sql query and fail this transformation, otherwise the user will have very strange results.
| for chunk in datasets[0](select_query).iter_arrow(chunk_size=config.buffer_max_items): | ||
| yield dlt.mark.with_hints(chunk, hints=make_hints(columns=all_columns)) | ||
| for chunk in relation.iter_arrow(chunk_size=config.buffer_max_items): | ||
| yield dlt.mark.with_hints(chunk, hints=sql_model.compute_hints()) |
There was a problem hiding this comment.
here we discard the hints coming from the outside, for the sqlmodel they are merged during extract, but since we have pure arrow tables here, we can't do it. We had this idea of putting our schema into the user data area of the arrow table. I have not looked at this yet, but should we do it?
36e2e96 to
93c97e2
Compare
…t of unneeded transpiling clean up make_transformation function tests still pending
|
I will review tests later |
There are a bunch of relevant tests in |
-> this allows to have multiple relations with different schemas in the relation, so this is allowed now too
I would add a note in the docs, I think it's quite unlikely that users would use tables from the output dataset as input in the same load. |
rudolfix
left a comment
There was a problem hiding this comment.
LGTM! just a few more questions
| fruit_p.run(fruitshop_source()) | ||
|
|
||
| @dlt.transformation() | ||
| def multiple_transformations(dataset: Dataset) -> Any: |
There was a problem hiding this comment.
would be cool to enumerate resource without pipeline and also to have a mixed easy and eager transformations yielded
There was a problem hiding this comment.
Resources are enumerated and schema access is tested in test_extract_without_source_name_or_pipeline and test_materializable_sql_model and test_enumerate_and_access_relation.
There was a problem hiding this comment.
For the mixed type function a small fix in the normalizer was needed that prevented mixed loader_file_type tables in general
| ["id", "customer_id", "inventory_id", "quantity", "name"] | ||
| ] | ||
|
|
||
| model = list(materializable_sql_model(fruit_p.dataset()))[0] |
There was a problem hiding this comment.
do we have a specific test for resource iteration?
There was a problem hiding this comment.
I'm not sure what this means, this is tested in the line above, no?
* Min max, filter with expr_or_string * Fix in min max test * Overload fix and docs * Test read interfaces partially uses default relation max
…metrized types and don't supply hints if none are present in dlt schema
d04e962 to
089f5ec
Compare
| if table_name in self._filtered_tables: | ||
| continue | ||
| # allow to cache dynamic table hints if only table name is dynamic | ||
| if table_name not in self._table_contracts or resource._table_has_other_dynamic_hints: |
There was a problem hiding this comment.
@rudolfix I had to revert his change that you made, because yielding multiple transformations with different schemas would not work without it. Relevant test is: def test_multiple_transformations_in_function. Maybe you should review this once again.
There was a problem hiding this comment.
right! but this prevents caching of the computed dynamic table hints which is 99% of cases we handle. my take:
convert _table_has_other_dynamic_hints into has_other_dynamic_hints which for transformation resources is always True and for others it falls back to the current behavior
9e72cf7 to
19822bc
Compare
19822bc to
c5f9b33
Compare
# Conflicts: # dlt/common/destination/capabilities.py
input_data
Outdated
There was a problem hiding this comment.
I believe this shouldn't be here?
| raise NotImplementedError("`fetchone()` method is not supported for this relation") | ||
|
|
||
|
|
||
| class Relation(DataAccess): |
There was a problem hiding this comment.
AFAIU, DataAccess and Relation are implemented as abstract base classes (abc.ABC) rather than duck-typing (Protocol). The class Relation inheriting DataAccess protocol should implement the methods directly (otherwise, make Relation the protocol).
For instance, the type inspection from the IDE complains that all methods don't return anything while they indicate they do.
What's the practical use case here? I thought IbisRelation was removed to have a single Dataset and Relation implementation. This has the benefit of shedding this abstract machinery
There was a problem hiding this comment.
We have interfaces (protocols or abstract classes) in common and implementations in the other parts. This enables us to have these interfaces as return types in places where the implementation is not known or can not be imported due to circular dependencies
There was a problem hiding this comment.
Also wrt to the protocol vs abc: Ideally the Relation would also be a Protocol, the problem is, that I am checking variables for types in the transformations to see what the transformations are yielding, and you can't check for protocols (or you can with runtime_checkable decorator but apparently that is very slow). I also don't see any problems in my IDE, maybe we are using different linters? The CI passes.
There was a problem hiding this comment.
This enables us to have these interfaces as return types in places where the implementation is not known or can not be imported due to circular dependencies
For this purpose, we should simply use
from __future__ import annotations
from typing import TYPE_CHECKING
if TYPE_CHECKING:
# import from other modulesWe should gradually introduce levels of abstraction. If we have a single dlt.Dataset and dlt.Relation, the additional layers make reading and editing the code harder IMO. For instance, I still struggle with the many layers of proxying and wrapping on Relation.df(), which could be written in a very straightforward manner
dlt/common/data_writers/writers.py
Outdated
|
|
||
| normalized_query = parsed_query.sql(dialect=dialect) | ||
| self._f.write("dialect: " + (dialect or "") + "\n" + normalized_query + "\n") | ||
| dialect = item.query_dialect() or (self._caps.sqlglot_dialect if self._caps else None) |
There was a problem hiding this comment.
my question was: can item.query_dialect() NOT return a dialect in current impl? so we can remove expression after "or"
| def __init__(self, *args: Any, **kwds: Any) -> None: | ||
| super().__init__(*args, **kwds) | ||
|
|
||
| def compute_table_schema(self, item: TDataItem = None, meta: Any = None) -> TTableSchema: |
There was a problem hiding this comment.
this background why we do the schema caching and when
dlt/destinations/dataset/relation.py
Outdated
| return self._columns_schema | ||
|
|
||
| @columns_schema.setter | ||
| def columns_schema(self, new_value: TTableSchemaColumns) -> None: |
* rename flag for executing raw queries to "execute_raw_query" * return sge queries from the internal _query method which removes a lot of unneeded transpiling clean up make_transformation function tests still pending * adds some tests to readable dataset and a test for column hint merging * allows any dialect when writing queries and fixes tests * update docs and set correct quoting to queries in normalization and load stage * fixes normalizer tests * fix limit on mssql normalize aliases in normalization step * add missing quote to alias * revert identifier normalization step in normalizer_query and use bigquery compiler for bigquery destinations * post rebase fix * smallish pr fixes * add materializable sqlmodel and handle hints in extractor * add and test always_materialize setting * add test for sql transformation type * convert transformation functions to need yield instead of return * migrate tests and docs snippets to yield in transformations * add simple test for materializable model * use correct compiler for converting ibis into sqlglot for each dialect fixes on transformation test * add first simple version of using unbound ibis tables in transformations * skip ibis test on python 3.9 * fix query building in new relation * return a "real" relation from a transformation * add ibis option when getting table from dataset natively support unbound ibis tables in transformations and when getting relations from dataset * update model item format tests to use relation * * remove one unneeded test (same thing is already tested in transformations) * fix wei conversion in linneage * adds support for adding resource hints to pyarrow items * switch most read access tests to default dataset * update datasets and transformations docs pages * separate ibis and default dbapi datasets and fix typing * update transformation tests and small typing fixes for updated datasets * fix default dataset type * fix wei sqlglot conversion * add sqlglot dialect type and some cleanup * fix dataset snippets * fix sqlglot schema test * removes ibis relation and dataset consolidates relation and dataset baseclasses with implementations updates interfaces/protocols fro relation and dataset and makes those the publicly available interface with "Relation" and "Dataset" remove query method from relation interface * fix one doc snippet * rename dataset and relation interfaces * fix relation ship between cursor and relation, remove function wiring hack in favor of explicit forwarding for better typing * clean up readablerelation (no actual code changes) * fix str test to assume pretty sql (which it is now) fix one transformation snippet * small changes from review comments: * query method on dataset * typing update of table method * rename query method to "to_sql" on relation * clean up transform function a bit (could maybe be even better= reject non-sql strings in transformation to not shadow errors * add support for "non-generator" transformations * move hints computation into resource class * smallish PR fixes * add support for dynamic hints in transformations -> this allows to have multiple relations with different schemas in the relation, so this is allowed now too * fixes dynamic table caching * Enhances ReadableDBAPIRelation: min/max, filter with expression (#2833) * Min max, filter with expr_or_string * Fix in min max test * Overload fix and docs * Test read interfaces partially uses default relation max * prevent sqglot schema from adding default hints info, only allow parametrized types and don't supply hints if none are present in dlt schema * make multi schema transformations work again * move model item format tests to transformations folder * re-order interface tests and fix playground dataset access * PR review test updated * update dataset and transformation pages * update transformations tests to new fruitshop * Last PR fixes * update columns_schema property --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org> Co-authored-by: anuunchin <88698977+anuunchin@users.noreply.github.com>

Description
This PR includes the following changes:
Open questions: