Skip to content

Commit 7c462df

Browse files
authored
Update docs (#77)
* refactored examples for with_sql * Fixed unit tests * Updated readme * Updated display description
1 parent b04defb commit 7c462df

File tree

10 files changed

+39
-39
lines changed

10 files changed

+39
-39
lines changed

README.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ As an illustration, consider the scenario where you need to retrieve a single ro
4848

4949
```
5050
dx.from_tables("dev_*.*.*sample*")\
51-
.apply_sql("SELECT to_json(struct(*)) AS row FROM {full_table_name} LIMIT 1")\
52-
.execute()
51+
.with_sql("SELECT to_json(struct(*)) AS row FROM {full_table_name} LIMIT 1")\
52+
.apply()
5353
```
5454

5555
## Available functionality
@@ -59,7 +59,7 @@ The available `dx` functions are
5959
* `from_tables("<catalog>.<schema>.<table>")` selects tables based on the specified pattern (use `*` as a wildcard). Returns a `DataExplorer` object with methods
6060
* `having_columns` restricts the selection to tables that have the specified columns
6161
* `with_concurrency` defines how many queries are executed concurrently (10 by defailt)
62-
* `apply_sql` applies a SQL template to all tables. After this command you can apply an [action](#from_tables-actions). See in-depth documentation [here](docs/Arbitrary_multi-table_SQL.md).
62+
* `with_sql` applies a SQL template to all tables. After this command you can apply an [action](#from_tables-actions). See in-depth documentation [here](docs/Arbitrary_multi-table_SQL.md).
6363
* `unpivot_string_columns` returns a melted (unpivoted) dataframe with all string columns from the selected tables. After this command you can apply an [action](#from_tables-actions)
6464
* `scan` (experimental) scans the lakehouse with regex expressions defined by the rules and to power the semantic classification.
6565
* `intro` gives an introduction to the library
@@ -72,12 +72,11 @@ The available `dx` functions are
7272

7373
### from_tables Actions
7474

75-
After a `apply_sql` or `unpivot_string_columns` command, you can apply the following actions:
75+
After a `with_sql` or `unpivot_string_columns` command, you can apply the following actions:
7676

7777
* `explain` explains the queries that would be executed
78-
* `execute` executes the queries and shows the result in a unioned dataframe
79-
* `to_union_dataframe` unions all the dataframes that result from the queries
80-
78+
* `display` executes the queries and shows the first 1000 rows of the result in a unioned dataframe
79+
* `apply` returns a unioned dataframe with the result from the queries
8180

8281
## Requirements
8382

discoverx/dx.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ def intro(self):
6868
<p>
6969
Then you can apply the following operations
7070
<ul>
71-
<li><code>.apply_sql(...)</code> - Runs a SQL template on each table</li>
71+
<li><code>.with_sql(...)</code> - Runs a SQL template on each table</li>
7272
<li><code>.scan(...)</code> - Scan your lakehouse for columns matching the given rules</li>
7373
<li><code>.search(...)</code> - Search your lakehouse for columns matching the given search term</li>
7474
</ul>

discoverx/explorer.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ def unpivot_string_columns(self, sample_size=None) -> "DataExplorerActions":
113113
if sample_size is not None:
114114
sql_query_template += f"TABLESAMPLE ({sample_size} ROWS)"
115115

116-
return self.apply_sql(sql_query_template)
116+
return self.with_sql(sql_query_template)
117117

118118
def scan(
119119
self,

docs/Arbitrary_multi-table_SQL.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ For example, to vacuum all the tables in "default" catalog:
1010

1111
```
1212
dx.from_tables("default.*.*")\
13-
.apply_sql("VACUUM {full_table_name}")\
14-
.execute()
13+
.with_sql("VACUUM {full_table_name}")\
14+
.display()
1515
```
1616

1717
That will apply the SQL template `VACUUM {full_table_name}` to all tables matched by the pattern `default.*.*`.
@@ -26,7 +26,7 @@ You can use the `explain()` command to see the SQL that would be executed.
2626

2727
```
2828
dx.from_tables("default.*.*")\
29-
.apply_sql("VACUUM {full_table_name}")\
29+
.with_sql("VACUUM {full_table_name}")\
3030
.explain()
3131
```
3232

@@ -35,14 +35,14 @@ You can also filter tables that have a specific column name.
3535
```
3636
dx.from_tables("default.*.*")\
3737
.having_columns("device_id")\
38-
.apply_sql("OPTIMIZE {full_table_name} ZORDER BY (`device_id`)")\
39-
.execute()
38+
.with_sql("OPTIMIZE {full_table_name} ZORDER BY (`device_id`)")\
39+
.display()
4040
```
4141

4242
## Select entire rows as json
4343

4444
```
4545
dx.from_tables("default.*.*")\
46-
.apply_sql("SELECT to_json(struct(*)) AS json_row FROM {full_table_name}")\
47-
.execute()
46+
.with_sql("SELECT to_json(struct(*)) AS json_row FROM {full_table_name}")\
47+
.display()
4848
```

docs/GDPR_RoA.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ For example, if you want to get all data for user `1` from all tables that have
99
```
1010
df = dx.from_tables("*.*.*")\
1111
.having_columns("user_id")\
12-
.apply_sql("SELECT `user_id`, to_json(struct(*)) AS row_content FROM {full_table_name} WHERE `user_id` = 1")\
13-
.to_union_dataframe()
14-
```
12+
.with_sql("SELECT `user_id`, to_json(struct(*)) AS row_content FROM {full_table_name} WHERE `user_id` = 1")\
13+
.apply()
14+
```
15+
16+
### Limitations
17+
18+
The current approach only selects tables that contain the specified column, and does not recursively follow the relationships with other tables.

docs/GDPR_RoE.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@ For example, if you want to delete users `1`, `2`, and `3` from all tables that
99
```
1010
dx.from_tables("*.*.*")\
1111
.having_columns("user_id")\
12-
.apply_sql("DELETE FROM {full_table_name} WHERE `user_id` IN (1, 2, 3)"")\
13-
.execute()
14-
# You can use .explain() instead of .execute() to preview the generated SQL
12+
.with_sql("DELETE FROM {full_table_name} WHERE `user_id` IN (1, 2, 3)"")\
13+
.display()
14+
# You can use .explain() instead of .display() to preview the generated SQL
1515
```
1616

1717
## Vaccum

docs/Vacuum.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ With DiscoverX you can vacuum all the tables at once with the command:
88

99
```
1010
dx.from_tables("*.*.*")\
11-
.apply_sql("VACUUM {full_table_name}")\
12-
.execute()
11+
.with_sql("VACUUM {full_table_name}")\
12+
.display()
1313
```
1414

1515
You can schedule [this example notebook](https://raw.githubusercontent.com/databrickslabs/discoverx/master/examples/vacuum_multiple_tables.py) in your Databricks workflows to run vacuum periodically.

examples/detect_small_files.py

Lines changed: 9 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
# MAGIC As a rule of thumb, if a table has more than `100` files and average file size smaller than `10 MB`, then we can consider it having too many small files.
88
# MAGIC
99
# MAGIC Some common causes of too many small files are:
10-
# MAGIC * Overpartitioning: the cardinality of the partition columns is too high
10+
# MAGIC * Overpartitioning: the cardinality of the partition columns is too high
1111
# MAGIC * Lack of scheduled maintenance operations like `OPTIMIZE`
1212
# MAGIC * Missing auto optimize on write
1313
# MAGIC
@@ -38,16 +38,13 @@
3838

3939
from pyspark.sql.functions import col, lit
4040

41-
dx.from_tables(from_tables)\
42-
.apply_sql("DESCRIBE DETAIL {full_table_name}")\
43-
.to_union_dataframe()\
44-
.withColumn("average_file_size_MB", col("sizeInBytes") / col("numFiles") / 1024 / 1024)\
45-
.withColumn("has_too_many_small_files",
46-
(col("average_file_size_MB") < small_file_max_size_MB) &
47-
(col("numFiles") > min_file_number))\
48-
.filter("has_too_many_small_files")\
49-
.display()
41+
dx.from_tables(from_tables).with_sql("DESCRIBE DETAIL {full_table_name}").apply().withColumn(
42+
"average_file_size_MB", col("sizeInBytes") / col("numFiles") / 1024 / 1024
43+
).withColumn(
44+
"has_too_many_small_files",
45+
(col("average_file_size_MB") < small_file_max_size_MB) & (col("numFiles") > min_file_number),
46+
).filter(
47+
"has_too_many_small_files"
48+
).display()
5049

5150
# COMMAND ----------
52-
53-

examples/pii_detection_presidio.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@
6767
unpivoted_df = (
6868
dx.from_tables(from_tables)
6969
.unpivot_string_columns(sample_size=sample_size)
70-
.to_union_dataframe()
70+
.apply()
7171
.localCheckpoint() # Checkpointing to reduce the query plan size
7272
)
7373

examples/vacuum_multiple_tables.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828

2929
# COMMAND ----------
3030

31-
dx.from_tables(from_tables).apply_sql("VACUUM {full_table_name}").explain()
31+
dx.from_tables(from_tables).with_sql("VACUUM {full_table_name}").explain()
3232

3333
# COMMAND ----------
3434

@@ -37,4 +37,4 @@
3737

3838
# COMMAND ----------
3939

40-
(dx.from_tables(from_tables).apply_sql("VACUUM {full_table_name}").execute())
40+
(dx.from_tables(from_tables).with_sql("VACUUM {full_table_name}").display())

0 commit comments

Comments
 (0)