Skip to content

Commit f7d793d

Browse files
authored
Added basic documentation for linter message codes (#2536)
1 parent d6a92df commit f7d793d

File tree

2 files changed

+331
-0
lines changed

2 files changed

+331
-0
lines changed

CONTRIBUTING.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -277,6 +277,35 @@ Before every commit, apply the consistent styleguide and formatting of the code,
277277
make fmt test
278278
```
279279

280+
## Getting overview of linter message codes
281+
282+
To get an overview of the linter message codes, run the following command:
283+
284+
```shell
285+
$ python tests/integration/source_code/message_codes.py
286+
cannot-autofix-table-reference
287+
catalog-api-in-shared-clusters
288+
changed-result-format-in-uc
289+
dbfs-read-from-sql-query
290+
dbfs-usage
291+
default-format-changed-in-dbr8
292+
dependency-not-found
293+
direct-filesystem-access
294+
implicit-dbfs-usage
295+
jvm-access-in-shared-clusters
296+
legacy-context-in-shared-clusters
297+
not-supported
298+
notebook-run-cannot-compute-value
299+
python-udf-in-shared-clusters
300+
rdd-in-shared-clusters
301+
spark-logging-in-shared-clusters
302+
sql-parse-error
303+
sys-path-cannot-compute-value
304+
table-migrated-to-uc
305+
to-json-in-shared-clusters
306+
unsupported-magic-line
307+
```
308+
280309
## First contribution
281310

282311
Here are the example steps to submit your first contribution:

README.md

Lines changed: 302 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,28 @@ See [contributing instructions](CONTRIBUTING.md) to help improve this project.
6060
* [<b>Always run this workflow AFTER the assessment has finished</b>](#balways-run-this-workflow-after-the-assessment-has-finishedb)
6161
* [[EXPERIMENTAL] Migrate tables in mounts Workflow](#experimental-migrate-tables-in-mounts-workflow)
6262
* [Jobs Static Code Analysis Workflow](#jobs-static-code-analysis-workflow)
63+
* [Linter message codes](#linter-message-codes)
64+
* [`cannot-autofix-table-reference`](#cannot-autofix-table-reference)
65+
* [`catalog-api-in-shared-clusters`](#catalog-api-in-shared-clusters)
66+
* [`changed-result-format-in-uc`](#changed-result-format-in-uc)
67+
* [`dbfs-read-from-sql-query`](#dbfs-read-from-sql-query)
68+
* [`dbfs-usage`](#dbfs-usage)
69+
* [`default-format-changed-in-dbr8`](#default-format-changed-in-dbr8)
70+
* [`dependency-not-found`](#dependency-not-found)
71+
* [`direct-filesystem-access`](#direct-filesystem-access)
72+
* [`implicit-dbfs-usage`](#implicit-dbfs-usage)
73+
* [`jvm-access-in-shared-clusters`](#jvm-access-in-shared-clusters)
74+
* [`legacy-context-in-shared-clusters`](#legacy-context-in-shared-clusters)
75+
* [`not-supported`](#not-supported)
76+
* [`notebook-run-cannot-compute-value`](#notebook-run-cannot-compute-value)
77+
* [`python-udf-in-shared-clusters`](#python-udf-in-shared-clusters)
78+
* [`rdd-in-shared-clusters`](#rdd-in-shared-clusters)
79+
* [`spark-logging-in-shared-clusters`](#spark-logging-in-shared-clusters)
80+
* [`sql-parse-error`](#sql-parse-error)
81+
* [`sys-path-cannot-compute-value`](#sys-path-cannot-compute-value)
82+
* [`table-migrated-to-uc`](#table-migrated-to-uc)
83+
* [`to-json-in-shared-clusters`](#to-json-in-shared-clusters)
84+
* [`unsupported-magic-line`](#unsupported-magic-line)
6385
* [Utility commands](#utility-commands)
6486
* [`logs` command](#logs-command)
6587
* [`ensure-assessment-run` command](#ensure-assessment-run-command)
@@ -678,6 +700,286 @@ in the Migration dashboard.
678700

679701
[[back to top](#databricks-labs-ucx)]
680702

703+
### Linter message codes
704+
705+
Here's the detailed explanation of the linter message codes:
706+
707+
#### `cannot-autofix-table-reference`
708+
709+
This indicates that the linter has found a table reference that cannot be automatically fixed. The user must manually
710+
update the table reference to point to the correct table in Unity Catalog. This mostly occurs, when table name is
711+
computed dynamically, and it's too complex for our static code analysis to detect it. We detect this problem anywhere
712+
where table name could be used: `spark.sql`, `spark.catalog.*`, `spark.table`, `df.write.*` and many more. Code examples
713+
that trigger this problem:
714+
715+
```python
716+
spark.table(f"foo_{some_table_name}")
717+
# ..
718+
df = spark.range(10)
719+
df.write.saveAsTable(f"foo_{some_table_name}")
720+
# .. or even
721+
df.write.insertInto(f"foo_{some_table_name}")
722+
```
723+
724+
Here the `some_table_name` variable is not defined anywhere in the visible scope. Though, the analyser would
725+
successfully detect table name if it is defined:
726+
727+
```python
728+
some_table_name = 'bar'
729+
spark.table(f"foo_{some_table_name}")
730+
```
731+
732+
We even detect string constants when coming either from `dbutils.widgets.get` (via job named parameters) or through
733+
loop variables. If `old.things` table is migrated to `brand.new.stuff` in Unity Catalog, the following code will
734+
trigger two messages: [`table-migrated-to-uc`](#table-migrated-to-uc) for the first query, as the contents are clearly
735+
analysable, and `cannot-autofix-table-reference` for the second query.
736+
737+
```python
738+
# ucx[table-migrated-to-uc:+4:4:+4:20] Table old.things is migrated to brand.new.stuff in Unity Catalog
739+
# ucx[cannot-autofix-table-reference:+3:4:+3:20] Can't migrate table_name argument in 'spark.sql(query)' because its value cannot be computed
740+
table_name = f"table_{index}"
741+
for query in ["SELECT * FROM old.things", f"SELECT * FROM {table_name}"]:
742+
spark.sql(query).collect()
743+
```
744+
745+
[[back to top](#databricks-labs-ucx)]
746+
747+
#### `catalog-api-in-shared-clusters`
748+
749+
`spark.catalog.*` functions require Databricks Runtime 14.3 LTS or above on Unity Catalog clusters in Shared access
750+
mode, so of your code has `spark.catalog.tableExists("table")` or `spark.catalog.listDatabases()`, you need to ensure
751+
that your cluster is running the correct runtime version and data security mode.
752+
753+
[[back to top](#databricks-labs-ucx)]
754+
755+
#### `changed-result-format-in-uc`
756+
757+
Calls to these functions would return a list of `<catalog>.<database>.<table>` instead of `<database>.<table>`. So if
758+
you have code like this:
759+
760+
```python
761+
for table in spark.catalog.listTables():
762+
do_stuff_with_table(table)
763+
```
764+
765+
you need to make sure that `do_stuff_with_table` can handle the new format.
766+
767+
[[back to top](#databricks-labs-ucx)]
768+
769+
#### `dbfs-read-from-sql-query`
770+
771+
DBFS access is not allowed in Unity Catalog, so if you have code like this:
772+
773+
```python
774+
df = spark.sql("SELECT * FROM parquet.`/mnt/foo/path/to/file`")
775+
```
776+
777+
you need to change it to use UC tables.
778+
779+
[[back to top](#databricks-labs-ucx)]
780+
781+
#### `dbfs-usage`
782+
783+
DBFS does not work in Unity Catalog, so if you have code like this:
784+
785+
```python
786+
display(spark.read.csv('/mnt/things/e/f/g'))
787+
```
788+
789+
You need to change it to use UC tables or UC volumes.
790+
791+
[[back to top](#databricks-labs-ucx)]
792+
793+
#### `dependency-not-found`
794+
795+
This message indicates that the linter has found a dependency, like Python source file or a notebook, that is not
796+
available in the workspace. The user must ensure that the dependency is available in the workspace. This usually
797+
means an error in the user code.
798+
799+
[[back to top](#databricks-labs-ucx)]
800+
801+
#### `direct-filesystem-access`
802+
803+
It's not allowed to access the filesystem directly in Unity Catalog, so if you have code like this:
804+
805+
```python
806+
spark.read.csv("s3://bucket/path")
807+
```
808+
809+
you need to change it to use UC tables or UC volumes.
810+
811+
[[back to top](#databricks-labs-ucx)]
812+
813+
#### `implicit-dbfs-usage`
814+
815+
The use of DBFS is not allowed in Unity Catalog, so if you have code like this:
816+
817+
```python
818+
display(spark.read.csv('/mnt/things/e/f/g'))
819+
```
820+
821+
you need to change it to use UC tables or UC volumes.
822+
823+
[[back to top](#databricks-labs-ucx)]
824+
825+
#### `jvm-access-in-shared-clusters`
826+
827+
You cannot access Spark Driver JVM on Unity Catalog clusters in Shared Access mode. If you have code like this:
828+
829+
```python
830+
spark._jspark._jvm.com.my.custom.Name()
831+
```
832+
833+
or like this:
834+
835+
```python
836+
log4jLogger = sc._jvm.org.apache.log4j
837+
LOGGER = log4jLogger.LogManager.getLogger(__name__)
838+
```
839+
840+
you need to change it to use Python equivalents.
841+
842+
[[back to top](#databricks-labs-ucx)]
843+
844+
#### `legacy-context-in-shared-clusters`
845+
846+
SparkContext (`sc`) is not supported on Unity Catalog clusters in Shared access mode. Rewrite it using SparkSession
847+
(`spark`). Example code that triggers this message:
848+
849+
```python
850+
df = spark.createDataFrame(sc.emptyRDD(), schema)
851+
```
852+
853+
or this:
854+
855+
```python
856+
sc.parallelize([1, 2, 3])
857+
```
858+
859+
[[back to top](#databricks-labs-ucx)]
860+
861+
#### `not-supported`
862+
863+
Installing eggs is no longer supported on Databricks 14.0 or higher.
864+
865+
[[back to top](#databricks-labs-ucx)]
866+
867+
#### `notebook-run-cannot-compute-value`
868+
869+
Path for `dbutils.notebook.run` cannot be computed and requires adjusting the notebook path.
870+
It is not clear for automated code analysis where the notebook is located, so you need to simplify the code like:
871+
872+
```python
873+
b = some_function()
874+
dbutils.notebook.run(b)
875+
```
876+
877+
to something like this:
878+
879+
```python
880+
a = "./leaf1.py"
881+
dbutils.notebook.run(a)
882+
```
883+
884+
[[back to top](#databricks-labs-ucx)]
885+
886+
#### `python-udf-in-shared-clusters`
887+
888+
`applyInPandas` requires DBR 14.3 LTS or above on Unity Catalog clusters in Shared access mode. Example:
889+
890+
```python
891+
df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double").show()
892+
```
893+
894+
Arrow UDFs require DBR 14.3 LTS or above on Unity Catalog clusters in Shared access mode.
895+
896+
```python
897+
@udf(returnType='int', useArrow=True)
898+
def arrow_slen(s):
899+
return len(s)
900+
```
901+
902+
It is not possible to register Java UDF from Python code on Unity Catalog clusters in Shared access mode. Use a
903+
`%scala` cell to register the Scala UDF using `spark.udf.register`. Example code that triggers this message:
904+
905+
```python
906+
spark.udf.registerJavaFunction("func", "org.example.func", IntegerType())
907+
```
908+
909+
[[back to top](#databricks-labs-ucx)]
910+
911+
#### `rdd-in-shared-clusters`
912+
913+
RDD APIs are not supported on Unity Catalog clusters in Shared access mode. Use mapInArrow() or Pandas UDFs instead.
914+
915+
```python
916+
df.rdd.mapPartitions(myUdf)
917+
```
918+
919+
[[back to top](#databricks-labs-ucx)]
920+
921+
#### `spark-logging-in-shared-clusters`
922+
923+
Cannot set Spark log level directly from code on Unity Catalog clusters in Shared access mode. Remove the call and set
924+
the cluster spark conf `spark.log.level` instead:
925+
926+
```python
927+
sc.setLogLevel("INFO")
928+
setLogLevel("WARN")
929+
```
930+
931+
Another example could be:
932+
933+
```python
934+
log4jLogger = sc._jvm.org.apache.log4j
935+
LOGGER = log4jLogger.LogManager.getLogger(__name__)
936+
```
937+
938+
or
939+
940+
```python
941+
sc._jvm.org.apache.log4j.LogManager.getLogger(__name__).info("test")
942+
```
943+
944+
[[back to top](#databricks-labs-ucx)]
945+
946+
#### `sql-parse-error`
947+
948+
This is a generic message indicating that the SQL query could not be parsed. The user must manually check the SQL query.
949+
950+
[[back to top](#databricks-labs-ucx)]
951+
952+
#### `sys-path-cannot-compute-value`
953+
954+
Path for `sys.path.append` cannot be computed and requires adjusting the path. It is not clear for automated code
955+
analysis where the path is located.
956+
957+
[[back to top](#databricks-labs-ucx)]
958+
959+
#### `table-migrated-to-uc`
960+
961+
This message indicates that the linter has found a table that has been migrated to Unity Catalog. The user must ensure
962+
that the table is available in Unity Catalog.
963+
964+
[[back to top](#databricks-labs-ucx)]
965+
966+
#### `to-json-in-shared-clusters`
967+
968+
`toJson()` is not available on Unity Catalog clusters in Shared access mode. Use `toSafeJson()` on DBR 13.3 LTS or
969+
above to get a subset of command context information. Example code that triggers this message:
970+
971+
```python
972+
dbutils.notebook.entry_point.getDbutils().notebook().getContext().toSafeJson()
973+
```
974+
975+
[[back to top](#databricks-labs-ucx)]
976+
977+
#### `unsupported-magic-line`
978+
979+
This message indicates the code that could not be analysed by UCX. User must check the code manually.
980+
981+
[[back to top](#databricks-labs-ucx)]
982+
681983
# Utility commands
682984

683985
## `logs` command

0 commit comments

Comments
 (0)