[SNAPPYDATA] honor existing PYSPARK_PYTHON in build

Sumedh Wale · Sumedh Wale · commit 74dbd37475cb · 2022-06-29T15:00:22.000+05:30
- more fixes to URL references in sparkr-vignettes and others
- updated copyright to 2022 in UI information
diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
@@ -11,8 +11,8 @@ Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),
                     email = "felixcheung@apache.org"),
              person(family = "The Apache Software Foundation", role = c("aut", "cph")))
 License: Apache License (== 2.0)
-URL: http://www.apache.org/ http://spark.apache.org/
-BugReports: http://spark.apache.org/contributing.html
+URL: https://www.apache.org/ https://spark.apache.org/
+BugReports: https://spark.apache.org/contributing.html
 Depends:
     R (>= 3.0),
     methods
diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd
@@ -46,7 +46,7 @@ Sys.setenv("_JAVA_OPTIONS" = paste("-XX:-UsePerfData", old_java_opt, sep = " "))
 
 ## Overview
 
-SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. With Spark `r packageVersion("SparkR")`, SparkR provides a distributed data frame implementation that supports data processing operations like selection, filtering, aggregation etc. and distributed machine learning using [MLlib](http://spark.apache.org/mllib/).
+SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. With Spark `r packageVersion("SparkR")`, SparkR provides a distributed data frame implementation that supports data processing operations like selection, filtering, aggregation etc. and distributed machine learning using [MLlib](https://spark.apache.org/mllib/).
 
 ## Getting Started
 
@@ -132,7 +132,7 @@ sparkR.session.stop()
 
 Different from many other R packages, to use SparkR, you need an additional installation of Apache Spark. The Spark installation will be used to run a backend process that will compile and execute SparkR programs.
 
-After installing the SparkR package, you can call `sparkR.session` as explained in the previous section to start and it will check for the Spark installation. If you are working with SparkR from an interactive shell (eg. R, RStudio) then Spark is downloaded and cached automatically if it is not found. Alternatively, we provide an easy-to-use function `install.spark` for running this manually. If you don't have Spark installed on the computer, you may download it from [Apache Spark Website](http://spark.apache.org/downloads.html).
+After installing the SparkR package, you can call `sparkR.session` as explained in the previous section to start and it will check for the Spark installation. If you are working with SparkR from an interactive shell (eg. R, RStudio) then Spark is downloaded and cached automatically if it is not found. Alternatively, we provide an easy-to-use function `install.spark` for running this manually. If you don't have Spark installed on the computer, you may download it from [Apache Spark Website](https://spark.apache.org/downloads.html).
 
 ```{r, eval=FALSE}
 install.spark()
@@ -147,7 +147,7 @@ sparkR.session(sparkHome = "/HOME/spark")
 ### Spark Session {#SetupSparkSession}
 
 
-In addition to `sparkHome`, many other options can be specified in `sparkR.session`. For a complete list, see [Starting up: SparkSession](http://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession) and [SparkR API doc](http://spark.apache.org/docs/latest/api/R/sparkR.session.html).
+In addition to `sparkHome`, many other options can be specified in `sparkR.session`. For a complete list, see [Starting up: SparkSession](https://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession) and [SparkR API doc](https://spark.apache.org/docs/2.1.3/api/R/sparkR.session.html).
 
 In particular, the following Spark driver properties can be set in `sparkConfig`.
 
@@ -169,15 +169,15 @@ sparkR.session(spark.sql.warehouse.dir = spark_warehouse_path)
 
 
 #### Cluster Mode
-SparkR can connect to remote Spark clusters. [Cluster Mode Overview](http://spark.apache.org/docs/latest/cluster-overview.html) is a good introduction to different Spark cluster modes.
+SparkR can connect to remote Spark clusters. [Cluster Mode Overview](https://spark.apache.org/docs/latest/cluster-overview.html) is a good introduction to different Spark cluster modes.
 
 When connecting SparkR to a remote Spark cluster, make sure that the Spark version and Hadoop version on the machine match the corresponding versions on the cluster. Current SparkR package is compatible with
 ```{r, echo=FALSE, tidy = TRUE}
 paste("Spark", packageVersion("SparkR"))
 ```
 It should be used both on the local computer and on the remote cluster.
 
-To connect, pass the URL of the master node to `sparkR.session`. A complete list can be seen in [Spark Master URLs](http://spark.apache.org/docs/latest/submitting-applications.html#master-urls).
+To connect, pass the URL of the master node to `sparkR.session`. A complete list can be seen in [Spark Master URLs](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls).
 For example, to connect to a local standalone Spark master, we can call
 
 ```{r, eval=FALSE}
@@ -208,7 +208,7 @@ The general method for creating `SparkDataFrame` from data sources is `read.df`.
 sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
 ```
 
-We can see how to use data sources using an example CSV input file. For more information please refer to SparkR [read.df](https://spark.apache.org/docs/latest/api/R/read.df.html) API documentation.
+We can see how to use data sources using an example CSV input file. For more information please refer to SparkR [read.df](https://spark.apache.org/docs/2.1.3/api/R/read.df.html) API documentation.
 ```{r, eval=FALSE}
 df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "NA")
 ```
@@ -297,7 +297,7 @@ printSchema(carsDF)
 
 #### Selecting rows, columns
 
-SparkDataFrames support a number of functions to do structured data processing. Here we include some basic examples and a complete list can be found in the [API](https://spark.apache.org/docs/latest/api/R/index.html) docs:
+SparkDataFrames support a number of functions to do structured data processing. Here we include some basic examples and a complete list can be found in the [API](https://spark.apache.org/docs/2.1.3/api/R/index.html) docs:
 
 You can also pass in column name as strings.
 ```{r}
@@ -842,7 +842,7 @@ perplexity
 
 #### Alternating Least Squares
 
-`spark.als` learns latent factors in [collaborative filtering](https://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering) via [alternating least squares](http://dl.acm.org/citation.cfm?id=1608614).
+`spark.als` learns latent factors in [collaborative filtering](https://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering) via [alternating least squares](https://dl.acm.org/doi/10.1109/MC.2009.263).
 
 There are multiple options that can be configured in `spark.als`, including `rank`, `reg`, and `nonnegative`. For a complete list, refer to the help file.
 
@@ -979,11 +979,11 @@ env | map
 
 ## References
 
-* [Spark Cluster Mode Overview](http://spark.apache.org/docs/latest/cluster-overview.html)
+* [Spark Cluster Mode Overview](https://spark.apache.org/docs/latest/cluster-overview.html)
 
-* [Submitting Spark Applications](http://spark.apache.org/docs/latest/submitting-applications.html)
+* [Submitting Spark Applications](https://spark.apache.org/docs/latest/submitting-applications.html)
 
-* [Machine Learning Library Guide (MLlib)](http://spark.apache.org/docs/latest/ml-guide.html)
+* [Machine Learning Library Guide (MLlib)](https://spark.apache.org/docs/latest/ml-guide.html)
 
 * [SparkR: Scaling R Programs with Spark](https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf), Shivaram Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, and Matei Zaharia. SIGMOD 2016. June 2016.
 
diff --git a/build.gradle b/build.gradle
@@ -280,18 +280,21 @@ allprojects {
 }
 
 // set python2 for pyspark if python3 version is an unsupported one
-String sparkPython = 'python'
-def checkResult = exec {
-  ignoreExitValue = true
-  commandLine 'sh', '-c', 'python --version 2>/dev/null | grep -Eq "( 3\\.[0-7])|( 2\\.)"'
-}
-if (checkResult.exitValue != 0) {
-  checkResult = exec {
+String sparkPython = System.getenv('PYSPARK_PYTHON')
+if (sparkPython == null || sparkPython.isEmpty()) {
+  sparkPython = 'python'
+  def checkResult = exec {
     ignoreExitValue = true
-    commandLine 'sh', '-c', 'python2 --version >/dev/null 2>&1'
+    commandLine 'sh', '-c', 'python --version 2>/dev/null | grep -Eq "( 3\\.[0-7])|( 2\\.)"'
   }
-  if (checkResult.exitValue == 0) {
-    sparkPython = 'python2'
+  if (checkResult.exitValue != 0) {
+    checkResult = exec {
+      ignoreExitValue = true
+      commandLine 'sh', '-c', 'python2 --version >/dev/null 2>&1'
+    }
+    if (checkResult.exitValue == 0) {
+      sparkPython = 'python2'
+    }
   }
 }
 
diff --git a/core/src/main/scala/org/apache/spark/ui/UIUtils.scala b/core/src/main/scala/org/apache/spark/ui/UIUtils.scala
@@ -636,7 +636,7 @@ private[spark] object UIUtils extends Logging {
               <p>
                 <strong>Project SnappyData<sup>&trade;</sup>
                   - Enterprise Edition</strong> <br />
-                <br />&copy; 2017-2020 TIBCO<sup>&reg;</sup> Software Inc. All rights reserved.
+                <br />&copy; 2017-2022 TIBCO<sup>&reg;</sup> Software Inc. All rights reserved.
                 <br />This program is protected by copyright law.
               </p>
               <p>
@@ -659,7 +659,7 @@ private[spark] object UIUtils extends Logging {
             } else {
               <p>
                 <strong>Project SnappyData<sup>&trade;</sup> - Community Edition </strong> <br />
-                <br />&copy; 2017-2020 TIBCO<sup>&reg;</sup> Software Inc. All rights reserved.
+                <br />&copy; 2017-2022 TIBCO<sup>&reg;</sup> Software Inc. All rights reserved.
                 <br />This program is protected by copyright law.
               </p>
               <p>
diff --git a/docs/ml-collaborative-filtering.md b/docs/ml-collaborative-filtering.md
@@ -15,7 +15,7 @@ missing entries of a user-item association matrix.  `spark.ml` currently support
 model-based collaborative filtering, in which users and products are described
 by a small set of latent factors that can be used to predict missing entries.
 `spark.ml` uses the [alternating least squares
-(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
+(ALS)](https://dl.acm.org/doi/10.1109/MC.2009.263)
 algorithm to learn these latent factors. The implementation in `spark.ml` has the
 following parameters:
 
diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
@@ -15,7 +15,7 @@ missing entries of a user-item association matrix.  `spark.mllib` currently supp
 model-based collaborative filtering, in which users and products are described
 by a small set of latent factors that can be used to predict missing entries.
 `spark.mllib` uses the [alternating least squares
-(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
+(ALS)](https://dl.acm.org/doi/10.1109/MC.2009.263)
 algorithm to learn these latent factors. The implementation in `spark.mllib` has the
 following parameters: