Move downloading of data files for examples into the build scripts and just point the users to where these files are located instead of adding url lib requests to the python examples so we can focus on what is most important to the user

timsaucer · timsaucer · commit eba8f6cf1edb · 2024-11-24T08:53:37.000-05:00
diff --git a/.github/workflows/docs.yaml b/.github/workflows/docs.yaml
@@ -75,6 +75,8 @@ jobs:
           set -x
           source venv/bin/activate
           cd docs
+          curl -O https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv
+          curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet
           make html
 
       - name: Copy & push the generated HTML
diff --git a/docs/.gitignore b/docs/.gitignore
@@ -1,2 +1,4 @@
 pokemon.csv
 yellow_trip_data.parquet
+yellow_tripdata_2021-01.parquet
+
diff --git a/docs/build.sh b/docs/build.sh
@@ -19,8 +19,17 @@
 #
 
 set -e
+
+if [ ! -f pokemon.csv ]; then
+    curl -O https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv
+fi
+
+if [ ! -f yellow_tripdata_2021-01.parquet ]; then
+    curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet
+fi
+
 rm -rf build 2> /dev/null
 rm -rf temp 2> /dev/null
 mkdir temp
 cp -rf source/* temp/
-make SOURCEDIR=`pwd`/temp html
+make SOURCEDIR=`pwd`/temp html
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -43,27 +43,13 @@ Example
 
 .. ipython:: python
 
-    import datafusion
-    from datafusion import col
-    import pyarrow
-
-    # create a context
-    ctx = datafusion.SessionContext()
-
-    # create a RecordBatch and a new DataFrame from it
-    batch = pyarrow.RecordBatch.from_arrays(
-        [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
-        names=["a", "b"],
-    )
-    df = ctx.create_dataframe([[batch]], name="batch_array")
-
-    # create a new statement
-    df = df.select(
-        col("a") + col("b"),
-        col("a") - col("b"),
-    )
-
-    df
+    from datafusion import SessionContext
+
+    ctx = SessionContext()
+
+    df = ctx.read_csv("pokemon.csv")
+
+    df.show()
 
 
 .. _toc.links:
diff --git a/docs/source/user-guide/basics.rst b/docs/source/user-guide/basics.rst
@@ -25,7 +25,7 @@ source file as described in the :ref:`Introduction <guide>`, the Pokemon data se
 
 .. ipython:: python
 
-    from datafusion import SessionContext, functions as F
+    from datafusion import SessionContext, col, functions as F
 
     ctx = SessionContext()
 
diff --git a/docs/source/user-guide/common-operations/aggregations.rst b/docs/source/user-guide/common-operations/aggregations.rst
@@ -26,16 +26,10 @@ to form a single summary value. For performing an aggregation, DataFusion provid
 
 .. ipython:: python
 
-    import urllib.request
     from datafusion import SessionContext
     from datafusion import col, lit
     from datafusion import functions as f
 
-    urllib.request.urlretrieve(
-        "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
-        "pokemon.csv",
-    )
-
     ctx = SessionContext()
     df = ctx.read_csv("pokemon.csv")
 
diff --git a/docs/source/user-guide/common-operations/functions.rst b/docs/source/user-guide/common-operations/functions.rst
@@ -25,14 +25,8 @@ We'll use the pokemon dataset in the following examples.
 
 .. ipython:: python
 
-    import urllib.request
     from datafusion import SessionContext
 
-    urllib.request.urlretrieve(
-    "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
-    "pokemon.csv",
-    )
-
     ctx = SessionContext()
     ctx.register_csv("pokemon", "pokemon.csv")
     df = ctx.table("pokemon")
diff --git a/docs/source/user-guide/common-operations/select-and-filter.rst b/docs/source/user-guide/common-operations/select-and-filter.rst
@@ -21,18 +21,15 @@ Column Selections
 Use :py:func:`~datafusion.dataframe.DataFrame.select`  for basic column selection.
 
 DataFusion can work with several file types, to start simple we can use a subset of the 
-`TLC Trip Record Data <https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>`_
+`TLC Trip Record Data <https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>`_,
+which you can download `here <https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet>`_.
 
 .. ipython:: python
-    
-    import urllib.request
-    from datafusion import SessionContext
 
-    urllib.request.urlretrieve("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
-                               "yellow_trip_data.parquet")
+    from datafusion import SessionContext
     
     ctx = SessionContext()
-    df = ctx.read_parquet("yellow_trip_data.parquet")
+    df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")
     df.select("trip_distance", "passenger_count")
 
 For mathematical or logical operations use :py:func:`~datafusion.col` to select columns, and give meaningful names to the resulting
diff --git a/docs/source/user-guide/common-operations/windows.rst b/docs/source/user-guide/common-operations/windows.rst
@@ -30,16 +30,10 @@ We'll use the pokemon dataset (from Ritchie Vink) in the following examples.
 
 .. ipython:: python
 
-    import urllib.request
     from datafusion import SessionContext
     from datafusion import col
     from datafusion import functions as f
 
-    urllib.request.urlretrieve(
-        "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
-        "pokemon.csv",
-    )
-
     ctx = SessionContext()
     df = ctx.read_csv("pokemon.csv")
 
diff --git a/docs/source/user-guide/introduction.rst b/docs/source/user-guide/introduction.rst
@@ -52,10 +52,6 @@ options for data sources. For our first example, we demonstrate using a Pokemon
 can download
 `here <https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv>`_.
 
-.. code-block:: shell
-
-    curl -O https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv
-
 With that file in place you can use the following python example to view the DataFrame in
 DataFusion.