Skip to content

Commit d47c613

Browse files
author
Anushriya Jain
committed
changes added
1 parent 6cf07f2 commit d47c613

File tree

1 file changed

+28
-29
lines changed

1 file changed

+28
-29
lines changed

samples/04_gis_analysts_data_scientists/forecasting_air_temperature_in_california.ipynb

Lines changed: 28 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -47,9 +47,9 @@
4747
"cell_type": "markdown",
4848
"metadata": {},
4949
"source": [
50-
"The rise in air temperature is directly correlated with Global warming and change in climatic conditions. Air temperature is one of the main factors in predicting other meteorological variables like streamflow, evapotranspiration and solar radiation. Hence accurate forecasting of this variable is prime to mitigate environmental and economic destruction. Including the dependency of air temperature in other variables like wind speed, precipitation, etc. is helping in more precise prediction. In this study, the deep learning TimeSeriesModel from arcgis.learn is used to predict monthly air temperature for two years at a ground station in the Fresno Yosemite International Airport, California, USA. The dataset ranges from 1948-2015. Data from January 2014 to November 2015 is used to validate the quality of the forecast.\n",
50+
"A rise in air temperature is directly correlated with Global warming and change in climatic conditions and is one of the main factors in predicting other meteorological variables, like streamflow, evapotranspiration, and solar radiation. As such, accurate forecasting of this variable is vital in pursuing the mitigation of environmental and economic destruction. Including the dependency of air temperature in other variables, like wind speed or precipitation, helps in deriving more precise predictions. In this study, the deep learning TimeSeriesModel from arcgis.learn is used to predict monthly air temperature for two years at a ground station at the Fresno Yosemite International Airport in California, USA. The dataset ranges from 1948-2015. Data from January 2014 to November 2015 is used to validate the quality of the forecast.\n",
5151
"\n",
52-
"Univariate time series modeling is one of the popular applications of time series analysis. This study includes multivariate time series analysis which is a bit more convoluted i.e. dataset contains more than one time-dependent variable. The TimeSeriesModel from arcgis.learn includes backbones such as InceptionTime, ResCNN, ResNet and FCN which does not need fine-tuning of multiple hyperparameters before fitting the model. Here is the schematic flow chart of the methodology:"
52+
"Univariate time series modeling is one of the more popular applications of time series analysis. This study includes multivariate time series analysis, which is a bit more convoluted, as the dataset contains more than one time-dependent variable. The TimeSeriesModel from arcgis.learn includes backbones, such as InceptionTime, ResCNN, ResNet and FCN, which do not need fine-tuning of multiple hyperparameters before fitting the model. Here is the schematic flow chart of the methodology:"
5353
]
5454
},
5555
{
@@ -337,7 +337,7 @@
337337
"cell_type": "markdown",
338338
"metadata": {},
339339
"source": [
340-
"Above dataframe contains columns of station ID (STATION), station name (NAME), Date (DATE), Wind speed (AWND), precipitation (PRCP), possible sunshine (PSUN), snow cover (SNOW), average temperature (TAVG), maximum temperature (TMAX), minimum temperature (TMIN), total sunshine (TSUN), and peak wind gust speed (WSFG). "
340+
"The dataframe above contains columns for station ID (STATION), station name (NAME), Date (DATE), Wind speed (AWND), precipitation (PRCP), possible sunshine (PSUN), snow cover (SNOW), average temperature (TAVG), maximum temperature (TMAX), minimum temperature (TMIN), total sunshine (TSUN), and peak wind gust speed (WSFG)."
341341
]
342342
},
343343
{
@@ -364,7 +364,7 @@
364364
"cell_type": "markdown",
365365
"metadata": {},
366366
"source": [
367-
"Next, the dataset is prepared wherein the variables of station, possible sunshine, snow cover, maximum temperature, minimum temperature, total sunshine, and peak wind gust speed are dropped, and then the dataset is chosen from 1987 onwards, to avoid missing values."
367+
"Next, the dataset is prepared by dropping the variables for station, possible sunshine, snow cover, maximum temperature, minimum temperature, total sunshine, and peak wind gust speed. Then, the dataset is narrowed to the data from 1987 on, to avoid missing values."
368368
]
369369
},
370370
{
@@ -496,7 +496,7 @@
496496
"cell_type": "markdown",
497497
"metadata": {},
498498
"source": [
499-
"Here **TAVG** is our variable to be predicted, the predictors used are **PRCP** and **AWND** showing their dependency over temperature."
499+
"Here, **TAVG** is our variable to be predicted, with **PRCP** and **AWND** being the predictors used, showing their influence on temperature."
500500
]
501501
},
502502
{
@@ -524,15 +524,15 @@
524524
"metadata": {},
525525
"source": [
526526
"## Time series data preprocessing<a class=\"anchor\" id=\"5\"></a> \n",
527-
"The preprocessing of the data for multivariate time series modeling includes the following steps:"
527+
"The preprocessing of the data for multivariate time series modeling involves the following steps:"
528528
]
529529
},
530530
{
531531
"cell_type": "markdown",
532532
"metadata": {},
533533
"source": [
534534
"### Converting into time series format<a class=\"anchor\" id=\"6\"></a>\n",
535-
"The dataset is now transformed into a time series data format by creating a new index that is to be used by the model for processing the sequential data."
535+
"The dataset is now transformed into a time series data format by creating a new index that will used by the model for processing the sequential data."
536536
]
537537
},
538538
{
@@ -632,7 +632,7 @@
632632
"metadata": {},
633633
"source": [
634634
"### Data types of time series variables<a class=\"anchor\" id=\"7\"></a> \n",
635-
"Checking the data type of the variables."
635+
"Here we check the data types of the variables."
636636
]
637637
},
638638
{
@@ -666,7 +666,7 @@
666666
"cell_type": "markdown",
667667
"metadata": {},
668668
"source": [
669-
"The time-dependent variables should be in float. If the variable is not of a float data type, then it needs to be changed to float. Here, Windspeed (AWND) is converted from object dtype to float64 as shown in the next cell."
669+
"The time-dependent variables should of the type float. If a time-dependent variable is not of a float data type, then it needs to be changed to float. Here, Windspeed (AWND) is converted from object dtype to float64, as shown in the next cell."
670670
]
671671
},
672672
{
@@ -765,7 +765,7 @@
765765
"metadata": {},
766766
"source": [
767767
"### Checking autocorrelation of time dependent variables<a class=\"anchor\" id=\"8\"></a> \n",
768-
"This step is to determine if the time series sequence is autocorrelated. To ensure that our time series data can be modeled well, the strength of correlation of the variable with its past data must be estimated."
768+
"The next step will determine if the time series sequence is autocorrelated. To ensure that our time series data can be modeled well, the strength of correlation of the variable with its past data must be estimated."
769769
]
770770
},
771771
{
@@ -830,7 +830,7 @@
830830
"metadata": {},
831831
"source": [
832832
"### Creating dataset for prediction<a class=\"anchor\" id=\"9\"></a> \n",
833-
"Here in the original dataset the variable predict column of Average Temperature (TAVG) is allocated with NaNs, for the forecasting period of 2014-2015. This format is required for the `model.predict()` function in time series analysis, which will fill up the NaN values with forecasted temperatures."
833+
"Here, in the original dataset, the variable predict column of Average Temperature (TAVG) is populated with NaNs for the forecasting period of 2014-2015. This format is required for the `model.predict()` function in time series analysis, which will fill up the NaN values with forecasted temperatures."
834834
]
835835
},
836836
{
@@ -930,7 +930,7 @@
930930
"metadata": {},
931931
"source": [
932932
"### Train - Test split of time series dataset<a class=\"anchor\" id=\"10\"></a> \n",
933-
"Out of these 27 years(1987-2015), 25 years of data is used for training the model and the 2 years (2014-2015) or a total of 23 months of data is used for forecasting and validation. Splitting timeseries data by keeping shuffle=False to keep the sequence intact, and test size of 12 months for validation."
933+
"Out of these 27 years(1987-2015), 25 years of data is used for training the model, with the remaining 23 months (2014-2015) being used for forecasting and validation. As we are splitting timeseries data, we set shuffle=False to keep the sequence intact and we set a test size of 23 months for validation."
934934
]
935935
},
936936
{
@@ -1103,16 +1103,15 @@
11031103
"cell_type": "markdown",
11041104
"metadata": {},
11051105
"source": [
1106-
"In this example, the dataset contains 'AWND' (Windspeed), 'PRCP' (Precipitation), and 'TAVG' (Average Air temperature) as time-dependent variables leading to multivariate time series analysis at monthly time scale. These variables are used to forecast the next 23 months of air temperature for the months after the last date in the training data, or, these multiple explanatory variables are used to predict the future values of the air temperature.\n",
1107-
"\n",
1108-
"Once the variables are identified, the preprocessing of the data is performed by the `prepare_tabulardata` method from the `arcgis.learn` module in the ArcGIS API for Python. This function takes either a non-spatial data frame, a feature layer, or a spatial data frame containing the dataset as input, and returns a TabularDataObject that can be fed into the model. By default, `prepare_tabulardata` scales/normalizes the numerical columns in dataset using StandardScaler.\n",
1106+
"In this example, the dataset contains 'AWND' (Windspeed), 'PRCP' (Precipitation), and 'TAVG' (Average Air temperature) as time-dependent variables leading to a multivariate time series analysis at a monthly time scale. These variables are used to forecast the next 23 months of air temperature for the months after the last date in the training data, or, in other words, these multiple explanatory variables are used to predict the future values of the dependent air temperature variable.\n",
11091107
"\n",
1108+
"Once the variables are identified, the preprocessing of the data is performed by the `prepare_tabulardata` method from the `arcgis.learn` module in the ArcGIS API for Python. This function takes either a non-spatial data frame, a feature layer, or a spatial data frame containing the dataset as input and returns a TabularDataObject that can be fed into the model. By default, `prepare_tabulardata` scales/normalizes the numerical columns in a dataset using StandardScaler.\n",
11101109
"The primary input parameters required for the tool are:\n",
11111110
"\n",
1112-
"- <span style='background :lightgrey' >input_features</span> : It takes the spatially enabled dataframe as a feature layer in this model\n",
1111+
"- <span style='background :lightgrey' >input_features</span> : Takes the spatially enabled dataframe as a feature layer in this model\n",
11131112
"- <span style='background :lightgrey' >variable_predict</span> : The field name of the forecasting variable\n",
1114-
"- <span style='background :lightgrey' >explanatory_variables</span> : list of the field names which are used as time-dependent variables in multivariate time series\n",
1115-
"- <span style='background :lightgrey' >index_field</span> : field name containing the timestamp which will be used as index field for the data and to visualize values on the x-axis in time series"
1113+
"- <span style='background :lightgrey' >explanatory_variables</span> : A list of the field names that are used as time-dependent variables in multivariate time series\n",
1114+
"- <span style='background :lightgrey' >index_field</span> : The field name containing the timestamp that will be used as the index field for the data and to visualize values on the x-axis in the time series"
11161115
]
11171116
},
11181117
{
@@ -1211,7 +1210,7 @@
12111210
"source": [
12121211
"### Model initialization <a class=\"anchor\" id=\"13\"></a>\n",
12131212
"\n",
1214-
"This is the important step for fitting a time series model. Here, along with the input dataset, the backbone for training the model and the sequence length is passed as parameters. Out of these three, the sequence length has to be selected carefully. The sequence length is usually the cycle of the data, which in this case is 12, as it is monthly data and the pattern repeats after 12 months. In model initialization, the data and the backbone is selected from the available set of InceptionTime, ResCNN, Resnet, and FCN."
1213+
"This is an important step for fitting a time series model. Here, along with the input dataset, the backbone for training the model and the sequence length are passed as parameters. Out of these three, the sequence length has to be selected carefully. The sequence length is usually the cycle of the data, which in this case is 12, as it is monthly data and the pattern repeats after 12 months. In model initialization, the data and the backbone are selected from the available set of InceptionTime, ResCNN, Resnet, and FCN."
12151214
]
12161215
},
12171216
{
@@ -1228,7 +1227,7 @@
12281227
"metadata": {},
12291228
"source": [
12301229
"### Learning rate search<a class=\"anchor\" id=\"14\"></a>\n",
1231-
"Finding the learning rate for training the model"
1230+
"Here, we find the optimal learning rate for training the model."
12321231
]
12331232
},
12341233
{
@@ -1259,7 +1258,7 @@
12591258
"source": [
12601259
"### Model training <a class=\"anchor\" id=\"15\"></a>\n",
12611260
"\n",
1262-
"The model is now ready for training. To train the model, the `model.fit` method is used and is provided with the number of epochs for training and the estimated learning rate suggested by `lr_find` in the previous step:"
1261+
"The model is now ready for training. To train the model, the `model.fit` method is used and is provided with the number of epochs for training and the learning rate suggested above as parameters:"
12631262
]
12641263
},
12651264
{
@@ -1899,7 +1898,7 @@
18991898
"cell_type": "markdown",
19001899
"metadata": {},
19011900
"source": [
1902-
"To check the quality of the trained model or whether the model needs more training, train vs valid loss plot is shown below"
1901+
"To check the quality of the trained model and whether the model needs more training, we generate a train vs validation loss plot below:"
19031902
]
19041903
},
19051904
{
@@ -1928,7 +1927,7 @@
19281927
"cell_type": "markdown",
19291928
"metadata": {},
19301929
"source": [
1931-
"The predicted vs the actual values by the trained model is printed for the training dataset."
1930+
"Next, the predicted values of the model and the actual values are printed for the training dataset."
19321931
]
19331932
},
19341933
{
@@ -1965,7 +1964,7 @@
19651964
"metadata": {},
19661965
"source": [
19671966
"### Forecasting using the trained TimeSeriesModel <a class=\"anchor\" id=\"17\"></a>\n",
1968-
"During forecasting, the model uses dataset prepared above with NaN values as input with `prediction_type` as `dataframe`."
1967+
"During forecasting, the model uses the dataset prepared above with NaN values as input, with the `prediction_type` set as `dataframe`."
19691968
]
19701969
},
19711970
{
@@ -2370,7 +2369,7 @@
23702369
"cell_type": "markdown",
23712370
"metadata": {},
23722371
"source": [
2373-
"Formating the result into actual vs the predicted columns"
2372+
"Next, we format the results into actual vs predicted columns."
23742373
]
23752374
},
23762375
{
@@ -2575,7 +2574,7 @@
25752574
"metadata": {},
25762575
"source": [
25772576
"### Estimate model metrics for validation <a class=\"anchor\" id=\"18\"></a>\n",
2578-
"The accuracy of the forecasted values is measured by comparing the forecasted values against the actual values for 23 months."
2577+
"The accuracy of the forecasted values is measured by comparing the forecasted values against the actual values for the 23 months chosen for testing."
25792578
]
25802579
},
25812580
{
@@ -2615,15 +2614,15 @@
26152614
"cell_type": "markdown",
26162615
"metadata": {},
26172616
"source": [
2618-
"A considerably high r-square value indicates a high similarity between the forecasted and the actual values. And, RMSE error is quite low showing good fit by model."
2617+
"A considerably high r-square value of .91 indicates a high similarity between the forecasted values and the actual values. Furthermore, the RMSE error of 3.661 is quite low, indicating a good fit by model."
26192618
]
26202619
},
26212620
{
26222621
"cell_type": "markdown",
26232622
"metadata": {},
26242623
"source": [
26252624
"## Result visualization<a class=\"anchor\" id=\"19\"></a>\n",
2626-
"Finally, the actual and forecasted values are plotted to visualize their distribution over the validation period, with the orange lines indicating forecasted values and the blue line showing the actual values."
2625+
"Finally, the actual and forecasted values are plotted to visualize their distribution over the validation period, with the orange line representing the forecasted values and the blue line representing the actual values."
26272626
]
26282627
},
26292628
{
@@ -2664,7 +2663,7 @@
26642663
"cell_type": "markdown",
26652664
"metadata": {},
26662665
"source": [
2667-
"The study conducted multivariate time series analysis using Deep learning TimeSeriesModel from the arcgis.learn and forecasted monthly Air temperature for a station in California. The model was trained with 25 years (1987-2013) of data and forecasted for a period of 2 years (2014-2015) with high accuracy. The other dependent variables were wind speed and precipitation. The methodology included preparing a times series dataset using the prepare_tabulardata() method, followed by modeling, predicting, and then validating the test dataset. Usually, time series modeling requires fine-tuning several hyperparameters for properly fitting the data, most of which has been internalized in this Model, leaving the user responsible for configuring only a few significant parameters, like the sequence length."
2666+
"The study conducted a multivariate time series analysis using the Deep learning TimeSeriesModel from the arcgis.learn library and forecasted the monthly Air temperature for a station in California. The model was trained with 25 years of data (1987-2013) that was used to forecast a period of 2 years (2014-2015) with high accuracy. The independent variables were wind speed and precipitation. The methodology included preparing a times series dataset using the prepare_tabulardata() method, followed by modeling, predicting, and validating the test dataset. Usually, time series modeling requires fine-tuning several hyperparameters for properly fitting the data, most of which has been internalized in this Model, leaving the user responsible for configuring only a few significant parameters, like the sequence length."
26682667
]
26692668
},
26702669
{

0 commit comments

Comments
 (0)