Feature selection #170

lekoenig · 2022-10-26T13:42:49Z

lekoenig
Oct 26, 2022
Maintainer

Our "base" configuration of the LSTM model relied on 11 input features to make predictions of daily min/max/mean DO. Dynamic features included daily precipitation, air temperature (min/max), and solar radiation, and static features included catchment slope, catchment area, catchment impervious cover, stream slope, riparian canopy cover, and topographic wetness index in the catchment.

In #165, we expanded the list of static features that we download and process so that they're available for modeling. The new static input variables were selected based on hypothesized relationships between these variables and DO. The expanded feature set includes 108 static attributes. This number seems high because 1) some features are "pseudo-dynamic," i.e. the number of dams built on or before 1960, 1970, 1980, etc..., and 2) we compile the values aggregated for both the local catchment and the full upstream watershed area. The features fall under a few general categories, including river/basin characteristics, climate, hydrologic modification, soils, and land cover:

Not all of the 108 static features should necessarily be added to the model. For example, some features are highly correlated across our sites and so include redundant information. My general approach to feature selection includes the following steps, and I'll include results in the discussion thread below.

Remove features that we're not interested in (mostly due to the "scale," i.e., catchment versus total upstream).
Inspect correlated features using pearson’s correlation coefficients and correlation plots, and remove non-unique features.
For the features remaining after this first manual cut, use global permutation feature importance to assess feature importance to models containing those features.
For the features remaining after the manual cut, successively run models after dropping individual features to assess how important given features are to model predictive accuracy.

lekoenig · 2022-10-26T14:09:16Z

lekoenig
Oct 26, 2022
Maintainer Author

First-round cut of features

To winnow the 108 static features to a list of those that should be included in a "full" model, I first inspected the features and eliminated some that we're not interested in. These included:

For the variables CAT_NDAMS_YYYY , CAT_NID_STORAGEYYYY, CAT_NORM_STORAGEYYYY (1930 - 2013, mostly in decadal increments), I omitted all years except 2010 because there was no change in these values at the catchment-scale during our modeling domain, i.e. after 1980. I also eliminated the upstream watershed equivalent of these attributes because I assumed we are more interested in local-scale hydrologic modification. Note that if we ever decide we are interested in the cumulative upstream dams/dam storage, these pseudo-dynamic values do change over time and so we would have to decide how to treat those snapshot values.
Similar to upstream dams/dam storage, I dropped the cumulative upstream values for baseflow index (BFI), potential evapotranspiration (PET), average depth to water table (EWT), topographic wetness (TWI), riparian canopy cover, percent clay/silt/sand. For these variables I assumed that the hypotheses linking these variables to DO were driven more by local, catchment-scale dynamics.

I then used pearson correlation coefficients and correlation plots to look at which of the remaining features are highly-correlated. For example, medium-high development in the catchment is highly correlated with percent imperviousness:

Because of high correlations with other features, I ended up dropping 17 additional static attributes from the list. These are:

total upstream road-to-stream crossings (TOT_RDX), which was highly correlated with upstream watershed area (TOTDASQKM)
max dam storage in the catchment (CAT_NID_STORAGE2010), which was highly correlated with our normalized dam storage metric (CAT_NORM_STORAGE2010).
proportion clay and silt in catchment soils (CAT_CLAYAVE and CAT_SILTAVE) which were highly correlated with CAT_SANDAVE.'
land cover metrics that were highly correlated with catchment and upstream-accumulated percent imperviousness (CAT_NLCD11_developed_medhi, TOT_NLCD11_developed_medhi, CAT_NLCD11_developed_low, TOT_NLCD11_developed_low, CAT_NLCD11_forest, TOT_NLCD11_forest, CAT_TOTAL_ROAD_DENS, TOT_TOTAL_ROAD_DENS`).
upstream-accumulated 30-yr avg annual runoff (TOT_RUN7100), which was highly correlated with the catchment-scale avg annual runoff.
upstream-accumulated number of major NPDES sites (TOT_NPDES_MAJ), which was highly correlated with upstream watershed area (TOTDASQKM).
upstream-accumulated mean elevation (TOT_ELEV_MEAN), which was highly correlated with watershed slope (TOT_BASIN_SLOPE).
30-yr avg potential evapotranspiration in the catchment (CAT_PET), which was highly correlated with the catchment average rainfall-runoff factor (CAT_RFACT).
upstream-accumulated rainfall-runoff factor (TOT_RFACT), which was highly correlated with mean watershed slope (TOT_BASIN_SLOPE).

These steps leave 26 static attributes to be added to the LSTM.

3 replies

galengorski Nov 23, 2022
Maintainer

I think this all makes sense, was there a threshold for the correlation that you used to make the cutoff? Also, thinking about a feature selection figure for the manuscript, we could make a table of all possible attributes, and then label them as "used", "removed due to high correlation", "removed due to domain knowledge" or something like that as a summary. Probably best in the SI

lekoenig Nov 23, 2022
Maintainer Author

Good question - because our dataset is relatively small (n = 16 reaches), I visually inspected all correlations that had a pearson correlation coefficient >= 0.8. I like that idea for a supplemental table/figure!

jsadler2 Nov 29, 2022
Maintainer

I also like that idea for an SI figure/table.

lekoenig · 2022-10-26T14:27:40Z

lekoenig
Oct 26, 2022
Maintainer Author

Global permutation feature importance

I ran 5 model replicates of the baseline LSTM with 4 dynamic features (daily precip, min/max temperature, and incoming solar radiation) and 26 static attributes.

Overall, it seems like adding more static attributes to the model improves predictive performance across sites and observations. In the plots below, "baseline" (green) is the original model with 11 total features and "baseline_w_features" (blue) is the model with the expanded feature set (30 total features):

What perhaps feels familiar at this point is that the gains largely come from one site, 01481500, Brandywine Creek at Wilmington, DE.

The model version w/ the expanded set of attributes seems to do much better at predicting the summer trough of DO-mean. The original model contains a lot of variability among reps and some reps also really struggled with the seasonal transition from summer lows to winter highs. Interestingly, this also seems much better in the model with more attributes:

I calculated global permutation feature importance (Δ RMSE) across the model reps for each feature. Not surprisingly, daily min/max air temperature (tmmn and tmmx, respectively) are the most important to model RMSE across all observations:

6 of the 30 features had a mean ΔRMSE that was negative, suggesting that the information contained in these variables is no better than noise. I take this to mean that these variables could be omitted from the model x_vars:

number of major NPDES sites in the catchment (CAT_NPDES_MAJ)
average soil permeability in the catchment (CAT_PERMAVE) and the upstream watershed (TOT_PERMAVE)
mean watershed slope (TOT_BASIN_SLOPE)
proportion wetland cover in the upstream watershed (TOT_NLCD11_wetland)
proportion sand in catchment soils (CAT_SANDAVE)

1 reply

jsadler2 Nov 29, 2022
Maintainer

Interesting that those 6 features were adding no information. In the manuscript when we talk about this, we'll have to point out that that doesn't mean wetlands and NPDES sites, for example, are not important to DO, but that for these sites and for these time periods, that information did not help the model make any better predictions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature selection #170

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature selection #170

Uh oh!

lekoenig Oct 26, 2022 Maintainer

Replies: 2 comments · 4 replies

Uh oh!

lekoenig Oct 26, 2022 Maintainer Author

Uh oh!

galengorski Nov 23, 2022 Maintainer

Uh oh!

lekoenig Nov 23, 2022 Maintainer Author

Uh oh!

jsadler2 Nov 29, 2022 Maintainer

Uh oh!

Uh oh!

lekoenig Oct 26, 2022 Maintainer Author

Uh oh!

jsadler2 Nov 29, 2022 Maintainer

lekoenig
Oct 26, 2022
Maintainer

Replies: 2 comments 4 replies

lekoenig
Oct 26, 2022
Maintainer Author

galengorski Nov 23, 2022
Maintainer

lekoenig Nov 23, 2022
Maintainer Author

jsadler2 Nov 29, 2022
Maintainer

lekoenig
Oct 26, 2022
Maintainer Author

jsadler2 Nov 29, 2022
Maintainer