tidymodels · franceslinyc · Sep 5, 2025 · Aug 12, 2025 · Aug 18, 2025 · Aug 18, 2025
diff --git a/_freeze/learn/develop/filtro/index/execute-results/html.json b/_freeze/learn/develop/filtro/index/execute-results/html.json
@@ -0,0 +1,15 @@
+{
+  "hash": "72caae83bac60ced40baca72547a173e",
+  "result": {
+    "engine": "knitr",
+    "markdown": "---\ntitle: \"Create your own score class object\"\ncategories:\n  - developer tools\ntype: learn-subsection\nweight: 1\ndescription: | \n Create a new score class object for feature selection.\ntoc: true\ntoc-depth: 3\ninclude-after-body: ../../../resources.html\n---\n\n\n\n\n\n## Introduction\n\nTo use code in this article,  you will need to install the following packages: filtro and modeldata.\n\nCurrently, there are 6 filters in filtro and many existing score objects. A list of existing scoring objects [can be found here](https://filtro.tidymodels.org/articles/filtro.html#available-score-objects-and-filter-methods). However, you might need to define your own scoring objects. This article serves as a guide to creating new scoring objects and computing feature scores before performing ranking and selection. \n\nFor reference, filtro is tidy tools to apply filter-based supervised feature selection methods. It provides functions to rank and select a specified proportion or a fixed number of features using built-in methods and the desirability function. \n\nRegarding scoring objects: \n\nThere is a parent class `class_score`, which defines the fixed properties that are shared across all subclasses. The parent class is already implemented, and serves as the infrastructure we build on when we make our own scoring class object.\n\nTherefore, the general procedure is to:\n\n1. Create a subclass `class_score_*` that inherts from `class_score`. This subclass introduces additional, method- or score-specific properties, as opposed to the general characteristics already defined in the parent class. \n\n2. Implement the scoring method in `fit()`, which computes feature score. The `fit()` generic refers to the subclass from step 1 to use the appropriate `fit()` method. \n\nThe hierarchy can be visualized as:\n\n```\nclass_score\n└─> class_score_* \n └─> fit()\n```\n\nAdditionally, we provide guidance on documenting an S7 method.\n\n## Parent class (General scoring object)\n\nAll the subclasses (custom scoring objects) share the same parent class named `class_score`. The parent class is already implemented, and we need this to build our own scoring class object: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Call the parent class\nlibrary(filtro) \nclass_score\n```\n:::\n\n\nThese are the fixed properties (attributes) for this object:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nargs(class_score)\n#> function (outcome_type = c(\"numeric\", \"factor\"), predictor_type = c(\"numeric\", \n#> \"factor\"), case_weights = logical(0), range = integer(0), inclusive = logical(0), \n#>     fallback_value = integer(0), score_type = character(0), transform_fn = function() NULL, \n#>     direction = character(0), deterministic = logical(0), tuning = logical(0), \n#>     calculating_fn = function() NULL, label = character(0), packages = character(0), \n#>     results = data.frame()) \n#> NULL\n```\n:::\n\n\nFor example: \n\n-   `case_weights`: Does the method accpet case weights? It is `TRUE` or `FALSE`.\n\n-   `fallback_value`: What is a value that can be used for the statistic so that it will never be eliminated? For example, `0` or `Inf`.\n\n-   `direction`: What direction of values indicates the most important values? For example, `minimize` or `maximum`.\n\n-   `results`: A slot for the results once the method is fitted. Initially, this is an empty data frame.\n\nFor details on its constructor and its remaining properties, please refer to the package documentation or [check it out here](https://github.com/tidymodels/filtro/blob/main/R/class_score.R). \n\n## Subclass (Custom scoring object)\n\nAll custom scoring objects implemented in filtro are subclasses of the parent class `class_score`, meaning that they all inherit the parent class's fixed properties. When creating a new scoring object, we do so by defining another subclass of this parent class.\n\n```\nclass_score\n└─> class_score_aov (example shown)\n└─> class_score_cor\n└─> ... \n```\n\nAs an example, we demonstrate how to create a custom scoring object for ANOVA F-test named `class_score_aov`. \n\nFor reference, the ANOVA F-test filter computes feature score using analysis of variance (ANOVA) hypothesis tests, powered by `lm()`. The `lm()` function fits a linear model and returns a summary containing the F-statistic and p-value, which can be used to evaluate feature importance. \n\nBy setting `parent = class_score`, the subclass `class_score_aov` inherits all of the fixed properties from the parent class. Additional, implementation-specific properties can be added using the `properties =` argument. For example:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Create a subclass named 'class_score_aov'\nclass_score_aov <- S7::new_class(\n  \"class_score_aov\",\n  parent = class_score,\n  properties = list(\n    neg_log10 = S7::new_property(S7::class_logical, default = TRUE)\n  )\n)\n```\n:::\n\n\nIn addition to the properties inherited from the parent class (discussed in the previous section), `class_score_aov` also includes the following property:\n\n- `neg_log10`: Represent the score as `-log10(p_value)`? It is `TRUE` or `FALSE`.\n\nFor the ANOVA F-test filter, users can represent the score using either the \n\n- p-value or \n\n- F-statistic. \n\nWe demonstrate how to create these instances (objects) accordingly. \n\n`score_aov_pval` is created as an instance of the `class_score_aov` subclass by calling its constructor and specifying its properties:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# ANOVA p-value\nscore_aov_pval <-\n  class_score_aov(\n    outcome_type = c(\"numeric\", \"factor\"),\n    predictor_type = c(\"numeric\", \"factor\"),\n    case_weights = TRUE,\n    range = c(0, Inf),\n    inclusive = c(FALSE, FALSE),\n    fallback_value = Inf,\n    score_type = \"aov_pval\",\n    direction = \"maximize\",\n    deterministic = TRUE,\n    tuning = FALSE,\n    label = \"ANOVA p-values\"\n  )\n```\n:::\n\n\nOnce instantiated, individual properties can be accessed via `object@`. For example: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nscore_aov_pval@case_weights\n#> [1] TRUE\nscore_aov_pval@fallback_value\n#> [1] Inf\nscore_aov_pval@direction\n#> [1] \"maximize\"\n```\n:::\n\n\nNote that by default, the returned p-value is transformed to `-log10(p_value)`, which means larger values correspond to more important predictors. This is why the fallback value is set to `Inf` and the direction is set to `\"maximize\"`. \n\n`score_aov_fstat` is another instance of the `class_score_aov` subclass:  \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# ANOVA F-statistic\nscore_aov_fstat <-\n  class_score_aov(\n    outcome_type = c(\"numeric\", \"factor\"),\n    predictor_type = c(\"numeric\", \"factor\"),\n    case_weights = TRUE,\n    range = c(0, Inf),\n    inclusive = c(FALSE, FALSE),\n    fallback_value = Inf,\n    score_type = \"aov_fstat\",\n    direction = \"maximize\",\n    deterministic = TRUE,\n    tuning = FALSE,\n    label = \"ANOVA F-statistics\"\n  )\n```\n:::\n\n\nThe F-statistic is not transformed, nor does it provide an option for transformation. Nevertheless, it also uses a fallback value of `Inf` with the direction set to `\"maximize\"`, since larger F-statistic values indicate more important predictors. \n\n## Fitting (or estimating) feature score\n\nSo far, we have covered how to construct a parent class, create a custom subclass, and instantiate objects for the ANOVA F-test filter. \n\nWe now discuss the dual role of `fit()`: it functions both as a *generic* and as the *methods* used to fit (or estimate) feature score. \n\n1. The `fit()` generic is re-exported from generics. It inspects the class of the object passed and dispatches to the appropriate method. \n\n2. We also define multiple methods named `fit()`. Each `fit()` method performs the actual fitting or score estimation for a specific class of object. \n\nIn other words, when `fit()` is called, the generic refers to the custom scoring object `class_score_*` to determine which method to dispatch. The actual scoring computation is performed within the dispatched method. \n\n```\nclass_score\n└─> class_score_aov (example shown)\n └─> fit()\n└─> class_score_cor\n └─> fit()\n└─> ... \n```\n\nLet’s use the ANOVA F-test filter again as an example: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# User-level example: Check the class of the object\nclass(score_aov_pval)\n#> [1] \"filtro::class_score_aov\" \"filtro::class_score\"    \n#> [3] \"S7_object\"\nclass(score_aov_fstat)\n#> [1] \"filtro::class_score_aov\" \"filtro::class_score\"    \n#> [3] \"S7_object\"\n```\n:::\n\n\nBoth instances (objects) belong to the custom scoring object `class_score_aov`. Therefore, when `fit()` is called, the method for `class_score_aov` is dispatched, performing the actual fitting using the ANOVA F-test:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# User-level example: Method dispatch for objects of class `class_score_aov`\nscore_aov_pval |>\n  fit(Sale_Price ~ ., data = ames)\nscore_aov_fstat |>\n  fit(Sale_Price ~ ., data = ames)\n```\n:::\n\n\n## Defining S7 methods \n\nFor users to use the `fit()` method described above, we need to define a S7 method that implements the scoring logic. \n\nThe following code defines the `fit()` method specifically for the `class_score_aov` subclass, specifying how feature score should be computed using ANOVA F-test:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Define the scoring method for `class_score_aov`\nS7::method(fit, class_score_aov) <- function(\n  object,\n  formula,\n  data,\n  case_weights = NULL,\n  ...\n) {\n  # This is where you add the rest of the code for this implementation \n\n  object@results <- res\n  object\n}\n```\n:::\n\n\nWe would want to do something similar to define a S7 method for other `class_score_*` subclass. \n\n## Documenting S7 methods \n\nDocumentation for S7 methods is still a work in progress, but our current best approach is as follows: \n\n- We re-export the `fit()` generic from generics. \n\n- Instead of documenting each `fit()` method, we document it in the \"Details\" section and the \"Estimating the scores\" subsection of the documentation for the corresponding object (instance) `score_*`. \n\nThe code below opens the help page for the `fit()` generic: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# User-level example: Help page for `fit()` generic\n?fit\n```\n:::\n\n\nThe code below opens the help page for specific `fit()` method: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# User-level example: Help page for `fit()` method along with the documentation for the specific object\n?score_aov_pval\n?score_aov_fstat\n```\n:::\n\n\nFor users to access the help page using `?` as described above, the `fit()` method needs to be exported using `#' @export`, but it is not documented directly.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n#' @export\nS7::method(fit, class_score_aov) <- function(\n  object,\n  ...\n) {\n  # Include the rest of the function here\n\n  object@results <- res\n  object\n}\n```\n:::\n\n\nInstead, documentation is provided in the \"Details\" section and the \"Estimating the scores\" subsection of the documentation for the `score_aov_pval` object.  \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n#' Scoring via analysis of variance hypothesis tests\n#'\n#' @description\n#' \n#' @name score_aov_pval\n#' @family class score metrics\n#'\n#' @details\n#'\n#' These objects are used when either:\n#'\n#' ...\n#'\n#' ## Estimating the scores\n#'\n#' In \\pkg{filtro}, the `score_*` objects define a scoring method (e.g., data\n#' input requirements, package dependencies, etc). To compute the scores for\n#' a specific data set, the `fit()` method is used. The main arguments for\n#' these functions are:\n#'\n#'   \\describe{\n#'     \\item{`object`}{A score class object (e.g., `score_aov_pval`).}\n#'     \\item{`formula`}{A standard R formula with a single outcome on the right-hand side and one or more predictors (or `.`) on the left-hand side. The data are processed via [stats::model.frame()]}\n#'     \\item{`data`}{A data frame containing the relevant columns defined by the formula.}\n#'     \\item{`...`}{Further arguments passed to or from other methods.}\n#'     \\item{`case_weights`}{A quantitative vector of case weights that is the same length as the number of rows in `data`. The default of `NULL` indicates that there are no case weights.}\n#'   }\n#'\n#' ...\n#' \n#' @export\nscore_aov_pval <-\n  class_score_aov(\n    outcome_type = c(\"numeric\", \"factor\"),\n    predictor_type = c(\"numeric\", \"factor\"),\n    case_weights = TRUE,\n    range = c(0, Inf),\n    inclusive = c(FALSE, FALSE),\n    fallback_value = Inf,\n    score_type = \"aov_pval\",\n    transform_fn = function(x) x,\n    direction = \"maximize\",\n    deterministic = TRUE,\n    tuning = FALSE,\n    label = \"ANOVA p-values\"\n  )\n```\n:::\n\n\nWe can have the `score_aov_fstat` object share the same help page as `score_aov_pval` by using `#' @name`. This avoids repeated documentation for similar or related objects.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n#' @name score_aov_pval\n#' @export\nscore_aov_fstat <-\n  class_score_aov(\n    outcome_type = c(\"numeric\", \"factor\"),\n    predictor_type = c(\"numeric\", \"factor\"),\n    case_weights = TRUE,\n    range = c(0, Inf),\n    inclusive = c(FALSE, FALSE),\n    fallback_value = Inf,\n    score_type = \"aov_fstat\",\n    transform_fn = function(x) x,\n    direction = \"maximize\",\n    deterministic = TRUE,\n    tuning = FALSE,\n    label = \"ANOVA F-statistics\"\n  )\n```\n:::\n\n\n## Accessing results after fitting\n\nOnce the method has been fitted via `fit()`, the data frame of results can be accessed via `object@results`. \n\nWe use a subset of the Ames data set from the {modeldata} package for demonstration. The goal is to predict housing sale price. `Sale_Price` is the outcome and is numeric. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(modeldata)\names_subset <- modeldata::ames |>\n  # Use a subset of data for demonstration\n  dplyr::select(\n    Sale_Price,\n    MS_SubClass,\n    MS_Zoning,\n    Lot_Frontage,\n    Lot_Area,\n    Street\n  )\names_subset <- ames_subset |>\n  dplyr::mutate(Sale_Price = log10(Sale_Price))\n```\n:::\n\n\nNext, we fit the score as we discuss before: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Specify ANOVA p-value and fit score\names_aov_pval_res <-\n  score_aov_pval |>\n  fit(Sale_Price ~ ., data = ames_subset)\n```\n:::\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Specify ANOVA F-statistic and fit score\names_aov_fstat_res <-\n  score_aov_fstat |>\n  fit(Sale_Price ~ ., data = ames_subset)\n```\n:::\n\n\nRecall that individual properties of an object can be accessed using `object@`. Once the method has been fitted, the resulting data frame can be accessed via `object@results`:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\names_aov_pval_res@results\n#> # A tibble: 5 × 4\n#>   name      score outcome    predictor   \n#>   <chr>     <dbl> <chr>      <chr>       \n#> 1 aov_pval 237.   Sale_Price MS_SubClass \n#> 2 aov_pval 130.   Sale_Price MS_Zoning   \n#> 3 aov_pval  NA    Sale_Price Lot_Frontage\n#> 4 aov_pval  NA    Sale_Price Lot_Area    \n#> 5 aov_pval   5.75 Sale_Price Street\n```\n:::\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\names_aov_fstat_res@results\n#> # A tibble: 5 × 4\n#>   name      score outcome    predictor   \n#>   <chr>     <dbl> <chr>      <chr>       \n#> 1 aov_fstat  94.5 Sale_Price MS_SubClass \n#> 2 aov_fstat 115.  Sale_Price MS_Zoning   \n#> 3 aov_fstat  NA   Sale_Price Lot_Frontage\n#> 4 aov_fstat  NA   Sale_Price Lot_Area    \n#> 5 aov_fstat  22.9 Sale_Price Street\n```\n:::\n\n\n## Session information {#session-info}\n\n\n::: {.cell layout-align=\"center\"}\n\n```\n#> \n#> Attaching package: 'dplyr'\n#> The following objects are masked from 'package:stats':\n#> \n#>     filter, lag\n#> The following objects are masked from 'package:base':\n#> \n#>     intersect, setdiff, setequal, union\n#> ─ Session info ─────────────────────────────────────────────────────\n#>  version  R version 4.5.0 (2025-04-11)\n#>  language (EN)\n#>  date     2025-08-29\n#>  pandoc   3.6.3\n#>  quarto   1.7.32\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#>  package     version date (UTC) source\n#>  dplyr       1.1.4   2023-11-17 CRAN (R 4.5.0)\n#>  filtro      0.2.0   2025-08-26 CRAN (R 4.5.0)\n#>  modeldata   1.5.1   2025-08-22 CRAN (R 4.5.0)\n#>  purrr       1.1.0   2025-07-10 CRAN (R 4.5.0)\n#>  rlang       1.1.6   2025-04-11 CRAN (R 4.5.0)\n#>  tibble      3.3.0   2025-06-08 CRAN (R 4.5.0)\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n\n",
+    "supporting": [],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
diff --git a/installs.R b/installs.R
@@ -18,6 +18,7 @@ packages <- c(
   "doParallel",
   "dotwhisker",
   "embed",
+  "filtro",
   "forecast",
   "fs",
   "furrr",