Support feature selection

# Feature selection

Feature selection is the process of selecting a subset consisting of influential features from multiple features. It is an important technique to **enhance results**, **shorten training time** and **make features human-understandable**.

_Currently, following is temporary I/F_
## Candidates for internal selecting methods
- chi2(for non-negative data only)
- SNR
- ~~mRMR~~
## Common
### [UDAF] `transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>`
#### Input

| array<number> X | array<number> Y |
| :-: | :-: |
| a row of matrix | a row of matrix |
#### Output

| array<array<double>> dotted |
| :-: |
| `dot(X.T, Y)`, shape = (X.#cols, Y.#cols) |
### [UDF] `select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>`
#### Input

| array<number> X | array<int> importance list | int k |
| :-: | :-: | :-: |
| array | the larger, the more important | top-? |
#### Output

| array<array<double>> k-best elements |
| :-: |
| top-k elements from X based on indices of importance list |

/***********************************************************************
## Note
- Current implementation expects **_ALL each `importance_list` and `k` are equal**_. It maybe confuse us.
  - Future WA: add option showing use of common `importance_list` and `k`

***********************************************************************/
## chi2
### [UDF] `chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>`
#### Input

both `observed` and `expected`, shape = (#classes, #features)

| array<number> observed | array<number> expected |
| :-: | :-: |
| observed features | expected features, `dot(class_prob.T, feature_count)` |
#### Output

| struct<array<double>, array<double>> importance lists |
| :-: |
| chi2-values and p-values each feature, each shape = (1, #features) |
### Example - chi2

``` sql
CREATE TABLE input (
  X array<double>, -- features
  Y array<int> -- binarized label
);

WITH stats AS (
  SELECT
    -- [UDAF] transpose_and_dot(Y::array<number>, X::array<number>)::array<array<double>>
    transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = (n_classes, n_features)
    array_sum(X) AS feature_count, -- n_features col vector, shape = (1, array<double>)
    array_avg(Y) AS class_prob -- n_class col vector, shape = (1, array<double>)
  FROM
    input
),
test AS (
  SELECT
    transpose_and_dot(class_prob, feature_count) AS expected -- array<array<double>>, shape = (n_class, n_features)
  FROM
    stats
),
chi2 AS (
  SELECT
    -- [UDAF] chi2(observed::array<array<double>>, expected::array<array<double>>)::struct<array<double>, array<double>>
    chi2(observed, expected) AS chi2s -- struct<array<double>, array<double>>, each shape = (1, n_features)
  FROM
    test JOIN stats;
)
SELECT
  -- [UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>
  select_k_best(X, chi2s.chi2, 2) -- top-2 feature selection based on chi2 score
FROM
  input JOIN chi2;
```
## SNR
### [UDAF] `snr(X::array<number>, Y::array<int>)::array<double>`
#### Input

| array<number> X | array<int> Y |
| :-: | :-: |
| a row of matrix, overall shape = (#samples, #features) | a row of one-hot matrix, overall shape = (#samples, #classes) |
#### Output

| array<double> importance list |
| :-: |
| snr values of each feature, shape = (1, #features) |
### Note
- There is no need to one-hot vectorize, but fitting its interface to chi2's one
### Example - snr

``` sql
CREATE TABLE input (
  X array<double>, -- features
  Y array<int> -- binarized label
);

WITH snr AS (
  -- [UDAF] snr(features::array<number>, labels::array<int>)::array<double>
  SELECT snr(X, Y) AS snr FROM input -- aggregated SNR as array<double>, shape = (1, #features)
)
SELECT select_k_best(X, snr, ${k}) FROM input JOIN snr;
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support feature selection #338

Feature selection

Candidates for internal selecting methods

Common

[UDAF] `transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>`

Input

Output

[UDF] `select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>`

Input

Output

Note

chi2

[UDF] `chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>`

Input

Output

Example - chi2

SNR

[UDAF] `snr(X::array<number>, Y::array<int>)::array<double>`

Input

Output

Note

Example - snr

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support feature selection #338

Description

Feature selection

Candidates for internal selecting methods

Common

[UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>

Input

Output

[UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>

Input

Output

Note

chi2

[UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>

Input

Output

Example - chi2

SNR

[UDAF] snr(X::array<number>, Y::array<int>)::array<double>

Input

Output

Note

Example - snr

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

[UDAF] `transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>`

[UDF] `select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>`

[UDF] `chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>`

[UDAF] `snr(X::array<number>, Y::array<int>)::array<double>`