Skip to content

Commit 4bbb1b8

Browse files
ooplesclaude
andauthored
Work Session Planning (#411)
* feat: Implement MaxAbsScaler and QuantileTransformer normalizers (#317) Implements two new specialized data normalization techniques: **MaxAbsScaler (13 points)** - Scales features to [-1, 1] range based on maximum absolute value - Preserves zeros and maintains sign of values (important for sparse data) - Formula: scaled_value = value / max(|values|) - Includes comprehensive unit tests covering: - Dense and sparse data - Positive, negative, and mixed values - Edge cases (all zeros, single values) - Matrix and Tensor support - Float and double type support - Round-trip normalization/denormalization **QuantileTransformer (21 points)** - Non-linear transformation mapping data to uniform or normal distributions - Robust against outliers using quantile computation - Configurable output distribution (uniform/normal) and number of quantiles - Formula: Maps values through empirical CDF to target distribution - Includes comprehensive unit tests covering: - Uniform and normal output distributions - Skewed data and outliers - Column-wise matrix normalization - Rank-order preservation - Repeated values handling - Float and double type support **Architecture Updates** - Added MaxAbsScaler and QuantileTransformer to NormalizationMethod enum - Extended NormalizationParameters with: - MaxAbs property for MaxAbsScaler - Quantiles list for QuantileTransformer - OutputDistribution property for target distribution - All implementations follow project patterns: - Use INumericOperations<T> for arithmetic - Use NumOps.Zero instead of default(T) - Generic inheritance pattern - Complete XML documentation with "For Beginners" sections - Support for Vector, Matrix, and Tensor data structures Resolves #317 * fix: replace linear search with binary search and add division-by-zero protection Resolves review comments on QuantileTransformer.cs: - Lines 406-414: Replaced O(n) linear search with O(log n) binary search for finding quantile position. With default 1000 quantiles, this improves performance from 1000 comparisons to ~10 comparisons per value. - Lines 431-450: Added division-by-zero protection when consecutive quantiles have equal values (occurs with duplicate values in data). Returns midpoint percentile when upperValue == lowerValue. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * fix: correct INumericOperations method names in QuantileTransformer This commit fixes pre-existing build errors in QuantileTransformer.cs by correcting method names to match the actual INumericOperations interface: Changes: - Replace NumOps.Compare (doesn't exist) with inline comparator using LessThan/GreaterThan for Array.Sort calls (lines 106-111, 144-149, 213-218, 267-272) - Replace NumOps.LessThanOrEqual with NumOps.LessThanOrEquals (note the 's') - Replace NumOps.GreaterThanOrEqual with NumOps.GreaterThanOrEquals (note the 's') - Replace NumOps.ToDouble (doesn't exist) with Convert.ToDouble((object)value!) for T to double conversions (lines 508, 530, 622) These errors were blocking the build and are now fixed, allowing the QuantileTransformer to compile successfully. * refactor: fix 9 unresolved review comments in PR #411 This commit resolves all remaining unresolved review comments: Test file improvements (7 fixes): - MaxAbsScalerTests.cs:223,260: Replace unused `normalized` with `_` discard - QuantileTransformerTests.cs:113,282,296,336,354: Replace unused variables with `_` discard - Remove redundant test for invalid outputDistribution (now enforced by enum type safety) Source file improvements (2 fixes): - QuantileTransformer.cs:473: Simplify if/else to ternary operator for output distribution - QuantileTransformer.cs:481: Simplify if/else to ternary operator for percentile calculation Note: One test case uses normalized so it wasn't discarded (MaxAbsScalerTests line 109) * feat: replace string outputDistribution with type-safe enum This commit improves code quality and production readiness by replacing the string-based outputDistribution parameter with a type-safe enum. Changes: - Created OutputDistribution enum with Uniform and Normal values - Updated NormalizationParameters.OutputDistribution from string to enum - Updated QuantileTransformer constructor to accept enum instead of string - Updated all string comparisons to use enum comparisons - Removed redundant validation code (enum provides compile-time type safety) - Updated all test files to use OutputDistribution.Uniform/Normal Benefits: - Compile-time type safety (prevents typos like "unifrom") - IntelliSense support for valid values - Better refactoring support - Self-documenting code - No runtime string validation needed * fix: handle degenerate distributions and tensor constructors - Add degenerate distribution check in QuantileTransformer when all quantiles are identical - Fix Tensor constructor calls in tests to use Vector instead of double[] - Map constant features to midpoint (0.5) to avoid skewing to extreme tails 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * refactor: move outputdistribution enum to enums folder - Move OutputDistribution.cs from src/Normalizers to src/Enums - Update namespace from AiDotNet.Normalizers to AiDotNet.Enums - Add using AiDotNet.Enums to NormalizationParameters.cs and QuantileTransformer.cs - Update property type references to use unqualified OutputDistribution 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> --------- Co-authored-by: Claude <[email protected]>
1 parent f99b0d2 commit 4bbb1b8

File tree

7 files changed

+1887
-13
lines changed

7 files changed

+1887
-13
lines changed

src/Enums/NormalizationMethod.cs

Lines changed: 34 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -148,13 +148,42 @@ public enum NormalizationMethod
148148
/// </summary>
149149
/// <remarks>
150150
/// <para>
151-
/// <b>For Beginners:</b> Robust Scaling is designed to handle data with outliers (extreme values that don't
152-
/// follow the pattern). Instead of using the minimum and maximum values (which can be skewed by outliers),
153-
/// it uses the median and quartiles. It's like saying "ignore the extremely tall and short people when
154-
/// standardizing heights." This is useful when your data contains unusual values that shouldn't influence
151+
/// <b>For Beginners:</b> Robust Scaling is designed to handle data with outliers (extreme values that don't
152+
/// follow the pattern). Instead of using the minimum and maximum values (which can be skewed by outliers),
153+
/// it uses the median and quartiles. It's like saying "ignore the extremely tall and short people when
154+
/// standardizing heights." This is useful when your data contains unusual values that shouldn't influence
155155
/// the overall scaling.
156156
/// Formula: (x - median) / (Q3 - Q1) where Q1 is the 25th percentile and Q3 is the 75th percentile
157157
/// </para>
158158
/// </remarks>
159-
RobustScaling
159+
RobustScaling,
160+
161+
/// <summary>
162+
/// Scales features to the range [-1, 1] by dividing by the maximum absolute value.
163+
/// </summary>
164+
/// <remarks>
165+
/// <para>
166+
/// <b>For Beginners:</b> MaxAbsScaler is like MinMax scaling, but instead of using both the minimum and
167+
/// maximum values, it only uses the maximum absolute value (the largest value ignoring signs). This
168+
/// method preserves zeros and the sign of values, which is important for sparse data where many values
169+
/// are zero. For example, if your largest value is 100 and smallest is -50, everything gets divided by 100,
170+
/// so results fall between -1 and 1.
171+
/// Formula: x / max(|x|)
172+
/// </para>
173+
/// </remarks>
174+
MaxAbsScaler,
175+
176+
/// <summary>
177+
/// Transforms features to follow a uniform or normal distribution using quantiles.
178+
/// </summary>
179+
/// <remarks>
180+
/// <para>
181+
/// <b>For Beginners:</b> QuantileTransformer is a powerful technique that changes your data's distribution
182+
/// to be either uniform (flat, where all ranges have equal numbers of values) or normal (bell-shaped).
183+
/// It works by ranking values and mapping them to a target distribution. This is extremely effective at
184+
/// handling outliers because it spreads them out across the distribution. Think of it as redistributing
185+
/// your data so that it matches a desired pattern, regardless of the original distribution.
186+
/// </para>
187+
/// </remarks>
188+
QuantileTransformer
160189
}

src/Enums/OutputDistribution.cs

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
namespace AiDotNet.Enums;
2+
3+
/// <summary>
4+
/// Specifies the target distribution for quantile transformation.
5+
/// </summary>
6+
/// <remarks>
7+
/// <para>
8+
/// This enum defines the available output distributions for the QuantileTransformer.
9+
/// Each distribution has different characteristics and use cases in machine learning.
10+
/// </para>
11+
/// <para><b>For Beginners:</b> Think of this as choosing the shape you want your data to take:
12+
/// - Uniform: Spreads values evenly across the range (like a flat distribution)
13+
/// - Normal: Creates a bell curve pattern (most values in the middle, fewer at extremes)
14+
/// </para>
15+
/// </remarks>
16+
public enum OutputDistribution
17+
{
18+
/// <summary>
19+
/// Maps data to a uniform distribution where all values are equally likely.
20+
/// Values are spread evenly across the [0, 1] range.
21+
/// </summary>
22+
/// <remarks>
23+
/// <para><b>For Beginners:</b> This makes your data look like a flat line - every value
24+
/// is equally common. Good for algorithms that don't assume any particular distribution.
25+
/// </para>
26+
/// <para>
27+
/// Use this when:
28+
/// - You want to reduce the impact of outliers
29+
/// - Your algorithm works best with uniformly distributed features
30+
/// - You want a simple, predictable transformation
31+
/// </para>
32+
/// </remarks>
33+
Uniform,
34+
35+
/// <summary>
36+
/// Maps data to a normal (Gaussian) distribution with mean 0 and standard deviation 1.
37+
/// Values follow a bell curve pattern.
38+
/// </summary>
39+
/// <remarks>
40+
/// <para><b>For Beginners:</b> This makes your data look like a bell curve - most values
41+
/// are near the middle, with fewer values at the extremes. Many statistical methods
42+
/// assume this distribution.
43+
/// </para>
44+
/// <para>
45+
/// Use this when:
46+
/// - Your algorithm assumes normally distributed features (e.g., linear regression, LDA)
47+
/// - You want to reduce the impact of outliers while maintaining statistical properties
48+
/// - You need compatibility with methods that expect Gaussian distributions
49+
/// </para>
50+
/// </remarks>
51+
Normal
52+
}

src/Models/NormalizationParameters.cs

Lines changed: 103 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
using AiDotNet.Enums;
2+
13
namespace AiDotNet.Models;
24

35
/// <summary>
@@ -70,8 +72,10 @@ public NormalizationParameters(INumericOperations<T>? numOps = null)
7072
{
7173
_numOps = numOps ?? MathHelper.GetNumericOperations<T>();
7274
Method = NormalizationMethod.None;
73-
Min = Max = Mean = StdDev = Scale = Shift = Median = IQR = P = _numOps.Zero;
75+
Min = Max = Mean = StdDev = Scale = Shift = Median = IQR = P = MaxAbs = _numOps.Zero;
7476
Bins = [];
77+
Quantiles = [];
78+
OutputDistribution = OutputDistribution.Uniform;
7579
}
7680

7781
/// <summary>
@@ -372,29 +376,120 @@ public NormalizationParameters(INumericOperations<T>? numOps = null)
372376
/// <value>The power parameter, used for power transformations.</value>
373377
/// <remarks>
374378
/// <para>
375-
/// This property stores a power parameter that can be used for certain normalization methods, such as power transformations
376-
/// like Box-Cox or Yeo-Johnson transformations. These transformations can help make skewed data more normally distributed
377-
/// by raising values to a certain power. The optimal power parameter is typically determined during training to maximize
379+
/// This property stores a power parameter that can be used for certain normalization methods, such as power transformations
380+
/// like Box-Cox or Yeo-Johnson transformations. These transformations can help make skewed data more normally distributed
381+
/// by raising values to a certain power. The optimal power parameter is typically determined during training to maximize
378382
/// the normality of the transformed data.
379383
/// </para>
380384
/// <para><b>For Beginners:</b> This stores a power value used for certain advanced normalization techniques.
381-
///
385+
///
382386
/// The power parameter:
383387
/// - Is used for power transformations like Box-Cox or Yeo-Johnson
384388
/// - Helps make skewed data more normally distributed
385389
/// - Can be optimized to find the best transformation
386-
///
390+
///
387391
/// For example, a value of 0.5 would correspond to a square root transformation,
388392
/// which can help normalize right-skewed data.
389-
///
393+
///
390394
/// This parameter is useful when:
391395
/// - Your data has a skewed distribution
392396
/// - You want to make the data more normally distributed
393397
/// - Standard normalization methods don't work well
394-
///
398+
///
395399
/// Power transformations are more advanced techniques but can significantly
396400
/// improve model performance with certain types of data.
397401
/// </para>
398402
/// </remarks>
399403
public T P { get; set; }
404+
405+
/// <summary>
406+
/// Gets or sets the maximum absolute value observed in the data.
407+
/// </summary>
408+
/// <value>The maximum absolute value, used for MaxAbsScaler normalization.</value>
409+
/// <remarks>
410+
/// <para>
411+
/// This property stores the maximum absolute value observed in the data for the feature or target variable.
412+
/// It is used for MaxAbsScaler normalization, which scales data to the range [-1, 1] by dividing each value
413+
/// by the maximum absolute value. This method preserves the sign of values and maintains zeros (which is
414+
/// important for sparse data). The maximum absolute value is typically calculated during training based on
415+
/// the training data.
416+
/// </para>
417+
/// <para><b>For Beginners:</b> This stores the largest absolute value (ignoring the sign) in your data.
418+
///
419+
/// The maximum absolute value:
420+
/// - Is used for MaxAbsScaler normalization
421+
/// - Represents the farthest distance from zero in either direction
422+
/// - Is used as a divisor to scale values to the range [-1, 1]
423+
///
424+
/// For example, if your data ranges from -75 to 100, the maximum absolute value would be 100,
425+
/// and all values would be divided by 100 to scale them to [-0.75, 1.0].
426+
///
427+
/// This parameter is important because:
428+
/// - It preserves the sign of values (positive stays positive, negative stays negative)
429+
/// - It keeps zero values as zero (important for sparse data)
430+
/// - It's simpler than min-max scaling but still effective
431+
/// </para>
432+
/// </remarks>
433+
public T MaxAbs { get; set; }
434+
435+
/// <summary>
436+
/// Gets or sets the quantile values used for quantile transformation.
437+
/// </summary>
438+
/// <value>A list of quantile values representing the empirical distribution.</value>
439+
/// <remarks>
440+
/// <para>
441+
/// This property stores the quantile values calculated from the training data for QuantileTransformer.
442+
/// These quantiles represent the empirical cumulative distribution function (CDF) of the data and are
443+
/// used to map values to either a uniform or normal distribution. The number of quantiles determines
444+
/// the granularity of the transformation.
445+
/// </para>
446+
/// <para><b>For Beginners:</b> This stores the distribution pattern learned from your training data.
447+
///
448+
/// The quantiles list:
449+
/// - Is used for QuantileTransformer normalization
450+
/// - Contains values that divide your data into equal-sized groups
451+
/// - Helps map your data to a target distribution (uniform or normal)
452+
///
453+
/// For example, with 100 quantiles:
454+
/// - The 25th quantile is the value below which 25% of the data falls
455+
/// - The 50th quantile is the median
456+
/// - The 75th quantile is the value below which 75% of the data falls
457+
///
458+
/// This approach is powerful because:
459+
/// - It can handle any input distribution
460+
/// - It's very robust to outliers
461+
/// - It can transform data to match a desired distribution shape
462+
/// </para>
463+
/// </remarks>
464+
public List<T> Quantiles { get; set; }
465+
466+
/// <summary>
467+
/// Gets or sets the target output distribution for quantile transformation.
468+
/// </summary>
469+
/// <value>An OutputDistribution enum indicating either Uniform or Normal distribution.</value>
470+
/// <remarks>
471+
/// <para>
472+
/// This property specifies whether the QuantileTransformer should map data to a uniform distribution
473+
/// (where all ranges have equal probability) or a normal distribution (bell-shaped curve). This setting
474+
/// determines how the quantiles are mapped during transformation.
475+
/// </para>
476+
/// <para><b>For Beginners:</b> This specifies what shape you want your data to have after transformation.
477+
///
478+
/// The output distribution:
479+
/// - Can be Uniform (flat distribution) or Normal (bell curve)
480+
/// - Affects how values are redistributed
481+
/// - Depends on what your machine learning algorithm expects
482+
///
483+
/// Uniform distribution:
484+
/// - All value ranges have equal numbers of data points
485+
/// - Values are spread evenly across the range
486+
/// - Good for algorithms that don't assume any particular distribution
487+
///
488+
/// Normal distribution:
489+
/// - Creates a bell-shaped curve
490+
/// - Most values cluster around the center
491+
/// - Good for algorithms that work best with normally-distributed data
492+
/// </para>
493+
/// </remarks>
494+
public OutputDistribution OutputDistribution { get; set; }
400495
}

0 commit comments

Comments
 (0)