Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 18 additions & 6 deletions documentation/tmva/UsersGuide/DataPreprocessing.tex
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,17 @@ \section{Data Preprocessing}
decomposition are available for input and target variables, gaussianization, uniformization and decorrelation
discussed below can only be used for input variables.

Apart from five variable transformation methods mentioned above, an unsupervised variable selection method Variance Threshold is also implemented in TMVA. It follows a completely different processing pipeline. It is discussed in detail in section \ref{sec:varianceThreshold}.
Apart from six variable transformation methods mentioned above, an unsupervised variable selection method Variance Threshold is also implemented in TMVA. It follows a completely different processing pipeline. It is discussed in detail in section \ref{sec:varianceThreshold}.

\subsection{Transforming input variables}
\label{sec:variableTransform}

Currently five preprocessing\index{Discriminating variables!preprocessing of}
Currently six preprocessing\index{Discriminating variables!preprocessing of}
transformations\index{Discriminating variables!transformation of}
are implemented in TMVA:
\begin{itemize}
\item variable normalisation;
\item variable scaling;
\item decorrelation via the square-root of the covariance matrix ;
\item decorrelation via a principal component decomposition;
\item transformation of the variables into Uniform distributions (``Uniformization'').
Expand Down Expand Up @@ -100,6 +101,17 @@ \subsubsection{Variable normalisation\index{Discriminating variables!normalisati
Normalisation may also render minimisation processes, such as the adjustment of
neural network weights, more effective.

\subsubsection{Variable scaling\index{Discriminating variables!scaling of}}
\label{sec:scaling}

The larger absolute value of the minimum and maximum values is determined from the training events
and used to scale the dataset to lie within $[-1,1]$. There is no offset added and thus the original
sign of the input is maintained. E.g Input data with a range $[x,y]$ where $|y|>|x|$ will transform
to the range $[x/|y|,1]$.
As with Normalisation, this may also render minimisation processes, such as the adjustment of
neural network weights, more effective especially if you are using sign sensitive activation
functions such as the a rectified linear unit ( ReLU ).

\subsubsection{Variable decorrelation\index{Discriminating variables!decorrelation of}}
\label{sec:decorrelation}

Expand Down Expand Up @@ -225,10 +237,10 @@ \subsubsection{Booking and chaining transformations for some or all input variab
Variable transformations to be applied prior to the MVA training (and application)
can be defined independently for each MVA method with the booking option
{\tt VarTransform=<type>}, where {\tt <type>} denotes the desired transformation
(or chain of transformations). The available transformation types are normalisation,
(or chain of transformations). The available transformation types are normalisation, scaling,
decorrelation, principal component analysis and Gaussianisation, which are labelled by
\code{Norm}, \code{Deco}, \code{PCA}, \code{Uniform}, \code{Gauss}, respectively, or, equivalently,
by the short-hand notations \code{N}, \code{D}, \code{P}, \code{U} , \code{G}.
\code{Norm}, \code{Scale}, \code{Deco}, \code{PCA}, \code{Uniform}, \code{Gauss}, respectively, or, equivalently,
by the short-hand notations \code{N}, \code{S}, \code{D}, \code{P}, \code{U} , \code{G}.

Transformations can be {\em chained} allowing the consecutive application of all defined
transformations to the variables for each event.
Expand Down Expand Up @@ -326,7 +338,7 @@ \subsection{Variable selection based on variance}
\label{eq:meanecalculation}
\mu_j = \frac{\sum_{i=1}^N w_i x_{j}(i)}{\sum_{i=1}^N w_i}
\eeq
Unlike above five variable transformation method, this Variance Threshold method is implemented in DataLoader class. After loading dataset in the DataLoader object, we can apply this method. It returns a new DataLoader with the selected variables which have variance strictly greater than the threshold value passed by user. Default value of threshold is zero i.e. remove the variables which have same value in all the events.
Unlike above six variable transformation method, this Variance Threshold method is implemented in DataLoader class. After loading dataset in the DataLoader object, we can apply this method. It returns a new DataLoader with the selected variables which have variance strictly greater than the threshold value passed by user. Default value of threshold is zero i.e. remove the variables which have same value in all the events.

\begin{codeexample}
\begin{tmvacode}
Expand Down
4 changes: 3 additions & 1 deletion tmva/tmva/inc/TMVA/VariableNormalizeTransform.h
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ namespace TMVA {

typedef std::vector<Float_t> FloatVector;
typedef std::vector< FloatVector > VectorOfFloatVectors;
VariableNormalizeTransform( DataSetInfo& dsi );
VariableNormalizeTransform( DataSetInfo& dsi, TString strcor="" );
virtual ~VariableNormalizeTransform( void );

void Initialize() override;
Expand All @@ -77,6 +77,8 @@ namespace TMVA {

private:

Bool_t fNoOffset;

void CalcNormalizationParams( const std::vector< Event*>& events);

// mutable Event* fTransformedEvent;
Expand Down
51 changes: 41 additions & 10 deletions tmva/tmva/src/VariableNormalizeTransform.cxx
Original file line number Diff line number Diff line change
Expand Up @@ -54,9 +54,13 @@ Linear interpolation class
////////////////////////////////////////////////////////////////////////////////
/// constructor

TMVA::VariableNormalizeTransform::VariableNormalizeTransform( DataSetInfo& dsi )
: VariableTransformBase( dsi, Types::kNormalized, "Norm" )
TMVA::VariableNormalizeTransform::VariableNormalizeTransform( DataSetInfo& dsi, TString strcor )
: VariableTransformBase( dsi, Types::kNormalized, "Norm" ),
fNoOffset(kFALSE)
{
if (strcor=="Scale") {fNoOffset = kTRUE;
SetName("Scale");
}
}

////////////////////////////////////////////////////////////////////////////////
Expand Down Expand Up @@ -143,10 +147,16 @@ const TMVA::Event* TMVA::VariableNormalizeTransform::Transform( const TMVA::Even

min = minVector.at(iidx);
max = maxVector.at(iidx);
Float_t offset = min;
Float_t scale = 1.0/(max-min);

Float_t valnorm = (val-offset)*scale * 2 - 1;
Float_t valnorm;
if (!fNoOffset) {
Float_t offset = min;
Float_t scale = 1.0/(max-min);
valnorm = (val-offset)*scale * 2 - 1;
} else {
fabs(max)>fabs(min) ? valnorm=val/fabs(max) : valnorm=val/fabs(min);
}

output.push_back( valnorm );

++iidx;
Expand Down Expand Up @@ -188,10 +198,16 @@ const TMVA::Event* TMVA::VariableNormalizeTransform::InverseTransform(const TMVA

min = minVector.at(iidx);
max = maxVector.at(iidx);
Float_t offset = min;
Float_t scale = 1.0/(max-min);

Float_t valnorm = offset+((val+1)/(scale * 2));
Float_t valnorm;
if (!fNoOffset) {
Float_t offset = min;
Float_t scale = 1.0/(max-min);
valnorm = offset+((val+1)/(scale * 2));
} else {
fabs(max)>fabs(min) ? valnorm=val*fabs(max) : valnorm=val*fabs(min);
}

output.push_back( valnorm );

++iidx;
Expand Down Expand Up @@ -282,8 +298,15 @@ std::vector<TString>* TMVA::VariableNormalizeTransform::GetTransformationStrings

Char_t type = (*itGet).first;
UInt_t idx = (*itGet).second;
Float_t offset = min;
Float_t scale = 1.0/(max-min);
Float_t offset;
Float_t scale;
if (!fNoOffset) {
offset = min;
scale = 1.0/(max-min);
} else {
offset = 0.;
fabs(max)>fabs(min) ? scale=.5/fabs(max) : scale=.5/fabs(min);
}
TString str("");
VariableInfo& varInfo = (type=='v'?fDsi.GetVariableInfo(idx):(type=='t'?fDsi.GetTargetInfo(idx):fDsi.GetSpectatorInfo(idx)));

Expand Down Expand Up @@ -329,6 +352,7 @@ void TMVA::VariableNormalizeTransform::AttachXMLTo(void* parent)
{
void* trfxml = gTools().AddChild(parent, "Transform");
gTools().AddAttr(trfxml, "Name", "Normalize");
gTools().AddAttr(trfxml, "UseOffsetOrNot", (fNoOffset?"NoOffset":"UseOffset") );
VariableTransformBase::AttachXMLTo( trfxml );

Int_t numC = (GetNClasses()<= 1)?1:GetNClasses()+1;
Expand All @@ -353,6 +377,13 @@ void TMVA::VariableNormalizeTransform::AttachXMLTo(void* parent)

void TMVA::VariableNormalizeTransform::ReadFromXML( void* trfnode )
{
TString UseOffsetOrNot;

gTools().ReadAttr(trfnode, "UseOffsetOrNot", UseOffsetOrNot );
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fairly certain this breaks the reading of existing TMVA files that have not been written with the UseOffsetOrNot tag.

I currently get messages like:

<FATAL>                          : Trying to read non-existing attribute 'UseOffsetOrNot' from xml node 'Transform'

I am currently trying a simple fix locally and will open a PR once I have validated that works.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! That is very kind.


if (UseOffsetOrNot == "NoOffset") fNoOffset = kTRUE;
else fNoOffset = kFALSE;

Bool_t newFormat = kFALSE;

void* inpnode = NULL;
Expand Down
4 changes: 4 additions & 0 deletions tmva/tmva/src/VariableTransform.cxx
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,10 @@ void CreateVariableTransforms(const TString& trafoDefinitionIn,
if (variables.Length() == 0) variables = "_V_,_T_";
transformation = new VariableNormalizeTransform(dataInfo);
}
else if (trName == "S" || trName == "Scale" || trName == "ScaleNorm" ) {
if (variables.Length() == 0) variables = "_V_,_T_";
transformation = new VariableNormalizeTransform(dataInfo,"Scale");
}
else
log << kFATAL << Form("Dataset[%s] : ",dataInfo.GetName())
<< "<ProcessOptions> Variable transform '"
Expand Down
Loading