|
| 1 | + |
| 2 | +%\VignetteIndexEntry{Mapping levels of a factor} |
| 3 | +%\VignettePackage{gdata} |
| 4 | +%\VignetteKeywords{levels, factor, manip} |
| 5 | + |
| 6 | +\documentclass[a4paper]{report} |
| 7 | +\usepackage{Rnews} |
| 8 | +\usepackage[round]{natbib} |
| 9 | +\bibliographystyle{abbrvnat} |
| 10 | + |
| 11 | +\usepackage{Sweave} |
| 12 | +\SweaveOpts{strip.white=all, keep.source=TRUE} |
| 13 | + |
| 14 | +\begin{document} |
| 15 | +\SweaveOpts{concordance=TRUE} |
| 16 | + |
| 17 | +\begin{article} |
| 18 | + |
| 19 | +\title{Mapping levels of a factor} |
| 20 | +\subtitle{The \pkg{gdata} package} |
| 21 | +\author{by Gregor Gorjanc} |
| 22 | + |
| 23 | +\maketitle |
| 24 | + |
| 25 | +\section{Introduction} |
| 26 | + |
| 27 | +Factors use levels attribute to store information on mapping between |
| 28 | +internal integer codes and character values i.e. levels. First level is |
| 29 | +mapped to internal integer code 1 and so on. Although some users do not |
| 30 | +like factors, their use is more efficient in terms of storage than for |
| 31 | +character vectors. Additionally, there are many functions in base \R{} that |
| 32 | +provide additional value for factors. Sometimes users need to work with |
| 33 | +internal integer codes and mapping them back to factor, especially when |
| 34 | +interfacing external programs. Mapping information is also of interest if |
| 35 | +there are many factors that should have the same set of levels. This note |
| 36 | +describes \code{mapLevels} function, which is an utility function for |
| 37 | +mapping the levels of a factor in \pkg{gdata} \footnote{from version 2.3.1} |
| 38 | +package \citep{WarnesGdata}. |
| 39 | + |
| 40 | +\section{Description with examples} |
| 41 | + |
| 42 | +Function \code{mapLevels()} is an (S3) generic function and works on |
| 43 | +\code{factor} and \code{character} atomic classes. It also works on |
| 44 | +\code{list} and \code{data.frame} objects with previously mentioned atomic |
| 45 | +classes. Function \code{mapLevels} produces a so called ``map'' with names |
| 46 | +and values. Names are levels, while values can be internal integer codes or |
| 47 | +(possibly other) levels. This will be clarified later on. Class of this |
| 48 | +``map'' is \code{levelsMap}, if \code{x} in \code{mapLevels()} was atomic |
| 49 | +or \code{listLevelsMap} otherwise - for \code{list} and \code{data.frame} |
| 50 | +classes. The following example shows the creation and printout of such a |
| 51 | +``map''. |
| 52 | + |
| 53 | +<<ex01>>= |
| 54 | +library(gdata) |
| 55 | +(fac <- factor(c("B", "A", "Z", "D"))) |
| 56 | +(map <- mapLevels(x=fac)) |
| 57 | +@ |
| 58 | + |
| 59 | +If we have to work with internal integer codes, we can transform factor to |
| 60 | +integer and still get ``back the original factor'' with ``map'' used as |
| 61 | +argument in \code{mapLevels<-} function as shown bellow. \code{mapLevels<-} |
| 62 | +is also an (S3) generic function and works on same classes as |
| 63 | +\code{mapLevels} plus \code{integer} atomic class. |
| 64 | + |
| 65 | +<<ex02>>= |
| 66 | +(int <- as.integer(fac)) |
| 67 | +mapLevels(x=int) <- map |
| 68 | +int |
| 69 | +identical(fac, int) |
| 70 | +@ |
| 71 | + |
| 72 | +Internally ``map'' (\code{levelsMap} class) is a \code{list} (see bellow), |
| 73 | +but its print method unlists it for ease of inspection. ``Map'' from |
| 74 | +example has all components of length 1. This is not mandatory as |
| 75 | +\code{mapLevels<-} function is only a wrapper around workhorse function |
| 76 | +\code{levels<-} and the later can accept \code{list} with components of |
| 77 | +various lengths. |
| 78 | + |
| 79 | +<<ex03>>= |
| 80 | +str(map) |
| 81 | +@ |
| 82 | + |
| 83 | +Although not of primary importance, this ``map'' can also be used to remap |
| 84 | +factor levels as shown bellow. Components ``later'' in the map take over |
| 85 | +the ``previous'' ones. Since this is not optimal I would rather recommend |
| 86 | +other approaches for ``remapping'' the levels of a \code{factor}, say |
| 87 | +\code{recode} in \pkg{car} package \citep{FoxCar}. |
| 88 | + |
| 89 | +<<ex04>>= |
| 90 | +map[[2]] <- as.integer(c(1, 2)) |
| 91 | +map |
| 92 | +int <- as.integer(fac) |
| 93 | +mapLevels(x=int) <- map |
| 94 | +int |
| 95 | +@ |
| 96 | + |
| 97 | +Up to now examples showed ``map'' with internal integer codes for values |
| 98 | +and levels for names. I call this integer ``map''. On the other hand |
| 99 | +character ``map'' uses levels for values and (possibly other) levels for |
| 100 | +names. This feature is a bit odd at first sight, but can be used to easily |
| 101 | +unify levels and internal integer codes across several factors. Imagine |
| 102 | +you have a factor that is for some reason split into two factors \code{f1} |
| 103 | +and \code{f2} and that each factor does not have all levels. This is not |
| 104 | +uncommon situation. |
| 105 | + |
| 106 | +<<ex05>>= |
| 107 | +(f1 <- factor(c("A", "D", "C"))) |
| 108 | +(f2 <- factor(c("B", "D", "C"))) |
| 109 | +@ |
| 110 | + |
| 111 | +If we work with this factors, we need to be careful as they do not have the |
| 112 | +same set of levels. This can be solved with appropriately specifying |
| 113 | +\code{levels} argument in creation of factors i.e. \code{levels=c("A", "B", |
| 114 | + "C", "D")} or with proper use of \code{levels<-} function. I say proper |
| 115 | +as it is very tempting to use: |
| 116 | + |
| 117 | +<<ex06>>= |
| 118 | +fTest <- f1 |
| 119 | +levels(fTest) <- c("A", "B", "C", "D") |
| 120 | +fTest |
| 121 | +@ |
| 122 | + |
| 123 | +Above example extends set of levels, but also changes level of 2nd and 3rd |
| 124 | +element in \code{fTest}! Proper use of \code{levels<-} (as shown in |
| 125 | +\code{levels} help page) would be: |
| 126 | + |
| 127 | +<<ex07>>= |
| 128 | +fTest <- f1 |
| 129 | +levels(fTest) <- list(A="A", B="B", |
| 130 | + C="C", D="D") |
| 131 | +fTest |
| 132 | +@ |
| 133 | + |
| 134 | +Function \code{mapLevels} with character ``map'' can help us in such |
| 135 | +scenarios to unify levels and internal integer codes across several |
| 136 | +factors. Again the workhorse under this process is \code{levels<-} function |
| 137 | +from base \R{}! Function \code{mapLevels<-} just controls the assignment of |
| 138 | +(integer or character) ``map'' to \code{x}. Levels in \code{x} that match |
| 139 | +``map'' values (internal integer codes or levels) are changed to ``map'' |
| 140 | +names (possibly other levels) as shown in \code{levels} help page. Levels |
| 141 | +that do not match are converted to \code{NA}. Integer ``map'' can be |
| 142 | +applied to \code{integer} or \code{factor}, while character ``map'' can be |
| 143 | +applied to \code{character} or \code{factor}. Result of \code{mapLevels<-} |
| 144 | +is always a \code{factor} with possibly ``remapped'' levels. |
| 145 | + |
| 146 | +To get one joint character ``map'' for several factors, we need to put |
| 147 | +factors in a \code{list} or \code{data.frame} and use arguments |
| 148 | +\code{codes=FALSE} and \code{combine=TRUE}. Such map can then be used to |
| 149 | +unify levels and internal integer codes. |
| 150 | + |
| 151 | +<<ex08>>= |
| 152 | +(bigMap <- mapLevels(x=list(f1, f2), |
| 153 | + codes=FALSE, |
| 154 | + combine=TRUE)) |
| 155 | +mapLevels(f1) <- bigMap |
| 156 | +mapLevels(f2) <- bigMap |
| 157 | +f1 |
| 158 | +f2 |
| 159 | +cbind(as.character(f1), as.integer(f1), |
| 160 | + as.character(f2), as.integer(f2)) |
| 161 | +@ |
| 162 | + |
| 163 | +If we do not specify \code{combine=TRUE} (which is the default behaviour) |
| 164 | +and \code{x} is a \code{list} or \code{data.frame}, \code{mapLevels} |
| 165 | +returns ``map'' of class \code{listLevelsMap}. This is internally a |
| 166 | +\code{list} of ``maps'' (\code{levelsMap} objects). Both |
| 167 | +\code{listLevelsMap} and \code{levelsMap} objects can be passed to |
| 168 | +\code{mapLevels<-} for \code{list}/\code{data.frame}. Recycling occurs when |
| 169 | +length of \code{listLevelsMap} is not the same as number of |
| 170 | +components/columns of a \code{list}/\code{data.frame}. |
| 171 | + |
| 172 | +Additional convenience methods are also implemented to ease the work with |
| 173 | +``maps'': |
| 174 | + |
| 175 | +\begin{itemize} |
| 176 | + |
| 177 | +\item \code{is.levelsMap}, \code{is.listLevelsMap}, \code{as.levelsMap} and |
| 178 | + \code{as.listLevelsMap} for testing and coercion of user defined |
| 179 | + ``maps'', |
| 180 | + |
| 181 | +\item \code{"["} for subsetting, |
| 182 | + |
| 183 | +\item \code{c} for combining \code{levelsMap} or \code{listLevelsMap} |
| 184 | + objects; argument \code{recursive=TRUE} can be used to coerce |
| 185 | + \code{listLevelsMap} to \code{levelsMap}, for example \code{c(llm1, llm2, |
| 186 | + recursive=TRUE)} and |
| 187 | + |
| 188 | +\item \code{unique} and \code{sort} for \code{levelsMap}. |
| 189 | + |
| 190 | +\end{itemize} |
| 191 | + |
| 192 | +\section{Summary} |
| 193 | + |
| 194 | +Functions \code{mapLevels} and \code{mapLevels<-} can help users to map |
| 195 | +internal integer codes to factor levels and unify levels as well as |
| 196 | +internal integer codes among several factors. I welcome any comments or |
| 197 | +suggestions. |
| 198 | + |
| 199 | +% \bibliography{refs} |
| 200 | +\begin{thebibliography}{1} |
| 201 | +\providecommand{\natexlab}[1]{#1} |
| 202 | +\providecommand{\url}[1]{\texttt{#1}} |
| 203 | +\expandafter\ifx\csname urlstyle\endcsname\relax |
| 204 | + \providecommand{\doi}[1]{doi: #1}\else |
| 205 | + \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi |
| 206 | + |
| 207 | +\bibitem[Fox(2006)]{FoxCar} |
| 208 | +J.~Fox. |
| 209 | +\newblock \emph{car: Companion to Applied Regression}, 2006. |
| 210 | +\newblock URL \url{http://socserv.socsci.mcmaster.ca/jfox/}. |
| 211 | +\newblock R package version 1.1-1. |
| 212 | + |
| 213 | +\bibitem[Warnes(2006)]{WarnesGdata} |
| 214 | +G.~R. Warnes. |
| 215 | +\newblock \emph{gdata: Various R programming tools for data manipulation}, |
| 216 | + 2006. |
| 217 | +\newblock URL |
| 218 | + \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}. |
| 219 | +\newblock R package version 2.3.1. Includes R source code and/or documentation |
| 220 | + contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley. |
| 221 | + |
| 222 | +\end{thebibliography} |
| 223 | + |
| 224 | +\address{Gregor Gorjanc\\ |
| 225 | + University of Ljubljana, Slovenia\\ |
| 226 | +\email{gregor.gorjanc@bfro.uni-lj.si}} |
| 227 | + |
| 228 | +\end{article} |
| 229 | + |
| 230 | +\end{document} |
0 commit comments