Kickstarting R
Massaging data in R

Specifying subsets of data

This in known as extraction in R. Most users will use the extraction operator "[]" to select values from a data object. The flexibility of the extraction operator does cause a bit of confusion in many new users. To extract a single value from a matrix or array of m dimensions, append the brackets to the name of the object with m integers separated by commas, e.g.:

> xmat[2,3]

Easy. What about a data frame? As the columns of a data frame can contain different modes of data, they may be specified differently:

> xdf$age[3]
> xdf[[2]][3]
> xdf[,2][3]
> xdf[2,3]

In the first line, the name of the column was used, along with the index of the desired value, in the second line, the index of the column was used in double brackets, then the index of the value in single brackets. Then the second column was specified by omitting the first index in single brackets, again with a second index in single brackets. Finally, the same system shown for matrices or arrays will work. Note that only the first example will extract only the value. The others will return a factor object with a single value if that column was a factor object.

To extract more than one value, use a vector rather than a single integer.

> xdf$age[3:6]
> xdf$age[c(0,0,1,1,1,1)]

The vector can be explicit indices, as shown in the first line, or a vector of logicals that will return the elements corresponding to non-zero values.

Changing the value of a variable

The convention of aligning variables as columns and subject as rows is pretty well established and will be observed here unless specified otherwise. Say that you have one or more variables that you would like to normalize (i.e. transform so that each variable has a mean of 0 (zero) and a variance of 1 (one).

> mydata$myvar<-(mydata$myvar-mean(mydata$myvar))/sd(myvar)

The original myvar has been replaced by the normalized values. Numeric transformations like this are relatively simple, as are generating categories from continuous measurements:

> mydata$tertiaryed<-ifelse(mydata$yearseduc > 12,"Y","N")

or recategorizations of factors:

> mydata$tertiaryed<-ifelse(mydata$education == "UNI" || mydata$education == "COL","Y","N")

Notice how this time, the new values were stored in a new variable rather than overwriting the previous values. You can either append the new variable to the original data frame, as in the example, or just make it a separate variable. Obviously, if you want to save your data in a compact form for further analysis, appending makes it easier to manage.

Dealing with NAs (missing values)

The NA (datum Not Available) is R's way of dealing with missing data. NAs can give you trouble unless you explicitly tell functions to ignore them, or pass the data through na.omit() (drop all NAs in the data) or na.exclude(). In some cases you may wish to give the NAs a specific value. For example, you may know that only non-smokers did not complete a "How many cigarettes?" item, and want to replace the NAs that were generated with zeros.

> mydata$ncigs[is.na(mydata$ncigs)]<-0

Notice here that an equality test is not appropriate for NAs, because they don't equal anything. The is.na() function returns a vector of indices that correspond to the elements in mydata$ncigs that are NAs. Those elements are then replaced with zeros. You can also replace NAs with potentially more informative values by using a data imputation function.

For more information, see An Introduction to R: Simple manipulations; numbers and vectors

Back to Table of Contents