-
Notifications
You must be signed in to change notification settings - Fork 761
Introduction
This book has grown out of over 10 years of programming in R, and constantly struggling to understand the best way of doing things. I would particularly like to thank the tireless contributors to R-help. There are too many that have helped me over the years to list individually, but I'd particularly like to thank Luke Tierney, John Chambers, and Brian Ripley for correcting countless of my misunderstandings and helping me to deeply understand R.
R is still a relatively young language, and the resources to help you understand it are still maturing. In my personal journey to understand R, I've found it particularly helpful to refer to resources that describe how other programming languages work. R has aspects of both functional and object-oriented (OO) programming languages, and learning how these aspects are expressed in R, will help you translate your existing knowledge from other programming languages, and to help you identify areas where you can improve.
Functional
- First class functions
- Pure functions: a goal, not a prerequisite
- Recursion: no tail call elimination. Slow
- Lazy evaluation: but only of function arguments. No infinite streams
- Untyped
OO
- Has three distinct OO frameworks built in to base. And more available in add on packages. Two of the OO styles are built around generic functions, a style of OO that comes from lisp.
I found the following two books particularly helpful:
-
The structure and interpretation of computer programs by Harold Abelson and Gerald Jay Sussman.
-
Concepts, Techniques and Models of Computer Programming by Peter van Roy and Sef Haridi
It's also very useful to learn a little about lisp, because many of the ideas in R are adapted from lisp, and there are often good descriptions of the basic ideas, even if the implementation differs somewhat. Part of the purpose of this book is so that you don't have to consult these original source, but if you want to learn more, this is a great way to develop a deeper understanding of how R works.
Other websites that helped me to understand smaller pieces of R are:
-
Getting Started with Dylan for understanding S4
-
Frames, Environments, and Scope in R and S-PLUS. Section 2 is recommended as a good introduction to the formal vocabulary used in much of the R documentation.
-
Lexical scope and statistical computing gives more examples of the power and utility of closures.
Other recommendations for becoming a better programmer:
- The pragmatic programmer, by Andrew Hunt and David Thomas.
This book describes the skills that I think you need to be an advanced R developer, producing reproducible code that can be used in a wide variety of circumstances.
After reading this book, you will be:
-
Familiar with the fundamentals of R, so that you can represent complex data types and simplify the operations performed on them. You have a deep understanding of the language, and know how to override default behaviours when necessary
-
Able to produce packages to make your work available to a wider audience, and how to efficiently program "in the large", so you spend your time solving new problems not struggling with old code.
-
Comfortable reading and understanding the majority of R code. Important so you can learn from and critique others code.
-
Experienced programmers from other languages who want to learn about the features of R that make it special
-
Existing package developers who want to make it less work.
-
R developers who want to take it to the next level - who are ready to release their own code into the wild
To get the most out of this look you should already be familiar with the basics of R as described in the next section, and you should have started developing your R vocabulary.
The basic data structure in R is the vector, which comes in two basic flavours: atomic vectors and lists. Atomic vectors are logical, integer, numeric, character and raw. Common vector properties are mode, length and names:
x <- 1:10
mode(x)
length(x)
names(x)
names(x) <- letters[1:10]
x
names(x)
Lists are different from atomic vectors in that they can contain any other type of vector. This makes them recursive, because a list can contain other lists.
x <- list(list(list(list())))
x
str(x)
str
is one of the most important functions in R: it gives a human readable description of any R data structure.
Vectors can be extended into multiple dimensions. If 2d they are called matrices, if more than 2d they are called arrays. Length generalises to nrow
and ncol
for matrices, and dim
for arrays. Names generalises to rownames
and colnames
for matrices, a dimnames
for arrays.
y <- matrix(1:20, nrow = 4, ncol = 5)
z <- array(1:24, dim = c(3, 4, 5))
nrow(y)
rownames(y)
ncol(y)
colnames(y)
dim(z)
dimnames(z)
All vectors can also have additional arbitrary attributes - these can be thought of as a named list (although the names must be unique), and can be accessed individual with attr
or all at once with attributes
. structure
returns a new object with modified attributes.
Another extremely important data structure is the data.frame. A data frame is a named list with a restriction that all elements must be vectors of the same length. Each element in the list represents a column, which means that each column must be one type, but a row may contain values of different types.
- Three subsetting operators.
- Five types of subsetting.
- Extensions to more than 1d.
All basic data structures can be teased apart using the subsetting operators: [
, '[[
and $
. It's easiest to explain subsetting for 1d first, and then show how it generalises to higher dimensions. You can subset by 5 different things:
- blank: return everything
- positive integers: return elements at those positions
- negative integers: return all elements except at those positions
- character vector: return elements with matching names
- logical vector: return all elements where the corresponding logical value is
TRUE
For higher dimensions these are separated by commas.
-
[
. Drop argument controls simplification. -
'[[
returns an element -
x$y
is equivalent tox[["y"]]
Functions in R are created by function
. They consist of an argument list (which can include default values), and a body of code to execute when evaluated. In R arguments are passed-by-value, so the only way a function can affect the outside world is through its return value:
f <- function(x) {
x$a <- 2
}
x <- list(a = 1)
f()
x$a
Functions can return only a single value, but this is not a limitation in practice because you can always return a list containing any number of objects.
When calling a function you can specify arguments by position, or by name:
mean(1:10)
mean(x = 1:10)
mean(x = 1:10, trim = 0.05)
Arguments are matched first by exact name, then by prefix matching and finally by position.
There is a special argument called ...
. This argument will match any arguments not otherwise specifically matched, and can be used to call other functions. This is useful if you want to collect arguments to call another function, but you don't want to prespecify their possible names.
You can define new infix operators with a special syntax:
"%+%" <- function(a, b) paste(a, b)
"new" %+% "string"
And replacement functions to modify arguments "in-place":
"second<-" <- function(x, value) {
x[2] <- value
x
}
x <- 1:10
second(x) <- 5
x
But this is really the same as
x <- "second<-"(x, 5)
and actual modification in place should be considered a performance optimisation, not a fundamental property of the language.