Bangda Sun

Practice makes perfect

Assignment operator in R

Discussion about the assignment operator preferences in R: <- and =.

There are three types of assignment operator in R: <-, = and <<-. Can you figure out the differences among them? If you can, you can close this page and go get some drink right now; if not, you can spend 5 to 10 minutes to take a quick look of this post!

1. Motivation

If you have some experiences in other programming languages before you know R, you might feel uncomfortable with the assignment operator <- in R. You just need to type = for once, but now you need to use shift + , + - to type <-. Although in Rstudio you can use the shortcut: Alt + -, there is still one more step.

I used to be a <- user, then I prefer = since it would be more consistent with other languages like Python and Java, also it saves some time… And I maintained that style for almost half a year. I also see several people who also using =: my phd roommate, my machine learning course’s TA, etc.

Today we gonna do some ‘research’ to figure out the difference between them.

2. Experiments

The assignment statements in R could be fancy:

1
2
3
4
5
6
7
8
9
# assignment
> a <- c(1, 3, 5, 7) -> b
> c = c(1, 3, 5, 7) = d
Error: object 'd' not found
> x <- y <- z <- 3
> x = y = z = 5
> x <- y = 5
Error in x <- y = 5 : could not find function "<-<-"
> x = y <- 5

The first difference is <- could be either leftward and rightward, where = cannot, and <- can group from right to left (-> from left to right).

Check the help document of R, we can see that:

The operators <- and = assign into the environment in which they are evaluated. The operator <- can be used anywhere, whereas the operator = is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions.

Hmm.. Could you speak in English???

What is top level? You can open your Rstudio if you have, open a new editor, and your can see the (Top Level) right above the console when your cursor are not in the body of any functions. This is what they mean. In other words, top level means the statement level, rather than the expression level.

Seems like <- is more widely used… But I can conclude that there is no difference when we want to assign values to variables in the most common case, i.e. there is just one line of assignment statement: x = … is same as x <- ….

Then we start our second experiment.

1
2
3
4
5
6
7
8
> df <- data.frame(
+ x = rnorm(10),
+ y <- rnorm(10)
+ )
> str(df)
'data.frame': 10 obs. of 2 variables:
$ x : num 0.336 0.955 0.327 -0.907 -0.105 ...
$ y....rnorm.10.: num 0.35 1.463 -0.21 -0.422 -0.463 ...

Is there something wrong? Seems everyone is ok right now, but there is something strange: …rnorm.10..

What we intend to do is create a data frame, we can list the variables in our global environment:

1
2
> ls()
[1] "df" "y"

Oops, did we define y? We are not intend to do that. But it is identical to the y in our data frame. Now you see, <- is an assignment operator, it evaluates rnorm(10) and send it to y, which means we create an object y. However, when we use = to set x as rnorm(10), no x is created in the enivornment.

We can think in this way: data.frame() is a R function, when we use the function, we need to specify the parameters of the function. And the statement of specifying parameters is difference from assignment. Here we just want to specify the columns x and y in the data frame, rather than assign values to them.

There is another very good example to illustrate this.

Finally, we will discuss <<- which looks “hard” to handle for R rookies. Check the help document again:

The operators <<- and ->> are normally only used in functions, and cause a search to be made through parent environments for an existing definition of the variable being assigned. If such a variable is found (and its binding is not locked) then its value is redefined, otherwise assignment takes place in the global environment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> # naive way to calculate the sum from 1 to n
> CalcSum <- function(n) {
+ tsum <- 0
+ for (i in seq(n)) {
+ tsum <- tsum + i
+ print(tsum)
+ }
+ }
> CalcSum(5)
[1] 1
[1] 3
[1] 6
[1] 10
[1] 15
> ls()
[1] "CalcSum"

now tsum is a local variable, its scope is only the CalcSum itself.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> CalcSum <- function(n) {
+ tsum <<- 0
+ for (i in seq(n)) {
+ tsum <- tsum + i
+ print(tsum)
+ }
+ }
> CalcSum(5)
[1] 1
[1] 3
[1] 6
[1] 10
[1] 15
> ls()
[1] "CalcSum" "tsum"

See the difference? That’s the most intuitive way I think for <<-, we can create global variable using <<-. Sometimes it’s helpful for debugging since we can “view” the interior of functions.

3. Summary

Check the R style guide of Google, we can see that:

Assignment

Use <-, not =, for assignment.

GOOD:
x <- 5

BAD:
x = 5

Hmm… I think I gonna abondon = from now…

Ok, one more step, what if I want to convert my ‘bad’ R code scripts to ‘good’ scripts? Just replace = with <-? Seems no good. Instead, we can use formatR package developed by Yihui Xie.

1
2
3
4
5
> library(formatR)
Warning message:
package ‘formatR’ was built under R version 3.3.3
> tidy_source(text = 'x = rnorm(10)', arrow = TRUE)
x <- rnorm(10)

Update at 08/04/2017: Today I found a very useful and highly related blogs from Yihui, he already discussed this five years ago! Here is the blog: R的若干基因及争论. His blogs are really nice.

4. References