The quirky yet venerable programming language R has gained a new lease on life through the resurgence of data science. Based on my experience as both a user and a package developer, I critically evaluate the R language and ecosystem. This essay is colored by my ongoing experience of writing program analysis tools for R. Still, I think many of the points raised are relevant to working R users.
What is good or bad in a programming language is the subject of endless controversy. Needless to say, what follows are my opinions and you may disagree with them. So let’s don our flame suits and get down to business.
At the heart of R is a simple, elegant, functional programming language. It also has a flourishing community of data scientists and engineers. What follows are the greatest strengths of the R computing environment, as I see them.
- Statistics ecosystem
The success of a programming language depends as much on its ecosystem of third-party packages as on the language itself. R is the dominant programming language among professional statisticians. An enormous variety of statistical methods, classical and modern, are available through R packages. If you’re looking for an obscure (or not so obscure) statistical method, there’s a good chance it’s implemented in R—and only in R. Not long ago, I was dismayed to discover that there is no mature, properly maintained Python implementation of k-medoids clustering, a well-known variant of k-means clustering.1 Meanwhile, k-medoids has been available through the R package cluster for nearly two decades. This situation is fairly common, despite the rapid growth of statistics packages for Python and Julia.
The ascendancy of R does not, however, extend to machine learning, the “other half” of data science. Notably, the mania surrounding deep learning is mostly disconnected from the R community. Python has become dominant in this area instead. Computing environments like R, Python, and MATLAB have all carved out their own niches. I expect this linguistic pluralism to remain a hallmark of the data science community for the foreseeable future, with R maintaining its niche among statisticians.
- Package management
In contrast, the R language has excellent package management. The R package format standardizes the distribution of just about any content you might want to include in a package: source code, API documentation, narrative documentation (called “vignettes”), unit tests, example datasets, and internal package data. Packages on CRAN, the official package repository, are required to pass a suite of automated and manual tests. The tests promote good coding style and catch common errors. All this attention to detail makes for an unusually pleasant package management experience. The CRAN maintainers deserve our gratitude for their careful and professional stewardship of the R package infrastructure.
- Functional programming
Although it’s not obvious from the syntax, R is essentially a functional programming language. The term “functional programming” evokes a variety of language features and conventions. R functions tend to behave like mathematical functions: they don’t mutate data or create side-effects. Mapping and reducing with anonymous functions is encouraged; explicit looping with
whileloops is discouraged. Most R objects have copy-on-write semantics.3 That is, any operation which mutates an object first creates a copy of that object. This convention guarantees that other variables which may be pointing to the object are unaffected by the mutation. In short, R is a functional programming language masquerading as an imperative language.
I like the functional style, in R specifically and in programming languages generally, because programs are easier to reason about (by both humans and machines) when objects are immutable and side-effects are controlled. Critics complain that R’s use of copy-on-write semantics can cause excessive memory usage, especially with large datasets. While memory usage can indeed be a problem, the correct solution is not to give up the benefits of functional style in all situations, but to adopt mutable data structures only when demanded by circumstances. That is exactly what R allows, mainly through external packages. For instance, the data.table package provides a mutable variant of the standard data frame. My position on mutability is consistent with my general attitude that human cognitive ease should take precedence over machine efficiency, except when data size or hardware limitations make that impossible.
Despite its strengths, the R language has a few serious design defects. The problems raised in this section are fundamental. They cannot be repaired without significantly altering the language, breaking compatibility with thousands of packages accumulated over decades. These problems are here to stay.
- Informal type system
In my view, the most serious defect of the R language is its informal type system. The only types officially recognized by the R interpreter are atomic vectors (
integer, etc.), lists, functions, and environments.4 All other types, from data frames to statistical models to plots, exist only by convention. According to R’s primary system for object-oriented programming, known as S3, the class of an object is determined by a special string-valued attribute called
class. Generic functions are implemented by dispatching on the value of this attribute.
Informal types are fragile. Because S3 classes are merely strings, they are not attached to packages and hence exist in a global namespace. Any two packages that happen to use the same class names will conflict. “Fake” instances of a class can be created by attaching the class name to any object one pleases. Introspection is also severely limited. In general, it is impossible to determine from what package an object originates and very cumbersome to determine whether a function is a generic or a method.
At least three class systems have been proposed as alternatives to S3, namely S4, reference classes (R5), and R6. They are all more formal than S3, among other differences. As a practical matter they are not very important, because, outside of Bioconductor, most packages simply use S3. Still, the mere existence of all these alternative class systems is incredible. I have never encountered another programming language with so many incompatible ways to define a class. This proliferation would not have occurred had R’s original type system been properly designed.
- Lack of modules
The R language makes no provision for modules, which may be loosely defined as independent, reusable units of code, organized at a coarse level of granularity. Generally speaking, a module consists of numerous types and functions, defines a formal public interface, and has a coherent theme or purpose. Experienced developers know that modules are essential for organizing large programs. Their absence from R makes developing large R packages more difficult.
As a solution, R devotees advise making packages as small and as loosely coupled as possible. Each package effectively serves as a module, whose interface is defined by the package’s exported names. Thus John Chambers identifies functions with “small-scale” programming and packages with “medium-scale” programming,5 where it would be more typical to identify functions as small-scale, modules as medium-scale, and complete packages and systems as large-scale.
The advice to design small packages is fine as far as it goes, but it involves the conceptual mistake of conflating packages with modules. Packages are about distributing code to other users and developers; modules are about encapsulating code to hide implementation details and create intelligible layers of abstraction. Both are indispensable and neither is a substitute for the other. Sometimes large packages are preferable to small ones. Even in small packages, modules can be useful for controlling namespaces and organizing functions.
As I’ve argued, R has a great package system. It deserves an equally great module system to accompany it.
- Nonstandard evaluation
Although it can be abused, metaprogramming is a powerful technique, enabling developers to create domain-specific languages without also having to create their own compilers. I generally support metaprogramming as a language feature, provided it is implemented safely and correctly. Hygienic macros with a well-defined expansion phase, as found in Lisp, Scheme, and Julia, are the gold standard. However, when implemented badly, as in C macros and C++ templates, metaprogramming can be disastrous. Metaprogramming in R is far better than in C/C++, but still suffers from significant design defects.
Metaprogramming in R happens through an unusual mechanism called nonstandard evaluation (NSE), whereby functions receive arguments not as values, but as promises to compute a value. Instead of immediately evaluating the promise, a function may choose to first modify the promise expression or evaluation environment. It may even choose not to evaluate the expression at all, manipulating it for some other purpose. In effect, every function in R is, or can be, a macro, which expands dynamically and at run-time.
This form of metaprogramming makes R programs difficult to reason about. Function calls using NSE cannot be translated into standard R code, unlike macro calls in languages with a compile-time macro expansion phase. Moreover, to the best of my knowledge, it is impossible to even determine whether an R function uses NSE. Thus, a priori, no R function can be assumed to be referentially transparent or to obey the standard semantics of the R programming language. Nonstandard evaluation is the Wild Wild West of metaprogramming.
R is a famously quirky language, in turns charming and frustrating. Here I’ll mention some aspects of R I find downright ugly. In contrast to the previous section, the issues raised here are not fundamental. They can be fixed by dedicated efforts from the R community. And, happily, many of them are being fixed.
- Legacy restrictions
R is an old programming language. The first version of the S language, the precursor to R, was created at Bell Labs in 1976 by a group of five people, of whom Rick Becker and John Chambers played a prominent role.6 At this time, interactive computing environments for statistics were scarce or nonexistent, with most statistical computing at Bell Labs done directly in Fortran. The first modern version of S, similar to the later S-PLUS and to R itself, appeared in 1988. Ross Ihaka and Robert Gentleman created the R language in 1993. Thus, if we count from the first release of S, then R is 8 years older than MATLAB, 14 years older than Python, 24 years older than the first modern version of Python (i.e. Python 2), and 36 years older than Julia.
In many respects, it is easy to forget that R is such an old language. That is a testament to the creativity and vision of its inventors. However, any programming language with a long history is bound to have some warts and R is no exception. The “strings as factors” convention is notorious. Automatic “simplification” in
sapplyand other functions leads to type instability and therefore to bugs in degenerate cases. There is no native support for 64-bit integers. The experienced R user can doubtless supply their own favorite examples.
Fortunately, most legacy problems can be fixed by writing new packages. That is exactly what is happening. Many of the packages in the tidyverse are designed, wholly or in part, to fix defects in the R standard library. For example, the readr package fixes the “strings as factor” problem and the purrr package provides a sane alternative to the indispensable but frustrating family of
applyfunctions in base R.
- Lack of data structures
Many data structures that are standard in general-purpose programming languages are inexplicably missing from R. It is embarrassing that, in 2018, R still has no first-class support for dictionaries (hash maps).7 And if you’re looking for slightly more exotic data structures, like ordered dictionaries or sets, or for fast mutable data structures, like linked lists or circular buffers, you can just forget about it.
Of course, this problem can remedied by writing new data structures, preferably with C/C++ backends. For instance, the new package dequer adds efficient stacks, queues, and deques. Such contributions may not flashy, but they are useful and important. The R ecosystem needs more of this foundational work.
- GPL licensing
The GPL license is popular among R package developers and the R language itself is GPL licensed. Due to the complicated web of dependencies on CRAN, it is very difficult to write an R program that does not import a GPL-ed package. For example, the flagship package of the tidyverse, dplyr, is MIT licensed, but that’s irrelevant for practical purposes because it, like most of the tidyverse, depends on the GPL-ed package rlang.
This state of affairs is unfortunate and atypical. More liberal licenses prevail in other corners of data science. The BSD license is very popular among Python developers and the MIT license is the de facto standard in the Julia community. Permissive licenses make it much easier for corporate developers to use and contribute to open source software. Corporate sponsorship has been a major boon to the scientific Python community over the years. Acceptability within industry may partly explain why Python is both more popular and growing more rapidly than R among data scientists. Of course, many other factors are involved and R is itself growing rapidly. Still, I think that establishing a culture of permissive licensing would benefit the R community.
Any movement in this direction will be gradual. Cultural change, if it happens at all, happens slowly. However, there is some evidence that, within the wider open source world, the GPL is declining compared to permissive licenses like Apache and MIT. With any luck this trend will spill over into the R community.
If you’re a data scientist who uses R mainly to write short scripts and notebooks for data analysis, you may be wondering: “Are these complaints about type systems and data structures really relevant to me? R is a tool for statistical computing. If it does that job well—and it does—why worry about these esoteric issues in programming language theory?” It’s true that R is used mainly for statistical computing and that these issues may not directly impact working data scientists. However, they indirectly impact all R users, because they reduce the productivity of the package and tool developers upon whom the whole community depends. Moreover, R is branching out from its roots in statistical methodology. RStudio’s Shiny is an R framework for creating interactive web applications. The cloudyr project aims to simplify cloud computing with R, including on serverless frameworks like AWS Lambda. I expect to see more such nontraditional projects as the R community grows. In general, I reject the dichotomy between general-purpose and domain-specific programming languages. The trouble with the latter is that someone always wants to do more.
The tripartite structure of this essay may create the impression that I subjectively dislike R, when that’s not really the case at all. I genuinely enjoy programming in R. In part, that is because R has a simple, elegant core, reminiscent in many ways of Lisp. R becomes much less mysterious when you realize that it’s basically Lisp, but with more statistics and less parentheses. Yet I cannot fully explain why I find R so appealing. I suppose there is a Zen to R programming, an ineffable quality of goodness that transcends the language’s many peculiarities.
The family of languages comprising R and S is now over 40 years old. So R, my old friend, here’s to another 40, you charming, irascible devil.
At the time of this writing, a pull request adding k-medoids clustering to scikit-learn is pending.
Witness the confusing and tortuous path from
distutils2and back again, from
pip, and from the Egg package format to Wheel. The situation remains complicated, as observed by the inimitable xkcd.
The usual prescription is to repurpose environments as dictionaries. The hash package defines a pleasant interface on top of environments, but it cannot circumvent the deeper problem that environment keys must be strings (except through hacks like coercing integer keys to strings). On the other hand, the hashmap package provides a custom hashmap supporting numeric keys, but its values are restricted to atomic vector types. So, to restate my point more carefully: R still has no efficient data structure for dictionaries, which allows a range of key types and arbitrary value types. Yet such data structures now exist in virtually every high-level programming language.