Sometimes the Best Documentation is Code You've Already Written

When you’ve been writing R code for long enough, you’ll invariably encounter a challenge you’ve faced before. It may be remembering whether to use the color or fill parameter for a line geom in ggplot, what to put in an R notebook chunk header to prevent the code from displaying, or how to convert a character string to a datetime value - or any of a million other things. You sit there, fingers over the keys, knowing full well you’ve done this before, but not remembering how you solved the problem. So what do you do?

One approach, if you’re using Rstudio, is to simply move your cursor over the function in question, if it’s a parameter based problem, and hit F1 to bring up the docs. Another is the tried and true google, stack overflow search. Both of these are valid and useful ways to solve coding problems, but they pull you out of the context - your script, rstudio, the inline output itself if you’re in an R notebook - and require you to search through a lot of material that is unrelated to the problem at hand. These approaches may be necessary for many problems you encounter, but for ones you know you’ve solved before, it may be preferable to refer to code you’ve already written.

Mine Your Code

These days, when I encounter some coding condundrum that I know I’ve solved in the past, I can pop open the terminal in Rstudio (Alt+Shift+t), type a quick command and a keyword related to the function or parameter I’m having trouble with, and see every line in every R notebook I’ve saved where that word was found. It feels like a superpower every time I do it, and I can’t recommend this approach enough. Having every line of code you’ve ever written at your fingertips, just a short command away, is powerful. However, this does require some familiarity with bash/shell, and some comfort with the blinking cursor and dark void of the terminal which can be initially intimidating.

Bash/Shell Familiarity

To get a sense of how familiar R coders are with Bash, and how narrowly I would be able to tailor the subject matter of this blog post, I decided to look at the 2018 Stack Overflow survey, where respondents were asked which languages they did extensive development work with in the last year. As you can see below, a fairly large proportion of R coders have been using bash/shell - nearly 55%. This puts them ahead of not only those who use VBA, SQL, and Matlab, but also those using non-data oriented programming languages like Java, Javascript, and C#.

So, if you don’t have any experience working at the command line, don’t despair, because it is well within the reach of those who tend not to think of themselves as “serious” programmers. For the purposes of this tutorial, some amount of familiarity with concepts around the terminal, and bash in particular, will be helpful, however, with some supplementary googling you should be able to follow along just fine.

To get started, you will need to have bash installed on your computer. This will already be the case if you’re using linux or macOS, but on Windows, this will require extra steps that are outside the scope of this post. For windows, I would recommend following this tutorial to install Windows Subsystem for Linux. Once you have WSL installed, you will need to navigate to Tools -> Global Options -> Terminal and select bash from the list to have bash available when you open a terminal in Rstudio.

Building the Command

Once you are able to open bash in a terminal in Rstudio, you will have the ability to run commands to do all sorts of things that we won’t cover here - creating and removing directories, editing files, and much more. To build our command to search your code for a keyword, you will use the grep command. When you type grep, then a filename and a word, in the terminal, grep will search for the word in the matching files. This is probably it’s most rudimentary use. We’re going to extend this command with some handy settings to do a bit more than just return matches.

The full command we will construct will look like this:

grep -irn --color --include "*.Rmd"

Breaking Down the Command

The -irn part of the command is a combination of flags, or settings, telling grep to behave in certain ways:

  • The i flag tells it to search case insensitively. That is, to find all words regardless of whether they are upper or lower case, or a combination of the two, that match the word you are searching for.
  • The r flag tells it to recurse through the subdirectories. That is, we will search all files, not only in the current directory, but in any folders within it, and again any within those.
  • The n flag tells it to print the line number. Without this flag, by default, the output will only include the file name in which each match is found, but with it the line number will appear as well.

The --color flag tells grep to color the word that was matched in the output. This will aid in skimming the output for matches that address your question.

The --include "*.Rmd" flag tells grep to only search within files that end in .Rmd.

When you type this command, followed by a search term, in a directory that has .Rmd files within it, you will see something like this. Each line shows the filename, the line number, and the text in the line with the match highlighted.

Now, we have our bash command, but we have to write out a long line of text to perform our search. This will invariably break our flow as we now have to remember “grep” and all of the flags each time we want to perform a search. What would be better would be to have a way of packaging this command into a single, easy to remember keyword. An “alias”, in bash, is a keyword that we can type in place of an often much longer command to get the same effect. This will be a suitable approach for our lengthy grep command.

Bash aliases are stored in a special file that contains many key settings. On MacOS, this would be the .bash_profile file in the home (i.e., /Users/yourusername) directory. On linux, you can use /etc/bash.bashrc. On windows, with WSL installed, this will also be in the home directory in the bash.bashrc file. Open up the appropriate file in the directory corresponding to the operating system of your computer and add the following line.

alias rmd='grep -irn --color --include "*.Rmd"'

Back in Rstudio, click the “x” in the terminal tab to close the terminal (changes to the bash configuration file that we made will not take affect until it is restarted). Open a new terminal. If you are in a directory with Rmd files in it, you should be able to type rmd and a keyword to search within the files.

Modifying the Behavior of the Command

Now that you have the command working, you’ll notice that it only displays one line per match. What if you want to see the lines around the match? Or the line with the match and the line below it? Since the “rmd” command is just an alias for grep and all of the flags we specified earlier, you can just add more flags in front of “rmd”. For example, if we want to show the line preceding the match, we can do the following.

rmd -B1 keyword

You can add any of the flags, below, after our new rmd command, followed by a number, to see lines around the one the match is found on.

  • -A Show lines after
  • -B Show lines before
  • -C Show surrounding lines

To see other flags that change the behavior of grep, type man grep in the terminal at any time. This will show you all available flags and descriptions of their use.

Familiarity with bash can pay off greatly, and being able to search your code to figure out how you accomplished something in the past - sometimes the distant past - can keep you speeding along as you build out an analysis.

bash  R