Chapter 2 Starting R
This chapter is a tour of some of the basics of the R language. Please follow along in RStudio as you read and try not to peek at the answers until you have given the activities a good go!
By the end of this tutorial you will know how to:
- Do basic stuff in R like arithmetic.
- Assign results of calculations to variables and refer to these.
- Calculate descriptive statistics like mean and standard deviation (SD).
- Install add-on packages from the internet and include them.
- Open a dataset in R.
This will prepare you for your first analysis of data.
2.1 Arithmetic
You can use R as a calculator:
+
is addition
-
is subtraction
*
is multiplication
/
is division
^
is exponentiation
So, for instance to calculate \(\frac{1}{2}\) you do
1/2
To work out \(2^3\) do:
2^3
The usual mathematical order of precedence rules apply, so
1 + 2 * 3
gives different result to
1 + 2) * 3 (
(Try it!)
2.1.1 Activities
Easing in tremendously gently…
What is the R code to calculate 2 + 2
How do you multiply 6 and 7?
2.1.2 Answers
a. What is the R code to calculate 2 + 2
2 + 2
## [1] 4
b. How do you multiply 6 and 7?
6 * 7
## [1] 42
2.2 Variables
R has a flexible way of creating and naming “objects” of various kinds, so that you can refer to them later. These objects may be things like the answer to a sum, a whole dataset, or the result of a regression analysis.
Let’s start with a sum – we will get to the more complex examples soon!
Look at this R code:
<- 2 + 2
first_bit <- first_bit / 8 second_bit
Read as
- “set
first_bit
to 2 plus 2” - “set
second_bit
tofirst_bit
divided by 8”
The <-
symbol is called an “assignment statement” and first_bit
and second_bit
are variable names.
2.2.1 Activity
When you run the code above, you will see that nothing actually seems to happen. Can you guess what you need to do to get the answer to the second_bit
sum?
2.2.2 Answer
You just have to type and run the name of the variable.
second_bit
## [1] 0.5
2.3 A note on variable names
There are limitations to the names you can use for variables (but ways to get around those – for another time). To keep life simple:
- make sure variable names always start with a character, “a” to “z.”
- they can have a number after the first character
- you can use an underscore, "_", as a sort of space
- remember that variables are cAsE sEnSiTiVe and consider always using lower case
- try to make variable names meaningful
2.4 Vectors
Vectors are a list of values of the same type of information. They usually come from a dataset, but you can also build them directly in R.
Vectors can be of whole numbers…
c(3,1,4,1,3,1,1,3,2,1)
… or (approximations of) real numbers…
c(1.2, 2.4, -2.3)
… or character strings, which are often used to represent categorical information:
<- c("woman", "woman", "man", "woman",
gender "genderqueer", "genderqueer", "gender fluid")
In this last example, I have saved the vector in a variable called gender
.
2.4.1 Activity
Try making a vector of the members of your favourite band (or other grouping of people that makes sense to you) and save the result in a variable with a meaningful name.
2.4.2 Answer
<- c("Françoise Cactus", "Brezel Göring") stereo_total
2.5 Functions
A function in R is a magical computational machine which takes an input and transforms it in some way, giving an output. (This is not a formal definition.)
Here’s a vector:
c(2,4,12)
One of R’s functions calculates the mean of a vector of numbers:
mean(c(2,4,12))
## [1] 6
There is another for the median:
median(c(2,4,12))
## [1] 4
Since these are built into R, there is also help. Try typing:
?mean
We can also apply functions to variables. Another useful function is table
which creates a table of counts.
So, in summary, to use a function you use its name and then tell it the inputs in parentheses.
2.5.1 Activity
Make a table of the gender
variable we created earlier.
2.5.2 Answer
table(gender)
## gender
## gender fluid genderqueer man woman
## 1 2 1 3
Note how this gives the same result as:
table(c("woman", "woman", "man", "woman",
"genderqueer", "genderqueer", "gender fluid"))
##
## gender fluid genderqueer man woman
## 1 2 1 3
But without having to copy and paste the vector.
2.6 More functions
Functions can have zero inputs. Try citation
which tells you how to cite R in write-ups.
citation()
(You could also try without the parentheses – not something you’d generally want to do on purpose – to see what happens.)
Functions can have more than one input. Another handy function, called rnorm
, generates random data from a normal distribution, with a given number of values, mean, and SD.
Here is an example with 50 values with a mean of 20 and an SD of 5:
<- rnorm(50, 20, 5)
random_values random_values
## [1] 17.936898 13.661468 18.299670 22.353483 23.613234 23.489314 20.176084
## [8] 18.991016 20.297336 19.421076 16.228409 9.679832 22.818685 17.784582
## [15] 15.677661 16.682719 15.277020 22.615267 30.458067 18.143680 17.104054
## [22] 19.729769 21.150979 16.615123 23.119212 15.338861 22.717791 19.515446
## [29] 16.541693 20.011065 14.588855 19.882224 25.225578 22.606624 19.320099
## [36] 15.456918 18.440037 20.030812 23.397467 23.629645 20.306022 18.132887
## [43] 20.422348 23.532638 19.367769 17.901262 18.316020 26.738726 17.147467
## [50] 18.189729
When you run it you will probably get different numbers to me.
Here is a histogram of values I got:
hist(random_values)
2.6.1 Activity
Try calculating the mean and standard deviation (the command is called sd
) of the data you generated. What do you notice?
2.6.2 Answer
Here’s what I got:
mean(random_values)
## [1] 19.56165
sd(random_values)
## [1] 3.613532
Unless you were very lucky, you probably discovered like me that the mean was different to 20 and the SD was different to 5. This is something we will return to in another session, but can you guess now why this has happened?
2.7 Data frames
Data frames are “tightly coupled collections of variables… used as the fundamental data structure by most of R’s modeling software.”
In practice, data frames are usually read in from files (next section); however, it can be helpful to see how to make them “by hand” within R.
To make them, use the function data.frame
with named vectors comprising the data.
Here is an example, created with copy and paste from Wikipedia. I have used newlines and space to make it readable.
<- data.frame(year = c(1995,
stereo_total_albums 1997,
1998,
1999,
2001,
2002,
2005,
2007,
2009,
2010,
2011,
2012,
2015,
2016,
2019),
album = c("Oh Ah!",
"Monokini",
"Juke-Box Alarm",
"My Melody",
"Musique Automatique",
"Trésors cachés",
"Do the Bambi",
"Paris-Berlin",
"No Controles",
"Baby ouh!",
"Underwater Love",
"Cactus versus Brezel",
"Yéyé Existentialiste",
"Les hormones",
"Ah! Quel Cinéma!"))
Do not build data frames like this with real data as it is painfully dull. However, it did get the data into R. R will print this in a sensible way if you run the data frame’s name:
stereo_total_albums
## year album
## 1 1995 Oh Ah!
## 2 1997 Monokini
## 3 1998 Juke-Box Alarm
## 4 1999 My Melody
## 5 2001 Musique Automatique
## 6 2002 Trésors cachés
## 7 2005 Do the Bambi
## 8 2007 Paris-Berlin
## 9 2009 No Controles
## 10 2010 Baby ouh!
## 11 2011 Underwater Love
## 12 2012 Cactus versus Brezel
## 13 2015 Yéyé Existentialiste
## 14 2016 Les hormones
## 15 2019 Ah! Quel Cinéma!
A few things you might want to do with data frames include:
Listing the variables:
names(stereo_total_albums)
## [1] "year" "album"
Finding out how many rows it has:
nrow(stereo_total_albums)
## [1] 15
You will often want to refer to a vector within the data frame, which you do with the $
operator:
$year stereo_total_albums
## [1] 1995 1997 1998 1999 2001 2002 2005 2007 2009 2010 2011 2012 2015 2016 2019
This is just like any other vector, so you could draw a histogram of the years of album releases:
hist(stereo_total_albums$year,
xlab = "Year",
main = "Stereo Total albums")
hist
has automatically chosen to group years in fives based on how much data there was and how it’s distributed. Note also how I changed the default settings on x-axis label and title.
2.7.1 Activity
What’s the median of the years albums were released?
2.7.2 Answer
median(stereo_total_albums$year)
## [1] 2007
2.8 Loading data frames from a file
It is easy to read files into R. This command loads an excerpt from the Gapminder dataset (extracted and tidied by Jennifer Bryan), which you can find here. Save the file in the same folder as your R or Rmd file and run:
<- read.csv("gapminder.csv") gap
Here “csv” stands for “comma separated values.” If you open the file in, say, Notepad, then you will see why…
"country","continent","year","lifeExp","pop","gdpPercap"
"Afghanistan","Asia",1952,28.801,8425333,779.4453145
"Afghanistan","Asia",1957,30.332,9240934,820.8530296
"Afghanistan","Asia",1962,31.997,10267083,853.10071
"Afghanistan","Asia",1967,34.02,11537966,836.1971382
"Afghanistan","Asia",1972,36.088,13079460,739.9811058
"Afghanistan","Asia",1977,38.438,14880372,786.11336
...
The values are separated by commas.
If it worked, then the following command will show the top 5 rows:
head(gap, 5)
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
If it didn’t work, then you’re probably not using R markdown. You can try changing the working directory to the same place as your R “source file.”
The variables in the dataset are as follows:
country: The country
continent: The continent
year: ranges from 1952 to 2007 in increments of 5 years
lifeExp: life expectancy at birth, in years
pop: population
gdpPercap: GDP per capita (US$, inflation-adjusted)
We can plot life expectancy against the GDP per capita as follows:
plot(lifeExp ~ gdpPercap, data = gap)
The “~” is read as “tilde” (or sometimes “twiddle”). The x-axis labels may not be readable at first but we shall return to this in a later session. (If you can’t wait, try this.)
2.8.1 Activity
Plot life expectancy at birth (on the y-axis) against year (on the x-axis).
2.8.2 Answer
plot(lifeExp ~ year,
data = gap,
xlab = "Year",
ylab = "Life expectancy at birth")
Again I have changed the axis label defaults. We will dramatically improve upon this in the next chapter.
2.9 Packages
When you install R, it comes with a small collection of essential packages which consist of functions like mean
, sd
, and hist
that we have used above. But a strength of R is that there are a HUGE number of packages written by people across the world who use R and these can easily be downloaded from the interweb and added using R. These add-ons do things like produce pretty data visualisations (e.g., used by the BBC Data and Visualisation Journalism team) and statistical models required by social scientists. The core recommended text by Fox and Weisberg (2019) also has an R package.
Here is an example, using the praise
package. To download and install the package, use this command:
install.packages("praise")
If this has worked, you will have seen text something like:
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/praise_1.0.0.zip'
Content type 'application/zip' length 19729 bytes (19 KB)
downloaded 19 KB
package ‘praise’ successfully unpacked and MD5 sums checked
Now the package is on your local computer but it isn’t available to use yet. You also have to do:
library(praise)
(Note how there were no quotation marks here.)
If this worked, then… well not much will have happened. But R won’t have complained.
2.9.1 Activity
The praise
packages provides a function also called praise
, which requires no input to run. Give it a go – what happens?
2.9.2 Answer
I’ll run it a few times…
praise()
## [1] "You are wonderful!"
praise()
## [1] "You are stunning!"
praise()
## [1] "You are hunky-dory!"
Not a big data science algorithm, but you may find it helpful as the course progresses :)
2.10 The end!
praise()
## [1] "You are awesome!"