Extra exercises inspired by “Getting started” & “Introduction to R programming”.
Challenge.
Recall that we suggested that you create a “standard” directory structure. Write a code snippet or script to create your preferred structure as a utility to be run when you create a new Rstudio project.
Consider how you could test if one or more of the directories already exists in your current working directory.
Hint: You may find it useful to checkout Rs help on ‘files2’ and ‘getwd’.Answer
Answer
curr_dir <- getwd() print(curr_dir) if (!dir.exists("data")) dir.create("data") # Another (and more flexible ) way could use the 'lapply' function (see R help) like this: tst_apply <- c("d1","d2","d3") # a vector of sub-dirs to create lapply(tst_apply, dir.create)#> Warning in FUN(X[[i]], ...): 'd1' already exists#> Warning in FUN(X[[i]], ...): 'd2' already exists#> Warning in FUN(X[[i]], ...): 'd3' already exists
Challenge.
Using two of the functions we introduced in this session, find out how many missing variables are missing in the cereals dataset.
NB You will need to execute “install.packages(”plspm”)” first to install package and its dataset.
#> #> Attaching package: 'plspm'#> The following object is masked from 'package:ggplot2': #> #> alpha
Work through the surveys data-set and find which columns have missing values and how many.
For an extra challenge use the help (or google) to use a for loop to count the number of NAs in each column and the cat() to print out a table.
Extra exercises inspired by “Starting with data”.
The data set contains 77 cereals with 15 variables:
manufacturer (categorical)
* A = American Home Food Products
* G = General Mills
* K = Kelloggs
* N = Nabisco
* P = Post
* Q = Quaker Oats
* R = Ralston Purina
type (categorical).
* C = cold cereal
* H = hot cereal
Nutritional information per serving.
number of calories (numeric).
grams of protein (numeric).
grams of fat (numeric).
milligrams of sodium (numeric).
grams of dietary fiber (numeric).
grams of complex carbohydrates (numeric).
grams of sugars (numeric).
grams of potassium (numeric).
percentage of FDA recommended vitamins (numeric).
display shelf (categorical) 1=bottom shelf to 3=top.
weight in ounces (numeric).
number of cups in one serving (numeric).
rating of the cereal (numeric) - consumer survey?.
Challenges.
.
.
Extra exercises inspired by session 1 of “Data manipulation and visualisation with tidyverse”.
Load ‘tidyverse’ and convert cereals to a tibble (either by saving to
CSV and then reading back into a tibble or by learning to use the
as_tibble function from R help).
Challenge.
It is hypothesised that consumers prefer cereals that contain more
sugar. Plot the relationship between sugar content and rating. Does this
support the hypothesis?
# challenge phrasing suggests using facets
ggplot(data = cereals_t,mapping = aes(x = sugars, y = rating)) +
geom_point() +
geom_smooth()#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

It is possible that market dominance is skewing our sample.
Produce a bar-plot showing the number of items made by each manufacturer
in our data-set.
In section 2 you listed those cereals made by ‘K’ with less than 100
calories per serving using the ‘Base R’ method. Please produce a
Tidyverse version (without creating any intermediate variables or
nesting functions). List the manufacturer, calories and cereal
name.
Extra exercises inspired by session 2 of “Data manipulation and visualisation with tidyverse”.
Challenges.
Using Tidyverse functions, create a summary table of the average
calories by manufacturer.
It should look like this:
| Manufacturer | mean_calories |
|---|---|
| A | 100.00000 |
| G | 111.36364 |
| K | 108.69565 |
| N | 86.66667 |
| P | 108.88889 |
| Q | 95.00000 |
| R | 115.00000 |
Produce a scatterplot of calories vs. sugar grouped by manufacturer.
ggplot(data = cereals_t, mapping = aes(x = sugars, y = calories)) +
geom_point() +
geom_smooth() +
facet_wrap(facets = vars(mfr))
This will be performed on a dataset from a paper by Reaven
& Miller - downloadable from here. This research analyses the
stages of the disease Type 2 Diabetes.
You will need to install the ‘heplots’ package and load it using
library() and this will bring in the data-set we will be working
with.
We have 5 columns with numeric values (relwt, glufast, glutest,
instest, sspg) and one categorical column (group).
Details about these values are to be found here.
From this we find that Group comprises of 3 values: (Normal, Chemical_Diabetic, Overt_Diabetic) corresponding to stages of developing Type 2 diabetes.
Use the Base R hist() command (Consult R help for usage) to plot the distributions of the numeric variables - are they normally distributed (Bell curve)?.
# Plot histograms of all columns distributions
# Does data fit normality criteria for statistical tests?
# e.g. if we wanted to do a t-test. We'll just use base R histogam
# here - As an extra exercise: How would you do this in tidyverse
# as well?
hist(Diabetes$relwt) 
# Plot histograms of all columns distributions
# Does data fit normality criteria for statistical tests?
# e.g. if we wanted to do a t-test. We'll just use base R histogam
# here - As an extra exercise: How would you do this in tidyverse
# as well?
hist(Diabetes$glufast)
# Plot histograms of all columns distributions
# Does data fit normality criteria for statistical tests?
# e.g. if we wanted to do a t-test. We'll just use base R histogam
# here - As an extra exercise: How would you do this in tidyverse
# as well?
hist(Diabetes$glutest) 
# Plot histograms of all columns distributions
# Does data fit normality criteria for statistical tests?
# e.g. if we wanted to do a t-test. We'll just use base R histogam
# here - As an extra exercise: How would you do this in tidyverse
# as well?
hist(Diabetes$instest)
# Plot histograms of all columns distributions
# Does data fit normality criteria for statistical tests?
# e.g. if we wanted to do a t-test. We'll just use base R histogam
# here - As an extra exercise: How would you do this in tidyverse
# as well?
hist(Diabetes$sspg)
Install the “PerformanceAnalytics” package to help us create correlation plots for our data-set.
NB.
- This plot also provides another way of generating the histograms of
the data columns.
- The distribution of each variable is shown on the diagonal.
- On the bottom of the diagonal : the bivariate scatter plots with a
fitted line are displayed.
- On the top of the diagonal : the value of the correlation plus the
significance level as stars.
- Each significance level is associated to a symbol :
p-values(0, 0.001, 0.01, 0.05, 0.1, 1) <=> symbols(“”,
“”, “”, “.”, ” “).
# install.packages("PerformanceAnalytics")
library("PerformanceAnalytics")
my_data <- Diabetes[, c(1,2,3,4,5)]
chart.Correlation(my_data, histogram=TRUE, pch=19)
Lets now replicate Table 1 from the paper. It has columns:
Group, Number in group, Mean Relwt with SD, Mean glucose area with SD,
Mean insulin area with SD, mean SSPG with SD.
# We use our 'old friend' the round() function
# with a digits=2 parameter
Diabetes %>%
group_by(group) %>%
summarise(
grp_size = n(),
mean_relwt = round(mean(relwt),2),
sd_relwt = round(sd(relwt),2),
mean_glucosearea = round(mean(glutest),2),
sd_glucarea = round(sd(glutest),2),
mean_insulinarea = round(mean(instest),2),
sd_insu_area = round(sd(instest),2),
mean_sspg = round(mean(sspg),2),
sd_sspg = round(sd(sspg),2)
)Now we replicate figures 2 & 3 from the paper. Scatterplots of insulin area vs glucose area and sspg vs. insulin area. The group that a data point is from should be appropriately discernable.
# NB We could have used colour=group but shape is closer
# to the published graph. Note usage of \n in title to break line.
ggplot(Diabetes, mapping= aes(x=glutest, y =instest, shape=group)) +
geom_point() +
labs(title = "Fig. 2, Relationship between the plasma glucose \n (mg/100 ml 9hr) and insulin (~tU/ml 9hr) responses ",
x = "Glucose area",
y = "Insulin area")
# Now repeat for Fig 3
ggplot(Diabetes, mapping= aes(x=instest, y =sspg, shape=group)) +
geom_point() +
labs(title = "Fig. 3. Relationship between the plasma insulin \n response(~tU/ml 9hr) and SSPG (rag/100 ml) ",
x = "Insulin area",
y = "SSPG")