Extended activities

Session 1

Extra exercises inspired by “Getting started” & “Introduction to R programming”.

Challenge.
Recall that we suggested that you create a “standard” directory structure. Write a code snippet or script to create your preferred structure as a utility to be run when you create a new Rstudio project.
Consider how you could test if one or more of the directories already exists in your current working directory.
Hint: You may find it useful to checkout Rs help on ‘files2’ and ‘getwd’.
Answer
Answer
curr_dir <- getwd()
print(curr_dir)
if (!dir.exists("data"))  dir.create("data")

# Another (and more flexible ) way could use the 'lapply' function (see R help) like this:    
tst_apply <- c("d1","d2","d3") # a vector of sub-dirs to create
lapply(tst_apply, dir.create)
#> Warning in FUN(X[[i]], ...): 'd1' already exists
#> Warning in FUN(X[[i]], ...): 'd2' already exists
#> Warning in FUN(X[[i]], ...): 'd3' already exists

Challenge.

Using two of the functions we introduced in this session, find out how many missing variables are missing in the cereals dataset.
NB You will need to execute “install.packages(”plspm”)” first to install package and its dataset.

# Install dataset
library(plspm)

#> 
#> Attaching package: 'plspm'

#> The following object is masked from 'package:ggplot2':
#> 
#>     alpha

data("cereals")
# read in surveys as dataframe (assumes portal data in data dir)
surveys <- read.csv("data/portal_data_joined.csv")

Answer

# we can add up the number of TRUEs using the sum() function
sum(is.na(cereals))
# if this is zero then there are no missing values.

Work through the surveys data-set and find which columns have missing values and how many.
For an extra challenge use the help (or google) to use a for loop to count the number of NAs in each column and the cat() to print out a table.
Answer
sum(is.na(surveys))
coln <- colnames(surveys)
length(coln)
coln
# could print sum(is.na(surveys, 1)) 2 etc to print
# tota NAs for each column or....
out_lst <- list()
for( i in 1: length(coln))
  cat(i, coln[i], sum(is.na(surveys[,i])),"\n")

Session 2

Extra exercises inspired by “Starting with data”.
The data set contains 77 cereals with 15 variables:

manufacturer (categorical)
* A = American Home Food Products
* G = General Mills
* K = Kelloggs
* N = Nabisco
* P = Post
* Q = Quaker Oats
* R = Ralston Purina

type (categorical).
* C = cold cereal
* H = hot cereal
Nutritional information per serving.
number of calories (numeric).
grams of protein (numeric).
grams of fat (numeric).
milligrams of sodium (numeric).
grams of dietary fiber (numeric).
grams of complex carbohydrates (numeric).
grams of sugars (numeric).
grams of potassium (numeric).
percentage of FDA recommended vitamins (numeric).
display shelf (categorical) 1=bottom shelf to 3=top.
weight in ounces (numeric).
number of cups in one serving (numeric).
rating of the cereal (numeric) - consumer survey?.

summary(cereals)

Challenges.

List data for the cereals manufactured by firm ‘K’.

Answer

cereals[cereals$mfr == "K",]

Find the first ten cereals that have greater than 100 calories per serving.

Answer

head(cereals[cereals$calories > 100,],10)

Of the cereals that have greater than 100 calories per serving, what is the mean number of calories/serving?

Answer

mean(cereals$calories[cereals$calories > 100])

List those cereals made by ‘K’ with less than 100 calories per serving.

Answer

cereals[cereals$calories < 100 & cereals$mfr == "K",]

Session 3

Extra exercises inspired by session 1 of “Data manipulation and visualisation with tidyverse”.

Load ‘tidyverse’ and convert cereals to a tibble (either by saving to CSV and then reading back into a tibble or by learning to use the as_tibble function from R help).

Answer

# NB dataframe has the cereal name as rownames. To preserve this
# data in our tibble we use "rownames=" parameter to create a
# cer_name column.
library(tidyverse)
cereals_t <- as_tibble(cereals, rownames = "cer_name")

Challenge.
It is hypothesised that consumers prefer cereals that contain more sugar. Plot the relationship between sugar content and rating. Does this support the hypothesis?

Answer

# challenge phrasing suggests using facets
ggplot(data = cereals_t,mapping = aes(x = sugars, y = rating)) +
  geom_point() +
  geom_smooth()

#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

It is possible that market dominance is skewing our sample.
Produce a bar-plot showing the number of items made by each manufacturer in our data-set.

Answer

ggplot(data=cereals_t,mapping = aes(x=mfr)) +
 geom_bar()

In section 2 you listed those cereals made by ‘K’ with less than 100 calories per serving using the ‘Base R’ method. Please produce a Tidyverse version (without creating any intermediate variables or nesting functions). List the manufacturer, calories and cereal name.

Answer

cereals_t %>%
  filter(calories < 100 & mfr == "K") %>%
  select(mfr,calories, type)

Session 4

Extra exercises inspired by session 2 of “Data manipulation and visualisation with tidyverse”.

Challenges.
Using Tidyverse functions, create a summary table of the average calories by manufacturer.
It should look like this:

Manufacturer	mean_calories
A	100.00000
G	111.36364
K	108.69565
N	86.66667
P	108.88889
Q	95.00000
R	115.00000

Answer

# Look up rename() in help - it's a useful dplyr verb
cereals_t %>%
  group_by(mfr) %>%
  rename(Manufacturer = mfr) %>%
  summarise(mean_calories=mean(calories))

Produce a scatterplot of calories vs. sugar grouped by manufacturer.

Answer

ggplot(data = cereals_t, mapping = aes(x = sugars, y = calories)) +
    geom_point() +
    geom_smooth() +
    facet_wrap(facets = vars(mfr))

Capstone project (Putting it all together).

This will be performed on a dataset from a paper by Reaven & Miller - downloadable from here. This research analyses the stages of the disease Type 2 Diabetes.
You will need to install the ‘heplots’ package and load it using library() and this will bring in the data-set we will be working with.

Answer

# list dataset
library(heplots)

#> Loading required package: broom

Diabetes

Gain familiarity with the Diabetes dataset.

We have 5 columns with numeric values (relwt, glufast, glutest, instest, sspg) and one categorical column (group).
Details about these values are to be found here.

Exploring the dataset.

print(summary(Diabetes))

From this we find that Group comprises of 3 values: (Normal, Chemical_Diabetic, Overt_Diabetic) corresponding to stages of developing Type 2 diabetes.

Histogram plots of the data.

Use the Base R hist() command (Consult R help for usage) to plot the distributions of the numeric variables - are they normally distributed (Bell curve)?.

Answer

# Plot histograms of all columns distributions
# Does data fit normality criteria for statistical tests?
# e.g. if we wanted to do a t-test. We'll just use base R histogam
# here - As an extra exercise: How would you do this in tidyverse
# as well?
hist(Diabetes$relwt)

# Plot histograms of all columns distributions
# Does data fit normality criteria for statistical tests?
# e.g. if we wanted to do a t-test. We'll just use base R histogam
# here - As an extra exercise: How would you do this in tidyverse
# as well?
hist(Diabetes$glufast)

# Plot histograms of all columns distributions
# Does data fit normality criteria for statistical tests?
# e.g. if we wanted to do a t-test. We'll just use base R histogam
# here - As an extra exercise: How would you do this in tidyverse
# as well?
hist(Diabetes$glutest)

# Plot histograms of all columns distributions
# Does data fit normality criteria for statistical tests?
# e.g. if we wanted to do a t-test. We'll just use base R histogam
# here - As an extra exercise: How would you do this in tidyverse
# as well?
hist(Diabetes$instest)

# Plot histograms of all columns distributions
# Does data fit normality criteria for statistical tests?
# e.g. if we wanted to do a t-test. We'll just use base R histogam
# here - As an extra exercise: How would you do this in tidyverse
# as well?
hist(Diabetes$sspg)

Explore correlations.

Install the “PerformanceAnalytics” package to help us create correlation plots for our data-set.

NB.
- This plot also provides another way of generating the histograms of the data columns.
- The distribution of each variable is shown on the diagonal.
- On the bottom of the diagonal : the bivariate scatter plots with a fitted line are displayed.
- On the top of the diagonal : the value of the correlation plus the significance level as stars.
- Each significance level is associated to a symbol :
p-values(0, 0.001, 0.01, 0.05, 0.1, 1) <=> symbols(“”, “”, “”, “.”, ” “).

Answer

# install.packages("PerformanceAnalytics")
library("PerformanceAnalytics")
my_data <- Diabetes[, c(1,2,3,4,5)]
chart.Correlation(my_data, histogram=TRUE, pch=19)

Lets now replicate Table 1 from the paper. It has columns:
Group, Number in group, Mean Relwt with SD, Mean glucose area with SD, Mean insulin area with SD, mean SSPG with SD.

Answer

Diabetes %>%
    group_by(group) %>%
    summarise(
        grp_size = n(),
        mean_relwt = mean(relwt),
        sd_relwt = sd(relwt),
        mean_glucosearea = mean(glutest),
        sd_glucarea = sd(glutest),
        mean_insulinarea = mean(instest),
        sd_insu_area = sd(instest),
        mean_sspg = mean(sspg),
        sd_sspg = sd(sspg)
    )

The table in Reaven & Millars paper has values quoted to 2 decimal places, modify your code to better mimic that.

Answer

# We use our 'old friend' the round() function 
# with a digits=2 parameter
Diabetes %>%
    group_by(group) %>%
    summarise(
        grp_size = n(),
        mean_relwt = round(mean(relwt),2),
        sd_relwt = round(sd(relwt),2),
        mean_glucosearea = round(mean(glutest),2),
        sd_glucarea = round(sd(glutest),2),
        mean_insulinarea = round(mean(instest),2),
        sd_insu_area = round(sd(instest),2),
        mean_sspg = round(mean(sspg),2),
        sd_sspg = round(sd(sspg),2)
    )

Now we replicate figures 2 & 3 from the paper. Scatterplots of insulin area vs glucose area and sspg vs. insulin area. The group that a data point is from should be appropriately discernable.

Answer

# NB We could have used colour=group but shape is closer
# to the published graph. Note usage of \n in title to break line.
ggplot(Diabetes, mapping= aes(x=glutest, y =instest, shape=group)) +
  geom_point() +
  labs(title = "Fig. 2, Relationship between the plasma glucose \n (mg/100 ml 9hr) and insulin (~tU/ml 9hr) responses ",
       x = "Glucose area",
       y = "Insulin area")

# Now repeat for Fig 3
ggplot(Diabetes, mapping= aes(x=instest, y =sspg, shape=group)) +
  geom_point() +
  labs(title = "Fig. 3. Relationship between the plasma insulin \n response(~tU/ml 9hr) and SSPG (rag/100 ml) ",
       x = "Insulin area",
       y = "SSPG")

Additional Exercises

Extended activities

Session 1

Answer

Challenge.

Answer

Session 2

Session 3

Session 4

Capstone project (Putting it all together).

Gain familiarity with the Diabetes dataset.

Exploring the dataset.

Histogram plots of the data.

Explore correlations.