1 Descriptive Statistics
1.1 Infant birth weight
In a study of different occupational groups the infant birth weight was recorded for randomly selected babies born by hairdressers, who had their first child. The following table shows the weight in grams (observations specified in sorted order) for 10 female births and 10 male births:
Gender | Weights |
---|---|
Famles(x) | 2474 2547 2830 3219 3429 3448 3677 3872 4001 4116 |
Males(y) | 2844 2863 2963 3239 3379 3449 3582 3926 4151 4356 |
Solve at least the following questions a)-c) first “manually” and then by the inbuilt functions in . It is OK to use as alternative to your pocket calculator for the “manual” part, but avoid the inbuilt functions that will produce the results without forcing you to think about how to compute it during the manual part.
a) Females
What is the sample mean, variance and standard deviation of the female births? Express in your own words the story told by these numbers. The idea is to force you to interpret what can be learned from these numbers.
- sample mean - show average including outliers
- variance - it’s the average of the distance of values from the mean squared
- standard deviation - it's the square root of the variance.
b) Males
Compute the same summary statistics of the male births. Compare and explain differences with the results for the female births.
- sample mean
- variance
- standard deviation
- Compared with females males infant is heavier with 113.9 in average. Moreover males weight varies 55.1732 grams less than females.
c) The five quartiles
Find the five quartiles for each sample — and draw the two box plots with pen and paper (i.e. not using R)
- Females
- Males
- Boxplot:
d) Inter Quartile Range(IQR)
Are there any “extreme” observations in the two samples (use the modified box plot definition of extremness)?
- Females
There are no observations lower than 1400 or higher than 5435, so there are no "extreme" observations
- Males
There are no observations lower than 1518.5 or higher than 5370.5 , so there are no "extreme" observations.
e) The coefficient of variations
What are the coefficient of variations in the two groups?
-
Females $$ V = \frac{587.2993}{3361.1}= 0.1747 $$
-
Males
1.2 Course grades
To compare the difficulty of 2 different courses at a university the following grades distributions (given as number of pupils who achieved the grades) were registered:
Grade | Course 1 | Course 2 | Total |
---|---|---|---|
Grade 12 | 20 | 14 | 34 |
Grade 10 | 14 | 14 | 28 |
Grade 7 | 16 | 27 | 43 |
Grade 4 | 20 | 22 | 42 |
Grade 2 | 12 | 27 | 39 |
Grade 0 | 16 | 17 | 33 |
Grade 12 | 10 | 22 | 32 |
a) Median
What is the median of the 251 achieved grades?
4
b) Quartiles and IQR
What are the quartiles and the IQR (Inter Quartile Range)?
1.3 Cholesterol
In a clinical trial of a cholesterol-lowering agent, 15 patients’ cholesterol (in mmol L^{-1} ) was measured before treatment and 3 weeks after starting treatment. Data is listed in the following table:
Patient No. | Before | After |
---|---|---|
1 | 9.1 | 8.2 |
2 | 8.0 | 6.4 |
3 | 7.7 | 6.6 |
4 | 10.0 | 8.5 |
5 | 9.6 | 8.0 |
6 | 7.9 | 5.8 |
7 | 9.0 | 7.8 |
8 | 7.1 | 7.2 |
9 | 8.3 | 6.7 |
10 | 9.6 | 9.8 |
11 | 8.2 | 7.1 |
12 | 9.2 | 7.7 |
13 | 7.3 | 6.0 |
14 | 8.5 | 6.6 |
15 | 9.5 | 8.4 |
a) Medians
What is the median of the cholesterol measurements for the patients before treatment, and similarly after treatment?
b) Standard deviation
Find the standard deviations of the cholesterol measurements of the patients before and after treatment.
c) Sample covariance
Find the sample covariance between cholesterol measurements of the patients before and after treatment.
By Definition 1.18
d) Correlation
Find the sample correlation between cholesterol measurements of the patients before and after treatment.
e) Differences
Compute the 15 differences (Dif = Before − After) and do various summary statistics and plotting of these: sample mean, sample variance, sample standard deviation, boxplot etc.
before <- c(9.1, 8.0, 7.7, 10.0, 9.6, 7.9, 9.0, 7.1, 8.3, 9.6,
8.2, 9.2, 7.3, 8.5, 9.5)
after <- c(8.2, 6.4, 6.6, 8.5, 8.0, 5.8, 7.8, 7.2, 6.7, 9.8,
7.1, 7.7, 6.0, 6.6, 8.4)
diffBeforeAfter <- before - after
mean(diffBeforeAfter)
median(diffBeforeAfter)
var(diffBeforeAfter)
sd(diffBeforeAfter)
quantile(diffBeforeAfter) # precentiles
boxplot(diffBeforeAfter, col = "red")
text(1.3, quantile(diffBeforeAfter), c("Minimum","Q1","Median","Q3","Maximum"),
col="blue")
f) Formal answer
Observing such data the big question is whether an average decrease in cholesterol level can be “shown statistically”. How to formally answer this question is presented in Chapter 3, but consider now which summary statistics and/or plots would you look at to have some idea of what the answer will be?
I would answer something like this: the most patients decreased cholesterol level between 1.1 and 1.6, which is Inner Quartile Range (IQR). However the worst result recorded is -0.2 and the best is 2.1. Lastly the mean is 1.21 and the median is 1.3.