Descriptive Statistics
Updated 20 February
2009
“I have realized that data without
statistics are just numbers.”
Amy
Greenway, Research Assistant (HU 2005)
You are expected to
read the corresponding textbook chapters as we cover them in class.
Working the
exercises at the end of each chapter will enhance your understanding
(click for
answers to all exercises)
Statistical
Basics
A.
Definitions
1. variable -
characteristics that may differ (vary) among individuals
a. measured
b. derived
2. data - values of
variables for individuals (singular datum)
3. case/observation
- data for an individual; symbolize: x1, x2, ...xn
(n=sample size)
B.
Collection of data
1.
population - all individuals of a specified part of
the universe (cf biological definition)
2.
sample - subset of population; used to make inferences regarding
the population (usually unknown)
3.
error - 2 sources
a.
physical measurement error
b.
difference between the real population value and the
estimates (from sample data) of the population value
4.
randomness - all individuals have equal probability
of being sampled (bias)
5.
independence - value of one case does not affect the
value of other cases (exception e.g., repeated measures on same individual –
dependence by design)
C.
Scales of measurement and variable types
1.
Categorical scale (Nominal)
a. values not
quantitative or ranked; no mathematical or value relationship
b. mutually
exclusive categories (e.g., male/female)
c. variable type:
categorical
2. Ranked scale (Ordinal)
a. relative
differences (e.g., greater than/less than)
b. no mathematical
relationship between values (e.g., small/medium/large; highly active/active/not
active; health status: 1=healthy, 2=ill, 3=sick, 4=hospitalized, 5=dead)
c. variable type:
ranked
3.
Ratio scale
a.
mathematically defined distance between values; quantitative
b.
absolute zero point (e.g., mass)
c.
variable types:
·
Discrete - may assume only certain values within given range (e.g.,
1, 2, 3, 4)
·
Continuous - may assume any value within given
range (e.g., 1.0, 2.0, 2.34, 2.344, etc.)
d.
may convert ratio data to ranked/categorical data (not vice
versa)
4. Interval scale
a. mathematically
defined distance between values; quantitative
b. arbitrary zero
point (e.g., Celsius temperature scale)
c. variable types:
·
Discrete - may assume only certain values within given range (e.g.,
1, 2, 3, 4)
·
Continuous - may assume any value within given
range (e.g., 1.0, 2.0, 2.34, 2.344, etc.)
d.
may convert interval data to ranked/categorical data (not
vice versa)
E. SYSTAT Demo Data
files
1.
windows (output, data, graph); menus
2.
data files
a.
columns (variables); numerical vs. string (categorical)
variables (e.g., SEX vs. SEX$)
b.
rows (values of variables [cases, observations, sample
size])
3.
creating data files (entering and editing data); opening
existing data files
4.
frequency distributions (plot frequency against variable);
note terminology, i.e., Y against X)
a.
discrete (ordinal/categorical) data (Tab 3.1, 3.2; Fig. 3-1);
continuous data (Tab 3.3, 3.4; Fig. 3-2)
b.
SYSTAT (Graph®Histogram); USOPHEO.SYD
5. selecting cases (Data®Select Cases)
6. by groups (Data®By Groups)
7. creating frequency tables (Analyze®One-Way Frequency
Tables; Analyze®Tables®Two-Way)
8. calculating an
average (Analyze®Basic Statistics)
9. transforming data
(Data® Transform®Let)

Introduction
to SYSTAT
Prepare a SYSTAT
data file using the data below. These
data are measurements taken from 10 specimens of spiny guanotzits from Arkansas
and Missouri. The variables are: collection
locality (categorical), length of body (continuous), sex (categorical), weight
of body (continuous), amount of pigment on the lower jaw (ranked), and number
of scales on the chin (discrete).
|
Case |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
Locality |
AR |
AR |
MO |
MO |
MO |
AR |
AR |
MO |
AR |
MO |
|
Length (mm) |
22.5 |
21.4 |
20.8 |
20.6 |
19.8 |
20.1 |
22.3 |
21.7 |
20.4 |
21.1 |
|
Sex |
m |
m |
f |
f |
f |
f |
m |
f |
m |
f |
|
Weight (g) |
333 |
298 |
401 |
257 |
21 |
30 |
478 |
400 |
35 |
288 |
|
Pigment |
4 |
5 |
5 |
3 |
2 |
1 |
1 |
5 |
4 |
5 |
|
No. scales |
23 |
22 |
14 |
26 |
9 |
21 |
17 |
12 |
15 |
12 |
Name your data
file first.syd (the file extensions .syz or .syd identify a SYSTAT data
file). After you finish entering the
data, proofread the file to make sure that the data are correct, edit if
necessary, save the file to your M-drive, and close it. Reopen the file and use it to learn the
following menus and functions:
1.
File Menu (New, Open, Save, Save As, Print, Exit)
2.
Edit Menu (Undo, Cut, Copy, Paste, Copy Graph, Delete,
Options)
3.
Data Menu (Variable properties, Transform, By Groups, Select
Cases)
4.
Graph Menu (Histogram)
5.
Analyze Menu (One-Way Frequency Tables, Basic Statistics,
Tables)
Exercises
1.
calculate the mean guanotzit weight
2.
calculate the mean guanotzit weight separately for males and
females
3.
calculate the mean weight for guanotzits from
4.
draw a histogram of guanotzit lengths
5.
transform weight to the common logarithm of weight
6.
how many quanotzits from
7.
determine the number of guanotzits by scale number and state
8.
after you finish #1-7, your instructor may have additional
exercises using lonoke.syd
_________________________________________________
Description
of Data (from a frequency distribution)
A. Descriptive statistics
1.
measures of central tendency (Fig. 4.1)
a. mode -
most frequent class (of frequency distribution)
b. median
(ordinal or ratio/interval data) - middle class
c. mean (ratio/interval
data) = “average”; calculation: Sx/n
2. measures of
dispersion - describe the amount that each observation is likely to vary from
the mean/median
a. maximum, minimum
(range): sensitive to extreme values
b. interquartile
range: break ordered data into four equal sections (quartiles, Q1,
Q2, Q3); middle 50% of observations (Q3 – Q1; difference between 25th
and 75th percentiles)
c. sum of squares
(SS): S(x -`x)2
d. variance: SS/n
e. standard
deviation: √variance (Fig. 6-3)
3. symbols for
statistics (sample) and parameters (population)
|
|
Parameter |
Statistic |
|
Mean |
m = Sx/n |
`x = Sx/n (=“x-bar”) |
|
Variance |
s2 = S(x-m)2/n |
s2 = S(x-`x)2/n-1 |
|
Standard Deviation |
s = √s2 |
s = √(s2) (=“SD”) |
4. coefficient of
variation (CV) - expresses SD as a percent of the mean
a. CV = (SD/`x) 100
b. used to compare relative
variation in one variable between groups with different means
c. example:
Note that group 2 is relatively more variable
despite a greater SD in group 1.
mean SD CV
Group 1 14.2 2.5 17.6
Group 2 7.2 1.8 25.0
5. calculating
descriptive univariate statistics
a.
personal
calculator - assignments:
1) calculate
descriptive stats for: 13.4, 13.8, 14.2, 17.0, 15.3, 15.8, 14.9, 12.3, 16.2,
16.4): `x = 14.9, SD =
1.49 [s = 1.41], s2
= 2.22 [s2 = 2.00]
2) calculate
descriptive statistics from a frequency distribution (an essential skill)
No. plants/
quadrat Frequency
Some calculators have a
frequency limit of 100. Check your
calculator!
0 35
1 28 `x = 1.4
2 15 SD = 1.48
3 10
4 7
5 5

b.
SYSTAT - calculate descriptive statistics from raw data file; Analyze®Basic Statistics
c.
Use CAVESALYS.SYD pic
SVL N
of Cases 55 Minimum 32.50 Maximum 62.50 Arithmetic
Mean 52.73 Standard
Deviation 8.02
Data for the following results were selected
according to SELECT (LOC$ = AR) AND (SEX$ = F)
d. SYSTAT - calculate
descriptive statistics from frequency distribution (use
noplants/quadrat data)
1.
Step
1: Data®Case Weighting®By Frequency
2. Step 2: Analyze®Basic
Statistics
6.
reporting sample means; must include measure of dispersion
(=”error”) - mean alone not useful;`x ± error
a.
SYSTAT (bar graph with error bars)

b.
how reduce variance? S(x -`x)2/n
-experimental
design (select cases); decrease SS (numerator)
-calculation
(increase sample size): increase n (denominator)
_____________________________________________
Exercise: Descriptive
Statistics and Graphics -
Click
here
_____________________________________________
Goodness-of-Fit (GOF)
1. How well does one shape mimic (“fit”)
another shape?
2. Provide shape (e.g., “Draw a square”)
3. Compare with the expected shape (square)
4. Does the observed shape fit the expected
shape? (either it fits or it doesn’t fit)
5. Substitute “frequency distribution” for
“shape.”
6. How well does an observed frequency distribution
fit an expected frequency distribution? (GOF)
Probability Distributions (“Expected distributions”)
(=frequency distribution w/probabilities
instead of counts) - why do we need to know? - practical use to biologists (observed
vs. expected); cannot know if result is due to chance alone unless we know what
the expected is (hypothesis testing
- basis for much of much of this course!); important theoretical frequency
distributions:
A. Discrete
probability distribution #1 - Binomial (mutually exclusive categories;
either/or); e.g., male/female, red/white, red/not red
1. Example: 1 coin
toss- possibilities: 1H, 1T
a.
probabilities: no. ways an event (i.e., H or T) can occur
/total no events (i.e., 2) possible; “division” rule; 1H [1/2] = 0.5; 1T [1/2]
= 0.5
b. add all
possibilities = 1 [0.5 + 0.5 = 1]
c.
probability distribution shape (Fig. 5-1)
2. Example: 2 coin toss-
possibilities: 2H, 1H1T, 1T1H, 2T (mutually exclusive, independent events)
a.
probabilities:
1) simultaneous
events (“and” rule, multiply): 2H
[0.5 x 0.5] = 0.25; 2T [0.5 x 0.5] =
0.25
2) alternative
events (“or” rule, add): 2HT [0.5 x
0.5] + [0.5 x 0.5] = 0.5
b. add all
probabilities [0.25 + 0.5 + 0.25 = 1]
c.
probability distribution shape (Fig. 5.3)
3. binomial formula: P(x) = (n!/(x!(n-x)!))pxq(n-x)
4. terms
p = probability of event of interest =
head (”success”)
q = probability of other event (1-p) =
not head (”failure”)
n = number of “simultaneous” events
(trials); Note: “n” is signified by “k” in H&H (p 31)
x = number of occurrences of the event of
interest (successes)
EXAMPLE: A reproductive
physiologist counted the number of males in 247 litters of four siblings each
in a species of Dimetrodon (observed
frequency in table below). Based on the
theory of sex determination in mammals (equal chance of being male or female; Fig. 5-1),
calculate the expected frequencies for the number of males in these
litters. Note the IF-THEN rationale.
Expected Expected
proportion number (frequency)
prop (0 males) = (4!/(0!(4-0)!))
x 0.50 x 0.5(4-0) = 0.0625 x 247 = 15.438
prop (1 male) = (4!/(1!(4-1)!))
x 0.51 x 0.5(4-1) = 0.2500 x
247 = 61.750
prop (2 males) = (4!/(2!(4-2)!))
x 0.52 x 0.5(4-2) = 0.3750 x
247 = 92.625
prop (3 males) = (4!/(3!(4-3)!))
x 0.53 x 0.5(4-3) = 0.2500 x
247 = 61.750
prop (4 males) = (4!/(4!(4-4)!))
x 0.54 x 0.5(4-4) = 0.0625 x 247 =
15.438
1.0
247
- Calculating
probabilities with SYSTAT (Utilities®Probability Calculator®Univariate Discrete)
Question: Is sex ratio in Dimetredon determined by a mechanism
similar to that of mammals?
No. Obs Exp
males freq. freq.
0 7 15.438 Conclusion: because of the large
1 24 61.750 deviations
between the expected
2 93 92.625 and
observed numbers, we reject
3 99 61.750 the
idea of there being equal chances
4 24 15.438 of having each sex.
Total 247 247
5. importance of
sample size for observed data (1 coin example, compare to theoretical)
- IF observed =
norm coin, THEN the larger the n, the closer we approximate expected
conversely: THEN the smaller the n, the
more we deviate from expected
_______________________________________________
Exercise: Binomial Distribution
Assuming that the sex of hatchling turtles
is determined by a particular combination of chromosomes as in mammals (i.e.,
an XX, XY system), fill in the expected frequencies below:
Number of male hatchlings emerging from
84 nests of kaw turtles (kaw turtles always lay 4 eggs per nest).
|
No. of Males |
Observed No. Nests |
Expected No. Nests |
|
0 |
23 |
|
|
1 |
25 |
|
|
2 |
19 |
|
|
3 |
10 |
|
|
4 |
7 |
|
ans:
exp- 5.25, 21.0, 31.5, 21.0, 5.25
Compare the observed and expected
frequencies. Do these data support the hypothesis
that sex of hatchlings is genetically determined? Support your conclusion.
_____________________________________________
B. Discrete
probability distribution #2 - Poisson (expected
distribution for rare and random [independent] events)
1. Poisson: m = s2 (distribution
defined by mean only)
-low
mean [relatively rare events]
-probability
distribution shape (Fig);
moves toward binomial shape at higher means
2. e.g., recapture
rates of ectotherm animals
3.
Poisson
formula: P(x) = (`xxe-`x)/x!
4.
terms
-`x = mean occurrence of event of interest
-
e = mathematical constant (2.71828)
-
x = number of occurrences of the event of interest
EXAMPLE 1: An ecologist
counted the number of maple seedlings in 100 quadrats. Using the mean calculated from the observed
frequency distribution of maple seedlings per quadrat in the table below (`x = 1.41),
calculate the expected frequencies assuming that occurring in a quadrat is a
random event. Note the IF-THEN
rationale.
Expected Expected
proportion number (frequency)
prop (0
seedlings) = (1.410e-1.41)/0!
= 0.244 x 100 = 24.4
prop (1 seedling) = (1.411e-1.41)/1! = 0.344
x 100 = 34.4
prop (2 seedlings) = (1.412e-1.41)/2! = 0.243
x 100 = 24.3
prop (3 seedlings) = (1.413e-1.41)/3! = 0.114
x 100 = 11.4
prop (4 seedlings) = (1.414e-1.41)/4! = 0.040
x 100 = 4.0
prop (5 seedlings) = (1.415e-1.41)/5! = 0.011
x 100 = 1.1
1.0 100
- Calculating probabilities with
SYSTAT (Utilities®Probability Calculator®Univariate Discrete)
Question: Do seedlings occur randomly in
quadrats?
No.
Obs Exp
plants freq. freq.
0 35 24.4 Conclusions:
1 28 34.4 1.
Rare (low value of mean)
2 15 24.3 2.
Random (compare observed
3 10 11.4 and
expected distributions)
4 7 4.0
5 5 1.1
Total 100 99.6
__________________________________________
exercise: Poisson Distribution
Assuming that being killed by a horse is
a rare and random event, fill in the expected frequencies below:
Men killed by being
kicked by a horse in the Prussian Army Corps.
|
No. killed/ yr/corps |
Observed Number |
Expected Number |
ans: exp-
108.67, 66.29, 20.22, 4.11, 0.63 `x = ________ (ans: 0.610) s2 = ________ (ans: 0.611) s2/`x = _______ (ans: 1.002) |
|
0 |
109 |
|
|
|
1 |
65 |
|
|
|
2 |
22 |
|
|
|
3 |
3 |
|
|
|
4 |
1 |
|
|
|
Total |
200 |
200 |
Compare the observed and expected frequencies. Do these data support the hypothesis that the chance of being killed by a horse in the Prussian Army Corps is a rare and random event? Support your conclusion.
_____________________________________________
Exercise: Testing Your
Concept of Randomness
1. Obtain a 10x10 grid
2. Draw 100 dots on
grid (keep eyes open, try to draw dots in a random manner)
3. Calculate
descriptive statistics on no. of dots per cell
4. Calculate
variance/mean ratio
5. s2 /`x = 1 (random); s2 /`x < 1 (evenly
spaced); s2 /`x > 1
(clumped)
6. How large must
the deviation be before we reject the idea of randomness?
7. Application:
ecological dispersion
patterns
_____________________________________________
Continuous
Probability Distribution
“Normal” distribution; very important frequency distribution in
statistics for 2 reasons:
A. Data that are
influenced by many small and unrelated random effects are approximately
normally distributed (“Fuzzy Central Limit Theorem”); extremely widespread and
common in nature
B. Forms the
conceptual basis of a large number of statistical procedures - one of the most
important theoretical distributions in statistics
C. Properties
1. formula: 1/(sÖ2p)exp(-(x-m)2/2s2)
2. shape determined
by mean and SD
3. symetrical around
the mean (mean=mode=median); Fig. 4.1
4. `x±1SD = approx. 68%
of cases; ±2SD = approx 95%
(Fig. 6-3)
D. Standard normal
distribution
1. many different
“normal” distributions (Fig. 6.2)
2. standardize any
normal distribution (directly compare)
3. express
individual cases in terms of SND; z = (x -`x)/s; “z-score”
4. z-score =
distance from mean in standard deviation units; e.g., z = 1 (=1SD greater than
the mean)
5. Table
- areas of normal curve
E. Testing observed
data for normality; SYSTAT output (TREAT.SYD,
EGGWGT) pic
1. qualitative: Probability plot (Graph®Distribution
Plots®Probability Plot); select cases


2. quantitative:
Kolmogorov-Smirnov Test;
normal if probability >0.05 (“skewed” if P<0.05)
a.
SYSTAT path: Analyze®Fitting
distributions®Continous (Enter
selected variable and distribution)
Kolmogorov-Smirnov One Sample Test using
Normal(0.00, 1.00) Distribution
Data for the following results
were selected
according to SELECT (EGGWGT > 5)
|
Variable |
N of
Cases |
Maximum |
Lilliefors |
|
EGGWGT |
245 |
0.143 |
0.000 |
|
Variable |
N of
Cases |
Maximum |
Lilliefors |
|
EGGWGT |
235 |
0.058 |
0.052 |
F.
In some circumstances, a discrete distribution (e.g., binomial)
may closely approximate a normal distribution and thus it may be treated as if
it were a normal distribution of continuous data (why? easier to work with)
1.
rule of thumb for binomial: whenever n (“k” in H&H) is
big enough to make the number of expected successes and failures both greater
than five, i.e., when np ≥ 5 and n(1-p) ≥ 5

____________________________________________________
Exercise
1.
Practice
SYSTAT Probability Plot and One-sample KS Test using the variable H2OOut from
file DLWMEANS.SYD. Note that H2OOUT is not normally distributed.
2.
Data
transformations - many procedures in statistics assume that the data are
normally distributed. If data are not
normally distributed, one can transform the data to another measurement scale
in an effort to normalize them. Deciding
which transformation to use is entirely practical, i.e., the “right” transformation
is whatever makes the data normally distributed. Trial-and-error applications of various
transformations may be necessary to determine which will work. However, some transformations work better in
some situations than in others. Examples
of transformations commonly used in biology are the logarithmic and arcsine
transformations.
a. logarithmic transformation - the logarithmic
transformation (either natural logs [base e] or common logs [base 10]) is useful
in a wide variety of situations and is by far the most commonly used data
transformation in biology
b. arcsine (inverse sine) transformations are used
specifically for proportions and percentages (which generally tend to be
non-normal if many of the observations fall outside the 30-70% range)
3. Transform the
variable H2OOUT above with common logarithms and retest for normality with both
Probability Plot and One-Sample KS. Note
that the SYSTAT designation for common logs is L10 and that for
natural logs is LOG (always use common logs in Biol. 254). After transformation, the new variable
logH2OOUT should now be normal (always create a NEW variable name for the
transformed variable).
_______________________________________________
Categories of statistical tests
1.
Parametric – ratio/interval, continuous/discrete*; rigid assumptions (including normal
distribution of data); “powerful”
-
examples: independent samples t-test, paired samples t-test,
Bartletts, analysis of variance, Pearson correlation, regression
2.
Non-parametric – all scales and variable types; fewer
assumptions; less powerful
-
examples: one-sample KS, Mann-Whitney, Wilcoxon,
Kruskal-Wallis, goodness-of-fit, Levene’s, Spearman correlation, test of
independence
__________________________
Exam 1 to here
__________________________