Descriptive Statistics

 Updated 20 February 2009

  “I have realized that data without statistics are just numbers.”
 
Amy Greenway, Research Assistant (HU 2005)

 

 
 


 

 

 

You are expected to read the corresponding textbook chapters as we cover them in class. 

Working the exercises at the end of each chapter will enhance your understanding

 (click for answers to all exercises)

    

Statistical Basics

A.   Definitions

1.     variable - characteristics that may differ (vary) among individuals

a.     measured

b.     derived

2.     data - values of variables for individuals (singular datum)

3.     case/observation - data for an individual; symbolize: x1, x2, ...xn (n=sample size)

B.   Collection of data

1.     population - all individuals of a specified part of the universe (cf biological definition)

2.     sample - subset of population; used to make inferences regarding the population (usually unknown)

3.     error - 2 sources

a.     physical measurement error

b.     difference between the real population value and the estimates (from sample data) of the population value

4.     randomness - all individuals have equal probability of being sampled (bias)

5.     independence - value of one case does not affect the value of other cases (exception e.g., repeated measures on same individual – dependence by design)

     C. Scales of measurement and variable types

          1. Categorical scale (Nominal)

a.     values not quantitative or ranked; no mathematical or value relationship

b.     mutually exclusive categories (e.g., male/female)

c.      variable type: categorical

2.     Ranked scale (Ordinal)

a.     relative differences (e.g., greater than/less than)

b.     no mathematical relationship between values (e.g., small/medium/large; highly active/active/not active; health status: 1=healthy, 2=ill, 3=sick, 4=hospitalized, 5=dead)

c.      variable type: ranked

3.     Ratio scale

a.     mathematically defined distance between values; quantitative

b.     absolute zero point (e.g., mass)

c.      variable types:

·        Discrete - may assume only certain values within given range (e.g., 1, 2, 3, 4)

·        Continuous - may assume any value within given range (e.g., 1.0, 2.0, 2.34, 2.344, etc.)

d.     may convert ratio data to ranked/categorical data (not vice versa)

4.     Interval scale

a.     mathematically defined distance between values; quantitative

b.     arbitrary zero point (e.g., Celsius temperature scale)

c.      variable types:

·        Discrete - may assume only certain values within given range (e.g., 1, 2, 3, 4)

·        Continuous - may assume any value within given range (e.g., 1.0, 2.0, 2.34, 2.344, etc.)

d.     may convert interval data to ranked/categorical data (not vice versa)

 

  1. Identify variables, variable types, and measurement scale (practice problems)

E.    SYSTAT Demo Data files

1.     windows (output, data, graph); menus

2.     data files

a.     columns (variables); numerical vs. string (categorical) variables (e.g., SEX vs. SEX$)

b.     rows (values of variables [cases, observations, sample size])

3.     creating data files (entering and editing data); opening existing data files

4.     frequency distributions (plot frequency against variable); note terminology, i.e., Y against X)

a.     discrete (ordinal/categorical) data (Tab 3.1, 3.2; Fig. 3-1); continuous data (Tab 3.3, 3.4; Fig. 3-2)

b.     SYSTAT (Graph®Histogram); USOPHEO.SYD

5.     selecting cases (Data®Select Cases)

6.     by groups (Data®By Groups)

7.     creating frequency tables (Analyze®One-Way Frequency Tables; Analyze®Tables®Two-Way)

8.     calculating an average (Analyze®Basic Statistics)

9.     transforming data (Data® Transform®Let)

Text Box: FYI - commonly used commercial statistical software:
•	SAS (the most widely used and respected statistical software); high learning curve
•	SYSTAT (a nice mix of statistical and graphic capabilities); student version available (=MYSTAT)
•	SPSS (commonly used by social scientists)
•	MINITAB (commonly used in education; poor graphics)
•	JMP (a more visual and user-friendly version of SAS)
•	Some spreadsheet, graphing, and mathematical software packages have limited statistical capabilities (e.g., Excel, SigmaPlot, MATLAB)
•	For further information, go to http://en.wikipedia.org/wiki/Comparison_of_statistical_packages

 

Introduction to SYSTAT

Prepare a SYSTAT data file using the data below.  These data are measurements taken from 10 specimens of spiny guanotzits from Arkansas and Missouri.  The variables are: collection locality (categorical), length of body (continuous), sex (categorical), weight of body (continuous), amount of pigment on the lower jaw (ranked), and number of scales on the chin (discrete).

Case

1

2

3

4

5

6

7

8

9

10

Locality

AR

AR

MO

MO

MO

AR

AR

MO

AR

MO

Length (mm)

22.5

21.4

20.8

20.6

19.8

20.1

22.3

21.7

20.4

21.1

Sex

m

m

f

f

f

f

m

f

m

f

Weight (g)

333

298

401

257

21

30

478

400

35

288

Pigment

4

5

5

3

2

1

1

5

4

5

No. scales

23

22

14

26

9

21

17

12

15

12

 

Name your data file first.syd (the file extensions .syz or .syd identify a SYSTAT data file).  After you finish entering the data, proofread the file to make sure that the data are correct, edit if necessary, save the file to your M-drive, and close it.  Reopen the file and use it to learn the following menus and functions:

 

1.     File Menu (New, Open, Save, Save As, Print, Exit)

2.     Edit Menu (Undo, Cut, Copy, Paste, Copy Graph, Delete, Options)

3.     Data Menu (Variable properties, Transform, By Groups, Select Cases)

4.     Graph Menu (Histogram)

5.     Analyze Menu (One-Way Frequency Tables, Basic Statistics, Tables)

 

Exercises

1.     calculate the mean guanotzit weight

2.     calculate the mean guanotzit weight separately for males and females

3.     calculate the mean weight for guanotzits from Arkansas

4.     draw a histogram of guanotzit lengths

5.     transform weight to the common logarithm of weight

6.     how many quanotzits from Missouri were measured?

7.     determine the number of guanotzits by scale number and state

8.     after you finish #1-7, your instructor may have additional exercises using lonoke.syd

_________________________________________________

 

Description of Data (from a frequency distribution)

     A. Descriptive statistics

1.     measures of central tendency (Fig. 4.1)

              a. mode - most frequent class (of frequency distribution)

              b. median (ordinal or ratio/interval data) - middle class

              c. mean (ratio/interval data) = “average”; calculation: Sx/n

 

2.     measures of dispersion - describe the amount that each observation is likely to vary from the mean/median

a.     maximum, minimum (range): sensitive to extreme values

b.     interquartile range: break ordered data into four equal sections (quartiles, Q1, Q2, Q3); middle 50% of observations (Q3 – Q1; difference between 25th and 75th percentiles)

c.      sum of squares (SS): S(x -`x)2

d.     variance: SS/n          

e.      standard deviation: variance (Fig. 6-3)

 

3.     symbols for statistics (sample) and parameters (population)

 

 

Parameter

Statistic

Mean

m  =   Sx/n

`x  =   Sx/n  (=“x-bar”)

Variance

s2 =   S(x-m)2/n

s2 =   S(x-`x)2/n-1 

Standard Deviation

s  =   s2

s  =   (s2)  (=“SD”)

 

 

4.     coefficient of variation (CV) - expresses SD as a percent of the mean

a. CV = (SD/`x) 100

b.     used to compare relative variation in one variable between groups with different means

              c. example:

Note that group 2 is relatively more variable despite a greater SD in group 1.

 
                                      mean SD     CV

                   Group 1      14.2   2.5     17.6

                   Group 2      7.2     1.8     25.0

 

 

          5. calculating descriptive univariate statistics

a.        personal calculator - assignments:

1)    calculate descriptive stats for: 13.4, 13.8, 14.2, 17.0, 15.3, 15.8, 14.9, 12.3, 16.2, 16.4): `x = 14.9, SD = 1.49 [s = 1.41], s2 = 2.22 [s2 = 2.00]

2)    calculate descriptive statistics from a frequency distribution (an essential skill)

 

                             No. plants/

                             quadrat       Frequency

Some calculators have a frequency limit of 100.  Check your calculator!

 
                             0                 35     

                             1                 28          `x  = 1.4                

                             2                 15            SD = 1.48          

                             3                 10                                      

                             4                 7

                             5                 5                     

 


 

 

 

b.     SYSTAT - calculate descriptive statistics from raw data file; Analyze®Basic Statistics

c.      Use CAVESALYS.SYD  pic

 Data for the following results were selected 
according to SELECT (LOC$ = AR) AND (SEX$ = F)

 

 

SVL

N of Cases

55

Minimum

32.50

Maximum

62.50

Arithmetic Mean

52.73

Standard Deviation

8.02

 

 

 
 

 

 

 

 

 

    

 

 

 

d.     SYSTAT - calculate descriptive statistics from frequency distribution (use noplants/quadrat data)

1.     Step 1: Data®Case Weighting®By Frequency

2.     Step 2: Analyze®Basic Statistics

 

6.     reporting sample means; must include measure of dispersion (=”error”) - mean alone not useful;`x ± error

a.     SYSTAT (bar graph with error bars)

 

 

 

 

 

 

 

b.     how reduce variance?    S(x -`x)2/n

                   -experimental design (select cases); decrease SS (numerator)

                   -calculation (increase sample size): increase n (denominator)

_____________________________________________

 

 

Exercise: Descriptive Statistics and Graphics - Click here   

 

_____________________________________________

 

 

Goodness-of-Fit (GOF)

1.     How well does one shape mimic (“fit”) another shape?

2.     Provide shape (e.g., “Draw a square”)

3.     Compare with the expected shape (square)

4.     Does the observed shape fit the expected shape? (either it fits or it doesn’t fit)

5.     Substitute “frequency distribution” for “shape.”

6.     How well does an observed frequency distribution fit an expected frequency distribution? (GOF)

 

Probability Distributions (“Expected distributions”)

(=frequency distribution w/probabilities instead of counts) - why do we need to know? - practical use to biologists (observed vs. expected); cannot know if result is due to chance alone unless we know what the expected is (hypothesis testing - basis for much of much of this course!); important theoretical frequency distributions:

 

A.   Discrete probability distribution #1 - Binomial (mutually exclusive categories; either/or); e.g., male/female, red/white, red/not red

1.     Example: 1 coin toss- possibilities: 1H, 1T

a.        probabilities: no. ways an event (i.e., H or T) can occur /total no events (i.e., 2) possible; “division” rule; 1H [1/2] = 0.5; 1T [1/2] = 0.5

b.       add all possibilities = 1  [0.5 + 0.5 = 1]

c.        probability distribution shape (Fig. 5-1)

 

2.     Example: 2 coin toss- possibilities: 2H, 1H1T, 1T1H, 2T (mutually exclusive, independent events)

a.        probabilities:

1)    simultaneous events (“and” rule, multiply): 2H [0.5 x 0.5] = 0.25; 2T [0.5 x 0.5] = 0.25

2)    alternative events (“or” rule, add): 2HT [0.5 x 0.5] + [0.5 x 0.5] = 0.5

b.       add all probabilities [0.25 + 0.5 + 0.25 = 1]

c.        probability distribution shape (Fig. 5.3)

                            

3.     binomial formula: P(x) = (n!/(x!(n-x)!))pxq(n-x)

4.     terms

p = probability of event of interest = head (”success”)

q = probability of other event (1-p) = not head (”failure”)

n = number of “simultaneous” events (trials); Note: “n” is signified by “k” in H&H (p 31)

x = number of occurrences of the event of interest (successes)

 

EXAMPLE: A reproductive physiologist counted the number of males in 247 litters of four siblings each in a species of Dimetrodon (observed frequency in table below).  Based on the theory of sex determination in mammals (equal chance of being male or female; Fig. 5-1), calculate the expected frequencies for the number of males in these litters.  Note the IF-THEN rationale.

                                                                                      Expected              Expected

                                                                                           proportion           number (frequency)

prop (0 males) =   (4!/(0!(4-0)!)) x 0.50 x 0.5(4-0) =     0.0625   x   247 = 15.438

prop (1 male)  =   (4!/(1!(4-1)!)) x 0.51 x 0.5(4-1) =     0.2500   x   247 = 61.750

prop (2 males) =   (4!/(2!(4-2)!)) x 0.52 x 0.5(4-2) =     0.3750   x   247 = 92.625

prop (3 males) =   (4!/(3!(4-3)!)) x 0.53 x 0.5(4-3) =     0.2500   x   247 = 61.750

prop (4 males) =   (4!/(4!(4-4)!)) x 0.54 x 0.5(4-4) =     0.0625   x   247 = 15.438

1.0                                                            247

 

- Calculating probabilities with SYSTAT (Utilities®Probability Calculator®Univariate Discrete)

    

          Question: Is sex ratio in Dimetredon determined by a mechanism similar to that of mammals?

          No.         Obs        Exp       

          males     freq.       freq. 

          0            7            15.438             Conclusion: because of the large

          1            24          61.750             deviations between the expected

          2            93          92.625             and observed numbers, we reject

          3            99          61.750             the idea of there being equal chances

          4            24          15.438             of having each sex.

          Total      247        247            

    

 

5.     importance of sample size for observed data (1 coin example, compare to theoretical) 

-       IF observed = norm coin, THEN the larger the n, the closer we approximate expected conversely:  THEN the smaller the n, the more we deviate from expected

_______________________________________________      

 

Exercise: Binomial Distribution

Assuming that the sex of hatchling turtles is determined by a particular combination of chromosomes as in mammals (i.e., an XX, XY system), fill in the expected frequencies below:

 

Number of male hatchlings emerging from 84 nests of kaw turtles (kaw turtles always lay 4 eggs per nest).

No. of

Males

Observed

No. Nests

Expected

No. Nests

0

23

 

1

25

 

2

19

 

3

10

 

4

7

 

           ans: exp- 5.25, 21.0, 31.5, 21.0, 5.25

 

Compare the observed and expected frequencies.  Do these data support the hypothesis that sex of hatchlings is genetically determined?  Support your conclusion.

_____________________________________________

 

B.   Discrete probability distribution #2 - Poisson (expected distribution for rare and random [independent] events)

1.     Poisson: m = s2 (distribution defined by mean only)

                        -low mean [relatively rare events]

-probability distribution shape (Fig); moves toward binomial shape at higher means

2.     e.g., recapture rates of ectotherm animals

3.    Poisson formula:  P(x) = (`xxe-`x)/x!

4.    terms

-`x = mean occurrence of event of interest

-  e = mathematical constant (2.71828)

-  x = number of occurrences of the event of interest

 

EXAMPLE 1: An ecologist counted the number of maple seedlings in 100 quadrats.  Using the mean calculated from the observed frequency distribution of maple seedlings per quadrat in the table below (`x = 1.41), calculate the expected frequencies assuming that occurring in a quadrat is a random event.  Note the IF-THEN rationale.

 

                                                                             Expected         Expected

                                                                             proportion       number (frequency)

     prop (0 seedlings)     = (1.410e-1.41)/0! =      0.244 x 100 = 24.4

     prop (1 seedling)       = (1.411e-1.41)/1! =      0.344 x 100 = 34.4      

     prop (2 seedlings)     = (1.412e-1.41)/2! =      0.243 x 100 = 24.3           

     prop (3 seedlings)     = (1.413e-1.41)/3! =      0.114 x 100 = 11.4              

     prop (4 seedlings)     = (1.414e-1.41)/4! =      0.040 x 100 = 4.0                  
     prop (5 seedlings)     = (1.415e-1.41)/5! =      0.011 x 100 = 1.1    

                                                                             1.0                   100            

 

- Calculating probabilities with SYSTAT (Utilities®Probability Calculator®Univariate Discrete)

 

     Question: Do seedlings occur randomly in quadrats?

 

     No.        Obs        Exp       

     plants     freq.       freq.

     0            35           24.4            Conclusions:

     1            28          34.4            1. Rare (low value of mean)

     2            15          24.3            2. Random (compare observed

     3            10          11.4                 and expected distributions)

     4            7            4.0                  

     5            5             1.1    

     Total      100        99.6      

__________________________________________      

 

exercise: Poisson Distribution

Assuming that being killed by a horse is a rare and random event, fill in the expected frequencies below:

 

          Men killed by being kicked by a horse in the Prussian Army Corps.

No. killed/

yr/corps

Observed

Number

Expected

Number

ans: exp- 108.67, 66.29, 20.22, 4.11,

0.63

 

`x =   ________ (ans: 0.610)

 

 s2 =   ________ (ans: 0.611)

 

s2/`x  = _______ (ans: 1.002)

0

109

 

1

65

 

2

22

 

3

3

 

4

1

 

Total

200

200

 

Compare the observed and expected frequencies.  Do these data support the hypothesis that the chance of being killed by a horse in the Prussian Army Corps is a rare and random event?  Support your conclusion.

_____________________________________________

 

Exercise: Testing Your Concept of Randomness

1.     Obtain a 10x10 grid

2.     Draw 100 dots on grid (keep eyes open, try to draw dots in a random manner)

3.     Calculate descriptive statistics on no. of dots per cell

4.     Calculate variance/mean ratio

5.     s2 /`x = 1 (random); s2 /`x < 1 (evenly spaced); s2 /`x > 1 (clumped)

6.     How large must the deviation be before we reject the idea of randomness?

7.     Application: ecological dispersion patterns

_____________________________________________

 

Continuous Probability Distribution

“Normal” distribution; very important frequency distribution in statistics for 2 reasons:

A.   Data that are influenced by many small and unrelated random effects are approximately normally distributed (“Fuzzy Central Limit Theorem”); extremely widespread and common in nature

B.   Forms the conceptual basis of a large number of statistical procedures - one of the most important theoretical distributions in statistics

C.   Properties

1.     formula: 1/(sÖ2p)exp(-(x-m)2/2s2)

2.     shape determined by mean and SD

3.     symetrical around the mean (mean=mode=median); Fig. 4.1

4.     `x±1SD = approx. 68% of cases; ±2SD = approx 95% (Fig. 6-3)

 

D.   Standard normal distribution

1.     many different “normal” distributions (Fig. 6.2)

2.     standardize any normal distribution (directly compare)

3.     express individual cases in terms of SND; z = (x -`x)/s; “z-score

4.     z-score = distance from mean in standard deviation units; e.g., z = 1 (=1SD greater than the mean)

5.     Table - areas of normal curve

 

E.    Testing observed data for normality; SYSTAT output (TREAT.SYD, EGGWGT) pic

1.      qualitative: Probability plot (Graph®Distribution Plots®Probability Plot); select cases

2.     quantitative: Kolmogorov-Smirnov Test; normal if probability >0.05 (“skewed” if P<0.05)

a.        SYSTAT path: Analyze®Fitting distributions®Continous (Enter selected variable and distribution)

 

   Kolmogorov-Smirnov One Sample Test using Normal(0.00, 1.00) Distribution

                                                                     Data for the following results were selected

                                                                                                     according to  SELECT (EGGWGT > 5)

Variable

N of Cases

Maximum
Difference

Lilliefors
Probability
(2-tail)

EGGWGT

245

0.143

0.000

Variable

N of Cases

Maximum
Difference

Lilliefors
Probability
(2-tail)

EGGWGT

235

0.058

0.052

 

 

 

 

 

F.      In some circumstances, a discrete distribution (e.g., binomial) may closely approximate a normal distribution and thus it may be treated as if it were a normal distribution of continuous data (why? easier to work with)

1.      rule of thumb for binomial: whenever n (“k” in H&H) is big enough to make the number of expected successes and failures both greater than five, i.e., when np ≥ 5  and  n(1-p) ≥ 5

 

binomial

 

____________________________________________________

 

 

Exercise

1.     Practice SYSTAT Probability Plot and One-sample KS Test using the variable H2OOut from file DLWMEANS.SYD.  Note that H2OOUT is not normally distributed.

 

2.     Data transformations - many procedures in statistics assume that the data are normally distributed.  If data are not normally distributed, one can transform the data to another measurement scale in an effort to normalize them.  Deciding which transformation to use is entirely practical, i.e., the “right” transformation is whatever makes the data normally distributed.  Trial-and-error applications of various transformations may be necessary to determine which will work.  However, some transformations work better in some situations than in others.  Examples of transformations commonly used in biology are the logarithmic and arcsine transformations.

 

a.     logarithmic transformation - the logarithmic transformation (either natural logs [base e] or common logs [base 10]) is useful in a wide variety of situations and is by far the most commonly used data transformation in biology

b.     arcsine (inverse sine) transformations are used specifically for proportions and percentages (which generally tend to be non-normal if many of the observations fall outside the 30-70% range)

 

3.     Transform the variable H2OOUT above with common logarithms and retest for normality with both Probability Plot and One-Sample KS.  Note that the SYSTAT designation for common logs is L10 and that for natural logs is LOG (always use common logs in Biol. 254).  After transformation, the new variable logH2OOUT should now be normal (always create a NEW variable name for the transformed variable).

_______________________________________________


Categories of statistical tests

1.              Parametric – ratio/interval, continuous/discrete*; rigid assumptions (including normal distribution of data); “powerful”

-                 examples: independent samples t-test, paired samples t-test, Bartletts, analysis of variance, Pearson correlation, regression

2.              Non-parametric – all scales and variable types; fewer assumptions; less powerful

-                   examples: one-sample KS, Mann-Whitney, Wilcoxon, Kruskal-Wallis, goodness-of-fit, Levene’s, Spearman correlation, test of independence

 

__________________________

       

             Exam 1 to here

__________________________