Hypothesis Testing - Relationships 

Updated 6 November 2009

  “I’ve learned that you can’t be taken seriously in any scientific discipline without an understanding of statistics.”

 Bethany Williams, PhD, U.S. Geological Survey (HU 2004)

 

 

 
 

 

 

You are expected to read the corresponding textbook chapters as we cover them in class.

Working the exercises at the end of the chapters will enhance your understanding.

 (click for answers to all exercises)

 

Analysis of relationships (among cases of variables)

A.   Correlation (=association); no functional dependence (e.g., wing length vs. tail length)

B.   Example: positive, negative, and no relationship

C.   2 questions:

1.     are two variables related in a linear way? (significant? =reject null of no relationship)

a.     if answer is “no”, do not go further

b.     if answer is “yes”, ask:

2.     what is the strength of the relationship? (value of correlation coefficient)

 

Statistical Tests

 

Parametric

Nonparametric

Assumptions of parametric tests

·          Data randomly sampled and independent

·          Data measured on ratio/interval scale

·          Variables continuous or discrete (if n and

no. possible values large)

·          Data normally distributed for each group

(residuals in ANOVA/regression)

·          For questions regarding means, data variances among groups (residuals in

ANOVA/regression) homogeneous

Assumptions of non-parametric tests

·          Data randomly sampled and independent

Guidelines for determining appropriate analyses

·          Read the question carefully; make sure you understand what question is asking

·          Look for key words: difference, relationship, association, correlation

·          A “v” word, (vary, variance, variation) will be present for questions of differences in variances

·          If a “v” word does not appear in a difference question and question does not concern frequencies, assume question concerns means

·          The word “affect” (or “effect”) can be used in both difference and relationship questions - must understand use in context

Differences

Means (2)

Indep. samples t

Mann-Whitney

Paired samples t

Wilcoxon

Means

ANOVA

Kruskal-Wallis

Variances

Bartlett’s

Levene’s

Frequencies

     -----

Goodness-of-fit

Relationships

Cases

Pearson correlation

Spearman correlation

Regression

     -----

Frequencies

     -----

Independence

 

 

Pearson correlation (Chap. 13)

1.     Purpose:  Test whether the cases of two variables are correlated

2.     Comments:  if variables are related, the relationship is linear

3.     Null hypothesis:  H0: r(var1, var2) = 0

4.     Test statistic (correlation coefficient, r (varies from -1 to +1) and probability source:  Systat/Systat

5.     r2 (coefficient of determination) - proportion of variation in Y that is explained by variation in X

6.     SYSTAT path:  Analyze®Correlation®Simple (enter variables; Continuous Data)

 

SYSTAT output: (EGGSIZE.SYD; select sex=1 and year=81); pic

 

Pearson correlation matrix

                        HWGT      HSVL

 HWGT           1.0000

 HSVL            0.8297       1.0000

 

Bartlett Chi-square statistic:  36.735 DF=1 Prob= 0.000

 

Matrix of Probabilities

                        HWGT      HSVL

 HWGT           0.0000

 HSVL            0.0000      0.0000

 

Number of observations: 34

 

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Bonferroni probabilities - use when correlating >2 variables from one data set
   (multiple comparisons increase chances of Type I error – must adjust probabilities)

SYSTAT output: (SHRIMPS.SYD); pic
 
Kolmogorov-Smirnov One Sample Test using Normal(0.00, 1.00) Distribution

Variable

N of Cases

Maximum
Difference

Lilliefors
Probability
(2-tail)

FEMLEN

68

0.099

0.097

FEMWGT

68

0.167

0.000

EGGNO

68

0.224

0.000

BROODVOL

68

0.214

0.000

BROODWGT

68

0.261

0.000

LOGFEMLEN

68

0.095

0.129

LOGFEMWGT

68

0.138

0.003

LOGEGGNO

68

0.103

0.072

LOGBROODVOL

68

0.063

0.695

LOGBROODWGT

68

0.096

0.122

 

Number of Observations: 68

Pearson Correlation Matrix

 

FEMLEN

LOGEGGNO

LOGBROODVOL

LOGBROODWGT

FEMLEN

1.000

 

 

 

LOGEGGNO

0.723

1.000

 

 

LOGBROODVOL

0.750

0.904

1.000

 

LOGBROODWGT

0.760

0.902

0.893

1.000

 

Matrix of Bonferroni Probabilities

 

FEMLEN

LOGEGGNO

LOGBROODVOL

LOGBROODWGT

FEMLEN

0.000

 

 

 

LOGEGGNO

0.000

0.000

 

 

LOGBROODVOL

0.000

0.000

0.000

 

LOGBROODWGT

0.000

0.000

0.000

0.000

 

 

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

 

 

 

 

Example problems

1.      Use the following data on wing length (cm) and tail length (cm) in cowbirds to determine if there is a relationship between the two variables. (Protocol link)

 

wing    10.4     10.8     11.1     10.2     10.3     10.2     10.7     10.45   10.8     11.2     10.6

tail       7.4       7.6       7.9       7.2       7.4       7.1       7.4       7.2       7.8       7.7       7.8

 

2.      Use the following data taken from crabs to determine if there is a relationship between weight of gills (g) and weight of body (g) and between weight of thoracic shield (g) and weight of body. (Protocol link)

 

body    159      179      100      45        384      230      100      320      80        220      320

gill       14.4     15.2     11.3     2.5       22.7     14.9     11.4     15.81   4.19     15.39   17.25

thorax  80.5     85.2     49.9     21.1     195.3   111.5   56.6     156.1   39.0     108.9   160.1

 

________________________________________________________________


 

Statistical Tests

 

Parametric

Nonparametric

Assumptions of parametric tests

·          Data randomly sampled and independent

·          Data measured on ratio/interval scale

·          Variables continuous or discrete (if n and

no. possible values large)

·          Data normally distributed for each group

(residuals in ANOVA/regression)

·          For questions regarding means, data variances among groups (residuals in

ANOVA/regression) homogeneous

Assumptions of non-parametric tests

·          Data randomly sampled and independent

Differences

Means (2)

Indep. samples t

Mann-Whitney

Paired samples t

Wilcoxon

Means

ANOVA

Kruskal-Wallis

Variances

Bartlett’s

Levene’s

Frequencies

     -----

Goodness-of-fit

Relationships

Cases

Pearson correlation

Spearman correlation

Regression

     -----

Frequencies

     -----

Independence

 

 

Spearman correlation (Chap. 13)

1.     Purpose:  Test whether the cases of two variables are correlated

2.     Comments: if variables are related, the relationship is linear

3.     Null hypothesis:  H0: rs(var1, var2) = 0

4.     Test statistic (rs) and probability source:  Systat/Statistical Table

5.     SYSTAT path:  Analyze®Correlation®Simple (enter variables; Rank Order Data)

SYSTAT output: (SHRIMPS.SYD); pic
 
Kolmogorov-Smirnov One Sample Test using Normal(0.00, 1.00) Distribution

Variable

N of Cases

Maximum
Difference

Lilliefors
Probability
(2-tail)

FEMLEN

68

0.099

0.097

FEMWGT

68

0.167

0.000

EGGNO

68

0.224

0.000

BROODVOL

68

0.214

0.000

BROODWGT

68

0.261

0.000

LOGFEMLEN

68

0.095

0.129

LOGFEMWGT

68

0.138

0.003

LOGEGGNO

68

0.103

0.072

LOGBROODVOL

68

0.063

0.695

LOGBROODWGT

68

0.096

0.122

 

Number of Observations: 68

Spearman Correlation Matrix

 

FEMLEN

FEMWGT

EGGNO

BROODVOL

BROODWGT

FEMLEN

1.000

 

 

 

 

FEMWGT

0.886

1.000

 

 

 

EGGNO

0.725

0.697

1.000

 

 

BROODVOL

0.773

0.750

0.905

1.000

 

BROODWGT

0.749

0.724

0.897

0.884

1.000

 

Probabilities (not available in SYSTAT, must get from Spearman Table→ P<0.01)

 

 

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Example problems

1.      The following data are ranked scores for ten students who took both a math and a biology aptitude examination.  Is there a relationship between math and biology aptitude scores for these students? (Protocol link)

 

math          53  45  72  78  53  63  86  98  59  71

biology      83  37  41  84  56  85  77  87  70  59

 

2.      Test the following data to determine if there is a relationship between the total length of aphid stem mothers and the mean thorax length of their parthenogenetic offspring. (Protocol link)

 

mother       8.7       8.5       9.4       10.0     6.3       7.8       11.9     6.5       6.6       10.6

offspring   5.95     5.65     6.00     5.70     4.40     5.53     6.00     4.18     6.15     5.93

 

____________________________________________________________


Correlation exercise (combination can result from what?)

 

 

Strength of Relationship

Low (weak)

High (strong)

Significant

?

?

Not significant

?

?

 

Multiple choice (choose all that apply)

a. biologically important

b. biologically unimportant

c. sufficient power

d. insufficient power

e. high variability

f. other biologically important variables not yet accounted for

 

________________________________________________________________________________


 

Statistical Tests

 

Parametric

Nonparametric

Assumptions of parametric tests

·          Data randomly sampled and independent

·          Data measured on ratio/interval scale

·          Variables continuous or discrete (if n and

no. possible values large)

·          Data normally distributed for each group

(residuals in ANOVA/regression)

·          For questions regarding means, data variances among groups (residuals in

ANOVA/regression) homogeneous

Assumptions of non-parametric tests

·          Data randomly sampled and independent

Differences

Means (2)

Indep. samples t

Mann-Whitney

Paired samples t

Wilcoxon

Means

ANOVA

Kruskal-Wallis

Variances

Bartlett’s

Levene’s

Frequencies

     -----

Goodness-of-fit

Relationships

Cases

Pearson correlation

Spearman correlation

Regression

     -----

Frequencies

     -----

Independence

 

Independence (Chap. 14)

1.     Purpose:  Test whether the frequencies of two variables are independent

2.     Comments:  2 variables, each frequency occurs in multiple mutually exclusive categories, no proportions or percentages, no cell has an expected frequency of <5 (Systat will inform you of violations)

3.     Null hypothesis:  H0: var(row) independent of var(column)

4.     Test statistic (X2) and probability source:  Systat/Systat

5.     SYSTAT path:  Analyze®Tables®Two-Way (enter row and column variables)

SYSTAT output: (GINMOVE.SYD); SLIDES

 

Frequencies

 HAB$ (rows) by SEX$ (columns)

 

           F          M         Total

       +----------------+

     P |   480      420  |  900

     R |   2          25    |  27

       +----------------+

 Total  482      445     927

 

Test statistic                      Value         DF             Prob

  Pearson Chi-square         22.1511    1.0000       0.0000

 

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6.     Frequency table data - start with table (no raw data)

a.     example 1 - association between the hemoglobin S allele and resistance to malaria:         

 

                                                                             Did not

                                                          Contracted       contract

                                                          malaria            malaria  

                             Heterozygotes           1                      14

                             Homozygotes            13                    2       

    

SYSTAT output:

 

Frequencies

 MALARIA$ (rows) by GENES$ (columns)

 

het    hom     Total

+----------------+

n |   14      2     |    16

y |    1     13     |    14

      +----------------+

Total    15     15         30

 

Test statistic                         Value          DF           Prob

  Pearson Chi-square                  19.286       1.000       0.000

 

 

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

     b. example 2 (supplements link)

 

Example problems

1.      Use the following data on frequency of rabies in skunks collected from three geographic areas to test the hypothesis that incidence of rabies dependent on geographic area. (Protocol link)

 

                  With          Without

Area          Rabies       Rabies

Ozarks       14              29

Ouachitas  12              38

Delta         11              35

 

2.      The following data are frequency of individuals with different hair colors according to sex.  According to these data, is human hair color dependent on sex? (Protocol link)

 

sex             black          brown        blond         red

male           32              43              16              9

female       55              65              64             16

_________________________

 

 

Correlation vs. causation

  1. spurious correlations; Calvin, Cause and effect
  2. sometimes results from a common correlation with 3rd variable (e.g., B correlated with C because both B&C are functionally correlated with A)

 

 

Statistical Tests

 

Parametric

Nonparametric

Assumptions of parametric tests

·          Data randomly sampled and independent

·          Data measured on ratio/interval scale

·          Variables continuous or discrete (if n and

no. possible values large)

·          Data normally distributed for each group

(residuals in ANOVA/regression)

·          For questions regarding means, data variances among groups (residuals in

ANOVA/regression) homogeneous

Assumptions of non-parametric tests

·          Data randomly sampled and independent

Differences

Means (2)

Indep. samples t

Mann-Whitney

Paired samples t

Wilcoxon

Means

ANOVA

Kruskal-Wallis

Variances

Bartlett’s

Levene’s

Frequencies

     -----

Goodness-of-fit

Relationships

Cases

Pearson correlation

Spearman correlation

Regression

     -----

Frequencies

     -----

Independence

 

Regression (Chap. 12)

1.      Purpose:  Test whether the cases of one variable are functionally (mathematically) related to the cases of another variable (i.e., can be predicted from)

2.      Comments:  if variables are related, the relationship is linear; normality assumptions are analyzed with residuals after the regression analysis; robust

3.      Null hypothesis:  H0: b(vary, varx) = 0

4.      Test statistic (F) and probability source:  Systat/Systat

5.      SYSTAT path:  Analyze®Regression®Linear®Least Squares (enter dependent and independent variables; enter KS on options tab)

 

Procedure

a.     SYSTAT: fit regression line (least squares method; minimize S(residuals2); test for significance of slope

b.     Student: determine general regression equation: Y = a + bX (a = intercept; b = slope); parametric, Y = a+ bX

c.      Student: determine specific equation (insert regression values and variables)

d.     Student: prepare regression plot (use SYSTAT Scatterplot) 

SYSTAT output: (BABIES.SYD; select svl>200 and svl<290 and sex=1); pic

 

Output format (two tables):

·        Regression statistics: intercept (=constant), slope (=regression coefficient); confidence limits (pic)

·        ANOVA table

 

Dep Var: WGT   N: 235   Multiple R: 0.874   Squared multiple R: 0.764

Adjusted squared multiple R: 0.763   Standard error of estimate: 0.769

 

                                                         Std     Std      Toler-

Effect                 Coefficient            Error   Coef    ance         t         P(2 Tail)

CONSTANT   -16.956 (intercept)  0.991   0.0        .         -17.118    0.000

SVL                   0.110 (slope)         0.004   0.874  1.000    27.436    0.000

 

                             Analysis of Variance

 Source             Sum-of-Squares   DF       Mean-Square   F-Ratio       P

Regression             444.780           1          444.780           752.726    0.000

Residual                137.678           233          0.591

 

  

 

 

 

 

 
  

 

 

 

 

 

 

Regression Plot (SYSTAT Scatterplot)

 

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Example problems

  1. The following data are rate of oxygen consumption (ml/g/hr) in crows at different temperatures (°C).  Does temperature affect oxygen consumption in crows?  Determine the equation for predicting oxygen consumption from temperature. (Protocol link)

 

temp          -18       -15       -10       -5         0          5          10        19

oxygen        5.2       4.7       4.5       3.6       3.4       3.1       2.7       1.8

 

  1. Use the following data on mean adult body weight (mg) and larval density (no./mm3) of fruit flies to determine if there is a functional relationship between adult body mass and the density at which it was reared.  Determine the equation for predicting body weight from larval density. (Protocol link)

 

density            1          3          5          6       10        20        40

weight       1.356   1.356   1.284   1.252   0.989   0.664   0.475

 

Extrapolation: linear regressions are valid only within limits of data (independent variable, X); beyond data - do not know if relationship is linear

 

AMNHjaws.jpgcarcharodon.jpg 

 

 

 “A regression of tooth size on actual body length for the living Carcharodon carcharias indicates by extrapolation (assuming continued linearity) that C. megalodon was “only” 13 m (43 ft) in length!

 

 

 

 

 

 

 

 

 

 

 __________________________________________________________

 

 

 Transformation in linear regression (goal: curvilinear®linear)

1.     SYSTAT e.g.: calibrate transmitters; DEMO

2.     Linear vs. log10 data regressions - note increase in r2 and linearity with log transformation

  

 

 

Dep Var: PI   N: 7   Multiple R: 0.989   Squared multiple R: 0.978

 

Effect      Coefficient    Std Error  Std Coef Tolerance     t   P

CONSTANT    3172.273       97.857     0.000      .      32.417    0.000

TEMP        -65.363        4.390     -0.989     1.000  -14.888    0.000

 

                             Analysis of Variance

Source             Sum-of-Squares   df  Mean-Square     F-ratio   P

Regression           3518606.514     1  3518606.514     221.638   0.000

Residual               79377.200     5    15875.440

 

 

Dep Var: LPI   N: 7   Multiple R: 1.000   Squared multiple R: 0.999

 

Effect      Coefficient  Std Error    Std Coef Tolerance     t   P

CONSTANT    3.540        0.004        0.000      .     834.735    0.000

TEMP       -0.015        0.000       -1.000     1.000  -78.645    0.000

 

                             Analysis of Variance

Source             Sum-of-Squares   df  Mean-Square     F-ratio  P

Regression                 0.184     1        0.184    6185.068  0.000

Residual                   0.000     5        0.000

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Predicting dependent variable Y from independent variable X

 

  Linear (Y, X) equations:   Y = a + bX

Example 1: using the regression equation Y = 14.5 + 2.56X, predict Y when X = 63

 

Y = 14.5 + 2.56(63) = 175.78

                             

____________________________________________

Example 2: inverse prediction (predict X from Y); Y = 14.5 + 2.56X

 

by algebraic manipulation Y-14.5 = 2.56X; (Y-14.5)/2.56 = X

 

predict X when Y = 175.78:

 

X = (175.78-14.5)/2.56 = 63

 
             

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Semilog (logY, X) equations:   log Y = log a + bX  (must take the inverse log of log Y to get final answer on linear scale)

Example:   using the regression equation log Y = 1.42234 +0.47560X, predict Y when X = 12.1

 

logY = 1.42234 + 0.047560(12.1) = 1.99782 (calculate intercept and

 

answer to at least 5 decimal places); inverse log 1.99782 = 99.49

 

Note that the intercept (1.42234) is a log value (i.e., log a = 1.42234).  You must not take the log of this value when calculating log Y; that would be the equivalent of taking the log of a log!

 

 

 

 
                  

 

 

 

 

 

Review: logs are exponents

 

      log10a = b is the same as 10b = a

 

Log rules

log10(ab) = log10(a) + log10(b)

 

log10(a/b) = log10(a) – log10(b)

 

log10(ab) = blog10(a)

 

 
 

 

 

 

 

 

 

Log-log (logY, logX) equations:  log Y = log a + b(log X)

Example 1:  using the regression equation log Y = 2.53403 + 0.72000(log X), predict Y when X = 1.98

 

log Y = 2.53403 + 0.72000 (log 1.98) = 2.74763 (calculate

 

intercept and answer to at least 5 decimal places)

 

inverse log 2.74763 = 559.28

 

 

An alternative form of the log-log regression equation, and one which is much easier to use is: log Y = log a + b(log X) = log a + log xb

 

take inverse logs:  Y = aXb

 

Example 2:  using the regression equation Y = 342X0.720, predict Y when

 

X = 1.98; *note that 342 = the inverse log of 2.53403

 

Y = 342(1.980.720) = 559.28         

 
                       

             

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

_______________________________________________________________

 

Examples of important uses of log-log regressions and regression plots in biology
1.     ecology: species-area curves (Isle Biogeography Theory)

S = 3.3A0.30

 
gbmammals

 

 

Common slope in some (~0.3); e.g.,

West Indian snakes: S = 1.19A0.33

Galapagos land plants: S = 28.6A0.32

 
 

 

 

 
2.     physiology: effects of scaling; e.g., metabolic rate and body mass

MR = 70M0.75

 
gbmammals

 

Compare regression lines (ANCOVA); e.g.,

   -marsupials: MR = 0.409 M0.75

   -eutherians: MR = 0.676 M0.75 (>60% higher)

 
 

 

 

 

 

 _____________________________________________________________________________________


 
Analysis of Covariance (ANCOVA) - not available in MYSTAT

1.     Purpose: Detect differences among means of two or more groups when the dependent variable is affected by a third variable (“covariate”); multivariate technique

2.     A covariate is a continuous independent variable that adds unwanted variability to the dependent variable

3.     ANCOVA removes the variability in the dependent variable due to the covariate

4.     ANCOVA combines the use of both ANOVA and regression methods

5.     ANCOVA permits accessing interaction between independent variables

6.     Assumptions (normality and equal variances of residuals) are checked after ANCOVA

  1. SYSTAT path:  Analyze®General Linear Model®Estimate Model® enter dependent and independent variables (categorical, covariate); enter interaction terms; enter KS and Levene on options tab
  2. MYSTAT: not available

 

Example: (pic) Does residual yolk mass differ between fed and unfed turtle hatchlings? (pdf)

  1. One approach is to measure yolk mass on hatchlings and analyze mass data with a t-test or ANOVA.

   

 

Analysis of Variance (ANOVA)

Source

SS

df

Mean Square

F-ratio

p-value

Treat

0.022

1

0.022

0.063

0.803 ns

Error

28.325

82

0.345

 

 

 

Note the two sources of variation and the

relatively large (94%) error variance.






 

2.     Confounding factor: yolk mass is a function of age (r2~80%); incorporate age variation into the analysis (ANCOVA)  

 

 

 Analysis of Variance (ANCOVA)

Source

SS

df

Mean Square

F-ratio

p-value

Treat

0.002

1

0.002

0.030

0.862 ns

Age

22.608

1

22.608

318.561

0.000 ***

Treat*Age

0.007

1

0.007

0.099

0.754

Error

5.678

80

0.071

 

 

                    

 

 

 

 

 

 

Note the additional sources of variation and

greatly reduced (<1%) error variance.

 

 

  

 

 

3.     Student responsibility for ANCOVA

-         have basic understanding

-         be able to interpret output

-         be able to work and explain the example problem (no protocol sheet)

 

 ANCOVA example problem

  1. A common belief is that men are stronger than women.  Is this belief due to men being bigger or are men actually stronger when compared to women of similar body size?  Test this question on data from a sample of healthy young adults (stronger.syd).  The variables are sex, lean body mass, and a measure of strength called “slow, right extensor knee peak torque.” 

  

 

 

Multivariate statistics (univariate, bivariate, multivariate)

·        Examples: Two-way ANOVA, ANCOVA

·        Other multivariate procedures commonly used in the literature

a.     principal components

b.     MANOVA

c.      factor analysis

d.     logistic regression

e.      multiple regression

 

A.   Multiple regression example (>1 independent variable); DEMO FYI only

1.     Stepwise method (Analyze®Regression®Linear®Least Squares®Options®Stepwise)

-         constructs model with the highest overall r2

-         adds variables in steps according to the strength of a significant relationship

-         graph and equation of model

 

Text Box: Dependent Variable: NOCAP        
 
Step #1 		R-Square =  0.758; Term entered: MAXAIR
 
Effect            Coef.			Df		F		P
In
  1 Constant
  2 MAXAIR			4.313			1		1.20	0.000
 
Out               Part. Corr.
  3 MINAIR        0.464			1		6.24	0.000
  4 SOLRAD			0.212			1		0.98	0.177
-------------------------------------------------
 
Step #2 		R-Square =  0.905; Term entered: MINAIR
 
Effect            Coef.			Df		F		P
In
  1 Constant
  2 MAXAIR        0.153			1		16.22	0.000
  3 MINAIR        3.262			1		3.17	0.000
 
Out               Part. Corr.
  3 SOLRAD        0.267			1		1.08	0.081
 
Stop
 

 

 

multipleregression