Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips

statistics final exam cheat sheet, Cheat Sheet of Statistics

Complete Statistics cheat sheet for your finam exam

Typology: Cheat Sheet

2018/2019
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 09/02/2019

zeb
zeb 🇺🇸

4.6

(25)

231 documents

0 of 0 (0 visible)

1 / 11

Toggle sidebar
Population entire collection of objects or
individuals about which information is desired.
easier to take a sample
Sample part of the population
that is selected for analysis
Watch out for:
Limited sample size that
might not be
representative of
population
Simple Random Sampling
Every possible sample of a certain
size has the same chance of being
selected
Observational Study there can always be
lurking variables affecting results
i.e, strong positive association between
shoe size and intelligence for boys
**should never show causation
Experimental Study lurking variables can be
controlled; can give good evidence for causation
Descriptive Statistics Part I
Summary Measures
Mean arithmetic average of data
values
**Highly susceptible to
extreme values (outliers).
Goes towards extreme values
Mean could never be larger or
smaller than max/min value but
could be the max/min value
Median in an ordered array, the
median is the middle number
**Not affected by extreme
values
Quartiles split the ranked data into 4
equal groups
Box and Whisker Plot
Range = Xmaximum Xminimum
Disadvantages: Ignores the
way in which data are
distributed; sensitive to outliers
Interquartile Range (IQR) = 3rd
quartile 1st quartile
Not used that much
Not affected by outliers
Variance the average distance
squared
sx
2 = n 1
(x x)
n
i=1 i
2
gets rid of the negativesx
2
values
units are squared
Standard Deviation shows variation
about the mean
s = n 1
(x x)
n
i=1 i
2
highly affected by outliers
has same units as original
data
finance = horrible measure of
risk (trampoline example)
Descriptive Statistics Part II
Linear Transformations
Linear transformations change the
center and spread of data
ar(a X) V ar(X)V + b = b2
Average(a+bX) = a+b[Average(X)]
Effects of Linear Transformations:
a + b*meanmeannew =
a + b*medianmediannew =
*stdevstdevnew = b| |
*IQRIQRnew = b| |
Zscore new data set will have mean
0 and variance 1
z = S
X X
Empirical Rule Only for moundshaped data
Approx. 95% of data is in the interval:
x s , x s ) s( 2 x + 2 x = x + / 2 x
only use if you just have mean and std.
dev.
Chebyshev's Rule Use for any set of data and for any
number k, greater than 1 (1.2, 1.3, etc.)
1 1
k2
(Ex) for k=2 (2 standard deviations),
75% of data falls within 2 standard
deviations
Detecting Outliers
Classic Outlier Detection
doesn't always work
z| | = |
| S
X X |
| ≥ 2
The Boxplot Rule
Value X is an outlier if:
X<Q11.5(Q3Q1)
or
X>Q3+1.5(Q3Q1)
Skewness
measures the degree of asymmetry
exhibited by data
negative values= skewed left
positive values= skewed right
if = don't need.8skewness| | < 0
to transform data
Measurements of Association
Covariance
Covariance > 0 = larger x,
larger y
Covariance < 0 = larger x,
smaller y
s (x )(y )xy = 1
n 1
n
i=1
x y
Units = Units of x Units of y
Covariance is only +, , or 0
(can be any number)
Correlation measures strength of a
linear relationship between two
variables
rxy = covariancexy
(std.dev. )(std. dev. )x y
correlation is between 1 and 1
Sign: direction of relationship
Absolute value: strength of
relationship (0.6 is stronger
relationship than +0.4)
Correlation doesn't imply
causation
The correlation of a variable
with itself is one
Combining Data Sets
Mean (Z) = X YZ = a + b
Var (Z) = V ar(X) V ar(Y )sz
2 = a2 + b2 +
abCov(X, )2 Y
Portfolios
Return on a portfolio:
R RRp = wA A + wB B
weights add up to 1
return = mean
risk = std. deviation
Variance of return of portfolio
s ss2
p = w2
A
2
A + w2
B
2
B + w w (s )2 A B A,B
Risk(variance) is reduced when
stocks are negatively
correlated. (when there's a
negative covariance)
Probability
measure of uncertainty
all outcomes have to be exhaustive
(all options possible) and mutually
exhaustive (no 2 outcomes can
occur at the same time)
Standard Error and Margin of Error
Example of Sample Proportion Problem
Determining Sample Size
n = e2
(1.96) p(1 p)2︿ ︿
If given a confidence interval, isp
︿
the middle number of the interval
No confidence interval; use worst
case scenario
=0.5p
︿
B. One Sample Mean
For samples n > 30
Confidence Interval:
If n > 30, we can substitute s for
so that we get:σ
For samples n < 30
T Distribution used when:
is not known, n < 30, and data isσ
normally distributed
*Stata always uses the tdistribution when
computing confidence intervals
Hypothesis Testing
Null Hypothesis:
, a statement of no change and isH0
assumed true until evidence indicates
otherwise.
Alternative Hypothesis: is aHa
statement that we are trying to find
evidence to support.
Type I error: reject the null hypothesis
when the null hypothesis is true.
(considered the worst error)
Type II error: do not reject the null
hypothesis when the alternative
hypothesis is true.
Example of Type I and Type II errors
Methods of Hypothesis Testing
1. Confidence Intervals **
2. Test statistic
3. Pvalues **
C.I and Pvalues always safe to do
because don’t need to worry about
size of n (can be bigger or smaller
than 30)
One Sample Hypothesis Tests
1. Confidence Interval (can be
used only for twosided tests)
2. Test Statistic Approach
(Population Mean) 3. Test Statistic Approach (Population
Proportion)
4. PValues
a number between 0 and 1
the larger the pvalue, the more
consistent the data is with the null
the smaller the pvalue, the more
consistent the data is with the
alternative
**If P is low (less than 0.05),
must go reject the nullH0
hypothesis

Download the full document

You can download it any time from any device

This page cannot be seen from the preview

Don't miss anything!

Discount

On special offer

Partial preview of the text

Download statistics final exam cheat sheet and more Cheat Sheet Statistics in PDF only on Docsity!   Population ­  entire collection of objects or  individuals about which information is desired.   ➔ easier to take a sample  ◆ Sample ­  part of the population  that is selected for analysis   ◆ Watch out for:   ● Limited sample size that  might not be  representative of  population  ◆ Simple Random Sampling­  Every possible sample of a certain  size has the same chance of being  selected    Observational Study ­  there can always be  lurking variables affecting results   ➔ i.e, strong positive association between  shoe size and intelligence for boys   ➔ **should never show causation    Experimental Study­  lurking variables can be  controlled; can give good evidence for causation    Descriptive Statistics Part I  ➔ Summary Measures         ➔ Mean ­  arithmetic average of data  values   ◆ * *Highly susceptible to  extreme values (outliers).  Goes towards extreme values  ◆ Mean could never be larger or  smaller than max/min value but  could be the max/min value    ➔ Median  ­  in an ordered array, the  median is the middle number  ◆ **Not affected by extreme  values     ➔ Quartiles ­  split the ranked data into 4  equal groups   ◆ Box and Whisker Plot      ➔ Range  =   Xmaximum Xminimum ◆ Disadvantages:  Ignores the  way in which data are  distributed; sensitive to outliers     ➔ Interquartile Range (IQR)  =  3rd  quartile ­ 1st quartile   ◆ Not used that much  ◆ Not affected by outliers         ➔ Variance ­  the average distance  squared                        sx2 = n 1 (x x)∑ n i=1 i 2     ◆ gets rid of the negativesx2   values   ◆ units are squared     ➔ Standard Deviation ­  shows variation  about the mean                    s =√  n 1(x x)∑ n i=1 i 2   ◆ highly affected by outliers   ◆ has same units as original  data   ◆ finance = horrible measure of  risk (trampoline example)      Descriptive Statistics Part II  Linear Transformations      ➔ Linear transformations change the  center and spread of data   ➔ ar(a X) V ar(X)V + b = b2   ➔ Average(a+bX) = a+b[Average(X)]      ➔ Effects of Linear Transformations:  ◆ a + b*mean meannew =   ◆ a + b*medianmediannew  =   ◆ *stdev stdevnew = b| |   ◆ *IQR IQRnew = b| |   ➔ Z­score ­  new data set will have mean  0 and variance 1                           z = S X X     Empirical Rule   ➔ Only for mound­shaped data  Approx. 95% of data is in the interval:         x s , x s ) s  ( 2 x   + 2 x = x + / 2 x ➔ only use if you just have mean and std.  dev.     Chebyshev's Rule   ➔ Use for any set of data and for any  number k, greater than 1 (1.2, 1.3, etc.)  ➔  1 1 k2 ➔ (Ex) for k=2 (2 standard deviations),  75% of data falls within 2 standard  deviations    Detecting Outliers   ➔ Classic Outlier Detection  ◆ doesn't always work   ◆ z| | =  |  |   S X X  |   |  ≥ 2   ➔ The Boxplot Rule  ◆ Value X is an outlier if:                  X<Q1­1.5(Q3­Q1)                            or                 X>Q3+1.5(Q3­Q1)          Skewness  ➔ measures the degree of asymmetry  exhibited by data  ◆ negative values= skewed left  ◆ positive values= skewed right   ◆ if  =  don't need.8  skewness| | < 0   to transform data    Measurements of Association  ➔ Covariance  ◆ Covariance > 0 = larger x,  larger y  ◆ Covariance < 0 = larger x,  smaller y   ◆ s (x )(y ) xy = 1n 1 ∑ n i=1 x y     ◆ Units = Units of x   Units of y  ◆ Covariance is only +, ­, or 0  (can be any number)    ➔ Correlation ­  measures strength of a  linear  relationship between two  variables   ◆ rxy = covariance  xy (std.dev. )(std. dev. )x y   ◆ correlation is between  ­1 and 1  ◆ Sign: direction of relationship  ◆ Absolute value: strength of  relationship (­0.6 is stronger  relationship than +0.4)                         ◆ Correlation doesn't imply  causation  ◆ The correlation of a variable  with itself is  one     Combining Data Sets  ➔ Mean (Z) =  X Y  Z = a + b ➔ Var (Z) =  V ar(X) V ar(Y )  sz2 = a2 + b 2 +                     abCov(X , )  2 Y   Portfolios   ➔ Return on a portfolio:                         R RRp = wA A + wB B     ◆ weights add up to 1  ◆ return = mean  ◆ risk = std. deviation    ➔ Variance of return of portfolio               s ss2p = w2A 2 A + w 2 B 2 B + w w (s )2 A B A,B     ◆ Risk(variance) is  reduced when  stocks are  negatively  correlated. (when there's a  negative covariance)      Probability   ➔ measure of uncertainty   ➔ all outcomes have to be  exhaustive  (all options possible)  and  mutually  exhaustive (no 2 outcomes can  occur at the same time)         ➔ Mean for uniform distribution:                    (X)E = 2 (a+b)   ➔ Variance for unif. distribution:                   ar(X)V = 12 (b a)2     Normal Distribution   ➔ governed by 2 parameters:               (the mean) and   (the standard μ σ               deviation)  ➔ (μ, )   X ~ N σ2   Standardize Normal Distribution:                        Z = σ X μ   ➔ Z­score is the number of standard  deviations the related X is from its  mean  ➔ **Z< some value, will just be the  probability found on table  ➔ **Z> some value, will be  (1­probability) found on table    Normal Distribution Example         Sums of Normals    Sums of Normals Example:   ➔ Cov(X,Y) = 0 b/c they're independent    Central Limit Theorem   ➔ as n increases,   ➔  should get closer to  (populationx  μ   mean)  ➔ mean( )  x = μ ➔ variance x) n  ( = σ2/   ➔ (μ, )X ~ N n σ2   ◆ if population is normally distributed,  n can be any value  ◆ any population, n needs to be  0  ≥ 3   ➔ Z = X μσ/√n         Confidence Intervals  = tells us how good our  estimate is   **Want high confidence, narrow interval   **As confidence increases  , interval also    increases      A. One Sample Proportion     ➔ p︿ = xn = sample size number of  successes in sample   ➔     ➔ We are thus 95% confident that the true  population proportion is in the interval…  ➔ We are assuming that n is large, n >5 and p︿   our sample size is less than 10% of the  population size.         Standard Error and Margin of Error     Example of Sample Proportion Problem     Determining Sample Size                n = e2 (1.96) p(1 p)2︿ ︿   ➔ If given a confidence interval,  is p︿   the middle number of the interval   ➔ No confidence interval; use worst  case scenario   ◆ =0.5 p︿       B. One Sample Mean   For samples n > 30   Confidence Interval:                ➔ If n > 30, we can substitute s for               so that we get: σ                                    For samples n < 30    T Distribution used when:   ➔ is not known, n < 30, and data is σ   normally distributed            * Stata always uses the t­distribution when  computing confidence intervals      Hypothesis Testing   ➔ Null Hypothesis:   ➔ , a statement of no change and isH0   assumed true until evidence indicates  otherwise.   ➔ Alternative Hypothesis:  is aHa   statement that we are trying to find  evidence to support.  ➔ Type I error:  reject the null hypothesis  when the null hypothesis is true.  (considered the worst error)  ➔ Type II error:  do not reject the null  hypothesis when the alternative  hypothesis is true.     Example of Type I and Type II errors      Methods of Hypothesis Testing   1. Confidence Intervals **  2. Test statistic   3. P­values **  ➔ C.I and P­values always safe to do  because don’t need to worry about  size of n (can be bigger or smaller  than 30)  One Sample Hypothesis Tests  1. Confidence Interval (can be  used only for  two­sided  tests)     2. Test Statistic Approach  (Population Mean)            3. Test Statistic Approach (Population  Proportion)             4. P­Values   ➔ a number between 0 and 1   ➔ the larger the p­value, the more  consistent the data is with the null  ➔ the smaller the p­value, the more  consistent the data is with the  alternative   ➔ ** If P is low (less than 0.05),                  must go ­ reject the nullH0                   hypothesis           Assumptions of Simple Linear Regression  1. We model the AVERAGE of something  rather than something itself  2.                 ◆ As  (noise) gets bigger, it’sε   harder to find the line                           Estimating  Se   ➔ S2e = n 2 SSE   ➔ is our estimate of  σ  Se2 2 ➔ is our estimate of  σ  Se =√Se2 ➔ 95% of the Y values should lie within  the interval  X   1.96S  b0 + b1 + e             Example of Prediction Intervals:               Standard Errors for  and bb1 0   ➔ standard errors  when noise     ➔ amount of uncertainty in oursb0   estimate of   (small s good, large sβ0   bad)  ➔ amount of uncertainty in oursb1   estimate of β1                Confidence Intervals for  and bb1 0   ➔   ➔   ➔   ➔   ➔ n small → bad               big → bad se                 small→ bad (wants x’s spread out fors2x                 better guess)  Regression Hypothesis Testing   *always a two­sided test  ➔ want to test whether slope ( ) isβ1   needed in our model  ➔ :   = 0  (don’t need x)H0 β1   :  0  (need x) Ha =  β1 /   ➔ Need X in the model if:    a. 0 isn’t in the confidence  interval   b. t > 1.96  c. P­value < 0.05    Test Statistic for Slope/Y­intercept   ➔ can only be used if n>30  ➔ if n < 30, use p­values        Multiple Regression   ➔   ➔ Variable Importance:   ◆ higher t­value, lower p­value =  variable is more important  ◆ lower t­value, higher p­value =  variable is less important (or not  needed)    Adjusted R­squared  ➔ k = # of X’s            ➔ Adj. R­squared will  as you add junk x    variables  ➔ Adj. R­squared will only  if the x you    add in is very useful  ➔ **want Adj. R­squared to go up and Se  low for better model    The Overall F Test                           ➔ Always want to reject F test (reject  null hypothesis)   ➔ Look at p­value (if < 0.05, reject null)  ➔ :    (don’tH0 ...β1 = β2 = β3 = βk = 0   need any X’s)               :  (need at Ha ... =  β1 = β2 = β3 = βk / 0               least 1 X)  ➔ If no x variables needed, then SSR=0  and SST=SSE          Modeling Regression   Backward Stepwise Regression   1. Start will all variables in the model   2. at each step, delete the least important  variable based on largest p­value above  0.05  3. stop when you can’t delete anymore   ➔ Will see Adj. R­squared  and Se       Dummy Variables   ➔ An indicator variable that takes on a  value of 0 or 1, allow intercepts to  change          Interaction Terms  ➔ allow the slopes to change   ➔ interaction between 2 or more x  variables that will affect the Y variable    How to Create Dummy Variables (Nominal  Variables)   ➔ If C is the number of categories, create  (C­1) dummy variables for describing  the variable   ➔ One category is always the  “baseline”, which is included in the  intercept                 Recoding Dummy Variables   Example: How many hockey sticks sold in  the summer (original equation)        ockey 00 0Wtr 0Spr 0Fall  h = 1 + 1 2 + 3 Write equation for how many hockey sticks  sold in the winter         ockey 10 0Fall 0Spri 0Summer  h = 1 + 2 3 1   ➔ **always need to get same exact  values from the original equation