1 Summary of discussion
1. Quantitative (numerical variable) vs qualitative (categorical variable)
2. Discrete distribution vs continuous distribution
3. Mean vs median
4. Mean vs proportion
5. Population vs sample
6. Some issues that you need to be aware of
(a) Is the sample representative?
(b) Is there measurement error?
(c) Are missing values present with a pattern?
1
2 Task: learn population, (random) sample
Goal: understand the difference between population and sample.
Reading: Appendix C.1 of the textbook
Discuss
1. What is population?
2. How to define the population if our goal is to show the relationship between class size
(number of students in a class) and the rating of the eco201 instructor?
3. What is sample?
4. Is our sample, which is the Excel file I provided, representative if the population is
all the Miami students that have taken eco201?
5. Is our sample representative if the population is all the Miami econ-major students
that have taken eco201?
Define your population appropriately. Do not over-generalize the result based on your sample
6. If the population is all the Miami students that have taken eco201, how to design a
new survey so that a more representative sample can be obtained? Comment on the
following ideas
(a) Go to the Rec Center, and do the new survey
(b) Go to the main lobby of FSB building, and do the new survey
(c) Randomly select students using their campus id number, and call them
You can assume the sample is random, or obtain the random sample using random numbers .
Discuss
1. Where to find the random number? How about using the phone number or SSN?
2. Can we use computer to generate random number?
2
Math: population and random sample
Goal: understand the properties of random sample.
Reading: Appendix B.1, B.2, B.3, B.4 of the textbook
1. Statistically speaking, a population is an (unknown) distribution of certain variable.
2. For example, the population for our purpose may be the distribution of the Miami
econ-major students’ ratings of their eco201 instructors.
3. The population distribution can be characterized by parameters such as population
mean (expected value) denoted by Ey or µ
y
, population variance denoted by var(y)
or σ
2
y
, etc. Usually those parameters are unknown, and we want to estimate them.
4. A sample is a part (portion or subset) of the population.
5. For example, one sample is the ratings provided by the students in this eco311 class
(section). Another class may provide a different sample.
6. Obtaining a sample is much easier than obtaining a population.
7. Statistics is about using the sample to estimate (make inference) the unknown pop-
ulation distribution and its parameters.
8. Intuitively, the estimate is “good” if the sample is “good.” Random sample is such a
good sample.
9. A random sample {y
1
, y
2
, . . . , y
n
}, or {y
i
}
n
i=1
, is a special sample with nice properties
(a) E(y
i
) = µ
y
, (i = 1, 2, . . . , n). In words, all observations have identical mean.
(b) var(y
i
) = σ
2
y
, (i = 1, 2, . . . , n). In words, all observations have identical variance.
(c) cov(x
i
, x
j
) = 0, i ̸= j. In words, all observations are independent, so they have
zero covariance with each other.
10. Put differently, a random sample is i.i.d sample . i.i.d stands for identically and
independently distributed.
11. A biased (non-random) sample arises typically because people choose (or select) to be
the sample.
3
12. One example of biased sample is the sample of students who finish the online course
evaluations. Those students choose to do so because either they like the instructor
or hate the instructor. This biased sample cannot represent the students who fail to
do the evaluation. Mathematically, the students who finish and who do not finish the
evaluations follow different distributions. So “identically distributed” is violated.
Critically thinking:
1. How to get the random sample?
2. Is the time series of US GDP from 2001 to 2012 a random sample?
3. Is the sample of econ honor students a random sample for estimating the average gpa
of econ-major students?
4. Is the sample of Miami students a random sample for estimating the average family
income of all US college students?
Check “identically” and “independently” for the sample that you intend to use
Define or choose the population appropriately. Do not over-generalize your result
4
3 Task: Estimation
Goal: understand estimator and property of its sampling distribution.
Reading: Appendix C.2, C.3, C.4, C.5 of the textbook
Key points
1. We use a sample to estimate the population parameter. For instance, we use sample
mean, sample variance, etc, to estimate the population mean, population variance, etc
2. The sample mean and sample variance are examples of estimators. The value of the
estimator obtained from a given sample is called estimate.
3. We do NOT expect the sample mean is the same as the population mean because
by definition a sample is just part of population. The difference between sample and
population gives rise to sampling error. The sampling error is random because different
samples can be used.
4. People can use different samples, and obtain different sample means (for the same
population means). For example, you may use the men in Alabama (sample one), or
the men in Ohio (sample two), to estimate the average height of US men. This fact
highlights that
(a) Sample mean is a random variable . It is random due to the sampling error.
(b) The distribution of the sample mean is called sampling distribution. Do not
confuse the sampling distribution with the population distribution.
5. Discuss
(a) Compute the sample mean of the family income y using the Excel file I give you.
The stata command to get the sample mean (and other descriptive statistics) is
sum y
(b) There are other sections of ECO 311 being taught. Do you think the other sections
will get the same sample mean of the family income as our section?
(c) what does “sample mean is a random variable” means?
(d) Is population mean a random variable?
5
6. Sample mean is just one estimator for population mean. There are other estimators.
7. Sample mean is the most popular estimator for population mean because its sampling distribution
has some nice properties
(a) (unbiasedness): the average of (indefinitely many) sample means obtained from
different samples is the same as the population mean
(b) (efficiency): the variance of sample mean is smaller than some other estimators.
This means the sample mean does not vary much across different samples.
(c) (consistency): as the sample size gets bigger, the sample mean converges to the
population mean. This result is called law of large number.
Math: sample mean and its sampling distribution
Goal: understand the mean and variance of sample mean
Reading: Appendix A.1, C.2, C.3, C.4, C.5 of the textbook
1. We want to show the properties of the sample mean obtained from a random sample
2. Random sample means
3. The formula for the sample mean ¯y is
¯y =
1
n
n
i=1
y
i
y
1
+ y
2
+ . . . + y
n
n
(1)
where the sigma notation is the shorthand for sum (summation operator).
4. The sample mean obtained from the random sample is an unbiased estimator for pop-
ulation mean because
E(¯y) = E
(
y
1
+ y
2
+ . . . + y
n
n
)
=
µ
y
+ µ
y
+ . . . + µ
y
n
= µ
y
(2)
where E is the expectation operator. We use the property that the expectation of sum
is the sum of expectation:
E(y
i
+ y
j
) = E(y
i
) + E(y
j
) (3)
6
5. Result (2) implies the center (or mean) of the sampling distribution of ¯y is the popu-
lation mean µ
y
. In short, the average of the sample mean is the population mean. In
a particular sample, the sample mean can be different from the population mean.
6. Discuss
(a) why do we emphasize random sample?
(b) please find E(¯y) if E(y
1
) ̸= µ
y
, E(y
i
) = µ
y
, (i 2). Is this a random sample? Is
the sample mean unbiased?
7. The variance of the sample mean (based on random sample) is
var(¯y) = var
(
y
1
+ y
2
+ . . . + y
n
n
)
=
σ
2
y
+ σ
2
y
+ . . . + σ
2
y
n
2
=
σ
2
y
n
(4)
Here we use the facts that
var(cy
i
) = c
2
var(y
i
) (5)
var(y
i
+ y
j
) = var(y
i
) + var(y
j
) + 2cov(y
i
, y
j
) (6)
cov(y
i
, y
j
) = 0, (for random sample) (7)
See equation [C.6] on page 760 for more details. Remarks
(a) Formula (5) shows we need to square the constant if taking it out of the variance
(b) Formula (6) shows the variance of sum equals the sum of variance plus covariance.
(c) Formula (7) shows another reason why random sample is popular. We can drop
the covariance term if observations are independent.
8. Result (4) shows that as the sample size n rise the variance of sample mean falls.
Result (2) and (4) jointly explain why the sample mean is a consistent estimator.
9. Discuss. Suppose anothera (bad) estimator for the population mean is ˜y =
y
1
+y
2
2
.
This estimator only uses the first two observations in the sample (and ignore other
observations). By contrast the sample mean uses all observations. Please show
(a) ˜y is an unbiased estimator
(b) var(˜y) > var(¯y) when n > 3. This fact shows the bad estimator is less efficient
than the sample mean because its variance is bigger.
7
10. Discuss. What happens to the mean and variance of the sample mean when n ?
lim
n→∞
E(¯y) =
lim
n→∞
var(¯y) =
What does the sampling distribution of the sample mean look like when the sample
size rises?
11. An estimator is consistent if (1) it is (asymptotically) unbiased; (2) its variances goes
to zero as n rises. The sample mean is an example of consistent estimator. A consistent
estimator is desirable because it will get close to the true value of the parameter as the
sample gets larger.
Critical thinking:
1. Does formula (4) hold for time series data? How to modify the formula (4) for the
time series data? The situation where cov(y
i
, y
j
) ̸= 0 for time series data is called
serial correlation.
2. Does formula (4) hold if var(y
i
) ̸= var(y
j
)? The situation where variances are unequal
(non-constant) is called heteroskedasticity.
3. Please shows ˜y =
y
1
+y
2
2
is inconsistent estimator. What is the intuition?
4. Why do we prefer big sample over small sample?
8