Name: 5
(d) (8 pt) Fill in the following code, so it will correctly perform a hypothesis test using the TVD as test
statistic and compute the empirical p-value. Assume the code in part (c) correctly computes the TVD.
You may use variables/names defined in part (c).
si mulat ed_ tvd s = m ake_array ()
for i in np . arange (10000):
si mul ate d_s amp le = s amp le_ pro por tion s (100 , predicted )
statistic = tvd ( predicted , sim ula ted _sa mpl e )
si mulat ed_ tvd s = np. append ( simulated_tvds , st at is ti c )
em piric al_ pva l = np. cou nt_ no nze ro ( s imu lated _tv ds >= obser ved_t vd )/1000 0
em piric al_ pva l
(e) (5 pt) Suppose that the result of your computation is empirical_pval = 0.0258. Which of the following
conclusions would be justified? Fill in the oval next to all that apply.
If we use a p-value cut-off of 5%, we should reject the null hypothesis.
If we use a p-value cut-off of 1%, we should reject the null hypothesis.
Bees are distributed according to the probability distribution specified by the scientist.
The scientist’s claim is wrong; honey bees occur in this region with a higher probability than the
scientist claimed.
The sample was not chosen randomly, because of a bias in how bees were selected.
The observed distribution in the sample is not consistent with the scientist’s claim. The difference
between the observed vs. claimed distribution is statistically significant (p < 0.05).
The 4th option is not correct. The above code tests whether there is a difference in the overall distribution;
it doesn’t test anything about honeybees specifically. Thus, we are not entitled to conclude (based on
the fact that empirical_pval = 0.0258) anything about honeybees specifically. If we wanted to draw
a conclusion about honeybees, we’d need to do a hypothesis test with a different test statistic that looks
solely at the number of honeybees (rather than the TVD).
The 5th option is not correct. As mentioned in part (b), we are testing the scientist’s claim, not whether
the sample was taken randomly. Also, even if we didn’t know whether the sample was taken randomly
or not, we aren’t entitled to conclude (from the fact that empirical_pval = 0.0258) that the sample
was non-random; it could be that the actual distribution of bees differs from the scientist’s claim, and the
sample was chosen randomly.
(f) (2 pt) Suppose you rerun the hypothesis testing code above, but this time you replace both instances of
10000 with 100000 (thus, you do 10× as many iterations of the loop). Which of the following describes
what we should expect to happen to empirical_pval? Fill in the oval next to one answer.
It should be about 10× larger (i.e., about 0.258, or a little bigger or a little smaller).
It should be about 10× smaller (i.e., about 0.00258, or a little bigger or a little smaller).
It should be about the same (i.e., it remains at about 0.0258, or a little bigger or a little smaller).
This follows from the law of averages. empirical_pval is a proportion of the time that something
happens in many (identical, independent) iterations a random process; doing more iterations will give you
approximately the same proportion.
(g) (4 pt) Suppose we wanted to test only whether bumblebees appear to be as common as the scientist
claimed, and we don’t care about honey bees or carpenter bees. In particular, suppose our null hypothesis
is that each bee has a 30% probability of being a bumblebee, and our alternative hypothesis is that
bumblebees are more common than that. Which of the following would be a good choice of test statistic,
for this null and alternative hypothesis? Assume that the observed sample has 100 bees and numbumbles
is the number of bumblebees in the observed sample. Fill in the oval next to all choices that are good.
numbumbles
numbumbles/100
abs(numbumbles - 10)
abs((numbumbles/100) - 0.10)