Hard Statistics Quiz Essay Example | Topics and Well Written Essays

? Credit will be given for answers to ALL questions. There are three sections to this project – A, B and C. Section A is a short Statistics Quiz, andis worth a total of 10 marks. Section B has four questions, which use your own personal datasets, and is worth a total of 70 marks. Section C is a critical appraisal of a technical paper (chosen by you) which has some statistical analysis in it, and is worth 20 marks. To find your personal dataset number for the questions in section B, enter Blackboard and click on MAR8001. Click on Topic 3: Statistics, click on Project and click the link Personalized Data Set Numbers 2013. Find your name in the list and make a note of your dataset number. You will use this to access your own datasets for the questions in section B. This project is worth 100% of the final mark Section A: Statistics Quiz Answers to questions in this section require no more than one or two sentences each! 1. Quantitative variables can be discrete or continuous. Explain the difference between discrete data and continuous data, and give one example of each. Answer: A discrete variable can assume only a countable number of values such as number of persons in a family, whereas a continuous variable can assume any numerical value over a certain interval or intervals (uncountable number of values) such as height of a person. 2. A measure of location is a quantity which is ‘typical’ of the data. Give the names of three such measures, and explain (in words, not formulae) how each is found. Answer: The most common measures of central tendency or location used to describe data are; Mode: This is the most commonly occurring value. Median: The middle value when all the data are placed in order. Mean (Arithmetic Mean): It is the ratio of the sum of the scores to the number of the scores. 3. What is a measure of spread? Give the names of three such measures. Answer: A measure of spread is a quantity that gives an indication of the dispersion or variability within a set of data. Range, variance and standard deviation are some measures of spread. 4. A random sample of a particular attribute yields the histogram shown in figure 1. Suggest a suitable measure of location and a suitable measure of spread for these data. Answer: The histogram in figure 1 suggests that median would be a suitable measure of location and interquartile range would be a suitable measure of spread for these data. 5. The probability that a ship has a defective radar is 0.05. The probability that a ship has a defective echo is 0.06. Three in one hundred ships have both a defective echo and a defective radar. Find the probability that a randomly chosen ship has either a defective echo or a defective radar. Answer: P(def. radar) = 0.05 P(def. echo) = 0.06 P(def. radar and def. echo) = 3/100 = 0.03 P (def. radar or def. echo) = P(def. radar) + P(def. echo) – P(def. radar and def. echo) P (def. radar or def. echo) = 0.05 + 0.06 – 0.03 = 0.08 6. Under what conditions might we use a binomial distribution as a probability model for our data? Answer: We use a binomial distribution when following four conditions are satisfied; The number of trials ‘n’ is fixed. Each trial is independent. Each trial represents one of two outcomes ("success" or "failure"). The probability of success ‘p’ is the same for each trial. 7. Under what conditions might we use a normal distribution as a probability model for our data? Answer: The mean, median and mode are equal The graph is symmetrical about the mean (50% above and 50% below) Because 100% of the distribution lies below the curve, the total area below the curve is 100% or 1. ± 68% of the sample lies within one standard deviation of the mean; 34% above and 34% below ± 96% within two standard deviations: 48% above and 48% below ± 99.7% within three standard deviations: 49.85% above and 49.85% below The two ends are asymptotic to the horizontal axis. 8. In hypothesis testing, the p-value can be thought of as the chance of obtaining the observed results, or more extreme results, if the alternative hypothesis is correct. TRUE or FALSE? Answer: False. (if null hypothesis is correct, then true) 9. Write down the assumptions implicit in ANOVA. Answer: The treatment data must be normally distributed. The variance must be the same for all treatments. All samples are randomly selected. All the samples are independent. 10. The scatterplot shown in figure 2 is obtained when observations from two variables are plotted against each other. Choose two words from the following list which might be used to explain the relationship between theses two variables: Linear Negative Significant Positive Regression Indirect Answer: The plot shows a Positive, Linear relationship. Section B: Data response Questions 1, 2, 3 and 4 require you to use your own personal dataset. You should have already made a note of your personal dataset number; if not, follow the instructions on the front of this paper. As above, enter Blackboard and click on MAR8001. Click on Topic 3: Statistics, click on ‘Project’ and click the file ‘Project data 2011’, then, if your personal number is say 21, scroll through till you get to the part beginning Data Display K1 21.0000 Copy the following data until you reach the next set of data (K1 22.0000 in the above case) into a file in your workspace. 1. Input the data entitled ship speed and ice thrust into Minitab. The data are the results obtained from 30 ice breaker trials, where the ice thrust, y (in thousand Newtons), was recorded for various ship speeds, x (in metres per second). We are interested in making predictions of ice thrust based on ship speed. (a) Use a scatterplot to determine whether there is any association between ice thrust and ship speed. Include the plot in your solutions, and make appropriate comments. (4 marks) Answer: The trend in scatterplot shows that y (ice thrust) increases with increase in x (ship speed). The points also lie close to a straight line. The slope indicates that variables are linearly related. The scatterplot appears to describe a strong positive linear association between ice thrust and ship speed. (b) Calculate the correlation coefficient between ship speed and ice thrust. Does your correlation coefficient support what you see in the plot in part (a)? Is your correlation coefficient significantly different from zero? (4 marks) Answer: Pearson correlation of Ice Thrust and Ship Speed = 0.888 P-Value = 0.000 The estimated correlation coefficient is 0.888, which is close to 1 and shows that there is a strong positive correlation between ice thrust and ship speed as also indicated by scatterplot (ice thrust increases with increase in ship speed). There is also evidence that the correlation is significantly different from zero, as p is < 0.001, indicating evidence of a relationship between ice thrust and ship speed. (c) Perform a regression analysis on these data, and include the regression table in your solutions. State the estimated regression equation, and interpret your R2 statistic. (4 marks) Answer: The regression equation is IceThrust = 129 + 44.1 ShipSpeed Predictor Coef SE Coef T P Constant 128.98 26.55 4.86 0.000 ShipSpeed 44.090 4.323 10.20 0.000 S = 39.3487 R-Sq = 78.8% R-Sq(adj) = 78.0% Analysis of Variance Source DF SS MS F P Regression 1 161037 161037 104.01 0.000 Residual Error 28 43353 1548 Total 29 204390 Unusual Observations Obs ShipSpeed IceThrust Fit SE Fit Residual St Resid 1 5.54 463.46 373.17 7.36 90.29 2.34R 22 5.81 469.85 385.24 7.20 84.61 2.19R 28 1.85 218.37 210.36 18.99 8.01 0.23 X R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. The estimated regression equation is “Ice Thrust = 129 + 44.1 Ship Speed.” The R2 value implies that 78.8% of the variation in ice thrust (y) is explained by ship speed(x) in a straight-line model. The model fits the data well and there are likely other variables affecting ice thrust. (d) Using the Minitab output in (c), test the null hypothesis that the population slope, ?, is equal to zero. (2 marks) Answer: In order to test whether ice thrust and ship speed are linearly related, we test the null hypothesis (H0) that population slope (?) is equal to 0; H0 : ? = 0 According to the information from the Minitab output given in (c); The sample slope is b = 44.090, the value under Coef in the Ship Speed row of the output. The estimated standard error of the sample slope, written as s.e.(b) is =4.323, the value under SE Coef. The test statistic is t = b/s.e.(b) = 44.090/4.323= 10.20, the value under T. Minitab by default calculates the p-value. The p-value for the test is p = 0.000 and is given under P. Since the p value is < 0.05, the null hypothesis (? = 0) is rejected at the 0.05 significance level and statistically significant evidence exists that the variables ice thrust and ship speed are related. (e) Check the assumptions implicit in your regression, i.e. that the residuals are normally distributed and independent of fitted values. (5 marks) Answer: The above graphs indicate that residuals seem to follow approximately normally distribution as the red points lie very close to the blue line in the normal probability plot (with exception of two outliers), and a shape resembling a normal distribution in the histogram could be drawn. The top right plot (Versus Fits) shows that residuals are not randomly scattered around zero throughout the plot. Points are more concentrated towards the center and residuals might not be distributed independent of the fitted values. These observations imply that to satisfy assumptions implicit in regression, the model needs some level of modification. (f) Use your estimated regression equation to predict the ice thrust of a ship traveling at a speed of 6.8 metres per second. (1 mark) Answer: The regression equation is IceThrust = 129 + 44.1 ShipSpeed Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 428.79 8.14 (412.11, 445.47) (346.48, 511.10) Values of Predictors for New Observations New Obs ShipSpeed 1 6.80 The predicted ice thrust of a ship traveling at a speed of 6.8 m/s is = 428.79 kN. ` 2. Enter the data entitled muzzle into another column in Minitab. The data are the muzzle velocities (in metres per second) of 50 shells tested with a new gunpowder. (a) Summarise the data numerically using suitable measures of location and spread. Produce appropriate graphical summaries of your data, and include this in your solutions. Comment on the shape of the sample distribution. Does the data appear to be approximately normally distributed? (6 marks) Answer: Descriptive Statistics: MuzzleVelocities Variable N N* Mean SE Mean StDev Variance CoefVar Minimum MuzzleVelocities 50 0 3017.6 6.40 45.2 2047.2 1.50 2904.6 Variable Q1 Median Q3 Maximum Range MuzzleVelocities 2982.5 3026.4 3049.8 3102.3 197.7 The superimposed normal curve on fairly symmetric histogram shows approximately normally distributed pattern and suggests that data may have been obtained from normal population. However, the pattern is skewed in one direction. It can also be seen in the boxplot that there is negative skewness as the median lies more towards the upper quartile. The probability plot also seems to follow approximately normally distribution as points lie close to a straight line. (b) Assuming approximate normality, produce a 95% confidence interval for the mean muzzle velocity. From your interval, is there any evidence of a departure from the target mean of 3000 metres per second? (3 marks) Answer: One-Sample Z: MuzzleVelocities The assumed standard deviation = 45.2 Variable N Mean StDev SE Mean 95% CI MuzzleVelocities 50 3017.65 45.25 6.39 (3005.12, 3030.18) One can be 95% confident that the mean muzzle velocity falls inside the interval (3005.12, 3030.18). As target mean of 3000 m/s lies outside the 95% confidence interval, evidence of departure of mean muzzle velocity from target mean exists. (c) Use a one-sample t-test to test the null hypothesis that the population mean muzzle velocity is equal to 3000 metres per second. Use your graphs in part (a) (or otherwise) to check the assumption implicit in this test, and clearly state your null and alternative hypotheses. (6 marks) Answer: Null Hypothesis: H0 : mean muzzle velocity (?) = 3000 m/s Alternate Hypothesis: H1 : mean muzzle velocity (?) ? 3000 m/s One-Sample T: MuzzleVelocities Test of mu = 3000 vs not = 3000 Variable N Mean StDev SE Mean 95% CI T P MuzzleVelocities 50 3017.65 45.25 6.40 (3004.79, 3030.51) 2.76 0.008 As p value is < 0.05 and 95% confidence interval is (3004.79, 3030.51), sufficient evidence exists to reject the null hypothesis i.e. mean muzzle velocity = 3000 m/s. The above graphical representations also indicate that H0 is different from mean and provide with sufficient evidence to reject the null hypothesis. The assumption of normal distribution in the population has been violated. 3. In Minitab after the ‘MTB >’, type Read 8 8 M1. Then type in the following numbers: 0.0 8.7 59.3 50.0 12.5 30.2 53.2 10.1 8.7 0.0 55.3 44.5 12.0 27.9 49.0 10.1 59.3 55.3 0.0 48.6 60.9 62.7 11.9 55.3 50.0 44.5 48.6 0.0 50.7 26.0 45.3 49.7 12.5 12.0 60.9 50.7 0.0 27.5 56.7 10.0 30.2 27.9 62.7 26.0 27.5 0.0 28.6 26.0 53.2 49.0 11.9 45.3 56.7 28.6 0.0 49.0 10.1 10.1 55.3 49.7 10.0 26.0 49.0 0.0 Now type Print M1 . The data in the session window are in the form of a distance matrix, the values in this matrix representing the distance (in metres) between eight colonies of tropical plants found in the Maracay region of Venezuela. For reference, the names of these plants are: 1: Abuta 2: Cascarilla 3: Brazilian Pepper Tree 4: Cedro Rosa 5: Maracuza 6: Zanga Tempo 7: Tiririca 8: Gervao (a) From the distance matrix in Minitab, (3 marks) i. write down the distance between the Abuta and Maracuza colonies; Answer: 12.5 meters ii. write down the distance between the Gervao and Brazilian Pepper Tree colonies; Answer: 55.3 meters iii. write down the names of the two colonies closest together. Answer: Abuta and Cascarilla are closest at a distance of 8.7 meters from each other. (b) Which multivariate analysis technique could be used to recover the co-ordinates of each colony, and so produce a map of the locations of all the colonies? (don’t attempt this analysis) (1 mark) Answer: Principal component analysis could be used to recover the coordinates and produce a map of the locations of all the colonies. (c) Perform a nearest neighbour (or single linkage) cluster analysis to identify clusters of colonies at various distances. Include the Minitab output and dendrogram in your solutions. (5 marks) Answer: Cluster Analysis of Observations: M1 Single Linkage Amalgamation Steps Number of obs. Number of Similarity Distance Clusters New in new Step clusters level level joined cluster cluster 1 7 86.1244 8.7 1 2 1 2 2 6 84.0510 10.0 5 8 5 2 3 5 83.8915 10.1 1 5 1 4 4 4 81.0207 11.9 3 7 3 2 5 3 58.5327 26.0 4 6 4 2 6 2 58.5327 26.0 1 4 1 6 7 1 54.3860 28.6 1 3 1 8 (d) By referring to the dendrogram in (c), i. at what distance do the Abuta, Cascarilla, Maracuza and Gervao colonies form a single cluster? (1 mark) Answer: Abuta, Cascarilla, Maracuza and Gervao colonies form a single cluster at a distance of 10.10 meters. ii. if we choose a distance of 15 metres to separate clusters, how many clusters do we have? Name the plant colonies within each cluster. (2 marks) Answer: We will have four clusters separated at a distance of 15 meters. Plant colonies within each cluster are as follows; Cluster 1: Abuta, Cascarilla, Maracuza and Gervao Cluster 2: Cedro Rosa Cluster 3: Zanga Tempo Cluster 4: Brazilian Pepper Tree and Tiririca iii. find the minimum distance that should be chosen in order to obtain exactly two clusters of colonies. Name the plant colonies within each cluster. (3 marks) Answer: Exactly two clusters of colonies are obtained at a minimum distance of 26 meters. The plant colonies within each cluster are as follows; Cluster 1: Abuta, Cascarilla, Maracuza, Gervao, Cedro Rosa and Zanga Tempo. Cluster 2: Brazilian Pepper Tree and Tiririca 4. Read the four datasets entitled ‘Palms: …’ into four further columns and name them by where they come from. These data correspond to the yields (in kilograms per hectare) of the Cocos nucifera, or coconut palm, for plantations across four locations in the Caribbean ? Jamaica, Turks & Caicos Islands, Granada and Puerto Rico. Perform a one-way ANOVA in Minitab to test the null hypothesis that there is no difference between population mean yields for the four locations. Remember to check the assumption that each sample is drawn from a normal distribution (by using normal probability plots, for example); check the assumption of equal population variances; clearly state your null and alternative hypotheses; interpret your p-value using table 3 in the lecture notes. If you find that there is a significant difference between the mean yields observed at the four locations, use the follow-up procedure of multiple comparisons (or Tukey’s test) to find out between which pair(s) of locations these differences lie. You should write this question up as a report, with a short introduction and conclusion which summarises your findings. Include any relevant Minitab output, and remember to make appropriate comments as you go along. (20 marks) Answer: 1- Introduction: We have the data corresponding to the yieds (in kilograms per hectare) of the Cocos nucifera for plantations across four locations in the Caribbean ? Jamaica, Turks & Caicos Islands, Granada and Puerto Rico. We are interested in performing a one-way ANOVA in Minitab to test if there is a difference between population mean yields for the four locations. However, before proceeding with ANOVA, we also need to check the implicit assumptions that each sample is drawn from a normal distribution and population variances are equal. The null hypothesis is that is no difference between population mean yields for the four locations. The alternate hypothesis is that there is some difference, that not all four population mean yields are equal. If we reject the null hypothesis, we will use the follow-up procedure of Tukey’s test to find out between which pair(s) of locations these differences lie. Ho: u1=u2=u3=u4 H1 : Not all of u1, u2, u3 and u4 are equal 2- Checking the Assumptions The red points follow the blue line in the Normal Probability Plot. They tend to droop at both ends but not too badly. The histogram is roughly symmetric with no signs of bimodality and one outlier. It is not too bad for n=105. Residuals Versus the Fitted Values show similar spread for each group and especially no trends with the spread increasing steadily or decreasing steadily as the fitted values increase. These boxplots show that the data for Puerto Rico and Jamaica seem approximately normally distributed, but not the remaining samples. Descriptive Statistics: Jamaica, Turks & Caicos, Granada, Puerto Rico Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Jamaica 20 0 917.17 2.06 9.21 900.47 910.65 915.86 924.20 Turks & Caicos 25 0 928.72 2.00 10.00 905.84 923.05 927.43 937.02 Granada 30 0 943.62 1.57 8.62 926.76 939.10 942.62 949.29 Puerto Rico 30 0 947.73 1.45 7.94 932.93 942.50 946.89 952.22 Variable Maximum Jamaica 934.03 Turks & Caicos 945.31 Granada 958.90 Since the largest sample standard deviation is no more than twice as large as the smallest sample standard deviation and distribution in probability plot seems approximately normal, we can proceed with ANOVA. 3- Performing Analysis One-way ANOVA: Jamaica, Turks & Caicos, Granada, Puerto Rico Source DF SS MS F P Factor 3 14286.3 4762.1 60.17 0.000 Error 101 7993.6 79.1 Total 104 22279.9 S = 8.896 R-Sq = 64.12% R-Sq(adj) = 63.06% Individual 95% CIs For Mean Based on Pooled StDev Level N Mean StDev -------+---------+---------+---------+-- Jamaica 20 917.17 9.21 (---*---) Turks & Caicos 25 928.72 10.00 (---*--) Granada 30 943.62 8.62 (---*--) Puerto Rico 30 947.73 7.94 (--*--) -------+---------+---------+---------+-- 920 930 940 950 Pooled StDev = 8.90 In above analysis, we can see p Read More

Hard Statistics Quiz - Essay Example

Extract of sample "Hard Statistics Quiz"

CHECK THESE SAMPLES OF Hard Statistics Quiz

Environmental and Public Health

Statistics Project

Gender Differences in Smoking Cessation Success

QUIZ QUESTION AND ANSWERS

Reflective Rhetorical Analysis

A Critical Role in Deterring Smoking Both in the US and in the UK

Implementation of the Plan to Quit Smoking

SPSS Statistics Project