StudentShare
Contact Us
Sign In / Sign Up for FREE
Search
Go to advanced search...
Free

Outliers in Statistical Analysis - Coursework Example

Cite this document
Summary
The paper "Outliers in Statistical Analysis" critically analyzes the issues of outliers in statistical analysis. In statistics, an observation that seems to be distant from other observations is known as an outlier. The reason it is the variation in measurement or an experimental error (Charu, pg.10)…
Download full paper File format: .doc, available for editing
GRAB THE BEST PAPER94.5% of users find it useful
Outliers in Statistical Analysis
Read Text Preview

Extract of sample "Outliers in Statistical Analysis"

Outliers in statistical analysis In statistics an observation that seems to be distant with other observations is known as an outlier. The reason as to this outlier is the variation in measurement or there may be an experimental error (Charu, pg.10). Experimental errors are at times not included in the data set. The probability of an outlier to occur in a distribution is always there indicating that either there is an error in measurement or that the population distribution is heavy tailed.  In the case of experimental errors one may wish to do away with the statistics or utilize statistics that are strong for outliers while in the case of heavy tailed distribution it implies that there is high kurtosis in the distribution and therefore, one needs to be cautious while using intuitions or tools assuming a normal distribution. Frequently outliers are brought when two distributions mix that may between separate sub populations or show correct trial against measurement error. The whole thing is made by mixing models. In cases where data samples are large, the data points will distant further from the mean of the sample than what is taken to be logical. The reason to this can be systematic error that is incidental of challenges in the theory that created the assumption family of distribution or some observations are distant from the mean of data collected. The outlier points therefore can indicate erroneous procedure, faulty data or other theories are not valid. Nevertheless, in bigger samples, a few outliers are to be predicted and due to abnormal conditions. As outliers are the most extensive observations, may be inclusive of maximum or minimum sample or both contingent of if they are extensively low or high.  Nevertheless, the minimum and maximum samples are not usually outliers as they may not be distant from other observations. Simple interpretation of derived statistics from sets of data that are inclusive of outliers may mislead. For instance, in case one is finding the average temperature of about ten objects in a place and nine lie between twenty and twenty five degrees Celsius, though an oven as at 170 degree Celsius, the data median will lie between twenty and twenty five   though the mean temperature will lie between 35 and 40 degree Celsius. In case of thought, the median is a reflection of temperature of an object that has been randomly sampled than mean. Simply taking the mean to be a typical sample that is equal to median is wrong. As shown in some cases outliers is an indication of data points belonging to various population than other sample set. Estimators that are capable of managing outliers are always robust. Median is a strong statistics as the mean isn’t. Causes and Occurrence In cases where the distribution of data is normal, the 3 sigma rule implies that an estimate of one in twenty two observations will vary by two times the standard deviation or more times from the mean. In one in 370 will vary by 3 times the SD (standard deviation) for details. Where there are a thousand observations, the probability of up to 5 observations differing from Mean by not less than 3 times the SD is inside the range of what should be expected, being not more than 2 times the number expected. Henceforth, within one SD of the number expected. It is evident in Poisson distribution and does not have any indication of abnormally.  If the size of the sample is 100 only, nevertheless, just 3 such outliers are a cause for concern, as it is not less than 11 times the number expected. In general, in case the population’s distribution nature is a priori, there is a possibility to test when the outliers’ number differ significantly from what is expected for some limit. Therefore, samples fall outside the limit with probability given a distribution, the outliers’ number will trail a binomial distribution that can be approximated using the Poisson distribution. Consequently, if an individual takes a distribution with limit 3 SD from the mean will be approximately 3 percent and hence for a thousand trials an individual can estimate the sample numbers that deviates with more than 3 sigmas as per the Poisson distribution. Causes Outliers have various anomalous causes. An apparatus for measuring samples may have developed a brief malfunction. Another thing is that there might be an error in data transcription or transmission. Outliers are caused by behavioral changes in the system, human error, and fraudulent behavior, natural differing in population or simply instrument errors. Samples may have been polluted with external elements from sample being examined outside. Alternatively, another cause may be a flaw’s result in an assumption theory needing more investigation by the person carrying out the research. In addition, the outliers’ pathological appearance of some form appears in various data sets with an indication that the mechanism that caused it for the sample data deviates at the extremes. Identifying Outliers No specific mathematical definition exists for what establishes an outlier. It is a subjective exercise to determine whether on observation is an outlier or not. Many methods can be used in identifying multipliers. Some of the methods are graphical for instance normal probability points. Some are model based. For the box plots they are of hybrid (Charu, pg.249). The commonly applied method for identifying outliers is model based assuming that the sample data are derived from a normal population distribution and detect observations that are taken unlikely based on SD and mean: pierce’s criterion, Grubb’s test and Chauvenet’s Criterion.  There is a proposal to detect in a series of N observations the error limit, past which all other observations involving much error should be discarded, given that there are many such observations. The problem’s solution that is proposed is based on a principle, that all proposed observations may be discarded when the system’s error probability got by keeping them is not more than the system’s errors got by their discarding increased by probability of no more and so many abnormal observation. Other methods standardize observation on the basis of measures like interquartile range. For instance, if the 1st and 3rd quartiles are within the range of description then outlier may be described as the plot outside the provided range. The other remaining approaches are density based and distance based and of them all always utilizes the distance to the neighbors to tag observations as non-outliers and outliers. Modified Thompson test The test is a method employed in determining if an outlier exists in a set of data. The strongest point of the method above is that it considers average, standard deviation of a data set, and gives a rejection zone that is determined statistically; giving an objective method in determining if the data plot is an outlier. How does it work? The first thing to do is to find out the average of the data. What is calculated next is absolute deviation that lies between average and data point. Thirdly, the rejection region is calculated using a formula For one to know if the result is an outlier δ = |(X - mean)) / s|. If the result you get is greater that rejection region then the point of data is an outlier. If the rejection region is greater than or equal to the value then the point of data is not an outlier. This test’s modified version is used in the calculation of a single outlier at a time. This means that if there is a data point found as an outlier it is immediately deleted from the set of data. The test is redone with a new rejection region and average. The process is repeated until there is no more outlier remaining in the set. There are other people who have studied outliers for categorical data. In a setting of a set of instances tough measure the chances that a point will not be classified well. In real sense, instance toughness would be worked out by adding over all hypothesis set. In practice the formula is not feasible like H is infinite or potential and calculating is not known for algorithms. Hence, instance hardness is approximated by the use of diverse subset. Working with Outliers While choosing on what to with an outlier one should take into account the cause. Retention Even if a distribution model that is normal is suitable for analysis of data, outliers should be for large sizes of data and should not be done away with if it is the case. An application should employ a categorizing algorithm which is strong to outliers to data that has outliers that occur naturally. Exclusion Deleting an outlier data is a practice that is controversial scowled by many science instructors and scientist as mathematical method gives a quantitative and objective method for data rejected. They never make the practice to sound methodological or scientific especially ones in small sets and there can not be an assumption of normal distribution. Outliers’ rejection is acceptable where the model in question of the measured process and the usual measurement of a distribution error are known. An outlier that is as a result of an error arising from an instrument reading should not be included but the reading should at least be verified. In regression analysis, there is an alternative method that is not top include points only which have large extent of influence on parameters with the help of cock’s distance. If data points are not excluded from data analysis they should be stated clearly on any succeeding reports. Non normal distributions The probability should be taken into consideration that data underlying distribution is not exactly normal having tails that are fat. For example, when taking samples from a Cauchy population distribution, samples variance grow bigger with the size, the mean do not converge as samples increase and outliers should be at larger rates compared to that of a normal distribution. Uncertainties A group of membership approach takes into consideration the fact that the doubt that corresponds to the measurement of a random vector is represented by a set. In case there are no outliers that occur, x may belong to where all x intersect with the aim of avoiding any inconsistency. The above can be done by employing the idea of relaxed intersection. As always the relaxed intersection resembles all x sets which belongs to sets apart from q sets. The sets which do not meet the relaxed intersection should be suspected as outliers. In case a cause of an outlier is known, there is a possibility of incorporating the effect into a structure for instance, by the use of mixture model or Bayes model. Conclusion An outlier may be a measurement error, coincidental phenomenon or caused as a result of an experimental error. Outliers may also be caused by heavy tail distribution where by the normal distribution assumption is not applicable. Some statistical estimators are robust and are in a position to tackle outliers while some are not in a position to tackle outliers. One of the best example is the median which can tackle outliers very well as it will not be of concern that extreme points are on the far ends or close to the data points for as long as there is no interference with the central value. Besides, the mean is always affected by the presence of an outlier as it decreases or increases in value and it depend on the outlier’s position. Any one dealing with outliers should be much careful and should make mistakes arising from exceptions or experimental errors at any given time. Outliers may at times show a different property and this indicates that the outliers are not of the same population. Mostly, the outliers should be treated with a lot of keenness until the cause is established, which cannot be by chance or random. Hence, a study should be carried out before one deletes an outlier. Statistical outliers are evident in distributions that are not normal. For instance, in a long tail distribution, statistical outliers are more evident than in normal distribution. Where there is normal distribution it easier to view that in random, approximately one in 370 observations deviates more than thrice the SD from mean. The rate decreases fast for values that are at a distant. Therefore if many data are distance from the mean, the outliers should be examined. For instance I f there are 1000 data points and 5 are at a distance four times the SD or even more then the outliers should be examined with immediate effect. In general, in case the population’s distribution nature is a priori, there is a possibility to test when the outliers’ number differ significantly from what is expected for some limit. Therefore, samples fall outside the limit with probability given a distribution, the outliers’ number will trail a binomial distribution that can be approximated using the Poisson distribution. Consequently, if an individual takes a distribution with limit 3 SD from the mean will be approximately 3 percent and hence for a thousand trials an individual can estimate the sample numbers that deviates with more than 3 sigmas as per the Poisson distribution. Works cited Charu C. A. Outlier Analysis. Springer Science & Business Media, 2013. . Read More
Cite this document
  • APA
  • MLA
  • CHICAGO
(Outliers in Statistical Analysis Article Example | Topics and Well Written Essays - 2000 words, n.d.)
Outliers in Statistical Analysis Article Example | Topics and Well Written Essays - 2000 words. https://studentshare.org/statistics/1846879-outliers-in-a-statistical-analysis
(Outliers in Statistical Analysis Article Example | Topics and Well Written Essays - 2000 Words)
Outliers in Statistical Analysis Article Example | Topics and Well Written Essays - 2000 Words. https://studentshare.org/statistics/1846879-outliers-in-a-statistical-analysis.
“Outliers in Statistical Analysis Article Example | Topics and Well Written Essays - 2000 Words”. https://studentshare.org/statistics/1846879-outliers-in-a-statistical-analysis.
  • Cited: 0 times

CHECK THESE SAMPLES OF Outliers in Statistical Analysis

Web Content Outlier Mining Through Using Web Datasets

The objectives of data users determine the tasks for data mining and types of analysis.... These tasks include Exploratory Data analysis (EDA), Descriptive Modeling, Predictive Modeling, Discovering Patterns and Rules, and Retrieval by Content.... (Data mining, 2011) Exploratory Data analysis (EDA) show small data sets interactively and visually in the form of a pie chart or coxcomb plot.... Descriptive Modeling is the technique that shows overall data distribution such as density estimation, cluster analysis and segmentation, and dependency modeling....
18 Pages (4500 words) Research Paper

Analyse using stat

1) Present histograms of price and dist variables in levels and in natural logarithms with normal density plots fitted.... Comment on the distribution of the variables in levels and logs.... Price variable shows a normal distribution with mean, $22511.... 1.... However dist variable is not normally distributed....
6 Pages (1500 words) Assignment

Statistical Analysis Research Paper

Define "causation" in statistical analysis.... Describe at least two factors that influence relationships between two variables and can lead to misinterpretation of data analysis.... have outliers.... Averages cannot be applied in this case, however if you remove the extreme values, which are in outliers, only then you can you averages....
4 Pages (1000 words) Research Paper

The Impact of Music on the Growth of Plants

CorrectionsFor corrections, an outlier analysis should be run to determine the root cause of the outliers.... If an assignable cause is identified, the outliers can be dropped from the analysis.... However, if no assignable cause is identified, the outlier should be used as a part of the analysis since it has also been selected from a random population and its extreme value is only due to chance (Coping with outliers).... statistical Methods for Geography....
1 Pages (250 words) Statistics Project

Brief Introduction to Basic Statistical Terminology and Concepts

The paper "Brief Introduction to Basic statistical Terminology and Concepts" aims to give know-how of the “quantitative nature of reality”, basic statistics concentrating on the subtopics: Descriptive Statistics, Correlations and the t-test for independent samples as part of the basic statistics.... hellip; statistical methods are very important in business and economics, management, operations, and research and development within a firm....
5 Pages (1250 words) Essay

Statistical Analysis of Geographical Data

The study "statistical analysis of Geographical Data" focuses on the critical analysis of geographical data using regression and correlation techniques to establish the relationship between two data sets, categorized as independent and dependent variables.... hellip; Geography utilizes many statistical tools to make conclusions about the features of geographical data.... A particular statistical technique is useful in analyzing the category of data....
6 Pages (1500 words) Case Study

Statistical Data Analysis

The assignment "Statistical Data Analysis" focuses on the critical analysis of the tasks in statistical data, i.... .... the price and dist variables.... The price variable shows a normal distribution with a mean of $22511.... 1.... However, the dist variable is not normally distributed....
7 Pages (1750 words) Assignment

Statistics for the Behavioral and Social Sciences

The statistical tests that could be used are the test for significance of correlation using Student's t-test.... "Statistics for the Behavioral and Social Sciences" paper analizes the study in which 1000 participants grouped into 2 groups, with one group of 500 participants getting the shot and another group of 500 participants getting nasal spray....
5 Pages (1250 words) Assignment
sponsored ads
We use cookies to create the best experience for you. Keep on browsing if you are OK with that, or find out how to manage cookies.
Contact Us