Bias when Dividing the Database Research Paper Example | Topics and Well Written Essays

Bias when Dividing the Database Inserts His/Her Inserts Inserts IntroductionMost studies are carried out in a manner such that a given sample generalizes a given population. As such, it is imperative that the chosen sample is representative of the population. In order to sure that the sample is representative, it is important to consider carefully how the sample was collected. If a sample is chosen for the sake of convenience alone, it becomes difficult to interpret the data with confidence (Der Geoff and Everitt, 25). Bias refers to the tendency for selected samples to contrast with the corresponding population in some methodical manner. Bias can arise in the manner that the sample was chosen or from the way that information is acquired once a given sample has been selected (Peck, Olsen and Devore, 58). When sampling, the most common types of bias are selection bias, response or measurement bias and nonresponsive bias. During statistical modelling, it is common for the sample to be split into the development sample and the validation sample. The development sample is used to develop the model while the validation sample is used to test the quality of the developed sample. It is very important to create both these samples in such a way that represents the total population as they can cause a lot of problems due to bias (Faraway, 54). Among the solutions that exist to check the representativeness between the two created samples, it is possible to check for different variables of the database if the repartition of the modalities is significantly different between the DEV and VAL samples. It is therefore important to understand why it is important to check for representativeness as well as the consequence that may arise is both the validation and development datasets are not representative. 1.1 Model Validation A Statistical model is usually constructed either for research/academic use or for business scenarios. Usually, a statistical model is used to forecast. In these cases, the effectiveness of the model can be gauged by through its ability to accurately predict future events. The main problem that affects modelling is that sample errors could be construed as actual trends and thus become part of the model. As such, each time the model is used leads to huge discrepancies between the performance of the development model and the new datasets (Giancristofaro and Salmaso, 375). Techniques utilized to determine the validity of statistical models include comparing the model coefficients and predictions with existing theories, collecting new information to test model predictions, and cross-validation or data splitting in which part of the collected sample is used to estimate the required coefficients while the rest is used for calculating the prediction accuracy of the statistical model. According to Snee (416), data splitting is the most effective method of model validation when it is impossible to collect new data to examine the model. In order to understand what would happen if when both the development and validation datasets are created badly in such a way that they are not representative of the population, this paper will look at different instances of the samples after being analysed by SAS software. 2. Data Splitting The Strata statement in SAS is used to designate the variables defining a dataset or strata or nested sets in a case control study (Stokes, Davis and Koch, 158). The predictive accuracy of data splitting differs depending on whether random sampling is done or whether stratified sampling is carried out. Taking an example constructed from a real database, we look at the risk rate distribution. The database chosen is large enough as such gives a good platform to examine the impact of stratified sampling and random sampling on the predictability of the chosen samples, both the validation and development databases. Number of Observation 12665 Number of Bad Observations 647 Risk Rate 5.11% Having this database, we then need to split into two datasets that include the validation and development datasets. The datasets will be split such that 70% of the data will belong in the development dataset and 30% for the validation dataset. The data split will be carried out in two main ways: i. Using a random sample that maintains the proportion of good and bad ii. With a random selection that does not keep the good bad condition. 2.1. Random sample with condition to keep the same good/bad proportion In SAS software, the “proc survey” with “strata good/bad” option can be used to split database and keep the same good/bad proportion. By choosing this method, the results are: Number of observations Number of Bad Risk rate Validation 3799 194 5.11% Development 8866 453 5.11% Total 12665 647 5.11% In this type of data splitting the sample size is split into two but the good and bad observation already collected are incorporated into the analysis. The total number of bad observation is separated using the same ratio as that of the sample that is 30 of the bad observations will belong to the validation dataset while 70% goes to the development dataset. We therefore have 194 bad contracts that can be used to validate the model. By considering the bad proportion of the dataset, the dataset so selected is more representative of the population being tested. 2.2. Random Sample without the God/Bad option In this method, it is assumed that the sample selected is absolutely random, and that no biases occur in the data. However, the failure to identify a serious bias in the sample can result in inaccurate test statistics and standard errors. Looking at the database previously selected, we find that if we do not use the Strata option different values for validation arise. The database was tested 1000 times so as to calculate the confidence interval. The results we as follows: From the graphs, we find that the development database has between 426 (4.8%) and 481 (5.43%) bad contracts. On the other hand, the validation databases have between n 166 (4.21%) and 221 (5.82%). In both these cases, there is an interval of 55 contracts between the lower and upper confidence levels. The differences witnessed is not that relevant with respect to the development database as a deviation of + or -6% of bad contracts among the 453 bad contracts is expected. However, the impact is very important when considering the validation database as a deviation of + or -14.5% of bad contracts among the 194 bad contracts is expected. Number of Observation Number of Bad Risk Rate Validation 3799 166 to 221 4.21% to 5.82% Development 8866 426 to 481 4.80% to 5.43% Total 12665 647 5.11% Looking at these figures we find out that in extreme cases, 14.5% of bad contracts could be lost to validate the model. This proportion (14.5%) is very high as the percentage of total bad contract (5.11%) is quite low and the quantity of the validation database is quite small (30% of the total sample size). As such, the choice to keep or remove the bad accounts from the validation database has significant consequences. Using the Strata Qual parameter in SAS is thus important in order to avoid this problem in case of a low number of bad contracts. 3. Consequences of Bias on Variable Qual In order to verify the details gained from the above sections, we will carry out two other tests and carry out both a study of the demand stability as well as a gap analysis. In the first test analysis, we split the whole sample such that both the DEV and VAL Samples have the same proportions but different proportions of bad. On analysis, the repartition is as follows: Number of Observation Number of Bad Risk Rate Validation 6329 258 4.08% Development 6336 389 6.14% Total 12665 647 5.11% 3.1. Demand Stability When we look at the stability of demand, we find that it experiences no impact whatsoever between both the DEV and VAL samples. The values provided in this analysis are similar to our initial values above showing that the bias does not affect the demand stability. 3.2. Risk Stability However, when we look at the risk stability, we find out that the risk between the DEV and VAL samples is not equal. This is expected as the risk should be lower for the validation sample as compared to the development sample. In our initial sample, the risk rate for the development sample was 6.14%, while that for the validation sample was 4.08%. The values received from this test are congruent with our original values and that the risk between the two values is stable. 3.3. Gap Analysis The gap analysis revealed that the gap present is inconsequential and does not have a large impact on the model. In this test analysis, a bias induced on the QUAL poses no problem. However, a problem could arise when a low number of bad contracts exist such that in the first test carried previously. Thus keeping the same good/bad proportion in the development and validation databases proves advantageous. 4. The consequences of a bias on a variable used in the score In this test analysis, we have a database with a bias on a variable (X_PRDURDEN) used in the score. The repartition is as follows: Development Sample Number of Observation Number of Bad Risk Rate X_PRDURDEN b_] 36 ; 48] 4366 57 1.31% X_PRDURDEN c_] 48] 1965 327 16.64% Total 6331 384 6.07% Validation Sample Number of Observation Number of Bad Risk Rate X_PRDURDEN b_] 36 ; 48] 4550 187 4.11% X_PRDURDEN c_] 48] 1784 76 4.26% Total 6334 263 4.15% 4.1. Demand Stability In this study, the stability of the demand has not been affected. This can be seen from the information from SAS below. This analysis reveals that there is no change in the demand stability when compared to our initial values above. 4.2. Risk Stability The risk between the DEV and VAL samples is very unstable, which raises a first alert on a problem of representativeness. From the risk stability we find that a bias on a variable used in the score causes the whole sample to lack in confidence in the representation the population. 4.3. Study of the gap analysis Big gaps between the observed and the predicted are observed in the gap analysis. Once again, it alerts the statistician of a problem either of correlation or instability 5. Analysis In the above sections, we have been looking at the difference between random and stratified random sampling. Random sampling ensures that every possible sample in the database has the same chance of being selected. While each sample has an equal chance of selection, by itself, is not sufficient to ensure that the sample is a random sample. Considering that B is the good/bad accounts and that A is the covariates. In logistic regression, the model becomes (Hosmer, Lemeshow and Sturdivant, 82): The data need for random sampling using the formulae above can be collected by selecting random observations from a given population or selecting the data based on the covariates (A) and finding out what values of the good/bad accounts (B) occur (Hosmer, Lemeshow and Sturdivant, 82). For our model we selected the second option where the dataset was split into two and in each of the VAL and DEV samples instances of bad accounts were noted. Considering that our initial dataset contained all the relevant data, there was no good reason for stratification hence a random selection using the same good/bad proportion was effective and the best course of action. In the first instance where random samples with the bad accounts, proportion was taken into account, the ratio of positive to negative patterns is basically irrelevant. Complications could only arise if the minority class is composed of too few samples Stratified sampling involves dividing the target populating into different strata, or subsets, and then the random selection of final samples proportionally from the different strata (Das, 43). The strata chosen for random sampling should never overlap in order to avoid giving some samples a higher chance of being selected. Stratified sampling is used when a researcher need to highlight a given subset with the population database. It can also be used to observe the interaction of two or more datasets (Pfeffermann, 321). Stratified sampling ensures that the researcher has guaranteed that samples from each dataset will be included in the final sample unlike the simple random sampling, where there is no guarantee that the subgroups will be represented proportionately or equally (Babbie, 206). Stratified sampling also requires a smaller sample size, which in turn saves time and money. In order to procure a stratified random sample from a given population database, we use Proc Survey select. This SAS command allows one to take a sample from a given dataset while maintaining the proportions that were previously put into place in our case the good/bad responses. The strata option allows one to define the stratification variable that will be used to define our sample. It is important to note that the stratum parameter is vital on their own right and offer insight information about the sampling database. During estimation of the overall population proportion (p) the following details are pertinent: Point Estimate = = Variance Estimate In the example database used above, denotes the bad contracts and acts as our stratification value. As we can see from the equations above, the general population parameter (p) estimate is also vital in stratified sampling (Das, 52). The non-response may be an issue that could result in bias as seen in section 2.1 above. Stratified sampling offers various advantages over simple random sampling. First of all, it increases efficiency of predictors of overall population parameters through the selection of strata that are homogenous over each datasets (Picard and Cook, 577). When we compare the random selection and Stratified random selection carried out in section 2 we find that simple random selection has an error of + or – 14.5% to the values of the stratified random sampling meaning that there is a very big risk of the response values (good/bad) being lost in the final sample. Stratified sampling is also advantageous as in focuses on subpopulations of special interest such as the ban contracts in our case. This type of sample selection is also convenient and can be used when dealing with small sample sizes. 6. Conclusion During statistical modelling, it is usually common for researchers to split the sample database into the validation and development sample datasets. The responsiveness of these samples should be checked to ensure that they do not introduce bias in the final sample chosen for a given study. The paper looks at what occurs when we use the strata function on a given sample and the effect on a problem of representativeness of on the score when this problem occurs on the QUAL variable and on the variable of the score. We note that stratification using the strata functions, reduce the loss of important datasets and ensures that the confidence level of the sample is high. Without the strata function, we find that a large variable exists within the response variable, good/bad in our case, and that the confidence level is low when dealing with a small sample size. In a random sample the test has to be carried out numerous times in order to define the confidence interval unlike when the strata function is used. Finally, we find that if a bias exists on the variable QUAL, no problem existing since the demand stability is not affected, the risk between the DEV and VAL is quite stable and there are nor serious gaps in the gap analysis. However, if a bias exists on a variety used in the score, the risk between the DEV and VAL samples is very unstable and there exist huge gaps in the gap analysis thus showing that a serious problem exists. Works Cited Das, N. G. Statistical Methods. New Delhi: Tata Mc-Graw Hill, 2009. Print Der Geoff and Brian Everitt. Handbook of Statistical Analysis using SAS. 2nd ed. London: CRC Press, 2005 Eric Babbie. The Practice of Social Research: 9th Edition. Belmont, CA: Wadsworth Thomson, 2001 Faraway, Julian. “Data Splitting Strategies for Reducing the Effect of Model Selection on Inference.” Technometrics 53.4(2006): 54-64 Giancristofaro, Arboretti and L. Salmaso. “Model Performance Analysis and Model Validation in Logistic Regression.” Statistica 63.2(2006): 375-397. Hosmer, David, Stanley, Lemeshow and Rodney Sturdivant. Applied Logical Regression. New Jersey: John Wiley and Sons, 2013. Peck, Roxy, Chris Olsen and Jay Devore. Introduction to Statistics and Data Analysis. New York: Cengage Learning, 2011. Pfeffermann, Danny. The Role of Sampling Weights when Modelling Survey Data. International Statistical Review 61.2(1977): 317-337 Snee, Robert. Validation of Regression Models: Methods and Examples. Techonometrics19.4 (1977): 415-431 Stokes, Maura, Charles Davis and Gary Koch. Categorical Data Analysis Using SAS. 3rd ed. Washington D.C. SAS Institute, 2004. Picard, Richard and R.D. Cook. Cross-validation of regression models. Journal of the American Statistical Association 70 (1984):575-583. Read More

Bias when Dividing the Database - Research Paper Example

Extract of sample "Bias when Dividing the Database"

CHECK THESE SAMPLES OF Bias when Dividing the Database

Using the Internet to Gather Information on Customer Needs

Educational Validity of Business Gaming Simulation

Paper - What is a Database

Network Diagram and Wireless Connections