Logistic Regression Classifier for the Churn Data Coursework Example | Topics and Well Written Essays

?Question For the churn data the package used for analysis is SPSS because it is more versatile and conversant. The churn variable is considered as output, good/bad and the other variables are treated as independent. The independent variables are checking, duration, history, purpose, amount, savings, employed, install, marital, coapp, resident, property, age, other, housing, existcr, job, depends, telephon and foreign. Since the output variable (dependent variable) is categorical (good or bad) the proper statistical technique to be adopted is logistic regression. The output of the logistic regression is given below. The preprocessing steps involve identifying independent and dependent variables, coding of variables which abruptly needs coding etc. The programming code is as follows: LOGISTIC REGRESSION VARIABLES good_bad /METHOD=ENTER checking duration history purpose amount savings employed installp marital coapp resident property age other housing existcr job depends telephon foreign /CONTRAST (purpose)=Indicator /CLASSPLOT /PRINT=CORR /CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5). Then the analysis is presented below: Case Processing Summary Unweighted Cases N Percent Selected Cases Included in Analysis 964 96.4 Missing Cases 36 3.6 Total 1000 100.0 Unselected Cases 0 .0 Total 1000 100.0 a. If weight is in effect, see classification table for the total number of cases. Dependent Variable Encoding Original Value Internal Value Bad 0 Good 1 Categorical Variables Codings Frequency Parameter coding (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) purpose 3 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 0 225 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 1 100 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 2 174 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 3 268 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 4 12 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 5 22 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 6 47 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 8 9 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 9 94 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 X 10 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 Beginning block Classification Table Observed Predicted good_bad Percentage Correct bad good Step 0 good_bad bad 0 292 .0 good 0 672 100.0 Overall Percentage 69.7 Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 0 Constant .834 .070 141.414 1 .000 2.301 Variables not in the Equation Score df Sig. Step 0 Variables checking 119.858 1 .000 duration 40.086 1 .000 History 48.045 1 .000 purpose 39.421 10 .000 purpose(1) 6.926 1 .008 purpose(2) 9.752 1 .002 purpose(3) 9.334 1 .002 purpose(4) .361 1 .548 purpose(5) 12.039 1 .001 purpose(6) .053 1 .817 purpose(7) .393 1 .531 purpose(8) 4.846 1 .028 purpose(9) 1.583 1 .208 purpose(10) .694 1 .405 amount 18.355 1 .000 savings 30.125 1 .000 employed 14.071 1 .000 installp 5.548 1 .019 marital 8.537 1 .003 coapp .419 1 .518 resident .000 1 .996 property 20.211 1 .000 age 7.933 1 .005 other 10.626 1 .001 housing .146 1 .703 existcr 2.184 1 .139 job .426 1 .514 depends .067 1 .797 telephon 2.137 1 .144 foreign 8.114 1 .004 a. Residual Chi-Squares are not computed because of redundancies. Block 1: Method = Enter Omnibus Tests of Model Coefficients Chi-square df Sig. Step 1 Step 299.197 29 .000 Block 299.197 29 .000 Model 299.197 29 .000 Model Summary Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square 1 883.255a .267 .378 a. Estimation terminated at iteration number 20 because maximum iterations has been reached. Final solution cannot be found. The sensitivity and specificity analysis can be done as follows: Classification Table Observed Predicted good_bad Total Good Bad good_bad Good 596 (TP) 76 (FP) 672 Bad 140 (FN) 152 (TN) 292 Total 736 (Sensitivity) 228 (Specificity) 964 TP: True Positive; TN: True Negative; FP: False Positive; FN: False Negative Sensitivity=TP/(TP+FN)=596/(596+140)=0.812 or 81,7% Specificity=TN/(FP+TN)=152/(152+76)=0.667 or 66.7% The ROC curve for the sensitivity and specificity analysis is given below: The classification accuracy is 77.6% So the model fits the data on a very good basis. Question 2: The academic paper published in 2010 or later discussing a real-life application of data mining or credit scoring is A comprehensive survey of data mining based fraud detection research. It was authored by Clifton Phua, Vincent Lee, Kate Smith and Ross Gayler and published in the IMA Journal of Management Mathematics The journal categorizes, compares and summarizes from almost all the published technical and review articles in automated fraud detection within the last 10 years. Now days, the technology is much developed such that it falls in the hands of hawkers as a bad weapon which can wrongly utilize credit cards, bank accounts, internet banking and private social networks. It is a by product or evil of a planned and monitored automation systems. The data mining techniques such as clustering, classification and association algorithms or artificial neural networks can be used in tracing the fraud detection system. First of all the Fraudsters are classified as Internal fraudster which includes Manager, Employee. Then, next comes the external fraudster which comprises of criminal, group of organized community doing fraud activities, average persons who are interested in tracing the accounts of an opponent. Here in each level coding is given as 1 for internal employee 2 for external or outsider and within employee 1 is coded for manager and 2 is coded for other staff, in externals 1 is given for criminals, 2 is given for organized community involving in fraud activities and 3 is gives for average persons. The fraud can be done at two levels one is the customer and the other is supplier. These also can be coded as 1 and 2 respectively so that the intensity of fraud happened at various stages. Now the above diagram shows from which industry and who have done this fraud. These data can also be used to investigate bankruptcy and other insurance entitlement problems. E business and E commerce present an enormous amount of data mining tasks because they are world wide open for fraud detection tasks. Here the IP based telecommunication services and online automated systems can be properly monitored to avoid leakage of data except between the giver and taker. The data was collected in a structure manner for this study. The data is also matched with previous fraud detection services for associating the fraudulent persons. Any new emerging gang has done this fraud or not will be known by this mining technique. Here classification technique is used. The specific attributes used for detection of fraud are involving financial risk using accounts receivables, allowance of doubtful debts, net sales figures, crop insurance data involving ratios, liability figures etc. Specific attributes cannot be directly revealed such as date/time stamps, current transaction (amount, geographical location, validity and merchant industry code), transactional history, payment history and other account information such as age of the account. Apart from the above, some temporal information can also be retrieved such as arrival at bank, departure, nature of dress and other essential attributes concerning personality can also be used to detect fraud. Here if the data is so expensive and vast, neural network techniques can be used so that the mining technique is optionally left to the computer. The most useful technique in detecting fraud is neural networks’ supervised learning technique, hybrid supervised learning and unsupervised learning. In most cases the data mining technique depends upon the practical issues of operational requirements, resource constraints and management commitment towards reduction of fraud than the technical requirements. Some other techniques like graph-theoretic anomaly detection and inductive logic programming can also be used in fraud detection studies. Rarely but usefully outlier detection technique, cost sensitive learning, stream mining, graph mining and scalability can also be used as data mining techniques. Question 3: Social networks and media such as Facebook, Twitter, LinkedIn, Xing,… are gaining in popularity. They are also generating huge amounts of data (e.g. friendship networks, tweet messages, hash tags, tagged pictures, …). So many data mining techniques can be used to generate new opportunities for companies for analyzing and leveraging social media data. Some examples of new applications using data mining techniques can be used for analyzing social media data and the challenges involved. We can also critically discuss possible threats when using data mining for analyzing social media data (e.g. privacy). The main goal of data mining techniques is description and prediction. Prediction makes use of existing variables in the database in order to predict unknown or future values of interest and description focuses on finding patterns describing the data and the subsequent presentation for user interpretation. The relative emphasis of both prediction and description differ with respect to the underlying application and the technique. There are several data mining techniques fulfilling these objectives. Some of these techniques are associations, classifications, sequential patterns and clustering. The basic premise of an association is to find all associations such that the presence of one set of items in a transaction implies the other items or not. Classification develops profiles of different groups. Sequential patterns identify sequential patterns subject to a user specified minimum constraint. Clustering segments a database into subsets are or clusters. A different approach of the Data Mining technique to classify the techniques are (i) user guided or verification driven data mining and (ii) Discovery driven or automatic discovery of rules which is beyond the capability of human being or just coincidence of some patterns of data. Let us illustrate one of the algorithms used in data mining for association rules. An association rule is searching of a set of items say A={A1, A2, A3 …. , An} Let the sample data base denoted be A1 A2 A3 A4 A5 A6 A7 A8 A9 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 1 0 0 1 0 1 1 0 0 0 1 1 1 0 1 etc… let A1 represents Gender; A2 represents graduation; A3 represents marital status; A4 represents income level above or below a certain amount; A5 represents age below 35 or above 35; A6 represents cadre above supervisory or below supervisory level; A7 represents technical or non technical; A8 represents possessing luxurious car or not and A9 represents American or Non American. This exactly represents a social meta data of an organization. Having huge records running into thousands, we can extract minimal subsets of interests so that we can identify groups of similarities, dissimilarities and extract some useful information for modeling to suit future goal needs. The goal may be whether to give promotion based on certain criteria which can also involve so many other parameters. We can also personalize these data whether a particular sub set of persons got married or not. For example which is the subset most representing married individuals and which subset represents unmarried individuals. This can be evaluated using frequent count for some itemsets A Support count {1} 3 {2} 2 {3} 2 {4} 3 {5} 3 {6} 3 {7} 3 {8} 2 {9} 3 {5,6} 2 {7,9} 2 {2,3,4} 1 {5,6,7} 1 The problem is to identify which is the most repeating combination similar to the above said table. By forming such tables, we can find some new relationships yet undefined and unexplored. So using the data mining techniques like classification and clustering algorithm it is possible to extract new yet unknown relationships. For example, some of the persons due to high educational profile may still be hesitant to marry even above 35. Such useful relationships can be extracted using data mining techniques. But when using these techniques, some threats may be emanating by officially declaring the some wrong policies and ideas which may be against the rules and regulations of social interest. For example the social networks does not ensure adequate steps to protect user privacy, intruding unnecessary private computer systems and face book, twitter and other social networks can tap the unnecessary privacies of college students involving social interests. Question 4: Data mining can be used to model LGD and Exploratory AD using the following algorithms: Logistic Regression (LR) Analysis Decision Tree Analysis Support Vector Machine (SVM) and Neural Networks algorithm (NN) Discrimnant Analysis by considering for two credit set algorithms. Among the above four algorithms, Logistic Regression and Support Vector Machine algorithms have been used for classification accuracy depicting which is the most expensive risk, expensive risk, moderate risk, less expensive risk and very less expensive risk. Neural Networks algorithm is not as much powerful as the LR and SVM algorithms. Credit risk estimation is a vital tool in financial analysis and data mining is a very powerful technique used to estimate credit risk system. Using kernel machine logistic regression techniques using non linear probabilistic classification, competitive discriminative ability and transparent reasoning methods credit risk can be strongly evaluated. In the field of statistical learning also convex optimization models, iteratively re weighted least squares algorithms can be used. Kernel logistic regression technique is proven to be accurate rather than SVM technique to evaluate credit risk. The credit risk in any financial system is decomposed into default risk (measured by probability of default or PD), loss risk (measured by loss given default or LGD) and exposure risk (measured by exposure at default or EAD). The Probability of Default (PD) is that probability an investor will be naturally experiencing a risk within one year of investment. The LGD (Loss Given Default) is a threatening factor of stability and smooth running factor in financial markets. It is used to bifurcate the credit risk into two potential binary customers (i) credible customers and (ii) non credible customers. Exposure at Default (EAD) is the investment pat which is involved in the present financial process. There is a minimum terms which should be meted out by the banks in setting up the regulation for LGD or EAD. Both of these risks LGD and EAD have bimodal distributions which complicates the statistical distribution involved in risk evaluation. Hence some innovative approaches like ECAP (Economic Capital Portfolio) models can be used as mediation between these two techniques. This technique ensures minimum capital a financial institution should have fixed so as to safeguard the savings of the investment of the persons those who have deposited with the financial institutions. By introducing more simulation techniques to estimate the risk involved in the investment complex market system and stock trading can be evolved so as to guide any investor in making a minimum risk for his investment. Some past records and the SWOT analysis will help the investor to assess the risk involved in making a plausible investment. The pre processing techniques involves major challenges like coding problems, missing values and transformation techniques to be adopted. Then suitable statistical techniques like regression analysis, discriminant analysis, clustering, classification and artificial neural networks can be used to model the data which is preprocessed and kept ready for modeling. Data should be made amenable for evaluating the statistical techniques involved. The data can be classified as categorized and uncategorized data. This should be kept in mind in preprocessing and modeling. The variables should be so chosen that it is useful in predicting the output variable by suitable modeling technique. Data summarization using descriptive statistics and graphs can be used to give some rough idea about the next processing, that is, inference and testing of hypotheses techniques have to be used. The variables involved for data processing should be identified as categorical or continuous in nature. If they are categorical, chi square tests for association of attributes can be used. If they are variables, correlation and regression techniques have to be used. Read More

Logistic Regression Classifier for the Churn Data - Coursework Example

Extract of sample "Logistic Regression Classifier for the Churn Data"

CHECK THESE SAMPLES OF Logistic Regression Classifier for the Churn Data

Business Logistics of the Apple Incorporation

Starbucks External and Internal Environment

Pre-arrival policy for patient attending accident emergency by ambulance

Retail sales in the UK 1986-2008: An empirical investigation

A Successful Supply Chain and Logistics Management System

Customer Relationship Management and Data Mining

Anova in Supply Chain and Logistics Management

Supply Chain Management and Logistics Quantitative Analysis