How many fraudulent records need to be resample if we would like the proportion of fraudulent records in the balanced data set to be 20%?

Assignment Question

Part 1 Use R to solve the following problems from the textbook when appropriate, and submit a word document with your solutions.

3. Should we strive for the highest possible accuracy for the training set? Why or why not?

6. Suppose we are running a fraud classification model, with a training set of 10,000 records of which only 400 are fraudulent. a) How many fraudulent records need to be resample if we would like the proportion of fraudulent records in the balanced data set to be 20%? b) How many non-fraudulent records need to be set aside if we would like the proportion of fraudulent records in the balanced data set to be 20%?

8. Explain why we should always report a baseline performance, rather than merely citing the uncalibrated result from our model.

Part 2 Use R to solve the following problems from the textbook when appropriate. Refer to RZONE in the textbook for possible hints for R commands. Furthermore, you may need to consult the online documentation of the relevant R packages. Submit your solutions with R codes.

For the following exercises, use the churn data set. Normalize the numerical data and deal with the correlated variables. Generate a CART decision tree to classify churn. Generate a C4.5/C5.0-type decision tree to classify churn. Compare the two decision trees and discuss the benefits and drawbacks of each. Generate the full set of decision rules for the CART decision tree. Generate the full set of decision rules for the C4/5/C5.0-type decision tree. Compare the two sets of decision rules and discuss the benefits and draw backs of each.