We provide an introductory demo of the usage for the package. This package implements two multi-class Neyman-Pearson classification algorithms proposed in Tian and Feng (2021).
npcs
can be installed from CRAN.
# install.packages("npcs_0.1.1.tar.gz", repos=NULL, type='source')
# install.packages("npcs", repos = "http://cran.us.r-project.org")
Then we can load the package:
Suppose there are K classes (K ≥ 2), and we denote them as classes 1 to K. The training samples {(xi, yi)}i = 1n are i.i.d. copies of (X, Y) ⊆ 𝒳 ⊗ {1, …, K}, where 𝒳 ⊆ ℝp. Suggested by Mossman (1999) and Dreiseitl, Ohno-Machado, and Binder (2000), we consider ℙX|Y = k(ϕ(X) ≠ k|Y = k) as the k-th error rate of classifier ϕ for any k ∈ {1, …, K}. We focus on the problem which minimizes a weighted sum of {ℙX|Y = k(ϕ(X) ≠ k)}k = 1K and controls ℙX|Y = k(ϕ(X) ≠ k) for k ∈ 𝒜, where 𝒜 ⊆ {1, …, K}. The Neyman-Pearson classification (NPMC) problem can be formally presented as where ϕ : 𝒳 → {1, …, K} is a classifier, αk ∈ [0, 1), wk ≥ 0 and 𝒜 ⊆ {1, …, K} (Tian and Feng (2021)).
This package implements two NPMC algorithms, NPMC-CX and NPMC-ER, which are motivated from cost-sensitive learning. For details about them, please refer to Tian and Feng (2021). Here we just show how to call relative functions to solve NP problems by these two algorithms.
We take Example 1 in Tian and Feng (2021) as an example. Consider a three-class independent Gaussian conditional distributions X|Y = k ∼ N(μk, Ip), where p = 5, μ1 = (−1, 2, 1, 1, 1)T, μ2 = (1, 1, 0, 2, 0)T, μ3 = (2, −1, −1, 0, 0)T and Ip is the p-dimensional identity matrix. The marginal distribution of Y is ℙ(Y = 1) = ℙ(Y = 2) = 0.3 and ℙ(Y = 3) = 0.4. Training sample size n = 1000 and test sample size is 2000.
We would like to solve the following NP problem
Now let’s first generate the training data by calling function
generate.data
. We also create variables alpha
and w
, representing the target level to control and the
weight in the objective function for each error rate. Here the target
levels for classes 1 and 3 are 0.05 and 0.01. There is no need to
control error rate of class 2, therefore we set the corresponding level
as NA
. The weights for three classes are 0, 1, 0,
respectively.
set.seed(123, kind = "L'Ecuyer-CMRG")
train.set <- generate_data(n = 1000, model.no = 1)
x <- train.set$x
y <- train.set$y
test.set <- generate_data(n = 2000, model.no = 1)
x.test <- test.set$x
y.test <- test.set$y
alpha <- c(0.05, NA, 0.01)
w <- c(0, 1, 0)
We first examine the test error rates of vanilla logistic regression
model fitted by training data. We fit the multinomial logistic
regression via function multinom
in package
nnet
. Note that our package provides function
error_rate
to calculate the error rate per class by
inputing the predicted response vector and true response vector. The
results show that the vanilla logistic regression fails to control the
error rates of classes 1 and 3.
library(nnet)
fit.vanilla <- multinom(y ~ ., data = data.frame(x = x, y = factor(y)),
trace = FALSE)
y.pred.vanilla <- predict(fit.vanilla, newdata = data.frame(x = x.test))
error_rate(y.pred.vanilla, y.test)
## 1 2 3
## 0.08517888 0.16264090 0.04924242
Then we conduct NPMC-CX and NPMC-ER based on multinomial logistic
regression by calling function npcs
. The user can indicate
which algorithm to use in parameter algorithm
. Also note
that we need to input the target level alpha
and weight
w
. For NPMC-ER, the user can decide whether to refit the
model using all training data or not by the boolean parameter
refit
. Compared to the previous results of vanilla logistic
regression, it shows that both NPMC-CX-logistic and NPMC-ER-logistic can
successfully control ℙX|Y = 1(ϕ(X) ≠ 1)
and ℙX|Y = 3(ϕ(X) ≠ 3)
around levels 0.05 and 0.01, respectively.
fit.npmc.CX.logistic <- try(npcs(x, y, algorithm = "CX", classifier = "multinom",
w = w, alpha = alpha))
fit.npmc.ER.logistic <- try(npcs(x, y, algorithm = "ER", classifier = "multinom",
w = w, alpha = alpha, refit = TRUE))
# test error of NPMC-CX-logistic
y.pred.CX.logistic <- predict(fit.npmc.CX.logistic, x.test)
error_rate(y.pred.CX.logistic, y.test)
## 1 2 3
## 0.056218058 0.333333333 0.007575758
# test error of NPMC-ER-logistic
y.pred.ER.logistic <- predict(fit.npmc.ER.logistic, x.test)
error_rate(y.pred.ER.logistic, y.test)
## 1 2 3
## 0.07666099 0.26731079 0.01262626
We can use different models to run NPMC-CX and NPMC-ER. Our framework
is built on top of the caret::train function from the Caret
(Classification And Regression Training) package, which enables users to
implement a wide range of classification methods using the
classifier
parameter. For instance, both LDA and Gradient
Boosting Machine have effectively controlled the probabilities ℙX|Y = 1(ϕ(X) ≠ 1)
and ℙX|Y = 3(ϕ(X) ≠ 3)
at levels of approximately 0.05 and 0.01, respectively.
fit.npmc.CX.lda <- try(npcs(x, y, algorithm = "CX", classifier = "lda",
w = w, alpha = alpha))
fit.npmc.ER.lda <- try(npcs(x, y, algorithm = "ER", classifier = "lda",
w = w, alpha = alpha, refit = TRUE))
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
fit.npmc.CX.gbm <- try(npcs(x, y, algorithm = "CX", classifier = "gbm",
w = w, alpha = alpha))
fit.npmc.ER.gbm <- try(npcs(x, y, algorithm = "ER", classifier = "gbm",
w = w, alpha = alpha, refit = TRUE))
# test error of NPMC-CX-LDA
y.pred.CX.lda <- predict(fit.npmc.CX.lda, x.test)
error_rate(y.pred.CX.lda, y.test)
## 1 2 3
## 0.056218058 0.341384863 0.007575758
# test error of NPMC-ER-LDA
y.pred.ER.lda <- predict(fit.npmc.ER.lda, x.test)
error_rate(y.pred.ER.lda, y.test)
## 1 2 3
## 0.02555366 0.32045089 0.02398990
# test error of NPMC-CX-GBM
y.pred.CX.gbm <- predict(fit.npmc.CX.gbm, x.test)
error_rate(y.pred.CX.gbm, y.test)
## 1 2 3
## 0.034071550 0.410628019 0.008838384
# test error of NPMC-ER-GBM
y.pred.ER.gbm <- predict(fit.npmc.ER.gbm, x.test)
error_rate(y.pred.ER.gbm, y.test)
## 1 2 3
## 0.100511073 0.394524960 0.003787879
By default, the base model may perform tuning on certain tuning
parameters before running the NPMC-CX and NPMC-ER algorithms. However,
we can modify the set of tuning parameter candidates using the
tuneGrid
parameter and specify the resampling method using
the trControl
parameter. The tuneGrid
parameter accepts values that are transformed into a data.frame with all
possible combinations of tuning parameters and passed to the base
model’s tuneGrid
parameter. Meanwhile, values in the
trControl
parameter are provided to the
caret::trainControl
function and passed to the base model’s
trControl
parameter. For more information on
hyperparameters in different models, please see chapters 5.5.2
and 5.5.4
in the Caret documentation.
In the below example, to implement a NPMC-CX model using knn as the base model, we specify k values as 5,7,9 and performs a five-fold cross-validation.
# 5-fold cross validation with tuning parameters k = 5,7,9
fit.npmc.CX.knn <- npcs(x, y, algorithm = "CX", classifier = "knn", w = w,
alpha = alpha,seed = 1,
trControl = list(method="cv", number=3),
tuneGrid = list(k=c(5,7,9)))
# the optimal hypterparameter is k=9
fit.npmc.CX.knn$fit
## k-Nearest Neighbors
##
## 1000 samples
## 5 predictor
## 3 classes: '1', '2', '3'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 666, 666, 668
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8910071 0.8356724
## 7 0.8990092 0.8477220
## 9 0.9020152 0.8522763
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
## 1 2 3
## 0.040885860 0.475040258 0.006313131
We can use the cv.npcs
function to automatically compare
the performance of the NPMC-CX, NPMC-ER, and vanilla models through
cross-validation or bootstrapping methods. K-fold cross-validation can
be specified using the fold
parameter and setting
resample
to "cv"
. Alternatively, bootstrapping
can be specified using the fold
parameter,
partition_ratio
parameter, and setting
resample
to “boot”. Here, fold
represents the
number of iterations and partition_ratio
represents the
proportion of data used for constructing the model in each iteration.
Additionally, we can use the stratified
argument to enable
stratified sampling.
The cv.npcs
function produces several outputs. First, a
summary of model evaluation is generated, which includes various
evaluation metrics such as accuracy, class-specific error rates,
Matthews correlation coefficient, micro-F1, macro-F1, Kappa, and
balanced accuracy. Class-specific error rates are displayed using a
combination of violin and box plots, which can be disabled using
plotit=FALSE
. Other outputs include the training and
testing error rates of the two algorithms and base model for each fold,
the validation set data indices for each fold, and the arithmetic mean
of the confusion matrix and sample size in the validation set.
cv.npcs.knn <- cv.npcs(x, y, classifier = "knn", w = w, alpha = alpha,
# fold=5, stratified=TRUE, partition_ratio = 0.7
# resample=c("bootstrapping", "cv"),seed = 1,
# plotit=TRUE, trControl=list(), tuneGrid=list(),
# verbose=TRUE
)
## Data splitting complete.
## Settings: stratified = TRUE , method = bootstrapping , classifier = knn
## current progress: split 5 of method knn
## algorithm training_1 training_2 training_3 testing_1 testing_2
## vanilla vanilla 0.06540284 0.1089286 0.046616541 0.08666667 0.1541667
## cx cx 0.03317536 0.3607143 0.006015038 0.04888889 0.4250000
## er er 0.04620853 0.4520089 0.002819549 0.05833333 0.5442708
## testing_3 feasibility accuracy mcc microF1 macroF1
## vanilla 0.05663717 1.0 0.9030100 0.8538280 0.9028717 0.9007137
## cx 0.01415929 1.0 0.8434783 0.7776806 0.8328596 0.8317244
## er 0.01106195 0.8 0.8035117 0.7244628 0.7829562 0.7803219
## Kappa BAC
## vanilla 0.8538119 0.9008432
## cx 0.7624099 0.8373173
## er 0.7013660 0.7954446