Title: | Neyman-Pearson Classification via Cost-Sensitive Learning |
---|---|
Description: | We connect the multi-class Neyman-Pearson classification (NP) problem to the cost-sensitive learning (CS) problem, and propose two algorithms (NPMC-CX and NPMC-ER) to solve the multi-class NP problem through cost-sensitive learning tools. Under certain conditions, the two algorithms are shown to satisfy multi-class NP properties. More details are available in the paper "Neyman-Pearson Multi-class Classification via Cost-sensitive Learning" (Ye Tian and Yang Feng, 2021). |
Authors: | Ye Tian [aut], Ching-Tsung Tsai [aut, cre], Yang Feng [aut] |
Maintainer: | Ching-Tsung Tsai <[email protected]> |
License: | GPL-2 |
Version: | 0.1.1 |
Built: | 2025-02-02 05:58:02 UTC |
Source: | https://github.com/cran/npcs |
Compare the performance of the NPMC-CX, NPMC-ER, and vanilla models through cross-validation or bootstrapping methods. The function will return a summary of evaluation which includes various evaluation metrics, and visualize the class-specific error rates.
cv.npcs( x, y, classifier, alpha, w, fold = 5, stratified = TRUE, partition_ratio = 0.7, resample = c("bootstrapping", "cv"), seed = 1, verbose = TRUE, plotit = TRUE, trControl = list(), tuneGrid = list() )
cv.npcs( x, y, classifier, alpha, w, fold = 5, stratified = TRUE, partition_ratio = 0.7, resample = c("bootstrapping", "cv"), seed = 1, verbose = TRUE, plotit = TRUE, trControl = list(), tuneGrid = list() )
x |
matrix; the predictor matrix of complete data |
y |
numeric/factor/string; the response vector of complete data. |
classifier |
string; Model to use for npcs function |
alpha |
the levels we want to control for error rates of each class. The length must be equal to the number of classes |
w |
the weights in objective function. Should be a vector of length K, where K is the number of classes. |
fold |
integer; number of folds in CV or number of bootstrapping iterations, default=5 |
stratified |
logical; if TRUE, sample will be split into groups based on the proportion of response vector |
partition_ratio |
numeric; the proportion of data to be used for model construction when parameter resample=="bootstrapping" |
resample |
string; the resampling method
|
seed |
random seed |
verbose |
logical; if TRUE, cv.npcs will print the progress. If FALSE, the model will remain silent |
plotit |
logical; if TRUE, the output list will return a box plot summarizing the error rates of vanilla model and NPMC model |
trControl |
list; resampling method within each fold |
tuneGrid |
list; for hyperparameters tuning or setting |
# data generation: case 1 in Tian, Y., & Feng, Y. (2021) with n = 1000 set.seed(123, kind = "L'Ecuyer-CMRG") train.set <- generate_data(n = 1000, model.no = 1) x <- train.set$x y <- train.set$y test.set <- generate_data(n = 2000, model.no = 1) x.test <- test.set$x y.test <- test.set$y alpha <- c(0.05, NA, 0.01) w <- c(0, 1, 0) # contruct the multi-class NP problem cv.npcs.knn <- cv.npcs(x, y, classifier = "knn", w = w, alpha = alpha) # result summary and visualization cv.npcs.knn$summaries cv.npcs.knn$plot
# data generation: case 1 in Tian, Y., & Feng, Y. (2021) with n = 1000 set.seed(123, kind = "L'Ecuyer-CMRG") train.set <- generate_data(n = 1000, model.no = 1) x <- train.set$x y <- train.set$y test.set <- generate_data(n = 2000, model.no = 1) x.test <- test.set$x y.test <- test.set$y alpha <- c(0.05, NA, 0.01) w <- c(0, 1, 0) # contruct the multi-class NP problem cv.npcs.knn <- cv.npcs(x, y, classifier = "knn", w = w, alpha = alpha) # result summary and visualization cv.npcs.knn$summaries cv.npcs.knn$plot
Calculate the error rate for each class given the predicted labels and true labels.
error_rate(y.pred, y, class.names = NULL)
error_rate(y.pred, y, class.names = NULL)
y.pred |
the predicted labels. |
y |
the true labels. |
class.names |
the names of classes. Should be a string vector. Default = NULL, which will set the name as 1, ..., K, where K is the number of classes. |
A vector of the error rate for each class. The vector name is the same as class.names
.
Tian, Y., & Feng, Y. (2021). Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. Submitted. Available soon on arXiv.
npcs
, predict.npcs
, generate_data
, gamma_smote
.
# data generation set.seed(123, kind = "L'Ecuyer-CMRG") train.set <- generate_data(n = 1000, model.no = 1) x <- train.set$x y <- train.set$y test.set <- generate_data(n = 1000, model.no = 1) x.test <- test.set$x y.test <- test.set$y library(nnet) fit.vanilla <- multinom(y~., data = data.frame(x = x, y = factor(y)), trace = FALSE) y.pred.vanilla <- predict(fit.vanilla, newdata = data.frame(x = x.test)) error_rate(y.pred.vanilla, y.test)
# data generation set.seed(123, kind = "L'Ecuyer-CMRG") train.set <- generate_data(n = 1000, model.no = 1) x <- train.set$x y <- train.set$y test.set <- generate_data(n = 1000, model.no = 1) x.test <- test.set$x y.test <- test.set$y library(nnet) fit.vanilla <- multinom(y~., data = data.frame(x = x, y = factor(y)), trace = FALSE) y.pred.vanilla <- predict(fit.vanilla, newdata = data.frame(x = x.test)) error_rate(y.pred.vanilla, y.test)
gamma-SMOTE with some gamma in [0,1], which is a variant of the original SMOTE proposed by Chawla, N. V. et. al (2002). This can be combined with the NPMC methods proposed in Tian, Y., & Feng, Y. (2021). See Section 5.2.3 in Tian, Y., & Feng, Y. (2021) for more details.
gamma_smote(x, y, dup_rate = 1, gamma = 0.5, k = 5)
gamma_smote(x, y, dup_rate = 1, gamma = 0.5, k = 5)
x |
the predictor matrix, where each row and column represents an observation and predictor, respectively. |
y |
the response vector. Must be integers from 1 to K for some K >= 2. Can either be a numerical or factor vector. |
dup_rate |
duplicate rate of original data. Default = 1, which finally leads to a new data set with twice sample size. |
gamma |
the upper bound of uniform distribution used when generating synthetic data points in SMOTE. Can be any number between 0 and 1. Default = 0.5. When it equals to 1, gamma-SMOTE is equivalent to the original SMOTE (Chawla, N. V. et. al (2002)). |
k |
the number of nearest neighbors during sampling process in SMOTE. Default = 5. |
A list consisting of merged original and synthetic data, with two components x and y. x is the predictor matrix and y is the label vector.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Tian, Y., & Feng, Y. (2021). Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. Submitted. Available soon on arXiv.
npcs
, predict.npcs
, error_rate
, and generate_data
.
## Not run: set.seed(123, kind = "L'Ecuyer-CMRG") train.set <- generate_data(n = 200, model.no = 1) x <- train.set$x y <- train.set$y test.set <- generate_data(n = 1000, model.no = 1) x.test <- test.set$x y.test <- test.set$y # contruct the multi-class NP problem: case 1 in Tian, Y., & Feng, Y. (2021) alpha <- c(0.05, NA, 0.01) w <- c(0, 1, 0) ## try NPMC-CX, NPMC-ER based on multinomial logistic regression, and vanilla multinomial ## logistic regression without SMOTE. NPMC-ER outputs the infeasibility error information. fit.npmc.CX <- try(npcs(x, y, algorithm = "CX", classifier = "multinom", w = w, alpha = alpha)) fit.npmc.ER <- try(npcs(x, y, algorithm = "ER", classifier = "multinom", w = w, alpha = alpha, refit = TRUE)) fit.vanilla <- nnet::multinom(y~., data = data.frame(x = x, y = factor(y)), trace = FALSE) # test error of NPMC-CX based on multinomial logistic regression without SMOTE y.pred.CX <- predict(fit.npmc.CX, x.test) error_rate(y.pred.CX, y.test) # test error of vanilla multinomial logistic regression without SMOTE y.pred.vanilla <- predict(fit.vanilla, newdata = data.frame(x = x.test)) error_rate(y.pred.vanilla, y.test) ## create synthetic data by 0.5-SMOTE D.syn <- gamma_smote(x, y, dup_rate = 1, gamma = 0.5, k = 5) x <- D.syn$x y <- D.syn$y ## try NPMC-CX, NPMC-ER based on multinomial logistic regression, and vanilla multinomial logistic ## regression with SMOTE. NPMC-ER can successfully find a solution after SMOTE. fit.npmc.CX <- try(npcs(x, y, algorithm = "CX", classifier = "multinom", w = w, alpha = alpha)) fit.npmc.ER <- try(npcs(x, y, algorithm = "ER", classifier = "multinom", w = w, alpha = alpha, refit = TRUE)) fit.vanilla <- nnet::multinom(y~., data = data.frame(x = x, y = factor(y)), trace = FALSE) # test error of NPMC-CX based on multinomial logistic regression with SMOTE y.pred.CX <- predict(fit.npmc.CX, x.test) error_rate(y.pred.CX, y.test) # test error of NPMC-ER based on multinomial logistic regression with SMOTE y.pred.ER <- predict(fit.npmc.ER, x.test) error_rate(y.pred.ER, y.test) # test error of vanilla multinomial logistic regression wit SMOTE y.pred.vanilla <- predict(fit.vanilla, newdata = data.frame(x = x.test)) error_rate(y.pred.vanilla, y.test) ## End(Not run)
## Not run: set.seed(123, kind = "L'Ecuyer-CMRG") train.set <- generate_data(n = 200, model.no = 1) x <- train.set$x y <- train.set$y test.set <- generate_data(n = 1000, model.no = 1) x.test <- test.set$x y.test <- test.set$y # contruct the multi-class NP problem: case 1 in Tian, Y., & Feng, Y. (2021) alpha <- c(0.05, NA, 0.01) w <- c(0, 1, 0) ## try NPMC-CX, NPMC-ER based on multinomial logistic regression, and vanilla multinomial ## logistic regression without SMOTE. NPMC-ER outputs the infeasibility error information. fit.npmc.CX <- try(npcs(x, y, algorithm = "CX", classifier = "multinom", w = w, alpha = alpha)) fit.npmc.ER <- try(npcs(x, y, algorithm = "ER", classifier = "multinom", w = w, alpha = alpha, refit = TRUE)) fit.vanilla <- nnet::multinom(y~., data = data.frame(x = x, y = factor(y)), trace = FALSE) # test error of NPMC-CX based on multinomial logistic regression without SMOTE y.pred.CX <- predict(fit.npmc.CX, x.test) error_rate(y.pred.CX, y.test) # test error of vanilla multinomial logistic regression without SMOTE y.pred.vanilla <- predict(fit.vanilla, newdata = data.frame(x = x.test)) error_rate(y.pred.vanilla, y.test) ## create synthetic data by 0.5-SMOTE D.syn <- gamma_smote(x, y, dup_rate = 1, gamma = 0.5, k = 5) x <- D.syn$x y <- D.syn$y ## try NPMC-CX, NPMC-ER based on multinomial logistic regression, and vanilla multinomial logistic ## regression with SMOTE. NPMC-ER can successfully find a solution after SMOTE. fit.npmc.CX <- try(npcs(x, y, algorithm = "CX", classifier = "multinom", w = w, alpha = alpha)) fit.npmc.ER <- try(npcs(x, y, algorithm = "ER", classifier = "multinom", w = w, alpha = alpha, refit = TRUE)) fit.vanilla <- nnet::multinom(y~., data = data.frame(x = x, y = factor(y)), trace = FALSE) # test error of NPMC-CX based on multinomial logistic regression with SMOTE y.pred.CX <- predict(fit.npmc.CX, x.test) error_rate(y.pred.CX, y.test) # test error of NPMC-ER based on multinomial logistic regression with SMOTE y.pred.ER <- predict(fit.npmc.ER, x.test) error_rate(y.pred.ER, y.test) # test error of vanilla multinomial logistic regression wit SMOTE y.pred.vanilla <- predict(fit.vanilla, newdata = data.frame(x = x.test)) error_rate(y.pred.vanilla, y.test) ## End(Not run)
Generate the data from two simulation cases in Tian, Y., & Feng, Y. (2021).
generate_data(n = 1000, model.no = 1)
generate_data(n = 1000, model.no = 1)
n |
the generated sample size. Default = 1000. |
model.no |
the model number in Tian, Y., & Feng, Y. (2021). Can be 1 or 2. Default = 1. |
A list with two components x and y. x is the predictor matrix and y is the label vector.
Tian, Y., & Feng, Y. (2021). Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. Submitted. Available soon on arXiv.
npcs
, predict.npcs
, error_rate
, and gamma_smote
.
set.seed(123, kind = "L'Ecuyer-CMRG") train.set <- generate_data(n = 1000, model.no = 1) x <- train.set$x y <- train.set$y
set.seed(123, kind = "L'Ecuyer-CMRG") train.set <- generate_data(n = 1000, model.no = 1) x <- train.set$x y <- train.set$y
Fit a multi-class Neyman-Pearson classifier with error controls via cost-sensitive learning. This function implements two algorithms proposed in Tian, Y. & Feng, Y. (2021). The problem is minimize a linear combination of P(hat(Y)(X) != k| Y=k) for some classes k while controlling P(hat(Y)(X) != k| Y=k) for some classes k. See Tian, Y. & Feng, Y. (2021) for more details.
npcs( x, y, algorithm = c("CX", "ER"), classifier, seed = 1, w, alpha, trControl = list(), tuneGrid = list(), split.ratio = 0.5, split.mode = c("by-class", "merged"), tol = 1e-06, refit = TRUE, protect = TRUE, opt.alg = c("Hooke-Jeeves", "Nelder-Mead") )
npcs( x, y, algorithm = c("CX", "ER"), classifier, seed = 1, w, alpha, trControl = list(), tuneGrid = list(), split.ratio = 0.5, split.mode = c("by-class", "merged"), tol = 1e-06, refit = TRUE, protect = TRUE, opt.alg = c("Hooke-Jeeves", "Nelder-Mead") )
x |
the predictor matrix of training data, where each row and column represents an observation and predictor, respectively. |
y |
the response vector of training data. Must be integers from 1 to K for some K >= 2. Can be either a numerical or factor vector. |
algorithm |
the NPMC algorithm to use. String only. Can be either "CX" or "ER", which implements NPMC-CX or NPMC-ER in Tian, Y. & Feng, Y. (2021). |
classifier |
which model to use for estimating the posterior distribution P(Y|X = x). String only. |
seed |
random seed |
w |
the weights in objective function. Should be a vector of length K, where K is the number of classes. |
alpha |
the levels we want to control for error rates of each class. Should be a vector of length K, where K is the number of classes. Use NA if if no error control is imposed for specific classes. |
trControl |
list; resampling method |
tuneGrid |
list; for hyperparameters tuning or setting |
split.ratio |
the proportion of data to be used in searching lambda (cost parameters). Should be between 0 and 1. Default = 0.5. Only useful when |
split.mode |
two different modes to split the data for NPMC-ER. String only. Can be either "per-class" or "merged". Default = "per-class". Only useful when
|
tol |
the convergence tolerance. Default = 1e-06. Used in the lambda-searching step. The optimization is terminated when the step length of the main loop becomes smaller than |
refit |
whether to refit the classifier using all data after finding lambda or not. Boolean value. Default = TRUE. Only useful when |
protect |
whether to threshold the close-zero lambda or not. Boolean value. Default = TRUE. This parameter is set to avoid extreme cases that some lambdas are set equal to zero due to computation accuracy limit. When |
opt.alg |
optimization method to use when searching lambdas. String only. Can be either "Hooke-Jeeves" or "Nelder-Mead". Default = "Hooke-Jeeves". |
An object with S3 class "npcs"
.
lambda |
the estimated lambda vector, which consists of Lagrangian multipliers. It is related to the cost. See Section 2 of Tian, Y. & Feng, Y. (2021) for details. |
fit |
the fitted classifier. |
classifier |
which classifier to use for estimating the posterior distribution P(Y|X = x). |
algorithm |
the NPMC algorithm to use. |
alpha |
the levels we want to control for error rates of each class. |
w |
the weights in objective function. |
pik |
the estimated marginal probability for each class. |
Tian, Y., & Feng, Y. (2021). Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. Submitted. Available soon on arXiv.
predict.npcs
, error_rate
, generate_data
, gamma_smote
.
# data generation: case 1 in Tian, Y., & Feng, Y. (2021) with n = 1000 set.seed(123, kind = "L'Ecuyer-CMRG") train.set <- generate_data(n = 1000, model.no = 1) x <- train.set$x y <- train.set$y test.set <- generate_data(n = 1000, model.no = 1) x.test <- test.set$x y.test <- test.set$y # contruct the multi-class NP problem: case 1 in Tian, Y., & Feng, Y. (2021) alpha <- c(0.05, NA, 0.01) w <- c(0, 1, 0) # try NPMC-CX, NPMC-ER, and vanilla multinomial logistic regression fit.vanilla <- nnet::multinom(y~., data = data.frame(x = x, y = factor(y)), trace = FALSE) fit.npmc.CX <- try(npcs(x, y, algorithm = "CX", classifier = "multinom", w = w, alpha = alpha)) fit.npmc.ER <- try(npcs(x, y, algorithm = "ER", classifier = "multinom", w = w, alpha = alpha, refit = TRUE)) # test error of vanilla multinomial logistic regression y.pred.vanilla <- predict(fit.vanilla, newdata = data.frame(x = x.test)) error_rate(y.pred.vanilla, y.test) # test error of NPMC-CX y.pred.CX <- predict(fit.npmc.CX, x.test) error_rate(y.pred.CX, y.test) # test error of NPMC-ER y.pred.ER <- predict(fit.npmc.ER, x.test) error_rate(y.pred.ER, y.test)
# data generation: case 1 in Tian, Y., & Feng, Y. (2021) with n = 1000 set.seed(123, kind = "L'Ecuyer-CMRG") train.set <- generate_data(n = 1000, model.no = 1) x <- train.set$x y <- train.set$y test.set <- generate_data(n = 1000, model.no = 1) x.test <- test.set$x y.test <- test.set$y # contruct the multi-class NP problem: case 1 in Tian, Y., & Feng, Y. (2021) alpha <- c(0.05, NA, 0.01) w <- c(0, 1, 0) # try NPMC-CX, NPMC-ER, and vanilla multinomial logistic regression fit.vanilla <- nnet::multinom(y~., data = data.frame(x = x, y = factor(y)), trace = FALSE) fit.npmc.CX <- try(npcs(x, y, algorithm = "CX", classifier = "multinom", w = w, alpha = alpha)) fit.npmc.ER <- try(npcs(x, y, algorithm = "ER", classifier = "multinom", w = w, alpha = alpha, refit = TRUE)) # test error of vanilla multinomial logistic regression y.pred.vanilla <- predict(fit.vanilla, newdata = data.frame(x = x.test)) error_rate(y.pred.vanilla, y.test) # test error of NPMC-CX y.pred.CX <- predict(fit.npmc.CX, x.test) error_rate(y.pred.CX, y.test) # test error of NPMC-ER y.pred.ER <- predict(fit.npmc.ER, x.test) error_rate(y.pred.ER, y.test)
Predict new labels from new data based on the fitted NPMC classifier, which belongs to S3 class "npcs".
## S3 method for class 'npcs' predict(object, newx, ...)
## S3 method for class 'npcs' predict(object, newx, ...)
object |
the model object for prediction |
newx |
input feature data |
... |
arguments to pass down |
Print the cv.npcs object.
## S3 method for class 'cv.npcs' print(x, ...)
## S3 method for class 'cv.npcs' print(x, ...)
x |
fitted cv.npcs object using |
... |
additional arguments. |