Title: | Identifying Interactions Between Binary Predictors |
---|---|
Description: | A statistical learning method that tries to find the best set of predictors and interactions between predictors for modeling binary or quantitative response data in a decision tree. Several search algorithms and ensembling techniques are implemented allowing for finetuning the method to the specific problem. Interactions with quantitative covariables can be properly taken into account by fitting local regression models. Moreover, a variable importance measure for assessing marginal and interaction effects is provided. Implements the procedures proposed by Lau et al. (2024, <doi:10.1007/s10994-023-06488-6>). |
Authors: | Michael Lau [aut, cre] |
Maintainer: | Michael Lau <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.5 |
Built: | 2024-10-24 03:43:02 UTC |
Source: | https://github.com/cran/logicDT |
This function can be used to compute the ideal number of boosting iterations
for the fitted logic.boosted
model using independent validation data.
bestBoostingIter(model, X, y, Z = NULL, consec.iter = 5, scoring_rule = "auc")
bestBoostingIter(model, X, y, Z = NULL, consec.iter = 5, scoring_rule = "auc")
model |
Fitted |
X |
Matrix or data frame of binary validation input data. This object should correspond to the binary matrix for fitting the model. |
y |
Validation response vector. 0-1 coding for binary outcomes. |
Z |
Optional quantitative covariables supplied as a matrix or data frame. Only used (and required) if the model was fitted using them. |
consec.iter |
Number of consecutive boosting iterations that do not increase the validation performance for determining the ideal number of iterations |
scoring_rule |
Scoring rule computing the validation performance.
This can either be |
If the model performance (on the validation data) cannot be increased for
consec.iter
consecutive boosting iterations, the last iteration
which increased the validation performance induces the ideal number of
boosting iterations.
The ideal number of boosting iterations
This function computes the area under the receiver operating characteristic curve.
calcAUC(preds, y, fast = TRUE, sorted = FALSE)
calcAUC(preds, y, fast = TRUE, sorted = FALSE)
preds |
Numeric vector of predicted scores |
y |
True binary outcomes coded as 0 or 1. Must be an integer vector. |
fast |
Shall the computation be as fast as possible? |
sorted |
Are the predicted scores already sorted increasingly? If so, this can slightly speed up the computation. |
The AUC between 0 and 1
Computation of the Brier score, i.e., the mean squared error for risk estimates in a binary classification problem.
calcBrier(preds, y)
calcBrier(preds, y)
preds |
Numeric vector of predictions |
y |
True outcomes |
The Brier score
Computation of the deviance, i.e., two times the negative log likelihood for risk estimates in a binary classification problem.
calcDev(preds, y)
calcDev(preds, y)
preds |
Numeric vector of predictions |
y |
True outcomes |
The deviance
Computation of the misclassification rate for risk estimates in a binary classification problem.
calcMis(preds, y, cutoff = 0.5)
calcMis(preds, y, cutoff = 0.5)
preds |
Numeric vector of predictions |
y |
True outcomes |
cutoff |
Classification cutoff. By default, scores above 50 otherwise. |
The misclassification rate
Computation of the mean squared error.
calcMSE(preds, y)
calcMSE(preds, y)
preds |
Numeric vector of predictions |
y |
True outcomes |
The MSE
This function computes the normalized cross entropy (NCE) which is given by
where (for )
are the true classes,
are the risk/probability predictions and
is total unrestricted
empirical risk estimate.
calcNCE(preds, y)
calcNCE(preds, y)
preds |
Numeric vector of risk estimates |
y |
Vector of true binary outcomes |
Smaller values towards zero are generally prefered. A NCE of one or above would indicate that the used model yields comparable or worse predictions than the naive mean model.
The normalized cross entropy
He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., Candela, J. Q. (2014). Practical Lessons from Predicting Clicks on Ads at Facebook. Proceedings of the Eighth International Workshop on Data Mining for Online Advertising 1-9. doi:10.1145/2648584.2648589
Computation of the normalized root mean squared error.
calcNRMSE(preds, y, type = "sd")
calcNRMSE(preds, y, type = "sd")
preds |
Numeric vector of predictions |
y |
True outcomes |
type |
|
The NRMSE
This function should be used to configure a search with simulated annealing.
cooling.schedule( type = "adaptive", start_temp = 1, end_temp = -1, lambda = 0.01, total_iter = 2e+05, markov_iter = 1000, markov_leave_frac = 1, acc_type = "probabilistic", frozen_def = "acc", frozen_acc_frac = 0.01, frozen_markov_count = 5, frozen_markov_mode = "total", start_temp_steps = 10000, start_acc_ratio = 0.95, auto_start_temp = TRUE, remember_models = TRUE, print_iter = 1000 )
cooling.schedule( type = "adaptive", start_temp = 1, end_temp = -1, lambda = 0.01, total_iter = 2e+05, markov_iter = 1000, markov_leave_frac = 1, acc_type = "probabilistic", frozen_def = "acc", frozen_acc_frac = 0.01, frozen_markov_count = 5, frozen_markov_mode = "total", start_temp_steps = 10000, start_acc_ratio = 0.95, auto_start_temp = TRUE, remember_models = TRUE, print_iter = 1000 )
type |
Type of cooling schedule. |
start_temp |
Start temperature on a log10 scale.
Only used if |
end_temp |
End temperature on a log10 scale.
Only used if |
lambda |
Cooling parameter for the adaptive schedule. Values between 0.01 and 0.1 are recommended such that in total, several hundred thousand iterations are performed. Lower values lead to a more fine search with more iterations while higher values lead to a more coarse search with less total iterations. |
total_iter |
Total number of iterations that should be performed. Only used for the geometric cooling schedule. |
markov_iter |
Number of iterations for each Markov chain. The standard value does not need to be tuned, since the temperature steps and number of iterations per chain act complementary to each other, i.e., less iterations can be compensated by smaller temperature steps. |
markov_leave_frac |
Fraction of accepted moves
leading to an early temperature reduction. This is
primarily used at (too) high temperatures lowering
the temperature if essentially a random walk is
performed. E.g., a value of 0.5 together with
|
acc_type |
Type of acceptance function. The
standard |
frozen_def |
How to define a frozen chain.
|
frozen_acc_frac |
If |
frozen_markov_count |
Number of frozen chains that need to be observed for finishing the search. |
frozen_markov_mode |
Do the frozen chains
have to occur consecutively ( |
start_temp_steps |
Number of iterations that should be used for
estimating the ideal start temperature if |
start_acc_ratio |
Acceptance ratio that should be achieved with the automatically configured start temperature. |
auto_start_temp |
Should the start
temperature be configured automatically?
|
remember_models |
Should already evaluated models be saved in a 2-dimensional hash table to prevent fitting the same trees multiple times? |
print_iter |
Number of iterations after which a progress report shall be printed. |
type = "adapative"
(default)
automatically choses the temperature steps by using the
standard deviation of the scores in a Markov chain
together with the current temperature to
evaluate if equilibrium is achieved. If the standard
deviation is small or the temperature is high,
equilibrium can be assumed leading
to a strong temperature reduction. Otherwise, the
temperature is only merely lowered.
The parameter lambda
is essential to control
how fast the schedule will be executed and, thus,
how many total iterations will be performed.
type = "geometric"
is the conventional
approach which requires more finetuning. Here,
temperatures are uniformly lowered on a log10 scale.
Thus, a start and an end temperature have to be
supplied.
An object of class cooling.schedule
which is a list of all necessary cooling parameters.
Using a fitted logicDT
model, its logic decision tree can be
optimally (post-)pruned utilizing k-fold cross-validation.
cv.prune( model, nfolds = 10, scoring_rule = "deviance", choose = "1se", simplify = TRUE )
cv.prune( model, nfolds = 10, scoring_rule = "deviance", choose = "1se", simplify = TRUE )
model |
A fitted |
nfolds |
Number of cross-validation folds |
scoring_rule |
The scoring rule for evaluating the cross-validation
error and its standard error. For classification tasks, |
choose |
Model selection scheme. If the model that minimizes the
cross-validation error should be chosen, |
simplify |
Should the pruned model be simplified with regard to the input terms, i.e., should terms that are no longer in the tree contained be removed from the model? |
Similar to Breiman et al. (1984), we implement post-pruning by first computing the optimal pruning path and then using cross-validation for identifying the best generalizing model.
In order to handle continuous covariables with fitted regression models in
each leaf, similar to the likelihood-ratio splitting criterion in
logicDT
, we propose using the log-likelihood as the impurity
criterion in this case for computing the pruning path.
In particular, for each node , the weighted node impurity
has to be calculated and the inequality
has to be fulfilled for each possible split splitting
into
two subnodes
and
. Here,
describes the
impurity of a node
,
the proportion of data points falling
into
, and
the proportion of data points falling
from
into
.
Since the regression models are fitted using maximum likelihood, the
maximum likelihood criterion fulfills this property and can also be seen as
an extension of the entropy impurity criterion in the case of classification
or an extension of the MSE impurity criterion in the case of regression.
The default model selection is done by choosing the most parsimonious model
that yields a cross-validation error in the range of
for the minimal cross-validation error
and its
corresponding standard error
.
For a more robust standard error estimation, the scores are calculated per
training observation such that the AUC is no longer an appropriate choice
and the deviance or the Brier score should be used in the case of
classification.
A list containing
model |
The new |
cv.res |
A data frame containing the penalties, the cross-validation scores and the corresponding standard errors |
best.beta |
The ideal penalty value |
Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984). Classification and Regression Trees. CRC Press. doi:10.1201/9781315139470
Method for fitting four parameter logistic models. In the fashion of this package, only binary and quantitative outcomes are supported.
fit4plModel(y, Z)
fit4plModel(y, Z)
y |
Response vector. 0-1 coding for binary outcomes, otherwise conventional regression is performed. |
Z |
Numeric vector of (univariate) input samples. |
4pL models are non-linear regression models of the shape
with being a random error term.
An object of class "4pl"
which contains a numeric
vector of the fitted parameters b, c, d, and e.
This function uses a fitted logic.boosted
model for fitting
a linear or logistic (depending on the type of outcome) regression model.
fitLinearBoostingModel(model, n.iter, type = "standard", s = NULL, ...)
fitLinearBoostingModel(model, n.iter, type = "standard", s = NULL, ...)
model |
Fitted |
n.iter |
Number of boosting iterations to be used |
type |
Type of linear model to be fitted. Either |
s |
Regularization parameter. Only used if |
... |
Additional parameters passed to |
In this procedure, the logic terms are extracted from the individual
logicDT
models and the set of unique terms are used as predictors
in a regression model. For incorporating a continuous covariable
the covariable itself as well as products of the covariable with the
extracted logic terms are included as predictors in the regression model.
For more details on the possible types of linear models, see
fitLinearLogicModel
.
A linear.logic
model. This is a list containing
the logic terms used as predictors in the model and the fitted glm
model.
This function fits a linear or logistic regression model (based on the
type of outcome) using the supplied logic terms, e.g., $disj
from
a fitted logicDT
model.
fitLinearLogicModel( X, y, Z = NULL, disj, Z.interactions = TRUE, type = "standard", s = NULL, ... )
fitLinearLogicModel( X, y, Z = NULL, disj, Z.interactions = TRUE, type = "standard", s = NULL, ... )
X |
Matrix or data frame of binary input data. This object should correspond to the binary matrix for fitting the model. |
y |
Response vector. 0-1 coding for binary outcomes. |
Z |
Optional quantitative covariables supplied as a matrix or data frame. Only used (and required) if the model was fitted using them. |
disj |
Integer matrix of logic terms. As in |
Z.interactions |
Shall interactions with the continuous covariable
|
type |
Type of linear model to be fitted. Either |
s |
Regularization parameter. Only used if |
... |
Additional parameters passed to |
For creating sparse final models, the LASSO can be used for shrinking
unnecessary term coefficients down to zero (type = "lasso"
).
If the complexity penalty s
shall be automatically tuned,
cross-validation can be employed (type = "cv.lasso"
).
However, since other hyperparameters also have to be tuned when fitting
a linear boosting model such as the complexity penalty for restricting
the number of variables in the terms, manually tuning the LASSO penalty
together with the other hyperparameters is recommended.
For every hyperparameter setting of the boosting itself, the best
corresponding LASSO penalty s
can be identified by, e.g., choosing
the s
that minimizes the validation data error.
Thus, this hyperparameter does not have to be explicitly tuned via a grid
search but is induced by the setting of the other hyperparameters.
For finding the ideal value of s
using independent validation data,
the function get.ideal.penalty
can be used.
A linear.logic
model. This is a list containing
the logic terms used as predictors in the model and the fitted glm
model.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. doi:10.1111/j.2517-6161.1996.tb02080.x
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of statistical software, 33(1), 1–22. doi:10.18637/jss.v033.i01
Method for fitting linear models. In the fashion of this package, only binary and quantitative outcomes are supported.
fitLinearModel(y, Z, logistic = TRUE)
fitLinearModel(y, Z, logistic = TRUE)
y |
Response vector. 0-1 coding for binary outcomes, otherwise conventional regression is performed. |
Z |
Numeric vector of (univariate) input samples. |
logistic |
Logical indicating whether, in the case of a binary
outcome, a logistic regression model should be fitted
( |
For binary outcomes, predictions are cut at 0 or 1 for generating proper probability estimates.
An object of class "linear"
which contains a numeric
vector of the fitted parameters b and c.
This function takes a fitted linear.logic
model and independent
validation data as input for finding the ideal LASSO complexity penalty
s
.
get.ideal.penalty( model, X, y, Z = NULL, scoring_rule = "deviance", choose = "min" )
get.ideal.penalty( model, X, y, Z = NULL, scoring_rule = "deviance", choose = "min" )
model |
A fitted |
X |
Matrix or data frame of binary input data. This object should correspond to the binary matrix for fitting the model. |
y |
Response vector. 0-1 coding for binary outcomes. |
Z |
Optional quantitative covariables supplied as a matrix or data frame. Only used (and required) if the model was fitted using them. |
scoring_rule |
The scoring rule for evaluating the validation
error and its standard error. For classification tasks, |
choose |
Model selection scheme. If the model that minimizes the
validation error should be chosen, |
A list containing
val.res |
A data frame containing the penalties, the validation scores and the corresponding standard errors |
best.s |
The ideal penalty value |
Transform the original predictor matrix X into the conjunction design matrix which contains for each conjunction a corresponding column.
getDesignMatrix(X, disj)
getDesignMatrix(X, disj)
X |
The original (binary) predictor matrix. This has to be of type
|
disj |
The conjunction matrix which can, e.g., be extracted from a
fitted |
The transformed design matrix.
Using a fitted logicDT
model, a general GxE interaction
test can be performed.
gxe.test(model, X, y, Z, perm.test = TRUE, n.perm = 10000)
gxe.test(model, X, y, Z, perm.test = TRUE, n.perm = 10000)
model |
A fitted |
X |
Binary predictor data for testing the interaction effect. This can be equal to the training data. |
y |
Response vector for testing the interaction effect. This can be equal to the training data. |
Z |
Quantitative covariable for testing the interaction effect. This can be equal to the training data. |
perm.test |
Should additionally permutation testing be performed? Useful if likelihood ratio test asymptotics cannot be justified. |
n.perm |
Number of random permutations for permutation testing |
The testing is done by fitting one shared 4pL model
for all tree branches with different offsets, i.e., allowing main effects
of SNPs. This shared model is compared to the individual 4pL models fitted
in the logicDT
procedure using a likelihood ratio test which
is asymptotically distributed. The degrees of freedom are
equal to the difference in model parameters.
For regression tasks, alternatively, a F-test can be utilized.
The shared 4pL model is given by
with being a grouping variable,
being the offsets for the different
groups, and
being a random error term.
Note that the last group
does not have an offset parameter, since
the model is calibrated such that the curve without any
's
fits to the last group.
The likelihood ratio test statistic is given by
for the log likelihoods of the shared and full 4pL models, respectively. In the regression case, the test statistic can be calculated as
with being the residual sum of squares for the
respective model.
For regression tasks, the alternative F test statistic is given by
with
and being the number of identified predictor
scenarios/groups by
logicDT
.
Alternatively, if linear models were fitted in the supplied logicDT
model, shared linear models can be used to test for a GxE interaction.
For continuous outcomes, the shared linear model is given by
For binary outcomes, LDA (linear discriminant analysis) models are fitted. In contrast to the 4pL-based test for binary outcomes, varying offsets for the individual groups are injected to the linear predictor instead of to the probability (response) scale.
If only few samples are available and the asymptotics of likelihood ratio
tests cannot be justified, alternatively, a permutation test approach
can be employed by setting perm.test = TRUE
and specifying an
appropriate number of random permutations via n.perm
.
For this approach, computed likelihoods of the shared and (paired) full
likelihood groups are randomly interchanged approximating the null
distribution of equal likelihoods. A p-value can be computed by determining
the fraction of more extreme null samples compared to the original
likelihood ratio test statistic, i.e., using the fraction of higher
likelihood ratios in the null distribution than the original likelihood
ratio.
A list containing
p.chisq |
The p-value of the chi-squared test statistic. |
p.f |
The p-value of the F test statistic. |
p.perm |
The p-value of the optional permutation test. |
ll.shared |
Log likelihood of the shared parameters 4pL model. |
ll.full |
Log likelihood of the full |
rss.shared |
Residual sum of squares of the shared parameters 4pL model. |
rss.full |
Residual sum of squares of the full |
This function takes a fitted linear.logic
model and independent test
data as input for testing if there is a general GxE interaction.
This hypothesis test is based on a likelihood-ratio test.
gxe.test.boosting(model, X, y, Z)
gxe.test.boosting(model, X, y, Z)
model |
A fitted |
X |
Matrix or data frame of binary input data. This object should correspond to the binary matrix for fitting the model. |
y |
Response vector. 0-1 coding for binary outcomes. |
Z |
Quantitative covariable supplied as a matrix or data frame |
In detail, the null hypothesis
using the supplied linear model
is tested.
A list containing
Deviance |
The deviance used for performing the likelihood-ratio test |
p.value |
The p-value of the test |
This function takes a fitted linear.logic
model and independent test
data as input for testing if the included terms are influential with respect
to the outcome.
This hypothesis test is based on a likelihood-ratio test.
importance.test.boosting(model, X, y, Z, Z.interactions = TRUE)
importance.test.boosting(model, X, y, Z, Z.interactions = TRUE)
model |
A fitted |
X |
Matrix or data frame of binary input data. This object should correspond to the binary matrix for fitting the model. |
y |
Response vector. 0-1 coding for binary outcomes. |
Z |
Optional quantitative covariables supplied as a matrix or data frame. Only used (and required) if the model was fitted using them. |
Z.interactions |
A Boolean value determining whether interactions with
quantitative covaraible |
In detail, the null hypotheses
using the linear model
are tested for each
if
Z.interactions
is set to TRUE
.
Otherwise, the null hypotheses
using the linear model
are tested.
A data frame consisting of three columns,
var |
The tested term, |
vim |
The associated variable importance, and |
p.value |
The corresponding p-value for testing if the term is influential. |
Main function for fitting logicDT models.
## Default S3 method: logicDT( X, y, max_vars = 3, max_conj = 3, Z = NULL, search_algo = "sa", cooling_schedule = cooling.schedule(), scoring_rule = "auc", tree_control = tree.control(), gamma = 0, simplify = "vars", val_method = "none", val_frac = 0.5, val_reps = 10, allow_conj_removal = TRUE, conjsize = 1, randomize_greedy = FALSE, greedy_mod = TRUE, greedy_rem = FALSE, max_gen = 10000, gp_sigma = 0.15, gp_fs_interval = 1, ... ) ## S3 method for class 'formula' logicDT(formula, data, ...)
## Default S3 method: logicDT( X, y, max_vars = 3, max_conj = 3, Z = NULL, search_algo = "sa", cooling_schedule = cooling.schedule(), scoring_rule = "auc", tree_control = tree.control(), gamma = 0, simplify = "vars", val_method = "none", val_frac = 0.5, val_reps = 10, allow_conj_removal = TRUE, conjsize = 1, randomize_greedy = FALSE, greedy_mod = TRUE, greedy_rem = FALSE, max_gen = 10000, gp_sigma = 0.15, gp_fs_interval = 1, ... ) ## S3 method for class 'formula' logicDT(formula, data, ...)
X |
Matrix or data frame of binary predictors coded as 0 or 1. |
y |
Response vector. 0-1 coding for binary responses. Otherwise, a regression task is assumed. |
max_vars |
Maximum number of predictors in the set of predictors.
For the set |
max_conj |
Maximum number of conjunctions/input variables for the
decision trees. For the set |
Z |
Optional matrix or data frame of quantitative/continuous covariables. Multiple covariables allowed for splitting the trees. If leaf regression models (such as four parameter logistic models) shall be fitted, only the first given covariable is used. |
search_algo |
Search algorithm for guiding the global search.
This can either be |
cooling_schedule |
Cooling schedule parameters if simulated
annealing is used. The required object should be created via
the function |
scoring_rule |
Scoring rule for guiding the global search.
This can either be |
tree_control |
Parameters controlling the fitting of
decision trees. This should be configured via the
function |
gamma |
Complexity penalty added to the score.
If |
simplify |
Should the final fitted model be simplified?
This means, that unnecessary terms as a whole ( |
val_method |
Inner validation method. |
val_frac |
Only used if |
val_reps |
Number of inner validation partitionings. |
allow_conj_removal |
Should it be allowed to remove
complete terms/conjunctions in the search?
If a model with the specified exact number of terms is desired,
this should be set to |
conjsize |
The minimum of training samples that have to belong to a conjunction. This parameters prevents including unnecessarily complex conjunctions that rarely occur. |
randomize_greedy |
Should the greedy search be randomized
by only considering |
greedy_mod |
Should modifications of conjunctions be
considered in a greedy search?
|
greedy_rem |
Should the removal of conjunctions be
considered in a greedy search?
|
max_gen |
Maximum number of generations for genetic programming. |
gp_sigma |
Parameter |
gp_fs_interval |
Interval for fitness sharing in
genetic programming. The fitness calculation can be
computationally expensive if many models exist in one
generation. |
... |
Arguments passed to |
formula |
An object of type |
data |
A data frame containing the data for the corresponding
|
logicDT is a method for finding response-associated interactions between binary predictors. A global search for the best set of predictors and interactions between predictors is performed trying to find the global optimal decision trees. On the one hand, this can be seen as a variable selection. On the other hand, Boolean conjunctions between binary predictors can be identified as impactful which is particularly useful if the corresponding marginal effects are negligible due to the greedy fashion of choosing splits in decision trees.
Three search algorithms are implemented:
Simulated annealing. An exhaustive stochastic optimization procedure. Recommended for single models (without [outer] bagging or boosting).
Greedy search. A very fast search always looking for the best possible improvement. Recommended for ensemble models.
Genetic programming. A more or less intensive search holding several competetive models at each generation. Niche method which is only recommended if multiple (simple) models do explain the variation in the response.
Furthermore, the option of a so-called "inner validation" is available. Here, the search is guided using several train-validation-splits and the average of the validation performance. This approach is computationally expensive but can lead to more robust single models.
For minimizing the computation time, two-dimensional hash tables are used saving evaluated models. This is irrelevant for the greedy search but can heavily improve the fitting times when employing a search with simulated annealing or genetic programming, especially when choosing an inner validation.
An object of class logicDT
. This is a list
containing
disj |
A matrix of the identified set of predictors
and conjunctions of predictors. Each row corresponds to one term.
Each entry corresponds to the column index in |
real_disj |
Human readable form of |
score |
Score of the best model. Smaller values are prefered. |
pet |
Decision tree fitted on the best set of
input terms. This is a list containing the pointer to the
|
ensemble |
List of decision trees. Only relevant if inner validation was used. |
total_iter |
The total number of search iterations, i.e., tested configurations by fitting a tree (ensemble) and evaluating it. |
prevented_evals |
The number of prevented tree fittings by using the two-dimensional hash table. |
... |
Supplied parameters of the functional call
to |
logicDT models can be saved and loaded using save(...)
and
load(...)
. The internal C
structures will not be saved
but rebuilt from the R
representations if necessary.
Lau, M., Schikowski, T. & Schwender, H. (2024). logicDT: A procedure for identifying response-associated interactions between binary predictors. Machine Learning 113(2):933–992. doi:10.1007/s10994-023-06488-6
Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984). Classification and Regression Trees. CRC Press. doi:10.1201/9781315139470
Kirkpatrick, S., Gelatt C. D. & Vecchi M. P. (1983). Optimization by Simulated Annealing. Science 220(4598):671–680. doi:10.1126/science.220.4598.671
# Generate toy data set.seed(123) maf <- 0.25 n.snps <- 50 N <- 2000 X <- matrix(sample(0:2, n.snps * N, replace = TRUE, prob = c((1-maf)^2, 1-(1-maf)^2-maf^2, maf^2)), ncol = n.snps) colnames(X) <- paste("SNP", 1:n.snps, sep="") X <- splitSNPs(X) Z <- matrix(rnorm(N, 20, 10), ncol = 1) colnames(Z) <- "E" Z[Z < 0] <- 0 y <- -0.75 + log(2) * (X[,"SNP1D"] != 0) + log(4) * Z/20 * (X[,"SNP2D"] != 0 & X[,"SNP3D"] == 0) + rnorm(N, 0, 1) # Fit and evaluate single logicDT model model <- logicDT(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE], max_vars = 3, max_conj = 2, search_algo = "sa", tree_control = tree.control( nodesize = floor(0.05 * nrow(X)/2) ), simplify = "vars", allow_conj_removal = FALSE, conjsize = floor(0.05 * nrow(X)/2)) calcNRMSE(predict(model, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N]) plot(model) print(model) # Fit and evaluate bagged logicDT model model.bagged <- logicDT.bagging(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE], bagging.iter = 50, max_vars = 3, max_conj = 3, search_algo = "greedy", tree_control = tree.control( nodesize = floor(0.05 * nrow(X)/2) ), simplify = "vars", conjsize = floor(0.05 * nrow(X)/2)) calcNRMSE(predict(model.bagged, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N]) print(model.bagged) # Fit and evaluate boosted logicDT model model.boosted <- logicDT.boosting(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE], boosting.iter = 50, learning.rate = 0.01, subsample.frac = 0.75, replace = FALSE, max_vars = 3, max_conj = 3, search_algo = "greedy", tree_control = tree.control( nodesize = floor(0.05 * nrow(X)/2) ), simplify = "vars", conjsize = floor(0.05 * nrow(X)/2)) calcNRMSE(predict(model.boosted, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N]) print(model.boosted) # Calculate VIMs (variable importance measures) vims <- vim(model.bagged) plot(vims) print(vims) # Single greedy model model <- logicDT(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE], max_vars = 3, max_conj = 2, search_algo = "greedy", tree_control = tree.control( nodesize = floor(0.05 * nrow(X)/2) ), simplify = "vars", allow_conj_removal = FALSE, conjsize = floor(0.05 * nrow(X)/2)) calcNRMSE(predict(model, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N]) plot(model) print(model)
# Generate toy data set.seed(123) maf <- 0.25 n.snps <- 50 N <- 2000 X <- matrix(sample(0:2, n.snps * N, replace = TRUE, prob = c((1-maf)^2, 1-(1-maf)^2-maf^2, maf^2)), ncol = n.snps) colnames(X) <- paste("SNP", 1:n.snps, sep="") X <- splitSNPs(X) Z <- matrix(rnorm(N, 20, 10), ncol = 1) colnames(Z) <- "E" Z[Z < 0] <- 0 y <- -0.75 + log(2) * (X[,"SNP1D"] != 0) + log(4) * Z/20 * (X[,"SNP2D"] != 0 & X[,"SNP3D"] == 0) + rnorm(N, 0, 1) # Fit and evaluate single logicDT model model <- logicDT(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE], max_vars = 3, max_conj = 2, search_algo = "sa", tree_control = tree.control( nodesize = floor(0.05 * nrow(X)/2) ), simplify = "vars", allow_conj_removal = FALSE, conjsize = floor(0.05 * nrow(X)/2)) calcNRMSE(predict(model, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N]) plot(model) print(model) # Fit and evaluate bagged logicDT model model.bagged <- logicDT.bagging(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE], bagging.iter = 50, max_vars = 3, max_conj = 3, search_algo = "greedy", tree_control = tree.control( nodesize = floor(0.05 * nrow(X)/2) ), simplify = "vars", conjsize = floor(0.05 * nrow(X)/2)) calcNRMSE(predict(model.bagged, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N]) print(model.bagged) # Fit and evaluate boosted logicDT model model.boosted <- logicDT.boosting(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE], boosting.iter = 50, learning.rate = 0.01, subsample.frac = 0.75, replace = FALSE, max_vars = 3, max_conj = 3, search_algo = "greedy", tree_control = tree.control( nodesize = floor(0.05 * nrow(X)/2) ), simplify = "vars", conjsize = floor(0.05 * nrow(X)/2)) calcNRMSE(predict(model.boosted, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N]) print(model.boosted) # Calculate VIMs (variable importance measures) vims <- vim(model.bagged) plot(vims) print(vims) # Single greedy model model <- logicDT(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE], max_vars = 3, max_conj = 2, search_algo = "greedy", tree_control = tree.control( nodesize = floor(0.05 * nrow(X)/2) ), simplify = "vars", allow_conj_removal = FALSE, conjsize = floor(0.05 * nrow(X)/2)) calcNRMSE(predict(model, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N]) plot(model) print(model)
Function for fitting bagged logicDT models.
## Default S3 method: logicDT.bagging(X, y, Z = NULL, bagging.iter = 500, ...) ## S3 method for class 'formula' logicDT.bagging(formula, data, ...)
## Default S3 method: logicDT.bagging(X, y, Z = NULL, bagging.iter = 500, ...) ## S3 method for class 'formula' logicDT.bagging(formula, data, ...)
X |
Matrix or data frame of binary predictors coded as 0 or 1. |
y |
Response vector. 0-1 coding for binary responses. Otherwise, a regression task is assumed. |
Z |
Optional matrix or data frame of quantitative/continuous covariables. Multiple covariables allowed for splitting the trees. If leaf regression models (such as four parameter logistic models) shall be fitted, only the first given covariable is used. |
bagging.iter |
Number of bagging iterations |
... |
Arguments passed to |
formula |
An object of type |
data |
A data frame containing the data for the corresponding
|
Details on single logicDT models can be found in logicDT
.
An object of class logic.bagged
. This is a list
containing
models |
A list of fitted |
bags |
A list of observation indices which were used to train each model |
... |
Supplied parameters of the functional call
to |
Function for fitting gradient boosted logicDT models.
## Default S3 method: logicDT.boosting( X, y, Z = NULL, boosting.iter = 500, learning.rate = 0.01, subsample.frac = 1, replace = TRUE, line.search = "min", ... ) ## S3 method for class 'formula' logicDT.boosting(formula, data, ...)
## Default S3 method: logicDT.boosting( X, y, Z = NULL, boosting.iter = 500, learning.rate = 0.01, subsample.frac = 1, replace = TRUE, line.search = "min", ... ) ## S3 method for class 'formula' logicDT.boosting(formula, data, ...)
X |
Matrix or data frame of binary predictors coded as 0 or 1. |
y |
Response vector. 0-1 coding for binary responses. Otherwise, a regression task is assumed. |
Z |
Optional matrix or data frame of quantitative/continuous covariables. Multiple covariables allowed for splitting the trees. If leaf regression models (such as four parameter logistic models) shall be fitted, only the first given covariable is used. |
boosting.iter |
Number of boosting iterations |
learning.rate |
Learning rate for boosted models. Values between 0.001 and 0.1 are recommended. |
subsample.frac |
Subsample fraction for each boosting iteration. E.g., 0.5 means that are random draw of 50 is used in each iteration. |
replace |
Should the random draws with subsample.frac
in boosted models be performed with or without
replacement? |
line.search |
Type of line search for gradient boosting.
|
... |
Arguments passed to |
formula |
An object of type |
data |
A data frame containing the data for the corresponding
|
Details on single logicDT models can be found in logicDT
.
An object of class logic.boosted
. This is a list
containing
models |
A list of fitted |
rho |
A vector of boosting coefficient corresponding to each model |
initialModel |
Initial model which is usually the observed mean |
... |
Supplied parameters of the functional call
to |
Lau, M., Schikowski, T. & Schwender, H. (2024). logicDT: A procedure for identifying response-associated interactions between binary predictors. Machine Learning 113(2):933–992. doi:10.1007/s10994-023-06488-6
Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29(5), 1189–1232. doi:10.1214/aos/1013203451
Alternative prediction function for logic.boosted
models
using up to n.iter
boosting iterations.
An array of predictions for every number of boosting iterations
up to n.iter
is returned.
partial.predict(model, X, Z = NULL, n.iter = 1, ...)
partial.predict(model, X, Z = NULL, n.iter = 1, ...)
model |
Fitted |
X |
Matrix or data frame of binary input data. This object should correspond to the binary matrix for fitting the model. |
Z |
Optional quantitative covariables supplied as a matrix or data frame. Only used (and required) if the model was fitted using them. |
n.iter |
Maximum number of boosting iterations for prediction |
... |
Parameters supplied to |
The main purpose of this function is to retrieve the optimal number of boosting iterations (early stopping) using a validation data set and to restrict future predictions on this number of iterations.
An array of dimension (N, n.iter)
containing the partial
predictions
This function plots a logicDT model on the active graphics device.
fancy.plot(x, cdot = FALSE, ...) ## S3 method for class 'logicDT' plot( x, fancy = TRUE, x_scaler = 0.5, margin_scaler = 0.2, cex = 1, cdot = FALSE, ... )
fancy.plot(x, cdot = FALSE, ...) ## S3 method for class 'logicDT' plot( x, fancy = TRUE, x_scaler = 0.5, margin_scaler = 0.2, cex = 1, cdot = FALSE, ... )
x |
An object of the class |
cdot |
Should a centered dot be used instead of a logical and for depicting interactions? |
... |
Arguments passed to fancy plotting function |
fancy |
Should the fancy mode be used for plotting? Default is
|
x_scaler |
Scaling factor on the horizontal axis for deeper trees,
i.e., |
margin_scaler |
Margin factor. Smaller values lead to smaller margins. |
cex |
Scaling factor for the plotted text elements. |
There are two plotting modes:
fancy = FALSE
which draws a tree with direct edges between
the nodes. Leaves are represented by their prediction value which is
obtained by the (observed) conditional mean.
fancy = TRUE
plots a tree similar to those in the rpart
(Therneau and Atkinson, 2019) and splinetree
(Neufeld and Heggeseth,
2019) R
packages. The trees are drawn in an angular manner and
if leaf regression models were fitted, appropriate plots of the fitted
curves are depicted in the leaves. Otherwise, the usual prediction values
are shown.
No return value, called for side effects
Therneau, T. & Atkinson, B. (2019). rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart
Neufeld, A. & Heggeseth, B. (2019). splinetree: Longitudinal Regression Trees and Forests. https://CRAN.R-project.org/package=splinetree
This function plots variable importance measures yielded by the function
vim
in a dotchart.
## S3 method for class 'vim' plot(x, p = 10, ...)
## S3 method for class 'vim' plot(x, p = 10, ...)
x |
An object of the class |
p |
The number of most important terms which will be included in the plot. A value of 0 leads to plotting all terms. |
... |
Ignored additional parameters |
No return value, called for side effects
Use new input data and a fitted four parameter logistic model to predict corresponding outcomes.
## S3 method for class ''4pl'' predict(object, Z, ...)
## S3 method for class ''4pl'' predict(object, Z, ...)
object |
Fitted |
Z |
Numeric vector of new input samples |
... |
Ignored additional parameters |
A numeric vector of predictions. For binary outcomes,
this is a vector with estimates for
.
Use new input data and a fitted linear model to predict corresponding outcomes.
## S3 method for class 'linear' predict(object, Z, ...)
## S3 method for class 'linear' predict(object, Z, ...)
object |
Fitted |
Z |
Numeric vector of new input samples |
... |
Ignored additional parameters |
For binary outcomes, predictions are cut at 0 or 1 for generating proper probability estimates.
A numeric vector of predictions. For binary outcomes,
this is a vector with estimates for
.
linear.logic
modelsUse new input data and a fitted linear.logic
model to
predict corresponding outcomes.
## S3 method for class 'linear.logic' predict(object, X, Z = NULL, s = NULL, ...)
## S3 method for class 'linear.logic' predict(object, X, Z = NULL, s = NULL, ...)
object |
Fitted |
X |
Matrix or data frame of binary input data. This object should correspond to the binary matrix for fitting the model. |
Z |
Optional quantitative covariables supplied as a matrix or data frame. Only used (and required) if the model was fitted using them. |
s |
Regularization parameter. Only used if |
... |
Ignored additional parameters |
A numeric vector of predictions. For binary outcomes,
this is a vector with estimates for
.
Supply new input data for predicting the outcome with a fitted logicDT model.
## S3 method for class 'logic.bagged' predict(object, X, Z = NULL, type = "prob", ...) ## S3 method for class 'logic.boosted' predict(object, X, Z = NULL, type = "prob", ...) ## S3 method for class 'logicDT' predict( object, X, Z = NULL, type = "prob", ensemble = FALSE, leaves = "4pl", ... ) ## S3 method for class 'genetic.logicDT' predict( object, X, Z = NULL, models = "best", n_models = 10, ensemble = NULL, leaves = "4pl", ... )
## S3 method for class 'logic.bagged' predict(object, X, Z = NULL, type = "prob", ...) ## S3 method for class 'logic.boosted' predict(object, X, Z = NULL, type = "prob", ...) ## S3 method for class 'logicDT' predict( object, X, Z = NULL, type = "prob", ensemble = FALSE, leaves = "4pl", ... ) ## S3 method for class 'genetic.logicDT' predict( object, X, Z = NULL, models = "best", n_models = 10, ensemble = NULL, leaves = "4pl", ... )
object |
Fitted |
X |
Matrix or data frame of binary input data. This object should correspond to the binary matrix for fitting the model. |
Z |
Optional quantitative covariables supplied as a matrix or data frame. Only used (and required) if the model was fitted using them. |
type |
Prediction type. This can either be |
... |
Parameters supplied to |
ensemble |
If the model was fitted using the inner
validation approach, shall the prediction be constructed
using the final validated ensemble ( |
leaves |
If leaf regression models (such as four parameter logistic
models) were fitted, shall these models be used for the prediction
( |
models |
Which logicDT models fitted via
genetic programming shall be used for prediction?
|
n_models |
How many models shall be used if
|
A numeric vector of predictions. For binary outcomes,
this is a vector with estimates for .
Using a fitted logicDT
model and a fixed complexity penalty
alpha
, its logic decision tree can be (post-)pruned.
prune(model, alpha, simplify = TRUE)
prune(model, alpha, simplify = TRUE)
model |
A fitted |
alpha |
A fixed complexity penalty value. This value should be determined out-of-sample, e.g., performing hyperparameter optimization on independent validation data. |
simplify |
Should the pruned model be simplified with regard to the input terms, i.e., should terms that are no longer in the tree contained be removed from the model? |
Similar to Breiman et al. (1984), we implement post-pruning by first computing the optimal pruning path and then choosing the tree that is pruned according to the specified complexity penalty.
If no validation data is available or if the tree shall be automatically
optimally pruned, cv.prune
should be used instead which
employs k-fold cross-validation for finding the best complexity penalty
value.
The new logicDT
model containing the pruned tree
Using a single fitted logic decision tree, the cost-complexity pruning path containing the ideal subtree for a certain complexity penalty can be computed.
prune.path(pet, y, Z)
prune.path(pet, y, Z)
pet |
A fitted logic decision tree. This can be extracted from a
|
y |
Training outcomes for potentially refitting regression models in
the leaves. This can be extracted from a |
Z |
Continuous training predictors for potentially refitting regression
models in the leaves. This can be extracted from a |
This is mainly a helper function for cv.prune
and should only
be used by the user if manual pruning is preferred.
More details are given in cv.prune
.
Two lists. The first contains the sequence of complexity penalties
. The second list contains the corresponding logic decision
trees which can then be substituted in an already fitted
logicDT
model, e.g., using
model$pet <- result[[2]][[i]]
where result
is the returned
object from this function and i
is the chosen tree index.
Newly fit the decision trees in the logicDT
model using
the supplied tree control parameters.
This is especially useful if, e.g., the model was initially trained
without utilizing a continuous covariable or fitting linear models and
now 4pL model shall be fitted.
refitTrees(model, tree_control)
refitTrees(model, tree_control)
model |
A fitted |
tree_control |
Tree control parameters. This object should be
constructed using the function |
The logicDT
model with newly fitted trees
This function takes a matrix or data frame of SNPs coded as
0, 1, 2 or 1, 2, 3 and returns a data frame with twice as many
columns. SNPs are splitted into dominant and recessive modes,
i.e., for a , two variables
and
are generated.
splitSNPs(data)
splitSNPs(data)
data |
A matrix or data frame only consisting of SNPs to be splitted |
A data frame of the splitted SNPs
Configure the fitting process of individual decision trees.
tree.control( nodesize = 10, split_criterion = "gini", alpha = 0.05, cp = 0.001, smoothing = "none", mtry = "none", covariable = "final_4pl" )
tree.control( nodesize = 10, split_criterion = "gini", alpha = 0.05, cp = 0.001, smoothing = "none", mtry = "none", covariable = "final_4pl" )
nodesize |
Minimum number of samples contained in a terminal node. This parameter ensures that enough samples are available for performing predictions which includes fitting regression models such as 4pL models. |
split_criterion |
Splitting criterion for deciding
when and how to split. The default is |
alpha |
Significance threshold for the likelihood ratio
tests when using |
cp |
Complexity parameter. This parameter determines
by which amount the impurity has to be reduced to further
split a node. Here, the total tree impurity is considered.
See details for a specific formula. Only used if
|
smoothing |
Shall the leaf predictions for risk
estimation be smoothed? |
mtry |
Shall the tree fitting process be randomized
as in random forests? Currently, only |
covariable |
How shall optional quantitative covariables
be handled? |
For the Gini or MSE splitting criterion,
if any considered split leads to
for a node , the empirical node probability
and the impurity reduction
,
then the node is further splitted. If not, the node is
declared as a leaf.
For continuous outcomes,
cp
will be scaled by the
empirical variance of y
to ensure the right scaling,
i.e., cp <- cp * var(y)
. Since the impurity measure
for continuous outcomes is the mean squared error, this can
be interpreted as controlling the minimum reduction of the
normalized mean squared error (NRMSE to the power of two).
If one chooses the 4pL or linear splitting criterion, likelihood
ratio tests testing the alternative of better fitting individual
models are employed. The corresponding test statistic
asymptotically follows a distribution where
the degrees of freedom are given by the difference in the
number of model parameters, i.e., leading to
degrees of freedom in the case of 4pL
models and to
degrees of freedom in
the case of linear models.
For binary outcomes, choosing to fit linear models for evaluating the splits or for modeling the leaves actually leads to fitting LDA (linear discriminant analysis) models.
An object of class tree.control
which is a list
of all necessary tree parameters.
Calculate variable importance measures (VIMs) based on different approaches.
vim( model, scoring_rule = "auc", vim_type = "logic", adjust = TRUE, interaction_order = 3, nodesize = NULL, alpha = 0.05, X_oob = NULL, y_oob = NULL, Z_oob = NULL, leaves = "4pl", ... )
vim( model, scoring_rule = "auc", vim_type = "logic", adjust = TRUE, interaction_order = 3, nodesize = NULL, alpha = 0.05, X_oob = NULL, y_oob = NULL, Z_oob = NULL, leaves = "4pl", ... )
model |
The fitted |
scoring_rule |
The scoring rule for assessing the model
performance. As in |
vim_type |
The type of VIM to be calculated. This can
either be |
adjust |
Shall adjusted interaction VIMs be additionally (to the VIMs of identified terms) computed? See below for details. |
interaction_order |
If |
nodesize |
If |
alpha |
If |
X_oob |
The predictor data which should be used for calculating the VIMs. Preferably some type of validation data independent of the training data. |
y_oob |
The outcome data for computing the VIMs. Preferably some type of validation data independent of the training data. |
Z_oob |
The optional covariable data for computing the VIMs. Preferably some type of validation data independent of the training data. |
leaves |
The prediction mode if regression models (such as 4pL models)
were fitted in the leaves. As in |
... |
Parameters passed to the different VIM type functions.
For |
Three different VIM methods are implemented:
Permutation VIMs: Random permutations of the respective identified logic terms
Removal VIMs: Removing single logic terms
Logic VIMs: Prediction with both possible outcomes of a logic term
Details on the calculation of these VIMs are given below.
By variable importance, importance of identified logic terms is meant. These terms can be single predictors or conjunctions between predictors in the spirit of this software package.
A data frame with two columns:
var |
Short descriptions of the terms for which the
importance was measured. For example |
vim |
The actual calculated VIM values. |
The rows of such a data frame are sorted decreasingly by the VIM values.
Permutation VIMs are computed by comparing the the model's performance using the original data and data with random permutations of single terms.
Removal VIMs are constructed by removing specific logic
terms from the set of predictors, refitting the decision
tree and comparing the performance to the original model.
Thus, this approach requires that at least two terms were
found by the algorithm. Therefore, no VIM will be
calculated if empty.model = "none"
was specified.
Alternatively, empty.model = "mean"
can be set to
use the constant mean response model for approximating
the empty model.
Logic VIMs use the fact that Boolean conjunctions are
Boolean variables themselves and therefore are equal to
0 or 1. To compute the VIM for a specific term,
predictions are performed once for this term fixed to
0 and once for this term fixed to 1. Then, the arithmetic
mean of these two (risk or regression) predictions is
used for calculating the performance. This performance
is then compared to the original one as in the other
VIM approaches (average = "before"
). Alternatively,
predictions for each fixed 0-1 scenario of the considered
term can be performed leading to individual performances
which then are averaged and compared to the original
performance (average = "after"
).
Validation data sets which
were not used in the fitting of the model are prefered
preventing an overfitting of the VIMs themselves.
These should be specified by the _oob
arguments,
if neither bagging nor inner validation was used for fitting
the model.
For the bagging version, out-of-bag (OOB) data are naturally used for the calculation of VIMs.
Since decision trees can naturally include interactions
between single predictors (especially when strong marginal
effects are present as well), logicDT models might, e.g.,
include the single input variables and
but
not their interaction
although an interaction
effect is present. We, therefore, developed and implemented an
adjustment approach for calculating VIMs for such
unidentified interactions nonetheless.
For predictors
, this interaction
importance is given by
and can basically be applied to all black-box models.
By , the VIM of
considering the predictor set excluding the variables in
is meant, i.e., the improvement of additionally considering
while regarding only the predictors in
.
The proposed interaction VIM can be recursively calculated through
for .
This leads to the relationship
The aforementioned VIM adjustment approach only captures the importance
of a general definition of interactions, i.e., it just considers
the question whether some variables do interact in any way.
Since logicDT is aimed at identifying specific conjunctions (and also assigns
them VIMs if they were identified by logicDT
), a further
adjustment approach is implemented which tries to identify the specific
conjunction leading to an interaction effect.
The idea of this method is to consider the response for each possible
scenario of the interacting variables, e.g., for
where the second term
was identified by
logicDT
and, thus, two interacting terms are regarded,
the possible scenarios
are considered. For each setting, the corresponding response is compared with
outcome values of the complementary set. For continuous outcomes, a two sample
t-test (with Welch correction for potentially unequal variances) is performed
comparing the means between these two groups. For binary outcomes, Fisher's exact
test is performed testing different underlying case probabilities.
If at least one test rejects the null hypothesis of equal outcomes (without adjusting
for multiple testing), the combination with the lowest p-value is chosen as the
explanatory term for the interaction effect. For example, if the most significant
deviation results from
and
from the example
above, the term
is chosen.
Lau, M., Schikowski, T. & Schwender, H. (2024). logicDT: A procedure for identifying response-associated interactions between binary predictors. Machine Learning 113(2):933–992. doi:10.1007/s10994-023-06488-6
Breiman, L. (2001). Random Forests. Machine Learning 45(1):5-32. doi:10.1023/A:1010933404324
Breiman, L. & Cutler, A. (2003). Manual on Setting Up, Using, and Understanding Random Forests V4.0. University of California, Berkeley, Department of Statistics. https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf