Title: | Association Rule Classification |
---|---|
Description: | Implements the Classification-based on Association Rules (CBA) algorithm for association rule classification. The package, also described in Hahsler et al. (2019) <doi:10.32614/RJ-2019-048>, contains several convenience methods that allow to automatically set CBA parameters (minimum confidence, minimum support) and it also natively handles numeric attributes by integrating a pre-discretization step. The rule generation phase is handled by the 'arules' package. To further decrease the size of the CBA models produced by the 'arc' package, postprocessing by the 'qCBA' package is suggested. |
Authors: | Tomas Kliegr [aut, cre] |
Maintainer: | Tomas Kliegr <[email protected]> |
License: | GPL-3 |
Version: | 1.4.1 |
Built: | 2025-01-06 05:18:09 UTC |
Source: | https://github.com/kliegr/arc |
Applies cut points to vector.
applyCut(col, cuts, infinite_bounds, labels)
applyCut(col, cuts, infinite_bounds, labels)
col |
input vector with data. |
cuts |
vector with cutpoints.
There are several special values defined:
|
infinite_bounds |
a logical indicating how the bounds on the extremes should look like.
If set to |
labels |
a logical indicating whether the bins of the discretized data should be represented by integer codes or as interval notation using (a;b] when set to TRUE. |
Vector with discretized data.
applyCut(datasets::iris[[1]], c(3,6), TRUE, TRUE)
applyCut(datasets::iris[[1]], c(3,6), TRUE, TRUE)
Applies cut points to input data frame.
applyCuts(df, cutp, infinite_bounds, labels)
applyCuts(df, cutp, infinite_bounds, labels)
df |
input data frame. |
cutp |
a list of vectors with cutpoints (for more information see |
infinite_bounds |
a logical indicating how the bounds on the extremes should look like (for more information see |
labels |
a logical indicating whether the bins of the discretized data should be represented by integer codes or as interval notation using (a;b] when set to TRUE. |
discretized data. If there was no discretization specified for some columns, these are returned as is.
applyCut
applyCuts(datasets::iris, list(c(5,6), c(2,3), "All", NULL, NULL), TRUE, TRUE)
applyCuts(datasets::iris, list(c(5,6), c(2,3), "All", NULL, NULL), TRUE, TRUE)
Learns a CBA rule set from supplied dataframe.
cba(train, classAtt, rulelearning_options = NULL, pruning_options = NULL)
cba(train, classAtt, rulelearning_options = NULL, pruning_options = NULL)
train |
a data frame with data. |
classAtt |
the name of the class attribute. |
rulelearning_options |
custom options for the rule learning algorithm overriding the default values.
If not specified, the the topRules function is called and defaults specified there are used |
pruning_options |
custom options for the pruning algorithm overriding the default values. |
Object of class CBARuleModel.
# Example using automatic threshold detection cba(datasets::iris, "Species", rulelearning_options = list(target_rule_count = 50000)) # Example using manually set confidence and support thresholds rm <- cba(datasets::iris, "Species", rulelearning_options = list(minsupp=0.01, minconf=0.5, minlen=1, maxlen=5, maxtime=1000, target_rule_count=50000, trim=TRUE, find_conf_supp_thresholds=FALSE)) inspect(rm@rules)
# Example using automatic threshold detection cba(datasets::iris, "Species", rulelearning_options = list(target_rule_count = 50000)) # Example using manually set confidence and support thresholds rm <- cba(datasets::iris, "Species", rulelearning_options = list(minsupp=0.01, minconf=0.5, minlen=1, maxlen=5, maxtime=1000, target_rule_count=50000, trim=TRUE, find_conf_supp_thresholds=FALSE)) inspect(rm@rules)
Learns a CBA rule set from supplied rules
cba_manual( train_raw, rules, txns, rhs, classAtt, cutp, pruning_options = list(input_list_sorted_by_length = FALSE) )
cba_manual( train_raw, rules, txns, rhs, classAtt, cutp, pruning_options = list(input_list_sorted_by_length = FALSE) )
train_raw |
a data frame with raw data (numeric attributes are not discretized). |
rules |
Rules class instance output by the apriori package |
txns |
Transactions class instance passed to the arules method invocation. Transactions are created over discretized data frame - numeric values are replaced with intervals such as "(13;45]". |
rhs |
character vectors giving the labels of the items which can appear in the RHS ($rhs element of the APappearance class instance passed to the arules call) |
classAtt |
the name of the class attribute. |
cutp |
list of cutpoints used to discretize data (required for application of the model on continuous data) |
pruning_options |
custom options for the pruning algorithm overriding the default values. |
Object of class CBARuleModel.
data(humtemp) data_raw<-humtemp data_discr <- humtemp #custom discretization data_discr[,1]<-cut(humtemp[,1],breaks=seq(from=15,to=45,by=5)) data_discr[,2]<-cut(humtemp[,2],breaks=c(0,40,60,80,100)) #change interval syntax from (15,20] to (15;20], which is required by MARC data_discr[,1]<-as.factor(unlist(lapply(data_discr[,1], function(x) {gsub(",", ";", x)}))) data_discr[,2]<-as.factor(unlist(lapply(data_discr[,2], function(x) {gsub(",", ";", x)}))) data_discr[,3] <- as.factor(humtemp[,3]) #mine rules classAtt="Class" appearance <- getAppearance(data_discr, classAtt) txns_discr <- as(data_discr, "transactions") rules <- apriori(txns_discr, parameter = list(confidence = 0.5, support= 3/nrow(data_discr), minlen=1, maxlen=5), appearance=appearance) inspect(rules) rmCBA <- cba_manual(data_raw, rules, txns_discr, appearance$rhs, classAtt, cutp= list(), pruning_options=NULL) inspect (rmCBA@rules) prediction <- predict(rmCBA,data_discr,discretize=FALSE) acc <- CBARuleModelAccuracy(prediction, data_discr[[classAtt]]) print(paste("Accuracy:",acc))
data(humtemp) data_raw<-humtemp data_discr <- humtemp #custom discretization data_discr[,1]<-cut(humtemp[,1],breaks=seq(from=15,to=45,by=5)) data_discr[,2]<-cut(humtemp[,2],breaks=c(0,40,60,80,100)) #change interval syntax from (15,20] to (15;20], which is required by MARC data_discr[,1]<-as.factor(unlist(lapply(data_discr[,1], function(x) {gsub(",", ";", x)}))) data_discr[,2]<-as.factor(unlist(lapply(data_discr[,2], function(x) {gsub(",", ";", x)}))) data_discr[,3] <- as.factor(humtemp[,3]) #mine rules classAtt="Class" appearance <- getAppearance(data_discr, classAtt) txns_discr <- as(data_discr, "transactions") rules <- apriori(txns_discr, parameter = list(confidence = 0.5, support= 3/nrow(data_discr), minlen=1, maxlen=5), appearance=appearance) inspect(rules) rmCBA <- cba_manual(data_raw, rules, txns_discr, appearance$rhs, classAtt, cutp= list(), pruning_options=NULL) inspect (rmCBA@rules) prediction <- predict(rmCBA,data_discr,discretize=FALSE) acc <- CBARuleModelAccuracy(prediction, data_discr[[classAtt]]) print(paste("Accuracy:",acc))
Learns a CBA rule set and saves the resulting rule set back to csv.
cbaCSV( path, outpath = NULL, classAtt = NULL, idcolumn = NULL, rulelearning_options = NULL, pruning_options = NULL )
cbaCSV( path, outpath = NULL, classAtt = NULL, idcolumn = NULL, rulelearning_options = NULL, pruning_options = NULL )
path |
path to csv file with data. |
outpath |
path to write the rule set to. |
classAtt |
the name of the class attribute. |
idcolumn |
the name of the id column in the dataf ile. |
rulelearning_options |
custom options for the rule learning algorithm overriding the default values. |
pruning_options |
custom options for the pruning algorithm overriding the default values. |
Object of class CBARuleModel
# cbaCSV("path-to-.csv")
# cbaCSV("path-to-.csv")
Test workflow on iris dataset: learns a cba classifier on one "train set" fold , and applies it to the second "test set" fold.
cbaIris()
cbaIris()
Accuracy.
Test workflow on iris dataset: learns a cba classifier on one "train set" fold, and applies it to the second "test set" fold.
cbaIrisNumeric()
cbaIrisNumeric()
Accuracy.
This class represents a rule-based classifier.
rules
an object of class rules from arules package
cutp
list of cutpoints
classAtt
name of the target class attribute
attTypes
attribute types
Compares predictions with true labels and outputs accuracy.
CBARuleModelAccuracy(prediction, groundtruth)
CBARuleModelAccuracy(prediction, groundtruth)
prediction |
vector with predictions |
groundtruth |
vector with true labels |
Accuracy
Discretizes provided numeric vector.
discretizeUnsupervised( data, labels = FALSE, infinite_bounds = FALSE, categories = 3, method = "cluster" )
discretizeUnsupervised( data, labels = FALSE, infinite_bounds = FALSE, categories = 3, method = "cluster" )
data |
input numeric vector. |
labels |
a logical indicating whether the bins of the discretized data should be represented by integer codes or as interval notation using (a;b] when set to TRUE. |
infinite_bounds |
a logical indicating how the bounds on the extremes should look like. |
categories |
number of categories (bins) to produce. |
method |
clustering method, one of "interval" (equal interval width), "frequency" (equal frequency), "cluster" (k-means clustering). See also documentation of the |
Discretized data. If there was no discretization specified for some columns, these are returned as is.
discretizeUnsupervised(datasets::iris[[1]])
discretizeUnsupervised(datasets::iris[[1]])
Can discretize both predictor columns in data frame – using supervised algorithm MDLP (Fayyad & Irani, 1993) – and the target class – using unsupervised algorithm (k-Means). This R file contains fragments of code from the GPL-licensed R discretization package by HyunJi Kim.
discrNumeric( df, classatt, min_distinct_values = 3, unsupervised_bins = 3, discretize_class = FALSE )
discrNumeric( df, classatt, min_distinct_values = 3, unsupervised_bins = 3, discretize_class = FALSE )
df |
a data frame with data. |
classatt |
name the class attribute in df |
min_distinct_values |
the minimum number of unique values a column needs to have to be subject to supervised discretization. |
unsupervised_bins |
number of target bins for discretizing the class attribute. Ignored when the class attribute is not numeric or when |
discretize_class |
logical value indicating whether the class attribute should be discretized. Ignored when the class attribute is not numeric. |
list with two slots: $cutp
with cutpoints and $Disc.data
with discretization results
Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning, Artificial intelligence 13, 1022–1027
discrNumeric(datasets::iris, "Species")
discrNumeric(datasets::iris, "Species")
Method that generates items for values in given data frame column.
getAppearance(df, classAtt)
getAppearance(df, classAtt)
df |
a data frame contain column |
classAtt |
name of the column in |
appearance object for mining classification rules
getAppearance(datasets::iris,"Species")
getAppearance(datasets::iris,"Species")
Methods for computing ROC curves require a vector of confidences of the positive class, while in CBA, the confidence returned by predict with outputProbabilies = TRUE returns confidence for the predicted class. This method converts the values to confidences for the positive class
getConfVectorForROC(confidences, predictedClass, positiveClass)
getConfVectorForROC(confidences, predictedClass, positiveClass)
confidences |
Vector of confidences |
predictedClass |
Vector with predicted classes |
positiveClass |
Positive class (String) |
Vector of confidence values
predictedClass = c("setosa","virginica") confidences = c(0.9,0.6) baseClass="setosa" getConfVectorForROC(confidences,predictedClass,baseClass) # Further examples showing how ROC curve and AUC values can be computed # using this function are available at project's GitHub homepage.
predictedClass = c("setosa","virginica") confidences = c(0.9,0.6) baseClass="setosa" getConfVectorForROC(confidences,predictedClass,baseClass) # Further examples showing how ROC curve and AUC values can be computed # using this function are available at project's GitHub homepage.
A syntetic toy dataset. The variables are as follows:
data(humtemp)
data(humtemp)
A data frame with 34 rows and 3 variables
Temperature.
Humidity.
Class. Comfort level
Performs supervised discretization of numeric columns, except class, on the provided data frame. Uses the Minimum Description Length Principle algorithm (Fayyed and Irani, 1993) as implemented in the discretization package.
mdlp2( df, cl_index = NULL, handle_missing = FALSE, labels = FALSE, skip_nonnumeric = FALSE, infinite_bounds = FALSE, min_distinct_values = 3 )
mdlp2( df, cl_index = NULL, handle_missing = FALSE, labels = FALSE, skip_nonnumeric = FALSE, infinite_bounds = FALSE, min_distinct_values = 3 )
df |
input data frame. |
cl_index |
index of the class variable. If not specified, the last column is used as the class variable. |
handle_missing |
Setting to TRUE activates the following behaviour: if there are any missing observations in the column processed, the input for discretization is a subset of data containing this column and target with rows containing missing values excuded. |
labels |
A logical indicating whether the bins of the discretized data should be represented by integer codes or as interval notation using (a;b] when set to TRUE. |
skip_nonnumeric |
If set to TRUE, any non-numeric columns will be skipped. |
infinite_bounds |
A logical indicating how the bounds on the extremes should look like. |
min_distinct_values |
If a column contains less than specified number of distinct values, it is not discretized. |
Discretized data. If there were any non-numeric input columns they are returned as is. All returned columns except class are factors.
Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning, Artificial intelligence 13, 1022–1027
mdlp2(datasets::iris) #gives the same result as mdlp(datasets::iris) from discretize package #uses Sepal.Length as target variable mdlp2(df=datasets::iris, cl_index = 1,handle_missing = TRUE, labels = TRUE, skip_nonnumeric = TRUE, infinite_bounds = TRUE, min_distinct_values = 30)
mdlp2(datasets::iris) #gives the same result as mdlp(datasets::iris) from discretize package #uses Sepal.Length as target variable mdlp2(df=datasets::iris, cl_index = 1,handle_missing = TRUE, labels = TRUE, skip_nonnumeric = TRUE, infinite_bounds = TRUE, min_distinct_values = 30)
Method that matches rule model against test data.
## S3 method for class 'CBARuleModel' predict( object, data, discretize = TRUE, outputFiringRuleIDs = FALSE, outputConfidenceScores = FALSE, confScoreType = "ordered", positiveClass = NULL, ... )
## S3 method for class 'CBARuleModel' predict( object, data, discretize = TRUE, outputFiringRuleIDs = FALSE, outputConfidenceScores = FALSE, confScoreType = "ordered", positiveClass = NULL, ... )
object |
a CBARuleModel class instance |
data |
a data frame with data |
discretize |
boolean indicating whether the passed data should be discretized using information in the passed @cutp slot of the ruleModel argument. |
outputFiringRuleIDs |
if set to TRUE, instead of predictions, the function will return one-based IDs of rules used to classify each instance (one rule per instance). |
outputConfidenceScores |
if set to TRUE, instead of predictions, the function will return confidences of the firing rule |
confScoreType |
applicable only if 'outputConfidenceScores=TRUE', possible values 'ordered' for confidence computed only for training instances reaching this rule, or 'global' for standard rule confidence computed from the complete training data |
positiveClass |
This setting is only used if 'outputConfidenceScores=TRUE'. It should be used only for binary problems. In this case, the confidence values are recalculated so that these are not confidence values of the predicted class (default behaviour of 'outputConfidenceScores=TRUE') but rather confidence values associated with the class designated as positive |
... |
other arguments (currently not used) |
A vector with predictions.
set.seed(101) allData <- datasets::iris[sample(nrow(datasets::iris)),] trainFold <- allData[1:100,] testFold <- allData[101:nrow(allData),] #increase for more accurate results in longer time target_rule_count <- 1000 classAtt <- "Species" rm <- cba(trainFold, classAtt, list(target_rule_count = target_rule_count)) prediction <- predict(rm, testFold) acc <- CBARuleModelAccuracy(prediction, testFold[[classAtt]]) message(acc) # get rules responsible for each prediction firingRuleIDs <- predict(rm, testFold, outputFiringRuleIDs=TRUE) # show rule responsible for prediction of test instance no. 28 inspect(rm@rules[firingRuleIDs[28]]) # get prediction confidence (three different versions) rm@rules[firingRuleIDs[28]]@quality$confidence rm@rules[firingRuleIDs[28]]@quality$orderedConf rm@rules[firingRuleIDs[28]]@quality$cumulativeConf
set.seed(101) allData <- datasets::iris[sample(nrow(datasets::iris)),] trainFold <- allData[1:100,] testFold <- allData[101:nrow(allData),] #increase for more accurate results in longer time target_rule_count <- 1000 classAtt <- "Species" rm <- cba(trainFold, classAtt, list(target_rule_count = target_rule_count)) prediction <- predict(rm, testFold) acc <- CBARuleModelAccuracy(prediction, testFold[[classAtt]]) message(acc) # get rules responsible for each prediction firingRuleIDs <- predict(rm, testFold, outputFiringRuleIDs=TRUE) # show rule responsible for prediction of test instance no. 28 inspect(rm@rules[firingRuleIDs[28]]) # get prediction confidence (three different versions) rm@rules[firingRuleIDs[28]]@quality$confidence rm@rules[firingRuleIDs[28]]@quality$orderedConf rm@rules[firingRuleIDs[28]]@quality$cumulativeConf
An implementation of the CBA-CB M1 algorithm (Liu et al, 1998) adapted for R and arules package apriori implementation in place of CBA-RG.
prune( rules, txns, classitems, default_rule_pruning = TRUE, rule_window = 50000, greedy_pruning = FALSE, input_list_sorted_by_length = TRUE, debug = FALSE )
prune( rules, txns, classitems, default_rule_pruning = TRUE, rule_window = 50000, greedy_pruning = FALSE, input_list_sorted_by_length = TRUE, debug = FALSE )
rules |
object of class rules from arules package |
txns |
input object with transactions. |
classitems |
a list of items to appear in the consequent (rhs) of the rules. |
default_rule_pruning |
boolean indicating whether default pruning should be performed. If set to TRUE, default pruning is performed as in the CBA algorithm. If set to FALSE, default pruning is not performed i.e. all rules surviving data coverage pruning are kept. In either case, a default rule is added to the end of the classifier. |
rule_window |
the number of rules to precompute for CBA data coverage pruning. The default value can be adjusted to decrease runtime. |
greedy_pruning |
setting to TRUE activates early stopping condition: pruning will be stopped on first rule on which total error increases. |
input_list_sorted_by_length |
indicates by default that the input rule list is sorted by antecedent length (as output by arules), if this param is set to false, the list will be resorted |
debug |
output debug messages. |
Returns an object of class rules. Note that 'rules@quality' slot has been extended
with additional measures, specifically 'orderedConf', 'orderedSupp', and 'cumulativeConf'. The rules are output in the order
in which they are assumed to be applied in classification. Only the first applicable rule is used to
classify the instance. As a result, in addition to rule confidence – which is computed over the
whole training dataset – it makes sense to define order-sensitive confidence, which is computed
only from instances reaching the given rule as , where
is the number of instances
matching both the antecedent and consequent (available in slot 'orderedSupp') and
is the number of instances matching the antecedent, but
not matching the consequent of the given rule. The cumulative confidence is an experimental measure,
which is computed as the accuracy of the rule list comprising the given rule and all higher priority
rules (rules with lower index) with uncovered instances excluded from the computation.
Ma, Bing Liu Wynne Hsu Yiming. Integrating classification and association rule mining. Proceedings of the fourth international conference on knowledge discovery and data mining. 1998.
#Example 1 txns <- as(discrNumeric(datasets::iris, "Species")$Disc.data,"transactions") appearance <- getAppearance(datasets::iris,"Species") rules <- apriori(txns, parameter = list(confidence = 0.5, support= 0.01, minlen= 2, maxlen= 4),appearance = appearance) prune(rules,txns, appearance$rhs) inspect(rules) #Example 2 utils::data(Adult) # this dataset comes with the arules package classitems <- c("income=small","income=large") rules <- apriori(Adult, parameter = list(supp = 0.3, conf = 0.5, target = "rules"), appearance=list(rhs=classitems, default="lhs")) # produces 25 rules rulesP <- prune(rules,Adult,classitems) rulesP@quality # inspect rule quality measured including the new additions # Rules after data coverage pruning: 8 # Performing default rule pruning. # Final rule list size: 6
#Example 1 txns <- as(discrNumeric(datasets::iris, "Species")$Disc.data,"transactions") appearance <- getAppearance(datasets::iris,"Species") rules <- apriori(txns, parameter = list(confidence = 0.5, support= 0.01, minlen= 2, maxlen= 4),appearance = appearance) prune(rules,txns, appearance$rhs) inspect(rules) #Example 2 utils::data(Adult) # this dataset comes with the arules package classitems <- c("income=small","income=large") rules <- apriori(Adult, parameter = list(supp = 0.3, conf = 0.5, target = "rules"), appearance=list(rhs=classitems, default="lhs")) # produces 25 rules rulesP <- prune(rules,Adult,classitems) rulesP@quality # inspect rule quality measured including the new additions # Rules after data coverage pruning: 8 # Performing default rule pruning. # Final rule list size: 6
A wrapper for the apriori method from the arules package that iteratively changes mining parameters until a desired number of rules is obtained, all options are exhausted or a preset time limit is reached. Within the arc package, this function serves as a replacement for the CBA Rule Generation algorithm (Liu et al, 1998) – without pessimistic pruning – with general apriori implementation provided by existing fast R package arules.
topRules( txns, appearance = list(), target_rule_count = 1000, init_support = 0, init_conf = 0.5, conf_step = 0.05, supp_step = 0.05, minlen = 2, init_maxlen = 3, iteration_timeout = 2, total_timeout = 100, max_iterations = 30, trim = TRUE, debug = FALSE )
topRules( txns, appearance = list(), target_rule_count = 1000, init_support = 0, init_conf = 0.5, conf_step = 0.05, supp_step = 0.05, minlen = 2, init_maxlen = 3, iteration_timeout = 2, total_timeout = 100, max_iterations = 30, trim = TRUE, debug = FALSE )
txns |
input transactions. |
appearance |
object named list or APappearance object (refer to arules package) |
target_rule_count |
the main stopping criterion, mining stops when the resulting rule set contains this number of rules. |
init_support |
initial support. |
init_conf |
initial confidence. |
conf_step |
confidence will be changed by steps defined by this parameter. |
supp_step |
support will be changed by steps defined by this parameter. |
minlen |
minimum length of rules, minlen=1 corresponds to rule with empty antecedent and one item in consequent. In general, rules with empty antecedent are not desirable for the subsequent pruning algorithm, therefore the value of this parameter should be set at least to value 2. |
init_maxlen |
maximum length of rules, should be equal or higher than minlen. A higher value may decrease the number of iterations to obtain target_rule_count rules, but it also increases the risk of initial combinatorial explosion and subsequent memory crash of the apriori rule learner. |
iteration_timeout |
maximum number of seconds it should take apriori to obtain rules with current configuration/ |
total_timeout |
maximum number of seconds the mining should take. |
max_iterations |
maximum number of iterations. |
trim |
if set to TRUE and more than |
debug |
boolean indicating whether to output debug messages. |
Returns an object of class rules.
Ma, Bing Liu Wynne Hsu Yiming. Integrating classification and association rule mining. Proceedings of the fourth international conference on knowledge discovery and data mining. 1998.
# Example 1 utils::data(Adult) rules <- topRules(Adult, appearance = list(), target_rule_count = 100, init_support = 0.5,init_conf = 0.9, minlen = 1, init_maxlen = 10) # Example 2 rules <- topRules(as(discrNumeric(datasets::iris, "Species")$Disc.data,"transactions"), getAppearance(datasets::iris,"Species")) # Example 3 utils::data(datasets::iris) appearance <- list(rhs = c("Species=setosa", "Species=versicolor", "Species=virginica"), default="lhs") data <- sapply(datasets::iris,as.factor) data <- data.frame(data, check.names=FALSE) txns <- as(data,"transactions") rules <- topRules(txns, appearance)
# Example 1 utils::data(Adult) rules <- topRules(Adult, appearance = list(), target_rule_count = 100, init_support = 0.5,init_conf = 0.9, minlen = 1, init_maxlen = 10) # Example 2 rules <- topRules(as(discrNumeric(datasets::iris, "Species")$Disc.data,"transactions"), getAppearance(datasets::iris,"Species")) # Example 3 utils::data(datasets::iris) appearance <- list(rhs = c("Species=setosa", "Species=versicolor", "Species=virginica"), default="lhs") data <- sapply(datasets::iris,as.factor) data <- data.frame(data, check.names=FALSE) txns <- as(data,"transactions") rules <- topRules(txns, appearance)