Create a confusion matrix and calculate compatible yardstick metrics. Supports class, class-probability, and numeric metrics. For multiclass classification problems (n_classes > 2), this function computes both overall (macro/multiclass) metrics and one-vs-rest (OvR) metrics per class label. It also add a human-readable model interpretation.

confusion_matrix(data, ...)

# Default S3 method
confusion_matrix(data, truth, estimate, na.rm = getOption("na.rm", FALSE), ...)

Arguments

data

Either a data.frame containing the columns specified by the truth and estimate arguments, or a table/matrix where the true class results should be in the columns of the table.

...

Not currently used.

truth

The column identifier for the true class results (that is a factor). This should be an unquoted column name although this argument is passed by expression and supports quasiquotation (you can unquote column names). For _vec() functions, a factor vector.

estimate

The column identifier for the predicted class results (that is also factor). As with truth this can be specified different ways but the primary method is to use an unquoted variable name. For _vec() functions, a factor vector.

na.rm

A logical indicating whether missing values should be removed.

Details

This is a function-agnostic wrapper around yardstick. It automatically discovers metric functions exported by yardstick, filters them by compatibility with the input type (class labels, class probabilities, or numeric regression), and applies all applicable metrics.

For multiclass classification (i.e., where the truth and estimate are factors with more than two levels), both macro/multiclass metrics and per-class OvR versions are computed. Each per-class column treats the class as the positive label and all others as negative.

In classification settings, this includes metrics such as accuracy, precision, recall, specificity, F1, MCC, Kappa, and others, depending on compatibility.

For binary classification or regression, output is unchanged and includes applicable metrics only.

Examples

# From aggregated counts
df <- data.frame(name = c("Yes", "No"),
                 Yes = c(123, 26),
                 No  = c(13, 834))
confusion_matrix(df)
#> 
#> ── Confusion Matrix ────────────────────────────────────────────────────────────
#> 
#>       Predicted
#> Actual Yes  No
#>    Yes 123  26
#>    No   13 834
#> 
#> ── Model Metrics ───────────────────────────────────────────────────────────────
#>                                              
#>  Accuracy                               0.961
#>  Balanced Accuracy                      0.937
#>  F1 Score                               0.863
#>  J-Index                                0.874
#>  Kappa                                  0.840
#>  Matthews Correlation Coefficient (MCC) 0.842
#>  Negative Predictive Value (NPV)        0.985
#>  Positive Predictive Value (PPV)        0.826
#>  Precision                              0.826
#>  Prevalence                             0.150
#>  Recall                                 0.904
#>  Sensitivity                            0.904
#>  Specificity                            0.970
#> 
#> ── Model Interpretation ────────────────────────────────────────────────────────
#> 
#> Overall performance is good. Accuracy (96.1%) and balanced accuracy (93.7%)
#> indicate consistent separation between classes. Agreement between predicted and
#> true classes is strong (Cohen's Kappa = 84.0%, MCC = 84.2%). These account for
#> chance agreement and are robust to class imbalance. Recall (90.4%) exceeds
#> precision (82.6%), meaning the model moderately prioritises detecting true
#> cases at the cost of more false positives. The macro-averaged F1 score is
#> 86.3%, indicating balanced harmonic performance across classes. The model's
#> ability to rule out incorrect classes is very strong, with specificity at 97.0%
#> and negative predictive value at 98.5%. Most misclassifications are
#> concentrated between a small number of class pairs, indicating overlap between
#> specific categories rather than random error. Class imbalance is present
#> (max:minor support ratio = 5.68). While macro-averaging mitigates this, some
#> metrics may still overestimate performance on minority classes.

# From predictions on known labels
iris |>
  ml_decision_trees(Species, quiet = TRUE) |>
  confusion_matrix()
#> 
#> ── Confusion Matrix ────────────────────────────────────────────────────────────
#> 
#>             Predicted
#> Actual       setosa versicolor virginica
#>   setosa          9          0         0
#>   versicolor      0         14         0
#>   virginica       0          1        14
#> 
#> ── Model Metrics ───────────────────────────────────────────────────────────────
#> 
#>                                                 overall setosa versicolor
#>  Accuracy                                         0.974  1.000      0.974
#>  Balanced Accuracy                                0.982  1.000      0.979
#>  F1 Score                                         0.977  1.000      0.979
#>  J-Index                                          0.964  1.000      0.958
#>  Kappa                                            0.960  1.000      0.944
#>  Matthews Correlation Coefficient (MCC)           0.961  1.000      0.946
#>  Negative Predictive Value (NPV)                  0.986  1.000      0.933
#>  Positive Predictive Value (PPV)                  0.978  1.000      1.000
#>  Precision                                        0.978  1.000      1.000
#>  Prevalence                                       0.333  0.763      0.605
#>  Recall                                           0.978  1.000      0.958
#>  Sensitivity                                      0.978  1.000      0.958
#>  Specificity                                      0.986  1.000      1.000
#>  Area under the Precision Recall Curve (AUCPR)    0.967                  
#>  Area under the Receiver Operator Curve (AUROC)   0.989                  
#>  Brier Score for Classification Models            0.025                  
#>  Costs Function for Poor Classification           0.073                  
#>  Gain Capture                                     0.973                  
#>  Mean log Loss for Multinomial Data (MLMD)        0.111                  
#>  virginica
#>      0.974
#>      0.967
#>      0.979
#>      0.933
#>      0.944
#>      0.946
#>      1.000
#>      0.958
#>      0.958
#>      0.632
#>      1.000
#>      1.000
#>      0.933
#>           
#>           
#>           
#>           
#>           
#>           
#> 
#> ── Model Interpretation ────────────────────────────────────────────────────────
#> 
#> Overall performance is very strong. Accuracy (97.4%) and balanced accuracy
#> (98.2%) indicate highly consistent separation between classes. Agreement
#> between predicted and true classes is strong (Cohen's Kappa = 96.0%, MCC =
#> 96.1%). These account for chance agreement and are robust to class imbalance.
#> Precision and recall (both 96.7%) are perfectly aligned, indicating an ideally
#> balanced trade-off between false positives and missed true cases. The
#> macro-averaged F1 score is 97.7%, indicating balanced harmonic performance
#> across classes. The model's ability to rule out incorrect classes is very
#> strong, with specificity at 98.6% and negative predictive value at 98.6%. Most
#> misclassifications are concentrated between a small number of class pairs,
#> indicating overlap between specific categories rather than random error. Class
#> imbalance is present (max:minor support ratio = 1.67). While macro-averaging
#> mitigates this, some metrics may still overestimate performance on minority
#> classes. The confusion matrix is sparsely populated; many class pairs have zero
#> observed errors. Interpret per-class metrics cautiously, as sparse data may
#> inflate estimates.