Detect disease clusters with early_warning_cluster(). Use has_clusters() to return TRUE or FALSE based on its output, or employ format() to format the result.

early_warning_cluster(
  df,
  column_date = NULL,
  column_patientid = NULL,
  based_on_historic_maximum = FALSE,
  period_length_months = 12,
  minimum_cases = 5,
  minimum_days = 0,
  minimum_case_days = 2,
  minimum_case_fraction_in_period = 0.02,
  threshold_percentile = 97.5,
  remove_outliers = TRUE,
  remove_outliers_coefficient = 1.5,
  moving_average_days = 7,
  moving_average_side = "left",
  case_free_days = 14,
  ...
)

n_clusters(x)

has_clusters(x, n = 1)

has_ongoing_cluster(x, dates = Sys.Date() - 1)

has_cluster_before(x, date)

has_cluster_after(x, date)

Arguments

df

Data set: This must consist of only positive results. The minimal data set should include a date column and a patient column. Do not summarize on patient IDs; this will be handled automatically.

column_date

Name of the column to use for dates. If left blank, the first date column will be used.

column_patientid

Name of the column to use for patient IDs. If left blank, the first column resembling "patient|patid" will be used.

based_on_historic_maximum

A logical to indicate whether the cluster detection should be based on the maximum of previous years. The default is FALSE, which uses all historic data points.

period_length_months

Number of months per period.

minimum_cases

Minimum number of cases that a cluster requires to be considered a cluster.

minimum_days

Minimum number of days that a cluster requires to be considered a cluster.

minimum_case_days

Minimum number of days with cases that a cluster requires to be considered a cluster.

minimum_case_fraction_in_period

Minimum fraction of cluster cases in a period that a cluster requires to be considered a cluster.

threshold_percentile

Threshold to set.

remove_outliers

A logical to indicate whether outliers should be removed before determining the threshold.

remove_outliers_coefficient

Coefficient used for outlier determination.

moving_average_days

Number of days to set in moving_average(). Defaults to a whole week (7).

moving_average_side

Side of days to set in moving_average(). Defaults to "left" for retrospective analysis.

case_free_days

Number of days to set in get_episode().

...

not used at the moment

x

output of early_warning_cluster()

n

number of clusters, defaults to 1

dates

date(s) to test whether any of the clusters currently has this date in it, defaults to yesterday.

date

date to test whether there are any clusters since or until this date.

Details

A (disease) cluster is defined as an unusually large aggregation of disease events in time or space (ATSDR, 2008). They are common, particularly in large populations. From a statistical standpoint, it is nearly inevitable that some clusters of chronic diseases will emerge within various communities, be it schools, church groups, social circles, or neighborhoods. Initially, these clusters are often perceived as products of specific, predictable processes rather than random occurrences in a particular location, akin to a coin toss.

Whether a (suspected) cluster corresponds to an actual increase of disease in the area, needs to be assessed by an epidemiologist or biostatistician (ATSDR, 2008).

The function has_ongoing_cluster() returns a logical vector with the same length as dates, so dates can have any length.

Examples

cases <- data.frame(date = sample(seq(as.Date("2015-01-01"),
                                      as.Date("2022-12-31"),
                                      "1 day"),
                                  size = 300),
                    patient = sample(LETTERS, size = 300, replace = TRUE))

# -----------------------------------------------------------

check <- early_warning_cluster(cases, threshold_percentile = 0.99)
#> Using column 'date' for dates
#> Using column 'patient' for patient IDs

has_clusters(check)
#> [1] FALSE
check
#> => Detected no disease clusters over the last 12-month period


check2 <- early_warning_cluster(cases,
                                minimum_cases = 1,
                                threshold_percentile = 0.75)
#> Using column 'date' for dates
#> Using column 'patient' for patient IDs

check2
#> => Detected 7 disease clusters over the last 12-month period with a total of 29
#> cases.
#> 
#> ── Disease Clusters ──
#> 
#> These disease clusters were found:
#> 1. Between 24 januari and 6 februari 2022: 6 cases
#> 2. Between 9 and 29 maart 2022: 3 cases
#> 3. Between 18 and 28 mei 2022: 3 cases
#> 4. Between 1 and 19 juli 2022: 5 cases
#> 5. Between 2 and 29 augustus 2022: 6 cases
#> 6. Between 9 and 12 oktober 2022: 2 cases
#> 7. Between 7 and 17 november 2022: 4 cases
#> 
#> ── Parameters Used ──
#> 
#> • based_on_historic_maximum: FALSE
#> • minimum_case_days: 2
#> • minimum_case_fraction_in_period: 0.02
#> • minimum_cases: 1
#> • minimum_days: 0
#> • moving_average_days: 7
#> • period_length_months: 12
#> • remove_outliers: TRUE
#> • remove_outliers_coefficient: 1.5
#> • threshold_percentile: 75
#> 
#> ── Summary ──
#> 
#> In total 29 cases between 24 januari and 17 november 2022, spread over 7
#> cluster(s).
#> Use `certeplot2::plot2()` to plot the results.
check2 |> format()
#> # A tibble: 7 × 8
#>   cluster first_day  last_day   first_day_in_period last_day_in_period cases
#>     <int> <date>     <date>                   <int>              <int> <int>
#> 1       1 2022-01-24 2022-02-06                  29                 42     6
#> 2       2 2022-03-09 2022-03-29                  73                 93     3
#> 3       3 2022-05-18 2022-05-28                 143                153     3
#> 4       4 2022-07-01 2022-07-19                 187                205     5
#> 5       5 2022-08-02 2022-08-29                 219                246     6
#> 6       6 2022-10-09 2022-10-12                 287                290     2
#> 7       7 2022-11-07 2022-11-17                 316                326     4
#> # ℹ 2 more variables: days <int>, case_days <int>

check2 |> n_clusters()
#> [1] 7
check2 |> has_clusters()
#> [1] TRUE
check2 |> has_clusters(n = 15)
#> [1] FALSE

check2 |> has_ongoing_cluster("2022-06-01")
#> [1] FALSE
check2 |> has_ongoing_cluster(c("2022-06-01", "2022-06-20"))
#> [1] FALSE FALSE
check2 |> has_cluster_before("2022-06-01")
#> [1] TRUE
check2 |> has_cluster_after("2022-06-01")
#> [1] TRUE

check2 |> unclass()
#> $clusters
#> # A tibble: 108 × 6
#>    date       cases cluster day_in_period  days case_days
#>    <date>     <int>   <int>         <int> <int>     <int>
#>  1 2022-01-24     1       1            29     1         1
#>  2 2022-01-25     0       1            NA     2         1
#>  3 2022-01-26     0       1            NA     3         1
#>  4 2022-01-27     1       1            32     4         2
#>  5 2022-01-28     0       1            NA     5         2
#>  6 2022-01-29     1       1            34     6         3
#>  7 2022-01-30     0       1            NA     7         3
#>  8 2022-01-31     0       1            NA     8         3
#>  9 2022-02-01     1       1            37     9         4
#> 10 2022-02-02     0       1            NA    10         4
#> # ℹ 98 more rows
#> 
#> $details
#> # A tibble: 2,911 × 10
#>     year date       period_date day_in_period period cases moving_avg
#>    <int> <date>     <date>              <int>  <int> <int>      <dbl>
#>  1  2015 2015-01-07 2000-01-07              1      7     1      1    
#>  2  2015 2015-01-08 2000-01-08              2      7     0      0.5  
#>  3  2015 2015-01-09 2000-01-09              3      7     0      0.333
#>  4  2015 2015-01-10 2000-01-10              4      7     0      0.25 
#>  5  2015 2015-01-11 2000-01-11              5      7     0      0.2  
#>  6  2015 2015-01-12 2000-01-12              6      7     0      0.167
#>  7  2015 2015-01-13 2000-01-13              7      7     1      0.286
#>  8  2015 2015-01-14 2000-01-14              8      7     0      0.143
#>  9  2015 2015-01-15 2000-01-15              9      7     0      0.143
#> 10  2015 2015-01-16 2000-01-16             10      7     0      0.143
#> # ℹ 2,901 more rows
#> # ℹ 3 more variables: moving_avg_max <dbl>, moving_avg_pctile <dbl>,
#> #   moving_avg_limit <dbl>
#> 
#> attr(,"threshold_percentile")
#> [1] 75
#> attr(,"based_on_historic_maximum")
#> [1] FALSE
#> attr(,"remove_outliers")
#> [1] TRUE
#> attr(,"remove_outliers_coefficient")
#> [1] 1.5
#> attr(,"moving_average_days")
#> [1] 7
#> attr(,"minimum_cases")
#> [1] 1
#> attr(,"minimum_days")
#> [1] 0
#> attr(,"minimum_case_days")
#> [1] 2
#> attr(,"minimum_case_fraction_in_period")
#> [1] 0.02
#> attr(,"period_length_months")
#> [1] 12