Detect disease clusters with detect_disease_clusters(). Use has_clusters() to return TRUE or FALSE based on its output, or employ format() to format the result.

detect_disease_clusters(
  df,
  column_date = NULL,
  column_patientid = NULL,
  based_on_historic_maximum = FALSE,
  period_length_months = 12,
  minimum_cases = 5,
  minimum_days = 0,
  minimum_case_days = 2,
  minimum_case_fraction_in_period = 0.02,
  threshold_percentile = 97.5,
  remove_outliers = TRUE,
  remove_outliers_coefficient = 1.5,
  moving_average_days = 7,
  moving_average_side = "left",
  case_free_days = 14,
  ...
)

n_clusters(x)

has_clusters(x, n = 1)

has_ongoing_cluster(x, dates = Sys.Date() - 1)

has_cluster_before(x, date)

has_cluster_after(x, date)

Arguments

df

Data set: This must consist of only positive results. The minimal data set should include a date column and a patient column. Do not summarize on patient IDs; this will be handled automatically.

column_date

Name of the column to use for dates. If left blank, the first date column will be used.

column_patientid

Name of the column to use for patient IDs. If left blank, the first column resembling "patient|patid" will be used.

based_on_historic_maximum

A logical to indicate whether the cluster detection should be based on the maximum of previous years. The default is FALSE, which uses all historic data points.

period_length_months

Number of months per period.

minimum_cases

Minimum number of cases that a cluster requires to be considered a cluster.

minimum_days

Minimum number of days that a cluster requires to be considered a cluster.

minimum_case_days

Minimum number of days with cases that a cluster requires to be considered a cluster.

minimum_case_fraction_in_period

Minimum fraction of cluster cases in a period that a cluster requires to be considered a cluster.

threshold_percentile

Threshold to set.

remove_outliers

A logical to indicate whether outliers should be removed before determining the threshold.

remove_outliers_coefficient

Coefficient used for outlier determination.

moving_average_days

Number of days to set in moving_average(). Defaults to a whole week (7).

moving_average_side

Side of days to set in moving_average(). Defaults to "left" for retrospective analysis.

case_free_days

Number of days to set in get_episode().

...

not used at the moment

x

output of detect_disease_clusters()

n

number of clusters, defaults to 1

dates

date(s) to test whether any of the clusters currently has this date in it, defaults to yesterday.

date

date to test whether there are any clusters since or until this date.

Details

A (disease) cluster is defined as an unusually large aggregation of disease events in time or space (ATSDR, 2008). They are common, particularly in large populations. From a statistical standpoint, it is nearly inevitable that some clusters of chronic diseases will emerge within various communities, be it schools, church groups, social circles, or neighborhoods. Initially, these clusters are often perceived as products of specific, predictable processes rather than random occurrences in a particular location, akin to a coin toss.

Whether a (suspected) cluster corresponds to an actual increase of disease in the area, needs to be assessed by an epidemiologist or biostatistician (ATSDR, 2008).

The function has_ongoing_cluster() returns a logical vector with the same length as dates, so dates can have any length.

Examples

cases <- data.frame(date = sample(seq(as.Date("2015-01-01"),
                                      as.Date("2022-12-31"),
                                      "1 day"),
                                  size = 300),
                    patient = sample(LETTERS, size = 300, replace = TRUE))

# -----------------------------------------------------------

check <- detect_disease_clusters(cases, threshold_percentile = 0.99)
#> Using column 'date' for dates
#> Using column 'patient' for patient IDs

has_clusters(check)
#> [1] TRUE
check
#> => Detected 1 disease cluster over the last 12-month period with a total of 7
#> cases.
#> 
#> ── Disease Clusters ──
#> 
#> These disease clusters were found:
#> 1. Between 28 november and 7 december 2022: 7 cases
#> 
#> ── Parameters Used ──
#> 
#> • based_on_historic_maximum: FALSE
#> • minimum_case_days: 2
#> • minimum_case_fraction_in_period: 0.02
#> • minimum_cases: 5
#> • minimum_days: 0
#> • moving_average_days: 7
#> • period_length_months: 12
#> • remove_outliers: TRUE
#> • remove_outliers_coefficient: 1.5
#> • threshold_percentile: 99
#> 
#> ── Summary ──
#> 
#> In total 7 cases between 28 november and 7 december 2022, spread over 1
#> cluster(s).
#> Use `plot2::plot2()` to plot the results.


check2 <- detect_disease_clusters(cases,
                                  minimum_cases = 1,
                                  threshold_percentile = 0.75)
#> Using column 'date' for dates
#> Using column 'patient' for patient IDs

check2
#> => Detected 2 disease clusters over the last 12-month period with a total of 13
#> cases.
#> 
#> ── Disease Clusters ──
#> 
#> These disease clusters were found:
#> 1. Between 1 and 5 augustus 2022: 2 cases
#> 2. Between 26 november and 29 december 2022: 11 cases
#> 
#> ── Parameters Used ──
#> 
#> • based_on_historic_maximum: FALSE
#> • minimum_case_days: 2
#> • minimum_case_fraction_in_period: 0.02
#> • minimum_cases: 1
#> • minimum_days: 0
#> • moving_average_days: 7
#> • period_length_months: 12
#> • remove_outliers: TRUE
#> • remove_outliers_coefficient: 1.5
#> • threshold_percentile: 75
#> 
#> ── Summary ──
#> 
#> In total 13 cases between 1 augustus and 29 december 2022, spread over 2
#> cluster(s).
#> Use `plot2::plot2()` to plot the results.
check2 |> format()
#> # A tibble: 2 × 8
#>   cluster first_day  last_day   first_day_in_period last_day_in_period cases
#>     <int> <date>     <date>                   <int>              <int> <int>
#> 1       1 2022-08-01 2022-08-05                 215                219     2
#> 2       2 2022-11-26 2022-12-29                 332                365    11
#> # ℹ 2 more variables: days <int>, case_days <int>

check2 |> n_clusters()
#> [1] 2
check2 |> has_clusters()
#> [1] TRUE
check2 |> has_clusters(n = 15)
#> [1] FALSE

check2 |> has_ongoing_cluster("2022-06-01")
#> [1] FALSE
check2 |> has_ongoing_cluster(c("2022-06-01", "2022-06-20"))
#> [1] FALSE FALSE
check2 |> has_cluster_before("2022-06-01")
#> [1] FALSE
check2 |> has_cluster_after("2022-06-01")
#> [1] TRUE

check2 |> unclass()
#> $clusters
#> # A tibble: 39 × 6
#>    date       cases cluster day_in_period  days case_days
#>    <date>     <int>   <int>         <int> <int>     <int>
#>  1 2022-08-01     1       1           215     1         1
#>  2 2022-08-02     0       1            NA     2         1
#>  3 2022-08-03     0       1            NA     3         1
#>  4 2022-08-04     0       1            NA     4         1
#>  5 2022-08-05     1       1           219     5         2
#>  6 2022-11-26     1       2           332     1         1
#>  7 2022-11-27     0       2            NA     2         1
#>  8 2022-11-28     1       2           334     3         2
#>  9 2022-11-29     0       2            NA     4         2
#> 10 2022-11-30     1       2           336     5         3
#> # ℹ 29 more rows
#> 
#> $details
#> # A tibble: 2,914 × 10
#>     year date       period_date day_in_period period cases moving_avg
#>    <int> <date>     <date>              <int>  <int> <int>      <dbl>
#>  1  2015 2015-01-07 2000-01-07              1      7     1      1    
#>  2  2015 2015-01-08 2000-01-08              2      7     0      0.5  
#>  3  2015 2015-01-09 2000-01-09              3      7     0      0.333
#>  4  2015 2015-01-10 2000-01-10              4      7     0      0.25 
#>  5  2015 2015-01-11 2000-01-11              5      7     0      0.2  
#>  6  2015 2015-01-12 2000-01-12              6      7     0      0.167
#>  7  2015 2015-01-13 2000-01-13              7      7     0      0.143
#>  8  2015 2015-01-14 2000-01-14              8      7     0      0    
#>  9  2015 2015-01-15 2000-01-15              9      7     0      0    
#> 10  2015 2015-01-16 2000-01-16             10      7     1      0.143
#> # ℹ 2,904 more rows
#> # ℹ 3 more variables: moving_avg_max <dbl>, moving_avg_pctile <dbl>,
#> #   moving_avg_limit <dbl>
#> 
#> attr(,"threshold_percentile")
#> [1] 75
#> attr(,"based_on_historic_maximum")
#> [1] FALSE
#> attr(,"remove_outliers")
#> [1] TRUE
#> attr(,"remove_outliers_coefficient")
#> [1] 1.5
#> attr(,"moving_average_days")
#> [1] 7
#> attr(,"minimum_cases")
#> [1] 1
#> attr(,"minimum_days")
#> [1] 0
#> attr(,"minimum_case_days")
#> [1] 2
#> attr(,"minimum_case_fraction_in_period")
#> [1] 0.02
#> attr(,"period_length_months")
#> [1] 12

# plot the results
# check2 |> plot2()