31  Reshaping our event logs for bupaR

bupaR is an R package for process mining.

Now - while we’ve primarly worked in Python throughout, sometimes we come across an R package that suits our needs better than the Python equivalent. While the excellent pm4py process mining package exists too, bupaR’s visuals are top notch.

There are a few different ways we could get our Python event logs to work with bupaR:

  1. the reticulate package (which runs Python from R) - though due to the complexity of our code, this is likely to run into issues
  2. the r2py package (which runs R from Python) - as we only want a little bit of R in a primarily Python project, this might be a better option
  3. Quarto’s features for passings objects like dataframes between R and Python cells
  4. exporting our event log as a csv, importing this into R, and saving the resulting bupaR visuals
    • the visuals we export can then be imported back into Streamlit apps or Quarto reports
    • we could take this even further by using a column-type preserving data format that is interoperable between R and Python, like Feather or Parquet.

In this chapter, we will use option 4.

The first thing we are going to do is add some extra bits that bupar requires.

import pandas as pd

event_log = pd.read_csv("resources/sample_event_log_10_day_10_run.csv")

df = event_log[(event_log["event_type"]=="queue") |
               (event_log["event_type"]=="resource_use")].copy()

df["activity_id"] = df.groupby("run").cumcount() + 1

# Duplicate rows and modify them
df_start = df.copy()
df_start["lifecycle_id"] = "start"

df_end = df.copy()
df_end["lifecycle_id"] = "complete"

# Shift timestamps for 'end' rows
df_end["time"] = df_end["time"].shift(-1)

# Combine and sort
df_combined = pd.concat([df_start, df_end]).sort_index(kind="stable")

# Drop last 'end' row (since there’s no next row to get a timestamp from)
df_combined = df_combined[:-1]

df_combined.to_csv("resources/bupar_log.csv", index=False)

df_combined.head(30)
    entity_id   pathway    event_type  ... run  activity_id  lifecycle_id
1           1  Simplest         queue  ...   0            1         start
1           1  Simplest         queue  ...   0            1      complete
2           1  Simplest  resource_use  ...   0            2         start
2           1  Simplest  resource_use  ...   0            2      complete
4           2  Simplest         queue  ...   0            3         start
4           2  Simplest         queue  ...   0            3      complete
5           2  Simplest  resource_use  ...   0            4         start
5           2  Simplest  resource_use  ...   0            4      complete
7           3  Simplest         queue  ...   0            5         start
7           3  Simplest         queue  ...   0            5      complete
8           3  Simplest  resource_use  ...   0            6         start
8           3  Simplest  resource_use  ...   0            6      complete
10          4  Simplest         queue  ...   0            7         start
10          4  Simplest         queue  ...   0            7      complete
11          4  Simplest  resource_use  ...   0            8         start
11          4  Simplest  resource_use  ...   0            8      complete
13          5  Simplest         queue  ...   0            9         start
13          5  Simplest         queue  ...   0            9      complete
15          6  Simplest         queue  ...   0           10         start
15          6  Simplest         queue  ...   0           10      complete
17          7  Simplest         queue  ...   0           11         start
17          7  Simplest         queue  ...   0           11      complete
19          8  Simplest         queue  ...   0           12         start
19          8  Simplest         queue  ...   0           12      complete
21          9  Simplest         queue  ...   0           13         start
21          9  Simplest         queue  ...   0           13      complete
24          5  Simplest  resource_use  ...   0           14         start
24          5  Simplest  resource_use  ...   0           14      complete
26         10  Simplest         queue  ...   0           15         start
26         10  Simplest         queue  ...   0           15      complete

[30 rows x 9 columns]
library(readr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
library(bupaverse)
Warning: package 'bupaverse' was built under R version 4.3.3

.______    __    __  .______      ___   ____    ____  _______ .______          _______. _______
|   _  \  |  |  |  | |   _  \    /   \  \   \  /   / |   ____||   _  \        /       ||   ____|
|  |_)  | |  |  |  | |  |_)  |  /  ^  \  \   \/   /  |  |__   |  |_)  |      |   (----`|  |__
|   _  <  |  |  |  | |   ___/  /  /_\  \  \      /   |   __|  |      /        \   \    |   __|
|  |_)  | |  `--'  | |  |     /  _____  \  \    /    |  |____ |  |\  \----.----)   |   |  |____
|______/   \______/  | _|    /__/     \__\  \__/     |_______|| _| `._____|_______/    |_______|
                                                                                                
── Attaching packages ─────────────────────────────────────── bupaverse 0.1.0 ──
✔ bupaR         0.5.4     ✔ processcheckR 0.1.4
✔ edeaR         0.9.4     ✔ processmapR   0.5.6
✔ eventdataR    0.3.1     
Warning: package 'bupaR' was built under R version 4.3.3
Warning: package 'processcheckR' was built under R version 4.3.3
── Conflicts ────────────────────────────────────────── bupaverse_conflicts() ──
✖ processcheckR::contains() masks dplyr::contains()
✖ bupaR::filter()           masks dplyr::filter(), stats::filter()
✖ processmapR::frequency()  masks stats::frequency()
✖ edeaR::setdiff()          masks lubridate::setdiff(), dplyr::setdiff(), base::setdiff()
✖ bupaR::timestamp()        masks utils::timestamp()
✖ processcheckR::xor()      masks base::xor()
library(processanimateR)
Warning: package 'processanimateR' was built under R version 4.3.3
simulation_start <- ymd_hms("2025-01-01 00:00:00")

data <- readr::read_csv("resources/bupar_log.csv")
Rows: 86059 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): pathway, event_type, event, lifecycle_id
dbl (5): entity_id, time, resource_id, run, activity_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
activity_log <- data |>
    dplyr::filter(run==1) |>
    dplyr::mutate(timestamp_dt = simulation_start + lubridate::dminutes(time)) |>
    bupaR::convert_timestamps("timestamp_dt", ymd_hms) |>
    bupaR::eventlog(
        case_id = "entity_id",
        activity_id = "event",
        activity_instance_id = "activity_id",
        lifecycle_id = "lifecycle_id",
        timestamp = "timestamp_dt",
        resource_id = "resource_id"
        )
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `timestamp_dt = (function (..., quiet = FALSE, tz = "UTC",
  locale = Sys.getlocale("LC_TIME"), ...`.
Caused by warning:
!  4 failed to parse.
## !!!! Note that the bupaR documentation recommmends using the
## to_activitylog() function at the end of this set of steps.
## This caused significant errors in testing of this code

head(activity_log, 20)
# Log of 20 events consisting of:
2 traces 
6 cases 
10 instances of 2 activities 
5 resources 
Events occurred from NA until NA 
 
# Variables were mapped as follows:
Case identifier:        entity_id 
Activity identifier:        event 
Resource identifier:        resource_id 
Activity instance identifier:   activity_id 
Timestamp:          timestamp_dt 
Lifecycle transition:       lifecycle_id 

# A tibble: 20 × 11
   entity_id pathway  event_type   event      time resource_id   run activity_id
       <dbl> <chr>    <chr>        <chr>     <dbl>       <dbl> <dbl>       <dbl>
 1         1 Simplest queue        treatmen…   0            NA     1           1
 2         1 Simplest queue        treatmen…   0            NA     1           1
 3         1 Simplest resource_use treatmen…   0             1     1           2
 4         1 Simplest resource_use treatmen…  12.0           1     1           2
 5         2 Simplest queue        treatmen…  12.0          NA     1           3
 6         2 Simplest queue        treatmen…  12.0          NA     1           3
 7         2 Simplest resource_use treatmen…  12.0           2     1           4
 8         2 Simplest resource_use treatmen…  23.7           2     1           4
 9         3 Simplest queue        treatmen…  23.7          NA     1           5
10         3 Simplest queue        treatmen…  23.7          NA     1           5
11         3 Simplest resource_use treatmen…  23.7           3     1           6
12         3 Simplest resource_use treatmen…  35.6           3     1           6
13         4 Simplest queue        treatmen…  35.6          NA     1           7
14         4 Simplest queue        treatmen…  35.6          NA     1           7
15         4 Simplest resource_use treatmen…  35.6           4     1           8
16         4 Simplest resource_use treatmen…  37.0           4     1           8
17         5 Simplest queue        treatmen…  37.0          NA     1           9
18         5 Simplest queue        treatmen…  37.5          NA     1           9
19         6 Simplest queue        treatmen…  37.5          NA     1          10
20         6 Simplest queue        treatmen…  41.2          NA     1          10
# ℹ 3 more variables: lifecycle_id <chr>, timestamp_dt <dttm>, .order <int>

31.1 Exploring bupaR Outputs

31.1.1 bupaR Static Outputs

31.1.1.1 Frequency Maps

31.1.1.1.1 Absolute (counts)
activity_log %>%
    process_map(frequency("absolute"))
Warning in process_map.eventlog(., frequency("absolute")): Some of the
timestamps in the supplied event log are missing (NA values). This may result
in a invalid process map!
31.1.1.1.2 Absolute case
activity_log %>%
    process_map(frequency("absolute-case"))
Warning in process_map.eventlog(., frequency("absolute-case")): Some of the
timestamps in the supplied event log are missing (NA values). This may result
in a invalid process map!
31.1.1.1.3 Relative
activity_log %>%
    process_map(frequency("relative"))
Warning in process_map.eventlog(., frequency("relative")): Some of the
timestamps in the supplied event log are missing (NA values). This may result
in a invalid process map!

31.1.1.2 Performance maps

31.1.1.2.1 Mean Times
activity_log %>%
    process_map(performance())
Warning in process_map.eventlog(., performance()): Some of the timestamps in
the supplied event log are missing (NA values). This may result in a invalid
process map!
31.1.1.2.2 Max Times
activity_log %>%
    process_map(performance(FUN = max))
Warning in process_map.eventlog(., performance(FUN = max)): Some of the
timestamps in the supplied event log are missing (NA values). This may result
in a invalid process map!
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `label = do.call(...)`.
ℹ In group 5: `ACTIVITY_CLASSIFIER_ = NA` and `from_id = NA`.
Caused by warning in `type()`:
! no non-missing arguments to max; returning -Inf
Warning: There were 2 warnings in `summarize()`.
The first warning was:
ℹ In argument: `value = do.call(...)`.
ℹ In group 1: `ACTIVITY_CLASSIFIER_ = "ARTIFICIAL_END"`, `next_act = NA`,
  `from_id = 1`, `to_id = NA`.
Caused by warning in `type()`:
! no non-missing arguments to max; returning -Inf
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

31.1.1.3 Common Routes

activity_log %>%
    trace_explorer(n_traces = 10)
Warning: Fewer traces (2) found than specified `n_traces` (10).

31.1.1.4 Activity Presence

activity_log %>%
    activity_presence() %>%
    plot()

31.1.1.5 Processing Time

activity_log %>%
    processing_time("resource-activity", units = "mins") %>%
    plot()
Warning: Removed 3 rows containing non-finite values (`stat_boxplot()`).
Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

activity_log %>%
    processing_time("activity", units = "mins") %>%
    plot()
Warning: Removed 3 rows containing non-finite values (`stat_boxplot()`).

31.1.1.6 Idle Time

activity_log %>%
    idle_time("resource", units = "mins") %>%
    plot()

31.1.2 bupaR Animated Outputs

activity_log %>%
    animate_process()
Warning in process_map.eventlog(eventlog, render = F, ...): Some of the
timestamps in the supplied event log are missing (NA values). This may result
in a invalid process map!

31.2 Working with a more advanced simulation log

31.2.1 Generating a flexible python function for moving from event logs to process logs

Let’s turn our code for processing our event log into its own reusable function.

To help our visualisations distinguish between different resources better, we’ll also adjust the resource IDs to reflect which process they are being used with.

def process_event_log_for_bupar(event_log_path):
    event_log = pd.read_csv(event_log_path)

    df = event_log[(event_log["event_type"]=="queue") |
                (event_log["event_type"]=="resource_use")].copy()

    df["activity_id"] = df.groupby("run").cumcount() + 1

    # Duplicate rows and modify them
    df_start = df.copy()
    df_start["lifecycle_id"] = "start"

    df_end = df.copy()
    df_end["lifecycle_id"] = "complete"

    # Shift timestamps for 'end' rows
    df_end["time"] = df_end["time"].shift(-1)

    # Combine and sort
    df_combined = pd.concat([df_start, df_end]).sort_index(kind="stable")

    # Drop last 'end' row (since there’s no next row to get a timestamp from)
    df_combined = df_combined[:-1]

    df_combined["resource_id"] = df_combined.apply(lambda x: f"{x['event']}_{x['resource_id']:.0f}", axis=1)

    df_combined.to_csv(f"{event_log_path.replace('.csv', '')}_bupar_log.csv", index=False)

    return df_combined

bupar_log_complex = process_event_log_for_bupar("resources/complex_event_log.csv")

bupar_log_complex.head(30)
    entity_id     pathway  ... activity_id lifecycle_id
1           1  Non-Trauma  ...           1        start
1           1  Non-Trauma  ...           1     complete
2           1  Non-Trauma  ...           2        start
2           1  Non-Trauma  ...           2     complete
4           2  Non-Trauma  ...           3        start
4           2  Non-Trauma  ...           3     complete
5           2  Non-Trauma  ...           4        start
5           2  Non-Trauma  ...           4     complete
7           1  Non-Trauma  ...           5        start
7           1  Non-Trauma  ...           5     complete
8           1  Non-Trauma  ...           6        start
8           1  Non-Trauma  ...           6     complete
10          2  Non-Trauma  ...           7        start
10          2  Non-Trauma  ...           7     complete
11          2  Non-Trauma  ...           8        start
11          2  Non-Trauma  ...           8     complete
13          1  Non-Trauma  ...           9        start
13          1  Non-Trauma  ...           9     complete
14          1  Non-Trauma  ...          10        start
14          1  Non-Trauma  ...          10     complete
16          2  Non-Trauma  ...          11        start
16          2  Non-Trauma  ...          11     complete
17          2  Non-Trauma  ...          12        start
17          2  Non-Trauma  ...          12     complete
22          2  Non-Trauma  ...          13        start
22          2  Non-Trauma  ...          13     complete
23          2  Non-Trauma  ...          14        start
23          2  Non-Trauma  ...          14     complete
25          3  Non-Trauma  ...          15        start
25          3  Non-Trauma  ...          15     complete

[30 rows x 9 columns]

31.2.2 Generating a flexible R function for generating a bupaR activity log

Let’s also take this opportunity to turn our conversion of the logs into a function.

library(readr)
library(dplyr)
library(lubridate)
library(bupaverse)
library(rlang)

Attaching package: 'rlang'
The following object is masked from 'package:edeaR':

    trace_length
library(glue)

create_activity_log <- function(filepath,
                                run_id = 1,
                                simulation_start = ymd_hms("2025-01-01 00:00:00"),
                                case_id = "entity_id",
                                activity_id = "event",
                                activity_instance_id = "activity_id",
                                lifecycle_id = "lifecycle_id",
                                resource_id = "resource_id",
                                time_column = "time",
                                run_column = "run") {
  # Read the data
  data <- readr::read_csv(filepath)

  # Validate that all required columns exist
  required_cols <- c(case_id, activity_id, activity_instance_id, lifecycle_id,
                     resource_id, time_column, run_column)
  missing_cols <- setdiff(required_cols, names(data))
  if (length(missing_cols) > 0) {
    stop(glue::glue("Missing required columns: {paste(missing_cols, collapse=', ')}"))
  }

  # Tidy evaluation symbols
  time_col_sym <- rlang::sym(time_column)
  run_col_sym <- rlang::sym(run_column)

  # Validate that the requested run_id exists
  available_runs <- unique(dplyr::pull(data, !!run_col_sym))
  if (!run_id %in% available_runs) {
    stop(glue::glue("Run ID {run_id} not found. Available run IDs: {paste(available_runs, collapse=', ')}"))
  }

  # Filter, create timestamp, and build event log
  activity_log <- data |>
    dplyr::filter(!!run_col_sym == run_id) |>
    dplyr::mutate(timestamp_dt = simulation_start + lubridate::dminutes(!!time_col_sym)) |>
    bupaR::eventlog(
      case_id = case_id,
      activity_id = activity_id,
      activity_instance_id = activity_instance_id,
      lifecycle_id = lifecycle_id,
      timestamp = "timestamp_dt",
      resource_id = resource_id
    )

  return(activity_log)
}
# Example usage:
activity_log <- create_activity_log("resources/complex_event_log_bupar_log.csv")
Rows: 163533 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): pathway, event, event_type, resource_id, lifecycle_id
dbl (4): entity_id, time, run, activity_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View a few entries
head(activity_log, 20)
# Log of 20 events consisting of:
2 traces 
2 cases 
10 instances of 6 activities 
8 resources 
Events occurred from 2025-01-01 00:37:35 until 2025-01-01 01:12:25 
 
# Variables were mapped as follows:
Case identifier:        entity_id 
Activity identifier:        event 
Resource identifier:        resource_id 
Activity instance identifier:   activity_id 
Timestamp:          timestamp_dt 
Lifecycle transition:       lifecycle_id 

# A tibble: 20 × 11
   entity_id pathway    event     event_type  time resource_id   run activity_id
       <dbl> <chr>      <chr>     <chr>      <dbl> <chr>       <dbl>       <dbl>
 1         1 Non-Trauma triage_w… queue       37.6 triage_wai…     1           1
 2         1 Non-Trauma triage_w… queue       37.6 triage_wai…     1           1
 3         1 Non-Trauma triage_b… resource_…  37.6 triage_beg…     1           2
 4         1 Non-Trauma triage_b… resource_…  51.8 triage_beg…     1           2
 5         2 Non-Trauma triage_w… queue       51.8 triage_wai…     1           3
 6         2 Non-Trauma triage_w… queue       51.8 triage_wai…     1           3
 7         2 Non-Trauma triage_b… resource_…  51.8 triage_beg…     1           4
 8         2 Non-Trauma triage_b… resource_…  52.0 triage_beg…     1           4
 9         1 Non-Trauma MINORS_r… queue       52.0 MINORS_reg…     1           5
10         1 Non-Trauma MINORS_r… queue       52.0 MINORS_reg…     1           5
11         1 Non-Trauma MINORS_r… resource_…  52.0 MINORS_reg…     1           6
12         1 Non-Trauma MINORS_r… resource_…  60.3 MINORS_reg…     1           6
13         1 Non-Trauma MINORS_e… queue       60.3 MINORS_exa…     1           7
14         1 Non-Trauma MINORS_e… queue       60.3 MINORS_exa…     1           7
15         1 Non-Trauma MINORS_e… resource_…  60.3 MINORS_exa…     1           8
16         1 Non-Trauma MINORS_e… resource_…  65.9 MINORS_exa…     1           8
17         2 Non-Trauma MINORS_r… queue       65.9 MINORS_reg…     1           9
18         2 Non-Trauma MINORS_r… queue       65.9 MINORS_reg…     1           9
19         2 Non-Trauma MINORS_r… resource_…  65.9 MINORS_reg…     1          10
20         2 Non-Trauma MINORS_r… resource_…  72.4 MINORS_reg…     1          10
# ℹ 3 more variables: lifecycle_id <chr>, timestamp_dt <dttm>, .order <int>

31.2.3 bupaR Static Outputs

31.2.3.1 Frequency Maps

31.2.3.1.1 Absolute (counts)
activity_log %>%
    process_map(frequency("absolute"))
31.2.3.1.2 Absolute case
activity_log %>%
    process_map(frequency("absolute-case"))
31.2.3.1.3 Relative
activity_log %>%
    process_map(frequency("relative"))

31.2.3.2 Performance maps

31.2.3.2.1 Mean Times
activity_log %>%
    process_map(performance())
31.2.3.2.2 Max Times
activity_log %>%
    process_map(performance(FUN = max))
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `label = do.call(...)`.
ℹ In group 15: `ACTIVITY_CLASSIFIER_ = NA` and `from_id = NA`.
Caused by warning in `type()`:
! no non-missing arguments to max; returning -Inf
Warning: There were 2 warnings in `summarize()`.
The first warning was:
ℹ In argument: `value = do.call(...)`.
ℹ In group 1: `ACTIVITY_CLASSIFIER_ = "ARTIFICIAL_END"`, `next_act = NA`,
  `from_id = 1`, `to_id = NA`.
Caused by warning in `type()`:
! no non-missing arguments to max; returning -Inf
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

31.2.3.3 Common Routes

activity_log %>%
    trace_explorer(n_traces = 10)
Warning: Fewer traces (9) found than specified `n_traces` (10).

31.2.3.4 Activity Presence

activity_log %>%
    activity_presence() %>%
    plot()

31.2.3.5 Processing Time

activity_log %>%
    processing_time("resource-activity", units = "mins") %>%
    plot()

activity_log %>%
    processing_time("activity", units = "mins") %>%
    plot()

31.2.3.6 Idle Time

activity_log %>%
    idle_time("resource", units = "mins") %>%
    plot()

31.2.4 bupaR Animated Outputs

activity_log %>%
    animate_process(style = "background-color: white;")