Before you start doing any detailed analysis of a dataset, especially if it a new dataset or one you are not familiar with, it is a good idea to spend a bit of time getting a feel for the data. This will save you time in the longer term and help you get better overall insights from the analysis.
Early on in my career, just after I left clinical practice and started as a management consultant, Excel was my go to data analysis tool (still very useful!).
When I got hold of a new dataset, I used to (and still do), take some basic steps to get an initial view of the data including applying some filters, doing a pivot table, doing some basic calculations (and simple graphs to spot any obvious trends), etc.
The same applies to process mining. I wanted to share what I think are the 4 steps that would be most useful to take.
[I am using the great open source R package – https://www.bupar.net/]
[As an example, I have used the sepsis event log dataset. This dataset is based on real data from a hospital for patients who suffered from severe infections (sepsis)]
- Summary
See the high level overview.
sepsis %>% summary
An extract from the output below highlights some of the key information you would see including number of cases (i.e. patients), number of activities (i.e. register the patient, give IV fluids, give IV antibiotics, admit to hospital, discharge patient), the timeframe the dataset covers, etc.
Number of events: 15214
Number of cases: 1050
Number of traces: 846
Number of distinct activities: 16
Average trace length: 14.48952
Start eventlog: 2013-11-07 08:18:29
End eventlog: 2015-06-05 12:25:11
- Default path or happy path
Visualise a basic process flow model that covers a small number of events (I chose 10% below) to get a feel of the common ‘variants’.
happy_path <- sepsis %>% filter_trace_frequency(percentage=0.1)
happy_path %>% process_map(type=frequency(value=”absolute_case”))

- Activity presence
To identify in what percentage of cases a particular activity occurs.
sepsis %>% activity_presence() %>% plot ()
In the example below, 78% of patients receive IV antibiotics and 28% return the Emergency Room.

- Precedence matrix
This shows which activity (consequent) follows which other activity (antecedent) and how many times that occurs.
sepsis %>% precedence_matrix((type=”absolute”)) %>% plot()
In the example below, it’s interesting to note that a white blood cell count (Leucocytes) and CRP test (a marker for inflammation, i.e. infection) is done almost interchangeably in terms of which test was done first.

Hope this is useful.
In terms of overall methodology and approach to process mining, I read a great paper recently, will share more about this in an upcoming post!
[Thank you to Gert Janssenswillen for developing bupaR and for the excellent course on DataCamp.]
Very happy to hear your comments below or feel free to email me to share ideas – janak@usehealthdata.com