BPI Challenges: 10 years of real-life datasets

challenge and study conference room

Presented by Boudewijn van Dongen

My name is Boudewijn van Dongen and I am a professor in Process Analytics at the Eindhoven University of Technology. In 2011, I organized the first BPI Challenge, at the time co-located with the Business Process Intelligence (BPI) workshop, a satellite event of the Business Process Management conference. This year, we celebrate the 10th edition of the BPI Challenge. The challenge has become a key activity of the IEEE Task Force on Process Mining and is co-located with the International Conference on Process Mining since 2019. Since the first edition in 2011, the BPI Challenge has grown to a well-known challenge and a source of datasets used by many in the process mining community to validate their ideas.

The idea of organizing the BPI Challenge came from the realization that many researchers were having difficulties finding case studies. While companies were interested to work with universities to gain insights into their processes, they often kept their data confidential and therefore it was impossible to compare one technique to another in a reproducible way. Hence, the BPI Challenge was the start of a community-effort to publicly share real-life data for process mining

The aim of the challenge was twofold from the beginning. On the one hand, companies that share their data could get free access to experts in process mining and, on the other hand, the research community would build a collection of benchmark datasets containing all sorts of process data.

To allow for this benchmark collection to grow, the datasets were published in the 4TU Centre for Research Data. This center is an initiative of the universities of technology in The Netherlands and it serves as a library for storing data. As it is a library, the data is static (once it’s there, it’s there forever) and gets a DOI for reference. At the time, the center was mainly used for storing sensor data which individual researchers could not store locally. I recall the first discussions we had with the representatives who asked about the size of our datasets. When I replied that our log files could grow to gigabytes, their response was: “per minute?”, as they were used to storing live streams from huge sensor arrays. The 4TU Centre for Research Data has been supportive of our initiative from the start. Not only do they publish the BPI Challenge datasets, but over time many other process mining benchmark datasets have been published there by researchers all over the world. And, our community has an excellent reputation within the center as their funding depends on the amount of usage of the data they publish. In the period 2011 to 2018, 16 out of the 20 most downloaded datasets from the 4TU Centre for Research Data are process mining datasets. Furthermore, all datasets in the top 10 are process mining datasets and 7 of them are BPI Challenge datasets.

The first dataset from 2011 is a dataset provided by a Dutch academic hospital. It contains 1143 traces and a little over 150 thousand events. From a process point of view, this dataset turned out to be fairly complex and none of the mining algorithms at the time could make much sense of it. However, this drastically changed in 2012. The BPI 2012 dataset is arguably the most analyzed dataset in process mining. This dataset was provided to the community by a Dutch financial institute and it contains data from a loan application process. Since 2012 it has appeared in hundreds of papers (albeit often without reference to the DOI, much to the regret of the data center).
The data for the BPI Challenge 2012 came from a very structured real-life process. Over time, models have been developed that quite accurately describe this process (although few process mining algorithms can actually discover these models automatically) and the dataset has served as a benchmark for many papers as it contains resource information, lifecycle information and many more interesting elements.

To ensure companies would be willing to contribute data to the BPI Challenge it purposely did not focus on the technical side of process mining. Instead, the challenge has always been centered around creating insights into the process for the company. Any insight that brings value is appreciated, both by the Jury and by the data owners and this focus on business insights has led to companies being open and eager to contribute data. We’ve had the hospital in 2011, Volvo IT in 2013, Rabobank ICT Services in 2014, a group of Dutch municipalities in 2015 and the Dutch unemployment agency in 2016.  Then, in 2017, the financial institute that provided data in 2012 volunteered to again provide an event log from the same process. More events, more cases and much more data made this dataset even richer than the one of 2012 and it’s not surprising that in 2017 this was the most resolved DOI of the entire 4TU Centre for Research Data.

The process of collecting data for the BPI Challenge is a lengthy one. Usually, I start looking for datasets directly after the challenge winners are announced and the aim is always to have the data ready before the start of the second academic semester, that is, before February 1. Once a company is interested, it typically takes 4 to 6 months to get approval for the publication of a fully anonymized dataset from the right people in the organization and sometimes this process contains the funniest of activities. In one instance, I had written approval from the highest management layer in the organization to publish the data, as long as I did so in accordance with their internal policies which stated that: “all data has to be encrypted before it leaves the building”. Hence, I went there with a hard disk on which we uploaded a password-protected, encrypted archive. I then went to the university, unencrypted the archive and uploaded it to the Internet.

The BPI Challenge 2018 was the first edition of the challenge where the data owner reached out to me with the question to participate which did speed up the approval process considerably. Unfortunately, around the same time, the European Union implemented the General Data Protection Regulation. This piece of legislation made companies much more worried about sharing data. Even though the BPI Challenges have always been properly anonymized and no data can be traced back to individuals, companies became reluctant in sharing data.

In 2019 and 2020, the process of obtaining a dataset and the necessary permission to publish it was harder than ever. Several companies who were interested in the idea did not follow through in the end, usually because higher management was worried about the risks. In both cases, I had to follow through with several leads before obtaining the right data and the right permissions. I apologize to those of you who rely on these datasets for their education in the second semester. If any of you has a company contact and think they may be interested in sharing data for the BPI Challenge 2021 or beyond, please do not hesitate to contact me. I’m always looking for data! Contributing data to the BPI Challenge means that the company gets access to the best expertise in process mining.

This year’s dataset is provided to you by our own university: https://icpmconference.org/2020/bpi-challenge/. It contains event data related to the handling of travel permits and declarations and the organizational entities involved want to know if there is anything they can do to improve the process. Although I’m quite sure that you will find it efficient already when compared with your local processes, I’m looking forward to seeing your recommendations. I expect that I will be able to get an updated dataset from my university for 2022 containing the data of travel permits and declarations in the entire period 2017 to 2021. Especially for those of you working on concept drifts, this would be an extremely interesting dataset with a sudden drift early 2020 due to COVID-19 and a gradual drift in 2021 when the world goes back to normal.