Process mining in Python: pm4py

challenge and study developers point

A chat with Sebastiaan (Bas) van Zelst

Sebastiaan (Bas) van Zelst is the Head of Process Mining Research at Fraunhofer FIT and founder of pm4py, among very many other things. He concedes to us an interview about his mixed academic/industrial experience in process mining and the renowned process mining platform for Python.

Tell us a bit about yourself and your research institute, Sebastiaan!

My name is Sebastiaan, Bas in short. I am a researcher in the field of process mining. I obtained my Ph.D. in 2019 at the Eindhoven University of Technology under the supervision of prof. Wil van der Aalst (my Ph.D. topic is process mining on real-time event data streams). I started working as a post-doc in 2018 at the Fraunhofer Institute for Applied Information Systems (Fraunhofer FIT). Initially, prof. van der Aalst was leading the research group. However, I gradually took over the group’s leadership and have been leading the group since 2021. Since 2022 I am also serving as the deputy head of our overarching department, i.e., the Data Science and Intelligence department. Fraunhofer FIT is part of the Fraunhofer Gesellschaft, a German research organization consisting of 76 research institutes (of which FIT is one). The main goal of any Fraunhofer Institute is to transfer knowledge generated in academia into the industry. As such, in my view, process mining is a good match with the Fraunhofer mission, i.e., it is a very application-oriented field of research. Generally, the projects we run at Fraunhofer are either government-funded research projects (also EU-level projects) or joint projects with the industry. The general division is 70/30 for research vs. industry projects. Hence, whereas we often collaborate with industrial partners, a fair share of our work is still devoted to pure research. Together with Prof. Maximilian Röglinger, I founded the Center for Process Intelligence, in which we support organizations in the general digitization of their processes. Next to my appointment at Fraunhofer, I am also affiliated with the Process and Data Science Chair (PADS) of prof. van der Aalst at the RWTH Achen University. In this context, I teach the “Advanced Process Mining” course every year (we are just finishing our first “live” lecture series since the start of the COVID pandemic). Finally, together with my colleague Daniel Schuster, I am starting a spin-off based on our Fraunhofer-developed PMTK software (https://pmtk.io).

While you were defending your PhD thesis, the first publications mentioning pm4py started popping up. How did it all start, and what drove you in this endeavour?

Indeed, although I only obtained my Ph.D. in early 2019, I finished writing the thesis around the middle of 2018. This is pretty normal, as some time usually passes between the finalization of the thesis and its actual defence. Yet, in my case, this may have been a bit longer than usual. In any case, I started working at Fraunhofer FIT in the middle of 2018. One of my first (minor) projects at the time was pm4py. During my Ph.D., most of the research tools in process mining were implemented in the ProM framework. In my view, there are two major problems with scientific software development in ProM. Firstly, the framework is rather complicated and does not lend itself well for rapid prototyping (which, I believe, is a core aspect of scientific software development). Secondly, because ProM is written in Java (which is much more suitable for large-scale enterprise software), connecting it with other data science techniques such as deep learning is much harder – compared to Python. These two problems were my primary motivation to start developing pm4py at the time.

What is pm4py about, and what is the core mission behind it?

The “Process Mining for Python” (pm4py) library is a software library written in the Python programming language that implements various process mining algorithms. Its core goal is indeed to allow users to invoke (complicated) process mining algorithms with just a few lines of code. This characteristic ultimately enables users (that is, researchers, data scientists, consultants, etc.) to reduce the average cycle time of their prototyping activities.

So, pm4py is meant to be used by both members of academia and industry. What are the main similarities and differences you have noticed in these years?

Users that are active in the industry focus more on using the software to obtain output (a process model, conformance checking statistics, and so on) which they can use for further analysis. For them, the outcome is the main goal (after some parameter tuning). Scientists use it much more to develop their new ideas (i.e., using existing algorithms as building blocks) and to compare their new algorithms to existing ones. As such, they focus more on the “numbers” that the output can generate for larger sets of parameters (and data), yet less on the actual implications that these outputs may have for the application domain.

If you were to characterise your tool in one sentence without superlative adjectives and not mentioning competitors, what would this sentence be?

The pm4py library lets users quickly develop custom process mining analyses and algorithms and enables integration with various alternative data science applications.

How do you see the future of process mining and what can we expect from PM4Py in the next few years?

I believe process mining has a lot of potential that still needs to be unlocked. The “traditional” focus of many commercial vendors has been on highly standardized processes, typically of a financial nature, e.g., “Order to Cash”, “Purchase to Pay”. On the one hand, streamlining these processes enables a swift execution of a company’s core processes, yet, on the other hand, an inefficient core process still generates a lot of waste (economically and environmentally). I expect (and hope) that more and more vendors will turn their focus (possibly by specialization of their software) to start supporting the analysis of core processes. If we manage to help organizations do so effectively, the field has years of fruitful work ahead of it. However, at the same time, we (both scientists and process mining technology developers) need to realize that the value of process mining is in the eventual improvement of the processes (and less in the artifacts our algorithms/software generate). As such, we should critically assess to what end our solutions help business owners to really enhance their processes.

As for pm4py, we are now in a consolidation phase. We are soon releasing a new version of the library which simplifies the overall method invocation even more. We have additionally completely revised the documentation, which is a significant step forward. Further, a core focus will be to improve the overall performance of the current algorithms in pm4py. Regarding my point on supporting process improvement, we have recently built a web-based front-end on top of pm4py: PMTK (http://pmtk.io). This is part of an incubation project for which we are we are starting a spin-off company, i.e., enabling the power of pm4py in a user-friendly interface.

Articles in this newsletter

Info about this article

This article has been updated on September 2 2022, 15:52.
A chat with Sebastiaan (Bas) van Zelst