Distributed stream processing and event-based systems are an increasingly critical component in contemporary large-scale data processing applications, and are often subject to strict latency and reliability requirements. However, to achieve scalability demands, they are often deployed on distributed clusters of heterogeneous nodes, causing unpredictable runtime performance and complex fault characteristics.
The behaviour of these systems is poorly understood, and existing performance and dependability evaluation techniques are ill-equipped to handle the challenges introduced by the complex and distributed nature of event-based systems.
We develop a dynamic code-injection approach to evaluate the performance and dependability of stream processing and event-based systems. Our approach supports fine-grained instrumentation of applications and their runtime infrastructure, and the dynamic injection of code mutations and faults into a production system at runtime. We demonstrate the proposed approach by performing instrumentation and code injection on a distributed Apache Spark cluster.
Download Full PDF Version (Non-Commercial Use)