Data Pipeline Segmentation

This post details the stages of data processing within a local pipeline managed by the Substrates runtime, from a client call to an Instrument interface method to dispatching an Event to an Outlet registered by a Subscriber.

Stage 1

When an activity occurs within a process, the instrumentation within the code (injected possibly by an agent) will call a method on an Instrument, such as a Counter::inc or Timer::start, to capture the phenomenon to be routed and relayed.

The purpose of an Instrument is to capture data and forward it onto an Inlet it has been given in its construction.

Sometimes the caller provides the data to the Instrument as an argument to a method call, as in logging an Exception.

Other times the Instrument must interact with the application runtime to collect the data to be emitted. An example of this would be a Timer accessing the current Clock time or an Instrument capturing the current Thread call stack.

This stage is executed within the context of calling Thread, so it must be efficient so as not to perturb the activity.

Based on captured data, an Instrument can opt out of forwarding data onto the Inlet, short-circuiting the pipeline.


Stage 2

When an Inlet receives data to be emitted, it immediately queues up work within the Circuit for execution if both the Conduit and Circuit are still open.

The enqueuing is done within the calling scope of the Instrument interface method, so again, it needs to be efficient.


Stage 3

Only when the work is executed is an Event created if there are Outlets that a Subscriber has registered for the Subject.

It is important to note that changes in Subscriptions are queued up within the Circuit along with the Event publication. In queuing such changes, linearization (sequential ordering) is maintained, wherein existing Events before a Subscribers Subscription will not be emitted to its Outlets. This dramatically simplifies the process model and aids diagnostics.

The final step in this stage is to dispatch the Event to Outlets. Circuit implementations can perform this step from within the executor of the Circuit itself or offload to another queue while maintaining order, depending on runtime capacities, processing characteristics, environmental configuration, and performance constraints.


Stage 4

An Outlet is the egress point in the data pipeline processing, except when the Outlet is linking an Instrument’s output to another Instrument’s input. This is an expected, if not encouraged, use case for Outlets, where far more data processing, such as augmentation and aggregation, of Events is done locally, though potentially across multiple Circuits.

Outlets and similar extensions offered by Substrates provide the means efficiently and transparently mirror pipelines to other processes and integrate with remote messaging systems such as Apache Kafka, MQTT, Apache Pulsar, Apache Flink, etc.