Improve performance of ma_node_uninit().

This changes the way ma_node_uninit() waits for the audio thread. Now, instead of waiting for the *entire* graph to complete, it only waits for it's local processing to complete.
2026-06-06 06:12:44 +02:00 · 2021-01-05 22:21:56 +10:00
parent 61c6e5b12a
commit c540366921
1 changed files with 215 additions and 186 deletions
@@ -583,43 +583,39 @@ data from the graph. Locking in these areas are achieved by means of spinlocks.

 The main complication with keeping `ma_node_graph_read_pcm_frames()` lock-free stems from the fact
 that a node can be uninitialized, and it's memory potentially freed, while in the middle of being
-processed by `ma_node_graph_read_pcm_frames()` on the audio thread. To solve this, uninitialization
-of a node with `ma_node_uninit()` will spin until `ma_node_graph_read_pcm_frames()` finishes. This
-is the most significant compromise employed by miniaudio as it can, in the worst case, result in
-`ma_node_uninit()` spinning for as long as it takes to process an entire period of audio. For
-example, if `ma_node_graph_read_pcm_frames()` takes a millisecond to process, then
-`ma_node_uninit()` will spin for 1 millisecond in the worst case. You should take this into
-consideration when designing your node management strategy - you may want to have some kind of
-threaded cleanup routines.
+processed on the audio thread. There are times when the audio thread will be referencing a node,
+which means the uninitialization process of a node needs to make sure it delays returning until the
+audio thread is finished so that control is not handed back to the caller thereby giving them a
+chance to free the node's memory.

-It's intuitive to think that instead of waiting for the *entire* graph to finish reading it would
-be possible to just wait for the node itself to finish, thereby making it a bit more efficient to
-uninitialize the node. This doesn't work because of how nodes are connected to each other. Nodes
-are connected to other nodes, which means there's always hole when doing things on a local level.
-A higher level structure that encompasses the entire node graph is required for things to work
-without holes. This is where the `ma_node_graph` object comes into it. This object encompasses the
-entire node graph which enables us to do a waiting system without holes.
+When the audio thread is processing a node, it does so by reading from each of the output buses of
+the node. In order for a node to process data for one of it's output buses, it needs to read from
+each of it's input buses, and so on an so forth. It follows that once all output buses of a node
+are detached, the node as a whole will be disconnected and no further processing will occur unless
+it's output buses are reattached, which won't be happening when the node is being uninitialized.
+By having `ma_node_detach_output_bus()` wait until the audio thread is finished with it, we can
+simplify a few things, at the expense of making `ma_node_detach_output_bus()` a bit slower. By
+doing this, the implementation of `ma_node_uninit()` becomes trivial - just detach all output
+nodes, followed by each of the attachments to each of it's input nodes, and then do any final clean
+up.

-The `ma_node_graph` has two members that are of relevance for implementing the waiting mechanism in
-`ma_node_uninit()`. One is a flag that keeps track of whether or not we are reading from the graph.
-That is, whether or not we are executing `ma_node_graph_read_pcm_frames()`. This assumes that this
-function is never called simultaneously by multiple threads, which wouldn't actually make sense
-anyway. The other variable is a counter which increments at the end of every call to
-`ma_node_graph_read_pcm_frames()`. In `ma_node_uninit()`, the node is first detached from the node
-graph with `ma_node_detach()`. It is safe to detach the node from the graph if it's still in the
-middle of processing, so long as the node remains valid. Once the node is detached, it can be
-guaranteed that future calls to `ma_node_graph_read_pcm_frames()` will never iterate over the node.
-Then, a copy of the current read counter is retrieved followed by a check against the flag that
-tracks whether or not a read is currently being performed. If so, the `ma_node_uninit()` will spin
-until the counter we retrieved earlier does not equal the current counter. Once the counters do not
-match, waiting is over and the uninitialization process can continue without needing to worry about
-thread-safety. See the implementation of `ma_node_uninit()` for more details on this.
+With the above design, the worst-case scenario is `ma_node_detach_output_bus()` taking as long as
+it takes to process the output bus being detached. This will happen if it's called at just the
+wrong moment where the audio thread has just iterated it and has just started processing. The
+caller of `ma_node_detach_output_bus()` will stall until the audio thread is finished, which
+includes the cost of recursively processing it's inputs. This is the biggest compromise made with
+the approach taken by miniaudio for it's lock-free processing system. The cost of detaching nodes
+earlier in the pipeline (data sources, for example) will be cheaper than the cost of detaching
+higher level nodes, such as some kind of final post-processing endpoint. If you need to do mass
+detachments, detach starting from the lowest level nodes and work your way towards the final
+endpoint node (but don't try detaching the node graph's endpoint). If the audio thread is not
+running, detachment will be fast and detachment in any order will be the same.

 Another compromise, albeit less significant, is locking when attaching and detaching nodes. This
-locking is achieved by means of a spinlock in order to reduce overhead. A lock is present for each
-input bus and output bus, totaling 4 for each node. When an output bus is connected to an input
-bus, both the output bus and input bus is locked. This locking is specifically for attaching and
-detaching across different threads and does not affect `ma_node_graph_read_pcm_frames()` in any
+locking is achieved by means of a spinlock in order to reduce memory overhead. A lock is present
+for each input bus and output bus, totaling 4 for each node. When an output bus is connected to an
+input bus, both the output bus and input bus is locked. This locking is specifically for attaching
+and detaching across different threads and does not affect `ma_node_graph_read_pcm_frames()` in any
 way. The locking and unlocking is mostly self-explanatory, but slightly less intuitive part comes
 into it when considering that iterating over attachments must not break as a result of attaching or
 detaching a node while iteration is occuring.
@@ -656,16 +652,6 @@ same call to `ma_node_graph_read_pcm_frames()`. This is an unusual scenario and
 go unnoticed by the majority of people, but it's still an issue to consider.
 */

-/*
-Open Questions
-==============
- What should the maximum bus count be for each node? This is defined at compile time and each node
-  has a fixed sized array of structs representing the input and output buses. It's therefore
-  important to keep this within reason. It's currently set to 2 for each side (maximum of 2 input
-  buses and 2 output buses). The limit can be increased with MA_MAX_NODE_BUS_COUNT, so I'm thinking
-  that this should be set to the maximum needed by the stock nodes and if an application needs more
-  to support their custom effects they can just increase it themselves.
-*/

 /* Must never exceed 255. */
 #ifndef MA_MAX_NODE_BUS_COUNT
@@ -759,8 +745,9 @@ struct ma_node_output_bus
    /* Mutable via multiple threads. The weird ordering here is for packing reasons. */
    volatile ma_uint8 inputNodeInputBusIndex;   /* The index of the input bus on the input. Required for detaching. */
    volatile ma_uint8 flags;                    /* Some state flags for tracking the read state of the output buffer. */
+    volatile ma_uint16 refCount;                /* Reference count for some thread-safety when detaching. */
+    volatile ma_bool8 isAttached;               /* This is used to prevent iteration of nodes that are in the middle of being detached. Used for thread safety. */
    volatile ma_spinlock lock;                  /* Unfortunate lock, but significantly simplifies the implementation. Required for thread-safe attaching and detaching. */
-    ma_uint8 padding[3];
    volatile float volume;                      /* Linear 0..1 */
    volatile ma_node_output_bus* pNext;         /* If null, it's the tail node or detached. */
    volatile ma_node_output_bus* pPrev;         /* If null, it's the head node or detached. */
@@ -775,12 +762,12 @@ typedef struct ma_node_input_bus ma_node_input_bus;
 struct ma_node_input_bus
 {
    /* Mutable via multiple threads. */
-    volatile ma_node_output_bus* pHead; /* If null, the list is empty. */
+    ma_node_output_bus head;        /* Dummy head node for simplifying some lock-free thread-safety stuff. */
    volatile ma_spinlock lock;      /* Unfortunate lock, but significantly simplifies the implementation. Required for thread-safe attaching and detaching. */
+    volatile ma_uint16 nextCounter; /* This is used to determine whether or not the input bus is finding the next node in the list. Used for thread safety when detaching output buses. */

    /* Set once at startup. */
    ma_uint8 channels;                      /* The number of channels in the audio stream for this bus. */
-    ma_uint8 padding[2];
 };


@@ -2792,6 +2779,17 @@ static ma_bool32 ma_node_output_bus_has_read(ma_node_output_bus* pOutputBus)
 }


+static void ma_node_output_bus_set_is_attached(ma_node_output_bus* pOutputBus, ma_bool8 isAttached)
+{
+    c89atomic_exchange_8(&pOutputBus->isAttached, isAttached);
+}
+
+static ma_bool8 ma_node_output_bus_is_attached(ma_node_output_bus* pOutputBus)
+{
+    return c89atomic_load_8(&pOutputBus->isAttached);
+}
+
+
 static ma_result ma_node_output_bus_set_volume(ma_node_output_bus* pOutputBus, float volume)
 {
    MA_ASSERT(pOutputBus != NULL);
@@ -2819,7 +2817,6 @@ static ma_result ma_node_input_bus_init(ma_uint32 channels, ma_node_input_bus* p
    MA_ASSERT(channels < 256);

    MA_ZERO_OBJECT(pInputBus);
-    pInputBus->pHead    = NULL;
    pInputBus->channels = (ma_uint8)channels;

    return MA_SUCCESS;
@@ -2836,6 +2833,22 @@ static void ma_node_input_bus_unlock(ma_node_input_bus* pInputBus)
 }


+static void ma_node_input_bus_next_begin(ma_node_input_bus* pInputBus)
+{
+    c89atomic_fetch_add_16(&pInputBus->nextCounter, 1);
+}
+
+static void ma_node_input_bus_next_end(ma_node_input_bus* pInputBus)
+{
+    c89atomic_fetch_sub_16(&pInputBus->nextCounter, 1);
+}
+
+static ma_uint16 ma_node_input_bus_get_next_counter(ma_node_input_bus* pInputBus)
+{
+    return c89atomic_load_16(&pInputBus->nextCounter);
+}
+
+
 static ma_uint32 ma_node_input_bus_get_channels(const ma_node_input_bus* pInputBus)
 {
    return pInputBus->channels;
@@ -2847,6 +2860,12 @@ static void ma_node_input_bus_detach__no_output_bus_lock(ma_node_input_bus* pInp
    MA_ASSERT(pInputBus  != NULL);
    MA_ASSERT(pOutputBus != NULL);

+    /*
+    Mark the output bus as detached first. This will prevent future iterations on the audio thread
+    from iterating this output bus.
+    */
+    ma_node_output_bus_set_is_attached(pOutputBus, MA_FALSE);
+
    /*
    We cannot use the output bus lock here since it'll be getting used at a higher level, but we do
    still need to use the input bus lock since we'll be updating pointers on two different output
@@ -2878,6 +2897,35 @@ static void ma_node_input_bus_detach__no_output_bus_lock(ma_node_input_bus* pInp
    c89atomic_exchange_ptr(&pOutputBus->pPrev, NULL);   /* As above. */
    pOutputBus->pInputNode             = NULL;
    pOutputBus->inputNodeInputBusIndex = 0;
+
+
+    /*
+    For thread-safety reasons, we don't want to be returning from this straight away. We need to
+    wait for the audio thread to finish with the output bus. There's two things we need to wait
+    for. The first is the part that selects the next output bus in the list, and the other is the
+    part that reads from the output bus. Basically all we're doing is waiting for the input bus
+    to stop referencing the output bus.
+
+    We're doing this part last because we want the section above to run while the audio thread
+    is finishing up with the output bus, just for efficiency reasons. We marked the output bus as
+    detached right at the top of this function which is going to prevent the audio thread from
+    iterating the output bus again.
+    */
+
+    /* Part 1: Wait for the current iteration to complete. */
+    while (ma_node_input_bus_get_next_counter(pInputBus) > 0) {
+        ma_yield();
+    }
+
+    /* Part 2: Wait for any reads to complete. */
+    while (c89atomic_load_16(&pOutputBus->refCount) > 0) {
+        ma_yield();
+    }
+
+    /*
+    At this point we're done detaching and we can be guaranteed that the audio thread is not going
+    to attempt to reference this output bus again (until attached again).
+    */
 }

 #if 0   /* Not used at the moment, but leaving here in case I need it later. */
@@ -2931,14 +2979,14 @@ static void ma_node_input_bus_attach(ma_node_input_bus* pInputBus, ma_node_outpu
        ma_node_input_bus_lock(pInputBus);
        {
            ma_node_output_bus* pNewPrev = NULL;
-            ma_node_output_bus* pNewNext = (ma_node_output_bus*)c89atomic_load_ptr(&pInputBus->pHead);
+            ma_node_output_bus* pNewNext = (ma_node_output_bus*)c89atomic_load_ptr(&pInputBus->head.pNext);

            /* Update the local output bus. */
            c89atomic_exchange_ptr(&pOutputBus->pPrev, pNewPrev);
            c89atomic_exchange_ptr(&pOutputBus->pNext, pNewNext);

            /* Update the other output buses to point back to the local output bus. */
-            c89atomic_exchange_ptr(&pInputBus->pHead, pOutputBus); /* <-- This is where the output bus is actually attached to the input bus. */
+            c89atomic_exchange_ptr(&pInputBus->head.pNext, pOutputBus); /* <-- This is where the output bus is actually attached to the input bus. */

            /* Do the previous pointer last. This is only used for detachment. */
            if (pNewNext != NULL) {
@@ -2946,24 +2994,59 @@ static void ma_node_input_bus_attach(ma_node_input_bus* pInputBus, ma_node_outpu
            }
        }
        ma_node_input_bus_unlock(pInputBus);
+
+        /*
+        Mark the node as attached last. This is used to controlling whether or the output bus will be
+        iterated on the audio thread. Mainly required for detachment purposes.
+        */
+        ma_node_output_bus_set_is_attached(pOutputBus, MA_TRUE);
    }
    ma_node_output_bus_unlock(pOutputBus);
 }

 static ma_node_output_bus* ma_node_input_bus_next(ma_node_input_bus* pInputBus, ma_node_output_bus* pOutputBus)
 {
+    ma_node_output_bus* pNext;
+
    MA_ASSERT(pInputBus != NULL);

    if (pOutputBus == NULL) {
        return NULL;
    }

-    return (ma_node_output_bus*)c89atomic_load_ptr(&pOutputBus->pNext);
+    ma_node_input_bus_next_begin(pInputBus);
+    {
+        pNext = pOutputBus;
+        for (;;) {
+            pNext = (ma_node_output_bus*)c89atomic_load_ptr(&pNext->pNext);
+            if (pNext == NULL) {
+                break;      /* Reached the end. */
+            }
+
+            if (ma_node_output_bus_is_attached(pNext) == MA_FALSE) {
+                continue;   /* The node is not attached. Keep checking. */
+            }
+
+            /* The next node has been selected. */
+            break;
+        }
+
+        /* We need to increment the reference count of the selected node. */
+        if (pNext != NULL) {
+            c89atomic_fetch_add_16(&pNext->refCount, 1);
+        }
+        
+        /* The previous node is no longer being referenced. */
+        c89atomic_fetch_sub_16(&pOutputBus->refCount, 1);
+    }
+    ma_node_input_bus_next_end(pInputBus);
+
+    return pNext;
 }

 static ma_node_output_bus* ma_node_input_bus_first(ma_node_input_bus* pInputBus)
 {
-    return (ma_node_output_bus*)c89atomic_load_ptr(&pInputBus->pHead);
+    return ma_node_input_bus_next(pInputBus, &pInputBus->head);
 }


@@ -2972,6 +3055,7 @@ static ma_result ma_node_input_bus_read_pcm_frames(ma_node* pInputNode, ma_node_
 {
    ma_result result = MA_SUCCESS;
    ma_node_output_bus* pOutputBus;
+    ma_node_output_bus* pFirst;
    ma_uint32 inputChannels;

    /*
@@ -2994,42 +3078,17 @@ static ma_result ma_node_input_bus_read_pcm_frames(ma_node* pInputNode, ma_node_
    inputChannels = ma_node_input_bus_get_channels(pInputBus);

    /*
-    A few optimizations here:
-                
-        1) If there's only a single attachment we just write directly to the output buffer.
-        2) For the first attachment, we do not mix. Instead we just overwrite.
+    We need to be careful with how we call ma_node_input_bus_first() and ma_node_input_bus_next(). They
+    are both critical to our lock-free thread-safety system. We can only call ma_node_input_bus_first()
+    once per iteration, however we have an optimization to checks whether or not it's the first item in
+    the list. We therefore need to store a pointer to the first item rather than repeatedly calling
+    ma_node_input_bus_first(). It's safe to keep hold of this point, so long as we don't dereference it
+    after calling ma_node_input_bus_next(), which we won't be.
    */
-    if (ma_node_input_bus_next(pInputBus, ma_node_input_bus_first(pInputBus)) == NULL) {
-        /* Fast path. There's only one attachment Read straight into the output buffer. */
-        pOutputBus = ma_node_input_bus_first(pInputBus);
-        if (pOutputBus != NULL) {
-            /* Fast path. Input and output channel counts are the same. No conversion necessary. */
-            ma_uint32 framesRead;
-            float volume = ma_node_output_bus_get_volume(pOutputBus);
+    pFirst = ma_node_input_bus_first(pInputBus);
    
-            if (volume > 0) {
-                result = ma_node_read_pcm_frames(pOutputBus->pNode, pOutputBus->outputBusIndex, pFramesOut, frameCount, &framesRead);
-            } else {
-                result = ma_node_read_pcm_frames(pOutputBus->pNode, pOutputBus->outputBusIndex, NULL,       frameCount, &framesRead);   /* Volume is muted. This resolves to a seek. */
-            }
-
-            if (result == MA_SUCCESS) {
-                /* Any frames that were not read need to be filled with silence. */
-                if (framesRead < frameCount) {
-                    ma_silence_pcm_frames(ma_offset_pcm_frames_ptr(pFramesOut, framesRead, ma_format_f32, inputChannels), (frameCount - framesRead), ma_format_f32, inputChannels);
-                }
-            }
-
-            /* Apply volume, if necessary. */
-            if (volume > 0 && volume != 1) {
-                ma_apply_volume_factor_f32(pFramesOut, framesRead * inputChannels, volume);
-            }
-
-            *pFramesRead = framesRead;
-        }
-    } else {
    /* Slow path. There's multiple output buses or channel conversion is required. We need to mix before outputting. pFramesOut is the accumulation buffer. */
-        for (pOutputBus = ma_node_input_bus_first(pInputBus); pOutputBus != NULL; pOutputBus = ma_node_input_bus_next(pInputBus, pOutputBus)) {
+    for (pOutputBus = pFirst; pOutputBus != NULL; pOutputBus = ma_node_input_bus_next(pInputBus, pOutputBus)) {
        ma_uint32 framesProcessed = 0;

        MA_ASSERT(pOutputBus->pNode != NULL);
@@ -3052,7 +3111,7 @@ static ma_result ma_node_input_bus_read_pcm_frames(ma_node* pInputNode, ma_node_

                pRunningFramesOut = ma_offset_pcm_frames_ptr_f32(pFramesOut, framesProcessed, inputChannels);

-                    if (pOutputBus == ma_node_input_bus_first(pInputBus)) {
+                if (pOutputBus == pFirst) {
                    /* Fast path. First attachment. We just read straight into the output buffer (no mixing required). */
                    result = ma_node_read_pcm_frames(pOutputBus->pNode, pOutputBus->outputBusIndex, pRunningFramesOut, framesToRead, &framesJustRead);
                } else {
@@ -3072,7 +3131,7 @@ static ma_result ma_node_input_bus_read_pcm_frames(ma_node* pInputNode, ma_node_
            }

            /* If it's the first attachment we didn't do any mixing. Any leftover samples need to be silenced. */
-                if (pOutputBus == ma_node_input_bus_first(pInputBus) && framesProcessed < frameCount) {
+            if (pOutputBus == pFirst && framesProcessed < frameCount) {
                ma_silence_pcm_frames(ma_offset_pcm_frames_ptr(pFramesOut, framesProcessed, ma_format_f32, inputChannels), (frameCount - framesProcessed), ma_format_f32, inputChannels);
            }

@@ -3088,7 +3147,6 @@ static ma_result ma_node_input_bus_read_pcm_frames(ma_node* pInputNode, ma_node_

    /* In this path we always "process" the entire amount. */
    *pFramesRead = frameCount;
-    }

    return result;
 }
@@ -3248,7 +3306,6 @@ MA_API ma_result ma_node_init(ma_node_graph* pNodeGraph, const ma_node_config* p
 MA_API void ma_node_uninit(ma_node* pNode, const ma_allocation_callbacks* pAllocationCallbacks)
 {
    ma_node_base* pNodeBase = (ma_node_base*)pNode;
-    volatile ma_uint32 readCounter; /* <-- Marking this as volatile just to make it clear that a copy of this variable must be made, and the compiler must not optimize it away. */

    if (pNodeBase == NULL) {
        return;
@@ -3256,50 +3313,12 @@ MA_API void ma_node_uninit(ma_node* pNode, const ma_allocation_callbacks* pAlloc

    /*
    The first thing we need to do is fully detach the node. This will detach all inputs and
-    outputs. We need to do this first because the next section, where we wait for any current reads
-    to complete, depends on the property of this node being guaranteed not to be enumerated in any
-    future reads that might happen on another thread while this function is running.
-
-    There's a possibility that the node will be in the middle of a read at this point. This is OK
-    because thread-safety on detachment is handled for this case.
+    outputs. We need to do this first because it will sever the connection with the node graph and
+    allow us to complete uninitialization without needing to worry about thread-safety with the
+    audio thread. The detachment process will wait for any local processing of the node to finish.
    */
    ma_node_detach(pNode);

-    /*
-    Now that the node has been detached need to wait for any active reads to complete. If we don't
-    do this, the caller may end up freeing memory allocating for this node while the node graph is
-    in the middle of reading.
-
-    If at this point the audio thread calls `ma_node_graph_read_pcm_frames()`, we will be safe
-    because we've already detached the node.
-    */
-
-    /*
-    We need to retrieve the read counter *before* checking whether or not a read is in progress.
-    */
-    readCounter = ma_node_graph_get_read_counter(pNodeBase->pNodeGraph);    /* <-- This copy must happen. */
-    c89atomic_compiler_fence(); /* <-- We must extract the read counter before checking ma_node_graph_is_reading(). */
-
-    /*
-    Once we've got a local copy of the read counter we can check whether or not the node graph is
-    in the middle of a read. If it is, we need to wait for it to complete. This is where the read
-    counter we retrieved just above comes into it. Every time a read completes, the counter is
-    incremented. While the read counter is equal to the counter we just extracted above, the read
-    is still happening.
-
-    If we were to not check if the graph is currently reading (`ma_node_graph_is_reading()`) we'd
-    end up stuck in an infinite loop because the counter would never end up incrementing. This is
-    why we need to extract the read counter before performing this check. If we were to do the
-    reading check first, it would be possible to end up in a situation where the reading process
-    completes *just* before we extract the counter and enter the wait loop, thereby creating an
-    infinite loop.
-    */
-    if (ma_node_graph_is_reading(pNodeBase->pNodeGraph)) {
-        while (readCounter == ma_node_graph_get_read_counter(pNodeBase->pNodeGraph)) {
-            ma_yield();
-        }
-    }
-
    /*
    At this point the node should be completely unreferenced by the node graph and we can finish up
    the uninitialization process without needing to worry about thread-safety.
@@ -3403,15 +3422,25 @@ MA_API ma_result ma_node_detach(ma_node* pNode)
        ma_node_detach_output_bus(pNode, iOutputBus);
    }

-    /* Any nodes attached to this node as an input need to be detached as well. */
+    /*
+    At this point all output buses will have been detached from the graph and we can be guaranteed
+    that none of it's input nodes will be getting processed by the graph. We can detach these
+    without needing to worry about the audio thread touching them.
+    */
    for (iInputBus = 0; iInputBus < ma_node_get_input_bus_count(pNode); iInputBus += 1) {
        ma_node_input_bus* pInputBus;
        ma_node_output_bus* pOutputBus;

        pInputBus = &pNodeBase->inputBuses[iInputBus];

-        for (pOutputBus = ma_node_input_bus_first(pInputBus); pOutputBus != NULL; pOutputBus = ma_node_input_bus_next(pInputBus, pOutputBus)) {
-            ma_node_detach_output_bus(pOutputBus->pNode, pOutputBus->outputBusIndex);
+        /*
+        This is important. We cannot be using ma_node_input_bus_first() or ma_node_input_bus_next(). Those
+        functions are specifically for the audio thread. We'll instead just manually iterate using standard
+        linked list logic. We don't need to worry about the audio thread referencing these because the step
+        above severed the connection to the graph.
+        */
+        for (pOutputBus = (ma_node_output_bus*)c89atomic_load_ptr(&pInputBus->head.pNext); pOutputBus != NULL; pOutputBus = (ma_node_output_bus*)c89atomic_load_ptr(&pOutputBus->pNext)) {
+            ma_node_detach_output_bus(pOutputBus->pNode, pOutputBus->outputBusIndex);   /* This won't do any waiting in practice and should be efficient. */
        }
    }