Drpc: `pool.Close` Should Make The Pool Unusable

Apr 22, 2025 by ADMIN 49 views

Introduction

In distributed systems, connection pooling is a crucial mechanism for improving performance and reducing latency. However, ensuring the correctness and reliability of connection pooling is equally important. In this article, we will discuss a critical issue with the pool.Close method in the drpc library, which can lead to unexpected behavior and potential security vulnerabilities.

The Issue

We recently encountered a case where calling pool.Close() did not fully drain or shut down the pool. Specifically, we observed that pool.Len() was non-zero even after calling pool.Close(). This was unexpected and highlighted a critical flaw in the current implementation.

Root Cause

Further investigation revealed that the issue arises when a client calls stream.CloseSend() but does not wait on stream.Context().Done(). If we immediately call pool.Close() after that, the pool gets reset, but in parallel, poolConn.manageStream (via monitorStream) may still run and put the connection back into the pool.

Simplified Repro

To demonstrate this issue, we created a simplified repro:

for i := range numStreams {
	streamID := i
	stream, err := client.PingS(ctx)
	require.NoError(t, err, fmt.Errorf("stream %d failed to start", streamID))
	defer func() { _ = stream.Close() }()

	for k := range numMessages {
		payload := fmt.Appendf(nil, "stream-%d-msg-%d", streamID, k)
		err = stream.Send(&pb.Ping{Payload: payload})
		require.NoError(t, err, fmt.Errorf("stream %d failed to send", streamID))
		pong, err := stream.Recv()
		require.NoError(t, err, fmt.Errorf("stream %d failed to recv", streamID))
		require.True(t, bytes.Equal(pong.Payload, payload),
			"stream %d payload mismatch: got %q, want %q",
			streamID, pong.Payload, payload)
	}
	err = stream.CloseSend()
	require.NoError(t, err, fmt.Errorf("stream %d CloseSend error", streamID))
    // <-stream.Context().Done() // <-- this would make the test pass
	require.Equal(t, 1, pool.Len())
}

require.Equal(t, 1, pool.Len())
require.NoError(t, pool.Close())
require.Equal(t, 0, pool.Len()) // <-- this assert can fail

Relevant Code

The relevant code from the pool is as follows:

func (p *poolConn[K, V]) monitorStream(stream drpc.Stream, conn V, done *drpcsignal.Chan) {
	<-stream.Context().Done()
	p.pool.Put(p.key, conn)
	done.Close()
}

Conclusion

The issue with pool.Close not making the pool unusable is a critical flaw that can lead to unexpected behavior and potential security vulnerabilities. To fix this issue, we need to gate the Put call in monitorStream behind a check to see if the pool is still valid/open, or add some kind of guard in Put() itself. Either way, Close() should make it impossible for future Put()s to succeed.

Recommendations

To address this issue, we recommend the following:

Gate the Put call in monitorStream: Add a check to see if the pool is still valid/open before calling p.pool.Put(p.key, conn).
Add a guard in Put() itself: Modify the Put() method to check if the pool is still valid/open before putting the connection back into the pool.
Ensure Close() makes the pool unusable: Modify the Close() method to make it impossible for future Put()s to succeed.

Q&A: Addressing the Critical Issue with `pool.Close`

Q: What is the issue with pool.Close in the drpc library? A: The issue arises when a client calls stream.CloseSend() but does not wait on stream.Context().Done(). If we immediately call pool.Close() after that, the pool gets reset, but in parallel, poolConn.manageStream (via monitorStream) may still run and put the connection back into the pool.

Q: What is the root cause of this issue? A: The root cause is that the monitorStream function does not wait for the stream.Context().Done() channel to be closed before putting the connection back into the pool. This allows the connection to be put back into the pool even after pool.Close() has been called.

Q: What is the impact of this issue? A: The impact of this issue is that the pool is not fully drained or shut down, even after calling pool.Close(). This can lead to unexpected behavior and potential security vulnerabilities.

Q: How can this issue be fixed? A: To fix this issue, we need to gate the Put call in monitorStream behind a check to see if the pool is still valid/open, or add some kind of guard in Put() itself. Either way, Close() should make it impossible for future Put()s to succeed.

Q: What are the recommendations for addressing this issue? A: The recommendations are:

Gate the Put call in monitorStream: Add a check to see if the pool is still valid/open before calling p.pool.Put(p.key, conn).
Add a guard in Put() itself: Modify the Put() method to check if the pool is still valid/open before putting the connection back into the pool.
Ensure Close() makes the pool unusable: Modify the Close() method to make it impossible for future Put()s to succeed.

Q: Why is it important to ensure pool.Close makes the pool unusable? A: It is essential to ensure that pool.Close makes the pool unusable to prevent potential security vulnerabilities and unexpected behavior. By making the pool unusable, we can prevent connections from being put back into the pool even after pool.Close() has been called.

Q: How can developers ensure they are not affected by this issue? A: Developers can ensure they are not affected by this issue by following the recommendations outlined above. Additionally, they can test their code thoroughly to ensure that pool.Close is working as expected.

Q: What is the next step in addressing this issue? A: The next step is to implement the recommendations outlined above and test the code thoroughly to ensure that pool.Close is working as expected. This will help to prevent potential security vulnerabilities and unexpected behavior.