Drpc: `pool.Close` Should Make The Pool Unusable
Introduction
In distributed systems, connection pooling is a crucial mechanism for improving performance and reducing latency. However, ensuring the correctness and reliability of connection pooling is equally important. In this article, we will discuss a critical issue with the pool.Close
method in the drpc library, which can lead to unexpected behavior and potential security vulnerabilities.
The Issue
We recently encountered a case where calling pool.Close()
did not fully drain or shut down the pool. Specifically, we observed that pool.Len()
was non-zero even after calling pool.Close()
. This was unexpected and highlighted a critical flaw in the current implementation.
Root Cause
Further investigation revealed that the issue arises when a client calls stream.CloseSend()
but does not wait on stream.Context().Done()
. If we immediately call pool.Close()
after that, the pool gets reset, but in parallel, poolConn.manageStream
(via monitorStream
) may still run and put the connection back into the pool.
Simplified Repro
To demonstrate this issue, we created a simplified repro:
for i := range numStreams {
streamID := i
stream, err := client.PingS(ctx)
require.NoError(t, err, fmt.Errorf("stream %d failed to start", streamID))
defer func() { _ = stream.Close() }()
for k := range numMessages {
payload := fmt.Appendf(nil, "stream-%d-msg-%d", streamID, k)
err = stream.Send(&pb.Ping{Payload: payload})
require.NoError(t, err, fmt.Errorf("stream %d failed to send", streamID))
pong, err := stream.Recv()
require.NoError(t, err, fmt.Errorf("stream %d failed to recv", streamID))
require.True(t, bytes.Equal(pong.Payload, payload),
"stream %d payload mismatch: got %q, want %q",
streamID, pong.Payload, payload)
}
err = stream.CloseSend()
require.NoError(t, err, fmt.Errorf("stream %d CloseSend error", streamID))
// <-stream.Context().Done() // <-- this would make the test pass
require.Equal(t, 1, pool.Len())
}
require.Equal(t, 1, pool.Len())
require.NoError(t, pool.Close())
require.Equal(t, 0, pool.Len()) // <-- this assert can fail
Relevant Code
The relevant code from the pool is as follows:
func (p *poolConn[K, V]) monitorStream(stream drpc.Stream, conn V, done *drpcsignal.Chan) {
<-stream.Context().Done()
p.pool.Put(p.key, conn)
done.Close()
}
Conclusion
The issue with pool.Close
not making the pool unusable is a critical flaw that can lead to unexpected behavior and potential security vulnerabilities. To fix this issue, we need to gate the Put
call in monitorStream
behind a check to see if the pool is still valid/open, or add some kind of guard in Put()
itself. Either way, Close()
should make it impossible for future Put()
s to succeed.
Recommendations
To address this issue, we recommend the following:
- Gate the
Put
call inmonitorStream
: Add a check to see if the pool is still valid/open before callingp.pool.Put(p.key, conn)
. - Add a guard in
Put()
itself: Modify thePut()
method to check if the pool is still valid/open before putting the connection back into the pool. - Ensure
Close()
makes the pool unusable: Modify theClose()
method to make it impossible for futurePut()
s to succeed.
Q&A: Addressing the Critical Issue with pool.Close
Q: What is the issue with pool.Close
in the drpc library?
A: The issue arises when a client calls stream.CloseSend()
but does not wait on stream.Context().Done()
. If we immediately call pool.Close()
after that, the pool gets reset, but in parallel, poolConn.manageStream
(via monitorStream
) may still run and put the connection back into the pool.
Q: What is the root cause of this issue?
A: The root cause is that the monitorStream
function does not wait for the stream.Context().Done()
channel to be closed before putting the connection back into the pool. This allows the connection to be put back into the pool even after pool.Close()
has been called.
Q: What is the impact of this issue?
A: The impact of this issue is that the pool is not fully drained or shut down, even after calling pool.Close()
. This can lead to unexpected behavior and potential security vulnerabilities.
Q: How can this issue be fixed?
A: To fix this issue, we need to gate the Put
call in monitorStream
behind a check to see if the pool is still valid/open, or add some kind of guard in Put()
itself. Either way, Close()
should make it impossible for future Put()
s to succeed.
Q: What are the recommendations for addressing this issue? A: The recommendations are:
- Gate the
Put
call inmonitorStream
: Add a check to see if the pool is still valid/open before callingp.pool.Put(p.key, conn)
. - Add a guard in
Put()
itself: Modify thePut()
method to check if the pool is still valid/open before putting the connection back into the pool. - Ensure
Close()
makes the pool unusable: Modify theClose()
method to make it impossible for futurePut()
s to succeed.
Q: Why is it important to ensure pool.Close
makes the pool unusable?
A: It is essential to ensure that pool.Close
makes the pool unusable to prevent potential security vulnerabilities and unexpected behavior. By making the pool unusable, we can prevent connections from being put back into the pool even after pool.Close()
has been called.
Q: How can developers ensure they are not affected by this issue?
A: Developers can ensure they are not affected by this issue by following the recommendations outlined above. Additionally, they can test their code thoroughly to ensure that pool.Close
is working as expected.
Q: What is the next step in addressing this issue?
A: The next step is to implement the recommendations outlined above and test the code thoroughly to ensure that pool.Close
is working as expected. This will help to prevent potential security vulnerabilities and unexpected behavior.