Skip to content

Make find_actor() delete stale sockaddr entries from registrar on OSError#366

Open
goodboy wants to merge 7 commits intomainfrom
dereg_on_oserror
Open

Make find_actor() delete stale sockaddr entries from registrar on OSError#366
goodboy wants to merge 7 commits intomainfrom
dereg_on_oserror

Conversation

@goodboy
Copy link
Owner

@goodboy goodboy commented Aug 28, 2023

All in the title 😎

@goodboy goodboy changed the title Make find_actor() delete stale sockaddr entries from registar actor on OSError Make find_actor() delete stale sockaddr entries from registrar on OSError Aug 28, 2023
@goodboy
Copy link
Owner Author

goodboy commented Aug 29, 2023

Lul gotta add the bidict dep?

@goodboy
Copy link
Owner Author

goodboy commented Jul 15, 2025

This needs to be bumped to integrate better with the new multi-ipc-proto stuff landed in #375 !1

Since stale addrs can be leaked where the actor transport server task
crashes but doesn't (successfully) unregister from the registrar, we
need a remote way to remove such entries; hence this new (registrar)
method.

To implement this make use of the `bidict` lib for the `._registry`
table thus making it super simple to do reverse uuid lookups from an
input socket-address.
In cases where an actor's transport server task (by default handling new
TCP connections) terminates early but does not de-register from the
pertaining registry (aka the registrar) actor's address table, the
trying-to-connect client actor will get a connection error on that
address. In the case where client handles a (local) `OSError` (meaning
the target actor address is likely being contacted over `localhost`)
exception, make a further call to the registrar to delete the stale
entry and `yield None` gracefully indicating to calling code that no
`Portal` can be delivered to the target address.

This issue was originally discovered in `piker` where the `emsd`
(clearing engine) actor would sometimes crash on rapid client
re-connects and then leave a `pikerd` stale entry. With this fix new
clients will attempt connect via an endpoint which will re-spawn the
`emsd` when a `None` portal is delivered (via `maybe_spawn_em()`).
By spawning an actor task that immediately shuts down the transport
server and then sleeps, verify that attempting to connect via the
`._discovery.find_actor()` helper delivers `None` for the `Portal`
value.

Relates to #184 and #216
@goodboy
Copy link
Owner Author

goodboy commented Sep 30, 2025

I think CI is running clean now?

Only thing left is to maybe also do the .delete_addr() call on ConnectionErrors with UDS as well?

@goodboy goodboy changed the base branch from asyncio_debugger_support to main September 30, 2025 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant