-
Notifications
You must be signed in to change notification settings - Fork 38
Description
Is this a support request?
No
Describe the bug
When using the Redis backing store, network errors can take down that section of the supervisor tree without crashing the whole client. This leads to an irrecoverable state where every lookup will return the supplied default value - reconnecting to Redis will never be attempted. I observed this when seeing timeouts to Redis, not sure if the same behavior happens for other network issues but I suspect it does.
When eredis starts returning an error tuple of {:error, :no_connection}, the ldclient_storage_redis_server_default process fails to pattern match on it and crashes. This causes the supervising process, ldclient_storage_redis_sup_default, to restart it.
The supervisors are aggressive about giving up on crash recovery. The ldclient_storage_redis_sup process is set to terminate itself if the child process ever crashes - so both will never be restarted. Its parent, ldclient_instance_sup, will recover from a single crash in 5s - any more than that and it will also give up.
When ldclient_instance_sup gives up and terminates itself, its brought back up - but doesn't try to start redis again. The problem seems to be that redis tree is added as a child to ldclient_instance_sup in the start/3 function of ldclient_instance. That method is then never called again - when ldclient_instance_sup is restarted, only the children defined in its init/1 function are created.
Making this worse, ldclient_eval will discard errors coming from the lookup and return the supplied default. This means my application can't detect the failure and terminate itself or otherwise react to it.
I could be missing something (I found the various supervisors with names not matching their modules hard to reason about) but that's my best understanding of whats happening.
To reproduce
- Configure the sdk to use redis with
use_ldd: true - Inject latency greater than the 5s timeout. I have redis running in Docker and did this with pumba:
pumba netem -d 5m delay -t 6000 redis - Query a flag,
:ldclient.variation(flag, ld_context, false) - Immediately repeat the query when the first one times out
- Wait for the second timeout/crash
- The third+ attempts will immediately return the default with no errors
Expected behavior
I would expect the supervisor to restart all children. Failing that, if the whole system is unusable the error should be propagated up instead of being silently discarded.
Logs
iex(6)> mods = [:ldclient_instance, :ldclient_instance_sup, :ldclient_sup, :ldclient_storage_redis]
[:ldclient_instance, :ldclient_instance_sup, :ldclient_sup,
:ldclient_storage_redis]
iex(7)> context = :ldclient_context.new("abc")
%{key: "abc", kind: "user"}
# Query running successfully
iex(8)> :ldclient.variation("example-integer", context, false)
500
#
# Started network latency injection here
#
iex(9)> mods |> Enum.map(fn m -> {m, :_, :return_trace} end) |> :recon_trace.calls({100, 1000}, scope: :local)
35
iex(10)> :ldclient.variation("example-integer", context, false)
17:3:01.046082 <0.1149.0> ldclient_instance:update_processor_initialized(default)
17:3:01.046374 <0.1149.0> ldclient_instance:update_processor_initialized/1 --> false
17:3:01.046578 <0.1149.0> ldclient_instance:feature_store_initialized(default)
17:3:01.046755 <0.1149.0> ldclient_instance:feature_store_initialized/1 --> true
17:3:01.046873 <0.1149.0> ldclient_storage_redis:get(default, features, <<"example-integer">>)
17:3:01.047021 <0.1149.0> ldclient_storage_redis:get_local_reg_name(worker, default)
17:3:01.047155 <0.1149.0> ldclient_storage_redis:get_local_reg_name/2 --> ldclient_storage_redis_server_default
false
[error] module=gen_server GenServer :ldclient_storage_redis_server_default terminating
** (stop) exited in: :gen_server.call(#PID<0.601.0>, {:request, [["*", ~c"3", "\r\n"], [["$", ~c"4", "\r\n", "HGET", "\r\n"], ["$", ~c"23", "\r\n", "ld_Development:features", "\r\n"], ["$", ~c"15", "\r\n", "example-integer", "\r\n"]]]}, 5000)
** (EXIT) time out
(stdlib 6.2.2.2) gen_server.erl:1222: :gen_server.call/3
(ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_storage_redis_server.erl:268: :ldclient_storage_redis_server.lookup_key/5
(ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_storage_redis_server.erl:163: :ldclient_storage_redis_server.handle_call/3
(stdlib 6.2.2.2) gen_server.erl:2381: :gen_server.try_handle_call/4
(stdlib 6.2.2.2) gen_server.erl:2410: :gen_server.handle_msg/6
(stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.1149.0>): {:get, :features, "example-integer"}
State: %{tag: :default, prefix: ~c"ld_Development", client: #PID<0.601.0>, buckets: [:segments, :features]}
Client #PID<0.1149.0> is alive
(iex 1.18.4) lib/iex/evaluator.ex:186: IEx.Evaluator.loop/1
(iex 1.18.4) lib/iex/evaluator.ex:34: IEx.Evaluator.init/5
(stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
[notice] Starting redis server with name :ldclient_storage_redis_server_default
iex(11)> :ldclient.variation("example-integer", context, false)
17:3:07.445872 <0.1149.0> ldclient_instance:update_processor_initialized(default)
17:3:07.446192 <0.1149.0> ldclient_instance:update_processor_initialized/1 --> false
17:3:07.446423 <0.1149.0> ldclient_instance:feature_store_initialized(default)
17:3:07.446579 <0.1149.0> ldclient_instance:feature_store_initialized/1 --> false
17:3:07.446759 <0.1149.0> ldclient_storage_redis:get_init(default)
17:3:07.446928 <0.1149.0> ldclient_storage_redis:get_local_reg_name(worker, default)
17:3:07.447058 <0.1149.0> ldclient_storage_redis:get_local_reg_name/2 --> ldclient_storage_redis_server_default
[warning] module=eredis_client eredis: Initial connect failed: {:authentication_error, :timeout}
[error] module=gen_server GenServer :ldclient_storage_redis_server_default terminating
** (MatchError) no match of right hand side value: {:error, :no_connection}
(ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_storage_redis_server.erl:340: :ldclient_storage_redis_server.get_init/2
(ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_storage_redis_server.erl:175: :ldclient_storage_redis_server.handle_call/3
(stdlib 6.2.2.2) gen_server.erl:2381: :gen_server.try_handle_call/4
(stdlib 6.2.2.2) gen_server.erl:2410: :gen_server.handle_msg/6
(stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.1149.0>): {:get_init}
State: %{tag: :default, prefix: ~c"ld_Development", client: #PID<0.1160.0>, buckets: []}
Client #PID<0.1149.0> is alive
(stdlib 6.2.2.2) gen.erl:260: :gen.do_call/4
(stdlib 6.2.2.2) gen_server.erl:1138: :gen_server.call/2
(ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_storage_redis.erl:92: :ldclient_storage_redis.get_init/1
(ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_eval.erl:293: :ldclient_eval.get_initialization_state/2
(ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_eval.erl:69: :ldclient_eval.flag_key_for_context/4
(ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient.erl:144: :ldclient.variation/4
(elixir 1.18.4) elixir.erl:386: :elixir.eval_external_handler/3
(stdlib 6.2.2.2) erl_eval.erl:919: :erl_eval.do_apply/7
(elixir 1.18.4) elixir.erl:364: :elixir.eval_forms/4
(elixir 1.18.4) lib/module/parallel_checker.ex:120: Module.ParallelChecker.verify/1
(iex 1.18.4) lib/iex/evaluator.ex:336: IEx.Evaluator.eval_and_inspect/3
(iex 1.18.4) lib/iex/evaluator.ex:310: IEx.Evaluator.eval_and_inspect_parsed/3
(iex 1.18.4) lib/iex/evaluator.ex:299: IEx.Evaluator.parse_eval_inspect/4
(iex 1.18.4) lib/iex/evaluator.ex:189: IEx.Evaluator.loop/1
(iex 1.18.4) lib/iex/evaluator.ex:34: IEx.Evaluator.init/5
(stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
false
17:3:11.092550 <0.491.0> ldclient_instance_sup:start_link(ldclient_instance_default, ldclient_instance_stream_default, ldclient_update_null_server, ldclient_instance_events_default, default)
[notice] Terminating event service, reason: :shutdown
[notice] Terminating, reason: :shutdown; Pid none
17:3:11.095185 <0.1161.0> ldclient_instance_sup:init([ldclient_instance_stream_default,ldclient_update_null_server,
ldclient_instance_events_default,default])
[notice] Starting instance supervisor for :default with name :ldclient_instance_default
[notice] Starting event supervisor for :default with name :ldclient_instance_events_default
17:3:11.095501 <0.1161.0> ldclient_instance_sup:children(ldclient_instance_stream_default, ldclient_update_null_server, ldclient_instance_events_default, default)
[notice] Starting event storage server for :default with name :ldclient_event_server_default
[notice] Starting event processor for :default with name :ldclient_event_process_server_default
17:3:11.096708 <0.1161.0> ldclient_instance_sup:children/4 --> [{ldclient_update_sup,
{ldclient_update_sup,start_link,
[ldclient_instance_stream_default,ldclient_update_null_server]},
permanent,5000,supervisor,
[ldclient_update_sup]},
{ldclient_event_sup,
{ldclient_event_sup,start_link,
[ldclient_instance_events_default,default]},
permanent,5000,supervisor,
[ldclient_event_sup]}]
17:3:11.096988 <0.1161.0> ldclient_instance_sup:init/1 --> {ok,{{one_for_one,1,5},
[{ldclient_update_sup,
{ldclient_update_sup,start_link,
[ldclient_instance_stream_default,ldclient_update_null_server]},
permanent,5000,supervisor,
[ldclient_update_sup]},
{ldclient_event_sup,
{ldclient_event_sup,start_link,
[ldclient_instance_events_default,default]},
permanent,5000,supervisor,
[ldclient_event_sup]}]}}
17:3:11.098170 <0.491.0> ldclient_instance_sup:start_link/5 --> {ok,<0.1161.0>}
#
# All calls now return the default immediately
#
iex(12)> :ldclient.variation("example-integer", context, false)
17:3:12.229785 <0.1149.0> ldclient_instance:update_processor_initialized(default)
false
17:3:12.230080 <0.1149.0> ldclient_instance:update_processor_initialized/1 --> false
17:3:12.230291 <0.1149.0> ldclient_instance:feature_store_initialized(default)
17:3:12.230491 <0.1149.0> ldclient_instance:feature_store_initialized/1 --> false
17:3:12.230623 <0.1149.0> ldclient_storage_redis:get_init(default)
17:3:12.230738 <0.1149.0> ldclient_storage_redis:get_local_reg_name(worker, default)
17:3:12.230860 <0.1149.0> ldclient_storage_redis:get_local_reg_name/2 --> ldclient_storage_redis_server_default
iex(13)>
SDK version
Running from git, bb5bc81
Language version, developer tools
Elixir 1.18.4 / OTP 27