You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the HDFS operator (and perhaps any operator based on kube-rs), when the Reflector watch resets, the Store has to be rebuilt.
Reconciliations before the Store is fully consistent can lead to service accounts being dropped from (Cluster)RoleBindings. The leads to the Topoology Provider not being able to determine the topology (or possibly builds an incorrect topology?)
The expected behaviour is that the above doesn't happen 😅.
Possible solution
We can requeue reconciliations (at least for some operations) until the store is fully consistent.
Eg:
On error: log error and return early
watcher::Event::Init -> the store is empty, waiting for InitApply events, requeue/return early.
watcher::Event::InitApply -> store is partially populated, requeue/return early until InitDone.
watcher::Event::InitDone -> store is populated, continue with reconcile
watcher::Event::Apply -> store is populated, continue with reconcile
watcher::Event::Delete -> store is populated, continue with reconcile
Caution
I haven't checked to see whether we can and should requeue, or just return an error which bubbles up as a Result for the error_policy handler which logs and does requeues.
Regardless, we need to make sure it eventually is reconciled and not just ignored.
Additional context
My understanding of the problem/solution should be double checked with someone else.
Tip
This might only be when the topology provider is used... but also seems like something that might affect other products that have components that interact with Kubernetes API in SDP generally
Affected Stackable version
Any up to and including SDP 26.3.0
Affected Apache HDFS version
N/A
Current and expected behavior
In the HDFS operator (and perhaps any operator based on kube-rs), when the Reflector watch resets, the Store has to be rebuilt.
Reconciliations before the Store is fully consistent can lead to service accounts being dropped from (Cluster)RoleBindings. The leads to the Topoology Provider not being able to determine the topology (or possibly builds an incorrect topology?)
The expected behaviour is that the above doesn't happen 😅.
Possible solution
We can requeue reconciliations (at least for some operations) until the store is fully consistent.
Eg:
watcher::Event::Init-> the store is empty, waiting for InitApply events, requeue/return early.watcher::Event::InitApply-> store is partially populated, requeue/return early until InitDone.watcher::Event::InitDone-> store is populated, continue with reconcilewatcher::Event::Apply-> store is populated, continue with reconcilewatcher::Event::Delete-> store is populated, continue with reconcileCaution
I haven't checked to see whether we can and should requeue, or just return an error which bubbles up as a Result for the error_policy handler which logs and does requeues.
Regardless, we need to make sure it eventually is reconciled and not just ignored.
Additional context
My understanding of the problem/solution should be double checked with someone else.
Tip
This might only be when the topology provider is used... but also seems like something that might affect other products that have components that interact with Kubernetes API in SDP generally
Environment
No response
Would you like to work on fixing this bug?
yes