Not receiving kafka notifications cause components to get stuck in "converting" and "loading preview"
@lathrops1 @Kireev @deniskar @John.kopanas @yannis @danjela @lathrops1 @andynicholson
There is a known problem of book components getting 'stuck' in the Converting or Loading preview status. This problem occurs because BCMS does not receive the kafka notification (for an error or successful preview).
The BCMS doesn't receive the kafka notifications because the connection to the kafka server is lost. This might be because we're connected to one kafka server at https://test.ncbi.nlm.nih.gov/books/kafka
for all three instance: coko testing site; ncbi testing site; dev (locally run apps). Having one kafka server per instance would help us debug other possible reasons for the "stuck" problem but this would requires NCBI to set up the three kafka servers.
We're not sure why the kafka server goes down so frequently in the first place. It could be because we're connecting via a proxy server (due to NCBI's security requirements) and additionally the library NCBI uses is deprecated (see #203 (closed)).
At the moment all we can do is keep trying to reconnect (every 80 seconds) until a connection is established but any notifications sent while there is no connection are lost. This isn't a sufficient solution.
We need to debug the issue asap as it's slowing down our development and testing on both sides. I suggest that whenever anyone in the team experiences this problem, comment here for @deniskar's attention with the time you noticed it and a link to the book component. NCBI will need to share the logs and help debug. -- Does that work for you @deniskar?
Assuming the lost connect issue is resolved, this alone won't move a book component out of Converting or Loading preview status. We should scope out the use cases for when a book component should be placed in a "failed" state so that the user can "submit" again or "reload preview". One suggestion was to fail after a certain amount of kafka connection attempts however it would be better for this failed action to relate to specific jobs instead of a general BCMS/Kafka problem. So for example we could say, if BCMS does not receive a notification for chapter 1
in x amount of time
then put the book component into a failed state.