BCMS site continually crashes and NCBI needs confidence these issues are being monitored and addressed
User Story
BCMS users experience frequent instances of the application giving a red pop up error and/or then going blank and/or giving a Gateway error, which usually causes our users to refresh their browser repeatedly to try to resurrect what they were doing before this experience. We need to ensure we have a reliable and systematic way to detect the root causes of such experiences resulting from suspected instability of the application and that our changes to the application to address such issues and their root causes are effective.
Acceptance Criteria
-
Reduction of reports from BCMS users about issues described in this bug ticket -
Logs created in #1493 (closed) demonstrate the BCMS application is stable without any crashes when we try to reproduce the user behaviors reported in this issue
Bug Description: Expected behaviour
- BCMS site should not crash
- Coko should monitor logs for crashes and have a documented process to identify the source of the crash and to fix that source problem
Bug Description: Current behaviour
We are repeatedly experienced the BCMS site crash.
First we get a lot of popup red errors that I could not get a screenshot of, but it said, "Can not reload preview"
Then we see this:
And then we get this page:
But when we try to log in - add our credentials and Login, nothing happens.
Usually in 15-20 minutes it is back up, but we get many complaints and this can't happen in production. Update 2023-02-06: Coko confirmed in #1469 (closed) that this temporary log in issue is because we're automatically restarting the server which can take some time.
Steps to reproduce
We don't know what is causing this so can't say how to reproduce it, and we would like Coko to propose a process for how they will monitor and troubleshoot such issues past, current, and future.
Case 1
Additional information from @lathrops1 in #1461 (comment 107648)
I did ask these questions and the users could be doing anything:
- uploading files
- submitting, loading, publishing files
OR as with myself when I have experienced crashes, even just scrolling and navigating the BCMS to find something.
These issues have been happening regularly so hard to track, but I know it happened exactly when I reported it on Mattermost on December 19, 2022 slightly before 3pm ET time.
I will ask folks to report to me exactly what they were doing and at what time, but I think everyone is now pretty trained to ignore and try again, so that behavior might be hard to change.
The site did not fully crash, but in our instance on Tuesday Jan 17 between 4 and 5pm, I was trying to publish a chapter and got a JSON red error that would not permit me to finish my transaction. Denis thought this might be because the site crashed, but the behavior is different than things going blank, getting a login page, and not being able to log in.
Case 2
@John.kopanas @DioneMentis We do still have frequent crashes:
For example, today we had crashes at these times: (19:19, 18:12, 14:44, 10:57, 00:32, 23:32)
They typically have the same error that postgres Connection terminated unexpectedly
It could be related to the (excessive pg connections but no query) issue;
Here is some of the logs surrounding these crashes
January_25th_2023__19.19.14.582_2023-01-.txt
January_25th_2023__18.12.32.208_2023-01-.txt
January_25th_2023__18.12.29.144_2023-01-.txt
January_17th_2023__09.18.33.551_2023-01-.txt
Bug Description: Environment
Possible solutions to resolving bug / technical proposal
Coko will receive an email notification whenever the server goes down and will investigate the logs. Known investigations are listed below:
-
increase idleTimeoutMillis !1099 (merged) and monitor -
refactor permissions query (estimated in #1471 (closed))
QA Steps
[To be completed by Coko once dev is done]