-
Notifications
You must be signed in to change notification settings - Fork 524
Log new metric for save failure due to conflict or unauthorized #70086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| saveSourcesErrorCount, | ||
| err.message | ||
| ).finally(() => utils.reload()); | ||
| MetricsReporter.incrementCounter('LegacyLab.ProjectSaveFailure', [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could log this and the 409 as some other metric. They aren't failures on our end, but on the user's end. Especially for the 409, that is an expected behavior, we will reload the page if we see a conflict. This is the case when a user has the same project open in multiple places, and we don't reconcile it. The unauthorized error could be us, but is more likely a user getting logged out and needing to log in again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I purposefully did not include these in the original failure count 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha! That makes sense - thanks! Do you remember reading about the page reloading automatically in some cases in Zendesk reports that have to do with save failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do remember one instance of page reloading in a report, but I think only one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think adding a new metric for these errors would be good! I'l update to do this instead.
| 'conflict-save-sources-reload', | ||
| saveSourcesErrorCount, | ||
| err.message | ||
| ).finally(() => utils.reload()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticing that we automatically reload page if we get this kind of save error - going to go through zendesk tickets next to see if users included that the page would reload automatically and the project wasn't saved.
| MetricsReporter.incrementCounter( | ||
| 'LegacyLab.ProjectSaveFailureClient', | ||
| [ | ||
| {name: 'AppName', value: this.getStandaloneApp()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to use this method for the app name, we had an incident because getStandaloneApp is sometimes null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - updated!
| // Log 413s as warnings. The save fail listener should handle these errors and labs should | ||
| // show a reasonable error message to the user. | ||
| this.metricsReporter.logWarning('Project too large to save'); | ||
| this.metricsReporter.publishMetric( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we should log this one; we give a message to the user in this case and consider it a warning since it's on the user to decrease their project size. We also don't have a similar metric for legacy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - removed!
| MetricsReporter.incrementCounter('LegacyLab.ProjectSaveFailure', [ | ||
| {name: 'AppName', value: this.getStandaloneAppForMetrics()}, | ||
| {name: 'SaveType', value: 'sources'}, | ||
| {name: 'ErrorType', value: 'unknown'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we add this dimension it will mess up our existing logging (we'll have to start over from the point this gets deployed, because of the way aws handles the dimensions). Since this is a separate metric and the type will always be 'unknown' is it worth adding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go ahead and remove!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove for lab2 as well in ProjectManager.
This follows up #70005. Comparing counts between Cloudwatch and Redshift, I noticed a lot of 'conflict-save-sources-reload' errors. However, this error is currently not incrementing the count for
LegacyLab.ProjectSaveFailuresince it's a save failure due to the user, so adding a NEW metric in legacy and lab2 projects for this event and 'unauthorized-save-sources-reload' (not as common of an error).Comparing Firehose and Cloudwatch data for legacy lab save errors logged in
projects.jsin this spreadsheet, I am a little mystified why firehose counts are higher than cloudwatch for 'unauthorized-save-sources-reload'.For the other events, the firehose counts are lower as expected since as @cat5inthecradle explained, "events sent to firehose are sent in batches, if any event in a batch throws an error when we try to load it into redshift, the whole batch is skipped. The error rate is pretty low, but nonzero."
Links
Testing story
Deployment strategy
Follow-up work
Privacy
Security
Caching
PR Creation Checklist: