Skip to content

Conversation

@fisher-alice
Copy link
Contributor

@fisher-alice fisher-alice commented Dec 17, 2025

This follows up #70005. Comparing counts between Cloudwatch and Redshift, I noticed a lot of 'conflict-save-sources-reload' errors. However, this error is currently not incrementing the count for LegacyLab.ProjectSaveFailure since it's a save failure due to the user, so adding a NEW metric in legacy and lab2 projects for this event and 'unauthorized-save-sources-reload' (not as common of an error).

Comparing Firehose and Cloudwatch data for legacy lab save errors logged in projects.js in this spreadsheet, I am a little mystified why firehose counts are higher than cloudwatch for 'unauthorized-save-sources-reload'.

Screenshot 2025-12-17 at 3 40 42 PM

For the other events, the firehose counts are lower as expected since as @cat5inthecradle explained, "events sent to firehose are sent in batches, if any event in a batch throws an error when we try to load it into redshift, the whole batch is skipped. The error rate is pretty low, but nonzero."

Links

Testing story

Deployment strategy

Follow-up work

Privacy

Security

Caching

PR Creation Checklist:

  • Tests provide adequate coverage
  • Privacy impacts have been documented
  • Security impacts have been documented
  • Code is well-commented
  • New features are translatable or updates will not break translations
  • Relevant documentation has been added or updated
  • User impact is well-understood and desirable
  • Follow-up work items (including potential tech debt) are tracked and linked

@fisher-alice fisher-alice marked this pull request as ready for review December 17, 2025 21:41
@fisher-alice fisher-alice requested review from a team and molly-moen December 17, 2025 21:41
saveSourcesErrorCount,
err.message
).finally(() => utils.reload());
MetricsReporter.incrementCounter('LegacyLab.ProjectSaveFailure', [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could log this and the 409 as some other metric. They aren't failures on our end, but on the user's end. Especially for the 409, that is an expected behavior, we will reload the page if we see a conflict. This is the case when a user has the same project open in multiple places, and we don't reconcile it. The unauthorized error could be us, but is more likely a user getting logged out and needing to log in again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I purposefully did not include these in the original failure count 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! That makes sense - thanks! Do you remember reading about the page reloading automatically in some cases in Zendesk reports that have to do with save failures?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do remember one instance of page reloading in a report, but I think only one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think adding a new metric for these errors would be good! I'l update to do this instead.

'conflict-save-sources-reload',
saveSourcesErrorCount,
err.message
).finally(() => utils.reload());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticing that we automatically reload page if we get this kind of save error - going to go through zendesk tickets next to see if users included that the page would reload automatically and the project wasn't saved.

@fisher-alice fisher-alice marked this pull request as draft December 17, 2025 21:54
@fisher-alice fisher-alice changed the title Log metrics for save failure due to conflict or unauthorized Log new metric for save failure due to conflict or unauthorized Dec 17, 2025
@fisher-alice fisher-alice marked this pull request as ready for review December 18, 2025 23:55
MetricsReporter.incrementCounter(
'LegacyLab.ProjectSaveFailureClient',
[
{name: 'AppName', value: this.getStandaloneApp()},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to use this method for the app name, we had an incident because getStandaloneApp is sometimes null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - updated!

// Log 413s as warnings. The save fail listener should handle these errors and labs should
// show a reasonable error message to the user.
this.metricsReporter.logWarning('Project too large to save');
this.metricsReporter.publishMetric(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we should log this one; we give a message to the user in this case and consider it a warning since it's on the user to decrease their project size. We also don't have a similar metric for legacy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - removed!

MetricsReporter.incrementCounter('LegacyLab.ProjectSaveFailure', [
{name: 'AppName', value: this.getStandaloneAppForMetrics()},
{name: 'SaveType', value: 'sources'},
{name: 'ErrorType', value: 'unknown'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we add this dimension it will mess up our existing logging (we'll have to start over from the point this gets deployed, because of the way aws handles the dimensions). Since this is a separate metric and the type will always be 'unknown' is it worth adding this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go ahead and remove!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove for lab2 as well in ProjectManager.

@fisher-alice fisher-alice merged commit e3d43f7 into staging Dec 19, 2025
6 checks passed
@fisher-alice fisher-alice deleted the alice/log-more-save-errors branch December 19, 2025 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants