Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniform Handling of Errors throughout COSMOS #1136

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions docs/architecture-decisions/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Architecture Decision Records (ADRs)

This directory contains Architecture Decision Records (ADRs) documenting key architectural decisions made in this project.

## Index of ADRs (To be updated when an ADR is added, modified, or updated)

- [Uniform Error Handling](uniform-error-handling.md)
**Status**: Proposed | **Date**: 2024-12-12
**Related Links**: [Issue #1112](#)

## Maintenance Guidelines

1. Keep Index Updated: Always update the index above when a new ADR is added or its status changes.
2. Use Consistent Formatting: Follow the provided template to ensure clarity and uniformity.
3. Cross-Reference Decisions: Link to related Issues, PRs, or other ADRs for better traceability.

## Format for New ADRs

To add a new ADR to this directory:

1. Create a new markdown file in this directory with a descriptive filename (e.g., `use-graphql.md`).
2. Use the following template for the ADR content:

```markdown
# [Title of the Decision]

## Status
[Proposed | Accepted | Deprecated | Rejected]

## Context
[Explain why this decision is being made. Provide background information, such as the problem to be solved, goals, and relevant constraints.]

## Decision
[Clearly describe the decision made. Include details about what was chosen and how it will be implemented.]

## Consequences
### Positive
[Describe the benefits of the decision.]

### Negative
[Describe the trade-offs, risks, or potential issues resulting from this decision.]

## Alternatives Considered
1. [Alternative 1]: [Brief description of the alternative, its pros, and cons.]
2. [Alternative 2]: [Brief description of the alternative, its pros, and cons.]

## References
[Provide links to relevant documents, discussions, RFCs, PRs, Issues or resources that support this decision.]
76 changes: 76 additions & 0 deletions docs/architecture-decisions/uniform-error-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Uniform Error Handling

## Status
Proposed

## Context
The current error handling system logs errors only in the terminal, which are neither preserved nor useful for developers or users. This approach fails to inform users of encountered issues and does not support debugging by developers. A consistent and efficient error-handling strategy is required to enhance user experience and simplify debugging.

## Decision
The **Error Dashboard** approach is proposed to be adopted as the method for handling system errors. This decision aligns with the need to consolidate real-time and asynchronous errors in a centralized location for better tracking and resolution.

### Why This Decision Was Taken
Since there are tasks that run asynchronously (e.g., Celery tasks), errors from these operations cannot be shown to the user in real-time. To ensure that both real-time and asynchronous errors are recorded and displayed, this approach was chosen.

- **Real-Time Errors**: These will be updated in the notification dashboard immediately as they occur.
- **Asynchronous Errors**: Errors encountered during asynchronous tasks (e.g., Celery tasks) will be recorded and updated in the dashboard once the task completes.

To achieve this, a targeted script will be developed to monitor asynchronous tasks. Details of this script are outlined in the **Monitoring Asynchronous Tasks** section of this ADR.

## Consequences

### Positive
- **Centralization**: All errors are consolidated in one place, reducing the chances of overlooked issues.
- **Improved User Experience**: Users have a clear view of errors affecting their operations without relying on backend logs or email notifications.
- **Scalability**: Suitable for large-scale operations involving asynchronous and real-time tasks.

### Negative
- **Navigation Overhead**: Users must navigate to the dashboard, which could be less convenient than inline error notifications.
- **Resource Requirements**: Developing and maintaining the dashboard requires additional frontend and backend resources.

## Alternatives Considered

### 1. **Logging System**
- **Approach**: Use Python’s built-in logging library to save structured logs with details like timestamps and severity levels.
- **Advantages**:
- Preserves logs for debugging.
- Differentiates critical and minor errors for better prioritization.
- Efficient for monitoring asynchronous tasks.
- **Disadvantage**:
- Primarily benefits developers; does not directly improve user interaction.

### 2. **Frontend Notifications**
- **Approach**: Display error messages directly on the frontend.
- **Advantages**:
- Enhances user interaction by providing immediate feedback.
- **Disadvantages**:
- Resource-intensive to implement.
- Real-time feedback is challenging for asynchronous tasks.

### 3. **Email Notifications**
- **Approach**: Send email alerts for critical failures.
- **Advantages**:
- Simple to implement.
- Suitable for asynchronous task monitoring.
- **Disadvantage**:
- Users need to register their emails.
- Requires users to check emails, reducing immediacy of feedback.

### 4. **Error Dashboard**
- **Approach**: Display errors on a dedicated frontend dashboard.
- **Advantages**:
- Consolidates both real-time and asynchronous errors.
- Provides a centralized location for error tracking.
- **Disadvantages**:
- Requires navigation away from the operational page to view errors.

## Monitoring Asynchronous Tasks
A targeted script will be developed to monitor asynchronous tasks for errors during import operations:
- **Trigger**: Activated at the start of an import operation.
- **Duration**: Runs for 10 minutes, polling the Flower API every minute.
- **Functionality**:
- Detects failed tasks related to the import.
- Notifies developers or users as appropriate.

## References
- Issue: [#1112](#)
Loading