Home: nocNetIntel

Welcome to the mysoftware-nocNetIntel wiki! To address the goal of predicting site outages (today, tomorrow, and rest of the week) using the provided data sources, I’ll outline the data sources, their roles, and a structured approach to achieve outage prediction. The focus will be on leveraging the alarm data, alarm classifications, site details, site data availability, and ticket data with root cause analysis (RCA) to build a predictive model. Below is a detailed plan.

Explanation of the Approach

Data Sources and Their Roles

Alarm Data:
- Contains timestamp, alarm_id, site_code, alarm_name etc.
- Used to identify patterns of alarms leading to outages.
Alarm Classifications:
- Maps alarm_name to categories (site_outage, cell_outage, service_affecting, passive, power, power_event).
- Critical for labeling alarms as site outage-related, which is the target for prediction.
Site Details:
- Includes site_code, region, equipment_type, capacity etc.
- Provides contextual features to differentiate site-specific outage risks.
Site Data Availability:
- Tracks timestamp, site_code, RNA availability_percentage etc.
- Indicates site health, where low availability may precede outages.
Ticket Data:
- Contains ticket_id, site_code, alarm_id, timestamp, rca, resolution_status etc.
- RCA provides insights into historical outage causes, enhancing feature engineering and model interpretability.

How to Achieve Site Outage Prediction

Data Integration:
- Merge all data sources on common keys (site_code, timestamp, alarm_id) to create a unified dataset.
- Handle missing values (e.g., assume 100% availability if missing, default RCA to "unknown").
Feature Engineering:
- Temporal Features: Extract hour, day of week, and holiday indicators from timestamps to capture time-based patterns.
- Rolling Statistics: Compute alarm frequency over the last 24 hours to detect spikes.
- Categorical Encoding: Encode alarm classifications and RCA using LabelEncoder.
- Scaling: Normalize numerical features like availability_percentage and alarm_count_24h.
Model Architecture:
- Use a multi-layer LSTM to model time-series sequences of alarms, availability, and other features.
- Input: Sequences of length 24 (e.g., hourly data over one day).
- Output: Probability of site outage and predicted outage type (classification).
Prediction Horizons:
- Rest of the Day: Use the latest sequence of data to predict outages until midnight.
- Tomorrow: Shift the prediction window to the next day.
- Rest of the Week: Extend predictions up to 7 days, aggregating daily probabilities.
Evaluation and Output:
- Evaluate model performance using F1 and AUC for classification (site outage vs. others) and RMSE for probability estimates.
- Output predictions in a DataFrame with site_code, prediction_time, outage_probability, and predicted_outage_type.
- Store results in a PostgreSQL table (as per the README schema) for downstream use.

Implementation Notes

The provided script assumes placeholder data loading. Replace load_data_sources() with actual logic (e.g., reading from CSV, SQL, or APIs).
The LSTM model is configured with 2 layers and 64 hidden units, but these hyperparameters can be tuned.
The script supports multiple prediction horizons by adjusting the time window in predict_outages().
To productionize, integrate with FastAPI (as per the README) for API endpoints and Celery for asynchronous data processing.

This approach leverages the data sources and follows the Noc-netIntel architecture to predict site outages effectively.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly