# Best Practice & Troubleshooting

## Appendix A: Best Practices and Troubleshooting

*A collection of expert recommendations, professional tips, and solutions to common problems to help you get the most out of EDCC.*

***

### Overview: Your Quick Reference Guide

This appendix serves two key purposes:

**A Best Practices Checklist**: It consolidates the most important recommendations discussed throughout this manual into a single place, helping you establish an efficient, secure, and scalable management workflow.

**A "First Aid" Troubleshooting Guide**: It provides quick, step-by-step solutions to the most common problems and questions you may encounter. When something goes wrong, start here.

***

### 1. Best Practices Checklist

This section consolidates best practices into a scannable checklist, organized by operational area.

#### **Initial Setup & Planning**

* **Plan Your Hierarchy**: Before adding nodes, map out your Organization and POD structure. Name PODs logically (e.g., "Taipei-IDC-Aisle-5," "Production-SQL-Cluster")
* **Configure System Services Early**: Configure the Mail Server and HTTPS File Server in **System > Application Settings** as a first step. They are prerequisites for user invitations and OS deployment
* **Secure Default Credentials**: Immediately change the default BMC password for your PODs in **CONFIGURE > General Settings**

#### **Daily Monitoring & Operations**

* **Start with the Dashboard**: Begin your daily checks on the **Dashboard**. The **POD Health** widget is your most critical indicator
* **Use Groups for Efficiency**: Organize nodes into Groups in the **Node List** for easier filtering and bulk operations
* **Keep POD View Updated**: Treat **POD View** as your logical "source of truth." Keep it updated to reflect your real-world rack layouts

#### **Maintenance & Updates**

* **Automate Firmware Updates**: Use the **Firmware Provisioning** feature for routine, fleet-wide updates. Schedule the **Maintenance Window** carefully during off-peak hours
* **Backup Before Changes**: Always create a **Configuration Backup** before making significant changes to a POD. Create a **System Backup** before making platform-level changes
* **Protect Your "Golden Images"**: Use the **Protect** feature on known-good, stable **Configuration Backups** to preserve a reliable rollback point

#### **Security Management**

* **Follow the Principle of Least Privilege**: Grant users the minimum level of permissions required. Use granular POD-level roles instead of Organization-level roles whenever possible
* **Regularly Audit Events**: Periodically review the **System Event** log to audit administrative actions and track changes

***

### 2. Troubleshooting Common Issues

This section provides solutions to the most common questions and problems, grouped by category.

#### **Scope & Permission Issues**

**Q: Why can't I edit settings or perform an operation on my node?**

**Solution**: You are in the wrong management scope. Configuration can only be performed when a POD is selected.

**How to fix**: Open the **Management Tree** and click on the specific POD that contains the node you want to manage.

**Q: I registered a new node, but I can't find it in the Node List.**

**Solution**: The node is waiting in the global **Inventory**. It must be assigned to a POD before it can be managed.

**How to fix**: Go to **System > Inventory**. Find the node, select it, and click the **Assign Device** button to move it to your desired POD.

#### **Hardware & Status Alerts**

**Q: The Dashboard shows a "CRITICAL" status for my POD. What should I do?**

**Solution**: This status is a direct reflection of unresolved events in a node's BMC SEL.

**Step-by-step fix**:

1. Go to **MANAGE > Services** (**Redfish SEL Health** tab) to quickly identify which node(s) are reporting a **Critical** status
2. Use the **BMC SEL** shortcut for an affected node to jump directly to its event log
3. In the **BMC SEL** tab, identify the hardware event. After resolving the physical issue, toggle the event's status to **Resolved** and click **Apply**

**Q: A specific node appears "Offline" in the Node List.**

**Solution**: EDCC cannot communicate with the node's BMC.

**Step-by-step diagnosis**:

1. **Physical**: Verify the node's BMC/management network port is physically connected
2. **Network**: Ensure the BMC's IP address is reachable from the EDCC host server (e.g., using ping). Check for firewalls
3. **Credential**: Go to the **Node Detail > Summary** page for that node and verify that the **BMC Credential** is correct

#### **Feature & Prerequisite Issues**

**Q: I invited a new user, but they never received the invitation email.**

**Solution**: The SMTP server is likely misconfigured.

**How to fix**: Go to **System > Application Settings > Mail Server**. Verify all details are correct and use the **Test** button to confirm it's working.

**Q: The "Mount ISO Image" option doesn't work.**

**Solution**: This feature depends on two other settings.

**Step-by-step fix**:

1. Go to **System > Application Settings > HTTPS File Server**. Ensure **File Sharing** is enabled
2. Go to **CONFIGURE > OS Deployment**. Ensure you have uploaded the necessary ISO file

**Q: My POD-wide Service Profile changes don't affect a particular node.**

**Solution**: The node has an individual configuration override enabled.

**How to fix**: Go to the **Node Detail > Services** page for that node. On the **Subscription** tab, disable the **INDIVIDUAL SERVICES CONFIGURATION ENABLE** toggle to make it inherit the POD policy again.

#### **Performance & Platform Issues**

**Q: The EDCC web interface is slow or unresponsive.**

**Solution**: Check host system resource usage.

**Step-by-step diagnosis**:

1. Go to **System > System Information > Summary** tab
2. Check the **CPU**, **Memory**, and **Disk** usage meters
3. If any meter shows consistently high usage (>85%), consider upgrading host resources

**Q: Firmware updates fail to complete successfully.**

**Solution**: Multiple potential causes need to be checked.

**Step-by-step diagnosis**:

1. Verify **Maintenance Window** is correctly configured in **CONFIGURE > Firmware Provisioning**
2. Check node connectivity and BMC communication status
3. Ensure firmware file is compatible with target node model
4. Review **BMC SEL** on affected nodes for specific error details

#### **Configuration & Setup Issues**

**Q: I can't see the CONFIGURE module in the menu.**

**Solution**: You need to select a POD scope and have appropriate permissions.

**How to fix**:

1. Ensure you have **POD Admin** or **Organization Admin** role
2. Use the **Management Tree** to select a specific POD (not Organization or Hierarchy View)

**Q: Dashboard health status doesn't update after fixing hardware issues.**

**Solution**: Events in **BMC SEL** need to be marked as resolved.

**How to fix**:

1. Navigate to the affected node's **Node Detail > BMC SEL** tab
2. Find the resolved hardware event and toggle its status to **Resolved**
3. Click **Apply** to save the change
4. Dashboard status should update within a few minutes

***

### Quick Reference: Permission Requirements

| Task                             | Required Permission             | Access Location           |
| -------------------------------- | ------------------------------- | ------------------------- |
| **View nodes and monitoring**    | POD Viewer or higher            | Any scope with POD access |
| **Configure nodes and PODs**     | POD Admin or Organization Admin | POD scope only            |
| **Manage users and permissions** | Organization Admin only         | Organization level        |
| **System settings and backups**  | Organization Admin only         | System menu               |
| **Device inventory management**  | Organization Admin only         | System > Inventory        |

***

### Emergency Contact Information

When escalating issues that cannot be resolved using this troubleshooting guide:

**Before contacting support, gather this information**:

* **EDCC Software Version** (from **System > System Information**)
* **Node Serial Numbers** (from **Node Detail > Summary** or **System Information**)
* **Error messages or screenshots** of the problem
* **Steps taken** to reproduce or resolve the issue
* **Recent changes** made to the system before the problem occurred

This information will help support teams diagnose and resolve issues more quickly.
