Troubleshooting Guide - CNOE Azure Reference Implementation
This guide covers common issues and their solutions when using the CNOE Azure Reference Implementation with Taskfile and Helmfile.
Note: Most issues are related to missing prerequisites, authentication, networking, or resource constraints. Start with verifying prerequisites and work systematically through the troubleshooting steps.
Table of Contents
- Installation Issues
- Configuration Issues
- General Troubleshooting Approach
- Component-Specific Issues
- Performance Issues
- Recovery Procedures
- Getting Help
- Prevention Tips
Installation Issues
Task Installation Fails
Symptoms: task install command fails
Common Causes:
- Missing prerequisite Azure resources (AKS cluster, DNS zone, Key Vault)
- Incorrect configuration in config.yaml
- Azure CLI not authenticated
- Incorrect cluster context
- Missing required tools
Debug Steps:
# Verify prerequisite Azure resources existaz aks show --name $(yq '.cluster_name' config.yaml) --resource-group $(yq '.resource_group' config.yaml)az network dns zone show --name $(yq '.domain' config.yaml) --resource-group $(yq '.resource_group' config.yaml)az keyvault show --name $(yq '.keyvault' config.yaml) --resource-group $(yq '.resource_group' config.yaml)# Verify required toolswhich az kubectl yq jq helm helmfile task# Check Azure CLI loginaz account show# Verify correct cluster contextkubectl config current-context# Validate configuration fileyq '.' config.yaml# Check cluster OIDC issueraz aks show --name $(yq '.cluster_name' config.yaml) \--resource-group $(yq '.resource_group' config.yaml) \--query "oidcIssuerProfile.issuerUrl" -o tsv
Helmfile Deployment Issues
Symptoms: Helmfile fails to deploy ArgoCD
Debug Steps:
# Check Helmfile syntaxtask helmfile:lint# View what would be deployedtask helmfile:diff# Check Helm repositorieshelm repo list# Manual Helmfile debughelmfile --debug diff
Azure Workload Identity Issues
Symptoms: Components can't authenticate to Azure services
Debug Steps:
# Check managed identity creationaz identity list --resource-group $(yq '.resource_group' config.yaml)# Verify federated credentialsaz identity federated-credential list \--name crossplane \--resource-group $(yq '.resource_group' config.yaml)# Check service account annotationskubectl get sa crossplane -n crossplane-system -o yaml# Verify workload identity resourceskubectl get workloadidentities.azure.livewyer.io -A -o yaml
Common Fixes:
# Update Azure credentials (for demo environments only)task azure:creds:deletetask azure:creds:create# Update workload identity configurationtask update:secret:azure
Important: The
azure:creds:*tasks are helper functions for demonstration only. In production, Azure identities should be managed through your organization's infrastructure management approach.
Configuration Issues
GitHub Integration Problems
Symptoms: Backstage cannot connect to GitHub
Solutions:
# Verify GitHub configuration in config.yamlyq '.github' config.yaml# Check if configuration was uploaded to Key Vaultaz keyvault secret show --name config --vault-name $(yq '.keyvault' config.yaml)# Update configurationtask update:secret
Important: GitHub integration details are stored in
config.yaml, not in private files. All configuration is centralized in this file and stored securely in Azure Key Vault.
Domain and DNS Issues
Symptoms: Services not accessible via domain names
Debug Steps:
# Check DNS resolutionnslookup backstage.YOUR_DOMAIN# Verify ingress configurationkubectl get ingress -A# Check external-dns logskubectl logs -n external-dns deployment/external-dns# Test load balancer IPkubectl get svc -n ingress-nginx ingress-nginx-controllercurl -H "Host: backstage.YOUR_DOMAIN" http://LOAD_BALANCER_IP
Azure Key Vault Issues
Symptoms: External secrets cannot fetch secrets from Key Vault
Debug Steps:
# Check Key Vault accessaz keyvault secret list --vault-name $(yq '.keyvault' config.yaml)# Verify external-secrets logskubectl logs -n external-secrets deployment/external-secrets# Check workload identity for external-secretskubectl get workloadidentity external-secrets -n external-secrets -o yaml# Test Key Vault connectivitykubectl run test-pod --rm -i --tty --image=mcr.microsoft.com/azure-cli -- az keyvault secret list --vault-name $(yq '.keyvault' config.yaml)
General Troubleshooting Approach
1. Check ArgoCD Applications
All components are deployed as ArgoCD applications. Start by checking their status:
# Get ArgoCD admin passwordkubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d# Port forward to ArgoCD UIkubectl port-forward svc/argocd-server -n argocd 8080:80
Navigate to http://localhost:8080 and login with username admin to view:
- Application sync status
- Resource health
- Event logs
- Sync history
2. Check Taskfile Operations
# View available taskstask --list-all# Check configurationtask diff# Verify Helmfile statustask helmfile:status
3. Common Diagnostic Commands
# Check cluster connectivitykubectl cluster-info# View all ArgoCD applicationskubectl get applications -n argocd# Check application setskubectl get applicationsets -n argocd# View workload identitieskubectl get workloadidentities.azure.livewyer.io -A
Common Log Locations
# ArgoCD logskubectl logs -n argocd deployment/argocd-application-controllerkubectl logs -n argocd deployment/argocd-server# Component logskubectl logs -n NAMESPACE deployment/COMPONENT_NAME# System logsjournalctl -u kubelet (on cluster nodes)
Component-Specific Issues
ArgoCD Issues
ArgoCD Not Accessible
Symptoms: Cannot access ArgoCD UI
Debug Steps:
# Check ArgoCD deploymentkubectl get pods -n argocd# Check ingresskubectl get ingress -n argocd# Check servicekubectl get svc -n argocd argocd-server# Check logskubectl logs -n argocd deployment/argocd-server
Applications Not Syncing
Symptoms: ArgoCD applications stuck in "OutOfSync" or "Unknown" state
Common Fix:
# Force refresh applicationkubectl patch app APP_NAME -n argocd --type merge --patch '{"operation":{"initiatedBy":{"automated":true}}}'# Check repository accesskubectl get secret -n argocd argocd-repo-server-tls-certs-cm# Verify repository connectivitykubectl exec -n argocd deployment/argocd-server -- argocd repo list
Crossplane Issues
Provider Not Ready
Symptoms: Crossplane Azure provider fails to install
Debug Steps:
# Check provider statuskubectl get providers# Check provider configkubectl get providerconfigs# Check crossplane logskubectl logs -n crossplane-system deployment/crossplane# Verify workload identitykubectl describe workloadidentity crossplane -n crossplane-system
ExternalDNS Issues
DNS Records Not Created
Symptoms: DNS records are not automatically created
Debug Steps:
# Check external-dns logskubectl logs -n external-dns deployment/external-dns# Check workload identitykubectl get workloadidentity external-dns -n external-dns -o yaml# Verify DNS zone permissionsaz role assignment list --scope "/subscriptions/$(yq '.subscription' config.yaml)/resourceGroups/$(yq '.resource_group' config.yaml)/providers/Microsoft.Network/dnszones/$(yq '.domain' config.yaml)"
Common Fix:
# Update external-dns workload identity (demo environments)task update:secret:external-dns# Check domain configurationyq '.domain' config.yaml# Verify DNS zone existsaz network dns zone show --name $(yq '.domain' config.yaml) --resource-group $(yq '.resource_group' config.yaml)
Cert-Manager Issues
Certificates Not Issued
Symptoms: TLS certificates remain in "Pending" state
Debug Steps:
# Check certificate statuskubectl get certificates -A# Check certificate requestskubectl get certificaterequests -A# Check challengeskubectl get challenges -A# Check issuer statuskubectl get clusterissuers# Check cert-manager logskubectl logs -n cert-manager deployment/cert-manager
Common Error Messages:
Get "http://example.com/.well-known/acme-challenge/...": dial tcp: lookup example.com: no such host
Solution: DNS propagation delay. Wait 5-10 minutes for DNS to propagate.
Keycloak Issues
Keycloak Pod Failing
Symptoms: Keycloak pods crash or fail to start
Debug Steps:
# Check pod statuskubectl get pods -n keycloak# Check logskubectl logs -n keycloak deployment/keycloak# Check persistent volume claimskubectl get pvc -n keycloak# Check secretskubectl get secrets -n keycloak
SSO Authentication Issues
Symptoms: Cannot log into Backstage via Keycloak
Debug Steps:
# Check Keycloak accessibilitycurl -k https://keycloak.YOUR_DOMAIN/realms/cnoe/.well-known/openid-configuration# Check user secretskubectl get secrets -n keycloak keycloak-user-config -o yaml# Verify Backstage configurationkubectl get configmap -n backstage backstage-config -o yaml
Backstage Issues
Backstage Pod Crashing
Symptoms: Backstage pods fail to start
Debug Steps:
# Check pod logskubectl logs -n backstage deployment/backstage# Check configurationkubectl get configmap -n backstage -o yaml# Check secretskubectl get secrets -n backstage -o yaml# Verify GitHub integration configuration in config.yamlyq '.github' config.yaml
Ingress Issues
Load Balancer Not Created
Symptoms: ingress-nginx service has no external IP
Debug Steps:
# Check service statuskubectl get svc -n ingress-nginx# Check ingress-nginx logskubectl logs -n ingress-nginx deployment/ingress-nginx-controller# Check Azure Load Balanceraz network lb list --resource-group MC_$(yq '.resource_group' config.yaml)_$(yq '.cluster_name' config.yaml)_$(yq '.location' config.yaml)
Performance Issues
Slow Installation
Symptoms: Installation takes very long or times out
Common Causes:
- DNS propagation delays
- Certificate issuance delays
- Image pull issues
- Resource constraints
Debug Steps:
# Check node resourceskubectl top nodes# Check pod resourceskubectl top pods -A# Scale up cluster if needed (using your infrastructure management approach)# Example for testing: az aks scale --name CLUSTER --resource-group RG --node-count 3# Check image pull statuskubectl get events -A --sort-by=.metadata.creationTimestamp
High Resource Usage
Symptoms: Cluster running out of resources
Debug Steps:
# Check resource requests and limitskubectl describe nodes# Identify resource-hungry podskubectl top pods -A --sort-by=cpukubectl top pods -A --sort-by=memory# Check persistent volume usagekubectl get pv
Recovery Procedures
Reinstalling Components
# Reinstall specific componentkubectl delete app COMPONENT_NAME -n argocdtask sync# Full reinstalltask uninstalltask install
Backup and Restore
# Backup ArgoCD configurationkubectl get applications -n argocd -o yaml > argocd-apps-backup.yaml# Backup configuration from Key Vaultaz keyvault secret show --name config --vault-name $(yq '.keyvault' config.yaml) > config-backup.json# Restore from backupkubectl apply -f argocd-apps-backup.yaml
Emergency Access
# Direct kubectl access to serviceskubectl port-forward svc/argocd-server -n argocd 8080:80kubectl port-forward svc/backstage -n backstage 3000:7007# Reset ArgoCD admin passwordkubectl patch secret argocd-initial-admin-secret -n argocd -p '{"data":{"password":"'$(echo -n 'new-password' | base64)'"}}'
Getting Help
Collecting Diagnostic Information
# Create diagnostic bundlemkdir cnoe-diagnosticskubectl cluster-info dump --output-directory=cnoe-diagnostics/cluster-infokubectl get events -A --sort-by=.metadata.creationTimestamp > cnoe-diagnostics/events.yamlkubectl get pods -A -o yaml > cnoe-diagnostics/pods.yamltask helmfile:status > cnoe-diagnostics/helmfile-status.txtyq '.' config.yaml > cnoe-diagnostics/config.yaml
Additional Resources
Prevention Tips
- Proper Prerequisites: Ensure all Azure resources are properly provisioned before installation
- Configuration Management: Keep config.yaml up-to-date and validate before applying changes
- Regular Updates: Use
task syncto keep components updated - Monitor Resources: Set up monitoring for cluster resources
- Backup Strategy: Regular backups of critical configurations
- Testing: Test changes in a separate environment first
- Infrastructure Management: Use proper infrastructure management tools for production Azure resources