Troubleshooting Guide - CNOE Azure Reference Implementation
This guide covers common issues and their solutions when using the CNOE Azure Reference Implementation with its Kind cluster bootstrap approach.
Note: Most issues are related to missing prerequisites, authentication, networking, or resource constraints. Start with verifying prerequisites and work systematically through the troubleshooting steps.
Table of Contents
- Installation Issues
- Configuration Issues
- General Troubleshooting Approach
- Bootstrap Environment Issues
- Target AKS Cluster Issues
- Component-Specific Issues
- Performance Issues
- Recovery Procedures
- Getting Help
- Prevention Tips
Installation Issues
Task Installation Fails
Symptoms: task install command fails
Common Causes:
- Missing prerequisite Azure resources (AKS cluster, DNS zone)
- Incorrect configuration in
config.yamlorprivate/azure-credentials.json - Azure CLI not authenticated
- Kind not installed or Docker not running
- Missing required tools
Debug Steps:
# Verify prerequisite Azure resources existaz aks show --name $(yq '.cluster_name' config.yaml) --resource-group $(yq '.resource_group' config.yaml)az network dns zone show --name $(yq '.domain' config.yaml) --resource-group $(yq '.resource_group' config.yaml)# Verify required toolswhich az kubectl yq helm helmfile task kind yamale# Check Docker is running (required for Kind)docker info# Check Azure CLI loginaz account show# Validate configuration filestask config:lint# Check cluster OIDC issueraz aks show --name $(yq '.cluster_name' config.yaml) \--resource-group $(yq '.resource_group' config.yaml) \--query "oidcIssuerProfile.issuerUrl" -o tsv
Kind Cluster Creation Issues
Symptoms: Kind cluster fails to create
Debug Steps:
# Check Docker is runningdocker ps# Check Kind configurationyq '.' kind.yaml# Try creating cluster manuallykind create cluster --config kind.yaml --name $(yq '.name' kind.yaml)# Check for port conflictsnetstat -tulpn | grep -E ':(80|443|30080|30443)'# Check disk spacedf -h
Common Fixes:
# Remove existing Kind clustertask kind:delete# Clean up Docker resourcesdocker system prune# Recreate clustertask kind:create
Helmfile Deployment Issues
Symptoms: Helmfile fails to deploy to Kind cluster
Debug Steps:
# Switch to Kind contexttask kubeconfig:set-context:kind# Check Helmfile syntaxtask helmfile:lint# View what would be deployedtask helmfile:diff# Check Helm repositorieshelm repo list# Manual Helmfile debughelmfile --debug diff# Check Kind cluster nodeskubectl get nodes
Azure Credentials Issues
Symptoms: Crossplane cannot authenticate to Azure
Debug Steps:
# Validate Azure credentials filetask config:lint# Check credentials formatcat private/azure-credentials.json | yq '.'# Test Azure authentication manuallyaz login --service-principal \--username $(yq '.clientId' private/azure-credentials.json) \--password $(yq '.clientSecret' private/azure-credentials.json) \--tenant $(yq '.tenantId' private/azure-credentials.json)# Check if credentials are loaded in Crossplanekubectl get secret provider-azure -n crossplane-system -o yaml
Common Fixes:
# Recreate credentials file from templatecp private/azure-credentials.template.json private/azure-credentials.json# Edit with your actual credentials# Restart Crossplane providerkubectl rollout restart deployment/crossplane -n crossplane-system
Configuration Issues
Configuration File Validation
Symptoms: Configuration validation fails
Debug Steps:
# Run configuration validationtask config:lint# Check config.yaml syntaxyq '.' config.yaml# Check azure-credentials.json syntaxyq '.' private/azure-credentials.json# Validate against schemayamale -s config.schema.yaml config.yamlyamale -s private/azure-credentials.schema.yaml private/azure-credentials.yaml
GitHub Integration Problems
Symptoms: ArgoCD cannot connect to GitHub repositories
Debug Steps:
# Verify GitHub configuration in config.yamlyq '.github' config.yaml# Check GitHub App credentials# Ensure GitHub App is installed in your organization# Test GitHub connectivity from Kind clusterkubectl run test-pod --rm -i --tty --image=curlimages/curl -- \curl -H "Authorization: token YOUR_TOKEN" https://api.github.com/user
Domain and DNS Issues
Symptoms: Local services not accessible via *.local.<domain> addresses
Debug Steps:
# Check DNS resolution for local servicesnslookup argocd.local.YOUR_DOMAINnslookup crossplane.local.YOUR_DOMAIN# Check if local DNS record was createdaz network dns record-set a show \--name "*.local" \--zone-name $(yq '.domain' config.yaml) \--resource-group $(yq '.resource_group' config.yaml)# Check ingress configuration in Kind clustertask kubeconfig:set-context:kindkubectl get ingress -A# Test local services directlycurl -H "Host: argocd.local.YOUR_DOMAIN" http://localhost
Azure Resource Creation Issues
Symptoms: Crossplane fails to create Azure resources (Key Vault, Workload Identity)
Debug Steps:
# Switch to Kind contexttask kubeconfig:set-context:kind# Check Crossplane logskubectl logs -n crossplane-system deployment/crossplane# Check Azure provider statuskubectl get providers# Check managed resourceskubectl get managed -A# Check specific resourceskubectl get vault -Akubectl get workloadidentity -A# Check Azure RBAC permissionsaz role assignment list --assignee $(yq '.clientId' private/azure-credentials.json)
General Troubleshooting Approach
1. Check Kind Cluster Status
Start troubleshooting with the bootstrap environment:
# Check Kind cluster exists and is runningkind get clusterskubectl cluster-info --context kind-$(yq '.name' kind.yaml)# Check nodestask kubeconfig:set-context:kindkubectl get nodes# Check system podskubectl get pods -A
2. Check ArgoCD Applications
Monitor the bootstrap ArgoCD for deployment status:
# Access local ArgoCD UI# Navigate to: http://argocd.local.<your-domain># Get local ArgoCD admin passwordtask kubeconfig:set-context:kindkubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d# Check application status via CLIkubectl get applications -n argocdkubectl get applicationsets -n argocd
3. Check Crossplane Resources
Monitor Azure resource creation:
# Access local Crossplane dashboard# Navigate to: http://crossplane.local.<your-domain># Check Crossplane resources via CLIkubectl get managed -Akubectl get workloadidentity -Akubectl get vault -A
4. Common Diagnostic Commands
# Check overall cluster healthkubectl get nodeskubectl get pods -A --field-selector=status.phase!=Running# Check events for errorskubectl get events -A --sort-by=.metadata.creationTimestamp# Check resource usagekubectl top nodeskubectl top pods -A
Common Log Locations
# ArgoCD logs (Kind cluster)kubectl logs -n argocd deployment/argocd-application-controllerkubectl logs -n argocd deployment/argocd-server# Crossplane logs (Kind cluster)kubectl logs -n crossplane-system deployment/crossplane# Component logs (AKS cluster)task kubeconfig:set-context:akskubectl logs -n NAMESPACE deployment/COMPONENT_NAME
Bootstrap Environment Issues
Local ArgoCD Issues
Symptoms: Cannot access ArgoCD at argocd.local.<domain>
Debug Steps:
# Switch to Kind contexttask kubeconfig:set-context:kind# Check ArgoCD podskubectl get pods -n argocd# Check ingress configurationkubectl get ingress -n argocd# Check if DNS record existsaz network dns record-set a show \--name "*.local" \--zone-name $(yq '.domain' config.yaml) \--resource-group $(yq '.resource_group' config.yaml)# Port forward to bypass ingresskubectl port-forward svc/argocd-server -n argocd 8080:80
Local Crossplane Issues
Symptoms: Cannot access Crossplane dashboard or resources not creating
Debug Steps:
# Check Crossplane podskubectl get pods -n crossplane-system# Check provider installationkubectl get providers# Check provider configurationkubectl get providerconfigs# Check crossplane logskubectl logs -n crossplane-system deployment/crossplane# Check Azure provider logskubectl logs -n crossplane-system -l pkg.crossplane.io/provider=azure
Local DNS Issues
Symptoms: *.local.<domain> addresses not resolving
Debug Steps:
# Check if DNS record was created by Crossplanekubectl get dnsarecord -A# Check DNS record in Azureaz network dns record-set a show \--name "*.local" \--zone-name $(yq '.domain' config.yaml) \--resource-group $(yq '.resource_group' config.yaml)# Check external-dns logs (if applicable)kubectl logs -n external-dns deployment/external-dns
Target AKS Cluster Issues
AKS Connection Issues
Symptoms: Cannot connect to or deploy to AKS cluster
Debug Steps:
# Verify AKS cluster credentialstask kubeconfig:set-context:akskubectl cluster-info# Check if cluster is accessiblekubectl get nodes# Verify OIDC issuer configurationaz aks show --name $(yq '.cluster_name' config.yaml) \--resource-group $(yq '.resource_group' config.yaml) \--query "oidcIssuerProfile.issuerUrl" -o tsv
Component Deployment Issues
Symptoms: Components not deploying to AKS cluster from Kind-based ArgoCD
Debug Steps:
# Check ArgoCD application status (from Kind cluster)task kubeconfig:set-context:kindkubectl get applications -n argocd# Check if ArgoCD can reach AKS clusterkubectl get secret cnoe -n argocd -o yaml# Check logs for deployment issueskubectl logs -n argocd deployment/argocd-application-controller
Workload Identity Issues
Symptoms: Services on AKS cannot authenticate to Azure
Debug Steps:
# Switch to AKS contexttask kubeconfig:set-context:aks# Check if workload identity was createdaz identity list --resource-group $(yq '.resource_group' config.yaml)# Check service account annotationskubectl get sa -A -o yaml | grep azure.workload.identity# Check federated credentialsaz identity federated-credential list \--name crossplane \--resource-group $(yq '.resource_group' config.yaml)
Component-Specific Issues
ArgoCD Issues
ArgoCD Not Accessible
Symptoms: Cannot access ArgoCD UI on AKS cluster
Debug Steps:
# Switch to AKS contexttask kubeconfig:set-context:aks# Check ArgoCD deploymentkubectl get pods -n argocd# Check ingresskubectl get ingress -n argocd# Get service URLstask get:urls# Port forward to bypass ingresskubectl port-forward svc/argocd-server -n argocd 8080:80
Applications Not Syncing
Symptoms: ArgoCD applications stuck in "OutOfSync" or "Unknown" state
Debug Steps:
# Check application statuskubectl get applications -n argocd# Check repository connectivitykubectl exec -n argocd deployment/argocd-server -- argocd repo list# Force refresh applicationkubectl patch app APP_NAME -n argocd --type merge --patch '{"operation":{"initiatedBy":{"automated":true}}}'
Crossplane Issues
Provider Not Ready
Symptoms: Crossplane Azure provider fails to install
Debug Steps:
# Switch to Kind context (where Crossplane is running)task kubeconfig:set-context:kind# Check provider statuskubectl get providers# Check provider configkubectl get providerconfigs# Check azure credentials secretkubectl get secret provider-azure -n crossplane-system -o yaml
Azure Resource Creation Failures
Symptoms: Azure resources (Key Vault, Workload Identity) not being created
Debug Steps:
# Check managed resourceskubectl get managed -A# Check specific resource eventskubectl describe vault VAULT_NAMEkubectl describe workloadidentity IDENTITY_NAME# Check Azure permissionsaz role assignment list --assignee $(yq '.clientId' private/azure-credentials.json)
ExternalDNS Issues
DNS Records Not Created
Symptoms: DNS records are not automatically created on AKS cluster
Debug Steps:
# Switch to AKS contexttask kubeconfig:set-context:aks# Check external-dns logskubectl logs -n external-dns deployment/external-dns# Check DNS zone permissionsaz role assignment list --scope "/subscriptions/$(yq '.subscription' config.yaml)/resourceGroups/$(yq '.resource_group' config.yaml)/providers/Microsoft.Network/dnszones/$(yq '.domain' config.yaml)"# Verify DNS zone existsaz network dns zone show --name $(yq '.domain' config.yaml) --resource-group $(yq '.resource_group' config.yaml)
Cert-Manager Issues
Certificates Not Issued
Symptoms: TLS certificates remain in "Pending" state
Debug Steps:
# Switchto AKS contexttask kubeconfig:set-context:aks# Check certificate statuskubectl get certificates -A# Check certificate requestskubectl get certificaterequests -A# Check challengeskubectl get challenges -A# Check issuer statuskubectl get clusterissuers# Check cert-manager logskubectl logs -n cert-manager deployment/cert-manager
Keycloak Issues
Keycloak Pod Failing
Symptoms: Keycloak pods crash or fail to start on AKS cluster
Debug Steps:
# Switch to AKS contexttask kubeconfig:set-context:aks# Check pod statuskubectl get pods -n keycloak# Check logskubectl logs -n keycloak deployment/keycloak# Check persistent volume claimskubectl get pvc -n keycloak# Check secretskubectl get secrets -n keycloak
SSO Authentication Issues
Symptoms: Cannot log into Backstage via Keycloak
Debug Steps:
# Check Keycloak accessibilitycurl -k https://keycloak.YOUR_DOMAIN/realms/cnoe/.well-known/openid-configuration# Check user secretskubectl get secrets -n keycloak keycloak-config -o yaml# Verify Backstage configurationkubectl get configmap -n backstage backstage-config -o yaml
Backstage Issues
Backstage Pod Crashing
Symptoms: Backstage pods fail to start on AKS cluster
Debug Steps:
# Switch to AKS contexttask kubeconfig:set-context:aks# Check pod logskubectl logs -n backstage deployment/backstage# Check configurationkubectl get configmap -n backstage -o yaml# Check secretskubectl get secrets -n backstage -o yaml# Verify GitHub integration configurationyq '.github' config.yaml
Ingress Issues
Load Balancer Not Created
Symptoms: ingress-nginx service has no external IP on AKS cluster
Debug Steps:
# Switch to AKS contexttask kubeconfig:set-context:aks# Check service statuskubectl get svc -n ingress-nginx# Check ingress-nginx logskubectl logs -n ingress-nginx deployment/ingress-nginx-controller# Check Azure Load Balanceraz network lb list --resource-group MC_$(yq '.resource_group' config.yaml)_$(yq '.cluster_name' config.yaml)_$(yq '.location' config.yaml)
Performance Issues
Slow Installation
Symptoms: Installation takes very long or times out
Common Causes:
- DNS propagation delays
- Certificate issuance delays
- Image pull issues
- Resource constraints on Kind cluster or AKS
Debug Steps:
# Check Kind cluster resourcestask kubeconfig:set-context:kindkubectl top nodeskubectl top pods -A# Check AKS cluster resourcestask kubeconfig:set-context:akskubectl top nodes# Check image pull statuskubectl get events -A --sort-by=.metadata.creationTimestamp | grep Pull# Monitor Crossplane resource creationkubectl get managed -A -w
High Resource Usage
Symptoms: Cluster running out of resources
Debug Steps:
# Check resource requests and limits on both clusterskubectl describe nodes# Identify resource-hungry podskubectl top pods -A --sort-by=cpukubectl top pods -A --sort-by=memory# Check persistent volume usagekubectl get pv
Recovery Procedures
Reinstalling Components
# Clean reinstalltask uninstalltask install# Reinstall only AKS components (keep Kind cluster)task kubeconfig:set-context:kindkubectl -n argocd delete app cnoetask sync
Backup and Restore
# Backup ArgoCD configuration from Kind clustertask kubeconfig:set-context:kindkubectl get applications -n argocd -o yaml > argocd-apps-backup.yaml# Backup configurationcp config.yaml config-backup.yamlcp private/azure-credentials.json private/azure-credentials-backup.json# Restore from backupkubectl apply -f argocd-apps-backup.yaml
Emergency Access
# Direct kubectl access to services on AKStask kubeconfig:set-context:akskubectl port-forward svc/argocd-server -n argocd 8080:80kubectl port-forward svc/backstage -n backstage 3000:7007# Access Kind cluster servicestask kubeconfig:set-context:kindkubectl port-forward svc/argocd-server -n argocd 8080:80
Getting Help
Collecting Diagnostic Information
# Create diagnostic bundlemkdir cnoe-diagnostics# Collect Kind cluster informationtask kubeconfig:set-context:kindkubectl cluster-info dump --output-directory=cnoe-diagnostics/kind-cluster-infokubectl get events -A --sort-by=.metadata.creationTimestamp > cnoe-diagnostics/kind-events.yamlkubectl get pods -A -o yaml > cnoe-diagnostics/kind-pods.yaml# Collect AKS cluster informationtask kubeconfig:set-context:akskubectl cluster-info dump --output-directory=cnoe-diagnostics/aks-cluster-infokubectl get events -A --sort-by=.metadata.creationTimestamp > cnoe-diagnostics/aks-events.yamlkubectl get pods -A -o yaml > cnoe-diagnostics/aks-pods.yaml# Collect configurationtask helmfile:status > cnoe-diagnostics/helmfile-status.txtyq '.' config.yaml > cnoe-diagnostics/config.yaml# DO NOT include azure-credentials.json in diagnostic bundle for security reasons# Collect Azure resourcesaz resource list --resource-group $(yq '.resource_group' config.yaml) > cnoe-diagnostics/azure-resources.json
Additional Resources
- CNOE Community
- ArgoCD Documentation
- Crossplane Documentation
- Backstage Documentation
- Kind Documentation
Prevention Tips
- Proper Prerequisites: Ensure AKS cluster and DNS zone are properly provisioned before installation
- Configuration Management: Keep
config.yamlandazure-credentials.jsonup-to-date and validate before applying changes - Regular Updates: Use
task syncto keep components updated - Monitor Resources: Set up monitoring for both Kind and AKS cluster resources
- Backup Strategy: Regular backups of critical configurations
- Testing: Test changes in a separate environment first
- Infrastructure Management: Use proper infrastructure management tools for production Azure resources
- Docker Health: Ensure Docker is running properly for Kind cluster operations
- Network Connectivity: Ensure reliable internet connection for image pulls and Azure API calls
- Azure Permissions: Verify service principal has necessary permissions for resource creation