Skip to main content

Command Palette

Search for a command to run...

Terraform Drift Detection in Azure: Finding and Fixing Configuration Drift

Published
11 min read
Terraform Drift Detection in Azure: Finding and Fixing Configuration Drift
J

Executive technology leader responsible for platform reliability, cloud operations, security posture, and enterprise technology risk within an investor-backed fintech environment. I lead technology operations at the intersection of engineering execution, governance, and business outcomes — ensuring platforms are scalable, resilient, and trusted by investors, regulators, and clients.

Currently VP of DevOps at InvestorFlow, where I focus on building board-ready technology operations, strengthening risk and resilience, and shaping long-term platform strategy to support growth and regulatory confidence.

Every Terraform codebase I have ever worked on has had drift at some point. You deploy the infrastructure, everything looks great, the plan comes back clean, and life is good. Then six months later someone asks why the production App Service has a setting that is not anywhere in the code, or why the terraform plan on a feature branch is suddenly showing a thirty-resource change when the PR only touched two lines. The answer is almost always the same: someone made a change in the Azure portal, or a script did, or a deployment pipeline you forgot about reached in and tweaked something. The state file and the real world have quietly fallen out of sync.

Drift detection is one of those practices that feels optional until the first time it bites you, and then it becomes a non-negotiable part of the platform. In this post I'll walkthrough how I approach drift detection on Azure Terraform, how to build a scheduled pipeline in both Azure DevOps and GitHub Actions, how to wire up a notification so the team actually sees findings, and what to do when you find drift rather than just staring at the diff.

What drift actually is (and what it isn't)

Drift is the gap between what Terraform thinks is deployed and what is actually running in Azure. It comes in a few flavours, and treating them all the same way is where teams tend to go wrong.

The first kind is the obvious one. Someone logs into the portal, clicks into a storage account, and turns on public network access to debug something on a Friday afternoon. They mean to put it back on Monday. They don't. Terraform still thinks the account has public access disabled, and the next plan shows a change that was not authored in code.

The second kind is silent drift introduced by Azure itself. Azure Policy remediation tasks are a classic one. You have a policy that enforces diagnostic settings on every resource, the policy fires, the diagnostic setting gets added, and now your Terraform state does not know about it. Microsoft also occasionally adds default properties to resources between API versions, which can surface as drift after a provider upgrade.

The third kind is drift that is not really drift. A common example is the tags attribute on a resource group. Azure sometimes adds its own tags, like hidden-link:/... on App Service resources, and these show up as a diff every time you plan. This is not drift you want to fix, it is drift you want to ignore with lifecycle { ignore_changes = [tags["hidden-link:*"]] }.

Knowing which type you are dealing with changes what you do about it. The first kind you revert or bring into code. The second kind you add to Terraform or accept. The third kind you teach Terraform to stop caring about.

The simplest drift check: terraform plan

Before you reach for any external tooling, the most honest drift check you have is terraform plan with no code changes. If the plan shows any changes, something has drifted. This is the baseline.

terraform plan -detailed-exitcode

The -detailed-exitcode flag is the one that makes this useful in automation. It returns:

  • 0 if there are no changes

  • 1 if the command errored

  • 2 if there are changes to apply

That exit code is what lets you wrap a plan in a pipeline step and actually fail the build when drift is detected, rather than just printing the output and hoping someone reads it.

The limitation is that terraform plan only tells you about drift for resources Terraform knows about. If someone has created a brand new resource directly in Azure that was never deployed by Terraform, a plan will not flag it because it has no state entry to compare against. For that problem the answer is Azure Policy or resource locks, which I'll come back to later.

Building a scheduled drift check in Azure DevOps

The pattern I use on most Azure Terraform repositories is a pipeline that runs a plan against every environment on a cadence, typically daily for production and weekly for non-production. Here is the Azure DevOps version.

schedules:
  - cron: "0 6 * * *"
    displayName: Daily drift check
    branches:
      include:
        - main
    always: true

trigger: none

pool:
  vmImage: ubuntu-latest

variables:
  - group: terraform-prod

stages:
  - stage: DriftCheck
    jobs:
      - job: Plan
        steps:
          - task: TerraformInstaller@1
            inputs:
              terraformVersion: "1.9.5"

          - script: |
              terraform init \
                -backend-config="resource_group_name=$(BACKEND_RG)" \
                -backend-config="storage_account_name=$(BACKEND_SA)" \
                -backend-config="container_name=tfstate" \
                -backend-config="key=prod.tfstate"
            displayName: Terraform init
            env:
              ARM_CLIENT_ID: $(ARM_CLIENT_ID)
              ARM_TENANT_ID: $(ARM_TENANT_ID)
              ARM_SUBSCRIPTION_ID: $(ARM_SUBSCRIPTION_ID)
              ARM_USE_OIDC: true

          - script: |
              set +e
              terraform plan -detailed-exitcode -no-color -out=drift.tfplan > plan.txt
              echo "##vso[task.setvariable variable=planExit]$?"
            displayName: Terraform plan
            env:
              ARM_CLIENT_ID: $(ARM_CLIENT_ID)
              ARM_TENANT_ID: $(ARM_TENANT_ID)
              ARM_SUBSCRIPTION_ID: $(ARM_SUBSCRIPTION_ID)
              ARM_USE_OIDC: true

          - script: |
              if [ "$(planExit)" = "2" ]; then
                RESOURCES=$(grep -E "^\s+#.*will be" plan.txt | sed 's/^\s*# //' | head -20)
                BUILD_URL="\((System.CollectionUri)\)(System.TeamProject)/_build/results?buildId=$(Build.BuildId)"

                curl -H "Content-Type: application/json" -d "{
                  \"@type\": \"MessageCard\",
                  \"@context\": \"http://schema.org/extensions\",
                  \"themeColor\": \"d9534f\",
                  \"summary\": \"Terraform drift detected in prod\",
                  \"title\": \"Terraform drift detected in prod\",
                  \"text\": \"The following resources have drifted:\n\n\({RESOURCES}\n\n[View build](\){BUILD_URL})\"
                }" $(TEAMS_WEBHOOK_URL)

                echo "Drift detected, failing build"
                exit 1
              elif [ "$(planExit)" = "1" ]; then
                echo "Plan errored, failing build"
                exit 1
              else
                echo "No drift detected"
              fi
            displayName: Notify and fail on drift

          - publish: plan.txt
            artifact: drift-plan
            condition: always()

A couple of things worth calling out. The always: true on the schedule is important because without it the pipeline only runs on days when the branch has new commits, which defeats the purpose. The set +e on the plan step is there because the script would otherwise exit immediately on the non-zero exit code from -detailed-exitcode, and we never get to check what the code actually was. Storing the exit code in a pipeline variable lets the next step branch on it. And the publish task runs unconditionally so the plan output is always available as an artefact, even when the build fails.

The Teams webhook URL goes into the variable group as a secret. If you do not have a webhook configured already, in Teams you go to the channel, click the three dots, then Connectors, then add an Incoming Webhook. Give it a name and copy the URL it generates.

The same thing in GitHub Actions

If your code lives in GitHub, the equivalent workflow looks like this. The mechanics are identical, just the syntax is different.

name: Terraform Drift Check

on:
  schedule:
    - cron: "0 6 * * *"
  workflow_dispatch:

permissions:
  id-token: write
  contents: read

jobs:
  drift-check:
    runs-on: ubuntu-latest
    environment: prod

    steps:
      - uses: actions/checkout@v4

      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.ARM_CLIENT_ID }}
          tenant-id: ${{ secrets.ARM_TENANT_ID }}
          subscription-id: ${{ secrets.ARM_SUBSCRIPTION_ID }}

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.9.5

      - name: Terraform init
        run: |
          terraform init \
            -backend-config="resource_group_name=${{ vars.BACKEND_RG }}" \
            -backend-config="storage_account_name=${{ vars.BACKEND_SA }}" \
            -backend-config="container_name=tfstate" \
            -backend-config="key=prod.tfstate"
        env:
          ARM_USE_OIDC: true

      - name: Terraform plan
        id: plan
        run: |
          set +e
          terraform plan -detailed-exitcode -no-color -out=drift.tfplan > plan.txt
          echo "exitcode=\(?" >> \)GITHUB_OUTPUT
        env:
          ARM_USE_OIDC: true
          ARM_CLIENT_ID: ${{ secrets.ARM_CLIENT_ID }}
          ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }}
          ARM_SUBSCRIPTION_ID: ${{ secrets.ARM_SUBSCRIPTION_ID }}

      - name: Notify on drift
        if: steps.plan.outputs.exitcode == '2'
        run: |
          RESOURCES=$(grep -E "^\s+#.*will be" plan.txt | sed 's/^\s*# //' | head -20)
          RUN_URL="\({{ github.server_url }}/\){{ github.repository }}/actions/runs/${{ github.run_id }}"

          curl -X POST -H "Content-type: application/json" --data "{
            \"text\": \"*Terraform drift detected in prod*\n\n\`\`\`\n\({RESOURCES}\n\`\`\`\n\n<\){RUN_URL}|View workflow run>\"
          }" ${{ secrets.SLACK_WEBHOOK_URL }}

      - name: Upload plan output
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: drift-plan
          path: plan.txt

      - name: Fail on drift
        if: steps.plan.outputs.exitcode == '2' || steps.plan.outputs.exitcode == '1'
        run: exit 1

The Slack webhook follows the same pattern as the Teams one. In Slack you create an incoming webhook through the Slack API site, point it at the channel, and store the URL as a repository secret. The message format I have used here is plain text inside a code block, which is enough for the team to see the resource addresses without needing to dig into the workflow run.

If you want fancier formatting in Slack, the Block Kit format gives you headers, divider lines, and action buttons that link straight to the run. I have done it both ways and honestly the simple text version gets read more reliably. The fancy cards look great in a demo and then get scrolled past when there is real noise in the channel.

Why the notification matters more than the pipeline

A pipeline that fails silently is barely better than no pipeline at all. The default behaviour of both Azure DevOps and GitHub Actions is to send an email when a scheduled build fails, and in my experience those emails get filtered into a folder no one looks at. The Teams or Slack message in a channel the team actually reads is what turns drift detection from a tickbox exercise into something that gets acted on.

What I would also recommend is being deliberate about which channel the notifications go to. Posting drift alerts into a busy general channel is the same as posting them into nowhere, because they get scrolled past. A dedicated #platform-alerts or #terraform-drift channel that only the platform team is in tends to work better. The signal-to-noise ratio is what makes the difference.

Handling drift that is not really drift

The first time you run a daily drift check on a real environment, you will almost certainly get a notification on day one for something that is not really drift. The two most common offenders on Azure are the hidden-link tags that App Service adds to its associated resources, and properties on Function Apps that get populated by the runtime after deployment.

The fix is ignore_changes on the specific properties, not on the whole resource.

resource "azurerm_linux_web_app" "app" {
  # ... other properties ...

  lifecycle {
    ignore_changes = [
      tags["hidden-link:/subscriptions/"],
      site_config[0].application_stack[0].docker_image_name,
    ]
  }
}

Be deliberate about ignore_changes. It is a useful tool and it is also how you accidentally end up with a codebase that silently accepts any change to any property. Only use it for the specific properties where Azure is genuinely the source of truth and you do not want Terraform to interfere. A blanket ignore_changes = all is almost never the right answer, even though it makes the drift notification go away.

Preventing drift in the first place with Azure Policy

The best drift detection is the drift that never happens. Azure Policy with deny effects is genuinely useful here, because it stops the portal change at source rather than flagging it after the fact.

The pattern I use is to tag every resource deployed by Terraform with a managed-by = terraform tag, and then have a policy that audits or denies modifications to those resources from anyone other than the Terraform service principal.

resource "azurerm_policy_definition" "deny_manual_changes" {
  name         = "deny-changes-to-terraform-managed"
  policy_type  = "Custom"
  mode         = "All"
  display_name = "Deny manual changes to Terraform-managed resources"

  policy_rule = jsonencode({
    if = {
      allOf = [
        {
          field  = "tags['managed-by']"
          equals = "terraform"
        },
        {
          field    = "type"
          notEquals = "Microsoft.Resources/subscriptions/resourceGroups"
        }
      ]
    }
    then = {
      effect = "[parameters('effect')]"
    }
  })

  parameters = jsonencode({
    effect = {
      type = "String"
      allowedValues = ["Audit", "Deny"]
      defaultValue  = "Audit"
    }
  })
}

I always deploy this in Audit mode first and watch the policy compliance view for a couple of weeks before flipping it to Deny. The audit view will flag every person, pipeline, or process that touches a tagged resource, and you will almost certainly find at least one legitimate workflow you did not know about. Breaking that workflow with a Deny on day one is a quick way to lose friends on your platform team.

My thoughts on drift detection

Drift is one of those things where the cost of ignoring it scales non-linearly. A small amount of drift is annoying. A large amount of drift is terrifying, because you stop trusting your own Terraform plan, and the moment you stop trusting the plan you stop running apply in production, and the whole value proposition of Infrastructure as Code starts to unwind.

If your Azure Terraform repository does not currently have any drift detection in place, the first thing I would add is the scheduled terraform plan -detailed-exitcode pipeline with a Teams or Slack notification. It takes an afternoon to set up, catches the vast majority of what you actually care about, and gives you the signal you need to start tightening things up from there. The Azure Policy work and the ignore_changes cleanup can follow once you have the habit of looking at drift at all.

%buymeacoffe-butyellow