Terraform Drift Detection in Azure: Finding and Fixing Configuration Drift

Executive technology leader responsible for platform reliability, cloud operations, security posture, and enterprise technology risk within an investor-backed fintech environment. I lead technology operations at the intersection of engineering execution, governance, and business outcomes — ensuring platforms are scalable, resilient, and trusted by investors, regulators, and clients.
Currently VP of DevOps at InvestorFlow, where I focus on building board-ready technology operations, strengthening risk and resilience, and shaping long-term platform strategy to support growth and regulatory confidence.
Every Terraform codebase I have ever worked on has had drift at some point. You deploy the infrastructure, everything looks great, the plan comes back clean, and life is good. Then six months later someone asks why the production App Service has a setting that is not anywhere in the code, or why the terraform plan on a feature branch is suddenly showing a thirty-resource change when the PR only touched two lines. The answer is almost always the same: someone made a change in the Azure portal, or a script did, or a deployment pipeline you forgot about reached in and tweaked something. The state file and the real world have quietly fallen out of sync.
Drift detection is one of those practices that feels optional until the first time it bites you, and then it becomes a non-negotiable part of the platform. In this post I'll walkthrough how I approach drift detection on Azure Terraform, how to build a scheduled pipeline in both Azure DevOps and GitHub Actions, how to wire up a notification so the team actually sees findings, and what to do when you find drift rather than just staring at the diff.
What drift actually is (and what it isn't)
Drift is the gap between what Terraform thinks is deployed and what is actually running in Azure. It comes in a few flavours, and treating them all the same way is where teams tend to go wrong.
The first kind is the obvious one. Someone logs into the portal, clicks into a storage account, and turns on public network access to debug something on a Friday afternoon. They mean to put it back on Monday. They don't. Terraform still thinks the account has public access disabled, and the next plan shows a change that was not authored in code.
The second kind is silent drift introduced by Azure itself. Azure Policy remediation tasks are a classic one. You have a policy that enforces diagnostic settings on every resource, the policy fires, the diagnostic setting gets added, and now your Terraform state does not know about it. Microsoft also occasionally adds default properties to resources between API versions, which can surface as drift after a provider upgrade.
The third kind is drift that is not really drift. A common example is the tags attribute on a resource group. Azure sometimes adds its own tags, like hidden-link:/... on App Service resources, and these show up as a diff every time you plan. This is not drift you want to fix, it is drift you want to ignore with lifecycle { ignore_changes = [tags["hidden-link:*"]] }.
Knowing which type you are dealing with changes what you do about it. The first kind you revert or bring into code. The second kind you add to Terraform or accept. The third kind you teach Terraform to stop caring about.
The simplest drift check: terraform plan
Before you reach for any external tooling, the most honest drift check you have is terraform plan with no code changes. If the plan shows any changes, something has drifted. This is the baseline.
terraform plan -detailed-exitcode
The -detailed-exitcode flag is the one that makes this useful in automation. It returns:
0if there are no changes1if the command errored2if there are changes to apply
That exit code is what lets you wrap a plan in a pipeline step and actually fail the build when drift is detected, rather than just printing the output and hoping someone reads it.
The limitation is that terraform plan only tells you about drift for resources Terraform knows about. If someone has created a brand new resource directly in Azure that was never deployed by Terraform, a plan will not flag it because it has no state entry to compare against. For that problem the answer is Azure Policy or resource locks, which I'll come back to later.
Building a scheduled drift check in Azure DevOps
The pattern I use on most Azure Terraform repositories is a pipeline that runs a plan against every environment on a cadence, typically daily for production and weekly for non-production. Here is the Azure DevOps version.
schedules:
- cron: "0 6 * * *"
displayName: Daily drift check
branches:
include:
- main
always: true
trigger: none
pool:
vmImage: ubuntu-latest
variables:
- group: terraform-prod
stages:
- stage: DriftCheck
jobs:
- job: Plan
steps:
- task: TerraformInstaller@1
inputs:
terraformVersion: "1.9.5"
- script: |
terraform init \
-backend-config="resource_group_name=$(BACKEND_RG)" \
-backend-config="storage_account_name=$(BACKEND_SA)" \
-backend-config="container_name=tfstate" \
-backend-config="key=prod.tfstate"
displayName: Terraform init
env:
ARM_CLIENT_ID: $(ARM_CLIENT_ID)
ARM_TENANT_ID: $(ARM_TENANT_ID)
ARM_SUBSCRIPTION_ID: $(ARM_SUBSCRIPTION_ID)
ARM_USE_OIDC: true
- script: |
set +e
terraform plan -detailed-exitcode -no-color -out=drift.tfplan > plan.txt
echo "##vso[task.setvariable variable=planExit]$?"
displayName: Terraform plan
env:
ARM_CLIENT_ID: $(ARM_CLIENT_ID)
ARM_TENANT_ID: $(ARM_TENANT_ID)
ARM_SUBSCRIPTION_ID: $(ARM_SUBSCRIPTION_ID)
ARM_USE_OIDC: true
- script: |
if [ "$(planExit)" = "2" ]; then
RESOURCES=$(grep -E "^\s+#.*will be" plan.txt | sed 's/^\s*# //' | head -20)
BUILD_URL="\((System.CollectionUri)\)(System.TeamProject)/_build/results?buildId=$(Build.BuildId)"
curl -H "Content-Type: application/json" -d "{
\"@type\": \"MessageCard\",
\"@context\": \"http://schema.org/extensions\",
\"themeColor\": \"d9534f\",
\"summary\": \"Terraform drift detected in prod\",
\"title\": \"Terraform drift detected in prod\",
\"text\": \"The following resources have drifted:\n\n\({RESOURCES}\n\n[View build](\){BUILD_URL})\"
}" $(TEAMS_WEBHOOK_URL)
echo "Drift detected, failing build"
exit 1
elif [ "$(planExit)" = "1" ]; then
echo "Plan errored, failing build"
exit 1
else
echo "No drift detected"
fi
displayName: Notify and fail on drift
- publish: plan.txt
artifact: drift-plan
condition: always()
A couple of things worth calling out. The always: true on the schedule is important because without it the pipeline only runs on days when the branch has new commits, which defeats the purpose. The set +e on the plan step is there because the script would otherwise exit immediately on the non-zero exit code from -detailed-exitcode, and we never get to check what the code actually was. Storing the exit code in a pipeline variable lets the next step branch on it. And the publish task runs unconditionally so the plan output is always available as an artefact, even when the build fails.
The Teams webhook URL goes into the variable group as a secret. If you do not have a webhook configured already, in Teams you go to the channel, click the three dots, then Connectors, then add an Incoming Webhook. Give it a name and copy the URL it generates.
The same thing in GitHub Actions
If your code lives in GitHub, the equivalent workflow looks like this. The mechanics are identical, just the syntax is different.
name: Terraform Drift Check
on:
schedule:
- cron: "0 6 * * *"
workflow_dispatch:
permissions:
id-token: write
contents: read
jobs:
drift-check:
runs-on: ubuntu-latest
environment: prod
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.ARM_CLIENT_ID }}
tenant-id: ${{ secrets.ARM_TENANT_ID }}
subscription-id: ${{ secrets.ARM_SUBSCRIPTION_ID }}
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.9.5
- name: Terraform init
run: |
terraform init \
-backend-config="resource_group_name=${{ vars.BACKEND_RG }}" \
-backend-config="storage_account_name=${{ vars.BACKEND_SA }}" \
-backend-config="container_name=tfstate" \
-backend-config="key=prod.tfstate"
env:
ARM_USE_OIDC: true
- name: Terraform plan
id: plan
run: |
set +e
terraform plan -detailed-exitcode -no-color -out=drift.tfplan > plan.txt
echo "exitcode=\(?" >> \)GITHUB_OUTPUT
env:
ARM_USE_OIDC: true
ARM_CLIENT_ID: ${{ secrets.ARM_CLIENT_ID }}
ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }}
ARM_SUBSCRIPTION_ID: ${{ secrets.ARM_SUBSCRIPTION_ID }}
- name: Notify on drift
if: steps.plan.outputs.exitcode == '2'
run: |
RESOURCES=$(grep -E "^\s+#.*will be" plan.txt | sed 's/^\s*# //' | head -20)
RUN_URL="\({{ github.server_url }}/\){{ github.repository }}/actions/runs/${{ github.run_id }}"
curl -X POST -H "Content-type: application/json" --data "{
\"text\": \"*Terraform drift detected in prod*\n\n\`\`\`\n\({RESOURCES}\n\`\`\`\n\n<\){RUN_URL}|View workflow run>\"
}" ${{ secrets.SLACK_WEBHOOK_URL }}
- name: Upload plan output
if: always()
uses: actions/upload-artifact@v4
with:
name: drift-plan
path: plan.txt
- name: Fail on drift
if: steps.plan.outputs.exitcode == '2' || steps.plan.outputs.exitcode == '1'
run: exit 1
The Slack webhook follows the same pattern as the Teams one. In Slack you create an incoming webhook through the Slack API site, point it at the channel, and store the URL as a repository secret. The message format I have used here is plain text inside a code block, which is enough for the team to see the resource addresses without needing to dig into the workflow run.
If you want fancier formatting in Slack, the Block Kit format gives you headers, divider lines, and action buttons that link straight to the run. I have done it both ways and honestly the simple text version gets read more reliably. The fancy cards look great in a demo and then get scrolled past when there is real noise in the channel.
Why the notification matters more than the pipeline
A pipeline that fails silently is barely better than no pipeline at all. The default behaviour of both Azure DevOps and GitHub Actions is to send an email when a scheduled build fails, and in my experience those emails get filtered into a folder no one looks at. The Teams or Slack message in a channel the team actually reads is what turns drift detection from a tickbox exercise into something that gets acted on.
What I would also recommend is being deliberate about which channel the notifications go to. Posting drift alerts into a busy general channel is the same as posting them into nowhere, because they get scrolled past. A dedicated #platform-alerts or #terraform-drift channel that only the platform team is in tends to work better. The signal-to-noise ratio is what makes the difference.
Handling drift that is not really drift
The first time you run a daily drift check on a real environment, you will almost certainly get a notification on day one for something that is not really drift. The two most common offenders on Azure are the hidden-link tags that App Service adds to its associated resources, and properties on Function Apps that get populated by the runtime after deployment.
The fix is ignore_changes on the specific properties, not on the whole resource.
resource "azurerm_linux_web_app" "app" {
# ... other properties ...
lifecycle {
ignore_changes = [
tags["hidden-link:/subscriptions/"],
site_config[0].application_stack[0].docker_image_name,
]
}
}
Be deliberate about ignore_changes. It is a useful tool and it is also how you accidentally end up with a codebase that silently accepts any change to any property. Only use it for the specific properties where Azure is genuinely the source of truth and you do not want Terraform to interfere. A blanket ignore_changes = all is almost never the right answer, even though it makes the drift notification go away.
Preventing drift in the first place with Azure Policy
The best drift detection is the drift that never happens. Azure Policy with deny effects is genuinely useful here, because it stops the portal change at source rather than flagging it after the fact.
The pattern I use is to tag every resource deployed by Terraform with a managed-by = terraform tag, and then have a policy that audits or denies modifications to those resources from anyone other than the Terraform service principal.
resource "azurerm_policy_definition" "deny_manual_changes" {
name = "deny-changes-to-terraform-managed"
policy_type = "Custom"
mode = "All"
display_name = "Deny manual changes to Terraform-managed resources"
policy_rule = jsonencode({
if = {
allOf = [
{
field = "tags['managed-by']"
equals = "terraform"
},
{
field = "type"
notEquals = "Microsoft.Resources/subscriptions/resourceGroups"
}
]
}
then = {
effect = "[parameters('effect')]"
}
})
parameters = jsonencode({
effect = {
type = "String"
allowedValues = ["Audit", "Deny"]
defaultValue = "Audit"
}
})
}
I always deploy this in Audit mode first and watch the policy compliance view for a couple of weeks before flipping it to Deny. The audit view will flag every person, pipeline, or process that touches a tagged resource, and you will almost certainly find at least one legitimate workflow you did not know about. Breaking that workflow with a Deny on day one is a quick way to lose friends on your platform team.
My thoughts on drift detection
Drift is one of those things where the cost of ignoring it scales non-linearly. A small amount of drift is annoying. A large amount of drift is terrifying, because you stop trusting your own Terraform plan, and the moment you stop trusting the plan you stop running apply in production, and the whole value proposition of Infrastructure as Code starts to unwind.
If your Azure Terraform repository does not currently have any drift detection in place, the first thing I would add is the scheduled terraform plan -detailed-exitcode pipeline with a Teams or Slack notification. It takes an afternoon to set up, catches the vast majority of what you actually care about, and gives you the signal you need to start tightening things up from there. The Azure Policy work and the ignore_changes cleanup can follow once you have the habit of looking at drift at all.





