Ansible beyond the tutorial: idempotency, drift detection, and the playbook that saved a 3am incident

The demo playbook installs nginx and starts it. It works once on a clean VM and everyone nods in the meeting. What nobody demonstrates is running the same playbook six months later on a server where an engineer manually edited /etc/nginx/nginx.conf to temporarily fix a production problem and then forgot to document it. Or after the nginx package got updated by an unnoticed apt cron job. Or on a server that was never properly converged because someone cancelled the playbook halfway through.

Production Ansible is not about running playbooks. It is about reliably converging infrastructure to known states — including infrastructure that has drifted from whatever Ansible last configured.

Idempotency is a contract, not a feature

Ansible modules are documented as idempotent and most of them are. But "idempotent" in Ansible means "running this module twice with the same arguments produces the same result" — it does not mean "this module is safe to run on a system in an unknown state."

Consider a popular pattern that breaks under drift:

# This looks fine. It is not fine if the service was manually stopped.
- name: Ensure application service is running
  ansible.builtin.service:
    name: myapp
    state: started
    enabled: true

If an engineer ran systemctl disable myapp --now on the server to debug a CPU spike and then forgot — this task reports ok (already running) or changed (re-enabled), but it does not tell you that a manual intervention occurred. The playbook converges the state, but you have lost the signal that drift happened.

The pattern I use instead:

- name: Check if service has been manually overridden
  ansible.builtin.command: systemctl is-enabled myapp
  register: svc_enabled
  changed_when: false
  failed_when: false

- name: Warn on manual override
  ansible.builtin.debug:
    msg: "WARNING: myapp service is {{ svc_enabled.stdout }} — expected 'enabled'"
  when: svc_enabled.stdout != 'enabled'

- name: Converge service state
  ansible.builtin.service:
    name: myapp
    state: started
    enabled: true

The warning does not block the playbook. It produces a visible signal that a human made a change that Ansible is now overwriting. In a CI/CD context you parse that output and create an alert.

The 3am playbook

The scenario: production API servers returning 502. Load balancer health checks failing. The on-call engineer has 90 seconds before customers notice. The cause: a deploy job timed out halfway through updating the nginx upstream config, leaving three of eight servers with the old configuration and five with the new.

You write the remediation playbook when you are not under pressure, so that when you are under pressure — you run one command:

---
- name: Emergency nginx config convergence
  hosts: api_servers
  serial: 2               # converge two at a time, keep 6/8 serving traffic
  max_fail_percentage: 25 # abort if more than 2 servers fail convergence

  tasks:
    - name: Validate config template renders without errors
      ansible.builtin.template:
        src:  templates/nginx-upstream.conf.j2
        dest: /tmp/nginx-upstream-validate.conf
        mode: '0600'
      changed_when: false

    - name: Syntax check the rendered config
      ansible.builtin.command: nginx -t -c /tmp/nginx-upstream-validate.conf
      changed_when: false
      # If nginx -t fails, the play fails here — before touching the live config

    - name: Deploy nginx upstream config
      ansible.builtin.template:
        src:   templates/nginx-upstream.conf.j2
        dest:  /etc/nginx/conf.d/upstream.conf
        owner: root
        group: root
        mode:  '0644'
        backup: true    # keeps upstream.conf.TIMESTAMP on the server
      notify: reload nginx

    - name: Verify health endpoint responds after reload
      ansible.builtin.uri:
        url:            "http://localhost:{{ app_port }}/health"
        status_code:    200
        timeout:        10
      retries: 3
      delay: 2

  handlers:
    - name: reload nginx
      ansible.builtin.service:
        name:  nginx
        state: reloaded
      # reloaded, not restarted — zero downtime config update

serial: 2 is the parameter that matters most. With eight servers and serial: 2 you always have at least six servers serving traffic during convergence. Without it, Ansible converges all hosts in parallel and you get a short window where all eight are simultaneously reloading nginx — faith-based deployment at scale.

Vault and the secret you accidentally committed

Every team eventually commits a secret to their Ansible repository. The textbook answer is Ansible Vault. The production answer: Ansible Vault for secrets that belong to the playbook, external secrets management (HashiCorp Vault, AWS Secrets Manager) for secrets shared between systems, and no_log: true on every task that handles either.

- name: Set database credentials in application config
  ansible.builtin.template:
    src:  templates/database.php.j2
    dest: /var/www/html/config/database.php
    mode: '0640'
  vars:
    db_password: "{{ lookup('aws_ssm', '/prod/app/db_password', region='eu-west-1') }}"
  no_log: true   # prevents the rendered template (containing the password) from appearing in logs

no_log: true suppresses not just the task output but also the diff output. If you run --diff to review what changed, you will not see the rendered template. That is a feature, not a limitation.

Testing playbooks before they matter

Two tools I use for every non-trivial role. Molecule for role-level testing: it spins up a container or VM, runs the role, runs a verifier (usually Testinfra) and checks that the desired state was actually achieved — not just that Ansible reported success.

# molecule/default/tests/test_nginx.py
import testinfra

def test_nginx_is_running(host):
    nginx = host.service("nginx")
    assert nginx.is_running
    assert nginx.is_enabled

def test_nginx_config_is_valid(host):
    result = host.run("nginx -t")
    assert result.rc == 0

--check mode with --diff before every production run shows what Ansible would change without actually changing it. The diff output on template tasks is particularly useful — you see exactly which lines in the config file would be modified. Limiting to one server with --limit api_servers[0] is non-negotiable: --check across the full production inventory can take minutes, on one representative server it takes seconds.

What I watch for in Ansible code review

Tasks with no changed_when on command or shell modules report changed every time they run, even if nothing changed. That makes your --check diff useless. ignore_errors: true on anything infrastructure-related is the equivalent of a bare catch (Exception e) {} — the playbook should stop, not continue with a potentially broken server still in the pool. Missing become: false on tasks that do not need root: a playbook where every task runs as root is a playbook where any bug has the blast radius of the entire server.