How to Fix Ansible Timeout Error on Ubuntu 20.04


Troubleshooting Guide: Resolving Ansible Timeout Errors on Ubuntu 20.04

As a Senior DevOps Engineer, encountering an Ansible timeout error is a common scenario that can halt your automation workflows. This guide provides a direct, professional approach to diagnose and resolve these issues specifically when working with Ansible on an Ubuntu 20.04 control node.


1. The Root Cause: Why this happens on Ubuntu 20.04

An Ansible timeout error occurs when the Ansible control node fails to receive a response from a managed node within a specified period. While not unique to Ubuntu 20.04, certain default configurations or common environmental factors on this platform can contribute to the problem.

The primary culprits typically fall into these categories:

  • SSH Connection Issues: The most frequent cause. This includes:
    • Firewall Restrictions: Either on the control node (outgoing port 22 blocked) or, more commonly, on the managed node (incoming port 22 blocked), or cloud provider security groups.
    • Network Latency or Congestion: Slow or unreliable network links causing delays beyond the default timeout.
    • SSH Daemon Unresponsive: The sshd service on the managed node might be stopped, overloaded, or misconfigured.
    • DNS Resolution Issues: If using hostnames, a DNS lookup failure or delay can prevent the initial connection.
    • Authentication Failures: While typically resulting in an “authentication failed” message, a very slow or retrying authentication process can sometimes lead to a timeout if it exhausts the connection attempt window.
  • Ansible’s Default Timeout: Ansible’s default SSH connection timeout is 10 seconds. For environments with higher latency, resource-constrained managed nodes, or complex initial SSH handshakes, this is often insufficient.
  • SSH Client Configuration: The underlying SSH client (OpenSSH, commonly used on Ubuntu 20.04) might not be configured to keep the connection alive, especially over long-duration tasks or idle periods.
  • Managed Node Overload: The target server might be experiencing high CPU, memory, or disk I/O, causing its sshd process to respond slowly or not at all.
  • Complex or Long-Running Tasks: Although typically resulting in a task timeout rather than a connection timeout, very early stages of long-running tasks can sometimes hit connection timeouts if initial setup is slow.

2. Quick Fix (CLI)

For immediate testing and temporary resolution, you can adjust settings directly from the command line.

  1. Increase Ansible’s Global Timeout (Environment Variable): This is the quickest way to test if increasing the timeout resolves the issue without modifying configuration files.

    export ANSIBLE_TIMEOUT=30 # Set to 30 seconds, adjust as needed (e.g., 60, 120)
    ansible-playbook your_playbook.yml -i inventory.ini
  2. Pass SSH Arguments for Keepalives: Instruct the SSH client to send “keep-alive” messages to prevent the connection from dropping due to inactivity or network intermediaries.

    ansible-playbook your_playbook.yml -i inventory.ini -e ansible_ssh_common_args='-o ServerAliveInterval=30 -o ServerAliveCountMax=5'
    • ServerAliveInterval=30: Sends a null packet every 30 seconds if no data has been received from the server.
    • ServerAliveCountMax=5: Disconnects after 5 consecutive server alive messages are sent without a response.
  3. Enable Verbose Mode: While not a “fix,” -vvv is crucial for debugging. It provides detailed output on the SSH connection process, which can pinpoint exactly where the timeout is occurring (e.g., “ssh: connect to host … port 22: Connection timed out”).

    ansible-playbook your_playbook.yml -i inventory.ini -vvv

3. Configuration Check

For persistent solutions, modify the relevant configuration files.

3.1. Ansible Configuration (ansible.cfg)

Edit your ansible.cfg file. This can be located in /etc/ansible/ansible.cfg, ~/.ansible.cfg, or a project-specific ansible.cfg in your current directory.

# Example: /etc/ansible/ansible.cfg or ~/.ansible.cfg

[defaults]
# Increase the default SSH connection timeout from 10 seconds
timeout = 30 

[ssh_connection]
# Optimize SSH connection parameters for reliability and performance
# ServerAliveInterval sends a null packet to keep the connection alive
# ControlMaster and ControlPersist reuse SSH connections, significantly reducing overhead
ssh_args = -o ServerAliveInterval=30 -o ControlMaster=auto -o ControlPersist=60s

Explanation of ssh_args:

  • -o ServerAliveInterval=30: Sends a keep-alive message every 30 seconds.
  • -o ControlMaster=auto: Allows multiple SSH sessions to share a single network connection.
  • -o ControlPersist=60s: Keeps the master connection open for 60 seconds after the last client connection closes, allowing subsequent Ansible tasks to reuse the same connection without re-authenticating. This is critical for performance and reducing timeout chances on subsequent tasks.

3.2. SSH Client Configuration (~/.ssh/config)

For host-specific or user-specific SSH settings, configure your ~/.ssh/config file on the control node.

# Example: ~/.ssh/config

Host *
    # Send a null packet every 30 seconds to keep the connection alive
    ServerAliveInterval 30
    # Disconnect after 5 consecutive ServerAliveInterval messages are sent without a response
    ServerAliveCountMax 5
    # Use ControlMaster for connection multiplexing
    ControlMaster auto
    # Keep the master connection open for 1 minute after the last client connection
    ControlPersist 60s
    # Attempt to use GSSAPI authentication (often used in Kerberos environments)
    # GSSAPIAuthentication no
    # GSSAPIDelegateCredentials no

Note: The Host * applies these settings to all SSH connections. You can specify a particular Host pattern (e.g., Host my_servers_*) if you only want these settings for specific targets. Uncomment GSSAPIAuthentication if you are not using Kerberos, as GSSAPI can sometimes cause delays if misconfigured.

3.3. Firewall Configuration

On the Control Node (Ubuntu 20.04): Ensure your local firewall (UFW) allows outgoing SSH connections. This is rarely the issue but worth checking.

sudo ufw status verbose
# If outbound SSH is blocked, you might need:
# sudo ufw allow out to any port 22

On the Managed Node(s): Crucially, ensure the managed node’s firewall allows incoming SSH connections on port 22 (or your custom SSH port).

# On the managed node:
sudo ufw status verbose
# If SSH is not allowed:
sudo ufw allow OpenSSH # or sudo ufw allow 22/tcp
sudo ufw enable       # if ufw is inactive

Cloud Provider Firewalls: If your managed nodes are in a cloud environment (AWS EC2, Azure VMs, GCP Compute Engine), verify their security group, network security group, or firewall rules allow inbound TCP port 22 from your Ansible control node’s IP address. This is a very common source of connection timeouts.

3.4. SSH Daemon Status on Managed Node

Confirm the SSH service is running and listening on the expected port on the managed node.

# On the managed node:
sudo systemctl status sshd
sudo netstat -tuln | grep 22

Expected output for netstat would include something like tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN.


4. Verification

After applying changes, systematically verify your connection.

  1. Test Raw SSH Connectivity: Attempt to SSH directly from your Ansible control node to a managed node that was previously timing out. Use verbose mode to see the connection process.

    ssh -vvv your_user@your_managed_node_ip_or_hostname

    Look for “Authentication successful” and “debug1: channel 0: new [client-session]”. If this works, your core SSH connectivity is resolved.

  2. Run a Simple Ansible Ping: Execute a basic Ansible ping module to confirm Ansible can establish a connection and communicate.

    ansible all -m ping -i inventory.ini

    (Replace all with your specific host group if needed).

  3. Rerun the Original Playbook: Finally, execute the Ansible playbook that was originally failing.

    ansible-playbook your_playbook.yml -i inventory.ini -vvv # Keep -vvv for detailed insight if it fails again

By systematically checking these configurations and progressively increasing verbosity, you should be able to pinpoint and resolve Ansible timeout errors on your Ubuntu 20.04 controlled environments. Remember that troubleshooting is often an iterative process.