How to Fix MongoDB Segmentation Fault on AWS EC2


Troubleshooting MongoDB Segmentation Fault on AWS EC2

As a Senior DevOps Engineer at WebToolsWiz.com, I’ve encountered my share of critical database issues. A “Segmentation Fault” for MongoDB on an AWS EC2 instance is particularly nasty, often pointing to underlying system resource constraints rather than a direct MongoDB bug. This guide will walk you through diagnosing and resolving this issue efficiently.


1. The Root Cause: Why this happens on AWS EC2

A “Segmentation Fault” (often abbreviated as “segfault”) is a specific type of fault raised by hardware with memory protection, indicating that a program (in this case, mongod) has attempted to access a memory location that it is not allowed to access, or has tried to access memory in a way that is not allowed. The operating system kernel terminates the program immediately to prevent data corruption or system instability.

On AWS EC2 instances, particularly for memory-intensive applications like MongoDB, the most common culprits for a mongod segfault are:

  1. Insufficient ulimit Settings:

    • nofile (Number of Open File Descriptors): MongoDB’s WiredTiger storage engine utilizes memory-mapped files and a large number of file descriptors. Default ulimit -n values on many Linux distributions (especially on new EC2 instances) are often set to 1024 or 4096, which are insufficient for a production MongoDB deployment under load. When MongoDB exhausts its allowed file descriptors, subsequent attempts to open files can lead to illegal memory access.
    • nproc (Number of Processes/Threads): While less common than nofile for direct segfaults, an inadequate nproc limit can prevent MongoDB from spawning necessary background processes or threads, leading to instability.
    • as (Address Space/Virtual Memory): If MongoDB attempts to allocate more virtual memory than allowed by the as limit, it can result in a segfault.
  2. Out-of-Memory (OOM) Conditions: While the OOM killer typically terminates processes cleanly, an extremely low-memory condition combined with memory mapping activities can sometimes trigger a segfault. This is more prevalent on smaller EC2 instance types with limited RAM.

  3. Corrupted Data Files: Less common but possible. If the underlying data files (.wt files, journaling files) become corrupted due to unexpected shutdowns or storage issues, MongoDB might attempt to read invalid data into memory, leading to a segfault.

  4. Hardware/Virtualization Issues: On rare occasions, a problem with the underlying EC2 host hardware or virtualization layer could manifest as memory corruption, leading to a segfault. This is generally outside your direct control and would require AWS support.

For the vast majority of cases on EC2, ulimit misconfigurations are the primary suspect.


2. Quick Fix (CLI)

The immediate goal is to get MongoDB running. This section focuses on temporary ulimit adjustments and checking logs.

  1. Check System Logs for Crash Details:

    dmesg | grep -i "mongod"
    sudo journalctl -xe | grep -i "mongod"
    sudo tail -n 200 /var/log/syslog # Or /var/log/messages
    sudo tail -n 200 /var/log/mongodb/mongod.log # Or wherever your mongod.log is

    Look for specific messages about “Segmentation fault,” “faulting address,” or “OOM killer” activity related to the mongod process.

  2. Temporarily Increase ulimits (Current Session): Before starting mongod, elevate your session’s ulimits. Replace 64000 with a value appropriate for your workload (MongoDB recommends at least 64000 for nofile).

    ulimit -n 64000 # Number of open file descriptors
    ulimit -u 64000 # Number of user processes

    Note: These changes are only for the current shell session. If MongoDB is managed by systemd or another init system, these ulimit changes might not propagate to the mongod service process unless configured at the service level (covered in the Configuration Check).

  3. Restart MongoDB (with increased ulimits if possible):

    • If starting manually:
      sudo systemctl stop mongod # If it's running as a service
      # Ensure your shell has the elevated ulimits from step 2
      mongod --config /etc/mongod.conf &
    • If restarting via systemd (and your systemd unit has ulimit overrides):
      sudo systemctl restart mongod
      sudo systemctl status mongod
  4. Monitor Immediately:

    sudo tail -f /var/log/mongodb/mongod.log

    Check if it starts successfully or segfaults again. If it segfaults again immediately, the issue might be deeper than just ulimits (e.g., severe data corruption or persistent OOM).


3. Configuration Check

To ensure permanent stability, you must persist the ulimit changes.

  1. /etc/security/limits.conf (Persistent ulimits): This file defines resource limits for users and groups. Add or modify the following lines, typically for the mongodb user:

    sudo vim /etc/security/limits.conf

    Add these lines (or adjust if they already exist):

    mongodb soft nofile 64000
    mongodb hard nofile 64000
    mongodb soft nproc 64000
    mongodb hard nproc 64000
    # Optional: If you suspect virtual memory limits
    # mongodb soft as unlimited
    # mongodb hard as unlimited
    • soft: The current enforceable limit.
    • hard: The maximum value the soft limit can be set to.
    • nofile: Number of open files.
    • nproc: Number of user processes.
    • as: Address space (virtual memory). unlimited is recommended by MongoDB for this.
  2. /etc/pam.d/common-session or /etc/pam.d/system-auth: Ensure that the pam_limits.so module is included, which applies the limits defined in limits.conf. Most modern Linux distributions have this enabled by default, but it’s good to verify. Look for a line similar to:

    session required pam_limits.so
  3. systemd Unit File (/etc/systemd/system/mongod.service or similar): For services managed by systemd, limits.conf changes might not directly apply to the service process without specific directives in the systemd unit file. Edit the mongod.service file:

    sudo systemctl edit mongod.service # Use 'edit' to create an override file

    Add or modify the [Service] section to explicitly set the limits:

    [Service]
    LimitNOFILE=64000
    LimitNPROC=64000
    LimitAS=infinity # For virtual memory, similar to 'unlimited'

    After editing the systemd unit, you must reload the systemd daemon:

    sudo systemctl daemon-reload
    sudo systemctl restart mongod
  4. /etc/mongod.conf (MongoDB Configuration): Review your MongoDB configuration, specifically memory settings.

    sudo vim /etc/mongod.conf
    • storage.wiredTiger.engineConfig.cacheSizeGB: Ensure this is set appropriately for your EC2 instance’s RAM. MongoDB by default allocates 50% of (RAM - 1GB), but if you’ve manually overridden it to be too high for a small instance, it could contribute to OOM issues. Generally, let MongoDB manage this or set it to 50% of available RAM.
    • systemLog.path: Confirm that your mongod.log file path is correct and accessible for future debugging.
  5. Kernel Parameters (sysctl): While less directly related to segfaults, optimizing kernel parameters for a database server can prevent resource contention that might indirectly lead to instability.

    sudo vim /etc/sysctl.conf

    Add/modify:

    vm.swappiness=1 # Reduce swapping
    vm.dirty_ratio=15 # Allow up to 15% of RAM for dirty pages
    vm.dirty_background_ratio=5 # Start writing dirty pages to disk when 5% of RAM is dirty

    Apply changes:

    sudo sysctl -p
  6. EC2 Instance Type: Verify that your chosen EC2 instance type (e.g., t3.medium, m5.large) provides sufficient RAM and CPU for your MongoDB workload. An undersized instance can lead to constant resource contention and instability.


4. Verification

After applying the configuration changes, it’s crucial to verify that they have taken effect and that MongoDB is stable.

  1. Restart MongoDB:

    sudo systemctl daemon-reload # If you modified systemd unit files
    sudo systemctl restart mongod
    sudo systemctl status mongod
  2. Check mongod Logs: Immediately after restart, check the MongoDB logs for successful startup messages and no further segfaults.

    sudo tail -f /var/log/mongodb/mongod.log
  3. Verify ulimits for the Running Process: This is the most critical step to ensure your ulimit changes are active for the mongod process itself.

    • Find the mongod process ID (PID):
      pgrep mongod
      # Or: ps aux | grep mongod | grep -v grep
    • Check its effective limits (replace <PID> with the actual PID):
      cat /proc/<PID>/limits
      You should see Max open files (for nofile) and Max processes (for nproc) reflecting the values you set (e.g., 64000). For Max address space, you should see unlimited or a very large number.
  4. Connect to MongoDB:

    mongo

    Perform some basic operations to ensure connectivity and responsiveness.

  5. Monitor System Resources: Use tools like htop, top, free -h, and AWS CloudWatch metrics (CPU Utilization, Memory Utilization if agent installed, Disk I/O, Network I/O) to monitor the instance’s performance under typical load for a period. Look for high memory usage, excessive swapping, or continuous CPU spikes that might indicate deeper performance issues beyond just the segfault.

By methodically following these steps, you should be able to diagnose and resolve the MongoDB Segmentation Fault on your AWS EC2 instance, restoring stability to your database operations.