Skip to content

cleanup is not called as expected when regression tests time out, which leads to unclean test env for next regression test and may cause tests stucked. #4610

@chen1195585098

Description

@chen1195585098
regression testcases with execution timeout ./tests/00-geo-rep/00-georep-verify-non-root-setup.t - 900 second
./tests/00-geo-rep/bug-1600145.t - 600 second
./tests/00-geo-rep/georep-stderr-hang.t - 500 second
./tests/00-geo-rep/georep-basic-tarssh-ec.t - 500 second
./tests/00-geo-rep/georep-basic-rsync-ec.t - 500 second
./tests/00-geo-rep/georep-basic-dr-tarssh.t - 500 second
./tests/00-geo-rep/georep-basic-dr-tarssh-arbiter.t - 500 second
./tests/00-geo-rep/georep-basic-dr-rsync.t - 500 second
./tests/00-geo-rep/georep-basic-dr-rsync-arbiter.t - 500 second
./tests/00-geo-rep/00-georep-verify-setup.t - 400 second
./tests/00-geo-rep/georep-config-upgrade.t - 300 second
./tests/00-geo-rep/01-georep-glusterd-tests.t - 300 second
./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t - 200 second
./tests/bugs/rpc/bug-921072.t - 200 second
./tests/bugs/replicate/bug-830665.t - 200 second
./tests/bugs/nfs/bug-974972.t - 200 second
./tests/bugs/cli/bug-1320388.t - 200 second
./tests/basic/namespace.t - 200 second
./tests/basic/gfapi/gfapi-ssl-test.t - 200 second
./tests/basic/gfapi/gfapi-ssl-load-volfile-test.t - 200 second
./tests/basic/fencing/afr-lock-heal-basic.t - 200 second
./tests/basic/fencing/afr-lock-heal-advanced.t - 200 second
./tests/000-flaky/basic\_mount-nfs-auth.t - 200 second

Steps to reproduce

  1. Run one of above tests.
  2. The chosen test finally end up with timeout, then, we can find a residual glusterfs process and volume used for test is not deleted.
  3. Run a new testcase.

Phenomenon
The new testcase will get stucked like follow:

... GlusterFS Test Framework ...

/home/shard/glusterfs /home/shard/glusterfs
/home/shard/glusterfs
testing 'timeout' command

============================ (784 / 840) ============================
[10:48:54] Running tests in file ./tests/bugs/transport/bug-873367.t
./tests/bugs/transport/bug-873367.t ..

And a 'D' state process occurs:

root      293910  0.0  0.0 222756  1824 pts/2    D+   18:48   0:00 mkdir -p /d/backends /mnt/glusterfs/0 /mnt/glusterfs/1 /mnt/glusterfs/2 /mnt/glusterfs/3 /mnt/nfs/0 /mnt/nfs/1 /d/dev

Analysis
When the new testcase introduce include.rc and do env init, it executes WORKDIRS="$B0 $M0 $M1 $M2 $M3 $N0 $N1 $DEVDIR"; mkdir -p $WORKDIRS. Since $M0 is currently ocuppied by the residual glusterfs process, mkdir gets stucked. The procedure of this new testcase stops, and finally this testcase gets timeout.

A trap in include.rc is registered for every testcase, so theoretically, the cleanup procedure will be called when a testcase times out and exits.

function force_terminate () {
        local ret=$?;
        1>&2 echo -e "\nreceived external"\
                        "signal --`kill -l $ret`--, calling 'cleanup' ...\n";

        cleanup;
        exit $ret;
}

trap force_terminate INT TERM HUP

In run-tests.sh, each test is executed with timeout command. In theory, timeout command will send two signals during execution at most:

  1. when target shell cmd times out, a SIGTERM (will not be sent with --foreground option) will be sent.
  2. if the target shell cmd still keeps running, a SIGKILL (cannot be trapped) will be send after <--kill-after> seconds.

Howerver, to stop the running tests on Ctrl-C, option '--foreground' is used for timeout command.
1f03309

After this patch, the target shell cmd does not time out any more. Each test that times out is terminated by SIGKILL finally and hence force_terminate is not called as expected.

Meanwhile, terminate_pids may fail to terminate glusterfs-related processes even if force_terminate is called. Because the following command redirect stdout to stderr and not trigger buffer flush:

        1>&2 echo -e "\nreceived external"\
                        "signal --`kill -l $ret`--, calling 'cleanup' ...\n";

In this case, the subsequent stdout is also redirected stderr, too, until the buffer is filled up. So, function calls in cleanup cannot return expected info.

Temporary solution

I remove '--foreground' in run-tests.sh and '>&2 echo' in inlude.rc, cleanup works well when tests time out.

timeout --foreground -k ${kill_after_time} ${cmd_timeout} prove -vmfe '/bin/bash' ${t}

So, should we consider --foreground as an optional settting, instead using it by default? Once it is specified, cleanup is not called even if tests time out and there will be residual processes and volume used for tests. It seems unreasonable.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions