A CTest That Never Ends?


Symptom

A test runs normally by itself, but it never terminates until the timeout is reached when run by CTest.


Sample Code

In order to show the exact problem, I have written a piece of sample code and putted it on my GitHub. There are mainly four files in the source directory:

  • CMakeLists.txt: We need this file to add the sample tests to CTest. It contains only two lines of real code, which simply call a user defined CMake function AddPythonTest() to add the two Python scripts to CTest. The actual implementation of AddPythonTest() is not important to the discussion of this post, for the people that are curious, it is defined in the root level CMake file.
  • install_monetdb_deb.sh: As its name suggest, this is a Bash script that install MonetDB on a Debian or Ubuntu system. Running this script is optional, as the chances are that if you are googling this particular CTest issue, you probably already had a test that can reproduce the problem. So, instead of messing up your environment with my code, you may prefer to experiment with your own tests. However, if you want to run my sample code, then MonetDB needs to be installed and configured properly before running these tests.
  • problematic_test.py: This is the test that will appear to hang with CTest.
  • setup_db_environment.py: This is the final working test, even with CTest.


Environment

The sample code was tested with the following environment:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic

$ cmake --version
cmake version 3.10.2

$ monetdbd -v
MonetDB Database Server v11.33.3 (Apr2019)

$ python --version
Python 2.7.15+


Background

Recently, we are working on a project which use MonetDB as its data store. For a mature project, it is natural to have some tests those require a running MonetDB Database server. Also, based on how MonetDB works, it is preferable to have all the tests sharing the same server, instead of each test starts and stops its own server.

The way we chose to accomplish this is to use CTest’s Test Fixtures. The discussion of how to use test fixtures is generally outside the scope of this post, for information about that, please refer to [1] Test Fixtures With CMake/CTest. The only thing you need to know is that the way how it works requires us to split the code that starts the MonetDB server into a setup program and run it with CTest, as well as make the stop part also as a separate clean-up test. And it is the setup test where we noticed this CTest issue.

Alright, without further ado, let’s look at how we tackled this problem.


Initial Attempt

As an initial attempt, this is how problematic_test.py looks like at first:

DBFARM_PATH="/tmp/dbfarm"
subprocess.call(["monetdbd", "start", DBFARM_PATH])

This code can certainly be written in one line, and this one line code does just what we want, to start the MonetDB server. Of course, this is not a useful test, as it doesn’t really test anything. However, this is intended so that I can show you the minimum code that can reproduce the problem. Also note, in order to save spaces, things like, shebang, imports, and so on, are omitted in purpose, please refer to the source file for the full implementation.

This simple code looks viable, right? So, let’s try it out, without CTest, first:

$ monetdbd create /tmp/dbfarm &> /dev/null; \
monetdbd stop /tmp/dbfarm &> /dev/null; \
./problematic_test.py

Don’t worry, if you don’t understand the first two lines. They basically setup the preconditions for the test. In more details, the first line creates the given DB farm, which is used later in the test. This line should arguably belong to the test script itself, but again I decide to leave it outside of the script in order to show the minimum code. The second line makes sure the given DB farm is stopped before we run the setup script. This line is also not in the test script, because it should belong to the separate clean-up test which runs after all the other tests have finished. In addition, we ignore the return code and outputs of the first two commands. In short, this is just an easy way to run the test that is under developing.

Running the above command should yield no output and it should return immediately after the MonetDB server gets started, which usually takes only a few seconds for an empty DB farm.

Now, let’s try it again with CTest, this time:

$ monetdbd create /tmp/dbfarm &> /dev/null; \
monetdbd stop /tmp/dbfarm &> /dev/null; \
ctest -R problematic_test
Test project /home/yyang/projects/yyangtech_wordpress_com/build
Start 2: Python.problematic_test

As you will see, CTest just hangs there until it gets killed or reaches the default timeout. No needs to mention, you can surely “resolve” this issue by setting the timeout to a small but not too small value so that the test can timeout faster. However, that is not an ideal solution, and is certainly not the solution we are proposing here in this article. So let’s find a better way to solve this problem.


CTest Basics

In order to find the root cause of this problem, we need to understand why CTest seems to be hanging; Or put it in another way, when does CTest terminate?

Normally, CTest ternimates when it can conclude that a test is finished. Where, normal means the test that CTest is running is well functioning, and also there is no bugs in the particular version of CTest you are using, which could potentially cause it to hang.

Since, we already know that the test works fine with itself, and that kind of bugs rarely happen in a stable release, so we can safely assume that they are not the cases, at least for now. Then, the question becomes, how can CTest conclude that a test is finished? Quoting from [2]:

CTest waits for all of the following before concluding that a test is finished:

  • The child process exits
  • The child’s stdout pipe closes
  • The child’s stderr pipe closes

If the child creates a grandchild and then exits or crashes, the grandchildren still hold on to the child’s out/err pipes and can write to them. CTest will keep listening to the pipes until the test timeout is reached. The default test timeout is 1500 seconds (25 minutes), so it can appear to be a hang when in fact it will eventually finish.

This sounds very promising, as it points us to a good direction. So the next step would be figuring out if that is the case, and if so, finding out exactly what CTest is waiting for.


Debugging

Let’s go back to our problematic_test, and first, check if the child and grandchild processes exit properly.

This is easy to do, with the help of the pstree command. Put the following two lines at the end of the Python test script.

os.system("pstree -p $(pgrep ctest)")
os.system("pstree -s -p $(pgrep monetdbd)")

Next, run the test with CTest using the same command expect this time with the -V option, like this:

$ monetdbd create /tmp/dbfarm &> /dev/null; \
monetdbd stop /tmp/dbfarm &> /dev/null; \
ctest -R problematic_test -V

You should find something similar to this, as the output:

...
2: ctest(32739)---problematic_tes(32740)---sh(32746)---pstree(32748)
2: systemd(1)---monetdbd(32742)-+-{monetdbd}(32743)
2:                              |-{monetdbd}(32744)
2:                              `-{monetdbd}(32745)

As you can see, CTest is not seeing monetdbd process as its offspring, even before the child test process exits. That basically rules out the possibility that CTest is waiting for the child process to exit.
Next, let’s check the pipes. By adding the following code to the end of the test script:

COMMAND_GET_PIPE_ID = "ls -l /proc/%d/fd/1 | sed -n -e 's/^.*pipe:\[\(.*\)\]/\\1/p'" % os.getpid()
os.system("lsof -c ctest -c monetdbd -c problematic_ | grep -e $(%s) -e COMMAND" % COMMAND_GET_PIPE_ID)

This one is slightly more complicated. Briefly, the first line is a command to get the pipe ID that is used as the current program’s stdout (fd/1). The ls command shows information that contains the pipe ID, then the sed command parses the pipe ID out of that. Of the second line, the lsof command lists open files associated with the given commands, then grep only open files that are related to the parsed pipe ID. That’s basically what it does. however, you don’t really need to understand those code to be able to use it.
Anyway, after running it with the same ctest command, the output may look something like below:

...
2: COMMAND    PID  USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
2: ctest     5746 yyang    6r  FIFO               0,12      0t0  278777 pipe
2: problemat 5747 yyang    1w  FIFO               0,12      0t0  278777 pipe
2: problemat 5747 yyang    2w  FIFO               0,12      0t0  278777 pipe
2: monetdbd  5749 yyang   10w  FIFO               0,12      0t0  278777 pipe

DON’T KILL THE TEST YET. While the CTest is still running, open another terminal and again issue the same enclosed bash command, except putting the pipe ID [278777] directly into the command this time:

$ lsof -c ctest -c monetdbd -c problematic_ | grep -e 278777 -e COMMAND
COMMAND   PID  USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
ctest    5855 yyang    6r  FIFO               0,12      0t0  278777 pipe
monetdbd 5858 yyang   10w  FIFO               0,12      0t0  278777 pipe

Aha! Here the issue shows itself! Isn’t it obvious? Not so? Okay, Allow me to explain, briefly. When the first output (Result 1) is produced, the test, problematic_test, is still running. As you can see, the test is using pipe node [278777] as both its stdout (FD 1) and stderr (FD 2). At this point, ctest and monetdbd processes also reference to the same pipe. After the test itself exits, which is why it disappears in the second output (Result 2), monetdbd still holds on to the pipe [278777] and can write to it, while ctest keeps trying to read from the same pipe, thus preventing CTest from terminating.


Solution

As it says [3]:

Understanding your problem is half the solution.

And there is an easy build-in solution to this problem right there in Python. It only requires you to change a few parameters of the subprocess.call() function. Like this:

subprocess.call(["monetdbd", "start", DBFARM_PATH], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

Based on The Python Standard Library Documentation of the subprocess module, what that means is:

PIPE indicates that a new pipe to the child should be created. With the default settings of None, no redirection will occur; the child’s file handles will be inherited from the parent.

And that’s it. The problem solved. Again, for the final working sample test script, please refer to the source file.


Conclusion

The important take-away from this post is the strategy we use to debug the problem. Although we only present a solution for Python, it is very likely that there are similar mechanisms in other programming languages, too, so the solution should be able to be extended to other domains easily.
It is also interesting to apply the same debugging code to the final working test, then compare the outputs with the one of the problematic test. However, that part is left as an exercise for the reader.




References

[1] Test Fixtures With CMake/CTest
[2] ctest hangs when a test segfaults
[3] Understanding your problem is half the solution (actually the most important half)

 

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s