A random walk in testing research programs

Like many bioinformatic researchers, I’m a self-taught programmer and realized the role of programming testing only years later. Recently, I started to look into testing as my projects grew in sizes and dependencies. Some quick search led to Test-Driven Development, but this strict framework turned out unsuitable for my research projects, full of trials and errors. To find suitable testing solutions for programming in exploratory research, I did more specific readings. This blog will focus on testing research projects especially data science and machine learning applications. As this blog is based on my limited readings, do suggest me new resources if you feel any.


1. Testing is crucial for debugging and maintaining codes

2. Implement test based on your needs and do it incrementally.

3. Test by Laws.

4. Use validation data sets.

  1. Some backgrounds on programming testing

Testing is crucial for debugging and maintaining programs. It is prevalently used in software engineering. Many bioinformatic researchers are aware of testing or even have implemented testing during programming classes, but many stop there. Additionally, popular packages are mostly well tested and even initial-stage research projects can benefit from testing. As Hadley Wickham mentioned in the package testthat, testing can largely speed up debugging, refactoring, and maintenance. It increases confidences when adding new modifications. Test cases are often good usage examples as well.

At the same time, there are considerable resources available to facilitate easier testing. Frameworks exist in R (testthat) and Python (pytest and unittest ) to simplify testing. Services like Travis CI have been broadly used to automate testing. Many published packages (e.g. stringr, scikit-learn and PyTorch) are thoroughly tested.

2. Why not just implement testing all the time?

Even though testing is broadly implemented and helpful, it does cost time. Most research projects are at early stage with the goal of exploration rather than stable implementation (like popular published packages). At this exploration stage, extensive testing can be a waste of time, especially when decision needs to be made promptly based on general patterns. Exploratory data analysis (EDA) is one major part at this early stage, where general statistics are calculated and visualized. A few different statical modeling approaches might also be tested informally, and most proposals do fail. As EDA is meant to give a quick, cheap, and less stringent view, testing might not provide good benefits over costs trade-offs. Many EDA codes actually won’t be used for the second time, for which testing might become wasting. Similarly, for those researchers who only use existing statistical tools (like SVM in scikit-learn) rather than implementing new tools, there seems little need for testing. There is also little benefit to test plotting functions.

However, testing can be super crucial when the projects expand, and the solutions need to be reused. Some researchers might convert functions into another language (refactoring), add new arguments on existing methods, or construct the implementation from mathematical descriptions. Testing is crucial for those cases, but the trade-off is still relevant. The researchers still need to make the difficult decision on where and how much to test. While really comprehensive testing looks good, it is often beyond the capability of the few authors. The following section will cover some testing solutions that reoccurs in public projects.

3. How to start test your data science codes

Functions in data science, statistics, and machine learning definitely need testing. However, there is specific difficulties in testing statistical programs (link). Possible outcomes in statistical methods are explosive and depends on the specific input data. Without running the function to be tested, it is often hard to know the result (test expectation) efficiently. While testing is possible for a few cases (predefined or random), it’s often hard to cover the most. Additionally, correct and well-tested functions are only halfway to success in data science projects, and hyper-parameter tuning and data preprocessing are also crucial. Discussion on proper machine learning skills is beyond this blog, and the following part will focus on testing solutions for the programs.

3.1 Test the Law rather than the output itself. Testing is often about the behavior of the function (the outputs based on given inputs). However, statistical methods often have explosive possibilities, and the output often cannot be obtained in obvious way. Hence, one alternative approach is to test things that hold true (Law) about the output rather than the output itself (link). This includes probabilities laws (0 ≤ Pᵢ≤ 1 and ∑Pᵢ=1) and known mathematical relationships between multiple outputs of the specific method. Careful reading of the original mathematic and statistic papers can help find some such Laws.

3.2 Include validation data set and baseline program. Small simulated or real-world datasets (e.g. iris) are often used in testing for sanity check and performance evaluation. Specific expectation for output might exist for a small simple dataset (see the last example in PCA function), but this is often not the case for real-world datasets. In this case, different internal implementations of the same method are expected to have similar outputs and can be cross-checked (the second last example in PCA function). Besides sanity check, performance can also be evaluated for the new function. By comparing with baseline program on the benchmark dataset, performance improvement can be presented. Unexpected behaviors seen in this comparison can also indicate possible bugs.

3.3 Treat adding testing as a process. It’s difficult to have relatively complete list of testing cases even for a simple function (e.g. PCA) in the beginning. Hence, it’s beneficial to treat testing as a process rather than a result. This is the case for scikit-learn as testing for PCA is being updated from 2011 till 2020 (link). In practice, this means adding testing when fixing bugs (and issues), adding new features, and involving community efforts (partly indicated by the number of contributors 16 for _pca.py 36 for test_pca.py). Even though there are considerable possibilities for testing your program, do have a plan and start from the most important ones.

3.4 There are many other approaches on testing. New neural network implementation can be tested through expectation on variable variation (link1 and link2). Functions on data transformation can be tested in relative determined perspectives, including dimensions, classes, and values. Extreme cases and expected failure can both be tested.

4. Examples

Here I go through some testing examples in published packages and explain concepts mentioned above.

4.1. Testing for str_detect in stringr package

str_detect is a string manipulation function and returns Boolean values indicating whether the input strings are matched by the given patterns. It is a common transformation step in my cleaning and preprocessing pipeline. stringr is a popular R package for common string manipulation, where easy-to-use functions like str_detect are provided. These string manipulation functions are, however, relatively easier to test than statistical ones as expected behaviors can be easily determined. For a given string, a programmer can easily conclude whether it matched with a given regex.

test_that("special cases are correct", {
expect_equal(str_detect(NA, "x"), NA)
expect_equal(str_detect(character(), "x"), logical())

test_that("vectorised patterns work", {
expect_equal(str_detect("ab", c("a", "b", "c")), c(T, T, F))
expect_equal(str_detect(c("ca", "ab"), c("a", "c")), c(T, F))

# negation works
expect_equal(str_detect("ab", c("a", "b", "c"), negate = TRUE), c(F, F, T))

This example contains two tests and within each, there are multiple expectations. Each test is related to one specific functional case and each expectation is related to one return.

The first test is about clean behaviors for special/extreme input cases. When input string is NA (empty character), the output should be NA (empty logical types). Ensuring these behaviors is beneficial for constructing and debugging pipelines. Additionally, the test is implemented regarding behavior and without relying on the internal function mechanisms.

The second test is about vectorization, a useful technique in R, that the function will work in similar way for both single value and vector as inputs. Vectorization is especially useful for speeding up workflows in R. Multiple possible usage cases are presented as expectations[1]. If negate=true, the output will be negated. It can also be seen that both tests use simple representative examples without going through more comprehensive cases.

4.2. Testing for PCA function in scikit-learn

scikit-learn is a popular python package for machine learning analysis. It is well documented and tested. PCA tries to find latent variables that explain maximal amount of variance iteratively. It is prevalently used for EDA, visualization, and features extraction (Details). Such a function is extensively tested (link). PCA has multiple internal implementations and each one contains multistep numeric computation. As a statistic method, the output of PCA depends on the input and it is difficult to write down the output given any arbitrary input. Here a few testing examples are presented.

@pytest.mark.parametrize('svd_solver', PCA_SOLVERS)
@pytest.mark.parametrize('n_components', range(1, iris.data.shape[1]))
def test_pca(svd_solver, n_components):
X = iris.data
pca = PCA(n_components=n_components, svd_solver=svd_solver)

# check the shape of fit.transform
X_r = pca.fit(X).transform(X)
assert X_r.shape[1] == n_components

# check the equivalence of fit.transform and fit_transform
X_r2 = pca.fit_transform(X)
assert_allclose(X_r, X_r2)
X_r = pca.transform(X)
assert_allclose(X_r, X_r2)

# Test get_covariance and get_precision
cov = pca.get_covariance()
precision = pca.get_precision()
assert_allclose(np.dot(cov, precision), np.eye(X.shape[1]), atol=1e-12)

Similar to testhat in R, in pytest each test contains multiple expectations (assertations). In the beginning, @pytest.mark.parametrize enable testing to loop through different arguments: different SVD solvers and predefined numbers of components. assert_allclose is similar to expect_equal in testhat, though the values are compared within some tolerance ranges. This is crucial as numeric computation can introduce small and irrelevant deviations.

This is the first test in test_pca.py, and multiple expectations based on a validation dataset (iris) are checked. First, it tests the dimension of the output to be the same as expected: the set number of components. Second, it tests the agreement between two ways of analysis: fit the model then transform the data and do both in one step. Last, a Law is tested: given input data, the dot product of covariance and precision should be identity matrix.

@pytest.mark.parametrize('svd_solver', ['arpack', 'randomized'])
def test_pca_explained_variance_equivalence_solver(svd_solver):
rng = np.random.RandomState(0)
n_samples, n_features = 100, 80
X = rng.randn(n_samples, n_features)

pca_full = PCA(n_components=2, svd_solver='full')
pca_other = PCA(n_components=2, svd_solver=svd_solver, random_state=0)



Here is a test about consistency among different implementations (SVD solvers). Explained variance and corresponding ratio is calculated with different solvers and compared with the results from option ‘full’. Random generated data is used for this test. Even though this argument can often be wisely (automatically) chosen in real world, different options should produce similar results.

@pytest.mark.parametrize("svd_solver", PCA_SOLVERS)
def test_pca_check_projection_list(svd_solver):
# Test that the projection of data is correct
X = [[1.0, 0.0], [0.0, 1.0]]
pca = PCA(n_components=1, svd_solver=svd_solver, random_state=0)
X_trans = pca.fit_transform(X)
assert X_trans.shape, (2, 1)
assert_allclose(X_trans.mean(), 0.00, atol=1e-12)
assert_allclose(X_trans.std(), 0.71, rtol=5e-3)

In this test, a super regular input matrix is supplied and so, there is theoretical expectation about the result: for the transformed matrix, mean equals 0 and standard deviation equals 0.71 approximately. This case is rare but useful for checking correct mathematical implementation. Agreement with formula can be reassuring even though the real-world dataset is often far more complex.

Some key take-aways:

1. Testing is crucial for debugging and maintaining codes

2. Implement test based on your needs and do it incrementally.

3. Test by Laws.

4. Use validation data sets.


  1. If length(string)==1 and length(pattern)>1, then function will loop though all different patterns for the same string. If both inputs are vectors, then they need to be the same length and the result will be based on pairwise match.

Acknowledgement: Thanks the great comments from Michael Judge and Marcus Hill













bioinformatics Ph.D. Candidate at UGA. Working on metabolomics and ML

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store