Our manual analysis reveals that inheritance and utility methods are the two major causes of de- pendencies among test cases (inheritance: 79.4%, utilities: 13.2%). While inheritance and utilities are standard in improving code maintenance, modularization, and usability, the necessity and neg- ative impacts of such code reuse in test cases should be revisited more thoroughly by future work. Intuitively, inheritance and sharing utility code may violate the test independence assumption, and
inheritance may even reduce test code readability and increase test execution overhead [? ]. Our manually-uncovered dependency-related test smells confirm such intuition to some extent: Code reuse through inheritance and utility may cause hard-to-maintain test fixtures (Test Smell #3), unstable test environments (Test Smell #2), and flaky tests (Test Smell #4). In short, our work calls for future research efforts to revisit current test design and its impact on test quality, and the effectiveness of test impact analysis. Moreover, automated approaches are needed to refactor test cases for better design and improved effectiveness of test impact analysis.
Chapter 6
Threats to validity
In this chapter, we discuss the threats to validity of this thesis.
6.1
External Validity.
We conduct our study on eleven large-scale open source systems in different domains. We find that the overall findings hold in all the studied systems. Our studied systems are all implemented in Java so the results may not be generalizable to systems in other programming languages. Future studies should validate the generalizability of our findings in systems that are implemented in other pro- gramming languages. Although the studied systems have different levels of development activities, our test impact analysis in CI is limited to a one-year period and does not cover the very beginning of the system development. Future studies can more thoroughly perform test impact analysis and expand it to the entire development lifecycle of software systems.
6.2
Construct Validity.
In this thesis, we use static analysis to uncover the dependencies between test cases and other source code files. During our manual study, we did not find any false positives that are caused by our static analysis approach. We choose static analysis over dynamic analysis for recovering dependencies at the class-level because of the three following reasons: 1) Prior studies [23, 29] found that static test impact analysis have similar performance compared to dynamic-based techniques, and class- level dependency gives better results compared to method-level dependency. 2) A large number of test cases suffer from flakiness [3]. Flaky tests expose different behaviors among multiple runs and may result in differences in code coverage. 3) In each CI execution, there may exist many test cases not being executed due to various reasons (e.g., failures of preceding test cases) [4, 45]. 4)
Using static analysis is shown to have high accuracy on identifying class dependencies [30]. Future studies may use dynamic analysis to re-evaluate our findings. We identify a file as a test case if it contains calls to testing frameworks such as JUnit or TestNG (i.e., contain @Test annotation). Although we did not find any false positives during our manual study, some of the identified test cases may be skipped by developers during the build process. Future studies are needed to verify the accuracy of our test identification approach. Note that, older versions of the testing frameworks (e.g., JUnit 3) use inheritance to define a test case (i.e., by calling extends TestCase) and not @Test annotation. In such cases, our test identification approach may not work properly. However, the studied systems are using newer versions of the testing frameworks, which use @Test annotation to define test cases. Our study focuses on studying Java source code and test cases. Although the majority of the studied systems are written in Java, some of them may contain code that is written in different programming languages. For example, Flink contains 25% non-Java tests, Kafka contains 27% non-Java tests, Hadoop contains 0.8% C++ tests, and Hbase contains 4.2% JavaScript and 1.5% Ruby tests. After some manual investigation of the build script and the executed tests in the CI process, we find that these non-Java tests are either excluded in the daily CI build, or are executed in a separate CI job (i.e., does not affect the CI process of the Java components). Future studies are needed to evaluate the effect of the polyglot nature of a system on its test design and execution.
6.3
Internal Validity.
In this thesis, we use static analysis to uncover the class dependencies. However, there may be some limitations in the static analysis (e.g., difficult to analyze reflection) that cause inaccurate results. Even though we did not find such cases in our manual study, future studies should validate our findings on other systems.
Chapter 7
Conclusion
To reduce test execution overhead, prior studies have proposed and evaluated techniques such as test impact analysis in the traditional software development settings. However, the effectiveness of test impact analysis remains unknown in the continuous integration setting, where developers continu- ously run test cases daily to ensure the quality of the integrated code. The complexity of modern software systems and the frequent code integration may introduce additional code dependencies that affect the effectiveness of test impact analysis. In this thesis, we first study the effectiveness of static test impact analysis on eleven open-source systems in the continuous testing setting. We analyzed one year of software development history. We found that most test cases (around 50% or more) are impacted in the daily test execution due to high code dependencies, although the number of changed files is small. Motivated by our finding, we further studied the code dependencies between the source code files and test cases, and among test cases. We found that most test cases cover the integrated behavior of the system, and many test cases cover more than 20 source code files. We also found that 18% of the test cases have dependencies with other test cases. Finally, we documented four dependency-related test smells that we manually uncovered. In short, our study highlights the needs and provides insights on reducing test execution overheads and improving test design.
Future studies should propose specialized techniques to lessen code dependencies. In particular, they should consider the unique test focus in the continuous integration environment and various properties of changed files to improve test impact analysis effectiveness. We provide four practical guidelines for practitioners and the research community: 1) encouraging using test impact analysis in large-scale projects as these systems have a higher benefit of test impact analysis than smaller systems due to longer test execution time; 2) improving test impact analysis by reducing dependencies between tests and source code and by designing modularized tests; 3) refactoring and removing unnecessary inheritance to maintain test independence; 4) utilizing automatic approaches (e.g., the
proposed test smell detection approach in the thesis) to review test design in order to avoid negative effects caused by test smells.
Bibliography
[1] (2019). Parameterized tests in JUnit. https://github.com/junit-team/junit4/wiki/ parameterized-tests. Last accessed May 2019.
[2] (2019). Three reasons why we should not use inheritance in our tests. https://www.petrikainulainen.net/programming/unit-testing/ 3-reasons-why-we-should-not-use-inheritance-in-our-tests/. Last accessed May 2019.
[3] Bell, J., Legunsen, O., Hilton, M., Eloussi, L., Yung, T., and Marinov, D. (2018). Deflaker: Au- tomatically detecting flaky tests. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), ICSE ’18, pages 433–444.
[4] Beller, M., Gousios, G., and Zaidman, A. (2017). Oops, my tests broke the build: An explorative analysis of travis ci with github. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), MSR ’17, pages 356–367.
[5] Boslaugh, S. and Watters, P. (2008). Statistics in a Nutshell: A Desktop Quick Reference. In a Nutshell (O’Reilly).
[6] Bouillon, P., Krinke, J., Meyer, N., and Steimann, F. (2007). Ezunit: A framework for associating failed unit tests with potential programming errors. In International Conference on Extreme Programming and Agile Processes in Software Engineering, XP’07, pages 101–104.
[7] Chen, J., Bai, Y., Hao, D., Xiong, Y., Zhang, H., and Xie, B. (2017a). Learning to prioritize test programs for compiler testing. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), ICSE ’17, pages 700–711.
[8] Chen, T.-H., Syer, M. D., Shang, W., Jiang, Z. M., Hassan, A. E., Nasser, M., and Flora, P. (2017b). Analytics-driven load testing: An industrial experience report on load testing of large-scale systems. In 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), pages 243–252.
[9] Chidamber, S. R. and Kemerer, C. F. (1994). A metrics suite for object oriented design. IEEE Transactions on Software Engineering, 20(6), 476–493.
[10] Daka, E. and Fraser, G. (2014). A survey on unit testing practices and problems. In 2014 IEEE 25th International Symposium on Software Reliability Engineering, ISSRE ’14, pages 201–211. [11] Elbaum, S., Rothermel, G., and Penix, J. (2014). Techniques for improving regression testing
in continuous integration development environments. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 235–245. [12] Engström, E., Runeson, P., and Skoglund, M. (2010). A systematic review on regression test
selection techniques. Information and Software Technology, 52(1), 14–30.
[13] Gambi, A., Bell, J., and Zeller, A. (2018). Practical test dependency detection. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST), pages 1–11.
[14] Ghafari, M., Ghezzi, C., and Rubinov, K. (2015). Automatically identifying focal methods under test in unit test cases. In 2015 IEEE 15th International Working Conference on Source Code Analysis and Manipulation (SCAM), pages 61–70.
[15] Gligoric, M., Eloussi, L., and Marinov, D. (2015). Practical regression test selection with dynamic file dependencies. In Proceedings of the 2015 International Symposium on Software Testing and Analysis, pages 211–222.
[16] Grant, S., Cordy, J. R., and Skillicorn, D. B. (2012). Using topic models to support software maintenance. In 16th European Conference on Software Maintenance and Reengineering, pages 403–408.
[17] Grant, S., Cordy, J. R., and Skillicorn, D. B. (2013). Using heuristics to estimate an appropriate number of latent topics in source code analysis. Science of Computer Programming, 78(9), 1663 – 1678.
[18] Hautus, E. (2002). Improving java software through package structure analysis. In The 6th IASTED International Conference Software Engineering and Applications.
[19] Janzen, D. and Saiedian, H. (2005). Test-driven development: Concepts, taxonomy, and future direction. Computer, 38(9), 43–50.
[20] JavaParser (2019). https://javaparser.org/. Last accessed Feb 1 2019.
[21] Jenkins, A. (2019). Apache Jenkins CI test results. https://builds.apache.org/. Last accessed Nov 2019.
[22] LaToza, T. D., Venolia, G., and DeLine, R. (2006). Maintaining mental models: A study of de- veloper work habits. In Proceedings of the 28th International Conference on Software Engineering, ICSE ’06, pages 492–501.
[23] Legunsen, O., Hariri, F., Shi, A., Lu, Y., Zhang, L., and Marinov, D. (2016). An extensive study of static regression test selection in modern software evolution. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 583–594.
[24] Lei, Y. and Andrews, J. H. (2005). Minimization of randomized unit test cases. In 16th IEEE International Symposium on Software Reliability Engineering (ISSRE’05), ISSRE ’05, pages 267– 276.
[25] Leitner, A., Oriol, M., Zeller, A., Ciupa, I., and Meyer, B. (2007). Efficient unit test case minimization. In Proceedings of the Twenty-second IEEE/ACM International Conference on Au- tomated Software Engineering, ASE ’07, pages 417–420.
[26] Li, Z., Harman, M., and Hierons, R. M. (2007). Search algorithms for regression test case prioritization. IEEE Transactions on Software Engineering, 33(4), 225–237.
[27] Li, Z., Chen, T.-H. P., Yang, J., and Shang, W. (2019). DLFinder: Characterizing and detecting duplicate logging code smells. In Proceedings of the 41th International Conference on Software Engineering, ICSE ’19, pages 1–1.
[28] Luo, Q., Hariri, F., Eloussi, L., and Marinov, D. (2014). An empirical analysis of flaky tests. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 643–653.
[29] Luo, Q., Moran, K., Zhang, L., and Poshyvanyk, D. (2019). How do static and dynamic test case prioritization techniques perform on modern software systems? an extensive study on github projects. IEEE Transactions on Software Engineering, 45(11), 1054–1080.
[30] Lutellier, T., Chollak, D., Garcia, J., Tan, L., Rayside, D., Medvidović, N., and Kroeger, R. (2018). Measuring the impact of code dependencies on software architecture recovery techniques. IEEE Transactions on Software Engineering, 44(2), 159–181.
[31] Marijan, D., Gotlieb, A., and Sen, S. (2013). Test case prioritization for continuous regression testing: An industrial case study. In 2013 IEEE International Conference on Software Mainte- nance, ICSM, pages 540–543.
[32] Mei, H., Hao, D., Zhang, L., Zhang, L., Zhou, J., and Rothermel, G. (2012). A static approach to prioritizing junit test cases. IEEE Transactions on Software Engineering, 38(6), 1258–1275.
[33] Memon, A., Gao, Z., Nguyen, B., Dhanda, S., Nickell, E., Siemborski, R., and Micco, J. (2017). Taming google-scale continuous testing. In Proceedings of the 39th International Conference on Software Engineering: Software Engineering in Practice Track, ICSE-SEIP ’17, pages 233–242. [34] Meszaros, G. (2007). XUnit Test Patterns: Refactoring Test Code. Pearson Education. [35] Microsoft (2019). Test impact analysis in visual studio test. https://docs.microsoft.com/
en-us/azure/devops/pipelines/test/test-impact-analysis?view=azure-devops. Last ac- cessed May 2019.
[36] Moonen, L. and Yamashita, A. (2012). Do code smells reflect important maintainability aspects? In 2012 28th IEEE international conference on software maintenance (ICSM), pages 306–315. [37] Muşlu, K., Brun, Y., and Meliou, A. (2013). Data debugging with continuous testing. In Joint
Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE’13, pages 631–634.
[38] Najafi, A., Shang, W., and Rigby, P. C. (2019). Improving test effectiveness using test executions history: an industrial experience report. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 213–222.
[39] Orso, A., Shi, N., and Harrold, M. J. (2004). Scaling regression testing to large software systems. In Proceedings of the 12th ACM SIGSOFT Twelfth International Symposium on Foundations of Software Engineering, FSE’12, pages 241–251.
[40] Palomba, F. and Zaidman, A. (2017). Does refactoring of test smells induce fixing flaky tests? In Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017, pages 1–12.
[41] Palomba, F. and Zaidman, A. (2019). The smell of fear: on the relation between test smells and flaky tests. Empirical Software Engineering, 24(5), 2907–2946.
[42] Pinto, L. S., Sinha, S., and Orso, A. (2012). Understanding myths and realities of test-suite evolution. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, FSE ’12, pages 33:1–33:11.
[43] Prechelt, L., Unger, B., Philippsen, M., and Tichy, W. F. (2003). A controlled experiment on inheritance depth as a cost factor for code maintenance. Journal of Systems and Software, 65(2), 115 – 126.
[44] Qusef, A., Oliveto, R., and De Lucia, A. (2010). Recovering traceability links between unit tests and classes under test: An improved method. In 2010 IEEE International Conference on Software Maintenance, pages 1–10.
[45] Rausch, T., Hummer, W., Leitner, P., and Schulte, S. (2017). An empirical analysis of build failures in the continuous integration workflows of java-based open-source software. In Proceedings of the 14th International Conference on Mining Software Repositories, MSR ’17, pages 345–355. [46] Rothermel, G. and Harrold, M. J. (1997). A safe, efficient regression test selection technique.
ACM Transactions on Software Engineering Methodology, 6(2), 173–210.
[47] Rothermel, G., Untch, R. H., Chu, C., and Harrold, M. J. (2001). Prioritizing test cases for regression testing. IEEE Transactions on Software Engineering, 27(10), 929–948.
[48] Saff, D. and Ernst, M. D. (2003). Reducing wasted development time via continuous testing. In 14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003., pages 281–292.
[49] Saha, R. K., Zhang, L., Khurshid, S., and Perry, D. E. (2015). An information retrieval approach for regression test prioritization based on program changes. In Proceedings of the 37th International Conference on Software Engineering, ICSE ’15, pages 268–279.
[50] Shi, A., Yung, T., Gyori, A., and Marinov, D. (2015). Comparing and combining test-suite reduction and regression test selection. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 237–247.
[51] Shore, J. (2004). Fail fast [software debugging]. IEEE Software, 21(5), 21–25.
[52] Spadini, D., Aniche, M., Bruntink, M., and Bacchelli, A. (2017). To mock or not to mock?: An empirical study on mocking practices. In Proceedings of the 14th International Conference on Mining Software Repositories, MSR ’17, pages 402–412.
[53] Spadini, D., Aniche, M., Bruntink, M., and Bacchelli, A. (2018). Mock objects for testing java systems. Empirical Software Engineering, 24(3), 1461–1498.
[54] Spadini, D., Palomba, F., Zaidman, A., Bruntink, M., and Bacchelli, A. (2018). On the relation of test smells to software code quality. In Proceedings of the International Conference on Software Maintenance and Evolution, ICSM ’18, pages 1–12.
[55] Thomas, S. W., Hemmati, H., Hassan, A. E., and Blostein, D. (2014). Static test case prioriti- zation using topic models. Empirical Software Engineering, 19(1), 182–212.
[56] Vahabzadeh, A., Fard, A. M., and Mesbah, A. (2015). An empirical study of bugs in test code. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), ICSME ’15, pages 101–110.
[57] Vahabzadeh, A., Stocco, A., and Mesbah, A. (2018). Fine-grained test minimization. In Pro- ceedings of the 40th International Conference on Software Engineering, ICSE ’18, pages 210–221. [58] Van Deursen, A., Moonen, L., Van Den Bergh, A., and Kok, G. (2001). Refactoring test code. In Proceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (XP), pages 92–95.
[59] Vasilescu, B., Yu, Y., Wang, H., Devanbu, P., and Filkov, V. (2015). Quality and productivity outcomes relating to continuous integration in github. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 805–816.
[60] Yoo, S. and Harman, M. (2012). Regression testing minimization, selection and prioritization: A survey. Software Testing, Verification and Reliability, 22(2), 67–120.
[61] Yoo, S., Harman, M., and Clark, D. (2013). Fault localization prioritization: Comparing information-theoretic and coverage-based approaches. ACM Transactions on Software Engineer- ing Methodology, 22(3), 19:1–19:29.
[62] Zhang, L., Marinov, D., Zhang, L., and Khurshid, S. (2011). An empirical study of junit test- suite reduction. In Proceedings of the IEEE 22nd International Symposium on Software Reliability Engineering, ISSRE ’11, pages 170–179.
[63] Zhang, L., Hao, D., Zhang, L., Rothermel, G., and Mei, H. (2013). Bridging the gap between the total and additional test-case prioritization strategies. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pages 192–201.
[64] Zhang, S., Jalali, D., Wuttke, J., Muslu, K., Lam, W., Ernst, M. D., and Notkin, D. (2014). Empirically revisiting the test independence assumption. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, ISSTA 2014, pages 385–396.
[65] Zhu, Y., Shihab, E., and Rigby, P. C. (2018). Test re-prioritization in continuous testing environments. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 69–79.