Chair: Katinka Wolter, Freie Universitaet Berlin, Germany
Chair: Tadashi Dohi, Hiroshima University, Japan
Chair: Hélène Waeselynck, LAAS-CNRS, France
Chair: Michael Lyu, The Chinese University of Hong Kong, Hong Kong
Chair: Roberto Natella, Federico II Univ. of Naples, Italy
Chair: Sagar Sen, Simula Research Laboratory, Norway
Chair: Yvan Labiche, Carleton University, Canada
Chair: Karthik Pattabiraman, University of British Columbia, Canada
Chair: Barbara Gallina, Mälardalen University, Sweden
Chair: Bojan Cukic, West Virginia University, USA
Chair: Brendan Murphy, Microsoft Research, UK
Chair: Karama Kanoun, LAAS/CNRS, France
Chair: Nuno Antunes, University of Coimbra, Portugal
Chair: Michael Grottke, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
Chair: Domenico Cotroneo, University of Naples Federico II, Italy
Chair: Irena Bojanova, NIST, USA
David Flater. WAP: Unreasonable distributions of execution time under reasonable conditions
Abstract: Reliability and safety often depend on the execution times of software tasks being reasonably consistent and predictable if not strictly bounded in the real-time sense. Since commodity computers are nominally deterministic machines, one might expect the elapsed and CPU time required to execute a fully-defined, deterministic software task with no complications to satisfy that requirement. But experiments at NIST have produced distributions of elapsed and CPU time which are "unreasonable" enough to invalidate the statistical assumptions and confidence intervals that are ordinarily used to summarize results. If this variability is endemic to the modern hardware and operating systems that are deployed, then increasingly invasive methods of controlling it in the lab are of marginal interest. Instead, it needs to be characterized and dealt with. Some approaches are suggested with the aim of starting a discussion.
Zhiqiang Zhang, Tianyong Wu and Jian Zhang. Boundary value analysis in automatic white-box test generation
Abstract: White-box testing is an effective technique for generating test cases to provide high coverage for programs. Generally, we usually select several execution paths using some strategy, and generate a corresponding test case that follows the path. Each execution path corresponds to an input subspace bounded by constraints. Extreme values in these subspaces are very likely to trigger failures since the constraints bounding input subspaces may be faulty. In this paper, we propose a new way of defining the boundaries of comparison predicates in white-box testing, and apply constrained combinatorial testing to reduce the number of test cases while still covering these boundaries. Our approach can guarantee to cover all possible boundaries for each selected execution path, thus achieving high fault coverage for boundary errors.
Julian Thomé, Lwin Khin Shar and Lionel Briand. Security Slicing for Auditing XML, XPath, and SQL Injection Vulnerabilities
Abstract: XML, XPath, and SQL injection vulnerabilities are among the most common and serious security issues for Web applications and Web services. Thus, it is important for security auditors to ensure that the implemented code is, to the extent possible, free from these vulnerabilities before deployment. Current approaches do not focus on supporting the auditing of these vulnerabilities, especially XML and XPath injections. Although existing taint analysis approaches could automatically detect potential vulnerabilities in source code, they tend to generate many false warnings. Furthermore, the analysis traces, i.e. data-flow paths from input sources to security-sensitive operations produced from existing approaches tend to be incomplete or contain a great deal of irrelevant information. Therefore, based on the information provided by these approaches, it is difficult to identify real vulnerabilities and determine their causes. One suitable approach to support security auditing is to compute a program slice for each security-sensitive operation, since it would contain all the information required for performing security audits (Soundness). A limitation, however, is that such slices may also contain information that is irrelevant to security (Precision), thus raising scalability issues for security audits. In this paper, we propose an approach to assist security auditors by defining and experimenting with pruning techniques to reduce original program slices to what we refer to as security slices, which contain sound and precise information. To evaluate the proposed pruning mechanism by using a number of open source benchmarks, we compared our security slices with the slices generated by a state-of-the-art program slicing tool. On average, our security slices are 80% smaller than the original slices, thus suggesting significant reduction in auditing costs. A manual code inspection revealed that the generated security slices contain all the information relevant to security, thus demonstrating that our security slices are, indeed, sound and precise.
Benjamin Jakobus, Alessandro Garcia, Eiji Adachi Barbosa and Carlos José Pereira de Lucena. Contrasting exception handling code across languages: An analysis of 50 open source projects
Matteo Camilli, Angelo Gargantini and Patrizia Scandurra. Specifying and Verifying Real-Time Self-Adaptive Systems
Abstract: Self-adaptive systems autonomously adapt their behavior at run-time to react to internal dynamics and to uncertain and changing environment conditions. Specification and verification of self-adaptive systems are generally very difficult to carry out due to their high complexity, especially when involving time constraints. In the last case, in fact, the correctness of systems depends also on the time associated with events.
This paper introduces a formal approach to specify and verify the self-adaptive behavior of real-time systems. Our specification formalism is based on Time-Basic Petri nets, a particular timed extension of Petri nets. We propose adaptation models to realize self-adaptation with temporal constraints and we adopt a zone-based modeling approach to support separation of concerns during the modeling phase. Zones identified during the modeling phase can be then used as modules (TB Petri subnets) either in isolation, to verify intra-zone properties, or all together, to verify inter-zone properties over the entire system model and check that all the temporal deadlines are met. We illustrate our approach by modeling and verifying a time-critical Gus Burner system that exhibits a self-healing behavior.
Otavio Lemos, Fabiano Cutigi Ferrari, Fabio Silveira and Alessandro Garcia. Experience Report: Can Software Testing Education Lead to More Reliable Code?
Abstract: Software Testing (ST) is one of the least known aspects of software development. At the same time, researchers and practitioners often argue that it demands more than 50% of the costs of a software project. Therefore, testing education is of paramount importance. In fact, the mere exposition to ST knowledge might have an impact on programming skills. In particular, it can encourage the production of more reliable code. Even though this is intuitive, to the best of our knowledge, there are no empirical studies about the impact of ST education on reliable software. Evidence on this matter is important to motivate classical testing education, all the more when academic curricula tend to underemphasize quality assurance topics in favor of design and implementation. On the other hand, it also appears that some techniques that are taught in ST courses, when these are present, are rarely put into practice by the industry (e.g., mutation and data-flow testing). Concerned with these issues, we have conducted a study to investigate the possible effect of ST knowledge on the production of reliable software. Our controlled experiment involved 28 senior-level Computer Science students, 8 auxiliary functions with 92 test cases, and a total of 112 implementations. Results show that implementations delivered after the exposition to ST knowledge are, on average, 20% more reliable than the ones delivered before (a significant difference at the 0.01 level). Moreover, although implementations are more reliable afterwards, they are not significantly larger in terms of lines of code. This indicates that ST knowledge can make developers produce more reliable implementations with approximately the same amount of code.
Hui Xu, Yangfan Zhou, Yu Kang, Michael Lyu and Cuiyun Gao. SpyAware: Investigating the Privacy Leakage Signatures in App Execution Traces
Abstract: A new security problem on smartphones is the wide spread of spyware nested in apps, which occasionally and silently collects user’s private data in the background. The
state-of-the-art work for privacy leakage detection is dynamic taint analysis, which, however, suffers usability issues because it requires flashing a customized system image to track the taint propagation and consequently incurs great overhead. Through a real-world privacy leakage case study, we observe that the spyware behaviors share some common features during execution, which may further indicate a correlation between the data flow of privacy leakage and some specific features of program execution traces. In this work, we examine such a hypothesis using the newly proposed SpyAware framework, together with a customized TaintDroid as the ground truth. SpyAware includes a profiler to automatically profile app executions in binder calls and system calls, a feature extractor to extract feature vectors from execution traces, and a classifier to train and predict spyware executions based on the feature vectors. We conduct an evaluation experiment with 100 apps downloaded from Google Play. Experimental results show that our approach can achieve promising performance with 67.4% accuracy in detecting device id spyware executions and 78.4% in recognizing location spyware executions.
Yiling Lou, Dan Hao and Zhang Lu. Mutation-based Test-Case Prioritization in Software Evolution
Abstract: During software evolution, to assure the software quality, test cases for an early version tend to be reused by its latter versions. As a large number of test cases may aggregate during software evolution, it becomes necessary to schedule the execution order of test cases so that the faults in the latter version may be detected as early as possible, which is test-case prioritization in software evolution.
In this paper, we proposed a novel test-case prioritization approach for software evolution, which first uses mutation faults on the difference between the early version and the latter version to simulate real faults occurred in software evolution, and then schedules the execution order of test cases based on their fault- detection capability, which is defined based on mutation faults. In particular, we present two models on calculating fault-detection capability, which are statistics-based model and probability-based model.
Moreover, we conducted an experimental study and found that our approach with the statistics-based model outperforms our approach with the probability-based model and the total statement coverage-based approach, and slightly outperforms the additional statement-coverage based approach in many cases. Furthermore, compared with the total or additional statement coverage-based approach, our approach with either the statistics- based model or the probability-based model tends to be stably effective when the difference between the source code of the early version and the source code of the latter version is non-trivial.
Sunil Nair, Neil Walkinshaw, Tim Kelly and Jose Luis de La Vara. An Evidential Reasoning Approach for Assessing Confidence in Safety Evidence
Abstract: Safety cases present the arguments and evidence that can be used to justify the acceptable safety of a system. Many secondary factors such as the tools, used, the techniques applied, and the experience of the people who created the evidence, can affect an assessor’s confidence in the evidence cited by a safety case. One means of reasoning about this confidence and its inherent uncertainties is to present a ‘confidence argument’ that explicitly justifies the provenance of the evidence used. In this paper, we propose a novel approach to automatically construct these confidence arguments by enabling assessors to provide individual judgements concerning the trustworthiness and the appropriateness of the evidence. The approach is based on Evidential Reasoning and enables the derivation of a quantified aggregate of the overall confidence. The proposed approach is supported by a prototype tool (EviCA) and has been evaluated using the Technology Acceptance Model.
Daniel Schneider, Yiannis Papadopoulos, Eric Armengaud, Mario Trapp, Marc Zeller and Kai Höfig. Digital Dependability Identities
Abstract: Cyber-Physical Systems (CPS) provide enormous potential for innovation but a precondition for this is that the issue of dependability has been addressed. This paper presents the concept of a Digital Dependability Identity (DDI) of a component or system as foundation for assuring the dependability of CPS. A DDI is an analyzable and potentially executable model of information about the dependability of a component or system. We argue that DDIs must fulfill a number of properties including being universally useful across supply chains, enabling off-line certification of systems where possible, and providing capabilities for in-field certification of safety of CPS. In this paper, we focus on system safety as one integral part of dependability. As a practical demonstration of the concept we present an initial implementation of DDIs in the form of Conditional Safety Certificates (also known as ConSerts). We explain ConSerts and their practical operationalization based on an illustrative example.
Xingyu Zhao, Bev Littlewood, Andrey Povyakalo and David Wright. Conservative Claims about the Probability of Perfection of Software-based Systems
Abstract: In recent years we have become interested in the problem of assessing the probability of perfection of software-based systems which are sufficiently simple that they are “possibly perfect”. By “perfection” we mean that the software of interest will never fail in a specific operating environment. We can never be certain that it is perfect, so our interest lies in claims for its probability of perfection. Our approach is Bayesian: our aim is to model the changes to this probability of perfection as we see evidence of failure-free working. Much of the paper considers the difficult problem of expressing prior beliefs about the probability of failure on demand (pfd), and representing these mathematically. This requires the assessor to state his prior belief in perfection as a probability, and also to state what he believes are likely values of the pfd in the event that the system is not perfect. We take the view that it will be impractical for an assessor to express these beliefs as a complete distribution for pfd. Our approach to the problem has three threads. Firstly we assume that, although he cannot provide a full probabilistic description of his uncertainty in a single distribution, the assessor can express some precise but partial beliefs about the unknowns. Secondly, we assume that in the inevitable presence of such incompleteness, the Bayesian analysis needs to provide results that are guaranteed to be conservative (because the analyses we have in mind relate to critical systems). Finally, we seek to prune the set of prior distributions that the assessor finds acceptable in order that the conservatism of the results is no greater than it has to be, i.e. we propose, and eliminate, sets of priors that would appear generally unreasonable. We give some illustrative numerical examples of this approach, and note that the numerical values obtained for the posterior probability of perfection in this way seem potentially useful (although we make no claims for the practical realism of the numbers we use). We also note that the general approach here to the problem of expressing and using limited prior belief in a Bayesian analysis may have wider applicability than to the problem we have addressed.
Yongwang Zhao, Zhibin Yang, David Sanan and Yang Liu. Event-based Formalization of Safety-critical Operating System Standards - An Experience Report on ARINC 653 using Event-B
Abstract: Standards play a key role in safety-critical systems. Errors in standards could mislead system developer's understanding and introduce bugs into system implementations. In this paper, we present an Event-B formalization and verification for the ARINC 653 standard, which provides a standardized interface between safety-critical real-time operating systems and application software, as well as a set of functionalities aimed to improve the safety and certification process of such safety-critical systems. The formalization is a complete model of ARINC 653, and provides a necessary foundation for the formal development and verification of ARINC 653 compliant operating systems and applications. 6 hidden errors were discovered from the verification using the Event-B deductive reasoning approach.
Xuan Bach D. Le, Tien-Duy B. Le and David Lo. Should Fixing These Failures be Delegated to Automated Program Repair?
Abstract: Program repair constitutes one of the major components of software maintenance that usually incurs a significant cost in software production. Automated program repair is supposed to help in reducing the software maintenance cost by automatically fixing software defects. Despite the recent advances in automated software repair, it is still very costly to wait for repair tools to produce valid repairs of defects. This paper addresses the following question: ``Will an automated program repair technique find a repair for a defect within a reasonable time?''. To answer this question, we build an oracle that could predict whether fixing a failure should be delegated to an automated repair technique. If the repair technique is predicted to take too long to produce a repair, the bug fixing process should rather be assigned to a developer or other appropriate techniques available.
Our oracle is built for genetic-programming-based automated program repair approaches, which have recently received considerable attention due to its capability to automatically fix real-world bugs. Genetic programming searches for a valid repair over a large number of variants that are syntactically mutated from the original program. At an early stage of running the repair tool, we extract a number of features that are potentially related to the effectiveness of the tool. Leveraging advances in machine learning, we process the values of these features to learn a discriminative model that is able to predict whether continuing the genetic programming search would lead to a repair within a desired time limit. We perform experiments on 105 real bugs, in which 53 bugs can be fixed by GenProg, a state-of-the-art genetic programming tool, with various time budgets (less than a minute to 11.5 hours). Our experiments show that our approach can identify effective cases (i.e., bugs for which GenProg produces correct fixes within a short period of time) with a precision, recall, F-measure, and AUC of 72.22\%, 73.58\%, 72.90\%, and 76.40\% respectively.
Martin Shooman. Software Reliability Growth Model Based on Bohrbugs and Mandelbugs
Abstract: Recent research characterizes two types of program errors. The repeatable class, called Bohrbugs, always cause the same error when the same set of inputs is applied. Prior thinking and reliability growth models were based on Bohrbugs. Another class of errors called Mandelbugs are difficult to reproduce even when the same inputs are applied, , . An explanation of Mandelbugs may be an interaction of inputs with computer storage, where comprehensive computer storage represents the past history of inputs to the computer system.. A small percentage of Mandelbugs increase with time, are called aging-related Mandelbugs, and are ignored in the model. This paper formulates a software reliability growth model which can be used to predict the reliability of the software as development progresses and once it is released. Based on prior experience, constant and exponentially decreasing error removal models are assumed. The inputs to the model are error removal data for both Bohrbugs and Mandelbugs, recorded during development testing and mean time to failure data from simulated operational tests. Simplified techniques for estimating the model parameters are developed based on moment estimates. More sophisticated estimates using least squares and maximum likelihood techniques are discussed. The potential for improving the accuracy of reliability estimates by using models that incorporate information on both Bohrbugs and Mandelbugs is discussed with respect to two examples. The power and utility of early reliability predictions, even if approximate, are compared with later more accurate predictions.
Mahdi Eslamimehr and Mohsen Lesani. AtomChase: Directed Search towards Atomicity Violations
Abstract: Atomicity violation is one of the main sources of concurrency bugs. Empirical studies show that the majority of atomicity violations are instances of the three-access pattern, where two accesses to a shared variable by a thread are interleaved by an access to the same variable by another thread. We present a novel approach to atomicity violation detection that directs the execution towards three-access candidates. The directed search technique comprises two parts: execution schedule synthesis and directed concurrent execution that are based on constraint solving and concolic execution. We have implemented this technique in a tool called AtomChase. In comparison to five previous tools on 22 benchmarks with 4.5 million lines of Java code, AtomChase increased the number of three-access violations found by 24%. To prevent reporting false alarms, we confirm the non-atomicity of the found execution traces. We present and prove sufficient conditions for non-atomicity of traces with the three-access pattern. The conditions could recognize the majority of 89% of the real atomicity violations found by AtomChase. Checking these conditions is two orders of magnitude faster than the exhaustive check.
Mostafa Farshchi, Jean-Guy Schneider, Ingo Weber and John Grundy. Experience Report: Anomaly Detection of Cloud Application Operations Using Log and Cloud Metric Correlation Analysis
Abstract: Failure of application operations is one of the main causes of system-wide outages in cloud environments. This particularly applies to DevOps operations, such as backup, redeployment, upgrade, customized scaling, and migration that are exposed to frequent interference from other concurrent operations, configuration changes, and resources failure. However, current practices fail to provide a reliable assurance of correct execution of these kinds of operations. In this paper, we present an approach to address this problem that adopts a regression-based analysis technique to find the correlation between an operation’s activity logs and the operation activity’s effect on cloud resources. The correlation model is then used to derive assertion specifications, which can be used for runtime verification of running operations and their impact on resources. We evaluated our proposed approach on Amazon EC2 with 22 rounds of rolling upgrade operations while other types of operations were running and random faults were injected. Our experiment shows that our approach successfully managed to report for all the 115 random injected faults, with a precision of 92.3%.
Andreas Ulrich and Anjelika Votintseva. Experience Report: Formal Verification and Testing in the Development of Embedded Software
Abstract: The paper summarizes our experiences in applying formal verification using the explicit-state model checker SPIN and combining it with a model-based testing approach to support the validation of embedded software. The discussed example covers a crucial part of the firmware of the faulttolerant programmable logic controller Siemens SIMATIC S7-400H. The chosen approach is outlined and obstacles that were faced during the project are discussed. The paper advocates why formal verification is not suitable as a standalone method in industrial projects. Rather it must be combined with an appropriate validation method such as testing to maximize the benefits from the combination of both approaches. In this case, formal verification complements code or design model reviews, and testing benefits from the availability of correct formal models provided during verification process.
Hiroyuki Okamura and Tadashi Dohi. Towards Comprehensive Software Reliability Evaluation in Open Source Software
Abstract: This paper proposes a unified modeling framework for software reliability assessment in open source project. We combine the classical non-homogeneous Poisson process-based software reliability growth model (SRGM) with a familiar regression scheme called the generalized linear model (GLM), and develop a novel approach not only to estimate software reliability measures, but also to investigate impacts of software metrics on the fault-detection process. The resulting GLM-based SRGM involves the common SRGMs as well as some existing metrics-based SRMs, such as logitic-regression-based SRGM and Poisson-regression-based SRGM, and possesses a great data fitting ability. We also provide an effective parameter estimation algorithm based on the EM (Expectation-Maximixation) principle. Finally, it is shown through numerical experiments with actual open source project data that our approach can estimate software reliability measures with higher accuracy and can feedback the analysis results to improve the current software development projects.
Guanpeng Li, Karthik Pattabiraman, Chen-Yong Cher and Pradip Bose. An Application-specific Checkpointing Technique for Minimizing Checkpoint Corruption
Abstract: Checkpointing is widely deployed in computer systems to recover from failures due to both hardware and software errors. However, as faults propagate, checkpoints may become corrupted by saving erroneous states and make errors unrecoverable, especially at aggressive checkpoint frequencies. In this paper, we proposed a technique that automatically analyzes a given program to guide checkpoint strategies in order to minimize checkpoint corruptions. To understand checkpoint corruptions, we first perform a large-scale fault injection study across ten benchmark applications. We then classify checkpoint corruptions, and comprehensively characterize the fault propagations leading to these corruptions. Leveraging these findings, we build RECOV, a compiler-based tool that automatically identifies the program locations that have lowest density of fault propagation for placing checkpoints, and combines it with low-overhead protection techniques. Our experimental results shows that RECOV can eliminate nearly 92% of the checkpoint corruptions with about 5% performance overhead. RECOV reduces the unavailability of the system by 8.25 times even at very aggressive checkpoint frequencies, showing that it is effective in practice.
Jeff Rasley, Eleni Gessiou, Tony Ohmann, Yuriy Brun, Shriram Krishnamurthi and Justin Cappos. Detecting Latent Cross-Platform API Violations
Robert J. Walls, Yuriy Brun, Marc Liberatore and Brian Neil Levine. Discovering Specification Violations in Networked Software Systems
Abstract: Publicly released software implementations of network protocols often have latent specification violations, or bugs. We present Ape, a technique that automatically explores program behavior to identify potential specification violations. Ape overcomes the challenge of exploring the very large space of behavior by dynamically inferring precise models of behavior, stimulating unobserved behavior likely to lead to violations, and refining the behavior models with the new, stimulated behavior. Ape can (1) discover new specification violations, (2) verify that violations are removed, (3) identify related violations in other versions and implementations of the protocols, and (4) generate tests. Ape works on binaries and requires a lightweight description of the protocol's network messages and a violation characteristic. We use Ape to rediscover the known heartbleed bug in OpenSSL, and discover one unknown bug and two unexpected uses of three popular BitTorrent clients. Manual inspection of Ape-produced artifacts reveals four additional, previously unknown specification violations in OpenSSL and μTorrent.
João Matos, João Garcia and Paolo Romano. Enhancing Privacy Protection in Fault Replication Systems
Abstract: Error reporting systems are valuable mechanisms for enhancing software reliability. Unfortunately, though, conventional error reporting systems are prone to leak sensitive user information, raising strong privacy concerns.
In this work we introduce RESPA (Recursive Shortest Path-based Anonymizer), a system for generating failure-reproducing, yet anonymized, error reports. R E S PA relies on symbolic execution, executed at client side, in order to identify alternative failure-inducing paths in the program’s execution graph, and derive the logical conditions, called path conditions, that define the set of user inputs reproducing these executions. Anonymized failure-inducing inputs are then synthesized using any (random) solution satisfying the path conditions.
The search for alternative failure-inducing executions is based on an innovative algorithm that exploits three key ideas: i) RESPA relies on binary search to determine, in an efficient way, which portions of the original execution should be preserved in the alternative one; ii) in order to identify alternative execution paths with low information leakage, R E S PA explores the execution graph by leveraging on the Djikstra’s shortest path algorithm with information leakage as the distance metric; iii) RESPA ensures provable non-reversibility of the alternative inputs it produces via a recursive technique, which anonymizes the alternative inputs found after a run of the algorithm till a minimum in the achievable information leakage is found.
We show via an evaluation based on 6 large, widely used applications and real bugs that RESPA reduces information leakage up to 99.76%, and on average by 93.92%. This corresponds to an average increase in privacy by 40% with respect to state of the art systems, with gains that extend up to 2×.
Sandro Morasca. Classifying Faulty Modules with an Extension of the H-index
Abstract: Background. Fault-proneness estimation models provide an estimate of the probability that a software module is faulty. These models can be used to classify modules as faulty or non-faulty, by using a fault-proneness threshold: modules whose fault-proneness exceeds the threshold are classified as faulty and the others as non-faulty. However, the selection of the threshold value is to some extent subjective, and different threshold values may lead to very different results in terms of classification accuracy.
Objective. We introduce and empirically validate a new approach to setting thresholds, based on an extension of the H-index defined in Bibliometrics, called the Fault-proneness H-Index. We define and use this extension to identify the most fault-prone modules, which are candidates for intensive Verification & Validation activities.
Method. We carried out the empirical validation on two data sets with different faultiness characteristics hosted on the PROMISE repository, by using T-times K-fold cross validation. We computed Precision, Recall, the F-measure, and a weighted version of the F-measure for the results obtained with our approach and compared them with the values obtained with other approaches based on several thresholds.
Results. In the empirical validation, our approach provides better classification results than those based on most other thresholds, according to some classification accuracy indicators, in a statistically significant way.
Conclusions. Our approach seems to be able to contribute to accurately classifying modules as faulty or non-faulty.
Xin Xia, David Lo, Pavneet Singh Kochhar, Zhenchang Xing, Xinyu Wang and Shanping Li. Experience Report: An Industrial Experience Report on Test Outsourcing Practices
Abstract: Nowadays, many companies contract their testing functionalities out to third-party IT outsourcing companies. This process referred to as test outsourcing is common in the industry, yet it is rarely studied in the research community. In this paper, to bridge the gap, we performed an empirical study on test outsourcing with 10 interviewees and 140 survey respondents. We investigated various research questions such as the types, the process, and the challenges of test outsourcing, and the differences between test outsourcing and in-house testing. We found customer satisfaction, tight project schedule, and domain unfamiliarity are the top-3 challenges faced by the testers. We also found there are substantial differences between test outsourcing and in-house testing. For example, most of the test outsourcing projects focused on functional test, and rarely did unit test. Also, due to privacy policies of client companies, test outsourcing is performed mainly on the binary distributions of the projects, and rarely the testers can touch the source code. Our findings have implications for future research. For instance, as a starting point, researchers can create automated program comprehension tools which work on binary distributions of projects to help testers better design effective test cases.
Jia-Ju Bai, Yu-Ping Wang, Hu-Qiu Liu and Shi-Min Hu. Automated Resource Release in Device Drivers
Abstract: Device drivers require system resources to control hardware and provide fundamental services for applications. The acquired resources must be explicitly released by drivers. Otherwise, these resources will never be reclaimed by the operating system, and they are not available for other programs any more, causing hard-to-find system problems. We study on Linux driver mailing lists, and find many applied patches handle improper resource-release operations, especially in error handling paths. In order to improve current resource management and avoid resource-release omissions in device drivers, we propose a novel approach named AutoRR, which can automatically and safely release resources based on specification-mining techniques. During execution, we maintain a resource-state table by recording the runtime information of function calls. If the driver fails to release acquired resources during execution, AutoRR will report violations and call corresponding releasing functions with the recorded runtime information to release acquired resources. To fully and safely release acquired resources, a dynamic analysis of resource dependency and allocation hierarchy is also performed, which can avoid dead resources and double frees. AutoRR works in both normal execution and error handling paths for reliable resource management. We implement AutoRR with LLVM, and evaluate it on 8 Ethernet drivers in Linux 3.17.2. The evaluation shows that the overhead of AutoRR is very low, and it has successfully fixed 18 detected resource-release omission violations without side effects. Our work shows a feasible way of using specification-mining results to avoid related violations.
Shuai Wang, Shaukat Ali, Tao Yue and Marius Liaaen. UPMOA: An Improved Search Algorithm to Support User-Preference Multi-Objective Optimization
Abstract: Multi-objective search algorithms (e.g., non-dominated sorting genetic algorithm II (NSGA-II)) have been applied extensively to solve various multi-objective optimization problems in software engineering such as problems in testing and requirements phases. However, existing multi-objective algorithms usually treat all the objectives with equivalent priorities and do not provide a mechanism to reflect various user preferences when guiding search towards finding optimal solutions. The need to have such a mechanism was observed in one of our industrial projects on applying search algorithms for test optimization of a product line of Videoconferencing Systems (VCSs) called Saturn, where user preferences must be incorporated into optimization objectives, based on test requirements at Cisco. To address this, we propose to extend the most commonly-used multi-objective search algorithm NSGA-II that has also shown promising results in several existing works with user preferences and name it as User-Preference Multi-Objective Optimization Algorithm (UPMOA). Its core idea is to extend NSGA-II with a user preference indicator p, based on existing weight assignment strategies.
We empirically evaluated UPMOA with two industrial problems focusing on optimizing the test execution system for Saturn in Cisco. To assess the scalability of UPMOA, in total 1000 artificial problems were created inspired from the two industrial problems. The evaluation includes two aspects: 1) Three weight assignment strategies together with UPMOA were empirically evaluated to identify a best weight assignment strategy for p. Results show that the Uniformly Distributed Weights (UDW) strategy can assist UPMOA in achieving the best performance; 2) UPMOA has been compared (via a comprehensive empirical study) with three representative multi-objective search algorithms (e.g., NSGA-II) and results show that UPMOA can significantly outperform the others and has the ability to solve a wide range of problems with varying complexity.
Cuiyun Gao, Baoxiang Wang, Pinjia He, Jieming Zhu, Yangfan Zhou and Michael R. Lyu. PAID: Prioritizing App Issues for Developers by Tracking User Reviews Over Versions
Abstract: Analyzing user reviews has been an indispensable part during the bug-fixing and version-modification process for app developers, attracting various research efforts in mining user reviews to discover app issues, e.g., low speed, much memory space taken, privacy leakage, etc. Existing exploration of app reviews depends on static collections and therefore ignores the fact that user reviews vary with app versions. Furthermore, the previous studies suffer from the problem that they require the developers consume much time on filtering out meaningless comments and digesting the informative textual data, which would be extremely cumbersome when we deal with a large quantity of reviews.
In the paper, we target towards designing a framework for Prioritizing App Issues for Developers (PAID) with minimal manual power and good accuracy. Intuitively, issues presented in the level of phrase, i.e., a couple of consecutive words, can be more quickly understood by observers than in long sentences. Hence, we aim at recommending phrase-level app issues to developers by tracking reviews over release versions. Finally, to assist developers in better comprehending the app properties, we involve ThemeRiver to visualize the analytical results reported by our method. In addition, when developers want to obtain a deep insight of certain problems, they can also check the most related reviews produced by our framework. Different from traditional evaluation methods such as labeling useful comments or adopting the app discussion forum, we exploit the first-hand information from developers, i.e., app changelogs, to measure the performance on app issue prioritization. In the experiment, we analyze 18 apps with millions of user reviews and 117 app versions. Results show that the prioritized issues match the official changelogs with a high precision.
Zebao Gao, Chunrong Fang and Atif Memon. Pushing the Limits on Automation in GUI Regression Testing
Abstract: Even though there has been much work on automated regression testing of software, full automation continues to elude us. There are two significant impediments to full automation: obtaining (1) test inputs and (2) test oracle. We now push the envelope on full automation of regression testing by fully automatically generating test cases as well as the test oracle, completely eliminating manual work. This allows us to study issues of false positives/negatives in test failure; we provide ways to minimize these. The results of our empirical studies suggest that our approach of using workflow-based test cases, derived from the software under test, may help empower the end user to perform regression testing before applying software patches and updates.
Lijie Xu, Wensheng Dou, Feng Zhu, Chushu Gao, Jie Liu, Hua Zhong and Jun Wei. A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications
Abstract: Out of memory (OOM) errors occur frequently in data-intensive applications that run atop distributed data-parallel frameworks, such as MapReduce and Spark. In these applications, the memory space is shared by the framework and user code. Since the framework hides the details of distributed execution, it is challenging for users to pinpoint the root causes and fix these OOM errors.
This paper presents a comprehensive characteristic study on 123 real-world OOM errors in Hadoop and Spark applications. Our major findings include: (1) Most errors (64%) are caused by memory－consuming user code, which carelessly processes unexpected large data or generates large computing results in memory. (2) 37% of the errors are caused by the unexpected large data, such as large data partition, hotspot key, and large key/value record. (3) 12% of the errors are caused by the large data stored in the framework. This indicates that it is challenging for users to configure the right memory quota to balance the memory usage of the framework and user code. (4) There are three common fix patterns (used in 34% errors), namely changing the memory/dataflow-related configurations, dividing runtime data, and optimizing user code logic. Our findings inspire us to propose potential solutions to avoid the OOM errors: (1) The frameworks can provide dynamic memory management mechanisms to balance the memory usage of the framework and user code at runtime. (2) The frameworks can provide users with memory+disk data structures, since accumulating large computing results in in-memory data structures is a common cause (15% errors).
Tingshan Huang, Nagarajan Kandasamy and Harish Sethu. Anomaly Detection in Computer Systems using Compressed Measurements
Abstract: Online performance monitoring of computer systems incurs a variety of costs: the very act of monitoring a system interferes with its performance and if the information is transmitted to a monitoring station for analysis and logging, this consumes network bandwidth and disk space. Compressive sampling-based schemes can help reduce these costs on the local machine by acquiring data directly from the system in a compressed form, and in a computationally efficient way. This paper focuses on reducing the computational cost associated with recovering the original signal from the transmitted sample set at the monitoring station for anomaly detection. Towards this end, we show that the compressed samples preserve, in a related form, properties such as mean, variance, as well as correlation between data points in the original full-length signal. We then use this result to detect changes in the original signal that could be indicative of an underlying anomaly such as abrupt changes in magnitude and gradual trends without the need to recover the full-length data. We illustrate the usefulness of our approach via case studies involving IBM?s Trade Performance Benchmark using signals from the disk and memory subsystems. Experiments indicate that abrupt changes can be detected using a compressed sample size of 25% with a hit rate of 95% for a fixed false alarm rate of 5%; trends can be detected within a confidence interval of 95% using a sample size of only 6%.
Rahul Gopinath, Mohammad Amin Alipour, Iftekhar Ahmed, Carlos Jensen and Alex Groce. How hard does mutation analysis have to be, anyway?
Abstract: Mutation analysis is considered the best method for measuring the adequacy of test suites. However, the number of test runs required to do a full mutation analysis grows faster than project size, which makes it an unrealistic metric for most real-world software projects, which often have more than a million lines of code. It is in projects of this size, however, that developers most need a method for evaluating the efficacy of a test suite. Another impediment to the adoption of mutation analysis is the equivalent mutant problem, which makes mutation scores less informative. Various strategies have been proposed to deal with the explosion of mutants, including forms of operator selection and x% stratified sampling based on operators and program elements. However, these strategies at best reduce the number of mutants required to a fraction of overall mutants, which still grows with program size. Running, e.g., 5% of all mutants of a two million LOC program usually requires analyzing well over 100,000 mutants. Similarly, while various approaches have been proposed to tackle equivalent mutants, none completely eliminate the problem, and the fraction of equivalent mutants remaining is hard to estimate, often requiring manual analysis of equivalence.
In this paper, we provide both theoretical analysis and empirical evidence that a small constant sample of mutants yields statistically similar results to running a full mutation analysis, regardless of the size of the program being analyzed or the similarity between mutants. We show that a similar approach, using a constant sample of inputs can estimate the degree of stubbornness in mutants remaining to a high degree of statistical confidence. We also provide a mutation analysis framework for Python that incorporates the analysis of stubbornness of mutants.
David Threm, Liguo Yu, Srini Ramaswamy and Sithu Sudarsan. Using Normalized Compression Distance to Measure the Evolutionary Stability of Software Systems
Abstract: Software evolutionary stability has become an important issue in software maintenance and evolution. It is directly related to software reusability, maintainability and evolvability. In reported prior research, software evolutionary stability has been measured with architecture-level metrics, including reference points and program-level metrics, such as number of modules and number of lines of code. In this paper, information-level metrics based on Kolmogorov complexity are used to measure the differences between versions of software products. Using normalized compression distance, various evolutionary stability metrics of software artifacts are defined: version stability, branch stability, structure stability and aggregate stability. Case studies are performed on two open-source products, Apache HTTP server and Apache Ant build tool. The results from this study show that information-level evolutionary stability metrics can be used together with architecture-level metrics and program-level metrics to assist monitoring the software evolution process, as well as identifying stable or unstable software artifacts.
Jie Wang, Wensheng Dou, Chushu Gao and Jun Wei. Fast Reproducing Web Application Errors
Daniel Vecchiato and Eliane Martins. Experience Report: A Field Analysis of User-defined Security Configurations of Android Devices
Abstract: The wide spreading of mobile devices, such as smartphones and tablets, and their always-advancing capabilities, ranging from taking photos to accessing banking accounts, makes them an attractive target for attackers. This, together with the fact that users frequently store critical personal information in such devices and that many organizations currently allow employees to use their personal devices to access the enterprise information infrastructure and applications, turns assessing the security of mobile devices into a key issue. In order to understand the common misconfiguration problems, this practical experience report presents a field analysis of 41 user-defined security settings of more than 500 Android devices. Findings suggest that most users neglect basic security configurations such as login mechanisms and that manufacturers should rethink their policies in terms of the security settings that can be modified by the users. The paper also proposes concrete security countermeasures to mitigate some of the identified misconfigurations.
Domenico Cotroneo, Luigi De Simone, Francesco Fucci and Roberto Natella. MoIO: Run-Time Monitoring for I/O Protocol Violations in Storage Device Drivers
Abstract: Bugs affecting storage device drivers include the so-called *protocol violation bugs*, which silently corrupt data and commands exchanged with I/O devices. Protocol violations are very difficult to prevent, since testing device driver is notoriously difficult. To address them, we present a monitoring approach for device drivers (MoIO) to *detect I/O protocol violations* at run-time. The approach infers a model of the interactions between the storage device driver, the OS kernel, and the hardware (the *device driver protocol*) by analyzing execution traces. The model is then used as a reference for detecting violations in production. The approach has been designed to have a low overhead and to overcome the lack of source code and protocol documentation. We show that the approach is feasible and effective by applying it on the SATA/AHCI storage device driver of the Linux kernel, and by performing fault injection and long-running tests.
Alejandra Ruiz, Garazi Juez Uriagereka, Philipp Schleiss and Gereon Weiss. A safe generic adaptation mechanism for smart cars
Abstract: Today’s vehicles are evolving towards smart cars, which will be able to drive autonomously and adapt to changing contexts. Incorporating self-adaptation in these cyber-physical systems (CPS) promises great benefits, like cheaper software-based redundancy or optimized resource utilization. As promising as these advantages are, a respective proportion of a vehicle’s functionality poses as safety hazards when confronted with faults and failure situations. Consequently, a system’s safety has to be ensured with respect to the availability of multiple software applications, thus often resulting in redundant hardware resources, such as dedicated backup control units. To benefit from self-adaptation by means of creating efficient and safe systems, this work introduces a safety concept in form of a generic adaptation mechanism (GAM). In detail, this generic adaptation mechanism is introduced and analyzed with respect to generally known and newly created safety hazards, in order to determine a minimal set of system properties and architectural limitations required to safely perform adaption. Moreover, the approach is applied to the ICT architecture of a smart e-car, thereby highlighting the soundness, general applicability, and advantages of this safety concept and forming the foundation for the currently ongoing implementation of the GAM within a real prototype vehicle.
Christoph Schulze, Mikael Lindvall, Sigurthor Bjorgvinsson and Robert Wiegand. Model Generation to support Model-based Testing Applied on the NASA DAT Web-application --an Experience Report
Abstract: Model-based Testing (MBT), where a model of the system under test’s (SUT) behavior is used to automatically generate executable test cases, is a promising and versatile testing technology. Nevertheless, adoption of MBT technologies in industry is slow and many testing tasks are performed via manually created executable test cases (i.e. test programs such as JUnit). In order to adopt MBT, testers must learn how to construct models and use these models to generate test cases, which might be a hurdle. An interesting observation in our previous work is that the existing manually created test cases often provided invaluable insights for the manual creation of the testing models of the system. In this paper we present an approach that allows the tester to first create and debug a set of test cases. When the tester is happy with the test cases, the next step is to automatically generate a model from the test cases. The generated model is derived from the test cases, which are actions that the system can perform (e.g. a button clicks) and their expected outputs in form of assert statements (e.g. assert data entered). The model is a Finite State Machine (FSM) model that can be employed with little or no manual changes to generate additional test cases for the SUT. We successfully applied the approach in a feasibility study to the NASA Data Access Toolkit (DAT), which is a web-based GUI. One compelling finding is that the test cases that were generated from the automatically generated models were able to detect issues that were not detected by the original set of manually created test cases. We present the findings from the case study and discuss best practices for incorporating model generation techniques into an existing testing process.
Andreas Löfwenmark and Simin Nadjm-Tehrani. Experience Report: Memory Accesses for Avionic Applications and Operating Systems on a Multi-Core Platform
Abstract: The deployment of multi-core platforms in safety- critical avionic applications is hampered by the lack of means to ensure predictability when processes running on different cores can create interference effects, affecting worst-case execution time, due to shared memory accesses. One way to restrict these interferences is to allocate a budget for different processes prior to run-time and to monitor the adherence to this budget during run-time. While earlier works in adopting this approach seem promising, they focus on application level (user mode) accesses to shared memory and not the operating system accesses. In this paper we construct experiments for studying a multi-core platform running an ARINC 653 compliant operating system, and measure the impact of both application processes and operating system (supervisor mode) activities. In particular, as opposed to earlier works that considered networking applications, we select four avionic processes that exhibit different memory access patterns, namely, a navigation process, a matrix multiplication process, a math library process and an image processing one. The benchmarking on a set of avionic-relevant application processes shows that (a) the potential interference by the operating system cannot be neglected when allocating budgets that are to be monitored at run-time, and (b) the bounds for the allowed number of memory accesses should not always be based on the maximum measured count during profiling, which would lead to overly pessimistic budgets.
Mona Erfani Joorabchi, Mohamed Ali and Ali Mesbah. Detecting Inconsistencies in Multi-Platform Mobile Apps
Abstract: Due to the increasing popularity and diversity of mobile devices, developers write the same mobile app for different platforms. Since each platform requires its own unique environment in terms of programming languages and tools, the teams building these multi-platform mobile apps are usually separate. This in turn can result in inconsistencies in the apps developed. In this paper, we propose an automated technique for detecting inconsistencies in the same native app implemented for iOS and Android platforms. Our technique (1) automatically instruments and traces the app on each platform for given execution scenarios, (2) infers abstract models from each platform execution trace, (3) compares the models using a set of code-based and GUI-based criteria to expose any discrepancies, and finally (4) generates a visualization of the models, highlighting any detected inconsistencies. We have implemented our approach in a tool called CHECKCAMP. CHECKCAMP can help mobile developers in testing their apps across multiple platforms. An evaluation of our approach with a set of 14 industrial and open-source multi-platform native mobile app-pairs indicates that CHECKCAMP can correctly (1) extract and abstract the models of mobile apps from multiple platforms, (2) infer likely mappings between the generated models based on different comparison criteria, and (3) detect inconsistencies at multiple levels of granularity.
Michael Grottke, Alberto Avritzer, Daniel Sadoc Menasché, Javier Alonso, Leandro Aguiar and Sara Alvarez. WAP: Models and Metrics for the Assessment of Critical-Infrastructure-Targeted Malware Campaigns
Abstract: Ensuring system survivability in the wake of advanced persistent threats is a big challenge that the security community is facing to ensure critical infrastructure protection. In this paper, we define metrics and models for the assessment of coordinated massive malware campaigns targeting critical infrastructure sectors. We define a three-phase high-level threat model that represents an abstraction of advanced persistent threats. We introduce a new approach for the implementation of countermeasures that takes advantage of software rejuvenation and quarantine, offering the potential to increase the survivability chances of coordinated attacks to critical infrastructure sectors. We apply our proposed models and metrics to the evaluation of the effectiveness of these new threat responses to counter the dissemination of malware.
First, we develop an analytical model that allows us to capture the effect of neighborhood on different metrics (e.g., infection probability, contagion probability). Then, we assess the impact of putting operational but possibly infected nodes into quarantine. Quarantine decreases contagion probability, but it may reduce node capacity. Finally, we study the implications of scanning nodes for early detection of malware (e.g., worms), accounting for false positives and false negatives.
We evaluate our methodology using a small four-node topology and a more realistic 14-node topology. We find that for the topologies and parameters evaluated in our empirical studies, malware infections can be effectively contained by using quarantine and appropriate rates of scanning for soft impacts.
Tanzeem Bin Noor and Hadi Hemmati. A similarity-based approach for test case prioritization using historical failure data
Abstract: Test case prioritization is a crucial element in software quality assurance in practice, specially, in the context of regression testing. Typically, test cases are prioritized in a way that they detect the potential faults earlier. The effectiveness of test cases, in terms of fault detection, is estimated using quality metrics, such as code coverage, size, and historical fault detection. Prior studies have shown that previously failing test cases are highly likely to fail again in the next releases, therefore, they are highly ranked, while prioritizing. However, in practice, a failing test case may not be exactly the same as a previously failed test case, but quite similar, e.g., when the new failing test is a slightly modified version of an old failing one to catch an undetected fault. In this paper, we define a class of metrics that estimate the test cases quality using their similarity to the previously failing test cases. We have conducted several experiments with five real world open source software systems to evaluate the effectiveness of these quality metrics. The results of our study show that our proposed similarity-based quality measure is significantly more effective for prioritizing test cases compared to existing test case quality measures.
Peter Tröger, Lena Feinbube and Matthias Werner. What activates a bug? A refinement of the Laprie terminology model
Abstract: Dependability modeling and analysis relies on the usage of an unambiguous terminology model, in order to avoid misunderstandings and wrong interpretations. In both academia and industry, the most widely accepted approach is the fault-error-failure model by Laprie et al.
Using this model for describing software faults and errors can help to establish a common vocabulary, but may lead to ambiguities, since the original `fault activation' term relates both to fault-enabling states and the execution of incorrect code. The clear separation of both issues, however, is crucial when talking about software defects.
One traditional solution for this problem is the definition of a dedicated terminology model. In this paper, we propose an alternative approach: The backward-compatible refinement of the original Laprie model for software defects.
Our proposal encourages a more detailed description of fault activation conditions, while keeping the widely accepted basic vocabulary. Our refinement relies on an abstract system model for software-based systems. Both the traditional and our revised version of the Laprie terminology are presented as failure automata, in order to explain relevant events and states in an unambiguous way. The paper discusses how well-known concepts still fit in, and lists some examples for real-world bugs and their activation conditions.
Jean-Marie Mottu, Sagar Sen, Juan Cadavid and Benoit Baudry. Discovering Model Transformation Pre-conditions using Automatically Generated Test Models
Abstract: Specifying a model transformation is challenging as it must be able to give a meaningful output for any input model in a possibly infinite modelling domain. Transformation pre-conditions constrain the input domain by rejecting input models that are not meant to be transformed by a model transformation.
This paper presents a systematic approach to discover such pre-conditions when it is hard for a human developer to foresee complex graphs of objects that are not meant to be transformed.
The approach is based on systematically generating a finite number of test models using our tool, Pramana to first cover the input domain based on input domain partitioning.
Tracing a transformation’s execution reveals why some test models are not correctly transformed.
Using a benchmark transformation from simplified UML class diagram models to RDBMS models we discover new pre-conditions that were not initially specified.
Emily Kowalczyk, Atif Memon and Myra Cohen. Piecing Together App Behavior from Multiple Artifacts: A Case Study
Abstract: Recent research in mobile software analysis has begun to combine information extracted from an app’s source code and marketplace webpage to identify correlated variables and validate an app’s quality properties such as its intended behavior, trust or suspiciousness. Such work typically involves analysis of one or two artifacts such as the GUI text, user ratings, app description keywords, permission requests, and sensitive API calls. However, these studies make assumptions about how the various artifacts are populated and used by developers, which may lead to a gap in the resulting analysis. In this paper, we take a step back and perform an in-depth study of 14 popular apps from the Google Play Store. We have studied a set of 16 different artifacts for each app, and conclude that the output of these must be pieced together to form a complete understanding of the app’s true behavior. We show that (1) developers are inconsistent in where and how they provide descriptions; (2) each artifact alone has incomplete information; (3) different artifacts may contain contradictory pieces of information; (4) there is a need for new analyses, such as those that use image processing; and (5) without including analyses of advertisement libraries, the complete behavior of an app is not defined. In addition, we show that the number of downloads and ratings of an app does not appear to be a strong predictor of overall app quality, as these are propagated through versions and are not necessarily indicative of the current app version’s behavior.
Xianglong Kong, Lingming Zhang, W. Eric Wong and Bixin Li. Experience Report: How Do Techniques, Programs, and Tests Impact Automated Program Repair?
Abstract: Automated program repair can save tremendous manual efforts in software debugging. Therefore, a huge body of research efforts have been dedicated to design and implement automated program repair techniques. Among the existing program repair techniques, genetic-programming-based techniques have shown promising results. Recently, researchers found that random-search-based and adaptive program repair techniques can also produce effective results. In this work, we performed an extensive study for four program repair techniques, including genetic-programming-based, random-search-based, brute-force-based and adaptive program repair techniques. Due to the extremely large time cost of the studied techniques, the study was performed on 129 bugs from 9 small to medium sized programs. In the study, we further investigated the impacts of different test suites on effectiveness and efficiency of program repair techniques. We also computed the false positive rates and discussed the ratio of the explored search space to the whole search space for each studied technique.
Dongjiang You, Sanjai Rayadurgam, Michael Whalen, Mats P.E. Heimdahl and Gregory Gay. Efficient Observability-based Test Generation by Dynamic Symbolic Execution
Abstract: Structural coverage metrics have been widely used to measure test suite adequacy as well as to generate test cases. In previous investigations, we have found that the fault-finding effectiveness of tests satisfying structural coverage criteria is highly dependent on program syntax -- even if the faulty code is exercised, its effect may not be observable at the output. To address these problems, observability-based coverage metrics have been defined. Specifically, observable MC/DC (OMC/DC) is a criterion that appears to be both more effective at detecting faults and more robust to program restructuring than MC/DC. Traditional counterexample-based test generation for OMC/DC, however, can be infeasible on large systems. In this study, we propose an incremental test generation approach that combines the notion of observability with dynamic symbolic execution. We evaluated the efficiency and effectiveness of our approach using seven systems from the avionics and medical device domains. Our results show that the incremental approach requires much lower generation time, while achieving even higher fault finding effectiveness compared with regular OMC/DC generation.
Lucas Amorim, Evandro Costa, Nuno Antunes, Baldoino Fonseca, and Márcio Ribeiro Experience Report: Evaluating the Effectiveness of Decision Trees for Detecting Code Smells
Abstract: Developers continuously maintain software systems to adapt to new requirements and to fix bugs. Due to the complexity of tasks and the time-to-market, developers make poor implementation choices, also known as code smells. Studies indicate that code smells hinder comprehensibility, and possibly increase change- and fault-proneness. Therefore, they must be identified to enable the application of corrections. The challenge is that the inaccurate definitions of code smells make developers disagree whether a piece of code is a smell or not, consequently, making difficult creation of a universal detection solution able to recognize smells in different software projects. Several works have been proposed to identify code smells but they still report inaccurate results and use techniques that do not present to developers a comprehensive explanation how these results have been obtained. In this experimental report we study the ability of the Decision Tree algorithm to recognize code smells. For this, it was applied in a dataset containing 4 open source projects and the results were compared with the manual oracle, with existing detection approaches and with other machine learning algorithms. The results showed that the approach was able to effectively learn rules for the detection of the code smells studied. The results were even better when genetic algorithms are used to pre-select the metrics to use.
Jeff Anderson, Hyunsook Do and Saeed Salem. Experience Report: Mining Test Results for Reasons Other Than Functional Correctness
Abstract: Regression testing is an important part of software development projects, and it is used to ensure software quality. Traditionally a regression test focuses primarily on functional correctness of a modified program and is examined only when it fails, meaning it found a fault that would have otherwise been undetected. For certain application domains, regression tests for non-functional quality aspects such as performance, security, and usability could be just as important. However, those regression tests are much more costly and difficult to create, and thus many applications lack adequate non-functional regression test coverage. This adds risk of regressions in these areas as changes are made over time. In this research, we propose using metrics from passing test cases to predict quality aspects of the software beyond the traditional focus of regression tests. Through an industrial case study, we show that metrics such as test response time from functional regression tests are good predictors of which product areas are likely to contain certain types of non-functional performance faults. Furthermore, we show that this prediction can be improved through environmental perturbation such as the use of synthetic volume datasets or data size variation.
Marllos Prado, Eric Verbeek, Margaret-Anne Storey and Auri Marcelo Rizzo Vincenzi. WAP: Cognitive Aspects in Unit Testing: the Hunting Game and the Hunter's Perspective
Abstract: Humans are hunters and love the chase---they hunt for food, they hunt for bugs in software. In the last decade, testing research has gone deeper and broader to help with the challenging task of catching bugs. Much of the literature approaches the problem from a theoretical-technical perspective and is often oriented to automated solutions. Yet, there is a gap between industry testing problems and research testing solutions. We take a different perspective and consider the human component as a major part of the solution for practical testing problems. Many of these human-related issues are reported in academic surveys of practitioners. We highlight the importance of human factors in testing by introducing a hunting metaphor. We also bring attention to evidence on cognitive support demands in unit test practices. An initial framework is proposed as an effort to bring understanding of cognitive support demands, and provide direction for further research on unit testing tools which support tester skill improvement.
Nariman Mirzaei, Hamid Bagheri, Riyadh Mahmood and Sam Malek. SIG-Droid: Automated System Input Generation for Android Applications
Abstract: Pervasiveness of smartphones and the vast number of corresponding apps is stressing the need for applicable automated software testing techniques. A wealth of research has been focused on either unit or GUI testing of smartphone apps, but little on automated support for end-to-end system testing. This paper presents SIG-Droid, a framework for system testing of Android apps, backed with automated program analysis to extract app models and symbolic execution of source code guided by such models for obtaining test inputs that ensure covering each reachable branch in the program. SIG-Droid leverages two automatically extracted models: Interface Model and Behavior Model. The Interface Model is used to find values that an app can receive through its interfaces. Those values are then exchanged with symbolic values to deal with constraints with the help of a symbolic execution engine. The Behavior Model is used to drive the apps for symbolic execution and generate sequences of events. We provide an efficient implementation of SIG-Droid based in part on Symbolic PathFinder, extended in this work to support automatic testing of Android apps. Our experiments show SIG- Droid is able to achieve significantly higher code coverage than existing automated testing tools targeted for Android.
Paris Kitsos, Dimitris. E. Simos, Jose Torres-Jimenez and Artemios Voyiatzis. Exciting FPGA Cryptographic Trojans using Combinatorial Testing
Abstract: Contemporary hardware design shares many similarities with software development. The injection of malicious functionality (Trojans) in FPGA designs is a realistic threat. The implementation of cryptographic primitives is an attractive target for attack. An undetected modification can have disastrous effects on the provided security. Established techniques for testing correctness do not cope well with Trojans, since these are not captured in the system model. Furthermore, a well-designed Trojan activates under rare conditions and escapes detection during testing. Such conditions cannot be exhaustively searched, especially in the case of cryptographic implementations with hundreds of inputs.
In this paper, we explore the applicability of a prominent combinatorial strategy, namely combinatorial testing, for FPGA Trojan detection and demonstrate that combinatorial testing provides the theoretical guarantees for exciting a Trojan of specific lengths by covering all of its input combinations. Our findings indicate that combinatorial testing constructs can improve the existing FPGA Trojan detection capabilities by reducing the number of tests needed. Besides the foundations of our approach, we also report on first experiments that indicate its practical use.
Shiyu Dong, Oswaldo Olivo, Lingming Zhang and Sarfraz Khurshid. Studying the Influence of Standard Compiler Optimizations on Symbolic Execution
Abstract: Compiler optimizations in the context of traditional program execution is a well-studied research area, and modern compilers support a suite of optimizations. In contrast, the influence of such optimizations on the performance of symbolic execution -- a classic program analysis technique that performs an exhaustive exploration of the program’s bounded execution paths -- is not so well-understood. This paper reports a study of the influence of standard compiler optimizations on
traditional symbolic execution. KLEE, a well-known symbolic execution engine based on the LLVM compiler infrastructure, provides the enabling technology for the study, which focuses on 33 optimization flags of LLVM. Specifically, we study (1) how different optimizations influence the performance of symbolic execution for Unix Coreutils, (2) how the influence varies across two different program classes, and (3) how the influence varies across three different back-end constraint solvers. Some of our findings surprised us. For example, applying the 33 optimizations in a pre-defined order provides a slow down (compared to no optimization) for a majority of the Coreutils when using the basic depth-first search with no constraint caching. We hope our study motivates future research on harnessing the power of symbolic execution more effectively.
Gustavo Ansaldi Oliva and Marco Gerosa. Experience Report: How do Structural Dependencies Influence Change Propagation? An Empirical Study
Abstract: Real world object-oriented systems are composed of hundreds or even thousands of classes that are structurally interconnected in many different ways. In this highly complex scenario, it is unclear how changes propagate. Given the high maintenance cost brought by change propagation, these questions become particularly relevant in practice. In this paper, we set out to investigate the influence of structural dependencies on change propagation. We historically analyzed thousands of code snapshots coming from 4 open-source Java projects of different sizes and domains. Our results indicated that, in general, it is more likely that two artifacts will not co-change just because one depends on the other. However, the rate with which an artifact co-changes with another is higher when the former structurally depends on the latter. This rate becomes higher if we track down dependencies to the low-level entities that are changed in commits. This implies, for instance, that developers should be aware of dependencies on methods that are added or changed, as these dependencies tend to propagate changes more often. Finally, we also found several cases where software changes could not be justified using structural dependencies, meaning that co-changes are induced by other subtler kinds of relationships.
Bo Zhou, Iulian Neamtiu and Rajiv Gupta. Experience Report: How Do Bug Characteristics Differ Across Severity Classes: A Multi-platform Study
Abstract: Bugs of different severities have so far been put into the same category, but their characteristics differ significantly. Moreover, the nature of issues with the same severity, e.g., high, differs markedly between desktops and smartphones. To understand these differences, we perform an empirical study on 72 Android and desktop projects. We first define three bug severity classes: high, medium, and low. Next, we study how severity changes and quantify differences between classes in terms of bug-fixing attributes. Then, we focus on topic differences: using LDA, we extract topics associated with each severity class, and study how these topics differ across classes, platforms, and over time. Our findings include: severity and priority affect bug fixing time; medium-severity contributors are more experienced than high-severity contributors; and there have been more concurrency and cloud-related high-severity bugs on desktop since 2009, while on Android concurrency bugs are prevalent.
Tukaram Muske and Uday P. Khedker. Efficient Elimination of False Positives using static analysis
Abstract: Bug detection using static analysis has been found useful in practice for ensuring software quality and reliability. However, it often requires sifting through a large number of warnings. This can be handled by generating an assertion corresponding to each warning and verifying the assertion using a model checker to classify the warning as an error or a false positive. Since model checking over larger code fragments is non-scalable and expensive, it is useful to model check a given assertion with a small calling context length. For this, the variables receiving values from outside the context are initialized with arbitrary values generated by non-deterministic choice functions. The calling context length is then gradually increased on a need basis to include callers higher up in the call chains. While this aids scalability by keeping the calling context as small as possible, it requires multiple calls to model checker for the same assertion, requiring a considerable amount of time.
We present a static analysis to expedite false positive elimination. It is based on the observation that when the variables involved in an assertion are allowed to take any arbitrary value at the point of assertion, the assertion is bound to be violated for some or the other combination of values. In such cases, usage of a model checker is redundant as it does not aid in resolution of the corresponding warning. Our data flow analysis identifies (an over-approximated set of) such variables using a novel lattice so that model checking of assertions involving such variables can be avoided. Our empirical evaluation demonstrates that, on an average, the proposed static analysis avoids 49.49% of the total model checking calls, and it reduces the false positives elimination time by 39.28%. However, this gain is achieved at the cost of missing 2.78% false positives which could have been eliminated otherwise.