Total Contributed Abstracts: 28
Show Contributed Abstracts
# | Type | Abstract Title | Theme | Abstract |
---|---|---|---|---|
1 | Presentation | Trust Throughout the Artificial Intelligence Lifecycle | Test & Evaluation Methods for Emerging Technology | AI and machine learning have become widespread throughout the defense, government, and commercial Read More sectors. This has led to increased attention on the topic of trust and the role it plays in successfully integrating AI into highconsequence environments where tolerance for risk is low. Driven by recent successes of AI algorithms in a range of applications, users and organizations rely on AI to provide new, faster, and more adaptive capabilities. However, along with those successes have come notable pitfalls, such as bias, vulnerability to adversarial attack, and inability to perform as expected in novel environments. Many types of AI are data-driven, meaning they operate on and learn their internal models directly from data. Therefore, tracking how data were used to build data properties (e.g., training, validation, and testing) is crucial not only to ensure a high-performing model, but also to understand if the AI should be trusted. MLOps, an offshoot of DevSecOps, is a set of best practices meant to standardize and streamline the end-to-end lifecycle of machine learning. In addition to supporting the software development and hardware requirements of AI-based systems, MLOps provides a scaffold by which the attributes of trust can be formally and methodically evaluated. Additionally, MLOps encourages reasoning about trust early and often in the development cycle. To this end, we present a framework that encourages the development of AI-based applications that can be trusted to operate as intended and function safely both with and without human interaction. This framework offers guidance for each phase of the AI lifecycle, utilizing MLOps, through a detailed discussion of pitfalls resulting from not considering trust, metrics for measuring attributes of trust, and mitigations strategies for when risk tolerance is low. |
2 | Presentation | Everday Reproducibility | Data Management / Reproducible Research | Modern data analysis is typically quite computational. Correspondingly, sharing scientific and Read More statistical work now often means sharing code and data in addition writing papers and giving talks. This type of code sharing faces several challenges. For example, it is often difficult to take code from one computer and run it on another due to software configuration, version, and dependency issues. Even if the code runs, writing code that is easy to understand or interact with can be difficult. This makes it difficult to assess third-party code and its findings, for example, in a review process. In this talk we describe a combination of two computing technologies that help make analyses shareable, interactive, and completely reproducible. These technologies are (1) analysis containerization, which leverages virtualization to fully encapsulate analysis, data, code and dependencies into an interactive and shareable format, and (2) code notebooks, a literate programming format for interacting with analyses. This talks reviews both the problems at the high-level and also provides concrete solutions to the challenges faced. In addition to discussing reproducibility and data/code sharing generally, we will touch upon several such issues that arise specifically in the defense and aerospace communities. |
3 | Poster Session | Analysis Apps for the Operational Tester | Analysis Tools and Techniques | In the acquisition and testing world, data analysts repeatedly encounter certain categories of data, Read More such as time or distance until an event (e.g., failure, alert, detection), binary outcomes (e.g., success/failure, hit/miss), and survey responses. Analysts need tools that enable them to produce quality and timely analyses of the data they acquire during testing. This poster presents four web-based apps that can analyze these types of data. The apps are designed to assist analysts and researchers with simple repeatable analysis tasks, such as building summary tables and plots for reports or briefings. Using software tools like these apps can increase reproducibility of results, timeliness of analysis and reporting, attractiveness and standardization of aesthetics in figures, and accuracy of results. The first app models reliability of a system or component by fitting parametric statistical distributions to time-to-failure data. The second app fits a logistic regression model to binary data with one or two independent continuous variables as predictors. The third calculates summary statistics and produces plots of groups of Likert-scale survey question responses. The fourth calculates the system usability scale (SUS) scores for SUS survey responses and enables the app user to plot scores versus an independent variable. These apps are available for public use on the Test Science Interactive Tools webpage https://new.testscience.org/interactive-tools/. |
4 | Presentation | A Decision-Theoretic Framework for Adaptive Simulation Experiments | Design of Experiments | This paper describes a framework for increasing effectiveness of high-performance computing (HPC) to Read More support decision-making in the presence of uncertainty intrinsic to queries, models, and simulation results. Given a mathematically precise query, the framework adaptively chooses where to sample. Unlike conventionally designed simulation experiments, which specify beforehand where to sample, the framework optimally schedules sampling predicated upon four interconnected models: (a) the surrogate model, e.g., a continuous correlated beta process, which globally estimates the response using beta distributions; (b) a value model, e.g., mutual information, for estimating the benefit of candidate runs to answering the query; (c) a cost model for predicting time to execute candidate runs, possibly from multi-fidelity simulation options; and (d) a grid state model. Runs are chosen by maximizing information per cost. A Bayesian perspective is taken to formulate and update each of these models as simulation results arrive for use in the iterative run-selection and scheduling phase. For a precisely stated query, up to an 80 percent reduction in total runs has been observed. The paper illustrates use of the framework with simple examples. A simulation experiment is conducted to answer some question. In order to define precisely how informative a run is for answering the question, the answer must be defined as a random variable. This random variable is called a query and has the general form of p(theta | y), where theta is the query parameter and y is the available data. Example models employed in the framework are briefly described below: 1. The continuous correlated beta process model (CCBP) estimates the proportions of successes and failures using beta-distributed uncertainty at every point in the input space. It combines results using an exponentially decaying correlation function. The output of the CCBP is used to estimate value of a candidate run. 2. The mutual information model quantifies uncertainty in one random variable that is reduced by observing the other one. The model quantifies the mutual information between any candidate runs and the query , thereby scoring the value of running each candidate. 3. The cost model estimates how long future runs will take, based upon past runs using, e.g., a generalized linear model. A given simulation might have multiple fidelity options that require different run times. It may be desirable to balance information with the cost of a mixture of runs using these multi-fidelity options. 4. The grid state model, together with the mutual information model, are used to select the next collection of runs for optimal information per cost, accounting for current grid load. The framework has been applied to multiple use cases involving a variety of queries: (a) assessing compliance with a performance requirement, (b) sensitivity analysis to system input factors, (c) design optimization in the presence of uncertainty, and (d) calibration of simulations using field data, i.e. model verification and validation that includes quantifying uncertainty (VVUQ). The paper describes several aspects that emerge when applying the framework to each of these use cases. |
5 | Presentation | Cloud Computing for Computational Fluid Dynamics (CFD) in T&E | Test & Evaluation Methods for Emerging Technology | In this talk we’ll focus on exploring the motivation for using cloud computing for Computational Read More Fluid Dynamics (CFD) for Federal Government Test & Evaluation. Using examples from automotive, aerospace and manufacturing we’ll look at benchmarks for a number of CFD codes using CPUs (x86 & Arm) and GPUs and we’ll look at how the development of high-fidelity CFD e.g. WMLES, HRLES, is accelerating the need for access to large scale HPC. The onset of COVID-19 has also meant a large increase in the need for remote visualization with greater numbers of researchers and engineering needing to work from home. This has also accelerated the adoption of the same approaches needed towards the pre- and post-processing of peta/exa-scale CFD simulation and we’ll look at how these are more easily accessed via a cloud infrastructure. Finally, we’ll explore perspectives on integrating ML/AI into CFD workflows using data lakes from a range of sources and where the next decade may take us. |
6 | Presentation | A revolutionary approach to software reliability modeling for software quality assurance | Analysis Tools and Techniques | This tutorial introduces a revolutionary approach to software reliability modeling for software Read More quality assurance. It is an integration of traditional software reliability modeling with recent advances in computing science. Software reliability models have been based on a single mathematical curve, i.e., either an S-shaped or an exponential curve, to represent a defect detection process. We have developed an innovative method for automatically generating multiple curves to accurately represent a defect detection process by identifying inflection points. It’s a piece-wise application of well-known exponential statistical models based on a non-homogeneous Poisson process. It is a good representation of the complex nature of current software development and test processes. With the advancement in computing science the innovative algorithm has been implemented in Python to run in a cloud environment. This approach enables the analytics to run in a real-time environment and share the results with other project team members. And it becomes more practical for use. In addition to the defect detection process, we also address the importance of defect closure process. In practice, we need to allocate development resources to fix software defects or bugs. We have developed a method for predicting software defect closure curve based on the defect detection curve. The difference between the detection curve and the closure curve is called the defect open curve, which represents the number of defects to be fixed. It’s an important measure for software quality at a delivery date. By combining the detection and open curves the project management will be able to balance the development and test resource allocation for a given delivery date. Next, we address early defect prediction without actual defect data during a planning phase. For this purpose, we use development and test effort data which are good representation of software complexity. We then analyze the relationship between defects detected and effort spent from previous releases. Our approach is not only technically sound but also useful for real software development industries to deliver a high-quality software product. Key steps encompassing from concept and proof of concept (prototype & trials) to productization of the innovative algorithm have been demonstrated. We have developed an online tool, called STAR. Its presentation of the output is focused on visualization using charts and tables. It will help project managers to quantitatively strike a balance between software delivery deadlines and quality. STAR can be used for internal test defects or acceptance test defects. It also includes early defect prediction without actual data using planning data such as development and test effort data during a planning period. It’s applicable to various software projects ranging from small-scale to large-scale development. In addition, STAR interactively provides quality impact of corrective actions such as a delay of delivery date or additional development resources. Analytics are all based on zero-touch automation algorithm. User input is basically delivery dates and defect data plus planning data as an option. The tutorial will cover an online demonstration of STAR as time permits. |
7 | Poster Session | Risk Comparison and Planning for Bayesian Assurance Tests | Design of Experiments | Designing a Bayesian assurance test plan requires choosing a test plan that guarantees a product of Read More interest is good enough to satisfy consumer’s criteria but not ‘so good’ that it causes producer’s concern if they fail the test. Bayesian assurance tests are especially useful because they can incorporate previous product information in the test planning and explicitly control levels of risk for the consumer and producer. We demonstrate an algorithm for efficiently computing a test plan given desired levels of risks in binomial and exponential testing. Numerical comparisons with the Operational Characteristic (OC) curve, Probability Ratio Sequential Test (PRST), and a simulation-based Bayesian sample size determination approach are also considered. |
8 | Poster Session | Live demo of a revolutionary online tool for software quality assurance | Analysis Tools and Techniques | This poster session presents a revolutionary online tool, STAR, for software quality assurance. STAR Read More implements a world-leading prediction method of software defects with user friendly interface and visualization. Key measurements used as input are defect detected date and closed date with severity and impacted component for each defect during the internal test and customer deployment periods. It will automatically generate output such as how many more defects are expected by delivery, how many more defects are expected after deployment, how many defects are still open at delivery. It will enable software development companies to quantitatively strike a balance between delivery deadlines and quality. With a direct connection to customer’s fault database, it will automatically extract the data from the database and update prediction in real time. The results can be shared with other project members via the online access. A live demonstration will show a user input upload process and the interpretation of STAR output using demo data sets. The defect data for demonstration has been generated based on over 40 years of experience working with real projects. STAR can be used for internal test defects or acceptance test defects. It’s applicable to various software projects ranging from small-scale to large-scale development. It contains several output views with actual vs. prediction: Executive summary of current quality assessment and predicted quality metrics at delivery, defect arrival & closure trends, defects by severity & component, release over release view, prediction stability over time. STAR also includes early defect prediction without actual data using planning data such as development and test effort data during the planning period. In addition, it also provides quality impact of corrective actions interactively. Examples of corrective action are delay of delivery date and additional developers. Analytics are all based on zero-touch automation algorithm. User input is basically delivery dates and defect data plus planning data as an option. This poster session will be helpful for the software reliability community for both academia and industry practitioners. Academia will be able to understand the need for making the current reliability modeling more practical. Practitioners will be able to understand the power of the online analytics tool, STAR, and begin to collect right sets of data from their own development projects. It will allow them to focus on developing quality improvement plans. An earlier version of this tool has been extensively used at Nokia. |
9 | Presentation | Analysis of Target Location Error using Stochastic Differential Equations | Analysis Tools and Techniques | This paper presents an analysis of target location error (TLE) based on the Cox Ingersoll Ross (CIR) Read More model. In brief, this model characterizes TLE as a function of range based the stochastic differential equation model dX(r) = a(b-X(r))dr + sigma *sqrt(X(r)) dW(r) where X(t) is TLE at range r, b is the long-term mean (terminal) of the TLE, a is the rate of reversion of X(r) to b, sigma is the process volatility, and W(t) is the standard Weiner process. Multiple flight test runs under the same conditions exhibit different realizations of the TLE process. This approach to TLE analysis models each flight test run as a realization the CIR process. Fitting a CIR model to multiple data runs then provides a characterization of the TLE system under test. This paper presents an example use of the CIR model. Maximum likelihood estimates of the parameters of the CIR model are found from a collection of TLE data runs. The resulting CIR model is then used to characterize overall system TLE performance as a function of range to the target as well as the asymptotic estimate of long-term TLE. |
10 | Presentation | Sparse Models for Detecting Malicious Behavior in OpTC | Analysis Tools and Techniques | Host-based sensors are standard tools for generating event data to detect malicious activity on a Read More network. There is often interest in detecting activity using as few event classes as possible in order to minimize host processing slowdowns. Using DARPA’s Operationally Transparent Cyber (OpTC) Data Release, we consider the problem of detecting malicious activity using event counts aggregated over five-minute windows. Event counts are categorized by eleven features according to MITRE CAR data model objects. In the supervised setting, we use regression trees with all features to show that malicious activity can be detected at above a 90% true positive rate with a negligible false positive rate. Using forward and exhaustive search techniques, we show the same performance can be obtained using a sparse model with only three features. In the unsupervised setting, we show that the isolation forest algorithm is somewhat successful at detecting malicious activity, and that a sparse three-feature model performs comparably. Finally, we consider various search criteria for identifying sparse models and demonstrate that the RMSE criteria is generally optimal. |
11 | Speed Session / Poster | Testing Abstract Notice | Design of Experiments | Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore Read More et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum |
12 | Presentation | A Framework to Guide Human-Systems Integration and Resilience in Sociotechnical Systems | Test and Evaluation Methods for Emerging Technology | The goal of this work is to support technology development teams in building advanced technologies Read More and transitioning them into existing systems in a way that strengthens the adaptivity and resilience of high-consequence work systems, including military, healthcare, air traffic, petrochemical, and utility systems. These work systems are complex: They feature a large variety of interacting factors; the variety and interactions produce emergent conditions that can challenge or exceed the system’s capability envelope. Furthermore, in these systems there is constant potential for excessive demands, outages and malfunctions, anomalies, threats such as ransomware, and crises. Given the variety and unpredictability of operating conditions as well as the high cost of failure, these work systems must be capable of adapting and evolving, i.e., demonstrate resilience. In today’s work systems, humans are the actuators and guides of system adaptation and evolution. For humans to effectively guide these processes, the system must have certain features, referred to as system resilience sources, which tend to be inherent in all resilient complex systems. We have developed the “Transform with Resilience during Upgrades to Socio-Technical Systems” (TRUSTS) Framework to guide system modernization and advanced-technology acquisition. The framework specifies complex-system resilience sources, allowing system stakeholders to identify modernization strategies that improve system resilience and enabling technology developers to design and transition technologies in ways that contribute to system resilience. Currently, we are translating the framework’s system resilience sources into engineering methods and tools. This presentation will provide an overview of the TRUSTS Framework and describe our efforts towards integrating it into systems development practice. |
13 | Presentation | Method for Evaluating Bayesian Reliability Models for Developmental Testing | Analysis Tools and Techniques | For analysis of military Developmental Test (DT) data, frequentist statistical models are Read More increasingly challenged to meet the needs of analysts and decision-makers. This is largely due to tightening constraints on test resources and schedule that reduce quantity and increase complexity of test data. Bayesian models have the potential to address this challenge. However, although there is a substantial body of research on Bayesian reliability estimation, there appears to be a paucity of Bayesian applications to issues of direct interest to DT decision makers. Due to user unfamiliarity with the characteristics and data needs of Bayesian models, the potential for such models appears to be unexploited. To address this deficiency, the purpose of this research is to provide a foundation and best practices for use of Bayesian reliability analysis in DT. This is accomplished by establishing a generic structure for systematically evaluating relevant statistical Bayesian models. First, it identifies reliability issues for DT programs using a structured poll of stakeholders combined with interviews of a selected set of Subject Matter Experts. Secondly, candidate solutions are identified in the literature and, thirdly solutions matched to issues using criteria designed to evaluate the capability of a solution to improve support for decision-makers at critical points in DT programs. The matching process uses a model taxonomy structured according to decisions at each DT phase, plus criteria for model applicability and data availability. The end result is a generic structure that allows an analyst to identify and evaluate a specific model for use with a program and issue of interest. This work includes example applications to models described in the statistical literature. |
14 | Presentation | Building Bridges: a Case Study of Assisting a Program from the Outside | Special Topics | STAT practitioners often find ourselves outsiders to the programs we assist. This session presents Read More a case study that demonstrates some of the obstacles in communication of capabilities, purpose, and expectations that may arise due to approaching the project externally. Incremental value may open the door to greater collaboration in the future, and this presentation discusses potential solutions to provide greater benefit to testing programs in the face of obstacles that arise due to coming from outside the program team. DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. CLEARED on 5 Jan 2022. Case Number: 88ABW-2022-0002 |
15 | Speed Session / Poster | Exploring the behavior of Bayesian adaptive design of experiments | Design of Experiments | Physical experiments in the national security arena, including nuclear deterrence, are often Read More expensive and time-consuming resulting in small sample sizes which make it difficult to achieve desired statistical properties. Bayesian adaptive design of experiments (BADE) is a sequential design of experiment approach which updates the test design in real time, in order to optimally collect data. BADE recommends ending experiments early by either concluding that the experiment would have ended in efficacy or futility, had the testing completely finished, with sufficiently high probability. This is done by using data already collected and marginalizing over the remaining uncollected data and updating the Bayesian posterior distribution in near real-time. BADE has seen successes in clinical trials, resulting in quicker and more effective assessments of drug trials while also reducing ethical concerns. BADE has typically only been used in futility studies rather than efficacy studies for clinical trials, although there hasn’t been much debate for this current paradigm. BADE has been proposed for testing in the national security space for similar reasons of quicker and cheaper test series. Given the high-consequence nature of the tests performed in the national security space, a strong understanding of new methods is required before being deployed. The main contribution of this research was to reproduce results seen in previous studies, for different aspects of model performance. A large simulation inspired by a real testing problem at Sandia National Laboratories was performed to understand the behavior of BADE under various scenarios, including shifts to mean, standard deviation, and distributional family, all in addition to the presence of outliers. The results help explain the behavior of BADE under various assumption violations. Using the results of this simulation, combined with previous work related to BADE in this field, it is argued this approach could be used as part of an “evidence package” for deciding to stop testing early due to futility, or with stronger evidence, efficacy. The combination of expert knowledge with statistical quantification provides the stronger evidence necessary for a method in its infancy in a high-consequence, new application area such as national security. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. |
16 | Speed Session / Poster | Profile Monitoring via Eigenvector Perturbation | Test and Evaluation Methods for Emerging Technology | Control charts are often used to monitor the quality characteristics of a process over time to Read More ensure undesirable behavior is quickly detected. The escalating complexity of processes we wish to monitor spurs the need for more flexible control charts such as those used in profile monitoring. Additionally, designing a control chart that has an acceptable false alarm rate for a practitioner is a common challenge. Alarm fatigue can occur if the sampling rate is high (say, once a millisecond) and the control chart is calibrated to an average in-control run length (ARL0) of 200 or 370 which is often done in the literature. As alarm fatigue may not just be annoyance but result in detrimental effects to the quality of the product, control chart designers should seek to minimize the false alarm rate. Unfortunately, reducing the false alarm rate typically comes at the cost of detection delay or average out-of-control run length (ARL1). Motivated by recent work on eigenvector perturbation theory, we develop a computationally fast control chart called the Eigenvector Perturbation Control Chart for nonparametric profile monitoring. The control chart monitors the l_2 perturbation of the leading eigenvector of a correlation matrix and requires only a sample of known in-control profiles to determine control limits. Through a simulation study we demonstrate that it is able to outperform its competition by achieving an ARL1 close to or equal to 1 even when the control limits result in a large ARL0 on the order of 10^6. Additionally, non-zero false alarm rates with a change point after 10^4 in-control observations were only observed in scenarios that are either pathological or truly difficult for a correlation based monitoring scheme. |
17 | Presentation | Bayesian Reliability Methods for Developmental Testing of a Generic Complex System | Analysis Tools and Techniques | Improving the statistical methods used to predict and assess defense system reliability in Read More developmental test (DT) programs would benefit Department of Defense (DoD) acquisition. Current methods for reliability planning tend to produce high reliability goals that are difficult to achieve in practice. Moreover, in many current applications, traditionally used frequentist methods for reliability assessment generally do not combine information across DT segments and tend to produce large uncertainty intervals of limited use to program decision makers. Much recent work has demonstrated the advantages of using Bayesian statistical methods for planning and assessing system reliability. This work investigates the application of Bayesian reliability assessment methods to notional DT lifetime data. The notional data are generated with the Bayesian reliability growth planning methodology of (Wayne 2018), and the Bayesian assessment methods are based on the methodology of (Wayne and Modarres 2015). The system under test is assumed to be a generic complex system, in which the large number of individual failure modes are not described individually. This work explores the sensitivity of the Bayesian results to the choice of the prior distribution and amount of DT data available. Furthermore, this work compares the Bayesian results for the reliability point estimate, uncertainty interval, and probability of passing a demonstration test to analogous results from traditional reliability assessment methods. Defining point estimates of key reliability metrics, in particular the mean time before system failure (MTBF), is discussed. The effect of relaxing the assumption of a generic complex system and attributing failures to individual failure modes is considered. Finally, the results of this study are compared with recent case studies that apply Bayesian reliability assessment methods. Wayne, Martin. 2018. “Modeling Uncertainty in Reliability Growth Plans.” 2018 Annual Reliability and Maintainability Symposium (RAMS). 1-6. Wayne, Martin, and Mohammad Modarres. 2015. “A Bayesian Model for Complex System Reliability Growth Under Arbitrary Corrective Actions.” IEEE Transactions on Reliability 64: 206-220. |
18 | Presentation | A Framework for Using Priors in a Continuum of Testing | Analysis Tools and Techniques | A strength of the Bayesian paradigm is that it allows for the explicit use of all available Read More information—to include subject matter expert (SME) opinion and previous (possibly dissimilar) data. While frequentists are constrained to only including data in an analysis (that is to say, only including information that can be observed), Bayesians can easily consider both data and SME opinion, or any other related information that could be constructed. This can be accomplished through the development and use of priors. When prior development is done well, a Bayesian analysis will not only lead to more direct probabilistic statements about system performance, but can result in smaller standard errors around fitted values when compared to a frequentist approach. Furthermore, by quantifying the uncertainty surrounding a model parameter, through the construct of a prior, Bayesians are able to capture the uncertainty across a test space of consideration. This presentation develops a framework for thinking about how different priors can be used throughout the continuum of testing. In addition to types of priors, how priors can change or evolve across the continuum of testing—especially when a system changes (e.g., is modified or adjusted) during phases of testing—will be addressed. Priors that strive to provide no information (reference priors) will be discussed, and will build up to priors that contain available information (informative priors). Informative priors—both those based on institutional knowledge or summaries from databases, as well as those developed based on previous testing data—will be discussed, with a focus on how to consider previous data that is dissimilar in some way, relative to the current test event. What priors might be more common in various phases of testing, types of information that can be used in priors, and how priors evolve as information accumulates will all be discussed. |
19 | Speed Session / Poster | Bayesian Estimation for Covariate Defect Detection Model Based on Discrete Cox Proportiona | Analysis Tools and Techniques | Traditional methods to assess software characterize the defect detection process as a function of Read More testing time or effort to quantify failure intensity and reliability. More recent innovations include models incorporating covariates that explain defect detection in terms of underlying test activities. These covariate models are elegant and only introduce a single additional parameter per testing activity. However, the model forms typically exhibit a high degree of non-linearity. Hence, stable and efficient model fitting methods are needed to enable widespread use by the software community, which often lacks mathematical expertise. To overcome this limitation, this poster presents Bayesian estimation methods for covariate models, including the specification of informed priors as well as confidence intervals for the mean value function and failure intensity, which often serves as a metric of software stability. The proposed approach is compared to traditional alternative such as maximum likelihood estimation. Our results indicate that Bayesian methods with informed priors converge most quickly and achieve the best model fits. Incorporating these methods into tools should therefore encourage widespread use of the models to quantitatively assess software. |
20 | Presentation | Quantifying the Impact of Staged Rollout Policies on Software Process and Product Metrics | Analysis Tools and Techniques | Software processes define specific sequences of activities performed to effectively produce Read More software, whereas tools provide concrete computational artifacts by which these processes are carried out. Tool independent modeling of processes and related practices enable quantitative assessment of software and competing approaches. This talk presents a framework to assess an approach employed in modern software development known as staged rollout, which releases new or updated software features to a fraction of the user base in order to accelerate defect discovery without imposing the possibility of failure on all users. The framework quantifies process metrics such as delivery time and product metrics, including reliability, availability, security, and safety, enabling tradeoff analysis to objectively assess the quality of software produced by vendors, establish baselines, and guide process and product improvement. Failure data collected during software testing is employed to emulate the approach as if the project were ongoing. The underlying problem is to identify a policy that decides when to perform various stages of rollout based on the software’s failure intensity. The illustrations examine how alternative policies impose tradeoffs between two or more of the process and product metrics. |
21 | Presentation | Likelihood Ratio Test Comparing V50 Values for a Partially Nested Generalized Linear Mixed | Analysis Tools and Techniques | Ballistic limit testing is a type of sensitivity testing in which the stressor is the velocity of a Read More kinetic energy threat (fixed effect) and the response is the penetration result. A generalized linear model (GLM) may be used to analyze the data. If there is an additional random effect (e.g., lot number for a threat), then a generalized linear mixed model (GLMM) may be used. In both cases, the V50 (velocity at which there is a 50% probability of penetration) is often used as a metric of performance. Surrogates are developed to improve repeatability, to increase the availability of test resources, and to decrease the cost of testing. Examples include a surrogate threat to replace a foreign round and a human skin simulant to replace the need for cadaver testing. Testing is required to ensure that the surrogate’s performance is similar to the material it is replacing. A Wald statistical test compares V50 values between the actual item and a surrogate. Although both values are based on a large sample approximation, likelihood ratio tests tend to outperform Wald tests for smaller samples. A likelihood ratio test on V50 has previously been proposed when using a GLM. This work extends this method to the comparison of V50 values for a partially nested GLMM as would be seen when evaluating the performance of a surrogate. A simulation study is conducted and quantile-quantile plots are used to investigate the performance of this likelihood ratio test. |
22 | Presentation | Assurance Techniques for Learning Enabled Autonomous Systems which Aid Systems Engineering | Test and Evaluation Methods for Emerging Technology | It is widely recognized that the complexity and resulting capabilities of autonomous systems created Read More using machine learning methods, which we refer to as learning enabled autonomous systems (LEAS), pose new challenges to systems engineering test, evaluation, verification, and validation (TEVV) compared to their traditional counterparts. This presentation provides a preliminary attempt to map recently developed technical approaches in the assurance and TEVV of learning enabled autonomous systems (LEAS) literature to a traditional systems engineering v-model. This mapping categorizes such techniques into three main approaches: development, acquisition, and sustainment. This mapping reviews the latest techniques to develop safe, reliable, and resilient learning enabled autonomous systems, without recommending radical and impractical changes to existing systems engineering processes. By performing this mapping, we seek to assist acquisition professionals by (i) informing comprehensive test and evaluation planning, and (ii) objectively communicating risk to leaders. The inability to translate qualitative assessments to quantitative metrics which measure system performance hinder adoption. Without understanding the capabilities and limitations of existing assurance techniques, defining safety and performance requirements that are both clear and testable remains out of reach. We accompany recent literature reviews on autonomy assurance and TEVV by mapping such developments to distinct steps of a well known systems engineering model chosen due to its prevalence, namely the v-model. For three top-level lifecycle phases: development, acquisition, and sustainment, a section of the presentation has been dedicated to outlining recent technical developments for autonomy assurance. This representation helps identify where the latest methods for TEVV fit in the broader systems engineering process while also enabling systematic consideration of potential sources of defects, faults, and attacks. Note that we use the v-model only to assist the classification of where TEVV methods fit. This is not a recommendation to use a certain software development lifecycle over another. |
23 | Speed Session / Poster | Predicting Trust in Automated Systems: Validation of the Trust of Automated Systems Test | Test and Evaluation Methods for Emerging Technology | Over the past three years, researchers in OED have developed a scale to measure trust in automated Read More systems called the Trust of Automated Systems Test (TOAST) and provided initial evidence of its validity. This poster will describe how accurate the TOAST scale is at predicting trust in an automated system by measuring the extent to which a civilian will use a provided system. The main question we plan to answer is if the scale can predict what we consider to be a “steady” level of reliance; or the level of reliance that a person reaches that does not vary as time continues. We also have two supporting questions designed to help us understand how people determine levels of trust of high-accuracy and low-accuracy systems and how long it takes for people to reach their steady level of trust. We believe that this scale should be used to evaluate the trust level of any human using any system, including predicting when operators will misuse or disuse complex, automated and autonomous systems. |
24 | Presentation | Utilizing Machine Learning Models to Predict Success in Special Operations Assessment | Data Management and Reproducible Research | The 75th Ranger Regiment is an elite Army Unit responsible for some of the most physically and Read More mentally challenging missions. Entry to the unit is based on an assessment process called Ranger Regiment Assessment and Selection (RASP), which consists of a variety of tests and challenges of strength, intellect, and grit. This study explores the psychological and physical profiles of candidates who attempt to pass RASP. Using a Random Forest Artificial Intelligence model, and a penalized logistic regression model, we identify initial entry characteristics that are predictive of success in RASP. We focus on the differences between racial sub-groups and military occupational specialties (MOS) sub-groups to provide information for recruiters to identify underrepresented groups who are likely to succeed into the selection process. |
25 | Presentation | USE OF DESIGN & ANALYSIS OF COMPUTER EXPERIMENTS (DACE) IN SPACE MISSION TRAJECTORY DESIGN | Design of Experiments | Numerical astrodynamics simulations are characterized by a large input space and com-plex, nonlinear Read More input-output relationships. Standard Monte Carlo runs of these simulations are typically time-consuming and numerically costly. We adapt the Design and Analysis of Com-puter Experiments (DACE) approach to astrodynamics simulations to improve runtimes and increase information gain. Space-filling designs such as the Latin Hypercube Sampling (LHS) methods, Maximin and Maximum Projection Sampling, with the Surrogate modelling tech-niques of DACE such as Radial Basis Functions and Gaussian Process Regression, gave sig-nificant improvements for astrodynamics simulations, including: reduced run time of Monte Carlo simulations, improved speed of sensitivity analysis, confidence intervals for non-Gaussian behavior, determination of outliers, and identifying extreme output cases not found by standard simulation and sampling methods. Four case studies are presented on novel applications of DACE to mission trajectory design & conjunction assessments with space debris: 1) Gaussian Process regression modelling of maneuvers and navigation uncertainties for commercial cislunar and NASA CLPS lunar missions; 2) Development of a Surrogate model for predficting collision risk and miss distance volatility between debris and satellites in Low Earth orbit; 3) Prediction of the displace-ment of an object in orbit using laser photon pressure; 4) Prediction of eclipse durations for the NASA IBEX-extended mission. The surrogate models are assessed by k-fold cross validation. The relative selection of sur-rogate model performance is verified by the Root Mean Square Error (RMSE) of predic-tions at untried points. To improve the sampling of manoeuvre and navigational uncertain-ties within trajectory design for lunar missions, a maximin LHS was used, in combination with the Gates model for thrusting uncertainty. This led to improvements in simulation ef-ficiency, producing a non-parametric ΔV distribution that was processed with Kernel Density Estimation to resolve a ΔV99.9 prediction with confidence bounds. In a collaboration with the NASA Conjunction Assessment Risk Analysis (CARA) group, the changes in probability of collision (Pc) for two objects in LEO was predicted using a network of 13 Gaussian Process Regression-based surrogate models that deter-mined the future trends in covariance and miss distance volatility, given the data provided within a conjunction data message. This allowed for determination of the trend in the prob-ability distribution of Pc up to three days from the time of closest approach, as well as the interpretation of this prediction in the form of an urgency metric that can assist satellite operators in the manoeuvre decision process. The main challenge in adapting the methods of DACE to astrodynamics simulations was to deliver a direct benefit to mission planning and design. This was achieved by delivering improvements in confidence and predictions for metrics including propellant required to complete a lunar mission (expressed as ΔV); statistical validation of the simulation models used and advising when a sufficient number of simulation runs have been made to verify convergence to an adequate confidence interval. Future applications of DACE for mission design include determining an optimal tracking schedule plan for a lunar mission, and ro-bust trajectory design for low thrust propulsion. |
26 | Speed Session / Poster | Machine Learning for Efficient Fuzzing | Test and Evaluation Methods for Emerging Technology | A high level of security in software is a necessity in today’s world; the best way to achieve Read More confidence in security is through comprehensive testing. This paper covers the development of a fuzzer that explores the massively large input space of a program using machine learning to find the inputs most associated with errors. A formal methods model of the software in question is used to generate and evaluate test sets. Using those test sets, a two-part algorithm is used: inputs get modified according to their Hamming distance from error-causing inputs and then a tree-based model learns the relative importance of each variable in causing errors. This architecture was tested against a model of an aircraft’s thrust reverser and predefined model properties offered a starting test set. From there, the hamming algorithm and importance model expand upon the original set to offer a more informed set of test cases. This system has great potential in producing efficient and effective test sets and has further applications in verifying the security of software programs and cyber-physical systems, contributing to national security in the cyber domain. |
27 | Speed Session / Poster | Convolutional Neural Networks and Semantic Segmentation for Cloud and Ice Detection | Special Topics | Recent research shows the effectiveness of machine learning on image classification and Read More segmentation. The use of artificial neural networks (ANNs) on image datasets such as the MNIST dataset of handwritten digits is highly effective. However, when presented with a more complex image, ANNs and other simple computer vision algorithms tend to fail. This research uses Convolutional Neural Networks (CNNs) to determine how we can differentiate between ice and clouds in the imagery of the Arctic. Instead of using ANNs, where we analyze the problem in one dimension, CNNs identify features using the spatial relationships between the pixels in an image. This technique allows us to extract spatial features, presenting us with higher accuracy. Using a CNN named the Cloud-Net Model, we analyze how a CNN performs when analyzing satellite images. First, we examine recent research on the Cloud-Net Model’s effectiveness on satellite imagery, specifically from Landsat data, with four channels: red, green, blue, and infrared. We extend and modify this model, allowing us to analyze data from the most common channels used by satellites: red, green, and blue. By training on different combinations of these three channels, we extend this analysis by testing on an entirely different data set: GOES imagery. This gives us an understanding of the impact of each individual channel in image classification. By selecting images that exist in the same geographic location and containing both ice and clouds, such as the Landsat, we test GOES analyzing the CNN’s generalizability. Finally, we present CNN’s ability to accurately identify the clouds and ice in the GOES data versus the Landsat data. |
28 | Presentation | Data Science & ML-Enabled Terminal Effects Optimization | Analysis Tools and Techniques | Warhead design and performance optimization against a range of targets is a foundational aspect of Read More the Department of the Army’s mission on behalf of the warfighter. The existing procedures utilized to perform this basic design task do not fully leverage the exponential growth in data science, machine learning, distributed computing, and computational optimization. Although sound in practice and methodology, existing implementations are laborious and computationally expensive, thus limiting the ability to fully explore the trade space of all potentially viable solutions. An additional complicating factor is the fast paced nature of many Research and Development programs which require equally fast paced conceptualization and assessment of warhead designs. By utilizing methods to take advantage of data analytics, the workflow to develop and assess modern warheads will enable earlier insights, discovery through advanced visualization, and optimal integration of multiple engineering domains. Additionally, a framework built on machine learning would allow for the exploitation of past studies and designs to better inform future developments. Combining these approaches will allow for rapid conceptualization and assessment of new and novel warhead designs. US overmatch capability is quickly eroding across many tactical and operational weapon platforms. Traditional incremental improvement approaches are no longer generating appreciable performance improvements to warrant investment. Novel next generation techniques are required to find efficiencies in designs and leap forward technologies to maintain US superiority. The proposed approach seeks to shift existing design mentality to meet this challenge. |