Taking Down a Turret: Introduction to Cyber Operational Test and Evaluation
Presented by: IDA OED Cyber Lab
Cyberattacks are in the news every day, from data breaches of banks and stores to ransomware attacks shutting down city governments and delaying school years. In this mini-tutorial, we introduce key cybersecurity concepts and methods to conducting cybersecurity test and evaluation. We walk you through a live demonstration of a cyberattack and provide real-world examples of each major step we take. The demonstration shows an attacker gaining command and control of a Nerf turret. We leverage tools commonly used by red teams to explore an attack scenario involving phishing, network scanning, password cracking, pivoting, and finally creating a mission effect. We also provide a defensive view and analytics that shows artifacts left by the attack path.
May 15, 2020 1:00 PM-2:30 PM Eastern Standard Time
Connecting Software Reliability Growth Models to Software Defect Tracking – Lance Fiondella, University of Massachusetts
Co-Author: Melanie Luperon.
Most software reliability growth models only track defect discovery. However, a practical concern is removal of high severity defects, yet defect removal is often assumed to occur instantaneously. More recently, several defect removal models have been formulated as differential equations in terms of the number of defects discovered but not yet resolved and the rate of resolution. The limitation of this approach is that it does not take into consideration data contained in a defect tracking database.
This talk describes our recent efforts to analyze data from a NASA program. Two methods to model defect resolution are developed, namely (i) distributional and (ii) Markovian approaches. The distributional approach employs times between defect discovery and resolution to characterize the mean resolution time and derives a software defect resolution model from the corresponding software reliability growth model to track defect discovery. The Markovian approach develops a state model from the stages of the software defect lifecycle as well as a transition probability matrix and the distributions for each transition, providing a semi-Markov model. Both the distribution and Markovian approaches employ a censored estimation technique to identify the maximum likelihood estimates, in order to handle the case where some but not all of the defects discovered have been resolved. Furthermore, we apply a hypothesis test to determine if a first or second order Markov chain best characterizes the defect lifecycle. Our results indicate that a first order Markov chain was sufficient to describe the data considered and that the Markovian approach achieves modest improvements in predictive accuracy, suggesting that the simpler distributional approach may be sufficient to characterize the software defect resolution process during test. The practical inferences of such models include an estimate of the time required to discover and remove all defects.
Development and Analytic Process Used to Develop a 3-Dimensional Graphical User . . . Interface System for Baggage Screening – Charles McKee, Taverene Analytics LLC
The Transportation Security Administration (TSA) uses several types of screening technologies for the purposes of threat detection at airports and federal facilities across the country. Computed Tomography (CT) systems afford TSA personnel in the Checked Baggage setting a quick and effective method to screen property with less need to physically inspect property due to their advanced imaging capabilities. Recent reductions in size, cost, and processing speed for CT systems spurred an interest in incorporating these advanced imaging systems at the Checkpoint to increase the speed and effectiveness of scanning personal property as well as passenger satisfaction during travel. The increase in speed and effectiveness of scanning personal property with fewer physical property inspections stems from several qualities native to CT imaging that current 2D X-Ray based Advanced Technology 2 (AT2) systems typically found at Checkpoints lack. Specifically, the CT offers rotatable 3D images and advanced identification algorithms that allow TSA personnel to more readily identify items requiring review on-screen without requesting that passengers remove them from their bag.
The introduction of CT systems at domestic airports led to the identification of a few key Human Factors issues, however. Several vendors used divergent strategies to produce the CT systems introduced at domestic airport Checkpoints. Each system offered users different 3D visualizations, informational displays, and identification algorithms, offering a range of views, tools, layouts, and material colorization for users to sort through. The disparity in system similarity and potential for multiple systems to operate at a single airport resulted in unnecessarily complex training, testing, certification, and operating procedures. In response, a group of human factors engineers (HFEs) was tasked with creating requirements for a single common Graphical User Interface (GUI) for all CT systems that would provide a standard look, feel, and interaction across systems.
We will discuss the development and analytic process used to 1.) gain an understanding of the tasks that CT systems must accomplish at the Checkpoint (i.e. focus groups), 2.) identify what tools Transportation Security Officers (TSOs) tend to use and why (i.e. focus groups and rank-ordered surveys), and 3.) determine how changes during iterative testing effects performance (i.e. A/B testing while collecting response time, accuracy, and tool usage). The data collection effort described here resulted in a set of requirements that produced a highly usable CT interface as measured by several valid and reliable objective and subjective measures. Perceptions of the CGUI’s usability (e.g., the System Usability Scale; SUS) were aligned with TSO performance (i.e., Pd, PFA, and Throughput) during use of the CGUI prototype. Iterative testing demonstrated an increase in the SUS score and performance measures for each revision of the requirements used to produce the common CT interface. User perspectives, feedback, and performance data also offered insight toward the determination of necessary future efforts that will increase user acceptance of the redesigned CT interface. Increasing user acceptance offers TSA the opportunity to improve user engagement, reduces errors, and the likelihood that the system will stay in service without a mandate.
A HellerVVA Problem: The Catch-22 for Simulated Testing of Fully Autonomous . . . Systems – Daniel Porter, IDA
In order to verify, validate, and accredit (VV&A) a simulation environment for testing the performance of an autonomous system, testers must examine more than just sensor physics—they must also provide evidence that the environmental features which drives system decision making are represented at all. When systems are black boxes though, these features are fundamentally unknown, necessitating that we first test to discover these features. An umbrella known as “model induction” provides approaches for demystifying black boxes and obtaining models of their decision making, but the current state of the art assumes testers can input large quantities of operationally relevant data. When systems only make passive perceptual decisions or operate in purely virtual environments, these assumptions are typically met. However, this will not be the case for black-box, fully autonomous systems. These systems can make decisions about the information they acquire—which cannot be changed in pre-recorded passive inputs—and a major reason to obtain a decision model is to VV&A the simulation environment—preventing the valid use of a virtual environment to obtain a model. Furthermore, the current consensus is that simulation will be used to get limited safety releases for live testing. This creates a catch-22 of needing data to obtain the decision-model, but needing the decision-model to validly obtain the data. In this talk, we provide a brief overview of this challenge and possible solutions.
KC-46A Adaptive Relevant Testing Strategies to Enable Incremental Evaluation – J. Quinn Stank, AFOTEC
The DoD’s challenge to provide capability at the “Speed of Relevance” has generated many new strategies to adapt to rapid development and acquisition. As a result, Operational Test Agencies (OTA) have had to adjust their test processes to accommodate rapid, but incremental delivery of capability to the warfighter. The Air Force Operational Test and Evaluation Center (AFOTEC) developed the Adaptive Relevant Testing (ART) concept to answer the challenge. In this session, AFOTEC Test Analysts will brief examples and lessons learned from implementing the ART principles on the KC-46A acquisition program to identify problems early and promote the delivery of individual capabilities as they are available to test. The AFOTEC goal is to accomplish these incremental tests while maintaining a rigorous statistical evaluation in a relevant and timely manner. This discussion will explain in detail how the KC-46A Initial Operational Test and Evaluation (IOT&E) was accomplished in a unique way that allowed the test team to discover, report on, and correct major system deficiencies much earlier than traditional methods.
D-Optimally Based Sequential Test Method for Ballistic Limit Testing – Leonard Lombardo, U.S. Army Aberdeen Test Center
Ballistic limit testing of armor is testing in which a kinetic energy threat is shot at armor at varying velocities. The striking velocity and whether the threat completely penetrated or partially penetrated the armor is recorded. The probability of penetration is modeled as a function of velocity using a generalized linear model. The parameters of the model serve as inputs to MUVES which is a DoD software tool used to analyze weapon system vulnerability and munition lethality.
Generally, the probability of penetration is assumed to be monotonically increasing with velocity. However, in cases in which there is a change in penetration mechanism, such as the shatter gap phenomena, the probability of penetration can no longer be assumed to be monotonically increasing and a more complex model is necessary. One such model was developed by Chang and Bodt to model the probability of penetration as a function of velocity over a velocity range in which there are two penetration mechanisms.
This paper proposes a D-optimally based sequential shot selection method to efficiently select threat velocities during testing. Two cases are presented: the case in which the penetration mechanism for each shot is known (via high-speed or post shot x-ray) and the case in which the penetration mechanism is not known. This method may be used to support an improved evaluation of armor performance for cases in which there is a change in penetration mechanism.
A Validation Case Study: The Environment Centric Weapons Analysis Facility – Elliot Bartis, IDA
Reliable modeling and simulation (M&S) allows the undersea warfare community to understand torpedo performance in scenarios that could never be created in live testing, and do so for a fraction of the cost of an in-water test. The Navy hopes to use the Environment Centric Weapons Analysis Facility (ECWAF), a hardware-in-the-loop simulation, to predict torpedo effectiveness and supplement live operational testing. In order to trust the model’s results, the T&E community has applied rigorous statistical design of experiments techniques to both live and simulation testing. As part of ECWAF’s two-phased validation approach, we ran the M&S experiment with the legacy torpedo and developed an empirical emulator of the ECWAF using logistic regression. Comparing the emulator’s predictions to actual outcomes from live test events supported the test design for the upgraded torpedo. This talk overviews the ECWAF’s validation strategy, decisions that have put the ECWAF on a promising path, and the metrics used to quantify uncertainty.
May 1, 2020 12:00 PM-1:30 PM Eastern Standard Time
The Role of Statistical Engineering in Creating Solutions for Complex . . . Opportunities – Geoff Vining, Virginia Tech
Statistical engineering is the art and science for addressing complex organizational opportunities with data. The span of statistical engineering ranges from the “problems that keep CEOs awake at night” to the analysts dealing with the results of the experimentation necessary for the success of their most current project. This talk introduces statistical engineering and its full spectrum of approaches to complex opportunities with data. The purpose of this talk is to set the stage for the two specific case studies that follow it. Too often, people lose sight of the big picture of statistical engineering by a too narrow focus on the specific case studies. Too many people walk away thinking “This is what I have been doing for years. It is simply good applied statistics.” These people fail to see what we can learn from each other through the sharing of our experiences to teach other people how to create solutions more efficiently and effectively. It is this big picture that is the focus of this talk.
Statistical Engineering for Service Life Prediction of Polymers – Adam Pintar, National Institute of Standards and Technology
Economically efficient selection of materials depends on knowledge of not just the immediate properties, but the durability of those properties. For example, when selecting building joint sealant, the initial properties are critical to successful design. These properties change over time and can result in failure in the application (buildings leak, glass falls). A NIST led industry consortium has a research focus on developing new measurement science to determine how the properties of the sealant change with environmental exposure. In this talk, the two-decade history of the NIST led effort will be examined through the lens of Statistical Engineering, specifically its 6 phases: (1) Identify the problem. (2) Provide structure. (3)Understand the context. (4) Develop a strategy. (5) Develop and execute tactics. (6) Identify and deploy a solution.
Phases 5 and 6 will be the primary focus of this talk, but all of the phases will be discussed. The tactics of phase 5 were often themselves multi-month or year research problems. Our approach to predicting outdoor degradation based only on accelerated weathering in the laboratory has been revised and improved many times over several years. In phase 6, because of NIST’s unique mission of promoting U.S. innovation and industrial competitiveness, the focus has been outward on technology transfer and the advancement of test standards. This may differ from industry and other government agencies where the focus may be improvement of processes inside of the organization.
Sequential Testing and Simulation Validation for Autonomous Systems – Jim Simpson, JK Analytics
Autonomous systems expect to play a significant role in the next generation of DoD acquisition programs. New methods need to be developed and vetted, particularly for two groups we know well that will be facing the complexities of autonomy: a) test and evaluation, and b) modeling and simulation. For test and evaluation, statistical methods that are routinely and successfully applied throughout DoD need to be adapted to be most effective in autonomy, and some of our practices need to be stressed. One is sequential testing and analysis, which we illustrate to allow testers to learn and improve incrementally. The other group needing to rethink practices best for autonomy is the modeling and simulation. Proposed are some statistical methods appropriate for modeling and simulation validation for autonomous systems. We look forward to your comments and suggestions.
April 24, 2020 1:00 PM-2:30 PM Eastern Standard Time
The Science of Trust of Autonomous Unmanned Systems – Reed Young, Johns Hopkins University Applied Physics Laboratory
The world today is witnessing a significant investment in autonomy and artificial intelligence that most certainly will result in ever-increasing capabilities of unmanned systems. Driverless vehicles are a great example of systems that can make decisions and perform very complex actions. The reality though is that while it is well understood what these systems are doing, but not well at all ‘how’ the intelligence engines are generating decisions to accomplish those actions. Therein lies the underlying challenge of accomplishing formal test and evaluation of these systems and related, how to engender trust in their performance. This presentation will outline and define the problem space, discuss those challenges, and offer solution constructs.
Can AI Predict Human Behavior? – Dustin Burns, Exponent
Given the rapid increase of novel machine learning applications in cybersecurity and people analytics, there is significant evidence that these tools can give meaningful and actionable insights. Even so, great care must be taken to ensure that automated decision making tools are deployed in such a way as to mitigate bias in predictions and promote security of user data. In this talk, Dr. Burns will take a deep dive into an open source data set in the area of people analytics, demonstrating the application of basic machine learning techniques, while discussing limitations and potential pitfalls in using an algorithm to predict human behavior. In the end, Dustin will draw a comparison between the potential to predict human behavioral propensity to things such as becoming an insider threat to how assisted diagnosis tools are used in medicine to predict development or reoccurrence of illnesses.
April 17, 2020 2:30 PM-3:30 PM Eastern Standard Time
Adoption Challenges in Artificial Intelligence and Machine Learning for Analytic . . . Work Environments – Laura McNamara, Sandia National Laboratories
Session Abstract Coming Soon
The Role of Uncertainty Quantification in Machine Learning – David Stracuzzi, Sandia National Laboratories
Uncertainty is an inherent, yet often under-appreciated, component of machine learning and statistical modeling. Data-driven modeling often begins with noisy data from error-prone sensors collected under conditions for which no ground-truth can be ascertained. Analysis then continues with modeling techniques that rely on a myriad of design decisions and tunable parameters. The resulting models often provide demonstrably good performance, yet they illustrate just one of many plausible representations of the data – each of which may make somewhat different predictions on new data.
This talk provides an overview of recent, application-driven research at Sandia Labs that considers methods for (1) estimating the uncertainty in the predictions made by machine learning and statistical models, and (2) using the uncertainty information to improve both the model and downstream decision making. We begin by clarifying the data-driven uncertainty estimation task and identifying sources of uncertainty in machine learning. We then present results from applications in both supervised and unsupervised settings. Finally, we conclude with a summary of lessons learned and critical directions for future work.
April 10, 2020 1:00 PM-2:30 PM Eastern Standard Time
I Have the Power! Power Calculation in Complex (and Not So Complex) Modeling Situations
Part 1 – Caleb King, JMP Division, SAS Institute Inc.
: Materials URL:
Invariably, any analyst who has been in the field long enough has heard the dreaded questions: “Is X number of samples enough? How much data do I need for my experiment?” Ulterior motives aside, any investigation involving data must ultimately answer the question of “How many?” to avoid risking either insufficient data to detect a scientifically significant effect or having too much data leading to a waste of valuable resources. This can become particularly difficult when the underlying model is complex (e.g. longitudinal designs with hard-to-change factors, time-to-event response with censoring, binary responses with non-uniform test levels, etc.). Even in the supposedly simpler case of categorical factors, where run size is often chosen using a lower bound power calculation, a simple approach can mask more “powerful” techniques. In this tutorial, we will spend the first half exploring how to use simulation to perform power calculations in complex modeling situations drawn from relevant defense applications. Techniques will be illustrated using both R and JMP Pro. In the second half, we will investigate the case of categorical factors and illustrate how treating the unknown effects as random variables induces a distribution on statistical power, which can then be used as a new way to assess experimental designs.
Instructor Bio: Caleb King is a Research Statistician Tester for the DOE platform in the JMP software. He received his MS and PhD in Statistics from Virginia Tech and worked for three years as a statistical scientist at Sandia National Laboratories prior to arriving at JMP. His areas of expertise include optimal design of experiments, accelerated testing, reliability analysis, and small-sample theory
Part 2 – Ryan Lekivetz, JMP Division, SAS Institute Inc.
Instructor Bio: Ryan Lekivetz is a Senior Research Statistician Developer for the JMP Division of SAS where he implements features for the Design of Experiments platforms in JMP software.
April 2, 2020 1:00 PM-2:30 PM Eastern Standard Time
Note, to gain access to the recording, users will be required to register. Once approved by SmartUQ, access will be granted.
Introduction to Uncertainty Quantification for Practitioners and Engineers – Gavin Jones, SmartUQ
Uncertainty is an inescapable reality that can be found in nearly all types of engineering analyses. It arises from sources like measurement inaccuracies, material properties, boundary and initial conditions, and modeling approximations. Uncertainty Quantification (UQ) is a systematic process that puts error bands on results by incorporating real world variability and probabilistic behavior into engineering and systems analysis. UQ answers the question: what is likely to happen when the system is subjected to uncertain and variable inputs. Answering this question facilitates significant risk reduction, robust design, and greater confidence in engineering decisions. Modern UQ techniques use powerful statistical models to map the input-output relationships of the system, significantly reducing the number of simulations or tests required to get accurate answers.
This tutorial will present common UQ processes that operate within a probabilistic framework. These include statistical Design of Experiments, statistical emulation methods used to create the simulation inputs to response relationship, and statistical calibration for model validation and tuning to better represent test results. Examples from different industries will be presented to illustrate how the covered processes can be applied to engineering scenarios. This is purely an educational tutorial and will focus on the concepts, methods, and applications of probabilistic analysis and uncertainty quantification. SmartUQ software will only be used for illustration of the methods and examples presented. This is an introductory tutorial designed for practitioners and engineers with little to no formal statistical training. However, statisticians and data scientists may also benefit from seeing the material presented from a more practical use than a purely technical perspective.
There are no prerequisites other than an interest in UQ. Attendees will gain an introductory understanding of Probabilistic Methods and Uncertainty Quantification, basic UQ processes used to quantify uncertainties, and the value UQ can provide in maximizing insight, improving design, and reducing time and resources.
Instructor Bio: Gavin Jones, Sr. SmartUQ Application Engineer, is responsible for performing simulation and statistical work for clients in aerospace, defense, automotive, gas turbine, and other industries. He is also a key contributor in SmartUQ’s Digital Twin/Digital Thread initiative. Mr. Jones received a B.S. in Engineering Mechanics and Astronautics and a B.S. in Mathematics from the University of Wisconsin-Madison.
March 31, 2020 1:00 PM-2:30 PM Eastern Standard Time
A previously recorded version of this seminar is available for viewing:
A Practical Introduction To Gaussian Process Regression – Robert “Bobby” Gramacy, Virginia Tech
: Materials URL:
Abstract: Gaussian process regression is ubiquitous in spatial statistics, machine learning, and the surrogate modeling of computer simulation experiments. Fortunately their prowess as accurate predictors, along with an appropriate quantification of uncertainty, does not derive from difficult-to-understand methodology and cumbersome implementation. We will cover the basics, and provide a practical tool-set ready to be put to work in diverse applications. The presentation will involve accessible slides authored in Rmarkdown, with reproducible examples spanning bespoke implementation to add-on packages.
Instructor Bio: Robert Gramacy is a Professor of Statistics in the College of Science at Virginia Polytechnic and State University (Virginia Tech). Previously he was an Associate Professor of Econometrics and Statistics at the Booth School of Business, and a fellow of the Computation Institute at The University of Chicago. His research interests include Bayesian modeling methodology, statistical computing, Monte Carlo inference, nonparametric regression, sequential design, and optimization under uncertainty. Professor Gramacy is a computational statistician. He specializes in areas of real-data analysis where the ideal modeling apparatus is impractical, or where the current solutions are inefficient and thus skimp on fidelity. Such endeavors often require new models, new methods, and new algorithms. His goal is to be impactful in all three areas while remaining grounded in the needs of a motivating application. His aim is to release general purpose software for consumption by the scientific community at large, not only other statisticians. Professor Gramacy is the primary author on six R packages available on CRAN, two of which (tgp, and monomvn) have won awards from statistical and practitioner communities.