April 27th Agenda

7:30 AM – 8:30 AM

7:30 AM – 8:30 AM

Registration / Check-in

8:30 AM – 8:40 AM

Room A+B

Virtual Session

8:30 AM – 8:40 AM

Opening Remarks

Norton Schwartz

(IDA)


Norton A. Schwartz serves as President of the Institute for Defense Analyses (IDA), a nonprofit corporation operating in the public interest. IDA manages three Federally Funded Research and Development Centers that answer the most challenging U.S. security and science policy questions with objective analysis leveraging extraordinary scientific, technical, and analytic expertise. At IDA, General Schwartz (U.S. Air Force, retired) directs the activities of more than 1,000 scientists and technologists employed by IDA. General Schwartz has a long and prestigious career of service and leadership that spans over 5 decades. He was most recently President and CEO of Business Executives for National Security (BENS). During his 6-year tenure at BENS, he was also a member of IDA’s Board of Trustees. Prior to retiring from the U.S. Air Force, General Schwartz served as the 19th Chief of Staff of the U.S. Air Force from 2008 to 2012. He previously held senior joint positions as Director of the Joint Staff and as the Commander of the U.S. Transportation Command. He began his service as a pilot with the airlift evacuation out of Vietnam in 1975. General Schwartz is a U.S. Air Force Academy graduate and holds a master’s degree in business administration from Central Michigan University. He is also an alumnus of the Armed Forces Staff College and the National War College. He is a member of the Council on Foreign Relations and a 1994 Fellow of Massachusetts Institute of Technology’s Seminar XXI. General Schwartz has been married to Suzie since 1981.

8:40 AM – 9:20 AM

Room A+B

Virtual Session

8:40 AM – 9:20 AM

Keynote 1

Wendy Masiello

(President, Wendy Mas Consulting LLC)


Wendy Masiello is an independent consultant having retired from the United States Air Force as a Lieutenant General.  She is president of Wendy Mas Consulting, LLC and serves as an independent director for KBR Inc., EURPAC Service, Inc., and StandardAero (owned by The Carlyle Group).  She is also a Director on the Procurement Round Table and the National ReBuilding Together Board, President-elect for the National Contract Management Association (NCMA) Board, Chair of Rawls Advisory Council for Texas Tech University’s College of Business and serves on the Air Force Studies Board under the National Academies of Science, Engineering, and Medicine.

Prior to her July 2017 retirement, she was Director of the Defense Contract Management Agency where she oversaw a $1.4 billion budget and 12,000 people worldwide in oversight of 20,000 contractors performing 340,000 contracts with more than $2 trillion in contract value. During her 36-year career, General Masiello also served as Deputy Assistant Secretary (Contracting), Office of the Assistant Secretary of the Air Force for Acquisition, and as Program Executive Officer for the Air Forces $65 billion Service Acquisition portfolio.  She also commanded the 96th Air Base Wing at Edwards Air Force Base and deployed to Iraq and Afghanistan as Principal Assistant Responsible for Contracting for Forces.

General Masiello’s medals and commendations include the Defense Superior Service Medal, Distinguished Service Medal and the Bronze Star.  She earned her Bachelor of Business Administration degree from Texas Tech University, a Master of Science degree in logistics management from the Air Force Institute of Technology, a Master of Science degree in national resource strategy from the Industrial College of the Armed Forces, Fort Lesley J. McNair, Washington, D.C., and is a graduate of Harvard Kennedy School’s Senior Managers in Government.

General Masiello is a 2017 Distinguished Alum of Texas Tech University, was twice (2015 and 2016) named among Executive Mosaic’s Wash 100, the 2014 Greater Washington Government Contractor “Public Sector Partner of the Year,” and recognized by Federal Computer Week as one of “The 2011 Federal 100”.  She is an NCMA Certified Professional Contract Manager and an NCMA Fellow.

9:20 AM – 10:00 AM

Room A+B

Virtual Session

9:20 AM – 10:00 AM

Keynote 2

Jill Marlowe

(Digital Transformation Officer, NASA)


Jill Marlowe is the NASA’s first Digital Transformation Officer, leading the Agency to conceive, architect, and accelerate enterprise digital solutions that transform NASA’s work, workforce and workplace to achieve bolder missions faster and more affordably than ever before. Her responsibilities include refinement and integration of NASA’s digital transformation strategy, plans and policies and coordination of implementation activities across the NASA enterprise in six strategic thrusts: data, collaboration, modeling, artificial intelligence/machine learning, process transformation, and culture and workforce.

Prior to this role, Ms. Marlowe was the Associate Center Director, Technical, at NASA’s Langley Research Center in Hampton, Virginia.  Ms. Marlowe led strategy and transformation of the center’s technical capabilities to assure NASA’s future mission success. In this role, she focused on accelerating Langley’s internal and external collaborations as well as the infusion of digital technologies critical for the center to thrive as a modern federal laboratory in an ever more digitally-enabled, hyper-connected, fast-paced, and globally-competitive world.

In 2008, Ms. Marlowe was selected to the Senior Executive Service as the Deputy Director for Engineering at NASA Langley, and went on to serve as the center’s Engineering Director and Research Director. With the increasing responsibility and scope of these roles, Ms. Marlowe has a broad range of leadership experiences that include: running large organizations of 500 –  1,000 people to deliver solutions to every one of NASA’s mission directorates; sustaining and morphing a diverse portfolio of technical capabilities spanning aerosciences, structures & materials, intelligent flight systems, space flight instruments, and entry descent & landing systems; assuring safe operation of over two-million square feet of laboratories and major facilities; architecting partnerships with universities, industry and other government agencies to leverage and advance NASA’s goals; project management of technology development and flight test experiments; and throughout all of this, incentivizing innovation in very different organizational cultures spanning foundational research, technology invention, flight design and development engineering, and operations. She began her NASA career in 1990 as a structural analyst supporting the development of numerous space flight instruments to characterize Earth’s atmosphere.

Ms. Marlowe’s formal education includes a Bachelor of Science degree in Aerospace and Ocean Engineering from Virginia Tech in 1988, a Master of Science in Mechanical Engineering from Rensselaer Polytechnic Institute in 1990, and a Degree of Engineer in Civil and Environmental Engineering at George Washington University in 1997.  She serves on advisory boards for Virginia Tech’s Aerospace & Ocean Engineering Department, Sandia National Laboratory’s Engineering Sciences Research Foundation, and Cox Communications’ Digitally Inclusive Communities (regional). She is the recipient of two NASA Outstanding Leadership Medals, was named the 2017 NASA Champion of Innovation, is an AIAA Associate Fellow, and was inducted in 2021 into the Virginia Tech Academy of Aerospace & Ocean Engineering Excellence.  She lives in Yorktown, Virginia, with her husband, Kevin, along with the youngest of their three children and two energetic labradoodles.

10:00 AM – 10:20 AM

10:00 AM – 10:20 AM

Break

10:20 AM – 12:20 PM: Parallel Sessions

Room A



Virtual Session

Session 1A: T&E for AI, ML, and HMT


Session Chair: Chad Bieber, JHU/APL

Test and Evaluation Framework for AI Enabled Systems
Brian Woolley (Joint Artificial Intelligence Center)

In the current moment, autonomous and artificial intelligence (AI) systems are emerging at a dizzying pace. Such systems promise to expand the capacity and capability of individuals by delegating increasing levels of decision making down to the agent-level. In this way, operators can set high-level objectives for multiple vehicles or agents and need only intervene when alerted to anomalous conditions. Test and evaluation efforts at the Join AI Center are focused on exercising a prescribed test strategy for AI-enabled systems. This new AI T&E Framework recognizes the inherent complexity that follows from incorporating dynamic decision makers into a system (or into a system-of-systems). The AI T&E Framework is composed of four high-level types of testing that examine at an AI-enabled system from different angles to provide as complete a picture as possible of the system’s capabilities and limitations, including algorithmic, system integration, human-system integration, and operational tests. These testing categories provides stakeholders with appropriate qualitative and quantitative assessments that bound the system’s use cases in a meaningful way. The algorithmic tests characterize the AI models themselves against metrics for effectiveness, security, robustness and responsible AI principles. The system integration tests the system itself to ensure it operates reliably, functions correctly, and is compatible with other components. The human-machine testing asks what do human operators think of the system, if they understand what the system is telling them, and if they trust the system under appropriate conditions. All of which culminates in an operational test that evaluates how the system performs in a realistic environment with realistic scenarios and adversaries. Interestingly, counter to traditional approaches, this framework is best applied during and throughout the development of an AI-enabled system. Our experience is that programs that conduct independent T&E alongside development do not suffer delays, but instead benefit from the feedback and insights gained from incremental and iterative testing, which leads to the delivery of a better overall capability.



Test & Evaluation of ML Models
Anna Rubinstein (MORSE Corporation)

Machine Learning models have been incredibly impactful over the past decade; however, testing those models and comparing their performance has remained challenging and complex. In this presentation, I will demonstrate novel methods for measuring the performance of computer vision object detection models, including running those models against still imagery and video. The presentation will start with an introduction to the pros and cons of various metrics, including traditional metrics like precision, recall, average precision, mean average precision, F1, and F-beta. The talk will then discuss more complex topics such as tracking metrics, handling multiple object classes, visualizing multi-dimensional metrics, and linking metrics to operational impact. Anecdotes will be shared discussing different types of metrics that are appropriate for different types of stakeholders, how system testing fits in, best practices for model integration, best practices for data splitting, and cloud vs on-prem compute lessons learned. The presentation will conclude by discussing what software libraries are available to calculate these metrics, including the MORSE-developed library Charybdis.



A Systems Perspective on Bringing Reliability and Prognostics to Machine Learning
Tyler Cody (Virginia Tech National Security Institute)

Machine learning is being deployed into the real-world, yet the body of knowledge on testing, evaluating, and maintaining machine learning models is overwhelmingly centered on component-level analysis. But, machine learning and engineered systems are tightly coupled. This is evidenced by extreme sensitivity to of ML to changes in system structure and behavior. Thus, reliability, prognostics, and other efforts related to test and evaluation for ML cannot be divorced from the system. That is, machine learning and its system go hand-in-hand. Any other way makes an unjustified assumption about the existence of an independent variable. This talk explores foundational reasons for this phenomena, and the foundational challenges it poses to existing practice. Cases in machine health monitoring and in cyber defense are used to motivate the position that machine learning is not independent of physical changes to the system with which it interacts, and ML is not independent of the adversaries it defends against. By acknowledging these couplings, systems and mission engineers can better align test and evaluation practices with the fundamental character of ML.



Topological Modeling of Human-Machine Teams
Jay Wilkins and Caitlan Fealing (IDA)

A Human-Machine Team (HMT) is a group of agents consisting of at least one human and at least one machine, all functioning collaboratively towards one or more common objectives. As industry and defense find more helpful, creative, and difficult applications of AI-driven technology, the need to effectively and accurately model, simulate, test, and evaluate HMTs will continue to grow and become even more essential. Going along with that growing need, new methods are required to evaluate whether a human-machine team is performing effectively as a team in testing and evaluation scenarios. You cannot predict team performance from knowledge of the individual team agents, alone; interaction between the humans and machines – and interaction between team agents, in general – increases the problem space and adds a measure of unpredictability. Collective team or group performance, in turn, depends heavily on how a team is structured and organized, as well as the mechanisms, paths, and substructures through which the agents in the team interact with one another – i.e. the team’s topology. With the tools and metrics for measuring team structure and interaction becoming more highly developed in recent years, we will propose and discuss a practical, topological HMT modeling framework that not only takes into account but is actually built around the team’s topological characteristics, while still utilizing the individual human and machine performance measures.




Room B



Virtual Session

Session 1B: Modeling & Simulation


Session Chair: Elizabeth Gregory, NASA

What statisticians should do to improve M&S validation studies
John Haman (IDA)

It is often said that many research findings — from social sciences, medicine, economics, and other disciplines — are false. This fact is trumpeted in the media and by many statisticians. There are several reasons that false research is published, but to what extent should we be worried about them in defense testing and in particular modeling and simulation validation studies? In this talk I will present several recommendations for actions that statisticians and data scientists can take to improve the quality of our validations and evaluations.



A Decision-Theoretic Framework for Adaptive Simulation Experiments
Terril Hurst (Raytheon Technologies)

We describe a model-based framework for increasing effectiveness of simulation experiments in the presence of uncertainty. Unlike conventionally designed simulation experiments, it adaptively chooses where to sample, based on the value of information obtained. A Bayesian perspective is taken to formulate and update the framework’s four models. A simulation experiment is conducted to answer some question. In order to define precisely how informative a run is for answering the question, the answer must be defined as a random variable. This random variable is called a query and has the general form of p(theta | y), where theta is the query parameter and y is the available data. Examples of each of the four models employed in the framework are briefly described below: 1. The continuous correlated beta process model (CCBP) estimates the proportions of successes and failures using beta-distributed uncertainty at every point in the input space. It combines results using an exponentially decaying correlation function. The output of the CCBP is used to estimate value of a candidate run. 2. The mutual information model quantifies uncertainty in one random variable that is reduced by observing the other one. The model quantifies the mutual information between any candidate runs and the query, thereby scoring the value of running each candidate. 3. The cost model estimates how long future runs will take, based upon past runs using, e.g., a generalized linear model. A given simulation might have multiple fidelity options that require different run times. It may be desirable to balance information with the cost of a mixture of runs using these multi-fidelity options. 4. The grid state model, together with the mutual information model, are used to select the next collection of runs for optimal information per cost, accounting for current grid load. The framework has been applied to several use cases, including model verification and validation with uncertainty quantification (VVUQ). Given a mathematically precise query, an 80 percent reduction in total runs has been observed.



Enabling Enhanced Validation of NDE Computational Models and Simulations
William C. Schneck, III (NASA LaRC)

Enabling Enhanced Validation of NDE Computational Models and Simulations William C. Schneck, III, Ph.D. Elizabeth D. Gregory, Ph.D. NASA Langley Research Center Computer simulations of physical processes are increasingly used in the development, design, deployment, and life-cycle maintenance of many engineering systems [1] [2]. Non-Destructive Evaluation (NDE) and Structural Health Monitoring (SHM) must employ effective methods to inspect increasingly complex structural and material systems developed for new aerospace systems. Reliably and comprehensively interrogating this multidimensional [3] problem domain from a purely experimental perspective can become cost and time prohibitive. The emerging way to confront these new complexities in a timely and cost-effective manner is to utilize computer simulations. These simulations must be Verified and Validated [4] [5] to assure reliable use for these NDE/SHM applications. Beyond the classical use of models for engineering applications for equipment or system design efforts, NDE/SHM are necessarily applied to as-built and as-used equipment. While most structural or CFD models are applied to ascertain performance of as-designed systems, the performance of an NDE/SHM system is necessarily tied to the indications of damage/defects/deviations (collectively, flaws) within as-built and as-used structures and components. Therefore, the models must have sufficient fidelity to determine the influence of these aberrations on the measurements collected during interrogation. To assess the accuracy of these models, the Validation data sets must adequately encompass these flaw states. Due to the extensive parametric spaces that this coverage would entail, this talk proposes an NDE Benchmark Validation Data Repository, which should contain inspection data covering representative structures and flaws. This data can be reused from project to project, amortizing the cost of performing high quality Validation testing. Works Cited [1] Director, Modeling and Simulation Coordination Office, “Department of Defense Standard Practice: Documentation of Verification, Validation, and Accredation (VV&A) for Models and Simulations,” Department of Defense, 2008. [2] Under Secretary of Defense (Acquisition, Technology and Logistics), “DoD Modeling and Simulation (M&S) Verification, Validation, and Accredation (VV&A),” Department of Defense, 2003. [3] R. C. Martin, Clean Architecture: A Craftsman’s Guide to Software Structure and Design, Boston: Prentice Hall, 2018. [4] C. J. Roy and W. L. Oberkampf, “A Complete Framework for Verification, Validation, and Uncertainty Quantification in Scientific Computing (Invited),” in 48th AIAA Aerospace Sciences Meeting, Orlando, 2010. [5] ASME Performance Test Code Committee 60, “Guide for Verification and Validation in Computational Solid Mechanics,” ASME International, New York, 2016.



USE OF DESIGN & ANALYSIS OF COMPUTER EXPERIMENTS (DACE) IN SPACE MISSION TRAJECTORY DESIGN
David Shteinman (Industrial Sciences Group)

Numerical astrodynamics simulations are characterized by a large input space and com-plex, nonlinear input-output relationships. Standard Monte Carlo runs of these simulations are typically time-consuming and numerically costly. We adapt the Design and Analysis of Com-puter Experiments (DACE) approach to astrodynamics simulations to improve runtimes and increase information gain. Space-filling designs such as the Latin Hypercube Sampling (LHS) methods, Maximin and Maximum Projection Sampling, with the Surrogate modelling tech-niques of DACE such as Radial Basis Functions and Gaussian Process Regression, gave sig-nificant improvements for astrodynamics simulations, including: reduced run time of Monte Carlo simulations, improved speed of sensitivity analysis, confidence intervals for non-Gaussian behavior, determination of outliers, and identifying extreme output cases not found by standard simulation and sampling methods. Four case studies are presented on novel applications of DACE to mission trajectory design & conjunction assessments with space debris: 1) Gaussian Process regression modelling of maneuvers and navigation uncertainties for commercial cislunar and NASA CLPS lunar missions; 2) Development of a Surrogate model for predficting collision risk and miss distance volatility between debris and satellites in Low Earth orbit; 3) Prediction of the displace-ment of an object in orbit using laser photon pressure; 4) Prediction of eclipse durations for the NASA IBEX-extended mission. The surrogate models are assessed by k-fold cross validation. The relative selection of sur-rogate model performance is verified by the Root Mean Square Error (RMSE) of predic-tions at untried points. To improve the sampling of manoeuvre and navigational uncertain-ties within trajectory design for lunar missions, a maximin LHS was used, in combination with the Gates model for thrusting uncertainty. This led to improvements in simulation ef-ficiency, producing a non-parametric ΔV distribution that was processed with Kernel Density Estimation to resolve a ΔV99.9 prediction with confidence bounds. In a collaboration with the NASA Conjunction Assessment Risk Analysis (CARA) group, the changes in probability of collision (Pc) for two objects in LEO was predicted using a network of 13 Gaussian Process Regression-based surrogate models that deter-mined the future trends in covariance and miss distance volatility, given the data provided within a conjunction data message. This allowed for determination of the trend in the prob-ability distribution of Pc up to three days from the time of closest approach, as well as the interpretation of this prediction in the form of an urgency metric that can assist satellite operators in the manoeuvre decision process. The main challenge in adapting the methods of DACE to astrodynamics simulations was to deliver a direct benefit to mission planning and design. This was achieved by delivering improvements in confidence and predictions for metrics including propellant required to complete a lunar mission (expressed as ΔV); statistical validation of the simulation models used and advising when a sufficient number of simulation runs have been made to verify convergence to an adequate confidence interval. Future applications of DACE for mission design include determining an optimal tracking schedule plan for a lunar mission, and ro-bust trajectory design for low thrust propulsion.




Room C

Session 1C: Statistical Methods


Session Chair: Tom Donnelly, JMP

Safe Machine Learning Prediction and Optimization via Extrapolation Control
Tom Donnelly and Laura Lancaster (JMP Statistical Discovery LLC)

Uncontrolled model extrapolation leads to two serious kinds of errors: (1) the model may be completely invalid far from the data, and (2) the combinations of variable values may not be physically realizable. Optimizing models that are fit to observational data can lead to extrapolated solutions that are of no practical use without any warning. In this presentation we introduce a general approach to identifying extrapolation based on a regularized Hotelling T-squared metric. The metric is robust to certain kinds of messy data and can handle models with both continuous and categorical inputs. The extrapolation model is intended to be used in parallel with a machine learning model to identify when the machine learning model is being applied to data that are not close to that model training set or as a non-extrapolation constraint when optimizing the model. The methodology described was introduced into the JMP Pro 16 Profiler.



Applications of Equivalence Testing in T&E
Sarah Burke (The Perduco Group)

Traditional hypothesis testing is used extensively in test and evaluation (T&E) to determine if there is a difference between two or more populations. For example, we can analyze a designed experiment using t-tests to determine if a factor affects the response or not. Rejecting the null hypothesis would provide evidence that the factor changes the response value. However, there are many situations in T&E where the goal is to actually show that things didn’t change; the response is actually the same (or nearly the same) after some change in the process or system. If we use traditional hypothesis testing to assess this scenario, we would want to “fail to reject” the null hypothesis; however, this doesn’t actually provide evidence that the null hypothesis is true. Instead, we can orient the analysis to the decision that will be made and use equivalence testing. Equivalence testing initially assumes the populations are different; the alternative hypothesis is that they are the same. Rejecting the null hypothesis provides evidence that the populations are the same, matching the objective of the test. This talk provides an overview of equivalence testing with examples demonstrating its applicability in T&E. We also discuss additional considerations for planning a test where equivalence testing will be used including: sample size and what does “equivalent” really mean.



Computing Statistical Tolerance Regions Using the R Package ‘tolerance’
Derek Young (University of Kentucky)

Statistical tolerance intervals of the form (1−α, P) provide bounds to capture at least a specified proportion P of the sampled population with a given confidence level 1−α. The quantity P is called the content of the tolerance interval and the confidence level 1−α reflects the sampling variability. Statistical tolerance intervals are ubiquitous in regulatory documents, especially regarding design verification and process validation. Examples of such regulations are those published by the Food and Drug Administration (FDA), the Environmental Protection Agency (EPA), the International Atomic Energy Agency (IAEA), and the standard 16269-6 of the International Organization for Standardization (ISO). Research and development in the area of statistical tolerance intervals has undoubtedly been guided by the needs and demands of industry experts. Some of the broad applications of tolerance intervals include their use in quality control of drug products, setting process validation acceptance criteria, establishing sample sizes for process validation, assessing biosimilarity, and establishing statistically-based design limits. While tolerance intervals are available for numerous parametric distributions, procedures are also available for regression models, mixed-effects models, and multivariate settings (i.e., tolerance regions). Alternatively, nonparametric procedures can be employed when assumptions of a particular parametric model are not met. Tools for computing such tolerance intervals and regions are a necessity for researchers and practitioners alike. This was the motivation for designing the R package ‘tolerance,’ which not only has the capability of computing a wide range of tolerance intervals and regions for both standard and non-standard settings, but also includes some supplementary visualization tools. This session will provide a high-level introduction to the ‘tolerance’ package and its many features. Relevant data examples will be integrated with the computing demonstration, and specifically designed to engage researchers and practitioners from industry and government. A recently-launched Shiny app corresponding to the package will also be highlighted.



Kernel Regression, Bernoulli Trial Responses, and Designed Experiments
John Lipp (Lockheed Martin, Systems Engineering)

Boolean responses are common for both tangible and simulation experiments. Well known approaches to fit models to Boolean responses include ordinary regression with normal approximations or variance stabilizing transforms, and logistic regression. Less well known is kernel regression. This session will present properties of kernel regression, its application to Bernoulli trial experiments, and other lessons learned from using kernel regression in the wild. Kernel regression is a non-parametric method. This requires modifications to many analyses, such as the required sample size. Unlike ordinary regression, the experiment design and model solution interact with each other. Consequently, the number of experiment samples for a desired modeling accuracy depends on the true state of nature. There has been trend in increasingly large simulation sample sizes as computing horsepower has grown. With kernel regression there is a point of diminishing return on sample sizes. That is, an experiment is better off with more data sites once a sufficient sample size is reached. Confidence interval accuracy is also dependent on the true state of nature. Parsimonious model tuning is required for accurate confidence intervals. Kernel tuning to build a parsimonious model using cross validation methods will be illustrated.




Room D



Virtual Session

Mini-Tutorial 1: Data Integrity For Deep Learning Models


Session Chair: Doug Ray, US Army CCDC Armaments Center

Data Integrity For Deep Learning Models
Victoria Gerardi and John Cilli (US Army, CCDC Armaments Center)

Deep learning models are built from algorithm frameworks that fit parameters over a large set of structured historical examples. Model robustness relies heavily on the accuracy and quality of the input training datasets. This mini-tutorial seeks to explore the practical implications of data quality issues when attempting to build reliable and accurate deep learning models. The tutorial will review the basics of neural networks, model building, and then dive deep into examples and data quality considerations using practical examples. An understanding of data integrity and data quality is pivotal for verification and validation of deep learning models, and this tutorial will provide students with a foundation of this topic.




12:20 PM – 1:30 PM

12:20 PM – 1:30 PM

Lunch

1:30 PM – 3:00 PM: Parallel Sessions

Room A



Virtual Session

Session 2A: Reproducible Research


Session Chair: Eric Chicken, Florida State University

Everyday Reproducibility
Gregory J. Hunt (William & Mary)

Modern data analysis is typically quite computational. Correspondingly, sharing scientific and statistical work now often means sharing code and data in addition writing papers and giving talks. This type of code sharing faces several challenges. For example, it is often difficult to take code from one computer and run it on another due to software configuration, version, and dependency issues. Even if the code runs, writing code that is easy to understand or interact with can be difficult. This makes it difficult to assess third-party code and its findings, for example, in a review process. In this talk we describe a combination of two computing technologies that help make analyses shareable, interactive, and completely reproducible. These technologies are (1) analysis containerization, which leverages virtualization to fully encapsulate analysis, data, code and dependencies into an interactive and shareable format, and (2) code notebooks, a literate programming format for interacting with analyses. This talks reviews both the problems at the high-level and also provides concrete solutions to the challenges faced. In addition to discussing reproducibility and data/code sharing generally, we will touch upon several such issues that arise specifically in the defense and aerospace communities.



From Gripe to Flight: Building an End-to-End Picture of DOD Sustainment
Benjamin Ashwell (IDA)

The DOD has to maintain readiness across a staggeringly diverse array of modern weapon systems, yet no single person or organization in the DOD has an end-to-end picture of the sustainment system that supports them. This shortcoming can lead to bad decisions when it comes to allocating resources in a funding-constrained environment. The underlying problem is driven by stovepiped databases, a reluctance to share data even internally, and a reliance on tribal knowledge of often cryptic data sources. Notwithstanding these difficulties, we need to create a comprehensive picture of the sustainment system to be able to answer pressing questions from DOD leaders. To that end, we have created a documented and reproducible workflow that shepherds raw data from DOD databases through cleaning and curation steps, and then applies logical rules, filters, and assumptions to transform the raw data into concrete values and useful metrics. This process gives us accurate, up-to-date data that we use to support quick-turn studies, and to rapidly build (and efficiently maintain) a suite of readiness models for a wide range of complex weapon systems.



Using the R ecosystem to produce a reproducible data analysis pipeline
Andrew Farina (United States Military Academy)

Advances in open-source software have brought powerful machine learning and data analysis tools requiring little more than a few coding basics. Unfortunately, the very nature of rapidly changing software can contribute to legitimate concerns surrounding the reproducibility of research and analysis. Borrowing from current practices in data science and software engineering fields, a more robust process using the R ecosystem to produce a version-controlled data analysis pipeline is proposed. By integrating the data cleaning, model generation, manuscript writing, and presentation scripts, a researcher or data analyst can ensure small changes at any step will automatically be reflected throughout using the Rmarkdown, targets, renv, and xaringan R packages.




Room B



Virtual Session

Session 2B: Trust in AI/ML


Session Chair: Elizabeth Gregory, NASA

Trust Throughout the Artificial Intelligence Lifecycle
Lauren H. Perry (The Aerospace Corporation)

AI and machine learning have become widespread throughout the defense, government, and commercial sectors. This has led to increased attention on the topic of trust and the role it plays in successfully integrating AI into highconsequence environments where tolerance for risk is low. Driven by recent successes of AI algorithms in a range of applications, users and organizations rely on AI to provide new, faster, and more adaptive capabilities. However, along with those successes have come notable pitfalls, such as bias, vulnerability to adversarial attack, and inability to perform as expected in novel environments. Many types of AI are data-driven, meaning they operate on and learn their internal models directly from data. Therefore, tracking how data were used to build data properties (e.g., training, validation, and testing) is crucial not only to ensure a high-performing model, but also to understand if the AI should be trusted. MLOps, an offshoot of DevSecOps, is a set of best practices meant to standardize and streamline the end-to-end lifecycle of machine learning. In addition to supporting the software development and hardware requirements of AI-based systems, MLOps provides a scaffold by which the attributes of trust can be formally and methodically evaluated. Additionally, MLOps encourages reasoning about trust early and often in the development cycle. To this end, we present a framework that encourages the development of AI-based applications that can be trusted to operate as intended and function safely both with and without human interaction. This framework offers guidance for each phase of the AI lifecycle, utilizing MLOps, through a detailed discussion of pitfalls resulting from not considering trust, metrics for measuring attributes of trust, and mitigations strategies for when risk tolerance is low.



Let’s stop talking about “transparency” with regard to AI
David Sparrow and David Tate (IDA)

For AI-enabled and autonomous systems, issues of safety, security, and mission effectiveness are not separable—the same underlying data and software give rise to interrelated risks in all of these dimensions. If treated separately, there is considerable unnecessary duplication (and sometimes mutual interference) among efforts needed to satisfy commanders, operators, and certification authorities of the systems’ dependability. Assurances cases, pioneered within the safety and cybersecurity communities, provide a structured approach to simultaneously verifying all dimensions of system dependability with minimal redundancy of effort. In doing so, they also provide a more concrete and useful framework for system development and explanation of behavior than is generally seen in discussions of “transparency” and “trust” in AI and autonomy. Importantly, trust generally cannot be “built in” to systems, because the nature of the assurance arguments needed for various stakeholders requires iterative identification of evidence structures that cannot be anticipated by developers.



Machine Learning for Uncertainty Quantification: Trusting the Black Box
James Warner (NASA Langley Research Center)

Adopting uncertainty quantification (UQ) has become a prerequisite for providing credibility in modeling and simulation (M&S) applications. It is well known, however, that UQ can be computationally prohibitive for problems involving expensive high-fidelity models, since a large number of model evaluations is typically required. A common approach for improving efficiency is to replace the original model with an approximate surrogate model (i.e., metamodel, response surface, etc.) using machine learning that makes predictions in a fraction of the time. While surrogate modeling has been commonplace in the UQ field for over a decade, many practitioners still remain hesitant to rely on “black box” machine learning models over trusted physics-based models (e.g., FEA) for their analyses. This talk discusses the role of machine learning in enabling computational speedup for UQ, including traditional limitations and modern efforts to overcome them. An overview of surrogate modeling and its best practices for effective use is first provided. Then, some emerging methods that aim to unify physics-based and data-based approaches for UQ are introduced, including multi-model Monte Carlo simulation and physics-informed machine learning. The use of both traditional surrogate modeling and these more advanced machine learning methods for UQ are highlighted in the context of applications at NASA, including trajectory simulation and spacesuit certification.




Room C

Speed Session 1


Session Chair: Denise Edwards, IDA

Analysis Apps for the Operational Tester
William Raymond Whitledge (IDA)

In the acquisition and testing world, data analysts repeatedly encounter certain categories of data, such as time or distance until an event (e.g., failure, alert, detection), binary outcomes (e.g., success/failure, hit/miss), and survey responses. Analysts need tools that enable them to produce quality and timely analyses of the data they acquire during testing. This poster presents four web-based apps that can analyze these types of data. The apps are designed to assist analysts and researchers with simple repeatable analysis tasks, such as building summary tables and plots for reports or briefings. Using software tools like these apps can increase reproducibility of results, timeliness of analysis and reporting, attractiveness and standardization of aesthetics in figures, and accuracy of results. The first app models reliability of a system or component by fitting parametric statistical distributions to time-to-failure data. The second app fits a logistic regression model to binary data with one or two independent continuous variables as predictors. The third calculates summary statistics and produces plots of groups of Likert-scale survey question responses. The fourth calculates the system usability scale (SUS) scores for SUS survey responses and enables the app user to plot scores versus an independent variable. These apps are available for public use on the Test Science Interactive Tools webpage https://new.testscience.org/interactive-tools/.



Quantifying the Impact of Staged Rollout Policies on Software Process and Product Metrics
Lance Fiondella (University of Massachusetts Dartmouth)

Software processes define specific sequences of activities performed to effectively produce software, whereas tools provide concrete computational artifacts by which these processes are carried out. Tool independent modeling of processes and related practices enable quantitative assessment of software and competing approaches. This talk presents a framework to assess an approach employed in modern software development known as staged rollout, which releases new or updated software features to a fraction of the user base in order to accelerate defect discovery without imposing the possibility of failure on all users. The framework quantifies process metrics such as delivery time and product metrics, including reliability, availability, security, and safety, enabling tradeoff analysis to objectively assess the quality of software produced by vendors, establish baselines, and guide process and product improvement. Failure data collected during software testing is employed to emulate the approach as if the project were ongoing. The underlying problem is to identify a policy that decides when to perform various stages of rollout based on the software’s failure intensity. The illustrations examine how alternative policies impose tradeoffs between two or more of the process and product metrics.



Estimating the time of sudden shift in the location or scale of ergodic-stationary process
Zhi Wang (Bayer Crop Science)

Autocorrelated sequences arise in many modern-day industrial applications. In this paper, our focus is on estimating the time of sudden shift in the location or scale of a continuous ergodic-stationary sequence following a genuine signal from a statistical process control chart. Our general approach involves “clipping” the continuous sequence at the median or interquartile range (IQR) to produce a binary sequence, and then modeling the joint mass function for the binary sequence using a Bahadur approximation. We then derive a maximum likelihood estimator for the time of sudden shift in the mean of the binary sequence. Performance comparisons are made between our proposed change point estimator and two other viable alternatives. Although the literature contains existing methods for estimating the time of sudden shift in the mean and/or variance of a continuous process, most are derived under strict independence and distributional assumptions. Such assumptions are often too restrictive, particularly when applications involve Industry 4.0 processes where autocorrelation is prevalent and the distribution of the data is likely unknown. The change point estimation strategy proposed in this work easily incorporates autocorrelation and is distribution-free. Consequently, it is widely applicable to modern-day industrial processes.



Analysis of Target Location Error using Stochastic Differential Equations
James Brownlow (USAF)

This paper presents an analysis of target location error (TLE) based on the Cox Ingersoll Ross (CIR) model. In brief, this model characterizes TLE as a function of range based the stochastic differential equation model dX(r) = a(b-X(r))dr + sigma *sqrt(X(r)) dW(r) where X(t) is TLE at range r, b is the long-term mean (terminal) of the TLE, a is the rate of reversion of X(r) to b, sigma is the process volatility, and W(t) is the standard Weiner process. Multiple flight test runs under the same conditions exhibit different realizations of the TLE process. This approach to TLE analysis models each flight test run as a realization the CIR process. Fitting a CIR model to multiple data runs then provides a characterization of the TLE system under test. This paper presents an example use of the CIR model. Maximum likelihood estimates of the parameters of the CIR model are found from a collection of TLE data runs. The resulting CIR model is then used to characterize overall system TLE performance as a function of range to the target as well as the asymptotic estimate of long-term TLE.



Data Science & ML-Enabled Terminal Effects Optimization
John Cilli (Picatinny Arsenal)

Warhead design and performance optimization against a range of targets is a foundational aspect of the Department of the Army’s mission on behalf of the warfighter. The existing procedures utilized to perform this basic design task do not fully leverage the exponential growth in data science, machine learning, distributed computing, and computational optimization. Although sound in practice and methodology, existing implementations are laborious and computationally expensive, thus limiting the ability to fully explore the trade space of all potentially viable solutions. An additional complicating factor is the fast paced nature of many Research and Development programs which require equally fast paced conceptualization and assessment of warhead designs. By utilizing methods to take advantage of data analytics, the workflow to develop and assess modern warheads will enable earlier insights, discovery through advanced visualization, and optimal integration of multiple engineering domains. Additionally, a framework built on machine learning would allow for the exploitation of past studies and designs to better inform future developments. Combining these approaches will allow for rapid conceptualization and assessment of new and novel warhead designs. US overmatch capability is quickly eroding across many tactical and operational weapon platforms. Traditional incremental improvement approaches are no longer generating appreciable performance improvements to warrant investment. Novel next generation techniques are required to find efficiencies in designs and leap forward technologies to maintain US superiority. The proposed approach seeks to shift existing design mentality to meet this challenge.



Predicting Trust in Automated Systems: Validation of the Trust of Automated Systems Test
Caitlan Fealing (IDA)

The number of people using autonomous systems for everyday tasks has increased steadily since the 1960s and has dramatically increased with the invention of smart devices that can be controlled via smartphone. Within the defense community, automated systems are currently used to perform search and rescue missions and to assume control of aircraft to avoid ground collision. Until recently, researchers have only been able to gain insights on trust levels by observing a human’s reliance on the system, so it was apparent that researchers needed a validated method of quantifying how much an individual trusts the automated system they are using. IDA researchers developed the Trust of Automated Systems Test (TOAST scale) to serve as a validated scale capable of measuring how much an individual trusts a system. This presentation will outline the nine item TOAST scale’s understanding and performance elements, and how it can effectively be used in a defense setting. We believe that this scale should be used to evaluate the trust level of any human using any system, including predicting when operators will misuse or disuse complex, automated and autonomous systems.



Exploring the behavior of Bayesian adaptive design of experiments
Daniel Ries (Sandia National Laboratories)

Physical experiments in the national security arena, including nuclear deterrence, are often expensive and time-consuming resulting in small sample sizes which make it difficult to achieve desired statistical properties. Bayesian adaptive design of experiments (BADE) is a sequential design of experiment approach which updates the test design in real time, in order to optimally collect data. BADE recommends ending experiments early by either concluding that the experiment would have ended in efficacy or futility, had the testing completely finished, with sufficiently high probability. This is done by using data already collected and marginalizing over the remaining uncollected data and updating the Bayesian posterior distribution in near real-time. BADE has seen successes in clinical trials, resulting in quicker and more effective assessments of drug trials while also reducing ethical concerns. BADE has typically only been used in futility studies rather than efficacy studies for clinical trials, although there hasn’t been much debate for this current paradigm. BADE has been proposed for testing in the national security space for similar reasons of quicker and cheaper test series. Given the high-consequence nature of the tests performed in the national security space, a strong understanding of new methods is required before being deployed. The main contribution of this research was to reproduce results seen in previous studies, for different aspects of model performance. A large simulation inspired by a real testing problem at Sandia National Laboratories was performed to understand the behavior of BADE under various scenarios, including shifts to mean, standard deviation, and distributional family, all in addition to the presence of outliers. The results help explain the behavior of BADE under various assumption violations. Using the results of this simulation, combined with previous work related to BADE in this field, it is argued this approach could be used as part of an “evidence package” for deciding to stop testing early due to futility, or with stronger evidence, efficacy. The combination of expert knowledge with statistical quantification provides the stronger evidence necessary for a method in its infancy in a high-consequence, new application area such as national security. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.



T&E of Responsible AI
Rachel Haga (IDA)

Getting Responsible AI (RAI) right is difficult and demands expertise. All AI-relevant skill sets, including ethics, are in high demand and short supply, especially regarding AI’s intersection with test and evaluation (T&E). Frameworks, guidance, and tools are needed to empower working-level personnel across DOD to generate RAI assurance cases with support from RAI SMEs. At a high level, framework should address the following points: 1. T&E is a necessary piece of the RAI puzzle–testing provides a feedback mechanism for system improvement and builds public and warfighter confidence in our systems, and RAI should be treated just like performance, reliability, and safety requirements. 2. We must intertwine T&E and RAI across the cradle-to-grave product life cycle. Programs must embrace T&E and RAI from inception; as development proceeds, these two streams must be integrated in tight feedback loops to ensure effective RAI implementation. Furthermore, many AI systems, along with their operating environments and use cases, will continue to update and evolve and thus will require continued evaluation after fielding. 3. The five DOD RAI principles are a necessary north star, but alone they are not enough to implement or ensure RAI. Programs will have to integrate multiple methodologies and sources of evidence to construct holistic arguments for how much the programs have reduced RAI risks. 4. RAI must be developed, tested, and evaluated in context–T&E without operationally relevant context will fail to ensure that fielded tools achieve RAI. Mission success depends on technology that must interact with warfighters and other systems in complex environments, while constrained by processes and regulation. AI systems will be especially sensitive to operational context and will force T&E to expand what it considers.



Next Gen Breaching Technology: A Case Study in Deterministic Binary Response Emulation
Eli Golden (US Army DEVCOM Armaments Center)

Combat Capabilities Development Command Armaments Center (DEVCOM AC) is developing the next generation breaching munition, a replacement for the M58 Mine Clearing Line Charge. A series of M&S experiments were conducted to aid with the design of mine-neutralizing submunitions, utilizing space-filling designs, support vector machines, and hyper-parameter optimization. A probabilistic meta-model of the FEA-simulated performance data was generated with Platt Scaling in order to facilitate optimization, which was implemented to generate several candidate designs for follow-up live testing. This paper will detail the procedure used to iteratively explore and extract information from a deterministic process with a binary response.




Room D



Virtual Session

Mini Tutorial 2: Mixed Models


Session Chair: Keyla Pagan-Rivera, IDA

Mixed Models: A Critical Tool for Dependent Observations
Elizabeth Claassen (JMP Statistical Discovery)

The use of fixed and random effects have a rich history. They often go by other names, including blocking models, variance component models, nested and split-plot designs, hierarchical linear models, multilevel models, empirical Bayes, repeated measures, covariance structure models, and random coefficient models. Mixed models are one of the most powerful and practical ways to analyze experimental data, and investing time to become skilled with them is well worth the effort. Many, if not most, real-life data sets do not satisfy the standard statistical assumption of independent observations. Failure to appropriately model design structure can easily result in biased inferences. With an appropriate mixed model we can estimate primary effects of interest as well as compare sources of variability using common forms of dependence among sets of observations. Mixed Models can readily become the most handy method in your analytical toolbox and provide a foundational framework for understanding statistical modeling in general. In this course we will cover many types of mixed models, including blocking, split-plot, and random coefficients.




3:00 PM – 3:20 PM

3:00 PM – 3:20 PM

Break

3:20 PM – 4:50 PM: Parallel Sessions

Room A



Virtual Session

Session 3A: Cybersecurity Methods and Concepts


Session Chair: Pete Mancini, IDA

Applying Design of Experiments to Cyber Testing
J. Michael Gilmore (IDA)

We describe a potential framework for applying DOE to cyber testing and provide an example of its application to testing of a hypothetical command and control system.



Sparse Models for Detecting Malicious Behavior in OpTC
Andrew Mastin (Lawrence Livermore National Laboratory)

Host-based sensors are standard tools for generating event data to detect malicious activity on a network. There is often interest in detecting activity using as few event classes as possible in order to minimize host processing slowdowns. Using DARPA’s Operationally Transparent Cyber (OpTC) Data Release, we consider the problem of detecting malicious activity using event counts aggregated over five-minute windows. Event counts are categorized by eleven features according to MITRE CAR data model objects. In the supervised setting, we use regression trees with all features to show that malicious activity can be detected at above a 90% true positive rate with a negligible false positive rate. Using forward and exhaustive search techniques, we show the same performance can be obtained using a sparse model with only three features. In the unsupervised setting, we show that the isolation forest algorithm is somewhat successful at detecting malicious activity, and that a sparse three-feature model performs comparably. Finally, we consider various search criteria for identifying sparse models and demonstrate that the RMSE criteria is generally optimal.



Adversaries and Airwaves – Compromising Wireless and Radio Frequency Communications
Mark Herrera and Jason Schlup (IDA)

Wireless and radio frequency (RF) technology are ubiquitous in our daily lives, including laptops, key fobs, remote sensors, and antennas. These devices, while oftentimes portable and convenient, can potentially be susceptible to adversarial attack over the air. This breakout session will provide a short introduction into wireless hacking concepts such as passive scanning, active injection, and the use of software defined radios to flexibly sample the RF spectrum. We will also ground these concepts in live demonstrations of attacks against both wireless and wired systems.




Room B



Virtual Session

Session 3B: Sustainment in the DoD


Session Chair: Benjamin Ashwell, IDA

An Introduction to Sustainment: The Importance and Challenges of Analyzing System Readines
V. Bram Lillard and Megan L. Gelsinger (IDA)

The Department of Defense (DoD) spends the majority of its annual budget on making sure that systems are ready to perform when called to action. Even with large investments, though, maintaining adequate system readiness poses a major challenge for the DoD. Here, we discuss why readiness is so difficult to maintain and introduce the tools IDA has developed to aid readiness and supply chain analysis and decision-making. Particular emphasis is placed on “honeybee,” the tool developed to clean, assemble, and mine data across a variety of sources in a well-documented and reproducible way. Using a notional example, we demonstrate the utility of this tool and others like it in our suite; these tools lower the barrier to performing meaningful analysis, constructing and estimating input data for readiness models, and aiding the DoD’s ability to tie resources to readiness outcomes.



M&S approach for quantifying readiness impact of sustainment investment scenarios
Andrew C. Flack, Han G. Yi (IDA (OED))

Sustainment for weapon systems involves multiple components that influence readiness outcomes through a complex array of interactions. While military leadership can use simple analytical approaches to yield insights into current metrics (e.g., dashboard for top downtime drivers) or historical trends of a given sustainment structure (e.g., correlative studies between stock sizes and backorders), they are inadequate tools for guiding decision-making due to their inability to quantify the impact on readiness. In this talk, we discuss the power of IDA’s end-to-end modeling and simulation (M&S) approach that estimates time-varying readiness outcomes based on real-world data on operations, supply, and maintenance. These models are designed to faithfully emulate fleet operations at the level of individual components and operational units, as well as to incorporate the multi-echelon inventory system used in military sustainment. We showcase a notional example in which our M&S approach produces a set of recommended component-level investments and divestments in wholesale supply that would improve the readiness of a weapon system. We argue for the urgency of increased end-to-end M&S efforts across the Department of Defense to guide the senior leadership in its data-driven decision-making for readiness initiatives.



Taming the beast: making questions about the supply system tractable by quantifying risk
Joseph Fabritius and Kyle Remley (IDA)

The DoD sustainment system is responsible for managing the supply of millions of different spare parts, most of which are infrequently and inconsistently requisitioned, and many of which have procurement lead times measured in years. The DoD must generally buy items in anticipation of need, yet it simply cannot afford to buy even one copy of every unique part it might be called upon to deliver. Deciding which items to purchase necessarily involves taking risks, both military and financial. However, the huge scale of the supply system makes these risks difficult to quantify. We have developed methods that use raw supply data in new ways to support this decision making process. First, we have created a method to identify areas of potential overinvestment that could safely be reallocated to areas at risk of underinvestment. Second, we have used raw requisition data to create an item priority list for individual weapon systems in terms of importance to mission success. Together, these methods allow DoD decision makers to make better-informed decisions about where to take risks and where to invest scarce resources.




Room C

Speed Session 2


Session Chair: Alyson Wilson, NCSU

Assurance Techniques for Learning Enabled Autonomous Systems which Aid Systems Engineering
Christian Ellis (Army Research Laboratory / University of Mass. Dartmouth)

It is widely recognized that the complexity and resulting capabilities of autonomous systems created using machine learning methods, which we refer to as learning enabled autonomous systems (LEAS), pose new challenges to systems engineering test, evaluation, verification, and validation (TEVV) compared to their traditional counterparts. This presentation provides a preliminary attempt to map recently developed technical approaches in the assurance and TEVV of learning enabled autonomous systems (LEAS) literature to a traditional systems engineering v-model. This mapping categorizes such techniques into three main approaches: development, acquisition, and sustainment. This mapping reviews the latest techniques to develop safe, reliable, and resilient learning enabled autonomous systems, without recommending radical and impractical changes to existing systems engineering processes. By performing this mapping, we seek to assist acquisition professionals by (i) informing comprehensive test and evaluation planning, and (ii) objectively communicating risk to leaders. The inability to translate qualitative assessments to quantitative metrics which measure system performance hinder adoption. Without understanding the capabilities and limitations of existing assurance techniques, defining safety and performance requirements that are both clear and testable remains out of reach. We accompany recent literature reviews on autonomy assurance and TEVV by mapping such developments to distinct steps of a well known systems engineering model chosen due to its prevalence, namely the v-model. For three top-level lifecycle phases: development, acquisition, and sustainment, a section of the presentation has been dedicated to outlining recent technical developments for autonomy assurance. This representation helps identify where the latest methods for TEVV fit in the broader systems engineering process while also enabling systematic consideration of potential sources of defects, faults, and attacks. Note that we use the v-model only to assist the classification of where TEVV methods fit. This is not a recommendation to use a certain software development lifecycle over another.



Optimal Designs for Multiple Response Distributions
Brittany Fischer (Arizona State University)

Designed experiments can be a powerful tool for gaining fundamental understanding of systems and processes or maintaining or optimizing systems and processes. There are usually multiple performance and quality metrics that are of interest in an experiment, and these multiple responses may include data from nonnormal distributions, such as binary or count data. A design that is optimal for a normal response can be very different from a design that is optimal for a nonnormal response. This work includes a two-phase method that helps experimenters identify a hybrid design for a multiple response problem. Mixture and optimal design methods are used with a weighted optimality criterion for a three-response problem that includes a normal, a binary, and a Poisson model, but could be generalized to an arbitrary number and combination of responses belonging to the exponential family. A mixture design is utilized to identify the optimal weights in the criterion presented.



Machine Learning for Efficient Fuzzing
John Richie (USAFA)

A high level of security in software is a necessity in today’s world; the best way to achieve confidence in security is through comprehensive testing. This paper covers the development of a fuzzer that explores the massively large input space of a program using machine learning to find the inputs most associated with errors. A formal methods model of the software in question is used to generate and evaluate test sets. Using those test sets, a two-part algorithm is used: inputs get modified according to their Hamming distance from error-causing inputs and then a tree-based model learns the relative importance of each variable in causing errors. This architecture was tested against a model of an aircraft’s thrust reverser and predefined model properties offered a starting test set. From there, the hamming algorithm and importance model expand upon the original set to offer a more informed set of test cases. This system has great potential in producing efficient and effective test sets and has further applications in verifying the security of software programs and cyber-physical systems, contributing to national security in the cyber domain.



Nonparametric multivariate profile monitoring using regression trees
Daniel A. Timme (Florida State University)

Monitoring noisy profiles for changes in the behavior can be used to validate whether the process is operating under normal conditions over time. Change-point detection and estimation in sequences of multivariate functional observations is a common method utilized in monitoring such profiles. A nonparametric method utilizing Classification and Regression Trees (CART) to build a sequence of regression trees is proposed which makes use of the Kolmogorov-Smirnov statistic to monitor profile behavior. Our novel method compares favorably to existing methods in the literature.



Profile Monitoring via Eigenvector Perturbation
Takayuki Iguchi (Florida State University)

Control charts are often used to monitor the quality characteristics of a process over time to ensure undesirable behavior is quickly detected. The escalating complexity of processes we wish to monitor spurs the need for more flexible control charts such as those used in profile monitoring. Additionally, designing a control chart that has an acceptable false alarm rate for a practitioner is a common challenge. Alarm fatigue can occur if the sampling rate is high (say, once a millisecond) and the control chart is calibrated to an average in-control run length (ARL0) of 200 or 370 which is often done in the literature. As alarm fatigue may not just be annoyance but result in detrimental effects to the quality of the product, control chart designers should seek to minimize the false alarm rate. Unfortunately, reducing the false alarm rate typically comes at the cost of detection delay or average out-of-control run length (ARL1). Motivated by recent work on eigenvector perturbation theory, we develop a computationally fast control chart called the Eigenvector Perturbation Control Chart for nonparametric profile monitoring. The control chart monitors the l_2 perturbation of the leading eigenvector of a correlation matrix and requires only a sample of known in-control profiles to determine control limits. Through a simulation study we demonstrate that it is able to outperform its competition by achieving an ARL1 close to or equal to 1 even when the control limits result in a large ARL0 on the order of 10^6. Additionally, non-zero false alarm rates with a change point after 10^4 in-control observations were only observed in scenarios that are either pathological or truly difficult for a correlation based monitoring scheme.



Risk Comparison and Planning for Bayesian Assurance Tests
Hyoshin Kim (North Carolina State University)

Designing a Bayesian assurance test plan requires choosing a test plan that guarantees a product of interest is good enough to satisfy consumer’s criteria but not ‘so good’ that it causes producer’s concern if they fail the test. Bayesian assurance tests are especially useful because they can incorporate previous product information in the test planning and explicitly control levels of risk for the consumer and producer. We demonstrate an algorithm for efficiently computing a test plan given desired levels of risks in binomial and exponential testing. Numerical comparisons with the Operational Characteristic (OC) curve, Probability Ratio Sequential Test (PRST), and a simulation-based Bayesian sample size determination approach are also considered.



Utilizing Machine Learning Models to Predict Success in Special Operations Assessment
Anna Vinnedge (United States Military Academy)

The 75th Ranger Regiment is an elite Army Unit responsible for some of the most physically and mentally challenging missions. Entry to the unit is based on an assessment process called Ranger Regiment Assessment and Selection (RASP), which consists of a variety of tests and challenges of strength, intellect, and grit. This study explores the psychological and physical profiles of candidates who attempt to pass RASP. Using a Random Forest Artificial Intelligence model, and a penalized logistic regression model, we identify initial entry characteristics that are predictive of success in RASP. We focus on the differences between racial sub-groups and military occupational specialties (MOS) sub-groups to provide information for recruiters to identify underrepresented groups who are likely to succeed into the selection process.



Convolutional Neural Networks and Semantic Segmentation for Cloud and Ice Detection
Prarabdha Ojwaswee Yonzon (United States Military Academy (West Point))

Recent research shows the effectiveness of machine learning on image classification and segmentation. The use of artificial neural networks (ANNs) on image datasets such as the MNIST dataset of handwritten digits is highly effective. However, when presented with a more complex image, ANNs and other simple computer vision algorithms tend to fail. This research uses Convolutional Neural Networks (CNNs) to determine how we can differentiate between ice and clouds in the imagery of the Arctic. Instead of using ANNs, where we analyze the problem in one dimension, CNNs identify features using the spatial relationships between the pixels in an image. This technique allows us to extract spatial features, presenting us with higher accuracy. Using a CNN named the Cloud-Net Model, we analyze how a CNN performs when analyzing satellite images. First, we examine recent research on the Cloud-Net Model’s effectiveness on satellite imagery, specifically from Landsat data, with four channels: red, green, blue, and infrared. We extend and modify this model, allowing us to analyze data from the most common channels used by satellites: red, green, and blue. By training on different combinations of these three channels, we extend this analysis by testing on an entirely different data set: GOES imagery. This gives us an understanding of the impact of each individual channel in image classification. By selecting images that exist in the same geographic location and containing both ice and clouds, such as the Landsat, we test GOES analyzing the CNN’s generalizability. Finally, we present CNN’s ability to accurately identify the clouds and ice in the GOES data versus the Landsat data.



Bayesian Estimation for Covariate Defect Detection Model Based on Discrete Cox Proportiona
Priscila Silva (University of Massachusetts Dartmouth)

Traditional methods to assess software characterize the defect detection process as a function of testing time or effort to quantify failure intensity and reliability. More recent innovations include models incorporating covariates that explain defect detection in terms of underlying test activities. These covariate models are elegant and only introduce a single additional parameter per testing activity. However, the model forms typically exhibit a high degree of non-linearity. Hence, stable and efficient model fitting methods are needed to enable widespread use by the software community, which often lacks mathematical expertise. To overcome this limitation, this poster presents Bayesian estimation methods for covariate models, including the specification of informed priors as well as confidence intervals for the mean value function and failure intensity, which often serves as a metric of software stability. The proposed approach is compared to traditional alternative such as maximum likelihood estimation. Our results indicate that Bayesian methods with informed priors converge most quickly and achieve the best model fits. Incorporating these methods into tools should therefore encourage widespread use of the models to quantitatively assess software.



Combining data from scanners to inform cadet physical performance
Nicholas Ashby (United States Military Academy)

Digital anthropometry obtained from 3D body scanners has already revolutionized the clothing and fitness industries. Within seconds, these scanners collect hundreds of anthropometric measurements which are used by tailors to customize an article of clothing or by fitness trainers to track their client’s progress towards a goal. Three-dimensional body scanners have also been used in military applications, such as predicting injuries at Army basic training and checking a solder’s compliance with body composition standards. In response this increased demand, several 3D body scanners have become commercially available, each with a proprietary algorithm for measuring specific body parts. Individual scanners may suffice to collect measurements from a small population; however, they are not practical for use in creating large data sets necessary to train artificial intelligence (AI) or machine learning algorithms. This study fills the gap between these two applications by correlating body circumferences taken from a small population (n = 109) on three different body scanners and creating a standard scale for pooling data from the different scanners into one large AI ready data set. This data set is then leveraged in a separate application to understand the relationship between body shape and performance on the Army Combat Fitness Test (ACFT).




Room D



Virtual Session

Mini-Tutorial 3: Introducing git for reproducible research


Session Chair: Matthew Avery, IDA

Introducing git for reproducible research
Curtis Miller (IDA)

Version control software manages different versions of files, providing both an archive of files, a means to manage multiple versions of a file, and perhaps distribution. Perhaps the most popular program in the computer science community for version control is git, which serves as the backbone for websites such as Github, Bitbucket, and others. In this mini-tutorial we will introduce basics of version control in general, git in particular. We explain what role git plays in a reproducible research context. The goal of the course is to get participants started using git. We will create and clone repositories, add and track files in a repository, and manage git branches. We also discuss a few git best practices.




5:00 PM – 7:00 PM

Café

5:00 PM – 7:00 PM

Reception and Poster Session


April 28th Agenda

7:30 AM – 8:30 AM

7:30 AM – 8:30 AM

Registration / Check-in

8:30 AM – 8:40 AM

Room A+B

Virtual Session

8:30 AM – 8:40 AM

Opening Remarks

Bram Lillard

(IDA)


V. Bram Lillard assumed the role of director of the Operational Evaluation Division (OED) in early 2022. In this position, Bram provides strategic leadership, project oversight, and direction for the division’s research program, which primarily supports the Director, Operational Test and Evaluation (DOT&E) within the Office of the Secretary of Defense. He also oversees OED’s contributions to strategic studies, weapon system sustainment analyses, and cybersecurity evaluations for DOD and anti-terrorism technology evaluations for the Department of Homeland Security. Bram joined IDA in 2004 as a member of the research staff. In 2013-14, he was the acting science advisor to DOT&E. He then served as OED’s assistant director in 2014-21, ascending to deputy director in late 2021. Prior to his current position, Bram was embedded in the Pentagon where he led IDA’s analytical support to the Cost Assessment and Program Evaluation office within the Office of the Secretary of Defense. He previously led OED’s Naval Warfare Group in support of DOT&E. In his early years at IDA, Bram was the submarine warfare project lead for DOT&E programs. He is an expert in quantitative data analysis methods, test design, naval warfare systems and operations and sustainment analyses for Defense Department weapon systems. Bram has both a doctorate and a master’s degree in physics from the University of Maryland. He earned his bachelor’s degree in physics and mathematics from State University of New York at Geneseo. Bram is also a graduate of the Harvard Kennedy School’s Senior Executives in National and International Security program, and he was awarded IDA’s prestigious Goodpaster Award for Excellence in Research in 2017.

8:40 AM – 9:20 AM

Room A+B

Virtual Session

8:40 AM – 9:20 AM

Keynote 3

Nickolas Guertin

(Director, Operational Test & Evaluation, OSD/DOT&E)


Nickolas H. Guertin was sworn in as Director, Operational Test and Evaluation on December 20, 2021. A Presidential appointee confirmed by the United States Senate, he serves as the senior advisor to the Secretary of Defense on operational and live fire test and evaluation of Department of Defense weapon systems.

Mr. Guertin has an extensive four-decade combined military and civilian career in submarine operations, ship construction and maintenance, development and testing of weapons, sensors, combat management products including the improvement of systems engineering, and defense acquisition. Most recently, he has performed applied research for government and academia in software-reliant and cyber-physical systems at Carnegie Mellon University’s Software Engineering Institute.

Over his career, he has been in leadership of organizational transformation, improving competition, application of modular open system approaches, as well as prototyping and experimentation. He has also researched and published extensively on software-reliant system design, testing and acquisition. He received a BS in Mechanical Engineering from the University of Washington and an MBA from Bryant University. He is a retired Navy Reserve Engineering Duty Officer, was Defense Acquisition Workforce Improvement Act (DAWIA) certified in Program Management and Engineering, and is also a registered Professional Engineer (Mechanical).

Mr. Guertin is involved with his community as an Assistant Scoutmaster and Merit Badge Counselor for two local Scouts BSA troops as well as being an avid amateur musician. He is a native of Connecticut and now resides in Virginia with his wife and twin children.

9:30 AM – 10:30 AM Panel Session

Room A+B



Virtual Session

Featured Panel: Evolving Data-Centric Organizations

For organizations to make data-driven decisions, they must be able to understand and organize their mission critical data.  Recently, the DoD, NASA and other federal agencies have declared their intention to become “data-centric” organizations, but transitioning from an existing mode of operation and architecture can be challenging.  Moreover, the DoD is pushing for artificial intelligence enabled systems (AIES) and wide scale digital transformation.  These concepts in the abstract seem straightforward, but because they can only evolve when people, processes, and technology change together, they have proven challenging in execution.  Since the structure and quality of an organization’s data limits what an organization can do with that data it is imperative to get data processes right before embarking on other initiatives that depend on quality data. Despite the importance of data quality, many organizations treat data architecture as an emergent phenomenon and not something to be planned or thought through holistically. In this discussion, panelists will explore what it means to be data-centric, what a data-centric architecture is, how it is different from the other data architectures, why an organization might prefer a data-centric approach, and the challenges associated with becoming data-centric.

Panelist 1
Heather Wojton (IDA)
Panelist 2
Laura Freeman (Virginia Tech)
Panelist 3
Jane Pinelis (Joint Artificial Intelligence Center)
Panelist 4
Calvin Robinson (NASA)
Moderator
Matthew Avery (IDA)

10:30 AM – 10:50 AM

10:30 AM – 10:50 AM

Break

10:50 AM – 12:20 PM: Parallel Sessions

Room A



Virtual Session

Session 4A: Statistical Engineering


Session Chair: Pete Parker, NASA

Building Bridges: a Case Study of Assisting a Program from the Outside
Anthony Sgambellone (Huntington Ingalls Industries)

STAT practitioners often find ourselves outsiders to the programs we assist. This session presents a case study that demonstrates some of the obstacles in communication of capabilities, purpose, and expectations that may arise due to approaching the project externally. Incremental value may open the door to greater collaboration in the future, and this presentation discusses potential solutions to provide greater benefit to testing programs in the face of obstacles that arise due to coming from outside the program team. DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. CLEARED on 5 Jan 2022. Case Number: 88ABW-2022-0002



STAT and UQ Implementation Lessons Learned
Kelsey Cannon (Lockheed Martin)

David Harrison and Kelsey Cannon from Lockheed Martin Space will present on STAT and UQ implementation lessons learned within Lockheed Martin. Faced with training 60,000 engineers in statistics, David and Kelsey formed a plan to make STAT and UQ processes the standard at Lockheed Martin. The presentation includes a range of information from initial communications plan, to obtaining leader adoption, to training engineers across the corporation. Not all programs initially accepted this process, but implementation lessons have been learned over time as many compounding successes and savings have been recorded. ©2022 Lockheed Martin, all rights reserved



An Overview of NASA’s Low Boom Flight Demonstration
Jonathan Rathsam (NASA Langley Research Center)

NASA will soon begin a series of tests that will collect nationally representative data on how people perceive low noise supersonic overflights. For half a century, civilian aircraft have been required to fly slower than the speed of sound over land to prevent “creating an unacceptable situation” on the ground due to sonic booms. However, new aircraft shaping techniques have led to dramatic changes in how shockwaves from supersonic flight merge together as they travel to the ground. What used to sound like a boom on the ground will be transformed into a thump. NASA is now building a full-scale, piloted demonstration aircraft called the X-59 to demonstrate low noise supersonic flight. In 2024, the X-59 aircraft will commence a national series of community overflight tests to collect data on how people perceive “sonic thumps.” The community response data will be provided to national and international noise regulators as they consider creating new standards that allow supersonic flight over land at acceptably low noise levels.




Room B



Virtual Session

Session 4B: Big Data and Cloud-Based Tools for T&E


Session Chair: Breeana Anderson, IDA

TRMC Big Data Analytics Investments & Technology Review
Edward Powell (Test Resource Management Center)

To properly test and evaluate today’s advanced military systems, the T&E community must utilize big data analytics (BDA) and techniques to quickly process, visualize, understand, and report on massive amounts of data. This tutorial/presentation/TBD will inform the audience how to transform the current T&E data infrastructure and analysis techniques to one employing enterprise BDA and Knowledge Management (BDKM) that supports the current warfighter T&E needs and the developmental and operational testing of future weapon platforms. The TRMC enterprise BDKM will improve acquisition efficiency, keep up with the rapid pace of acquisition technological advancement, and ensure that effective weapon systems are delivered to warfighters at the speed of relevance – all while enabling T&E analysts across the acquisition lifecycle to make better and faster decisions using data previously inaccessible or unusable. This capability encompasses a big data architecture framework – its supporting resources, methodologies, and guidance – to properly address the current and future data needs of systems testing and analysis, as well as an implementation framework, the Cloud Hybrid Edge-to-Enterprise Evaluation and Test Analysis Suite (CHEETAS). In combination with the TRMC’s Joint Mission Environment Test Capability (JMETC) which provides readily-available connectivity to the Services’ distributed test capabilities and simulations, the TRMC has demonstrated that applying enterprise-distributed BDA tools and techniques to distributed T&E leads to faster and more informed decision-making – resulting in reduced overall program cost and risk.



Cloud Computing for Computational Fluid Dynamics (CFD) in T&E
Neil Ashton (Amazon Web Services)

In this talk we’ll focus on exploring the motivation for using cloud computing for Computational Fluid Dynamics (CFD) for Federal Government Test & Evaluation. Using examples from automotive, aerospace and manufacturing we’ll look at benchmarks for a number of CFD codes using CPUs (x86 & Arm) and GPUs and we’ll look at how the development of high-fidelity CFD e.g. WMLES, HRLES, is accelerating the need for access to large scale HPC. The onset of COVID-19 has also meant a large increase in the need for remote visualization with greater numbers of researchers and engineering needing to work from home. This has also accelerated the adoption of the same approaches needed towards the pre- and post-processing of peta/exa-scale CFD simulation and we’ll look at how these are more easily accessed via a cloud infrastructure. Finally, we’ll explore perspectives on integrating ML/AI into CFD workflows using data lakes from a range of sources and where the next decade may take us.



Leveraging Data Science and Cloud Tools to Enable Continuous Reporting
Timothy Dawson (AFOTEC Detachment 5)

The DoD’s challenge to provide test results at the “Speed of Relevance” has generated many new strategies to accelerate data collection, adjudication, and analysis. As a result, the Air Force Operational Test and Evaluation Center (AFOTEC), in conjunction with the Air Force Chief Data Office’s Visible, Accessible, Understandable, Linked and Trusted Data Platform (VAULT), is developing a Survey Application. This new cloud-based application will be deployable on any AFNET-connected computer or tablet and merges a variety of tools for collection, storage, analytics, and decision-making into one easy-to-use platform. By placing cloud-computing power in the hands of operators and testers, authorized users can view report-quality visuals and statistical analyses the moment a survey is submitted. Because the data is stored in the cloud, demanding computations such as machine learning are run at the data source to provide even more insight into both quantitative and qualitative metrics. The T-7A Red Hawk will be the first operational test (OT) program to utilize the Survey Application. Over 1000 flying and simulator test points have been loaded into the application, with many more coming from developmental test partners. The Survey app development will continue as USAF testing commences. Future efforts will focus on making the Survey Application configurable to other research and test programs to enhance their analytic and reporting capabilities.




Room C

Session 4C: Machine Learning and Dynamic Programming Topics


Session Chair: Jay Dennis, IDA

Legal, Moral, and Ethical Implications of Machine Learning
Alan B. Gelder (IDA)

Machine learning algorithms can help to distill vast quantities of information to support decision making. However, machine learning also presents unique legal, moral, and ethical concerns – ranging from potential discrimination in personnel applications to misclassifying targets on the battlefield. Building on foundational principles in ethical philosophy, this presentation summarizes key legal, moral, and ethical criteria applicable to machine learning and provides pragmatic considerations and recommendations.



Forecasting with Machine Learning
Akshay Jain (IDA)

The Department of Defense (DoD) has a considerable interest in forecasting key quantities of interest including demand signals, personnel flows, and equipment failure. Many forecasting tools exist to aid in predicting future outcomes, and there are many methods to evaluate the quality and uncertainty in those forecasts. When used appropriately, these methods can facilitate planning and lead to dramatic reductions in costs. This talk explores the application of machine learning algorithms, specifically gradient-boosted tree models, to forecasting and presents some of the various advantages and pitfalls of this approach. We conclude with an example where we use gradient-boosted trees to forecast Air National Guard personnel retention.



Structural Dynamic Programming Methods for DOD Research
Mikhail Smirnov (IDA)

Structural dynamic programming models are a powerful tool to help guide policy under uncertainty. By creating a mathematical representation of the intertemporal optimization problem of interest, these models can answer questions that static models cannot address. Applications can be found from military personnel policy (how does future compensation affect retention now?) to inventory management (how many aircraft are needed to meet readiness objectives?). Recent advances in statistical methods and computational algorithms allow us to develop dynamic programming models of complex real-world problems that were previously too difficult to solve.




Room D



Virtual Session

Mini-Tutorial 4: Survey Dos and Don’ts


Session Chair: Brian Vickers, IDA

Survey Dos and Don’ts
Gina Sigler & Alex (Mary) McBride (Scientific Test and Analysis Techniques Center of Excellence (STAT COE))

How many surveys have you been asked to fill out? How many did you actually complete? Why those surveys? Did you ever feel like the answer you wanted to mark was missing from the list of possible responses? Surveys can be a great tool for data collection if they are thoroughly planned out and well-designed. They are a relatively inexpensive way to collect a large amount of data from hard to reach populations. However, if they are poorly designed, the test team might end up with a lot of data and little to no information. Join the STAT COE for a short tutorial on the dos and don’ts of survey design and analysis. We’ll point out the five most common survey mistakes, compare and contrast types of questions, discuss the pros and cons for potential analysis methods (such as descriptive statistics, linear regression, principal component analysis, factor analysis, hypothesis testing, and cluster analysis), and highlight how surveys can be used to supplement other sources of information to provide value to an overall test effort. DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. CLEARED on 5 Jan 2022. Case Number: 88ABW-2022-0003




12:20 PM – 1:30 PM

12:20 PM – 1:30 PM

Lunch

1:30 PM – 3:00 PM: Parallel Sessions

Room A



Virtual Session

Session 5A: Analysis Tools for T&E


Session Chair: Denise Edwards, IDA

Orbital Debris Effects Prediction Tool for Satellite Constellations
Joel Williamsen (IDA)

Based on observations gathered from the IDA Forum on Orbital Debris (OD) Risks and Challenges (October 8-9, 2020), DOT&E needed first-order predictive tools to evaluate the effects of orbital debris on mission risk, catastrophic collision, and collateral damage to DOD spacecraft and other orbital assets – either from unintentional or intentional [Anti-Satellite (ASAT)] collisions. This lack of modeling capability hindered DOT&E’s ability to evaluate the risk to operational effectiveness and survivability of individual satellites and large constellations, as well as risks to the overall use of space assets in the future. Part 1 of this presentation describes an IDA-derived Excel-based tool (SatPen) for determining the probability and mission effects of >1mm orbital debris impacts and penetration on individual satellites in low Earth orbit (LEO). IDA estimated the likelihood of satellite mission loss using a Starlink-like satellite as a case study and NASA’s ORDEM 3.1 orbital debris environment as an input, supplemented with typical damage prediction equations to support mission loss predictions. Part 2 of this presentation describes an IDA-derived technique (DebProp) to evaluate the debris propagating effects of large, trackable debris (>5 cm) or antisatellite weapons colliding with satellites within constellations. IDA researchers again used a Starlink-like satellite as a case study and worked with Stellingwerf Associates to modify the Smooth Particle Hydrodynamic Code (SPHC) in order to predict the number and direction of fragments following a collision by a tracked satellite fragment. The result is a file format that is readable as an input file for predicting orbital stability or debris re-entry for thousands of created particles, and predict additional, short-term OD-induced losses to other satellites in the constellation. By pairing these techniques, IDA can predict additional, short-term and long-term OD-induced losses to other satellites in the constellation, and conduct long-term debris growth studies.



Using Sensor Stream Data as Both an Input and Output in a Functional Data Analysis
Thomas A Donnelly (JMP Statistical Discovery LLC)

A case study will be presented where patients wearing continuous glycemic monitoring systems provide sensor stream data of their glucose levels before and after consuming 1 of 5 different types of snacks. The goal is to be able to better predict a new patient’s glycemic-response-over-time trace after being given a particular type of snack. Functional Data Analysis (FDA) is used to extract eigenfunctions that capture the longitudinal shape information of the traces and principal component scores that capture the patient-to-patient variation. FDA is used twice. First it is used on the “before” baseline glycemic-response-over-time traces. Then a separate analysis is done on the snack-induced “after” response traces. The before FPC scores and the type of snack are then used to model the after FPC scores. This final FDA model will then be used to predict the glycemic response of new patients given a particular snack and their existing baseline response history. Although the case study is for medical sensor data, the methodology employed would work for any sensor stream where an event perturbs the system thus affecting the shape of the sensor stream post event.



Measuring training efficacy: Structural validation of the Operational Assessment of Training
Brian Vickers (IDA)

Effective training of the broad set of users/operators of systems has downstream impacts on usability, workload, and ultimate system performance that are related to mission success. In order to measure training effectiveness, we designed a survey called the Operational Assessment of Training Scale (OATS) in partnership with the Army Test and Evaluation Center (ATEC). Two subscales were designed to assess the degrees to which training covered relevant content for real operations (Relevance subscale) and enabled self-rated ability to interact with systems effectively after training (Efficacy subscale). The full list of 15 items were given to over 700 users/operators across a range of military systems and test events (comprising both developmental and operational testing phases). Systems included vehicles, aircraft, C3 systems, and dismounted squad equipment, among other types. We evaluated reliability of the factor structure across these military samples using confirmatory factor analysis. We confirmed that OATS exhibited a two-factor structure for training relevance and training efficacy. Additionally, a shortened, six-item measure of the OATS with three items per subscale continues to fit observed data well, allowing for quicker assessments of training. We discuss various ways that the OATS can be applied to one-off, multi-day, multi-event, and other types of training events. Additional OATS details and information about other scales for test and evaluation are available at the Institute for Defense Analyses’ web site, https://testscience.org/validated-scales-repository/.




Room B



Virtual Session

Session 5B: DOE Applications


Session Chair: Dominik Alder, Lockheed Martin

Experiment Design and Visualization Techniques for an X-59 Low-boom Variability Study
William J Doebler (NASA Langley Research Center)

This presentation outlines the design of experiments approach and data visualization techniques for a simulation study of sonic booms from NASA’s X-59 supersonic aircraft. The X-59 will soon be flown over communities across the contiguous USA as it produces a low-loudness sonic boom, or low-boom. Survey data on human perception of low-booms will be collected to support development of potential future commercial supersonic aircraft noise regulatory standards. The macroscopic atmosphere plays a critical role in the loudness of sonic booms. The extensive sonic boom simulation study presented herein was completed to assess climatological, geographical, and seasonal effects on the variability of the X-59’s low-boom loudness and noise exposure region size in order to inform X-59 community test planning. The loudness and extent of the noise exposure region make up the “sonic boom carpet.” Two spatial and temporal resolutions of atmospheric input data to the simulation were investigated. A Fast Flexible Space-Filling Design was used to select the locations across the USA for the two spatial resolutions. Analysis of simulated X-59 low-boom loudness data within a regional subset of the northeast USA was completed using a bootstrap forest to determine the final spatial and temporal resolution of the countrywide simulation study. Atmospheric profiles from NOAA’s Climate Forecast System Version 2 database were used to generate over one million simulated X-59 carpets at the final selected 138 locations across the USA. Effects of aircraft heading, season, geography, and climate zone on low-boom levels and noise exposure region size were analyzed. Models were developed to estimate loudness metrics throughout the USA for X-59 supersonic cruise overflight, and results were visualized on maps to show geographical and seasonal trends. These results inform regulators and mission planners on expected variations in boom levels and carpet extent from atmospheric variations. Understanding potential carpet variability is important when planning community noise surveys using the X-59.



Case Study on Applying Sequential Methods in Operational Testing
Keyla Pagán-Rivera (IDA)

Sequential methods concerns statistical evaluation in which the number, pattern, or composition of the data is not determined at the start of the investigation but instead depends on the information acquired during the investigation. Although sequential methods originated in ballistics testing for the Department of Defense (DoD), it is underutilized in the DoD. Expanding the use of sequential methods may save money and reduce test time. In this presentation, we introduce sequential methods, describe its potential uses in operational test and evaluation (OT&E), and present a method for applying it to the test and evaluation of defense systems. We evaluate the proposed method by performing simulation studies and applying the method to a case study. Additionally, we discuss some of the challenges we might encounter when using sequential analysis in OT&E.



A New Method for Planning Full-Up System-Level (FUSL) Live Fire Tests
Lindsey Butler (IDA)

Planning Full-Up System-Level (FUSL) Live Fire tests is a complex process that has historically relied solely on subject matter expertise. In particular, there is no established method to determine the appropriate number of FUSL tests necessary for a given program. We developed a novel method that is analogous to the Design of Experiments process that is used to determine the scope of Operational Test events. Our proposed methodology first requires subject matter experts (SMEs) to define all potential FUSL shots. For each potential shot, SMEs estimate the severity of that shot, the uncertainty of that severity estimate, and the similarity of that shot to all other potential shots. We developed a numerical optimization algorithm that uses the SME inputs to generate a prioritized list of FUSL events and a corresponding plot of the total information gained with each successive shot. Together, these outputs can help analysts determine the adequate number of FUSL tests for a given program. We illustrate this process with an example on a notional ground vehicle. Future work is necessary prior to implementation on a program of record.




Room D



Virtual Session

Mini-Tutorial 5: Data Visualization


Session Chair: Jay Wilkins, IDA

An Introduction to Data Visualization
Christina Heinich (NASA)

Data visualization can be used to present findings, explore data, and use the human eye to find patterns that a computer would struggle to locate. Borrowing tools from art, storytelling, data analytics and software development, data visualization is an indispensable part of the analysis process. While data visualization usage spans across multiple disciplines and sectors, most never receive formal training in the subject. As such, this tutorial will introduce key data visualization building blocks and how to best use those building blocks for different scenarios and audiences. We will also go over tips on accessibility, design and interactive elements. While this will by no means be a complete overview of the data visualization field, by building a foundation and introducing some rules of thumb, attendees will be better equipped for communicating their findings to their audience.




3:00 PM – 3:20 PM

3:00 PM – 3:20 PM

Break

3:20 PM – 4:20 PM: Parallel Sessions

Room A



Virtual Session

Session 6A: Applications of Deep Learning and Monte Carlo Analysis


Session Chair: Tom Donnelly, JMP

Deep learning aided inspection of additively manufactured metals
Brendan Croom (JHU Applied Physics Laboratory)

The performance and reliability of additively manufactured (AM) metals is limited by the ubiquitous presence of void- and crack-like defects that form during processing. Many applications require non-destructive evaluation of AM metals to detect potentially critical flaws. To this end, we propose a deep learning approach that can help with the interpretation of inspection reports. Convolutional neural networks (CNN) are developed to predict the elastic stress fields in images of defect-containing metal microstructures, and therefore directly identify critical defects. A large dataset consisting of the stress response of 100,000 random microstructure images is generated using high-resolution Fast Fourier Transform-based finite element (FFT-FE) calculations, which is then used to train a modified U-Net style CNN model. The trained U-Net model more accurately predicted the stress response compared to previous CNN architectures, exceeded the accuracy of low-resolution FFT-FE calculations, and were evaluated more than 100 times faster than conventional FE techniques. The model was applied to images of real AM microstructures with severe lack of fusion defects, and predicted a strong linear increase of maximum stress as a function of pore fraction. This work shows that CNNs can aid the rapid and accurate inspection of defect-containing AM material.



Applications for Monte Carlo Analysis within Job Shop Planning
Dominik Alder (Lockheed Martin, Program Management)

Summary overview of Discrete Event Simulations (DES) for optimizing scheduling operations in a high mix, low volume, job shop environment. The DES model employs Monte Carlo simulation to minimize schedule conflicts and prioritize work, while taking into account competition for limited resources. Iterative simulation balancing to dampen model results and arrive at a globally optimized schedule plan will be contrasted with traditional deterministic scheduling methodologies.




Room B



Virtual Session

Session 6B: Bayesian Statistics


Session Chair: John Haman, IDA

A Framework for Using Priors in a Continuum of Testing
Victoria Sieck (Scientific Test & Analysis Techniques Center of Excellence (STAT COE) / Air Force Institute of Technology (AFIT))

A strength of the Bayesian paradigm is that it allows for the explicit use of all available information—to include subject matter expert (SME) opinion and previous (possibly dissimilar) data. While frequentists are constrained to only including data in an analysis (that is to say, only including information that can be observed), Bayesians can easily consider both data and SME opinion, or any other related information that could be constructed. This can be accomplished through the development and use of priors. When prior development is done well, a Bayesian analysis will not only lead to more direct probabilistic statements about system performance, but can result in smaller standard errors around fitted values when compared to a frequentist approach. Furthermore, by quantifying the uncertainty surrounding a model parameter, through the construct of a prior, Bayesians are able to capture the uncertainty across a test space of consideration. This presentation develops a framework for thinking about how different priors can be used throughout the continuum of testing. In addition to types of priors, how priors can change or evolve across the continuum of testing—especially when a system changes (e.g., is modified or adjusted) during phases of testing—will be addressed. Priors that strive to provide no information (reference priors) will be discussed, and will build up to priors that contain available information (informative priors). Informative priors—both those based on institutional knowledge or summaries from databases, as well as those developed based on previous testing data—will be discussed, with a focus on how to consider previous data that is dissimilar in some way, relative to the current test event. What priors might be more common in various phases of testing, types of information that can be used in priors, and how priors evolve as information accumulates will all be discussed.



Method for Evaluating Bayesian Reliability Models for Developmental Testing
Paul Fanto and David Spalding (IDA)

For analysis of military Developmental Test (DT) data, frequentist statistical models are increasingly challenged to meet the needs of analysts and decision-makers. Bayesian models have the potential to address this challenge. Although there is a substantial body of research on Bayesian reliability estimation, there appears to be a paucity of Bayesian applications to issues of direct interest to DT decision makers. To address this deficiency, this research accomplishes two tasks. First, this work provides a motivating example that analyzes reliability for a notional but representative system. Second, to enable the motivated analyst to apply Bayesian methods, it provides a foundation and best practices for Bayesian reliability analysis in DT. The first task is accomplished by applying Bayesian reliability assessment methods to notional DT lifetime data generated using a Bayesian reliability growth planning methodology (Wayne 2018). The tested system is assumed to be a generic complex system with a large number of failure modes. Starting from the Bayesian assessment methodology of (Wayne and Modarres, A Bayesian Model for Complex System Reliability 2015), this work explores the sensitivity of the Bayesian results to the choice of the prior distribution and compares the Bayesian results for the reliability point estimate and uncertainty interval with analogous results from traditional reliability assessment methods. The second task is accomplished by establishing a generic structure for systematically evaluating relevant statistical Bayesian models. It identifies what have been implicit reliability issues for DT programs using a structured poll of stakeholders combined with interviews of a selected set of Subject Matter Experts. Secondly, candidate solutions are identified in the literature. Thirdly, solutions matched to issues using criteria designed to evaluate the capability of a solution to improve support for decision-makers at critical points in DT programs. The matching process uses a model taxonomy structured according to decisions at each DT phase, plus criteria for model applicability and data availability. The end result is a generic structure that allows an analyst to identify and evaluate a specific model for use with a program and issue of interest. Wayne, Martin. 2018. “Modeling Uncertainty in Reliability Growth Plans.” 2018 Annual Reliability and Maintainability Symposium (RAMS). 1-6. Wayne, Martin, and Mohammad Modarres. 2015. “A Bayesian Model for Complex System Reliability.” IEEE Transactions on Reliability 64: 206-220.




Room D



Virtual Session

Session 6D: Disease and Mental Health


Session Chair: Joe Warfield, JHU/APL

Stochastic Modeling and Characterization of a Wearable-Sensor-Based Surveillance Network
Jane E. Valentine (Johns Hopkins University Applied Physics Laboratory)

Current disease outbreak surveillance practices reflect underlying delays in the detection and reporting of disease cases, relying on individuals who present symptoms to seek medical care and enter the health care system. To accelerate the detection of outbreaks resulting from possible bioterror attacks, we introduce a novel two-tier, human sentinel network (HSN) concept composed of wearable physiological sensors capable of pre-symptomatic illness detection, which prompt individuals to enter a confirmatory stage where diagnostic testing occurs at a certified laboratory. Both the wearable alerts and test results are reported automatically and immediately to a secure online platform via a dedicated application. The platform aggregates the information and makes it accessible to public health authorities. We evaluated the HSN against traditional public health surveillance practices for outbreak detection of 80 Bacillus anthracis (Ba) release scenarios in mid-town Manhattan, NYC. We completed an end-to-end modeling and analysis effort, including the calculation of anthrax exposures and doses based on computational atmospheric modeling of release dynamics, and development of a custom-built probabilistic model to simulate resulting wearable alerts, diagnostic test results, symptom onsets, and medical diagnoses for each exposed individual in the population. We developed a novel measure of network coverage, formulated new metrics to compare the performance of the HSN to public health surveillance practices, completed a Design of Experiments to optimize the test matrix, characterized the performant trade-space, and performed sensitivity analyses to identify the most important engineering parameters. Our results indicate that a network covering greater than ~10% of the population would yield approximately a 24-hour time advantage over public health surveillance practices in identifying outbreak onset, and provide a non-target-specific indication (in the form of a statistically aberrant number of wearable alerts) of approximately 36-hours; these earlier detections would enable faster and more effective public health and law enforcement responses to support incident characterization and decrease morbidity and mortality via post-exposure prophylaxis.



The Mental Health Impact of Local COVID-19 Cases Prior to the Mass Availability of Vaccine
Zachary Szlendak (IDA)

During the COVID-19 pandemic the majority of Americans experienced many new mental health stressors, including isolation, economic instability, fear of exposure to COVID-19, and the effects of themselves or loved ones catching COVID-19. Service members, veterans, and their families experienced these stressors differently from the general public. In this seminar we examine how local COVID-19 case counts affected mental health outcomes prior to the mass availability of vaccines. We show that households we identify as likely military households and TRICARE and Military health system beneficiaries reported higher mental health quality than their general population peers, but VA beneficiaries do not. We find local case counts are an important factor in determining demographic groups reporting drops in mental health during the pandemic.




4:20 PM – 4:40 PM

Room A+B

Virtual Session

4:20 PM – 4:40 PM

Awards

4:40 PM – 4:50 PM

Room A+B

Virtual Session

4:40 PM – 4:50 PM

Closing Remarks

Alyson Wilson

(NCSU)


Dr. Alyson Wilson is the Associate Vice Chancellor for National Security and Special Research Initiatives at North Carolina State University. She is also a professor in the Department of Statistics and Principal Investigator for the Laboratory for Analytic Sciences. Her areas of expertise include statistical reliability, Bayesian methods, and the application of statistics to problems in defense and national security. Dr. Wilson is a leader in developing transformative models for rapid innovation in defense and intelligence. Prior to joining NC State, Dr. Wilson was a jointly appointed research staff member at the IDA Science and Technology Policy Institute and Systems and Analyses Center (2011-2013); associate professor in the Department of Statistics at Iowa State University (2008-2011); Scientist 5 and technical lead for Department of Defense Programs in the Statistical Sciences Group at Los Alamos National Laboratory (1999-2008); and senior statistician and operations research analyst with Cowboy Programming Resources (1995-1999). She is currently serving on the National Academy of Sciences Committee on Applied and Theoretical Statistics and on the Board of Trustees for the National Institute of Statistical Sciences. Dr. Wilson is a Fellow of the American Statistical Association, the American Association for the Advancement of Science, and an elected member of the International Statistics Institute.