April 26th Agenda

7:30 AM - 8:30 AM

Check-in

8:30 AM - 8:40 AM

Room A+B

Virtual Session

Opening Remarks

The Honorable Nickolas H. Guertin

(OSD/DOT&E)


Nickolas H. Guertin was sworn in as Director, Operational Test and Evaluation (DOT&E) on December 20, 2021. A Presidential appointee confirmed by the United States Senate, he serves as the senior advisor to the Secretary of Defense on operational and live fire test and evaluation of Department of Defense weapon systems. Mr. Guertin has an extensive four-decade combined military and civilian career in submarine operations; ship construction and maintenance; development and testing of weapons, sensors, combat management products including the improvement of systems engineering; and defense acquisition. Most recently, he has performed applied research for government and academia in software-reliant and cyber-physical systems at Carnegie Mellon University’s Software Engineering Institute. Over his career, he has led organizational transformation, improved competition, and increased application of modular open-system approaches, prototyping, and experimentation. He has also researched and published extensively on software-reliant system design, testing, and acquisition. He received a Bachelor of Science in Mechanical Engineering from the University of Washington and an MBA from Bryant University. He is a retired Navy Reserve Engineering Duty Officer, was Defense Acquisition Workforce Improvement Act (DAWIA) certified in Program Management and Engineering, and is also a licensed Professional Engineer (Mechanical). Mr. Guertin is involved with his community as an Assistant Scoutmaster and Merit Badge Counselor for two local Boy Scouts of America troops, and is an avid amateur musician. He is a native of Connecticut and now resides in Virginia with his wife and twin children

8:40 AM - 9:20 AM

Room A+B

Virtual Session

Keynote 1

Col Sacha Tomlinson

(United States Space Force)


Colonel Sacha Tomlinson is the Test Enterprise Division Chief for Space Training and Readiness Command (STARCOM), Peterson Space Force Base, Colorado. She is also the STARCOM Commander’s Deputy Operational Test Authority. In these two roles, she is responsible for establishing STARCOM’s test enterprise, managing and executing the continuum of test for space systems and capabilities.

Colonel Tomlinson’s background is principally in operational test, having served two assignments in the Air Force Test and Evaluation Center first as the Test Director for the Space Based Infrared System's GEO 1 test campaign, and later as AFOTEC Detachment 4's Deputy Commander.  She also had two assignments with the 17th Test Squadron in Air Combat Command, first as the Detachment 4 Commander, responsible for testing the Eastern and Western Range Launch and Test Range Systems, and later commanding the 17th Test Squadron, which tested major sustainment upgrades for Space systems, as well as conducting Tactics Development & Evaluations and the Space Weapon System Evaluation Program.

9:20 AM - 10:00 AM

Room A+B

Virtual Session

Keynote 2

Mr. Peter Coen

(NASA)


Peter Coen currently serves as the Mission Integration Manager for NASA’s Quesst Mission.  His primary responsibility in this role is to ensure that the X-59 aircraft development, in-flight acoustic validation and community test elements of the Mission stay on track toward delivering on NASA’s Critical Commitment to provide quiet supersonic overflight response data to the FAA and the International Civil Aviation Organization.

Previously, Peter was the manager for the Commercial Supersonic Technology Project in NASA’s Aeronautics Research Mission, where he led a team from the four NASA Aero Research Centers in the development of tools and technologies for a new generation of quiet and efficient supersonic civil transport aircraft.

Peter’s NASA career spans almost 4 decades. During this time, he has studied technology integration in practical designs for many different types of aircraft and has made technical and management contributions to all of NASA’s supersonics related programs over the past 30 years.  As Project Manager, he led these efforts for 12 years.

Peter is a licensed private pilot who has amassed nearly 30 seconds of supersonic flight time.

10:00 AM - 10:20 AM

Break

10:20 AM - 12:20 PM: Parallel Sessions

Room A

Virtual Session
Session 1A: Cybersecurity in Test & Evaluation
Session Chair: Mark Herrera, IDA
  •  1 

    Planning for Public Sector Test and Evaluation in the Commercial Cloud

    Brian Conway (IDA)


    As the public sector shifts IT infrastructure toward commercial cloud solutions, the government test community needs to adjust its test and evaluation (T&E) methods to provide useful insights into a cloud-hosted system’s cyber posture. Government entities must protect what they develop in the cloud by enforcing strict access controls and deploying securely configured virtual assets. However, publicly available research shows that doing so effectively is difficult, with accidental misconfigurations leading to the most commonly observed exploitations of cloud-hosted systems. Unique deployment configurations and identity and access management across different cloud service providers increases the burden of knowledge on testers. More care must be taken during the T&E planning process to ensure that test teams are poised to succeed in understanding the cyber posture of cloud-hosted systems and finding any vulnerabilities present in those systems. The T&E community must adapt to this new paradigm of cloud-hosted systems to ensure that vulnerabilities are discovered and mitigated before an adversary has the opportunity to use those vulnerabilities against the system.


  •  2 

    Cyber Testing Embedded Systems with Digital Twins

    Michael Thompson (Naval Postgraduate School)


    Dynamic cyber testing and analysis require instrumentation to facilitate measurements, e.g., to determine which portions of code have been executed, or detection of anomalous conditions which might not manifest at the system interface. However, instrumenting software causes execution to diverge from the execution of the deployed binaries. And instrumentation requires mechanisms for storing and retrieving testing artifacts on target systems. RESim is a dynamic testing and analysis platform that does not instrument software. Instead, RESim instruments high fidelity models of target hardware upon which software-under-test executes, providing detailed insight into program behavior. Multiple modeled computer platforms run within a single simulation that can be paused, inspected and run forward or backwards to selected events such as the modification of a specific memory address. Integration of the Google’s AFL fuzzer with RESim avoids the need to create fuzzing harnesses because programs are fuzzed in their native execution environment, commencing from selected execution states with data injected directly into simulated memory instead of I/O streams. RESim includes plugins for the IDA Pro and NSA’s Ghidra disassembler/debuggers to facilitate interactive analysis of individual processes and threads, providing the ability to skip to selected execution states (e.g., a reference to an input buffer) and “reverse execution” to reach a breakpoint by appearing to run backwards in time. RESim simulates networks of computers through use of Wind River’s Simics platform of high fidelity models of processors, peripheral devices (e.g., network interface cards), and memory. The networked simulated computers load and run firmware and software from images extracted from the physical systems being tested. Instrumenting the simulated hardware allows RESim to observe software behavior from the other side of the hardware, i.e., without affecting its execution. Simics includes tools to extend and create high fidelity models of processors and devices, providing a clear path to deploying and managing digital twins for use in developmental test and evaluation. The simulations can include optional real-world network and bus interfaces to facilitate integration into networks and test ranges. Simics is a COTS product that runs on commodity hardware and is able to execute several parallel instances of complex multi-component systems on a typical engineering workstation or server. This presentation will describe RESim and strategies for using digital twins for cyber testing of embedded systems. And the presentation will discuss some of the challenges associated with fuzzing non-trivial software systems.


  •  2 

    Test and Evaluation of AI Cyber Defense Systems

    Shing-hon Lau (Software Engineering Institute, Carnegie Mellon University)


    Adoption of Artificial Intelligence and Machine Learning powered cybersecurity defenses (henceforth, AI defenses) has outpaced testing and evaluation (T&E) capabilities. Industrial and governmental organizations around the United States are employing AI defenses to protect their networks in ever increasing numbers, with the commercial market for AI defenses currently estimated at $15 billion and expected to grow to $130 billion by 2030. This adoption of AI defenses is powered by a shortage of over 500,000 cybersecurity staff in the United States, by a need to expeditiously handle routine cybersecurity incidents with minimal human intervention and at machine speeds, and by a need to protect against highly sophisticated attacks. It is paramount to establish, through empirical testing, trust and understanding of the capabilities and risks associated with employing AI defenses. While some academic work exists for performing T&E of individual machine learning models trained using cybersecurity data, we are unaware of any principled method for assessing the capabilities of a given AI defense within an actual network environment. The ability of AI defenses to learn over time poses a significant T&E challenge, above and beyond those faced when considering traditional static cybersecurity defenses. For example, an AI defense may become more (or less) effective at defending against a given cyberattack as it learns over time. Additionally, a sophisticated adversary may attempt to evade the capabilities of an AI defense by obfuscating attacks to maneuver them into its blind spots, by poisoning the training data utilized by the AI defense, or both. Our work provides an initial methodology for performing T&E of on-premises network-based AI defenses on an actual network environment, including the use of a network environment with generated user network behavior, automated cyberattack tools to test the capabilities of AI cyber defenses to detect attacks on that network, and tools for modifying attacks to include obfuscation or data poisoning. Discussion will also center on some of the difficulties with performing T&E on an entire system, instead of just an individual model.


  •  1 

    Test and Evaluation of Systems with Embedded Artificial Intelligence Components

    Michael Smith (Sandia National Laboratories)


    As Artificial Intelligence (AI) continues to advance, it is being integrated into more systems. Often, the AI component represents a significant portion of the system that reduces the burden on the end user or significantly improves the performance of a task. The AI component represents an unknown complex phenomenon that is learned from collected data without the need to be explicitly programmed. Despite the improvement in performance, the models are black boxes. Evaluating the credibility and the vulnerabilities of AI models poses a gap in current test and evaluation practice. For high consequence applications, the lack of testing and evaluation procedures represents a significant source of uncertainty and risk. To help reduce that risk, we have developed a red-teaming inspired methodology to evaluate systems embedded with an AI component. This methodology highlights the key expertise and components that are needed beyond what a typical red team generally requires. Opposed to academic evaluation of AI models, we present a system-level evaluation rather than the AI model in isolation. We outline three axes along which to evaluate an AI component: 1) Evaluating the performance of the AI component to ensure that the model functions as intended and is developed based on bast practices developed by the AI community. This process entails more than simply evaluating the learned model. As the model operates on data used for training as well as perceived by the system, peripheral functions such as feature engineering and the data pipeline need to be included. 2) AI components necessitate supporting infrastructure in deployed systems. The support infrastructure may introduce additional vulnerabilities that are overlooked in traditional test and evaluation processes. Further, the AI component may be subverted by modifying key configuration files or data pipeline components. 3) AI models introduce possible vulnerabilities to adversarial attacks. These could be attacks designed to evade detection by the model, steal the model, poison the model, steal the model or data, or misuse the model to act inappropriately. Within the methodology, we highlight tools that may be applicable as well as gaps that need to be addressed by the community. SNL is managed and operated by NTESS under DOE NNSA contract DE-NA0003525


Room B

Virtual Session
Session 1B: Validation of Statistical Software and Self-Validated Ensemble . . . Models
Session Chair: Tom Donnelly, JMP Statistical Discovery
  •  1 

    On the Validation of Statistical Software

    Ryan Lekivetz (JMP Statistical Discovery)


    Validating statistical software involves a variety of challenges. Of these, the most difficult is the selection of an effective set of test cases, sometimes referred to as the “test case selection problem”. To further complicate matters, for many statistical applications, development and validation are done by individuals who often have limited time to validate their application and may not have formal training in software validation techniques. As a result, it is imperative that the adopted validation method is efficient, as well as effective, and it should also be one that can be easily understood by individuals not trained in software validation techniques. As it turns out, the test case selection problem can be thought of as a design of experiments (DOE) problem. This talk discusses how familiar DOE principles can be applied to validating statistical software.


  •  1 

    Validating the Prediction Profiler with Disallowed Combination: A Case Study

    Yeng Saanchi (JMP Statistical Discovery)


    The prediction profiler is an interactive display in JMP statistical software that allows a user to explore the relationships between multiple factors and responses. A common use case of the profiler is for exploring the predicted model from a designed experiment. For experiments with a constrained design region defined by disallowed combinations, the profiler was recently enhanced to obey such constraints. In this case study, we show how a DOE based approach to validating statistical software was used to validate this enhancement.


  •  2 

    Introducing Self-Validated Ensemble Models (SVEM) – Bringing Machine Learning to DOEs

    Chris Gotwalt (JMP Statistical Discovery)


    DOE methods have evolved over the years, as have the needs and expectations of experimenters. Historically, the focus emphasized separating effects to reduce bias in effect estimates and maximizing hypotheses testing power, which are largely a reflection of the methodological and computational tools of their time. Often DOE in industry is done to predict product or process behavior under possible changes. We introduce Self-Validating Ensemble Models (SVEM), an inherently predictive algorithmic approach to the analysis of DOEs, generalizing the fractional bootstrap to make machine learning and bagging possible for small datasets common in DOE. In many DOE applications the number of rows is small, and the factor layout is carefully structured to maximize information gain in the experiment. Applying machine learning methods to DOE is generally avoided because they begin with a partitioning the rows into a training set for model fitting and a holdout set for model selection. This alters the structure of the design in undesirable ways such as randomly introducing effect aliasing. SVEM avoids this problem by using a variation of the fractionally weighted bootstrap to create training and validation versions of the complete data that differ only in how rows are weighted. The weights are reinitialized, and models refit multiple times so that our final SVEM model is a model average, much like bagging. We find this allows us to fit models where the number of estimated effects exceeds the number of rows. We will present simulation results showing that in these supersaturated cases SVEM outperforms existing approaches like forward selection as measured by prediction accuracy.


  •  1 

    Effective Application of Self-Validated Ensemble Models in Challenging Test Scenarios

    James Wisnowski (Adsurgo)


    We test the efficacy of SVEM versus alternative variable selection methods in a mixture experiment setting. These designs have built-in dependencies that require modifications of the typical design and analysis methods. The usual design metric of power is not helpful for these tests and analyzing results becomes quite challenging, particularly for factor characterization. We provide some guidance and lessons learned from hypersonic fuel formulation experience. We also show through simulation favorable combinations of design and Generalized Regression analysis options that lead to the best results. Specifically, we quantify the impact of changing run size, including complex design region constraints, using space-filling vs optimal designs, including replicates and/or center runs, and alternative analysis approaches to include full model, backward stepwise, SVEM forward selection, SVEM Lasso, and SVEM neural network.


Room C

Session 1C: Statistical and Systems Engineering Applications in Aerospace
Session Chair: Melissa Hooke
  •  2 

    The Containment Assurance Risk Framework of the Mars Sample Return Program

    Giuseppe Cataldo (NASA)


    The Mars Sample Return campaign aims at bringing rock and atmospheric samples from Mars to Earth through a series of robotic missions. These missions would collect the samples being cached and deposited on Martian soil by the Perseverance rover, place them in a container, and launch them into Martian orbit for subsequent capture by an orbiter that would bring them back. Given there exists a non-zero probability that the samples contain biological material, precautions are being taken to design systems that would break the chain of contact between Mars and Earth. These include techniques such as sterilization of Martian particles, redundant containment vessels, and a robust reentry capsule capable of accurate landings without a parachute. Requirements exist that the probability of containment not assured of Martian-contaminated material into Earth’s biosphere be less than one in a million. To demonstrate compliance with this strict requirement, a statistical framework was developed to assess the likelihood of containment loss during each sample return phase and make a statement about the total combined mission probability of containment not assured. The work presented here describes this framework, which considers failure modes or fault conditions that can initiate failure sequences ultimately leading to containment not assured. Reliability estimates are generated from databases, design heritage, component specifications, or expert opinion in the form of probability density functions or point estimates and provided as inputs to the mathematical models that simulate the different failure sequences. The probabilistic outputs are then combined following the logic of several fault trees to compute the ultimate probability of containment not assured. Given the multidisciplinary nature of the problem and the different types of mathematical models used, the statistical tools needed for analysis are required to be computationally efficient. While standard Monte Carlo approaches are used for fast models, a multi-fidelity approach to rare event probabilities is proposed for expensive models. In this paradigm, inexpensive low-fidelity models are developed for computational acceleration purposes while the expensive high-fidelity model is kept in the loop to retain accuracy in the results. This work presents an example of end-to-end application of this framework highlighting the computational benefits of a multi-fidelity approach. The decision to implement Mars Sample Return will not be finalized until NASA’s completion of the National Environmental Policy Act process. This document is being made available for information purposes only.


  •  3 

    A Bayesian Optimal Experimental Design for High-dimensional Physics-based Models

    James Oreluk (Sandia National Laboratories)


    Many scientific and engineering experiments are developed to study specific questions of interest. Unfortunately, time and budget constraints make operating these controlled experiments over wide ranges of conditions intractable, thus limiting the amount of data collected. In this presentation, we discuss a Bayesian approach to identify the most informative conditions, based on the expected information gain. We will present a framework for finding optimal experimental designs that can be applied to physics-based models with high-dimensional inputs and outputs. We will study a real-world example where we aim to infer the parameters of a chemically reacting system, but there are uncertainties in both the model and the parameters. A physics-based model was developed to simulate the gas-phase chemical reactions occurring between highly reactive intermediate species in a high-pressure photolysis reactor coupled to a vacuum-ultraviolet (VUV) photoionization mass spectrometer. This time-of-flight mass spectrum evolves in both kinetic time and VUV energy producing a high-dimensional output at each design condition. The high-dimensional nature of the model output poses a significant challenge for optimal experimental design, as a surrogate model is built for each output. We discuss how accurate low-dimensional representations of the high-dimensional mass spectrum are necessary for computing the expected information gain. Bayesian optimization is employed to maximize the expected information gain by efficiently exploring a constrained design space, taking into account any constraint on the operating range of the experiment. Our results highlight the trade-offs involved in the optimization, the advantage of using optimal designs, and provide a workflow for computing optimal experimental designs for high-dimensional physics-based models.


  •  1 

    Systems Engineering Applications of UQ in Space Mission Formulation

    Kelli McCoy and Roger Ghanem (NASA Jet Propulsion Laboratory, University of Southern California)


    In space mission formulation, it is critical to link the scientific phenomenology under investigation directly to the spacecraft design, mission design, and the concept of operations. With many missions of discovery, the large uncertainty in the science phenomenology and the operating environment necessitates mission architecture solutions that are robust and resilient to these unknowns, in order to maximize the probability of achieving the mission objectives. Feasible mission architectures are assessed against performance, cost, and risk, in the context of large uncertainties. For example, despite Cassini observations of Enceladus, significant uncertainties exist in the moon’s jet properties and the surrounding Enceladus environment. Orbilander or any other mission to Enceladus will need to quantify or bound these uncertainties in order to formulate a viable design and operations trade space that addresses a range of mission objectives within the imposed technical and programmatic constraints. Uncertainty quantification (UQ), utilizes a portfolio of stochastic, data science, and mathematical methods to characterize uncertainty of a system and inform risk and decision-making. This discussion will focus on a formulation of a UQ workflow and an example of an Enceladus mission development use case.


  •  3 

    *Virtual Speaker* Epistemic and Aleatoric Uncertainty Quantification for Gaussian Processes

    Pau Batlle (California Institute of Technology)


    One of the major advantages of using Gaussian Processes for regression and surrogate modeling is the availability of uncertainty bounds for the predictions. However, the validity of these bounds depends on using an appropriate covariance function, which is usually learned from the data out of a parametric family through maximum likelihood estimation or cross-validation methods. In practice, the data might not contain enough information to select the best covariance function, generating uncertainty in the hyperparameters, which translates into an additional layer of uncertainty in the predictions (epistemic uncertainty) on top of the usual posterior covariance (aleatoric uncertainty). In this talk, we discuss considering both uncertainties for UQ, and we quantify them by extending the MLE paradigm using a game theoretical framework that identifies the worst-case prior under a likelihood constraint specified by the practitioner.


Room E

Virtual Session
Mini-Tutorial 1
Session Chair: George Khoury, IDA
  •  1 

    Introduction to Design of Experiments in R: Generating and Evaluating Designs with skpr

    Tyler Morgan-Wall (IDA)


    The Department of Defense requires rigorous testing to support the evaluation of effectiveness and suitability of oversight acquisition programs. These tests are performed in a resource constrained environment and must be carefully designed to efficiently use those resources. The field of Design of Experiments (DOE) provides methods for testers to generate optimal experimental designs taking these constraints into account, and computational tools in DOE can support this process by enabling analysts to create designs tailored specifically for their test program. In this tutorial, I will show how you can run these types of analyses using “skpr”: a free and open source R package developed by researchers at IDA for generating and evaluating optimal experimental designs. This software package allows you to perform DOE analyses entirely in code; rather than using a graphical user interface to generate and evaluate individual designs one-by-one, this tutorial will demonstrate how an analyst can use “skpr” to automate the creation of a variety of different designs using a short and simple R script. Attendees will learn the basics of using the R programming language and how to generate, save, and share their designs. Additionally, “skpr” provides a straightforward interface to calculate statistical power. Attendees will learn how to use built-in parametric and Monte Carlo power evaluation functions to compute power for a variety of models and responses, including linear models, split-plot designs, blocked designs, generalized linear models (including logistic regression), and survival models. Finally, I will demonstrate how you can conduct an end-to-end DOE analysis entirely in R, showing how to generate power versus sample size plots and other design diagnostics to help you design an experiment that meets your program's needs.


12:20 PM - 1:30 PM

Lunch

1:30 PM - 3:00 PM: Parallel Sessions

Room A

Virtual Session
Session 2A: Statistical Engineering
Session Chair: Peter Parker, NASA Langley Research Center
  •  1 

    An Overview of the NASA Quesst Community Test Campaign with the X-59 Aircraft

    Jonathan Rathsam (NASA Langley Research Center)


    In its mission to expand knowledge and improve aviation, NASA conducts research to address sonic boom noise, the prime barrier to overland supersonic flight. For half a century civilian aircraft have been required to fly slower than the speed of sound when over land to prevent sonic boom disturbances to communities under the flight path. However, lower noise levels may be achieved via new aircraft shaping techniques that reduce the merging of shockwaves generated during supersonic flight. As part of its Quesst mission, NASA is building a piloted, experimental aircraft called the X-59 to demonstrate low noise supersonic flight. After initial flight testing to ensure the aircraft performs as designed, NASA will begin a national campaign of community overflight tests to collect data on how people perceive the sounds from this new design. The data collected will support national and international noise regulators’ efforts as they consider new standards that would allow supersonic flight over land at low noise levels. This presentation provides an overview of the community test campaign, including the scope, key objectives, stakeholders, and challenges.


  •  1 

    Dose-Response Data Considerations for the NASA Quesst Community Test Campaign

    Aaron Vaughn (NASA Langley Research Center)


    Key outcomes for NASA's Quesst mission are noise dose and perceptual response data to inform regulators on their decisions regarding noise certification standards for the future of overland commercial supersonic flight. Dose-response curves are commonly utilized in community noise studies to describe the annoyance of a community to a particular noise source. The X-59 aircraft utilizes shaped-boom technology to demonstrate low noise supersonic flight. For X-59 community studies, the sound level from X-59 overflights constitutes the dose, while the response is an annoyance rating selected from a verbal scale, e.g., “slightly annoyed” and “very annoyed.” Dose-response data will be collected from individual flyovers (single event dose) and an overall response to the accumulation of single events at the end of the day (cumulative dose). There are quantifiable sources of error in the noise dose due to uncertainty in microphone measurements of the sonic thumps and uncertainty in predicted noise levels at survey participant locations. Assessing and accounting for error in the noise dose is essential to obtain an accurate dose-response model. There is also a potential for error in the perceptual response. This error is due to the ability of participants to provide their response in a timely manner and participant fatigue after responding to up to one hundred surveys over the course of a month. This talk outlines various challenges in estimating noise dose and perceptual response and the methods considered in preparation for X-59 community tests.


  •  2 

    Infusing Statistical Thinking into the NASA Quesst Community Test Campaign

    Nathan Cruze (NASA Langley Research Center)


    Statistical thinking permeates many important decisions as NASA plans its Quesst mission, which will culminate in a series of community overflights using the X-59 aircraft to demonstrate low-noise supersonic flight. Month-long longitudinal surveys will be deployed to assess human perception and annoyance to this new acoustic phenomenon. NASA works with a large contractor team to develop systems and methodologies to estimate noise doses, to test and field socio-acoustic surveys, and to study the relationship between the two quantities, dose and response, through appropriate choices of statistical models. This latter dose-response relationship will serve as an important tool as national and international noise regulators debate whether overland supersonic flights could be permitted once again within permissible noise limits. In this presentation we highlight several areas where statistical thinking has come into play, including issues of sampling, classification and data fusion, and analysis of longitudinal survey data that are subject to rare events and the consequences of measurement error. We note several operational constraints that shape the appeal or feasibility of some decisions on statistical approaches, and we identify several important remaining questions to be addressed.


Room B

Virtual Session
Session 2B: Situation Awareness, Autonomous Systems, and Digital Engineering in . . . T&E
Session Chair: Elizabeth Green, IDA
  •  1 

    Towards Scientific Practices for Situation Awareness Evaluation in Operational Testing

    Miriam Armstrong (IDA)


    Situation Awareness (SA) plays a key role in decision making and human performance; higher operator SA is associated with increased operator performance and decreased operator errors. In the most general terms, SA can be thought of as an individual’s “perception of the elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future.” While “situational awareness” is a common suitability parameter for systems under test, there is no standardized method or metric for quantifying SA in operational testing (OT). This leads to varied and suboptimal treatments of SA across programs and test events. Current measures of SA are exclusively subjective and paint an inadequate picture. Future advances in system connectedness and mission complexity will exacerbate the problem. We believe that technological improvements will necessitate increases in the complexity of the warfighters’ mission, including changes to team structures (e.g., integrating human teams with human-machine teams), command and control (C2) processes (e.g., expanding C2 frameworks toward joint all-domain C2), and battlespaces (e.g., overcoming integration challenges for multi-domain operations). Operational complexity increases the information needed for warfighters to maintain high SA, and assessing SA will become increasingly important and difficult to accomplish. IDA’s Test science team has proposed a piecewise approach to improve the measurement of situation awareness in operational evaluations. The aim of this presentation is to promote a scientific understanding of what SA is (and is not) and encourage discussion amongst practitioners tackling this challenging problem. We will briefly introduce Endsley’s Model of SA, review the trade-offs involved in some existing measures of SA, and discuss a selection of potential ways in which SA measurement during OT may be improved.


  •  1 

    T&E Landscape for Advanced Autonomy

    Kathryn Lahman (Johns Hopkins University Applied Physics Laboratory)


    The DoD is making significant investments in the development of autonomous systems, spanning from basic research, at organizations such as DARPA and ONR, to major acquisition programs, such as PEO USC. In this talk we will discuss advanced autonomous systems as complex, fully autonomous systems and systems of systems, rather than specific subgenres of autonomous functions – i.e. basic path planning autonomy or vessel controllers for moving vessels from point A to B. As a community, we are still trying to understand how to integrate these systems in the field with the warfighter to fully optimize their added capabilities. A major goal of using autonomous systems is to support multi-domain, distributed operations. We have a vision for how this may work, but we don’t know when, or if, these systems will be ready any time soon to implement these visions. We must identify trends, analyze bottlenecks, and find scalable approaches to fielding these capabilities, such as identifying certification criterion, or optimizing methods of testing and evaluating (T&E) autonomous systems. Traditional T&E methods are not sufficient for cutting edge autonomy and artificial intelligence (AI). Not only do we have to test the traditional aspects of system performance (speed, endurance, range, etc.) but also the decision-making capabilities that would have previously been performed by humans. This complexity increases when an autonomous system changes based on how it is applied in the real world. Each domain, environment, and platform an autonomy is run on, presents unique autonomy considerations. Complexity is further compounded when we begin to stack these autonomies and integrate them into a fully autonomous system of systems. Currently, there are no standard processes or procedures for testing these nested, complex autonomies; yet there are numerous areas for growth and improvement in this space. We will dive into identified capability gaps in Advanced Autonomy T&E that we have recognized and provide approaches for how the DOD may begin to tackle these issues. It is important that we make critical contributions towards testing, trusting and certifying these complex autonomous systems. Primary focus areas that are addressed include: - Recommending the use of bulk testing through Modeling and Simulation (M&S), while ensuring that the virtual environment is representative of the operational environment. - Developing intelligent tests and test selection tools to locate and discriminate areas of interest faster than through traditional Monte-Carlo sampling methods. - Building methods for testing black box autonomies faster than real time, and with fewer computational requirements. - Providing data analytics that assess autonomous systems in ways that human provide decision makers a means for certification. - Expanding the concept of what trust means, how to assess and, subsequently, validate trustworthiness of these systems across stakeholders. - Testing experimental autonomous systems in a safe and structured manner that encourages rapid fielding and iteration on novel autonomy components.


  •  1 

    Digital Transformation Enabled by Enterprise Automation

    Nathan Pond (Edaptive Computing, Inc.)


    Digital transformation is a broad term that means a variety of things to people in many different operational domains, but the underlying theme is consistent: using digital technologies to improve business processes, culture, and efficiency. Digital transformation results in streamlining communications, collaboration, and information sharing while reducing errors. Properly implemented digital processes provide oversight and cultivate accountability to ensure compliance with business processes and timelines. A core tenet of effective digital transformation is automation. The elimination or reduction of human intervention in processes provides significant gains to operational speed, accuracy, and efficiency. DOT&E uses automation to streamline the creation of documents and reports which need to include up-to-date information. By using Smart Documentation capabilities, authors can define and automatically populate sections of documents with the most up-to-date data, ensuring that every published document always has the most current information. This session discusses a framework for driving digital transformation to automate nearly any business process.


Room C

Students & Fellows Speed Session 1

Please note many of the Speed Sessions will also include a poster during the Poster Session

Session Chair: Alyson Wilson, NC State
  •  1 

    Comparing Normal and Binary D-optimal Design of Experiments by Statistical Power

    Addison Adams (IDA / Colorado State University)


    In many Department of Defense (DoD) Test and Evaluation (T&E) applications, binary response variables are unavoidable. Many have considered D-optimal design of experiments (DOEs) for generalized linear models (GLMs). However, little consideration has been given to assessing how these new designs perform in terms of statistical power for a given hypothesis test. Monte Carlo simulations and exact power calculations suggest that D-optimal designs generally yield higher power than binary D-optimal designs, despite using logistic regression in the analysis after data have been collected. Results from using statistical power to compare designs contradict traditional DOE comparisons which employ D-efficiency ratios and fractional design space (FDS) plots. Power calculations suggest that practitioners that are primarily interested in the resulting statistical power of a design should use normal D-optimal designs over binary D-optimal designs when logistic regression is to be used in the data analysis after data collection.


  •  2 

    Uncertain Text Classification for Proliferation Detection

    Andrew Hollis (North Carolina State University)


    A key global security concern in the nuclear weapons age is the proliferation and development of nuclear weapons technology, and a crucial part of enforcing non-proliferation policy is developing an awareness of the scientific research being pursued by other nations and organizations. Deep, transformer-based text classification models are an important piece of systems designed to monitor scientific research for this purpose. For applications like proliferation detection involving high-stakes decisions, there has been growing interest in ensuring that we can perform well-calibrated, interpretable uncertainty quantification with such classifier models. However, because modern transformer-based text classification models have hundreds of millions of parameters and the computational cost of uncertainty quantification typically scales with the size of the parameter space, it has been difficult to produce computationally tractable uncertainty quantification for these models. We propose a new variational inference framework that is computationally tractable for large models and meets important uncertainty quantification objectives including producing predicted class probabilities that are well-calibrated and reflect our prior conception of how different classes are related.


  •  1 

    A Data-Driven Approach of Uncertainty Quantification on Reynolds Stress Based on DNS Turbulence Data

    Zheming Gou (University of Southern California)


    High-fidelity simulation capabilities have progressed rapidly over the past decades in computational fluid dynamics (CFD), resulting in plenty of high-resolution flow field data. Uncertainty quantification remains an unsolved problem due to the high-dimensional input space and the intrinsic complexity of turbulence. Here we developed an uncertainty quantification method to model the Reynolds stress based on Karhunen-Loeve Expansion(KLE) and Project Pursuit basis Adaptation polynomial chaos expansion(PPA). First, different representative volume elements (RVEs) were randomly drawn from the flow field, and KLE was used to reduce them into a moderate dimension. Then, we build polynomial chaos expansions of Reynolds stress using PPA. Results show that this method can yield a surrogate model with a test accuracy of up to 90%. PCE coefficients also show that Reynolds stress strongly depends on second-order KLE random variables instead of first-order terms. Regarding data efficiency, we built another surrogate model using a neural network(NN) and found that our method outperforms NN in limited data cases.


  •  1 

    I-TREE: a Tool for Characterizing Research Using Taxonomies

    Aayushi Verma (IDA)


    IDA is developing a Data Strategy to develop solid infrastructures and practices that allow for a rigorous data-centric approach to answering U.S. security and science policy questions. The data strategy implements data governance and data architecture strategies to leverage data to gain trusted insights, and establishes a data-centric culture. One key component of the Data Strategy is a set of research taxonomies that describe and characterize the research done at IDA. These research taxonomies, broadly divided into six categories, are a vital tool to help IDA researchers gain insight into the research expertise of staff and divisions, in terms of the research products that are produced for our sponsors. We have developed an interactive web application which consumes numerous disparate sources of data related to these taxonomies, research products, researchers, and divisions, and unites them to create quantified analytics and visualizations to answer questions about research at IDA. This tool, titled I-TREE (IDA-Taxonomical Research Expertise Explorer), will enable staff to answer questions like ‘Who are the researchers most commonly producing products for a specified research area?’, ‘What is the research profile of a specified author?’, ‘What research topics are most commonly addressed by a specified division?’, ‘Who are the researchers most commonly producing products in a specified division?’, and ‘What divisions are producing products for a specified research topic?’. These are essential questions whose answers allow IDA to identify subject-matter expertise areas, methodologies, and key skills in response to sponsor requests, and to identify common areas of expertise to build a research team with a broad range of skills. I-TREE demonstrates the use of data science and data management techniques that enhance the company’s data strategy while actively enabling researchers and management to make informed decisions.


  •  2 

    An Evaluation Of Periodic Developmental Reviews Using Natural Language Processing

    Dominic Rudakevych (United States Military Academy)


    As an institution committed to developing leaders of character, the United States Military Academy (USMA) holds a vested interest in measuring character growth. One such tool, the Periodic Developmental Review (PDR), has been used by the Academy’s Institutional Effectiveness Office for over a decade. PDRs are written counseling statements evaluating how a cadet is developing with respect to his/her peers. The objective of this research was to provide an alternate perspective of the PDR system by using statistical and natural language processing (NLP) based approaches to find whether certain dimensions of PDR data were predictive of a cadet’s overall rating. This research implemented multiple NLP tasks and techniques, including sentiment analysis, named entity recognition, tokenization, part-of-speech tagging, and word2vec, as well as statistical models such as linear regression and ordinal logistic regression. The ordinal logistic regression model concluded PDRs with optional written summary statements had more predictable overall scores than those without summary statements. Additionally, those who wrote the PDR on the cadet (Self, Instructor, Peer, Subordinate) held strong predictive value towards the overall rating. When compared to a self-reflecting PDR, instructor-written PDRs were 62.40% more probable to have a higher overall score, while subordinate-written PDRs had a probability of improvement of 61.65%. These values were amplified to 70.85% and 73.12% respectively when considering only those PDRs with summary statements. These findings indicate that different writer demographics have a different understanding of the meaning of each rating level. Recommendations for the Academy would be implementing a forced distribution or providing a deeper explanation of overall rating in instructions. Additionally, no written language facets analyzed demonstrated predictive strength, meaning written statements do not introduce unwanted bias and could be made a required field for more meaningful feedback to cadets.


  •  2 

    Optimal Release Policy for Covariate Software Reliability Models.

    Ebenezer Yawlui (University of Massachusetts Dartmouth)


    The optimal time to release a software is a common problem of broad concern to software engineers, where the goal is to minimize cost by balancing the cost of fixing defects before or after release as well as the cost of testing. However, the vast majority of these models are based on defect discovery models that are a function of time and can therefore only provide guidance on the amount of additional effort required. To overcome this limitation, this paper presents a software optimal release model based on cost criteria, incorporating the covariate software defect detection model based on the Discrete Cox Proportional Hazards Model. The proposed model provides more detailed guidance recommending the amount of each distinct test activity performed to discover defects. Our results indicate that the approach can be utilized to allocate effort among alternative test activities in order to minimize cost.


  •  2 

    A Stochastic Petri Net Model of Continuous Integration and Continuous Delivery

    Sushovan Bhadra (University of Massachusetts Dartmouth)


    Modern software development organizations rely on continuous integration and continuous delivery (CI/CD), since it allows developers to continuously integrate their code in a single shared repository and automates the delivery process of the product to the user. While modern software practices improve the performance of the software life cycle, they also increase the complexity of this process. Past studies make improvements to the performance of the CI/CD pipeline. However, there are fewer formal models to quantitatively guide process and product quality improvement or characterize how automated and human activities compose and interact asynchronously. Therefore, this talk develops a stochastic Petri net model to analyze a CI/CD pipeline to improve process performance in terms of the probability of successfully delivering new or updated functionality by a specified deadline. The utility of the model is demonstrated through a sensitivity analysis to identify stages of the pipeline where improvements would most significantly improve the probability of timely product delivery. In addition, this research provided an enhanced version of the conventional CI/CD pipeline to examine how it can improve process performance in general. The results indicate that the augmented model outperforms the conventional model, and sensitivity analysis suggests that failures in later stages are more important and can impact the delivery of the final product.


  •  2 

    Neural Networks for Quantitative Resilience Prediction

    Karen Alves da Mata (University of Massachusetts Dartmouth)


    System resilience is the ability of a system to survive and recover from disruptive events, which finds applications in several engineering domains, such as cyber-physical systems and infrastructure. Most studies emphasize resilience metrics to quantify system performance, whereas more recent studies propose resilience models to project system recovery time after degradation using traditional statistical modeling approaches. Moreover, past studies are either performed on data after recovering or limited to idealized trends. Therefore, this talk considers alternative machine learning approaches such as (i) Artificial Neural Networks (ANN), (ii) Recurrent Neural Networks, and (iii) Long-Short Term Memory (LSTM) to model and predict system performance of alternative trends other than ones previously considered. These approaches include negative and positive factors driving resilience to understand and precisely quantify the impact of disruptive events and restorative activities. A hybrid feature selection approach is also applied to identify the most relevant covariates. Goodness of fit measures are calculated to evaluate the models, including (i) mean squared error, (ii) predictive-ratio risk, (iii) and adjusted R squared. The results indicate that LSTM models outperform ANN and RNN models requiring fewer neurons in the hidden layer in most of the data sets considered. In many cases, ANN models performed better than RNNs but required more time to be trained. These results suggest that neural network models for predictive resilience are both feasible and accurate relative to traditional statistical methods and may find practical use in many important domains.


  •  1 

    A generalized influence maximization problem

    Sumit Kumar Kar (University of North Carolina at Chapel Hill)


    The influence maximization problem is a popular topic in social networks with several applications in viral marketing and epidemiology. One possible way to understand the problem is from the perspective of a marketer who wants to achieve the maximum influence on a social network by choosing an optimum set of nodes of a given size as seeds. The marketer actively influences these seeds, followed by a passive viral process based on a certain influence diffusion model, in which influenced nodes influence other nodes without external intervention. Kempe et al. showed that a greedy algorithm-based approach can provide a (1-1/e)-approximation guarantee compared to the optimal solution if the influence spreads according to the Triggering model. In our current work, we consider a much more general problem where the goal is to maximize the total expected reward obtained from the nodes that are influenced by a given time (that may be finite or infinite) where the reward obtained by influencing a set of nodes can depend on the set (and not necessarily a sum of rewards from the individual nodes) as well as the times at which each node gets influenced, we can restrict ourself to a subset of the network from where the seeds can be chosen, we can choose to assign multiple units of our budget to a single node (where the maximum number of budget units that may be assigned on a node can depend on the node), and a seeded node will actually get influenced with a certain probability where the probability is a non-decreasing function of the number of budget units assigned to that node. We have formulated a greedy algorithm that provides a (1-1/e)-approximation guarantee compared to the optimal solution of this generalized influence maximization problem if the influence spreads according to the Triggering model.


Room D

Contributed Session 1
Session Chair: Elizabeth Gregory, NASA Langley Research Center
  •  1 

    Avoiding Pitfalls in AI/ML Packages

    Justin Krometis (Virginia Tech National Security Institute)


    Recent years have seen an explosion in the application of artificial intelligence and machine learning (AI/ML) to practical problems from computer vision to game playing to algorithm design. This growth has been mirrored and, in many ways, been enabled by the development and maturity of publicly-available software packages such as PyTorch and TensorFlow that make model building, training, and testing easier than ever. While these packages provide tremendous power and flexibility to users, and greatly facilitate learning and deploying AI/ML techniques, they and the models they provide are extremely complicated and as a result can present a number of subtle but serious pitfalls. This talk will present three examples from the presenter's recent experience where obscure settings or bugs in these packages dramatically changed model behavior or performance - one from a classic deep learning application, one from training of a classifier, and one from reinforcement learning. These examples illustrate the importance of thinking carefully about the results that a model is producing and carefully checking each step in its development before trusting its output.


  •  3 

    Reinforcement Learning Approaches to the T&E of AI/ML-based Systems Under Test

    Karen O'Brien (Modern Technology Solutions, Inc)


    Designed experiments provide an efficient way to sample the complex interplay of essential factors and conditions during operational testing. Analysis of these designs provide more detailed and rigorous insight into the system under test’s (SUT) performance than top-level summary metrics provide. The introduction of artificial intelligence and machine learning (AI/ML) capabilities in SUTs create a challenge for test and evaluation because the factors and conditions that constitute the AI SUT's “feature space” are more complex than those of a mechanical SUT. Executing the equivalent of a full-factorial design quickly becomes infeasible. This presentation will demonstrate an approach to efficient, yet rigorous, exploration of the AI/ML-based SUT’s feature space that achieves many of the benefits of a traditional design of experiments – allowing more operationally meaningful insight into the strengths and limitations of the SUT than top-level AI summary metrics (like ‘accuracy’) provide. The approach uses an algorithmically defined search method within a reinforcement learning-style test harness for AI/ML SUTs. An adversarial AI (or AI critic) efficiently traverses the feature space and maps the resulting performance of the AI/ML SUT. The process identifies interesting areas of performance that would not otherwise be apparent in a roll-up metric. Identifying 'toxic performance regions', in which combinations of factors and conditions result in poor model performance, provide critical operational insights for both testers and evaluators. The process also enables T&E to explore the SUT's sensitivity and robustness to changes in inputs and the boundaries of the SUT's performance envelope. Feedback from the critic can be used by developers to improve the AI/ML SUT and by evaluators to interpret in terms of effectiveness, suitability, and survivability. This procedure can be used for white box, grey box and black box testing.


  •  2 

    Analysis Opportunities for Missile Trajectories

    Kelsey Cannon (Lockheed Martin Space)


    Contractor analysis teams are given 725 monte-carlo generated trajectories for thermal heating analysis. The analysis team currently uses a reduced-order model to evaluate all 725 options for the worst cases. The worst cases are run with the full model for handoff of predicted temperatures to the design team. Two months later, the customer arrives with yet another set of trajectories updated for the newest mission options. This presentation will question each step in this analysis process for opportunities to improve the cost, schedule, and fidelity of the effort. Using uncertainty quantification and functional data analysis processes, the team should be able to improve the analysis coverage and power with reduced (or at least a similar number) of model runs.


  •  1 

    *Virtual Speaker* Using Changepoint Detection and Artificial Intelligence to Classify Fuel Pressure Behaviors in Aerial Refueling

    Nelson Walker and Michelle Ouellette (United States Air Force)


    An open question in aerial refueling system test and evaluation is how to classify fuel pressure states and behaviors reproducibly and defensibly when visually inspecting the data stream. This question exists because fuel pressure data streams are highly stochastic, may exhibit multiple types of troublesome behavior simultaneously in a single stream, and may exhibit unique platform-dependent discernable behaviors. These data complexities result in differences in fuel pressure behavior classification determinations between engineers based on experience level and individual judgement. In addition to consuming valuable time, discordant judgements between engineers reduce confidence in metrics and other derived analytic products that are used to evaluate the system's performance. The Pruned Exact Linear Time (PELT) changepoint detection algorithm is an unsupervised machine learning method that, when coupled with an expert system AI, has provided a consistent and reproducible solution in classifying various fuel pressure states and behaviors with adjustable sensitivity.


Room E

Virtual Session
Mini-Tutorial 2
Session Chair: Caleb King, JMP Statistical Discovery
  •  2 

    A Tour of JMP Reliability Platforms and Bayesian Methods for Reliability Data

    Peng Liu (JMP Statistical Discovery)


    JMP is a comprehensive, visual, and interactive statistical discovery software with a carefully curated graphical user interface designed for statistical discovery. The software is packed with traditional and modern statistical analysis capabilities and many unique innovative features. The software hosts several suites of tools that are especially valuable to the DATAWorks’s audience. The software includes suites for Design of Experiments, Quality Control, Process Analysis, and Reliability Analysis. JMP has been building its reliability suite for the past fifteen years. The reliability suite in JMP is a comprehensive and mature collections of JMP platforms. The suite empowers reliability engineers with tools for analyzing time-to-event data, accelerated life test data, observational reliability data, competing cause data, warranty data, cumulative damage data, repeated measures degradation data, destructive degradation data, and recurrence data. For each type of data, there are numerous models and one or more methodologies that are applicable based on the nature of data. In addition to reliability data analysis platforms, the suite also provides capabilities of reliability engineering for system reliability from two distinct perspectives, one for non-repairable systems and the other for repairable systems. The capability of JMP reliability suite is also at the frontier of advanced research on reliability data analysis. Inspired by the research by Prof. William Meeker at Iowa State University, we have implemented Bayesian inference methodologies for analyzing three most important types of reliability data. The tutorial will start with an overall introduction to JMP’s reliability platforms. Then the tutorial will focus on analyzing time-to-event data, accelerated life test data, and repeated measures degradation data. The tutorial will present analyzing these types of reliability data using traditional methods, and highlight when, why, and how to analyze them in JMP using Bayesian methods.


3:00 PM - 3:20 PM

Break

3:20 PM - 4:50 PM: Parallel Sessions

Room A

Virtual Session
Session 3A: Methods & Applications of M&S Validation
Session Chair: Benjamin Ashwell, IDA
  •  1 

    Model Verification in a Digital Engineering Environment: An Operational Test Perspective

    Jo Anna Capp (IDA)


    As the Department of Defense adopts digital engineering strategies for acquisition systems in development, programs are embracing the use highly federated models to assess the end-to-end performance of weapon systems, to include the threat environment. Often, due to resource limitations or political constraints, there is limited live data with which to validate the end-to-end performance of these models. In these cases, careful verification of the model, including from an operational factor-space perspective, early in model development can assist testers in prioritizing resources for model validation in later system development. This presentation will discuss how using Design of Experiments to assess the operational factor space can shape model verification efforts and provide data for model validation focused on the end-to-end performance of the system.


  •  2 

    Recommendations for Statistical Analysis of Modeling and Simulation Environment Outputs

    Curtis Miller (IDA)


    Modeling and simulation (M&S) environments feature frequently in test and evaluation (T&E) of Department of Defense (DoD) systems. Testers may generate outputs from M&S environments more easily than collecting live test data, but M&S outputs nevertheless take time to generate, cost money, require training to generate, and are accessed directly only by a select group of individuals. Nevertheless, many M&S environments do not suffer many of the resourcing limitations associated with live test. We thus recommend testers apply higher resolution output generation and analysis techniques compared to those used for collecting live test data. Doing so will maximize stakeholders’ understanding of M&S environments’ behavior and help utilize its outputs for activities including M&S verification, validation, and accreditation (VV&A), live test planning, and providing information for non-T&E activities. This presentation provides recommendations for collecting outputs from M&S environments such that a higher resolution analysis can be achieved. Space filling designs (SFDs) are experimental designs intended to fill the operational space for which M&S predictions are expected. These designs can be coupled with statistical metamodeling techniques that estimate a model that flexibly interpolates or predicts M&S outputs and their distributions at both observed settings and unobserved regions of the operational space. Analysts can use the resulting metamodels as a surrogate for M&S outputs in situations where the M&S environment cannot be deployed. They can also study metamodel properties to decide if a M&S environment adequately represents the original systems. IDA has published papers recommending specific space filling design and metamodeling techniques; this presentation briefly covers the content of those papers.


  •  2 

    Back to the Future: Implementing a Time Machine to Improve and Validate Model Predictions

    Olivia Gozdz and Kyle Remley (IDA)


    At a time when supply chain problems are challenging even the most efficient and robust supply ecosystems, the DOD faces the additional hurdles of primarily dealing in low volume orders of highly complex components with multi-year procurement and repair lead times. When combined with perennial budget shortfalls, it is imperative that the DOD spend money efficiently by ordering the “right” components at the “right time” to maximize readiness. What constitutes the “right” components at the “right time” depends on model predictions that are based upon historical demand rates and order lead times. Given that the time scales between decisions and results are often years long, even small modeling errors can lead to months-long supply delays or tens of millions of dollars in budget shortfalls. Additionally, we cannot evaluate the accuracy and efficacy of today’s decisions for some years to come. To address this problem, as well as a wide range of similar problems across our Sustainment analysis, we have built “time machines” to pursue retrospective validation – for a given model, we rewind DOD data sources to some point in the past and compare model predictions, using only data available at the time, against known historical outcomes. This capability allows us to explore different decisions and the alternate realities that would manifest in light of those choices. In some cases, this is relatively straightforward, while in others it is made quite difficult by problems familiar to any time-traveler: changing the past can change the future in unexpected ways.


Room B

Virtual Session
Session 3B: Software Quality
Session Chair: Sushovan Bhadra, UMASS Dartmouth
  •  2 

    Assessing Predictive Capability and Contribution for Binary Classification Models

    Mindy Hotchkiss (Aerojet Rocketdyne)


    Classification models for binary outcomes are in widespread use across a variety of industries. Results are commonly summarized in a misclassification table, also known as an error or confusion matrix, which indicates correct vs incorrect predictions for different circumstances. Models are developed to minimize both false positive and false negative errors, but the optimization process to train/obtain the model fit necessarily results in cost-benefit trades. However, how to obtain an objective assessment of the performance of a given model in terms of predictive capability or benefit is less well understood, due to both the rich plethora of options described in literature as well as the largely overlooked influence of noise factors, specifically class imbalance. Many popular measures are susceptible to effects due to underlying differences in how the data are allocated by condition, which cannot be easily corrected. This talk considers the wide landscape of possibilities from a statistical robustness perspective. Results are shown from sensitivity analyses for a variety of different conditions for several popular metrics and issues are highlighted, highlighting potential concerns with respect to machine learning or ML-enabled systems. Recommendations are provided to correct for imbalance effects, as well as how to conduct a simple statistical comparison that will detangle the beneficial effects of the model itself from those of imbalance. Results are generalizable across model type.


  •  2 

    STAR: A Cloud-based Innovative Tool for Software Quality Analysis

    Kazu Okumoto (Sakura Software Solutions (3S) LLC)


    Traditionally, subject matter experts perform software quality analysis using custom spreadsheets which produce inconsistent output and are challenging to share and maintain across teams. This talk will introduce and demonstrate STAR – a cloud-based, data-driven tool for software quality analysis. The tool is aimed at practitioners who manage software quality and make decisions based on its readiness for delivery. Being web-based and fully automated allows teams to collaborate on software quality analysis across multiple projects and releases. STAR is an integration of SaaS and automated analytics. It is a digital engineering tool for software quality practice. To use the tool, all users need to do is upload their defect and development effort (optional) data to the tool and set a couple of planned release milestones, such as test start date and delivery dates for customer trial and deployment. The provided data is then automatically processed and aggregated into a defect growth curve. The core innovation of STAR is in its set of Statistical sound algorithms that are then used to fit a defect prediction curve to the provided data. This is achieved through the automated identification of inflection points in the original defect data and their use in generating piece-wise exponential models that make up the final prediction curve. Moreover, during the early days of software development, where no defect data is available, STAR can use the development effort plan and learn from previous software releases' defects and effort data to make predictions for the current release. Finally, the tool implements a range of what-if scenarios that enable practitioners to evaluate several potential actions to correct course. Thanks to the use of an earlier version of STAR by a large software development group at Nokia and the current trialing collaboration with NASA, the features and accuracy of the tool have improved to be better than traditional single curve fitting. In particular, the defect prediction is stable several weeks before the planned software release, and the multiple metrics provided by the tool make the analysis of software quality straightforward, guiding users in making an intelligent decision regarding the readiness for high-quality software delivery.


  •  2 

    Covariate Software Vulnerability Discovery Model to Support Cybersecurity T&E

    Lance Fiondella (University of Massachusetts)


    Vulnerability discovery models (VDM) have been proposed as an application of software reliability growth models (SRGM) to software security related defects. VDM model the number of vulnerabilities discovered as a function of testing time, enabling quantitative measures of security. Despite their obvious utility, past VDM have been limited to parametric forms that do not consider the multiple activities software testers undertake in order to identify vulnerabilities. In contrast, covariate SRGM characterize the software defect discovery process in terms of one or more test activities. However, data sets documenting multiple security testing activities suitable for application of covariate models are not readily available in the open literature. To demonstrate the applicability of covariate SRGM to vulnerability discovery, this research identified a web application to target as well as multiple tools and techniques to test for vulnerabilities. The time dedicated to each test activity and the corresponding number of unique vulnerabilities discovered were documented and prepared in a format suitable for application of covariate SRGM. Analysis and prediction were then performed and compared with a flexible VDM without covariates, namely the Alhazmi-Malaiya Logistic Model (AML). Our results indicate that covariate VDM significantly outperformed the AML model on predictive and information theoretic measures of goodness of fit, suggesting that covariate VDM are a suitable and effective method to predict the impact of applying specific vulnerability discovery tools and techniques.


Room C

Students & Fellows Speed Session 2

Please note many of the Speed Sessions will also include a poster during the Poster Session

Session Chair: Jim Starling, United States Military Academy at West Point
  •  3 

    A Bayesian Approach for Nonparametric Multivariate Process Monitoring using Universal Residuals

    Daniel Timme (Florida State University)


    In Quality Control, monitoring sequential-functional observations for characteristic changes through change-point detection is a common practice to ensure that a system or process produces high-quality outputs. Existing methods in this field often only focus on identifying when a process is out-of-control without quantifying the uncertainty of the underlying decision-making processes. To address this issue, we propose using universal residuals under a Bayesian paradigm to determine if the process is out-of-control and assess the uncertainty surrounding that decision. The universal residuals are computed by combining two non-parametric techniques: regression trees and kernel density estimation. These residuals have the key feature of being uniformly distributed when the process is in control. To test if the residuals are uniformly distributed across time (i.e., that the process is in-control), we use a Bayesian approach for hypothesis testing, which outputs posterior probabilities for events such as the process being out-of-control at the current time, in the past, or in the future. We perform a simulation study and demonstrate that the proposed methodology has remarkable detection and a low false alarm rate.


  •  1 

    Implementing Fast Flexible Space Filling Designs In R

    Christopher Dimapasok (IDA / Johns Hopkins University)


    Modeling and simulation (M&S) can be a useful tool for testers and evaluators when they need to augment the data collected during a test event. During the planning phase, testers use experimental design techniques to determine how much and which data to collect. When designing a test that involves M&S, testers can use Space-Filling Designs (SFD) to spread out points across the operational space. Fast Flexible Space-Filling Designs (FFSFD) are a type of SFD that are useful for M&S because they work well in nonrectangular design spaces and allow for the inclusion of categorical factors. Both of these are recurring features in defense testing. Guidance from the Deputy Secretary of Defense and the Director of Operational Test and Evaluation encourages the use of open and interoperable software and recommends the use of SFD. This project aims to address those. IDA analysts developed a function to create FFSFD using the free statistical software R. To our knowledge, there are no R packages for the creation of an FFSFD that could accommodate a variety of user inputs, such as categorical factors. Moreover, by using this function, users can share their code to make their work reproducible. This presentation starts with background information about M&S and, more specifically, SFD. The briefing uses a notional missile system example to explain FFSFD in more detail and show the FFSFD R function inputs and outputs. The briefing ends with a summary of the future work for this project.


  •  3 

    Development of a Wald-Type Statistical Test to Compare Live Test Data and M&S Predictions

    Carrington Metts (IDA)


    This work describes the development of a statistical test created in support of ongoing verification, validation, and accreditation (VV&A) efforts for modeling and simulation (M&S) environments. The test decides between a null hypothesis of agreement between the simulation and reality, and an alternative hypothesis stating the simulation and reality do not agree. To do so, it generates a Wald-type statistic that compares the coefficients of two generalized linear models that are estimated on live test data and analogous simulated data, then determines whether any of the coefficient pairs are statistically different. The test was applied to two logistic regression models that were estimated from live torpedo test data and simulated data from the Naval Undersea Warfare Center’s (NUWC) Environment Centric Weapons Analysis Facility (ECWAF). The test did not show any significant differences between the live and simulated tests for the scenarios modeled by the ECWAF. While more work is needed to fully validate the ECWAF’s performance, this finding suggests that the facility is adequately modeling the various target characteristics and environmental factors that affect in-water torpedo performance. The primary advantage of this test is that it is capable of handling cases where one or more variables are estimable in one model but missing or inestimable from the other. While it is possible to simply create the linear models on the common set of variables, this results in the omission of potentially useful test data. Instead, this approach identifies the mismatched coefficients and combines them with the model’s intercept term, thus allowing the user to consider models that are created on the entire set of available data. Furthermore, the test was developed in a generalized manner without any references to a specific dataset or system. Therefore, other researchers who are conducting VV&A processes on other operational systems may benefit from using this test for their own purposes.


  •  1 

    Energetic Defect Characterizations

    Naomi Edegbe (United States Military Academy)


    Energetic defect characterizations in munitions is a task requiring further refinement in military manufacturing processes. Convolutional neural networks (CNN) have shown promise in defect localization and segmentation in recent studies. These studies supplement that we may utilize a CNN architecture to localize casting defects in X-ray images. The U.S. Armament center has provided munition images for training to develop a system against MILSPEC requirements to identify and categorize defect munitions. In our approach, we utilize preprocessed munitions images and transfer learning from prior studies' model weights to compare the localization accuracy of this dataset for application in the field.


  •  2 

    Covariate Resilience Modeling

    Priscila Silva (additional authors: Andrew Bajumpaa, Drew Borden, and Christian Taylor) (University of Massachusetts Dartmouth)


    Resilience is the ability of a system to respond, absorb, adapt, and recover from a disruptive event. Dozens of metrics to quantify resilience have been proposed in the literature. However, fewer studies have proposed models to predict these metrics or the time at which a system will be restored to its nominal performance level after experiencing degradation. This talk presents three alternative approaches to model and predict performance and resilience metrics with techniques from reliability engineering, including (i) bathtub-shaped hazard functions, (ii) mixture distributions, and (iii) a model incorporating covariates related to the intensity of events that degrade performance as well as efforts to restore performance. Historical data sets on job losses during seven different recessions in the United States are used to assess the predictive accuracy of these approaches, including the recession that began in 2020 due to COVID-19. Goodness of fit measures and confidence intervals as well as interval-based resilience metrics are computed to assess how well the models perform on the data sets considered. The results suggest that both bathtub-shaped functions and mixture distributions can produce accurate predictions for data sets exhibiting V, U, L, and J shaped curves, but that W and K shaped curves that respectively experience multiple shocks, deviate from the assumption of a single decrease and subsequent increase, or suffers a sudden drop in performance cannot be characterized well by either of those classes proposed. In contrast, the model incorporating covariates is capable of tracking all of types of curves noted above very well, including W and K shaped curves such as the two successive shocks the U.S. economy experienced in 1980 and the sharp degradation in 2020. Moreover, covariate models outperform the simpler models on all of the goodness of fit measures and interval-based resilience metrics computed for all seven data sets considered. These results suggest that classical reliability modeling techniques such as bathtub-shaped hazard functions and mixture distributions are suitable for modeling and prediction of some resilience curves possessing a single decrease and subsequent recovery, but that covariate models to explicitly incorporate explanatory factors and domain specific information are much more flexible and achieve higher goodness of fit and greater predictive accuracy. Thus, the covariate modeling approach provides a general framework for data collection and predictive modeling for a variety of resilience curves.


  •  2 

    Application of Software Reliability and Resilience Models to Machine Learning

    Zakaria Faddi (University of Massachusetts Dartmouth)


    Machine Learning (ML) systems such as Convolutional Neural Networks (CNNs) are susceptible to adversarial scenarios. In these scenarios, an attacker attempts to manipulate or deceive a machine learning model by providing it with malicious input, necessitating quantitative reliability and resilience evaluation of ML algorithms. This can result in the model making incorrect predictions or decisions, which can have severe consequences in applications such as security, healthcare, and finance. Failure in the ML algorithm can lead not just to failures in the application domain but also to the system to which they provide functionality, which may have a performance requirement, hence the need for the application of software reliability and resilience. This talk demonstrates the applicability of software reliability and resilience tools to ML algorithms providing an objective approach to assess recovery after a degradation from known adversarial attacks. The results indicate that software reliability growth models and tools can be used to monitor the performance and quantify the reliability and resilience of ML models in the many domains in which machine learning algorithms are applied.


  •  2 

    Application of Recurrent Neural Network for Software Defect Prediction

    Fatemeh Salboukh (University of Massachusetts Dartmouth)


    Traditional software reliability growth models (SRGM) characterize software defect detection as a function of testing time. Many of those SRGM are modeled by the non-homogeneous Poisson process (NHPP). However, those models are parametric in nature and do not explicitly encode factors driving defect or vulnerability discovery. Moreover, NHPP models are characterized by a mean value function that predicts the average of the number of defects discovered by a certain point in time during the testing interval, but may not capture all changes and details present in the data and do not consider them. More recent studies proposed SRGM incorporating covariates, where defect discovery is a function of one or more test activities documented and recorded during the testing process. These covariate models introduce an additional parameter per testing activity, which adds a high degree of non-linearity to traditional NHPP models, and parameter estimation becomes complex since it is limited to maximum likelihood estimation or expectation maximization. Therefore, this talk assesses the potential use of neural networks to predict software defects due to their ability to remember trends. Three different neural networks are considered, including (i) Recurrent neural networks (RNNs), (ii) Long short-term memory (LSTM), and (iii) Gated recurrent unit (GRU) to predict software defects. The neural network approaches are compared with the covariate model to evaluate the ability in predictions. Results suggest that GRU and LSTM present better goodness-of-fit measures such as SSE, PSSE, and MAPE compared to RNN and covariate models, indicating more accurate predictions.


  •  2 

    Topological Data Analysis’ involvement in Cyber Security

    Anthony Cappetta and Elie Alhajjar (United States Military Academy)


    The purpose of this research is to see the use and application of Topological Data Analysis (TDA) in the real of Cyber Security. The methods used in this research include an exploration of different Python libraries or C++ python interfaces in order to explore the shape of data that is involved using TDA. These methods include, but are not limited to, the GUDHI, GIOTTO, and Scikit-tda libraries. The project’s results will show where the literal holes in cyber security lie and will offer methods on how to better analyze these holes and breaches.


Room D

Contributed Session 2
Session Chair: Jim Warner, NASA Langley Research Center
  •  3 

    Empirical Calibration for a Linearly Extrapolated Lower Tolerance Bound

    Caleb King (JMP Statistical Discovery)


    In many industries, the reliability of a product is often determined by a quantile of a distribution of a product's characteristics meeting a specified requirement. A typical approach to address this is to assume a distribution model and compute a one-sided confidence bound on the quantile. However, this can become difficult if the sample size is too small to reliably estimate a parametric model. Linear interpolation between order statistics is a viable nonparametric alternative if the sample size is sufficiently large. In most cases, linear extrapolation from the extreme order statistics can be used, but can result in inconsistent coverage. In this talk, we'll present an empirical study from our submitted manuscript used to generate calibrated weights for linear extrapolation that greatly improves the accuracy of the coverage across a feasible range of distribution families with positive support. We'll demonstrate this calibration technique using two examples from industry.


  •  2 

    Analysis of Surrogate Strategies and Regularization with Application to High-Speed Flows

    Gregory Hunt (William & Mary)


    Surrogate modeling is an important class of techniques used to reduce the burden of resource-intensive computational models by creating fast and accurate approximations. In aerospace engineering, surrogates have been used to great effect in design, optimization, exploration, and uncertainty quantification (UQ) for a range of problems, like combustor design, spacesuit damage assessment, and hypersonic vehicle analysis. Consequently, the development, analysis, and practice of surrogate modeling is of broad interest. In this talk, several widely used surrogate modeling strategies are studied as archetypes in a discussion on parametric/nonparametric surrogate strategies, local/global model forms, complexity regularization, uncertainty quantification, and relative strengths/weaknesses. In particular, we consider several variants of two widely used classes of methods: polynomial chaos and Gaussian process regression. These surrogate models are applied to several synthetic benchmark test problems and examples of real high-speed flow problems, including hypersonic inlet design, thermal protection systems, and shock-wave/boundary-layer interactions. Through analysis of these concrete examples, we analyze the trade-offs that modelers must navigate to create accurate, flexible, and robust surrogates.


  •  2 

    Case Study on Test Planning and Data Analysis for Comparing Time Series

    Phillip Koshute (Johns Hopkins University Applied Physics Laboratory)


    Several years ago, the US Army Research Institute of Environmental Medicine developed an algorithm to estimate core temperature in military working dogs (MWDs). This canine thermal model (CTM) is based on thermophysiological principles and incorporates environmental factors and acceleration. The US Army Medical Materiel Development Activity is implementing this algorithm in a collar-worn device that includes computing hardware, environmental sensors, and an accelerometer. Among other roles, Johns Hopkins University Applied Physics Laboratory (JHU/APL) is coordinating the test and evaluation of this device. The device’s validation is ultimately tied to field tests involving MWDs. However, to minimize the burden to MWDs and the interruptions to their training, JHU/APL seeks to leverage non-canine laboratory-based testing to the greatest possible extent. For example, JHU/APL is testing the device’s accelerometers with shaker tables that vertically accelerate the device according to specified sinusoidal acceleration profiles. This test yields time series of acceleration and related metrics, which are compared to ground-truth measurements from a reference accelerometer. Statistically rigorous comparisons between the CTM and reference measurements must account for the potential lack of independence between measurements that are close in time. Potentially relevant techniques include downsampling, paired difference tests, hypothesis tests of absolute difference, hypothesis tests of distributions, functional data analysis, and bootstrapping. These considerations affect both test planning and subsequent data analysis. This talk will describe JHU/APL’s efforts to test and evaluate the CTM accelerometers and will outline a range of possible methods for comparing time series.


  •  1 

    Model Validation Levels for Model Authority Quantification

    Kyle Provost (STAT COE)


    Due to the increased use of Modeling & Simulation (M&S) in the development of Department of Defense (DOD) weapon systems, it is critical to assign models appropriate levels of trust. Validation is an assessment process that can help mitigate the risks posed by relying on potentially inaccurate, insufficient, or incorrect models. However, validation criteria are often subjective, and inconsistently applied between differing models. Current Practice fails to reassess models as requirements change, mission scope is redefined, new data is collected, or models are adapted to a new use. This brief will present Model Validation Levels (MVLs) as a validation paradigm that enables rigorous, objective validation of a model and yields metrics that quantify the amount of trust that can be placed in a model. This validation framework will be demonstrated through a real-world example detailing the construction and interpretation of MVLs.


Room E

Virtual Session
Mini-Tutorial 3
Session Chair: Jason Sheldon, IDA
    Data Management for Research, Development, Test, and Evaluation


    It is important to manage Data from research, development, test, and evaluation effectively. Well-managed data makes research more efficient and promotes better analysis and decision-making. At present, numerous federal organizations are engaged in large-scale reforms to improve the way they manage their data, and these reforms are already effecting the way research is executed. Data management effects every part of the research process. Thoughtful, early planning sets research projects on the path to success by ensuring that the resources and expertise required to effectively manage data throughout the research process are in place when they are needed. This interactive tutorial will discuss the planning and execution of data management for research projects. Participants will build a data management plan, considering data security, organization, metadata, reproducibility, and archiving. By the conclusion of the tutorial, participants will be able to define data management and understand its importance, understand how the data lifecycle relates to the research process, and be able to build a data management plan.

    Heather Wojton 4(IDA)


    It is important to manage Data from research, development, test, and evaluation effectively. Well-managed data makes research more efficient and promotes better analysis and decision-making. At present, numerous federal organizations are engaged in large-scale reforms to improve the way they manage their data, and these reforms are already effecting the way research is executed. Data management effects every part of the research process. Thoughtful, early planning sets research projects on the path to success by ensuring that the resources and expertise required to effectively manage data throughout the research process are in place when they are needed. This interactive tutorial will discuss the planning and execution of data management for research projects. Participants will build a data management plan, considering data security, organization, metadata, reproducibility, and archiving. By the conclusion of the tutorial, participants will be able to define data management and understand its importance, understand how the data lifecycle relates to the research process, and be able to build a data management plan.


    Matthew Avery 4(IDA)


    It is important to manage Data from research, development, test, and evaluation effectively. Well-managed data makes research more efficient and promotes better analysis and decision-making. At present, numerous federal organizations are engaged in large-scale reforms to improve the way they manage their data, and these reforms are already effecting the way research is executed. Data management effects every part of the research process. Thoughtful, early planning sets research projects on the path to success by ensuring that the resources and expertise required to effectively manage data throughout the research process are in place when they are needed. This interactive tutorial will discuss the planning and execution of data management for research projects. Participants will build a data management plan, considering data security, organization, metadata, reproducibility, and archiving. By the conclusion of the tutorial, participants will be able to define data management and understand its importance, understand how the data lifecycle relates to the research process, and be able to build a data management plan.


5:00 PM - 7:00 PM: Parallel Sessions

Café

Poster Session and Reception
  •  1 

    Comparison of Magnetic Field Line Tracing Methods

    Dean Thomas (George Mason University)


    At George Mason University, we are developing swmfio, a Python package, for processing Space Weather Modeling Framework (SWMF) magnetosphere and ionosphere results, which is used to study the sun, heliosphere, and the magnetosphere. The SWMF framework centers around a high-performance magnetohydrodynamic (MHD) model, the Block Adaptive Tree Solar-wind Roe Upwind Scheme (BATS-R-US). This analysis uses swmfio and other methods, to trace magnetic field lines, compare the results, and identify why the methods differ. While the earth's magnetic field protects the planet from solar radiation, solar storms can distort the earth's magnetic field allowing solar storms to damage satellites and electrical grids. Being able to trace magnetic field lines helps us understand space weather. In this analysis, the September 1859 Carrington Event is examined. This event is the most intense geomagnetic storm in recorded history. We use three methods to trace magnetic field lines in the Carrington Event, and compare the field lines generated by the different methods. We consider two factors in the analysis. First, we directly compare methods by measuring the distances between field lines generated by different methods. Second, we consider how sensitive the methods are to initial conditions. We note that swmfio’s linear interpolation, which is customized for the BATS-R-US adaptive mesh, provides expected results. It is insensitive to small changes in initial conditions and terminates field lines at boundaries. We observe, that for any method, when the mesh size becomes large, results may not be accurate.


  •  1 

    Framework for Operational Test Design: An Example Application of Design Thinking

    Miriam Armstrong (IDA)


    Design thinking is a problem-solving approach that promotes the principles of human-centeredness, iteration, and diversity. The poster provides a five-step framework for how to incorporate these design principles when building an operational test. In the first step, test designers conduct research on test users and the problems they encounter. In the second step, designers articulate specific user needs to address in the test design. In the third step, designers generate multiple solutions to address user needs. In the forth step, designers create prototypes of their best solutions. In the fifth step, designers refine prototypes through user testing.


  •  2 

    The Component Damage Vector Method: A Statistically Rigorous Method for Validating AJEM

    Tom Johnson (IDA)


    As the Test and Evaluation community increasingly relies on Modeling and Simulation (M&S) to supplement live testing, M&S validation has become critical for ensuring credible weapon system evaluations. System-level evaluations of Armored Fighting Vehicles (AFV) rely on the Advanced Joint Effectiveness Model (AJEM) and Full-Up System Level (FUSL) testing to assess AFV vulnerability. This report reviews one of the primary methods that analysts use to validate AJEM, called the Component Damage Vector (CDV) Method. The CDV method compares components that were damaged in FUSL testing to simulated representations of that damage from AJEM.


  •  1 

    Fully Bayesian Data Imputation using Stan Hamiltonian Monte Carlo

    Melissa Hooke (NASA Jet Propulsion Laboratory)


    When doing multivariate data analysis, one common obstacle is the presence of incomplete observations, i.e., observations for which one or more covariates are missing data. Rather than deleting entire observations that contain missing data, which can lead to small sample sizes and biased inferences, data imputation methods can be used to statistically “fill-in” missing data. Imputing data can help combat small sample sizes by using the existing information in partially complete observations with the end goal of producing less biased and higher confidence inferences.

    In aerospace applications, imputation of missing data is particularly relevant because sample sizes are small and quantifying uncertainty in the model is of utmost importance. In this paper, we outline the benefits of a fully Bayesian imputation approach which samples simultaneously from the joint posterior distribution of model parameters and the imputed values for the missing data. This approach is preferred over multiple imputation approaches because it performs the imputation and modeling steps in one step rather than two, making it more compatible with complex model forms. An example of this imputation approach is applied to the NASA Instrument Cost Model (NICM), a model used widely across NASA to estimate the cost of future spaceborne instruments. The example models are implemented in Stan, a statistical-modeling tool enabling Hamiltonian Monte Carlo (HMC).


  •  1 

    Introducing TestScience.org

    Sean Fiorito (IDA / V-Strat, LLC)


    The Test Science Team facilitates data-driven decision-making by disseminating various testing and analysis methodologies. One way they disseminate these methodologies is through the annual workshop, DATAWorks; another way is through the website, TestScience.org. The Test Science website includes video training, interactive tools, a related research library as well as the DATAWorks Archive. "Introducing TestScience.org", a presentation at DATAWorks, could include a poster and an interactive guided session through the site content. The presentation would inform interested DATAWorks attendees of the additional resources throughout the year. It could also be used to inform the audience about ways to participate, such as contributing interactive Shiny tools, training content, or research. "Introducing TestScience.org" would highlight the following sections of the website: 1. The DATAWorks Archives 2. Learn (Video Training) 3. Tools (Interactive Tools) 5. Research (Library) 6. Team (About and Contact) Incorporating into DATAWorks an introduction to TestScience.org would inform attendees of additional valuable resources available to them, and could encourage broader participation in Testscience.org, adding value to both the DATAWorks attendees and the TestScience.org efforts.


  •  1 

    Developing a Domain-Specific NLP Topic Modeling Process for Army Experimental Data

    Anders Grau (United States Military Academy)


    Researchers across the U.S. Army are conducting experiments on the implementation of emerging technologies on the battlefield. Key data points from these experiments include text comments on the technologies’ performances. Researchers use a range of Natural Language Processing (NLP) tasks to analyse such comments, including text summarization, sentiment analysis, and topic modelling. Based on the successful results from research in other domains, this research aims to yield greater insights by implementing military-specific language as opposed to a generalized corpus. This research is dedicated to developing a methodology to analyze text comments from Army experiments and field tests using topic models trained on an Army domain-specific corpus. The methodology is tested on experimental data agglomerated in the Forge database, an Army Futures Command (AFC) initiative to provide researchers with a common operating picture of AFC research. As a result, this research offers an improved framework for analysis with domain-specific topic models for researchers across the U.S. Army.


  •  2 

    The Application of Semi-Supervised Learning in Image Classification

    Elijah Dabkowski (United States Military Academy)


    In today's Army, one of the fastest growing and most important areas in the effectiveness of our military is data science. One aspect of this field is image classification, which has applications such as target identification. However, one drawback within this field is that when an analyst begins to deal with a multitude of images, it becomes infeasible for an individual to examine all the images and classify them accordingly. My research presents a methodology for image classification which can be used in a military context, utilizing a typical unsupervised classification approach involving K-Means to classify a majority of the images while pairing this with user input to determine the label of designated images. The user input comes in the form of manual classification of certain images which are deliberately selected for presentation to the user, allowing this individual to select which group the image belongs in and refine the current image clusters. This shows how a semi-supervised approach to image classification can efficiently improve the accuracy of the results when compared to a traditional unsupervised classification approach.


  •  1 

    Best Practices for Using Bayesian Reliability Analysis in Developmental Testing

    Paul Fanto (IDA)


    Traditional methods for reliability analysis are challenged in developmental testing (DT) as systems become increasingly complex and DT programs become shorter and less predictable. Bayesian statistical methods, which can combine data across DT segments and use additional data to inform reliability estimates, can address some of these challenges. However, Bayesian methods are not widely used. I will present the results of a study aimed at identifying effective practices for the use of Bayesian reliability analysis in DT programs. The study consisted of interviews with reliability subject matter experts, together with a review of relevant literature on Bayesian methods. This analysis resulted in a set of best practices that can guide an analyst in deciding whether to apply Bayesian methods, in selecting the appropriate Bayesian approach, and in applying the Bayesian method and communicating the results.


  •  2 

    Test and Evaluation Tool for Stealthy Communication

    Olga Chen (U.S. Naval Research Laboratory)


    Stealthy communication allows the transfer of information while hiding not only the content of that information but also the fact that any hidden information was transferred. One way of doing this is embedding information into network covert channels, e.g., timing between packets, header fields, and so forth. We describe our work on an integrated system for the design, analysis, and testing of such communication. The system consists of two main components: the analytical component, the NExtSteP (NRL Extensible Stealthy Protocols) testbed, and the emulation component, consisting of CORE (Common Open Research Emulator), an existing open source network emulator, and EmDec, a new tool for embedding stealthy traffic in CORE and decoding the result. We developed the NExtSteP testbed as a tool to evaluate the performance and stealthiness of embedders and detectors applied to network traffic. NExtSteP includes modules to: generate synthetic traffic data or ingest it from an external source (e.g., emulation or network capture); embed data using an extendible collection of embedding algorithms; classify traffic, using an extendible collection of detectors, as either containing or not containing stealthy communication; and quantify, using multiple metrics, the performance of a detector over multiple traffic samples. This allows us to systematically evaluate the performance of different embedders (and embedder parameters) and detectors against each other. Synthetic data are easy to generate with NExtSteP. We use these data for initial experiments to broadly guide parameter selection and to study asymptotic properties that require numerous long traffic sequences to test. The modular structure of NExtSteP allows us to make our experiments increasingly realistic. We have done this in two ways: by ingesting data from captured traffic and then doing embedding, classification, and detector analysis using NExtSteP, and by using EmDec to produce external traffic data with embedded communication and then using NExtStep to do the classification and detector analysis. The emulation component was developed to build and evaluate proof-of-concept stealthy communications over existing IP networks. The CORE environment provides a full network, consisting of multiple nodes, with minimal hardware requirements and allows testing and orchestration of real protocols. Our testing environment allows for replay of real traffic and generation of synthetic traffic using MGEN (Multi-Generator) network testing tool. The EmDec software was created with the already existing NRL-developed protolib (protocol library). EmDec, running on CORE networks and orchestrated using a set of scripts, generates sets of data which are then evaluated for effectiveness by NExtSteP. In addition to evaluation by NExtSteP, development of EmDec allowed us to discover multiple novelties that were not apparent while using theoretical models. We describe current status of our work, the results so far, and our future plans.


  •  2 

    Comparison of Bayesian and Frequentist Methods for Regression

    James P Theimer (Homeland Security Community of Best Practices)


    Statistical analysis is typically conducted using either a frequentist or Bayesian approach. But what is the impact of choosing one analysis method over another? This presentation will compare the results of both linear and logistic regression using Bayesian and frequentist methods. The data set combines information on simulated diffusion of material and anticipated background signal to imitate sensor output. The sensor is used to estimate the total concentration of material, and a threshold will be set such that the false alarm rate (FAR) due to the background is a constant. The regression methods are used to relate the probability of detection, for a given FAR, to predictor variables, such as the total amount of material released. The presentation concludes with a comparison of the similarities and differences between the two methods given the results.


  •  2 

    Post-hoc UQ of Deep Learning Models Applied to Remote Sensing Image Scene Classification

    Alexei Skurikhin (Los Alamos National Laboratory)


    Post-hoc Uncertainty Quantification of Deep Learning Models Applied to Remote Sensing Image Scene Classification Steadily growing quantities of high-resolution UAV, aerial, and satellite imagery provide an exciting opportunity for global transparency and geographic profiling of activities of interest. Advances in deep learning, such as deep convolutional neural networks (CNNs) and transformer models, offer more efficient ways to exploit remote sensing imagery. Transformers, in particular, are capable of capturing contextual dependencies in the data. Accounting for context is important because activities of interest are often interdependent and reveal themselves in co-occurrence of related image objects or related signatures. However, while transformers and CNNs are powerful models, their predictions are often taken as point estimates, also known as pseudo probabilities, as they are computed by the softmax function. They do not provide information about how confident the model is in its predictions, which is important information in many mission-critical applications, and therefore limits their use in this space. Model evaluation metrics can provide information about the predictive model’s performance. We present and discuss results of post-hoc uncertainty quantification (UQ) of deep learning models, i.e., UQ application to trained models. We consider an application of CNN and transformer models to remote sensing image scene classification using satellite imagery, and compare confidence estimates of scene classification predictions of these models using evaluation metrics, such as expected calibration error, reliability diagram, and Brier score, in addition to conventional metrics, e.g. accuracy and F1 score. For validation, we use the publicly available and well-characterized Remote Sensing Image Scene Classification (RESISC45) dataset, which contains 31,500 images, covering 45 scene categories with 700 images in each category, and with the spatial resolution that varies from 30 to 0.2 m per pixel. This dataset was collected over different locations and under different conditions and possesses rich variations in translation, viewpoint, object pose and appearance, spatial resolution, illumination, background, and occlusion.


  •  2 

    Multimodal Data Fusion: Enhancing Image Classification with Text

    Jack Perreault (United States Military Academy)


    Image classification is a critical part of gathering information on high-value targets. To this end, Convolutional Neural Networks (CNN) have become the standard model for image and facial classification. However, CNNs alone are not entirely effective at image classification, and especially human classification due to their lack of robustness and bias. Recent advances in CNNs, however, allow for data fusion to help reduce the uncertainty in their predictions. In this project, we describe a multimodal algorithm designed to increase confidence in image classification with the use of a joint fusion model with image and text data. Our work utilizes CNNs for image classification and bag-of-words for text categorization on Wikipedia images and captions relating to the same classes as the CIFAR-100 dataset. Using data fusion, we combine the vectors of the CNN and bag-of-words models and utilize a fully connected network on the joined data. We measure improvements by comparing the SoftMax layer for the joint fusion model and image-only CNN.


  •  1 

    Predicting Success and Identifying Key Characteristics in Special Forces Selection

    Mark Bobinski (United States Military Academy)


    The United States Military possesses special forces units that are entrusted to engage in the most challenging and dangerous missions that are essential to fighting and winning the nations wars. Entry into special forces is based on a series of assessments called Special Forces Assessment and Selection (SFAS), which consists of numerous challenges that test a soldiers mental toughness, physical fitness, and intelligence. Using logistic regression, random forest classification, and neural network classification, the researchers in this study aim to create a model that both accurately predicts whether a candidate passes SFAS and which variables are significant indicators of passing selection. Logistic regression proved to be the most accurate model, while also highlighting physical fitness, military experience, and intellect as the most significant indicators associated with success.


  •  2 

    The Calculus of Mixed Meal Tolerance Test Trajectories

    Skyler Chauff (United States Military Academy)


    BACKGROUND Post-prandial glucose response resulting from a mixed meal tolerance test is evaluated from trajectory data of measured glucose, insulin, C-peptide, GLP-1 and other measurements of insulin sensitivity and β-cell function. In order to compare responses between populations or different composition of mixed meals, the trajectories are collapsed into the area under the curve (AUC) or incremental area under the curve (iAUC) for statistical analysis. Both AUC and iAUC are coarse distillations of the post-prandial curves and important properties of the curve structure are lost. METHODS Visual Basic Application (VBA) code was written to automatically extract seven different key calculus-based curve-shape properties of post-prandial trajectories (glucose, insulin, C-peptide, GLP-1) beyond AUC. Through two-sample t-tests, the calculus-based markers were compared between outcomes (reactive hypoglycemia vs. healthy) and against demographic information. RESULTS Statistically significant p-values (p < .01) between multiple curve properties in addition to AUC were found between each molecule studied and the health outcome of subjects based on the calculus-based properties of their molecular response curves. A model was created which predicts reactive hypoglycemia based on individual curve properties most associated with outcomes. CONCLUSIONS There is a predictive power using response curve properties that was not present using solely AUC. In future studies, the response curve calculus-based properties will be used for predicting diabetes and other health outcomes. In this sense, response-curve properties can predict an individual's susceptibility to illness prior to its onset using solely mixed meal tolerance test results.


  •  1 

    Using Multi-Linear Regression to Understand Cloud Properties' Impact on Solar Radiance

    Grant Parker (United States Military Academy)


    With solar energy being the most abundant energy source on Earth, it is no surprise that the reliance on solar photovoltaics (PV) has grown exponentially in the past decade. The increasing costs of fossil fuels have made solar PV more competitive and renewable energy more attractive, and the International Energy Agency (IEA) forecasts that solar PV's installed power capacity will surpass that of coal by 2027. Crucial to the management of solar PV power is the accurate forecasting of solar irradiance, which is heavily impacted by different types and distributions of clouds. Many studies have aimed to develop models that accurately predict the global horizontal irradiance (GHI) while accounting for the volatile effects of clouds; in this study, we aim to develop a statistical model that helps explain the relationship between various cloud properties and solar radiance reflected by clouds them-self. Using 2020 GOES-16 data from the GOES R-Series Advanced Baseline Imager (ABI), we investigated the effect that the cloud-optical depth, cloud top temperature, solar zenith angle, and look zenith angle had on cloud solar radiance while accounting for differing longitude and latitudes. Using these variables as the explanatory variables, we developed a linear model using multi-linear regression that, when tested on untrained data sets from different days (same time of day as the training set), results in a coefficient of determination (R^2) between .70-.75. Lastly, after analyzing the variables' degree of contribution to the cloud solar radiance, we presented error maps that highlight areas where the model succeeds and fails in prediction accuracy.


  •  2 

    Data Fusion: Using Data Science to Facilitate the Fusion of Multiple Streams of Data

    Madison McGovern (United States Military Academy)


    Today there are an increasing number of sensors on the battlefield. These sensors collect data that includes, but is not limited to, images, audio files, videos, and text files. With today’s technology, the data collection process is strong, and there is a growing opportunity to leverage multiple streams of data, each coming in different forms. This project aims to take multiple types of data, specifically images and audio files, and combine them to increase our ability to detect and recognize objects. The end state of this project is the creation of an algorithm that utilizes and merges voice recordings and images to allow for easier recognition. Most research tends to focus on one modality or the other, but here we focus on the prospect of simultaneously leveraging both modalities for improved entity resolution. With regards to audio files, the most successful deconstruction and dimension reduction technique is a deep auto encoder. For images, the most successful technique is the use of a convolutional neural network. To combine the two modalities, we focused on two different techniques. The first was running each data source through a neural network and multiplying the resulting class probability vectors to capture the combined result. The second technique focused on running each data source through a neural network, extracting a layer from each network, concatenating the layers for paired image and audio samples, and then running the concatenated object through a fully connected neural network.


  •  1 

    Assessing Risk with Cadet Candidates and USMA Admissions

    Daniel Lee (United States Military Academy)


    Though the United States Military Academy (USMA) graduates approximately 1,000 cadets annually, over 100 cadets from the initial cohort fail to graduate and are separated or resign at great expense to the federal government. Graduation risk among incoming cadet candidates is difficult to measure; based on current research, the strongest predictors of college graduation risk are high school GPA and, to a lesser extent, standardized test scores. Other predictors include socioeconomic factors, demographics, culture, and measures of prolonged and active participation in extra-curricular activities. For USMA specifically, a cadet candidate’s Whole Candidate Score (WCS), which includes measures to score leadership and physical fitness, has historically proven to be a promising predictor of a cadet’s performance at USMA. However, predicting graduation rates and identifying risk variables still proves to be difficult. Using data from the USMA Admissions Department, we used logistic regression, k-Nearest Neighbors, random forests, and gradient boosting algorithms to better predict which cadets would be separated or resign using potential variables that may relate to graduation risk. Using measures such as p-values for statistical significance, correlation coefficients, and the Area Under the Curve (AUC) scores to determine true positives, we found supplementing the current admissions criteria with data on the participation of certain extra-curricular activities improves prediction rates on whether a cadet will graduate.


  •  1 

    Overarching Tracker of DOT&E Actions

    Buck Thome (IDA)


    OED’s Overarching Tracker of DOT&E Actions distills information from DOT&E’s operational test reports and memoranda on test plan and test strategy approvals to generate informative metrics on the office’s activities. In FY22, DOT&E actions covered 68 test plans, 28 strategies, and 28 reports, relating to 74 distinct programs. This poster presents data from those documents and highlights findings on DOT&E’s effectiveness, suitability, and survivability determinations and other topics related to the state of T&E.