Disruptive innovations in naval technology, such as the shifts from sail to steam power, wooden to steel hulls, surface-only vessels to submarines and naval aviation, fossil fuels to nuclear propulsion, line-of-sight to over-the-horizon targeting, and independent tactical maneuver to network-centric warfare, have favored the nations best equipped to make tremendous pioneering investments. But at only a moderate cost, other innovations or creative uses of existing technology have helped lesser powers to level the playing field, holding superpower military forces at risk. These technologies include advanced mines, torpedo boats, guided missiles, and cyber warfare.
There are many indications that artificial intelligence (AI), coupled with the explosive growth in affordable computing power, may be the next disruption in the latter category. Navies that have not been able to sustain rigorous programs of high-end tactical training may quickly close the gap with automated decision aids. More important, the wide availability of autonomous vehicles makes it dramatically less costly to field new weapon and sensor systems and to cover more ocean with fewer ships (the eternal problem for all navies). AI presents a momentous opportunity for navies that employ it intelligently, and a profound vulnerability for those that get bogged down in the wrong details.
The Department of Defense’s first-ever “Artificial Intelligence Strategy” acknowledges this and outlines the desired future state, but as a strategic document it does not clarify how the procurement, evaluation, and employment of AI systems will be managed.1 More detailed policies will certainly follow, but so, too, will the risks of prematurely or imprudently embracing systems simply because they are “AI.” The Navy would be wise to clarify foundational principles for procuring and employing these systems. Below are four recommendations.
Frame Realistic Expectations
When most people hear the term “artificial intelligence,” they envision fully autonomous systems with superhuman capabilities—self-driving cars or the classic science fiction trope of a devious AI villain. Conversely, when marketing materials refer to a product or service as “AI-based,” they often mean only that parts of the system proved too complex to code by hand, so they were implemented using machine-learning techniques trained to some level of accuracy and reliability. And when computer scientists hear the term in a popular context, they wince and think of AI pioneer John McCarthy’s old joke, “As soon as it works, no one calls it AI anymore.”2
It’s always foolish to predict some absolute limit on the development of future technology—even the laws of physics occasionally contain new wrinkles. Yet it is worth remembering that there is not yet a single example of a truly autonomous system that consistently demonstrates anything like the breadth of human decision-making and awareness across a complex problem domain. To be sure, a multitude of AI-based systems easily exceed human performance in narrow tasks. My father had a 1960s-era AI textbook that purported to “demonstrate” that no deterministic computer system would ever beat the best chess players, and yet systems consistently able to thrash human champions have existed for more than 20 years. They are now so common that, true to McCarthy’s prediction, they no longer are considered significant avenues of AI research. More recently, IBM’s Watson and Google’s AlphaGo and AlphaZero have, with great fanfare, surpassed human performance at trivia contests and the game of Go, both requiring far more sophisticated AI techniques than chess. It has proven very difficult, however, to adapt these systems to surpass people’s decision-making in more complex and uncertain problem domains, and therefore their readiness for life-or-death decision-making remains highly dubious.3
Always Keep People in the Loop
Computers have an undeniable advantage over humans in many tasks, including perfect recall from practically limitless repositories of previous observations, rapid and flawless calculations, and the ability to repeat an operation indefinitely without fatigue. The best systems take advantage of these capabilities to present for human review the most promising courses of action, the most interesting signals, or the most consistent patterns from a vast data set. With modern computational resources, some algorithms can eschew that review and still reliably beat human experts in narrow problem domains, although this might engender overconfidence in the algorithms’ capability in more general situations.
Unfortunately, the same algorithms that build successful pattern-recognition and decision models can be slightly modified to automatically generate “adversarial inputs” for which those same models will systematically fail in surprising ways. One well-known example permuted a photograph of a giant panda in ways imperceptible to a person that caused state-of-the-art image recognition AI to reclassify it from “panda” to “gibbon.”4 This sort of attack could be mounted against any AI classification system, such as one trying to distinguish friend from foe. When self-driving cars become more common, it is just a matter of time before pranksters discover how a few pieces of masking tape can render a stop sign invisible to their cameras.
The current state of AI is bound by a paradox: Either decision systems are more or less hardcoded by human domain experts—and thus don’t really “know” anything that people do not—or they are generated by an evolutionary learning process that may eventually outperform human experts. However, almost by definition, decision systems cannot effectively explain preferences, conclusions, or limitations in terms comprehensible to a person. In other words, useful machine learning also is often inexplicable.
Fully entrusting life-or-death decisions to AI systems is fraught with peril, and probably justifiable only when a human operator simply cannot be expected to process the same information and react in sufficient time. It is more practical and safer to use AI techniques to filter through mountains of data and highlight the most important real-time signals, most relevant historical precedents, or most likely predictive models (with some statistical measure of confidence) for evaluation by a well-trained human supervisor, who can then choose an action that complies with human expectations and first principles.
Rather than continue to characterize such systems as artificial forms of intelligence, it is more accurate and prudent to label them as “automatic pattern recognizers,” or “partially automated decision support aids,” and so forth. This shift in nomenclature keeps AI limitations and fallibility on display and helps human operators remain mindful of their own responsibilities. Ultimately, viewing AI systems as support tools for people rather than as substitutes also clarifies their place within existing doctrine and reinforces the human chains of ownership, responsibility, and accountability that commanders presumably wish to maintain. This emphasizes the necessity of training and evaluating operators to work with AI and that the most useful systems are those that do the most to amplify, rather than replace, people’s capabilities.
Keep Roles and Limitations Clear
In spring 2018, an Uber self-driving test vehicle (with a human supervisor on board) struck and killed a pedestrian in a well-publicized accident. The car’s sensors failed to distinguish a woman walking her bicycle across the road from a bicycle, a car, or something else and did not determine any need to make course or speed adjustments until just 1.3 seconds before impact. By then an emergency braking maneuver was required, but the system had not been designed to initiate one.5 The human operator did not react in time to prevent collision. The test vehicle was equipped with a separate sensor and automatic-braking system known as Volvo City Safety that might have been able to prevent the collision, but it had been intentionally disabled, perhaps to avoid interference with the separate Uber autonomous control system.
At first glance, this might seem to be an argument against a human in the loop, as the automated system apparently had a quicker reaction time. On deeper inspection, however, there seem to have been gross flaws in the overall system design. According to the National Traffic Safety Board report, the Uber system was “not designed to alert the operator,” apparently even when the system was uncertain about its present situation (up to six seconds before the collision) or knew that emergency maneuvering was required (1.3 seconds before impact).6
If the Navy does not keep clear, conservative design principles and division of responsibilities at the forefront of all autonomous-system development, it easily could find itself in a similarly painful situation with self-piloting ships or aircraft. At a minimum, human watchstanders must remain mindful that the autonomous system is never fully autonomous. It is a decision tool, an automator of simple tasks only, and it requires constant supervision. It is not equipped with an intelligence equal or superior to their own. AI employment doctrine should be integrated into the proven models of supervisory and subordinate watchstations or under-instruction and over-instruction watchstanders, rather than being treated as something entirely new and mostly outside human purview.
Design and Test to Objective Standards
Because of the surprising capability of even mediocre AI, it may be tempting to develop the technology to a point where it consistently surpasses the performance of an “average” human operator on a few metrics, then to declare that good enough for the field. After all, better than average must be an improvement, right? For example, there is little doubt that Uber’s self-driving cars might outperform an attentive person in many other scenarios, or that drivers already manage to strike and kill a distressing number of pedestrians without any computer assistance. Yet emplacing watchstander-like responsibilities and authorities in an automated system that cannot explain or defend itself or experience accountability in any meaningful way sidesteps the command principles at the heart of the Navy. Will the software developers appear at court-martial when their system makes a grave error? Without any process to enforce accountability for mistakes or reckless decisions, it is critical that systems be made as mistake- and recklessness-free as possible.
That’s an intimidating demand, but one that can be met by identifying a few uncompromising principles that must always be enforced, similar to the Hippocratic oath to “first, do no harm.” For example, a self-driving ship might be specified to keep its speed at less than x when a defined volume in front of the vehicle is not known to be clear of obstacles for at least the next y seconds. An automated weapon system might be inhibited from arming until at least z positive threat indications are received, one of which might be (if practicable) definitive classification by a human in the loop. These requirements should be enforced by human-auditable rules and a rigorous test program, just as with any other military specification.
But the nature of machine learning means that acceptance testing cannot be a static, one-and-done process, as it is for other military specifications. Continual testing in the style pioneered by high-uptime cloud services is a better model. Rival AIs, particularly adversarial-input generators, make excellent test generators, as do expert human red teams. And developers must view this testing as a constructive part of their process rather than a hassle—the best teams already have this mind-set regarding cybersecurity, but they are still a distressing minority. Contract incentives may have to be restructured to reward active participation in evolving test processes rather than passing scripted milestones. On the other hand, failed tests should not be used to weaken system specifications. The integrity of the original guiding principles must be maintained, even when that inevitably means additional cost or delay—the cost of fielding (or leaving in the field) an incompletely tested system ultimately will be far higher.
AI presents perhaps the most promising and dynamic opportunity of the early 21st century. The same shared values and processes that made the U.S. Navy the world leader in nuclear energy, signals intelligence, satellite systems, interservice operations, and distributed command and control also can enable the service to push AI technologies further, and with more consistent success, than have even the most innovative tech companies. But without a firm commitment to the right principles, the Navy could concede lethality and squander precious resources on systems not suitable for high-end conflict. More than anything else, warfighters need to step up and ensure that implementing the new national AI strategy does not overreach or undermine established principles of command.
1. U.S. Department of Defense, “Summary of the 2018 Department of Defense Artificial Intelligence Strategy: Harnessing AI to Advance Our Security and Prosperity,” Washington, DC.
2. Quoted by Bertrand Meyer in “John McCarthy,” blog@CACM, Communications of the Association for Computing Machinery, 28 October 2011.
3. Fink Densford, “Report: IBM Watson Delivered ‘Unsafe and Inaccurate’ Cancer Recommendations,” Mass Device, 25 July 2018.
4. Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and Harvesting Adversarial Examples,” published as a conference paper at the International Conference on Learning Representations 2015, Mountain View, California.
5. National Transportation Safety Board, “Preliminary Report: Highway HWY19MH01,” 2018.
6. National Transportation Safety Board, “Preliminary Report: Highway HWY19MH01.”