If the U.S. Navy is going to be led by the best tacticians, officer FitReps must include objective evaluations of warfighting prowess.
For the first time since 1996, senior naval leadership is committed to revamping the Navy’s abstruse performance assessment system, the fitness report (FitRep). Chief of Naval Personnel Vice Admiral Robert Burke outlined a few of the much-needed improvements to an outdated program to increase granularity, improve performance characteristics, and enhance mentorship.1 Still, fundamental inequities do not seem to be addressed by the new system. The Navy should look for practices to ensure the best possible system in a seemingly unusual place: baseball sabremetrics.
Sabremetrics, a title coined by baseball statistician Bill James after the Society of American Baseball Researchers (SABR), has blossomed into widespread use with baseball commentators referring to a player’s WAR (wins above replacement) or OPS (on-base plus slugging percentage) instead of more traditional statistics, such as batting average and home runs. The widely read book Moneyball chronicled how the small-market Oakland Athletics created a team of cheaper players that proved competitive against larger-market clubs such as the New York Yankees by leveraging sabremetrics throughout the first decade of the 2000s. The key advantage leveraged by the Athletics’ leadership was a reexamination of the importance of traditional baseball performance parameters and a rigorous statistical analysis of comparative performance made possible through effective use of sabremetrics.
SABREMETRICS AND RELATIVE PERFORMANCE
Take as an example basic hitting statistics. The traditional metric for hitting performance is batting average. Statistically, the batting average is calculated by dividing the number of hits by the number of at-bats. The deficiency in the traditional statistic is that there are more ways to get on base, and thus potentially score a run, than to get a hit. Players can also walk or be hit by a pitch. On-base percentage (OBP), calculated by adding the total number of hits, walks, and hits by pitch (HBP) and dividing by the number of plate appearances, provides a clearer indication of what batting average purports to show—how likely it is that a baseball player will not get struck out. Baseball announcers, scouts, and coaches, however, were concerned more often with a player’s batting average. Major League Baseball awards an annual batting title to the player with the highest batting average—no such award exists for the player with the highest on-base percentage. A sabremetrican would have the ability to identify a player with a lower than average batting average but an above-average OBP and sign him for a bargain price.
In the film Moneyball, co-stars Brad Pitt and Jonah Hill use sabremetrics to assess players' skills and revamp the Oakland Athletics baseball team.
In addition to promoting a deeper statistical analysis of relative baseball performance, sabremetrics rejects the “old boy’s club” paradigm that permeated baseball scouting. Professional baseball scouts, who often had minor league careers themselves, tended to view the quality of baseball recruits with an eye for what they themselves were when they started out: tall, athletic, muscular, and fast. With clubs prioritizing the same factors, there were few differences among clubs. Athletics general manager Billy Beane—himself a failed “golden boy”—separated appearances from actual performance metrics as indications of future performance, breaking with the paradigm. Fundamentally, it is of little consequence if a player can run a 4.4-second 40-meter dash if he cannot hit; or a pitcher who can throw a 100-mph fastball but cannot throw a strike. Sabremetricians understood and looked beneath the surface, creating the possibility of more value to their teams.
What does all this baseball talk have to do with the Navy’s effort to revamp its performance management system? Akin to sabremetrics, the revised FitRep system must ensure that it prioritizes the proper performance measures and that it applies appropriate statistical rigor to those measures to provide effective comparative performance assessments, promote and screen the best officers, and provide transparency to what has been an opaque system.
ASSESS THE RIGHT THINGS
Individual communities must tailor their performance assessment standards to ensure that the proper qualifications are used as performance metrics. The tactical skills necessary for a naval aviator to win a dog fight in his F/A-18 are substantively different than the skills used as tactical action officer on an Aegis surface combatant directing a dozen-man combat information center watchteam. Allowing warfighting communities to tailor fitness reports will improve the utility of the report for actual performance assessment and feedback. Care must be taken, however, in choosing the traits to be assessed and the means of assessing performance.
The Navy puts great emphasis on breadth of leadership and management performance, but often fails to consider the depth of tactical performance. For instance, the surface community places great priorities on junior officers earning advanced qualifications ahead of notional career timing. Shipboard tours are limited, however, and the qualifications process is often rushed and lacking in experiential depth. The surface community should standardize rigor in its qualifications and make achievement and consistent performance the standard of excellence instead of timeliness.
Part of the problem is that different commanding officers (COs) have different qualification standards, and will frequently rush the qualification process for their hot runners to make them competitive for desirable billets, such as flag aides, Fleet Scholar’s Education Program, Naval Postgraduate School, and more. Ironically, the most talented officers often have the least amount of actual watchstanding experience. The community would be better served by emphasizing watch-hours stood as an officer of the deck during a first tour or warfare coordinator during a second tour, instead of the number of different qualifications earned.
The reverse is true when it comes to leadership and management, where skills and management acumen are applicable across various projects or circumstances on board a ship. The performance system should reward consistent excellence across a broad range of challenges, and officers should be rewarded for their leadership experience afloat and ashore. An officer who served all of his tours in an engineering department should be competent in shipboard engineering operations and program management, but when that officer assumes command, he may not have the experience in weapons employment or operations planning to be effective.
LETHALITY ABOVE REPLACEMENT
Relative performance and peer comparisons still fundamentally determine an officer’s likelihood to screen for career milestones and advanced rank, but each officer’s particular circumstances are different. Some reporting seniors may grade harder or easier than others, and subjectivity and bias plague fitness reports. This can be fixed through the use of relative performance metrics used by sabremetrics.
The most useful sabremetric statistic is wins above replacement, which estimates how much better or worse off a team would be if the player assessed were “replaced” with a notational league average player. Pertinent baseball performance metrics are compared to league averages with the results then scaled appropriately to provide a holistic measurement of a player’s performance relative to the notional “average” replacement player. While not perfect, this single aggregate statistic helps clarify the relative difference between a .300 hitter with 20 homeruns and a .250 hitter with 30 homeruns. A positive WAR indicates that the player assessed is on the whole “better” than an average replacement player, a negative WAR the contrary.
For surface warfare officers (SWOs), long-term career viability depends on performance as an afloat department head. On board a surface combatant, there are typically only three or four unrestricted line peers in a wardroom, generally resulting in either a significantly above-average quality wardroom or a significantly below-average wardroom. An outstanding, but not the top, officer in a great wardroom will look inferior competitively on paper to the top officer in the poor wardroom. Despite systemic intentions otherwise, the fact that an officer received an “early promote” and accompanying “soft-breakout” in the write-up of a FitRep will outweigh whatever flowery language the first CO uses to “take care of” his officers in the outstanding wardroom. This is a disservice to the officer in the outstanding wardroom, a disservice to the Navy which may not screen him to further milestones, and a disservice to the officer in the poor wardroom who may be screened to positions for which he is unsuited. The practical consequence of all this is the FitRep and rotation timing chicanery used to compensate for these realities.
Consider what such practice would look like in baseball. In 2016, outfielder Khris Davis of the Oakland Athletics had the lowest-team-best team WAR: 2.5, or 156th in the league. When comparing each team’s best player by WAR, Davis had the lowest-best stat. The best overall player in 2016 was Mike Trout of the Anaheim Angels, with a WAR of 9.4. Just on the Angels alone, there were four players with a higher WAR than Davis. Thus, the fourth-best player on the Angels, although better than Davis, would be viewed as inferior on a Navy FitRep. As good of a player as he was, Davis would benefit from being on a poor team while the Angels, even finer players, would suffer from being behind some of the league’s superstars. Were a team of the “best” players to be chosen, taking the top player from each team would be an inefficient way to create that roster.
Were the current FitRep system to be executed as written, without the present chicanery, the Navy would be doing exactly this: taking the “best” officer from each ship to promote and screen for advanced milestones. The Navy should recognize this fact and use a similar aggregate relative performance metric, “lethality above replacement” (LAR), to serve as a single performance measurement that would transparently compare disparate experiences and allow for a more equitable performance comparison among officers at promotion and screening boards. In essence, LAR would indicate how much better or worse off a ship would be if an officer was replaced with a fleet “average” replacement.
EXECUTING FOR LETHALITY
Excluding the necessarily subjective write-up portions of the fitness report, the officer summary record and the officer performance summary record are the naval equivalents of a baseball player’s stat sheet. From these, members of promotion and screening boards primarily are concerned with the following four measures: officer’s performance trait average, the summary group performance trait average, the number of officers rated by the reporting senior, and the reporting senior’s cumulative average (RSCA). Just as with WAR, LAR would compare these values with fleet averages, scale appropriately, and present an aggregate statistic that could serve as a transparent and equitable measure of relative performance. But the measures have significant issues.
Reporting seniors are encumbered by the need to manage their RSCA and thus are incentivized not to objectively grade their subordinates on their performance trait average. Individual trait marks are “reverse engineered” to get the proper result. But if each community is able to tailor its performance traits, each trait will have a corresponding fleet average value for the officer to be compared against. This can be used in the final formulation in addition to the overall performance trait average. Within the summary group average, an aggregate is formed of objectively reversed-engineered arbitrary scores that are inequitable from command to command. Problems arise when different reporting seniors have different standards. Finally, the RSCA issue arbitrarily binds an officer’s performance based on the factors listed above.
Under the present system the officer who happens to have a reporting senior who failed to manage his RSCA appropriately could be disadvantaged. For a reporting senior in his first command tour, the first group of department heads observed likely will have varying grades that are not truly representative of their performance, but rather primarily serve as a means to allow the top performer to receive top marks while balancing the reporting senior’s RSCA. In the current paradigm, a RSCA delta of 1.0 is about the best possible score as officers rarely receive sub 3.0 reports. For every 5.0 officer, however, there needs to be a 3.0 graded officer to bring the average down. In many cases, the officer receiving the low marks would otherwise have been marked higher than 3.0. Reporting seniors understand this and thus initially must mark their subordinates in such a way so as to effectively manage RSCA until there are enough graded officers in the rating pool that extreme grades will no longer cause undesirable variability.
LAR corrects this problem by normalizing RSCA deltas to the notional fleet average reporting senior. For argument’s sake let this be 4.0 in the current system. Whatever the new grading system average ends up being, the principle here will hold. An outstanding department head on a destroyer with a first-time reporting senior receives top marks of 5.0. The reporting senior also has two other competitive officers and grades them accordingly and uses relatively poorer performing officers to set his RSCA resulting in an initial RSCA of 4.20. The delta is thus limited to 0.8. Since for now the officer is constrained by the reporting senior and the need to not hamper other fine officers, his or her performance on an absolute scale will be “worse” than the department head on a cruiser with a more senior commanding officer (CO) who has a well-managed RSCA and is able to write a FitRep with a RSCA delta of 1.0. Recall the Khris Davis issue above. We can do better.
CRITICISM AND RESPONSE
It could be said that LAR would further complicate the Navy’s performance management system. A single number could not reasonably quantify the subjective performance over the course of a year. The formula and accompanying weights of performance traits would simply be another game for reporting seniors and boards to master. With baseball and WAR, everyone is at least playing the same game, and performance is quantified easily so such an aggregate relative comparison is useful. Experiences differ so widely afloat that it would be disingenuous to say that an aggregate measure as suggested by LAR would be any fairer than the current system.
With transparency and candor the Navy would solve any complexity and equity concerns. Performance trait weighting should be reassessed periodically and any adjustments should be clearly promulgated to the respective community prior to their use at promotion and screening boards.
While it is true that athletic performance can be more easily quantified than naval leadership and tactical performance, such subjective quantification already is occurring within the current system. It is good that COs’ subjective assessments form the foundation of the performance management system—what could be corrected through LAR is the inherent differences in individual reporting seniors’ grading practices and experience levels. Meanwhile, without dampening the prominence of the CO’s assessment, standards can be objectified and standardized.
A clear precedent for such standardization exists in the SWO Command Qualification Examination (CQE) process. Between an officer’s first and second department head tours, all aspiring commanding officers must travel to Newport, Rhode Island, for a series of examinations, ship-handling assessments, and tactical evaluations judged by Surface Warfare Officers School staff members. This serves as a standardizing quality-assurance check and ensures equitable rigor in “qualifying” SWOs for eligibility to serve as an afloat commanding officer. This practical assessment has only been implemented in the past few years and given the notional SWO career timeline, the fleet is just now beginning to benefit from this rigorous additional assessment as new CQE-screened executive/commanding officers report to their commands.2 This process by no means diminishes the importance of the reporting senior’s subjective assessment, as without a FitRep-documented recommendation for command-at-sea, an officer will not screen for command. With a few practicable changes such evaluations would demonstrably improve the Navy’s ability to screen and promote the most deserving officers at all levels.
This type of standardized assessment process should be expanded to include all officers, and the results used as a significant portion of the annual FitRep. Similar to the arduous tactical games exhaustively used to assess command potential in the popular science-fiction novel “Ender’s Game,” surface and submarine officers should be assessed on shiphandling prowess and tactical acumen. Aviators should be assessed on airmanship and air combat. Simulators exist in fleet concentration areas with expert staff to serve as assessors and could be augmented by experienced type commander’s staff officers. The scenarios should be tailored as appropriate to an officer’s rank and expected competency: an ensign could work through a straightforward mooring evolution while a commander could be challenged with a demanding, low-visibility, high-density straits transit. An officer’s performance on these evolutions would form the basis for FitRep marks related to tactics and professional ability. Such off-ship evaluations would standardize tactical, seamanship, and airmanship assessments and provide annual feedback and performance data metrics—both to the officers concerned and to selection boards charged with screening officers for command. These assessments would occur throughout an officer’s career.
Finally, action at the command level would remain unchanged. All of the additional calculations would be done remotely after submission of the FitRep. There would be significantly less gamesmanship, as there would be little for the reporting senior to do to affect fleet-wide averages. In reality, reporting seniors would be free to grade their officers more accurately without having to worry about effects on personal RSCA. Such a system would promote more transparent performance assessment and mentorship.
CONCLUSION
The Navy should take the opportunity to address a fundamental inequity in previous iterations of the fitness reporting system—that of treating all reporting seniors and all wardrooms the same. By leveraging sabremetric insights and using effective performance measures, the Navy can revolutionize its performance management system to ensure the most capable and deserving officers rise to positions of increased responsibility to meet the demands of a complex future and defeat future adversaries in battle.
1. Mark D. Faram and Andrew Tilghman, “All New Evals and Fitreps Soming Soon,” Navy Times, 7 May 2017, www.navytimes.com/articles/all-new-evals.
2. SWOSCOLCOM INSTRUCTION 1412.1B, Surface Warfare Command Qualification Assessment and Examination, 19 Jan. 17, www.public.navy.mil/bupers-npc/officer/Detailing/surfacewarfare/Documents/1412.1B%20Surface%20Warfare%20Command%20Qualification%20Assessment%20and%20Examination.pdf.
Lieutenant Cordial is currently assigned to the staff of the Chief of Naval Operations, Surface Warfare Directorate. He has served on board the amphibious assault ship USS Iwo Jima (LHD-7).