III. Reliability and Validity of Classification Instruments
Before any classification instrument can be implemented, it is necessary to confirm its validity and reliability. Interview-based classification tools are deemed reliable if different raters assessing the same offender arrive at the same result. This is referred to as interrater reliability. Instruments based on self-reports are deemed reliable if the same offender offers similar answers if assessed more than once, barring the passage of time sufficient to cause legitimate changes in responses. This is called test-retest reliability.
An instrument is judged to be valid if it really measures what it purports to measure. There are numerous ways of assessing validity, depending upon the classification instrument in question. Personality inventories and typologies are typically examined for construct validity, which uses quantitative means to measure the viability of their various scales or factors.
With respect to risk assessments, the question of predictive accuracy is a key concern. Different statistics are available for summarizing any one tool’s accuracy; currently favored is the area under the curve (AUC) statistic. It is popular because its utility does not depend upon the base rate of the outcome in question (i.e., proportion of offenders who recidivate). That is, it is equally meaningful whether the behavior being predicted is rare or fairly common (Harris & Rice, 2007). The AUC communicates how well the assessment tool improves over a chance prediction (such as determined by the toss of a coin). Higher AUC values indicate greater prediction accuracy and improvement over chance. For example, a value of .50 would inform the user that the instrument does not predict better than chance, whereas a value of .80 indicates substantial improvement over chance.
Researchers can use the AUC not just to assess the performance of any one instrument but also to compare different instruments. The best comparisons are those that report results of administration of various tools on the same population of offenders, followed up for the same period of time. Such comparisons are not typical, however. One exception is a study by Barbaree, Seto, Langton, and Peacock (2001), who evaluated the accuracy of six assessment instruments designed for the prediction of general and/or sexual violence, using a sample of 215 sex offenders released from prison and followed up on community supervision for an average of 4.5 years. Outcomes of interest included any new recidivism (measured as either charges or convictions), any serious recidivism, and new sexual offense recidivism. Barbaree and colleagues found that some instruments were good at predicting all three outcomes; none was superior at predicting all three; and of much importance, instruments that were easy to use and score (such as the RRASOR and Static-99) provided good predictions of new sexual violence, the least frequent of the outcomes studied. This kind of information is extremely valuable to agencies seeking to invest in a specific kind of classification instrument when there are several or more to choose from.
Reliability and validity are necessary but not sufficient conditions in the selection of assessment tools. Agencies usually need to consider how much time they can devote to each offender, and the demands imposed by particular instruments on interviewers’ skill sets. Thus while “multitasking” instruments such as the VRAG perform well with respect to the prediction of both new violence and new sexual violence, the high offender caseloads faced by community corrections agencies and baccalaureate status of the typical probation or parole officer make instruments such as the RRASOR and the Static-99, which have relatively few items and do not require much in the way of clinical skills to score, very attractive tools for assessing sex offender risk.