Reliability and Validity Tests in Psychology

Validity and Reliability Matrix

TEST of Reliability Description, Goal, Application, and Appropriateness Strengths Weaknesses
Inter-item Consistency Inter-item consistency measures the extent to which the items that purport to measure the same construct produce consistent results. This method is appropriate in determining the effect of lengthening or shortening the test on its overall reliability. Here, the Spearman-Brown formula is used to measure internal consistency to this effect. This method is advantageous because it allows a test developer to estimate the reliability of a test by increasing or decreasing the number of test items. According to Cohen (2010), the reliability of a test increases as the number of test items increases. This method is limited to the extent that it can only be used with homogeneous test items. Homogenous test items possess the same level of difficulty and length. Moreover, Cohen (2010) shows that the Spearman-Brown formula works best with lengthier tests as opposed to halved or shortened tests.
Split-half In split-half reliability, a given test is divided into halves, which can be scored separately. Then the results of one half can be correlated with the results of the other half of the test to determine the test’s reliability. This method is appropriate for tests that are lengthier because the more questions asked, the more the information obtained by the test developer (Cohen, 2010). This method is convenient compared to parallel and alternate forms of a test because it eliminates many problems associated with the latter. First, this method eliminates the problems associated with the development of two forms of the same test. Second, this method eliminates the carryover and reactivity effects of a given test (Kaplan & Saccuzzo, 2009). It decreases the reliability of a test, particularly when the test is split into lesser items compared to the original test. Furthermore, there are many ways through which a split can be made, but this approach is problematic because some splits may confer a higher correlation to the test compared to others (Cohen, 2010).
Test/retest It entails measuring the same construct in the same group of participants using the same test in two different instances. Therefore, this method is appropriate in estimating stability in a test, particularly when the construct under review has not changed over time. Test-Retest reliability is a convenient method for estimating the performance of test-takers over time, and it provides an estimate of the test’s stability over time. Moreover, test-retest reliability is a better method for estimating temporal stability, which gives a clear picture of the consistency of test scores over time (Cohen, 2010). This method fails to control practice or carryover effects. The problem with practice or carryover effects is that the benefits derived from administering the test in the first instance can be a source of the test’s instability in the second round of administration. Moreover, as the interval between the two instances of administration increases, the test’s reliability decreases (Kaplan & Saccuzzo, 2009).
Parallel and alternate forms This entails comparing two equivalent versions of a given test, which contain different items that measure the same characteristic in the same group of participants. Here, reliability is obtained by comparing the level of correlation and consistency of the results obtained from the two alternate forms. Therefore, this test is appropriately used when there are two versions of the same test. So far, this is the most rigorous method of assessing reliability, which has found widespread applications in psychology. Moreover, this method is one of the most straightforward methods of assessing reliability in different tests (Cohen, 2010). This test is rarely used in practice because many test developers encounter a lot of difficulties when constructing two alternate forms of the same test. Moreover, this test has a lot of practical limitations, which make it difficult for one to retest the characteristics of the same group of participants (Kaplan & Saccuzzo, 2009).
Test of Validity Description, Application, and Appropriateness Strengths Weaknesses
Face validity In essence, this method does not measure the actual validity of a given test. It is rather appropriate for establishing the subjects’ perceptions regarding the test’s validity. On the other hand, this method is not appropriate if the test developer wants to consider the style, formal characteristics, and appropriateness of different test items (Cohen, 2010). This method is appropriate for such tests in which the test developer aims to measure the subjects’ confidence about the test’s validity. Therefore, the higher the confidence of the subjects in the validity of the test, the greater the likelihood that the test will be administered or taken with little or no difficulties (Kaplan & Saccuzzo, 2009). This method does not measure the test’s real validity (Kaplan & Saccuzzo, 2009). This is because the test-taker does not consider any of the factors that go into determining the actual validity of a well-designed test.
Content validity Content validity examines whether the test specifications meet the requirements and purpose of the test being developed. This method is appropriate in educational settings whereby the test being developed must reflect the content covered in the course syllabus (Kaplan & Saccuzzo, 2009). This method is advantageous to the extent that it allows the test developer to use the existing information, which has been tested over time, in the development of a new test that seeks to measure the same characteristics as those measured in the previous tests (Cohen, 2010). The only limitation of this method is that it is subject to cultural and chronological changes. This means that the responses to different test items may change from place-to-place or from time-to-time (Kaplan & Saccuzzo, 2009).
Criterion-related Criterion-related validity is categorized into concurrent and predictive validity. In concurrent validity, the results of a given test are compared with a certain criterion at the same time. With predictive validity, the results of the new test foretell the outcome of a given characteristic (Cohen, 2010). This method is appropriate for such tests that measure well-known characteristics. Criterion-related validity measures the test’s real validity. This is because the method uses existing information to validate the subject matter and the criterion of the new test. More specifically, the scores of the test under review are compared to the scores of an existing test to determine the former test’s validity (Kaplan & Saccuzzo, 2009). This method is not effective in determining the test’s validity, especially when the predictor measure is similar to the criterion measure. For instance, when a certain test item is used as the test criterion as well as the measure of the test’s validity, chances are high that the test developer cannot validate the new test (Kaplan & Saccuzzo, 2009).
Construct This forms the core of validation in psychometrics because it entails measuring a test item that cannot be measured directly. Thus, the usefulness of a test item is determined by considering the extent to which the test reflects a certain phenomenon predicted by theory. As a result, this method is appropriate invalidating tests in which there is no clear construct that can be measured directly (Cohen, 2010). This method is advantageous because it uses the principles of the scientific method in measuring the validity of a given construct that purports to measure a certain characteristic in the test. Hence, the test developer begins by constructing a hypothesis, and based on scientific facts, the hypothesis is then tested to determine whether it holds true or not (Kaplan & Saccuzzo, 2009). Construct validation is limited to the extent that it fails to determine the validity of a test if there is no clear constructor if the construct is not clearly defined (Cohen, 2010).


Cohen, R. J. (2010). Exercises in psychological testing and assessment (7th ed.). New York, NY: McGraw-Hill Publishers.

Kaplan, R. M., & Saccuzzo, D. P. (2009). Psychological testing: Principles, applications, and issues (7th ed.). Belmont, CA: Wadsworth.