Certification Galore

Adaptive Testing

Certification and Skills Assessment Group

Microsoft, Inc.

Adaptive Testing Comparison with Fixed-Length Exams

CAT.1.gif (14369 bytes)
book through amazon.com

General Description

A traditional fixed-length computerized (or paper-and-pencil) exam presents the same number of questions to each test taker, without considering how well the person is doing on the exam. The score from this type of test usually depends on the number of questions answered correctly. The more a person knows, the more questions he or she should be able to answer correctly. Traditional exams have a long and successful history dating back to the second decade of the 20th century; however, it is clear that for any one person, the traditional test presents more questions than are necessary. For any single person there are questions on the test that are far too easy and those that are far too hard. Answering easy questions correctly doesn’t tell us much about the person, because most people answer the easy ones correctly. Likewise, and for a similar reason, answering the difficult questions incorrectly tells us very little as well. It would be better if a test were able to discover the level, on a scale of easy to difficult, where the person begins to encounter personally challenging questions. A score could be derived for that level. A computerized adaptive test (CAT) does just that.

A CAT is a test that tailors itself to the ability of the test taker. By taking into account how each person answered previous questions, taking the same CAT, a low-ability examinee and a high-ability examinee will see quite different sets of questions: The low-ability examinee will mainly see relatively easy questions, and the high-ability examinee will see more difficult questions. Both individuals may answer the same percentage of questions correctly, but because the high-ability person can answer more difficult questions correctly, he or she will get a higher score.

The experience of taking a CAT can be loosely compared to participating in a high-jump event in track-and-field competition. The high-jumper, regardless of ability, quickly reaches a challenging level where there is about an equal chance of clearing the bar or knocking it down. The "score" for the high-jumper is the last height he or she was able to jump over. The high-jumper earns the score without having to jump over every possible lower height, nor is he or she required to try all the higher levels. Similarly, for the person taking a CAT, where test questions are ranked from easy to hard, the score would be based on the point where the person encounters questions which are too difficult.

Using another example, one from education, imagine that you are a teacher giving an oral exam to one of your students. You would probably begin by asking a question of moderate difficulty. If the student answered correctly, you would likely ask a more difficult question; if he or she answered incorrectly, you would probably choose an easier one. You would continue asking questions, selecting subsequent questions based on the student’s responses to earlier questions. Within a short time you would have a good idea of the student’s competence. Throughout the questioning you were able to avoid asking many easy and hard questions that would have not helped to determine the person’s competence. Finally, your judgment about the person’s competence would not be based on the absolute number of correct responses, but, instead, on the level of difficulty of the questions he or she was able to answer correctly.

A CAT works like a good oral exam. It first presents a question of moderate difficulty. After the answer is given, the question is scored immediately. If correct, the test statistically estimates the person’s ability as higher than previously estimated. It then finds and presents a question that matches that higher ability. (If the first question is answered incorrectly, the opposite sequence occurs.) The test then presents the second question and waits for the answer. After the answer is given it scores the second question. If correct, it re-estimates the person’s ability as higher still; if incorrect, it re-estimates the ability as lower. It then searches for a third question to match the new ability estimate. This process continues with the test gradually locating the person’s competence level. The score that serves as an estimate of competence gets more accurate with each question given. The test ends when the accuracy of that estimate reaches a statistically acceptable level (or when a maximum number of items has been presented). Figure 1 shows the estimation of a person’s competence after each of 10 test questions. Notice that how the ability is estimated lower after questions are answered incorrectly (Questions 3, 6, 8 and 10). The dotted vertical lines indicate the amount of error associated with the ability estimates (and correspondingly, the degree of confidence in the score). As more questions are presented and answered this error amount decreases.

wpe1.jpg (8271 bytes)

Figure 1. A typical pattern for a CAT.

The CAT usually ends when the amount of measurement error around the ability estimate reaches an acceptable level. Low levels of measurement error are required for high-stakes certification tests and indicate that the test would likely produce a similar score if re-administered right away. Because it is unclear exactly when the test will end, a CAT usually presents a variable number of questions, and minimum and maximum numbers of questions are typically set.

Test Score. In a CAT it is possible that a person with less competence is able to answer the same number of questions correctly as a more able person. Comparing the questions answered correctly for both persons would reveal that the higher-ability person was able to answer more difficult questions correctly. Therefore he or she should receive a higher score. And that is exactly what happens. The score is not based on the number of questions answered correctly, but instead it is derived from the level of difficulty of the questions answered correctly.

How the score is computed is statistically quite complicated, and is based on the principles of Item Response Theory, proposed initially by Frederick Lord (see reference in Bibliography). The formula that calculates the final score converts the score into Microsoft’s score scale, which ranges from 0 to 1000. A pass/fail score is also determined which falls somewhere on the scale as well. As with the traditional computerized test, you will pass if your test score is higher than the pass/fail score.

Certification tests, whether they are CAT or not, are not intended as diagnostic tests and do not provide much help to you in preparing to retake the test. A CAT simply does not give enough questions to be able to determine a person’s strengths and weaknesses, other than in very broad subject matter areas. Being maximally efficient, a CAT presents the fewest items possible in a test. The primary and only purpose of a Microsoft CAT certification exam is to make an efficient and accurate pass/fail certification decision.

CAT’s Main Benefit. The main advantage of a CAT over a traditional computerized test design is efficiency. The CAT can determine a person’s score with fewer questions, sometimes reducing the length of the test by 60% or more. As explained above, it avoids presenting questions that provide no help in determining the person’s score (i.e., questions that are too easy or too hard). This efficiency is very important for Microsoft certification candidates, and is the main reason why adaptive tests are overwhelmingly preferred by certification candidates and why Microsoft is adopting this new measurement technology.


bullet* Hambleton, R. K., Swamination, H., & Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park: Sage Publications.
bulletHambleton, R. K. & Zaal, J. N. (Eds.) (1991) Advances in Educational and Psychological Testing. Boston: Kluwer Academic Publishers.
bulletKingsbury, G. G. & Zara, A. R. (1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2, 359-375.
bulletLord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. New Jersey: Lawrence Erlbaum Associates, Publishers.
bulletMcKinley, R. L. & Recase, M. D. (1980). Computer applications to ability testing. Association for Educational Data Systems Journal, 13, 193-203.
bullet* Reckase, M. D. (1989). Adaptive testing: The evolution of a good idea. Educational Measurement: Issues and Practice, 8, 11-15.
bulletVan der Linden, W. J. & Hambleton, R. K. (Eds.) (1996). Handbook of Modern Item Response Theory. New York: Springer-Verlag.
bullet* Wainer, H. (1990). Computerized Adaptive Testing: A Primer. New Jersey: Lawrence Erlbaum Associates, Publishers.
bullet* Weiss, D. J. (1983). New Horizons in Tesing: Latent Trait Test Theory and Computerized Adaptive Testing. New York: Academic Press.