Test Item Analysis - Internal Consistency


This post is the last post from me regarding E-Learning Challenge 109 - Cheat Sheets.

This is a spreadsheet I designed to help me calculate the internal consistency of my test items.  I design e-learning and instructor led training in my current position.  While a lot of our learning contains minor assessments or learning checks... we occasionally have to create a formal assessment to document an employee's proficiency.  I use the attached spreadsheet to analyze my test after it has been completed

This is a macro enabled Excel Spreadsheet.  I wrote some code in Visual Basic that can customize the spreadsheet to handle any number of learners or test questions (as long as you don't exceed Excel's total spreadsheet size). 

Before using this report, you will need to pull a report on your test of right/wrong answers. (Articulate online has a convenient report that shows "Correct/Incorrect" responses only).  Replace the word "Correct" with "1" and replace the word "Incorrect" with "0" and you are good to go.  Paste this data "text only" in to the spread sheet and it will calculate the following items for you:

P-Value (Item Difficulty) - Simply the percentage of your population that got the question correct.  5 out of 10 would have a P-value of 0.50.  For maximum discrimination potential (explained below), you want to have a difficulty level that are slightly higher than the midway between chance and perfect scores.  For example, you would want around a 0.75 if it's a true false question (That's the midway between 0.50 and 01.00)


Point-Biserial Correlation (Item Discrimination) - This is how well the item discriminated between your learners.  An item that everyone gets right will have a discrimination of zero.  Items that score positively  means that the item discriminated well (learners that scored above the mean typically got the question right, and learners below the mean got the question wrong).  Items that score negatively require investigation as it's possible this item assessed a different content area than the rest of the quiz. 

An example of this would be a very simple word problem on a math test.  If your best mathematicians were poor readers, they might miss the easy question based on their ability to read vs. there ability to calculate.  Furthermore, your struggling math students would get the question right because it's an easy math question and they read well.  In this case, the question is performing more as a reading assessment rather than a math assessment and would either need to be revised or removed in future editions of the quiz.

Lastly, the spreadsheet calculates the Reliability of the quiz as per the Kuder-Richardson 20 (KR-20) formula for internal consistency.  As simply as possible, this number represents the odds that the same population would take this quiz again and make the exact same score.  Tests that have a lot of positively discriminating items will typically have a high reliability and vice-versa.  A solid reliability score would be above 0.70.  If you manage to reach 0.90 or above, you are in the range of national standardized tests.

I'm not a fan of huge quizzes or tests.  I'm typically more concerned with "Can you do?" over "Do you know?".  However, when I have to do this, I like to be able to show that there is a professional analysis behind what I do.

I know this is long winded, but I hope this is something that you all find useful!  I'm very nerdy when it comes to the science behind our craft :)

1 Reply