Tutorial IRT

Introduction

Welcome!

Hello and welcome to this introduction on Item Response Theory, or IRT. IRT is used with testing, to help with item selection and test validation.

In this tutorial, you will learn the main parameters and models associated with IRT, visualize how changing these parameters affect the models, learn how IRT can help identify item biases, and see how you can fit IRT models to data. There are also some questions at the end, so that you can test what you learned.

Good luck!

Difficulty

Imagine you’re trying to learn someone’s specific ability - for now, let’s take 2nd grade math ability. You develop some questions about math with different levels of difficulty, for example:

\(4 + 3 = \cdots\)
\(6 * 7 = \cdots\)
\(\frac{1}{2} + \frac{1}{4} = \cdots\)

Let’s assume that your questions are directly measuring that person’s ability. If the question has a 50% chance of being answered correctly by a 2nd grader, then you can assume that the difficulty of the item matches the ability of the student: people that answer the question correctly have a higher level of ability than an average 2nd grader, and people that answer incorrectly have a lower level.

This means that the probability that someone correctly responds to a question depends both on that person’s ability and on the difficulty of the question. This is also given by the following model, with ability denoted \(\Theta\) and item difficulty denoted \(b_i\). This model is called the Rasch model.

\[P_i(\theta) = \frac{e^{\theta - b_i}}{1 + e^{\theta - b_i}}\]

Experiment with Difficulty

When you plot the probability of someone correctly answering a question depending on their ability, you get an Item Characteristic Curve (ICC), as seen below. To get an idea of how changing the difficulty of the item affects what you can infer about ability, experiment with the difficulty sliders.

Difficulty item 1:

Difficulty item 2:

Difficulty item 3:

Rasch model

The Rasch model only assesses one parameter: difficulty, denoted \(b_i\). As you can see from the graph, the curve for every item can move horizontally, but the shape stays the same. This is because the model assumes that the relation between the item and the latent trait being measured is the same for all items. In practice, however, this relation might vary over the different items. This is why the Two-Parameter Logistic (2PL) model introduces a new parameter: discrimination.

Discrimination

Let’s say that the reason you’re trying to figure out the 2nd grade math ability of a group of kids is because it will determine if certain students will be placed in a remedial math class. You want to make sure that the items are clearly differentiating students that need the remedial class from those who don’t. The discrimination parameter thus shows how well items distinguish between groups. In essence, discrimination measures how well an item relates to the underlying latent ability, so an item that strongly relates to the ability should have a higher discrimination parameter.

Experiment with Discrimination

In the following graph, the difficulties of the three items are set to -2, 0 and 2. Play around with the discrimination to see how the ICCs of these items change.

Discrimination item 1:

Infinity (if slider is set below zero this will be negative infinity

Discrimination item 2:

Infinity

Discrimination item 3:

Infinity

More about Discrimination

As you can see from the above graph, an item with a higher discrimination has a steeper slope. We can say that such an item discriminates or differentiates better than an item with a lower discrimination and gives more information about someone’s ability. A curious situation occurs when the discrimination is below 0. In that case, people with a lower ability have a higher probability of answering an item correctly than people with a higher ability. This could indicate that an item needs to be rescored, because it was counter-intuitive.

2PL Model:

Adding the discrimination parameter \(a_i\) changes the formula for the probability of a correct response into the Two-Parameter Logistic Model:

\[P_i(\theta) = \frac{e^{a_i (\theta - b_i)}}{1 + e^{a_i (\theta - b_i)}}\]

####Guttman model As you can see when you click the button marked infinity, the discrimination parameter in the 2PL model goes to infinity and the ICC curve becomes deterministic (it fully differentiates between those with higher and lower ability). This version of the 2PL model is called the Guttman model.

Guessing

Almost there! We just need one last ingredient to cover the basics of IRT. So far, as ability decreases, the probability of responding correctly to an item goes towards zero. This means that a person with very low ability will have a near-zero probability of correctly answering the question. But if a question has multiple possible answers, then a person with very low ability can guess one of the possibilities, and so the probability of correctly answering will be larger than zero.

Let’s take an item with four answer options. Without knowing anything about the subject, there is still a 1 in 4 probability of correctly answering the item (given that all options are equally likely). The Three-Parameter Logistic (3PL) model takes this into account by adding a third parameter \(c_i\):

\[P_i(\theta) = c_i + (1 - c_i) \frac{e^{a_i (\theta - b_i)}}{1 + e^{a_i (\theta - b_i)}}\]

Experiment with guessing

In this graph, the difficulty parameters are all equal and the discrimination of all items is set to 1. Pay attention to where the curve hits the y-axis (how the \(P_i(\theta)\) changes) as you change the guessing parameter.

Guessing item 1:

Guessing item 2:

Guessing item 3:

(Note: Although the slider is continuous, the guessing parameter should only take values of 1 divided by the number of questions, such as .5, .25, .33 etc.)

####A note about the 3PL model This model includes a very unlikely assumption: that all answering options are equally likely. This is rarely the case in practice, especially for all participants. This model is thus generally difficult to fit to data.

Putting it all together

Let’s play!

Here you can experiment with all 3 parameters: item difficulty, discrimination and guessing.

Difficulty:

Discrimination:

Guessing:

DIF

Differential Item Functioning

IRT also can grant insight into how different groups might answer items differently.

Let’s take our 2nd graders again. This time, we have two children of equal ability, both at the average 2nd grade level, answering the same item. But let’s say the children belong to different groups, and receive different perspectives on math. For example, a young girl might perform worse on the item than a young boy, because they have both internalized gender-based views on math.

Sometimes, people with the same ability but from different groups have different probabilities of correctly answering an item. Only the measured ability is supposed to impact the probability of successfully answering a question, so if there are differences based on group membership, the test item might be biased. This can be investigated by seeing if there are different ICCs across groups — there can be differences in both difficulty and in discrimination. If an item is labeled with DIF, then that item can be removed to attempt to reduce test bias.

Experiment with DIF: Visualization

Here you can see how an item can show a differential functioning unrelated to ability but instead to group differences. For example, let’s say this item assesses intelligence, but it does so differently for 2 groups with the same ability level.

Show how the item difficulty can differentially affect groups.

Show how the item discrimination can differentially affect groups.

When an item only shows difference in difficulty for different groups, this is called uniform DIF. When the discrimination is also different, this is called non-uniform DIF. It is rare to observe an item with differential discrimination but equivalent difficulty.

In practice: data sets

So far, we’ve talked about looking at items assessing the ability of one person. In practice, though, IRT is often used to validate questionnaire items, to see what level of ability an item measures and how well it distinguishes between groups. To investigate how test items relate to ability, questionnaires are piloted with a large amount of participants, and then the Rasch model is fitted to the data.

TIF (Test Information Function)

Now that we have our items and we know how they behave individually, we would like to understand how the test as a whole actually works.

This information is given by the Test Information Function, or TIF. The TIF tells us how well the test assesses the latent trait, or, in other words, the precision of the test in measuring a specific level of ability. Since the precision is the inverse of the variability of the estimated latent trait, the greater the variability, the less the precision of the test in finding the true value of respondents’ ability.

Don’t be scared, this concept will be clearly illustrated in the practical part of this tutorial, coming up next. Make sure to take a look at both the TIF graph and the Standard Error graph, and you’ll notice the inverse relationship between them. Last but not least: take a look at how the changes in item discrimination and difficulty parameters affect the TIF.

Model fit

Together with the estimated parameter of the model, you’ll find another tab in which Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC) are reported. These indices tell us whether the model actually fits the data, BUT you have to consider that they are comparative indices. What does that mean? It means that for interpreting them, you have to compare the AICs and BICs resulting from different models, and the model with the smallest values wins (i.e. is the best fitting model).

Now it’s time to play with the data!

To see for yourself how sample size affects model fit, as well as how any change in a model parameter affects the estimation of all of the others, experiment with the simulated data below.

How it works:

The system will generate new data as you change the sample size, the range of the difficulty parameter (1 PL Model), and the discrimination parameter for each item (2 PL Model). As you will see, every change in one of the generating parameters will affect the model, but we don’t want to spoil anything…try it yourself!

In practice: 1PL Model

Experiment with the sample size and the difficulty

Choose the number of participants:

Choose the item difficulty range:

In practice: 2PL Model

Experiment with the sample size, difficulty and discrimination

Choose the number of participants:

Choose the item difficulty range:

Choose item1 discrimination parameter:

Choose item2 discrimination parameter:

Choose item3 discrimination parameter:

Choose item4 discrimination parameter:

Questions

Did you pay attention..?

Quiz

Sources and links

This app was developed through the TquanT seminar.

Sources:

DeMars, C. (2010). Item response theory: Understanding statistic measurement. New York, NY: Oxford University Press.