What the
problem is ?
Whenever
you use humans as a part of your measurement procedure, you have to worry about
whether the results you get are reliable or consistent. People are notorious
for their inconsistency. We are easily distractible. We get tired of doing
repetitive tasks. We daydream. We misinterpret.
Solution
Inter-rater
reliability is to determine whether two observers
are being consistent in their observations. Inter-rater reliability should be
established outside of the context of the measurement in your study. After all,
if you use data from your study to establish reliability, and you find that
reliability is low, you're kind of stuck. Probably it's best to do this as a
side study or pilot study. And, if your study goes on for a long time, you may
want to reestablish inter-rater reliability from time to time to assure that
your raters aren't changing.
The ways
to estimate inter-rater reliability
There
are two major ways to actually estimate inter-rater reliability. If your
measurement consists of categories -- the raters are checking off which
category each observation falls in -- you can calculate the percent of
agreement between the raters.
First
way:
Let's
say you had 100 observations that were being rated by two raters. For each
observation, the rater could check one of three categories. Imagine that on 86
of the 100 observations the raters checked the same category. In this case, the
percent of agreement would be 86%. It does give an idea of how much agreement
exists, and it works no matter how many categories are used for each
observation.
Second
way:
The
other major way to estimate inter-rater reliability is appropriate when the
measure is a continuous one. There, all you need to do is calculate the
correlation between the ratings of the two observers. For instance, they might
be rating the overall level of activity in a classroom on a 1-to-7 scale. You
could have them give their rating at regular time intervals (e.g., every 30 seconds).
The correlation between these ratings would give you an estimate of the
reliability or consistency between the raters.
You
might think of this type of reliability as "calibrating" the
observers. There are other things you could do to encourage reliability between
observers, even if you don't estimate it. For instance, I used to work in a
psychiatric unit where every morning a nurse had to do a ten-item rating of
each patient on the unit. Of course, we couldn't count on the same nurse being
present every day, so we had to find a way to assure that any of the nurses
would give comparable ratings. The way we did it was to hold weekly
"calibration" meetings where we would have all of the nurses ratings
for several patients and discuss why they chose the specific values they did.
If there were disagreements, the nurses would discuss them and attempt to come
up with rules for deciding when they would give a "3" or a
"4" for a rating on a specific item. Although this was not an
estimate of reliability, it probably went a long way toward improving the
reliability between raters.
1 comments:
of course human are easily can be disturbed. some may be very strict in giving opinion, some may be biases. so, i think it is not fair if we just take one side's opinion only and ignoring the others.
Post a Comment