My research

larc

My reseach,

I am in the process of writing a manuscript on the subject of the four methods of measuring the reliabiliy of human judgement and the reliability of psychological tests. It occured to me several years ago, that there was a mistake in the way measurement error was being calculated. This lead me to collect some data to prove my hypothesis. I submitted my results to the Journal of Applied Psychology, and my manuscript was soundly trounced by the three reviewers and the editor.

Since then, I have gathered further evidence. I am retired, but am working with a young PhD from the local university who graduated from the same school as I, and even had the same advisor, I worked with years ago. Even though my young collegue knows the field very well, it took him about a month to get my concept, since it is at odds with what is being taught. Once the light came on, he was the one that saw that my concept would apply to tests. I kind of saw that, but not as clearly as he did. Originally, I was more focused on the implications for human judgement.

At any rate, we have collected very complete data, and are ready to write a much expanded, detailed manuscript.

So, here are the different measures:

Alpha coeffiecient: This measures the degree of consistency of items in a test. For people, this is an index of their consistency of judgement at one point in time.

Test-retest reliability: This measures the error associated with real change over time.

Construct validity: This measure contains all sources of error, item inconsistency, changes in people, and nonparrelism in measurement between two tests or two people.

To correct for item inconsistency in construct validity, the square root of the validity coefficient should equal the test-retest value, since there is no item inconsistency reflected in the test-retest correlation. This correction, though a standard method, has not be considered in the past, for this kind of data.

Here are the results for cognitive tests:

Average alpha coefficient = .914. 1.000 - .914 = .086, which means that item inconsistency produces 8.6% error.

Average test-retest reliability is .892, which means that changes in people account for .108 or 10.8% error variance. The sum of these two sources of error are .194. 1.000 - .194 is .806, which is the highest correlation that could be obtained for construct validity. The average construct validity is .790 or only .016 less than this value. Therefore, the degree of nonparrelism between two tests measuring the same construct is less than 2% error, and item inconsistency and changes in people account for most of the error. Furthermore, the square root of construct validiy is .888 which is only .004 less than test-retest reliability. So, two different methods, adding up the error terms and correcting with the sqaure root value produce nearly identical results.

This same relationship holds for personality tests, and for measures of human judgement. No one in the 120 year history of psychometrics has tied this altogether.

One other measure of reliability is one set of scores correlated against an average from a large set of other scores. This can not be done with tests or measures of human judgement, in most cases. It can be done with simulated data using a method developed by Walter Borman in 1976. He termed his calcualtion, an accuracy score. In a set of data using a human judgement task, my collegue and I have collected, we can compare all four measures of reliability. The data is to be put in a data base next week, and we should have the analysis done in the near future.

I know this stuff is tiny detail, but it has major implications for some big issues in assessment, such as the degree of bias in testing or judging protected groups.

Just thought I would lay this all out for a few of my distractors, who think this Gramps is going brain dead.

neyank

Hi larc,

Are you saying then, that 1 + 1 does indeed equel two?

Or is it really just a figment of our imagination?

neyank

That should be equal.

neyank

larc

I am saying that total variance = 1.00, and various error components can be calculated, primarily item inconsistency and person inconsistency. What is left over after these components are determined is true score variance.

For tests of mental ability, the true score variance is about 80%. For personality tests it is about 58%, and for the judgement of job performance it is about 68%.

Simon

woosh ... the sound of things going over my head, lol!

larc

Simon,

Don't worry about that. I have explained this to PhD's with strong backgrounds in tests and measurements and they don't get it either. That is the reason the paper was turned down the first time. Since then, I have collected data from three very different sources and it all agrees with what I originally had with just one data set.

Two things that are new.

1. No one has used the square root correction in this situation.

2. No one has figured out how the different error components fit together before.

Francois

So what is the point of all this?

FT

larc

Francois,

One point is that people are much more accurate in making judgements than previously thought. They were using the wrong statistic all along to meausure the reliability of judgements. Another point is that tests are more valid in predicting job success than previously thought for the same reasons. Both of these facts are important in determining if a procedure is biased and unfairly discriminates against minorities in hiring and promotion.

This leads to a very important decision for a business, for example. Should they use a particular test to screen job candidates?

hillary_step

Larc,

A very interesting project you are working on.

Given the difficulty in being able to assess the ‘value’ of a person within a given occupation due to our own 'trigger' preferences and stereotyping, it seems that a more scientific means of reaching such conclusions would actually revolutionize many fields of business, industry and social integration in general.

Good luck with the paper - HS

larc

Hillery,

I am also working on some other related research. One finding is particularly interesting. If two supervisors are asked to independently rate one or two aspects of employee's performance along an overall evaluation, the correlation of the two supervisor's overall evaluation is low. However, as more aspects of the job are rated, the accuracy of the overall rating goes up. It appears that if you ask supervisors to rate a lot of job related items, their degree of bias goes down, and they come with more accurate over all results. It appears that bias is more likely if they rate only a few aspects of the job.

My research

Share this