It's a very complicated process! There are lots of people to create the test, lots of participants to create norms for the test; Start out with a much larger number of items so you can pare it down (cut bad items)
2 characteristics of useful tests (some out there don’t meet these essentials):
- Reliability: Does it measure the same things across different raters? (On training across raters: Some standard explicitly tells you how to score the test, while other tests say use your judgement against XYZ which may lead to different results.)
- Test-Retest Reliability: Expectation is that if you were to retake the test, you would get the same score (intelligence doesn’t go up in 2 weeks (at least in the ballpark))
- Ideally correlation of 1: complete agreement; Correlation of 0: completely unrelated
- Validity: Is it measuring what it purports to measure?
- Content Validity: items that make up the test are in fact measuring what the test purports to measure; items have to match up
- Math test: Spell the word “CAT” — invalid because it’s unrelated
- Predictive Validity: if it’s measuring said thing, it should be able to predict something
- Score 120+ on IQ (top 10%) and compare against 50th percentile → If it’s predictive of achievement, there should be a relationship that makes them better at Reading/Writing/Arithmetic