Methodology — TraitTune

1. What the instrument measures

TraitTune estimates where a respondent sits on each of fifteen personality dimensions, together with a per-dimension measurement-precision figure. The dimensions are bipolar and continuous — each is a spectrum anchored by two pole labels — and each score is reported on a 0–100 scale alongside the precision with which that score was measured.

The fifteen dimensions organise into four interpretive clusters: Curiosity & Drive, Social Engagement, Structure & Focus, and Emotional Ground. The clusters were derived empirically from the posterior correlation structure of the deployed calibration, not imposed by borrowing a competing framework's taxonomy. The underlying dimensional structure draws on the modern trait-psychology literature — the Five-Factor tradition (Goldberg 1993; McCrae & Costa 2008), the HEXACO extension (Ashton & Lee 2007), and further constructs well-attested across several academic taxonomies — but TraitTune does not inherit any one framework's labels or factor structure wholesale.

A mid-range score on a dimension is informative, not a failure to detect a signal. It means that the respondent's behaviour on that dimension is context-dependent or balanced between the two poles — a real pattern, not an absence of one. Every reported score is paired with a measurement-precision figure, so a mid-range result is read as a substantive finding, not as noise.

2. How respondents interact with the instrument

Every single-format instrument has a characteristic failure mode. Agreement-scale items invite acquiescence bias; forced-choice items produce within-person (ipsative) artefacts that are hard to compare across respondents; scenario items are information-rich but expensive per unit of respondent time. TraitTune deliberately mixes four formats so their strengths compound and their weaknesses cancel (Christiansen, Burns & Montgomery 2005; Brown 2016).

Likert-type agreement items present a statement and ask how much the respondent agrees on a five-point scale; these form the majority of the assessment. Two-option forced-choice items present a pair of paraphrases — both plausible — that anchor opposite ends of a single dimension, and ask which fits better. Scenario items describe a short everyday situation with three plausible behavioural responses corresponding to graded levels of the underlying construct.

Multidimensional forced-choice (MFC) triplets present three statements drawn from different dimensions and ask the respondent first to select the one that fits them most, and then the one that fits least. The sequential Most-then-Least presentation follows the Best-Worst Scaling tradition (Finn & Louviere 1992; Marley & Louviere 2005; Louviere, Flynn & Marley 2015), which itself rests on Thurstone's Law of Comparative Judgment (Thurstone 1927). Presenting the two choices sequentially rather than as a simultaneous grid reduces cognitive load while yielding the same pairwise-ranking information.

Crucially, three of the four formats are normative — each respondent's answer is directly comparable to another respondent's — while MFC is ipsative, meaning answers are within-person comparisons only. Ipsative-only instruments are notoriously difficult to interpret across respondents; normative-only instruments are vulnerable to response-style and social-desirability artefacts. Combining the two is more robust to faking and response styles than either family on its own (Jackson, Wroblewski & Ashton 2000; Christiansen, Burns & Montgomery 2005; Brown 2016).

3. How a score is computed

Every response is scored with an Item Response Theory (IRT) model appropriate to its format. IRT is the measurement framework behind large-scale adaptive testing programmes worldwide, including the GRE, GMAT, and the US Armed Services Vocational Aptitude Battery (Embretson & Reise 2000; van der Linden & Glas 2010). Unlike classical test theory, every item carries its own calibrated difficulty and discrimination parameters, which lets the engine weigh evidence from items of different formats on a single common scale.

Each format is scored with the IRT model appropriate to its response structure. Likert items are modelled with the Graded Response Model (Samejima 1969); two-option forced-choice items with the two-parameter logistic model (Birnbaum 1968); scenario items with the Generalised Partial Credit Model (Muraki 1992) on three ordered categories. MFC triplets are modelled with Thurstonian IRT after Brown & Maydeu-Olivares (2011), in which each statement carries a latent utility U = λ·κ·θ + ε with item-specific loading λ, keying κ ∈ {+1, −1}, and Gaussian residual ε. A Most/Least pick expands to three pairwise probabilities, each a normal-CDF function of the utility differences between the paired statements; this is the same family of models that underpins the Multi-Unidimensional Pairwise Preference approach to adaptive personality testing (Stark, Chernyshenko & Drasgow 2005, 2012).

Because one answer can carry information about more than one dimension at once — and because MFC triplets are intrinsically multidimensional — trait estimation is carried out inside a Multidimensional IRT (MIRT) framework on a single fifteen-dimensional latent vector with a shared standard-normal prior (Reckase 2009; Ackerman 1994). After every response the posterior over each dimension is re-computed by an expected a posteriori (EAP) update on a 61-point grid, implemented as a two-pass coordinate-ascent sweep across the fifteen dimensions (Bock & Mislevy 1982). The score, the precision figure, and the interpretive category reported to the respondent are all derived from the resulting per-dimension posterior mean and posterior standard deviation.

Item selection is adaptive. In a fixed-length questionnaire, every respondent sees the same items in the same order; in TraitTune the next item is chosen from what the engine already knows about this respondent. Concretely, each candidate item is scored by its Fisher information at the current posterior mean multiplied by the current posterior variance on the dimension that item targets: score(i) = I_i(θ̂_d) · Var(θ_d | responses so far). Fisher information is the classical optimality criterion for IRT item selection; variance-weighting routes the engine toward dimensions whose estimates are still wide rather than those already resolved. The rule is simple and stationary — there is no separate early-session heuristic and no phase transition. Because variance differs across dimensions, the engine naturally explores broadly at first (when all variances are near 1.0) and focuses narrowly later (when most dimensions have converged and only a few remain wide). This is a practical form of posterior-weighted Fisher selection (Chang & Ying 1996; van der Linden & Pashley 2010).

For MFC content the same rule applies at the block level: candidate triplets are scored by the expected Fisher information they would contribute to each of the three dimensions they cover, weighted per-dimension by current posterior variance, summed across still-active dimensions (Stark, Chernyshenko & Drasgow 2012). Selection terminates on a dimension once its posterior standard deviation crosses a precision threshold; items and blocks that would only supply information to already-converged dimensions are dropped from further consideration. The engine works in three precision zones: a dimension is treated as resolved once its posterior SD falls below a tight threshold, refined with one or two additional high-information items in an intermediate zone, and — if it remains broadly uncertain after the parametric items are exhausted — handed off to an open-ended follow-up exchange (the clarify chat) before a final estimate is reported. The exact threshold values are versioned with each calibration run rather than hard-coded into this page.

Because different dimensions converge at different rates — a function of item-pool depth on that dimension and of the respondent's position on it — the total number of items a respondent sees is not fixed in advance. A full fifteen-dimension session typically takes 40–70 items, substantially fewer than a comparable fixed-length inventory would require at equivalent precision. The progress bar shows an expected-remaining-item forecast, computed by simulating the selector forward under current posterior estimates rather than counting toward a pre-set target.

Some readers do not need a full fifteen-dimension profile in one sitting. TraitTune offers focused use-case scopes — Career & performance, Dating & partnership, Collaboration style, Self-discovery — each of which targets the five priority dimensions for that scope rather than all fifteen. The same scoring engine runs in both cases; only the candidate pool changes, so a focused session is shorter (roughly 25–40 items) and only the five in-scope dimensions receive a precision-bearing estimate. Out-of-scope dimensions are not reported as zero — they are simply not measured in that session, and the respondent can extend later to the full instrument without retaking what was already answered.

4. Item development

Item quality is the single largest determinant of measurement precision in modern IRT practice (Clark & Watson 1995; DeVellis 2017). The active pool currently contains roughly 885 items across the fifteen dimensions — averaging about 59 items per dimension, with the actual distribution shaped by dimension breadth and format mix. Every item has passed a structured multi-stage review before entering the live instrument.

Items are authored against an operational definition of a single dimension and reviewed both for their primary construct and for any cross-loading onto unrelated constructs; items that load on more than one dimension are either rewritten or retired. Every dimension carries forward-keyed and reverse-keyed items in approximately equal proportion, so that acquiescence bias and careless-responding artefacts cannot systematically bias the estimate in one direction.

Items that appear inside MFC triplets pass an additional structural calibration. Within a block, the three statements are grouped so that their social-desirability ratings are closely matched; matched-desirability grouping is the condition under which Thurstonian IRT produces the least-biased utility estimates, and it is what makes a forced-choice block a genuine discriminative task rather than a social-desirability preference (Jackson, Wroblewski & Ashton 2000; Brown & Maydeu-Olivares 2013). Block assembly additionally guarantees that each triplet draws from three different dimensions, contains at least one reverse-keyed and at least one forward-keyed item, and does not reuse the same item across overlapping active blocks. These constraints are enforced by an automated block-assembly pipeline, following recommendations from the forced-choice-format methodology literature (Hicks 1970; Meade 2004).

Every item also passes a content review for double-barrelled wording, culturally-loaded phrasing, ambiguity, and reading-level consistency. When post-deployment data indicates differential item functioning, poor discrimination, or localisation issues, the policy is retire and replace: a new item is authored and calibrated rather than rewriting the old one in place, so that each historical calibration record stays attached to the exact item it was computed on. Items retired under this policy do not return to the active pool. Formal Differential Item Functioning (DIF) analysis across language and demographic strata is on the validation roadmap.

5. Calibration

Item parameters — discriminations, difficulties, response-category thresholds, and the MFC loadings and uniquenesses — are estimated jointly under a single hierarchical Bayesian model in which all fifteen latent traits share a standard multivariate-normal prior and every response format contributes its own likelihood on the same θ vector. Posterior inference uses No-U-Turn Hamiltonian Monte Carlo (NUTS; Hoffman & Gelman 2014) as implemented in PyMC (Salvatier, Wiecki & Fonnesbeck 2016), with standard r-hat and effective-sample-size diagnostics (Gelman et al. 2013). Each deployment is gated on passing those diagnostics on every item parameter. Each calibration run is archived together with its data snapshot, MCMC configuration, diagnostic report, and posterior traces so that every deployed parameter is fully reproducible back to the sample it was fit on. The currently-deployed joint MIRT calibration was finalised in April 2026 (joint MIRT v6.3) under those gates.

A note on the current calibration data source. The instrument is in early production. The currently-deployed calibration was fit on 1,199 respondent profiles generated by a validated simulator — 699 used for the parametric formats and 500 for the MFC triplets, with disjoint item support — not yet on 1,199 human respondents. Eight NUTS chains × 2,000 post-warmup draws produced posterior means, standard deviations, and 95% credible intervals for every item parameter; those values drive the live engine. Simulated-persona bootstrap is a defensible strategy for opening a new instrument, and it is standard practice for cold-start IRT calibration, but it is a bootstrap: the calibration will be re-estimated on the live respondent pool as soon as that pool is large enough to support convergence, at which point the simulated calibration retires and a human-data calibration takes its place. We disclose this because a reader has a right to know what the current parameter estimates rest on.

Internal theta-recovery diagnostics — correlation between the engine's posterior-mean estimate and the simulator's ground-truth θ across the fifteen dimensions — currently show a median of 0.77 and a minimum of 0.51 on the same dataset the calibration was fit on. These are engine-internal recovery numbers, not convergent-validity numbers against human inventories; the distinction is material and we flag it explicitly. As the live response pool grows, parameters will be re-estimated on the same joint model against real data, and out-of-sample cross-validation against held-out human respondents will join the release gate. Continuous refinement of this kind is standard practice in large-scale adaptive-testing programmes (van der Linden & Glas 2010).

Parameter values themselves — discriminations, difficulties, utilities — are not exposed in respondent-facing outputs. They are withheld both because they are uninformative to the reader and because they would be a non-trivial input to any attempt at gaming the instrument. Only aggregate measurement outputs — a score, a precision estimate, an interpretive summary — appear in a TraitTune report.

6. Reliability, validity, and what this instrument is not

Each dimension's measurement precision is reported as a posterior standard deviation on θ̂, obtained directly from the EAP update. In the Fisher limit this is equivalent to marginal IRT reliability ρ_d = 1 − 1/(1 + I_d(θ̂_d)), which is how the same quantity is often reported in the adaptive-testing literature. Internal-consistency indicators within a session and test-retest stability across repeat completions are computed alongside marginal reliability. Because every score is reported with its associated precision, the reader never has to guess at how much weight to put on a given result. Reporting a score without its associated precision is, in our view, poor psychometric practice, and it is not something we do.

Personality measurement is probabilistic. A reported score is not a categorical verdict; it is the posterior mean of a continuous estimate, paired with the standard deviation that quantifies how much the responses constrain it. Two respondents with the same point estimate can have meaningfully different posteriors, and the precision figure is what makes that difference visible. Confidence in a measurement is a function of item information and sample of responses, not of how strong a trait sounds in narrative form.

Convergent-validity work — correlations between TraitTune dimensions and established personality inventories — is part of the ongoing research programme and is one of the first things the human-data pool will support once available. It is re-run whenever a dimension is rewritten or a calibration refreshed.

This instrument is not a clinical diagnostic. It measures self-reported personality on a series of well-defined continua, and its results are intended as input to self-reflection, personal development, and — where the respondent chooses — as context for downstream AI personalisation. It is not appropriate as the sole basis for significant life, hiring, or clinical decisions. Clinical evaluation has a different purpose — diagnosis, case formulation, treatment planning — and is the correct tool for those questions; TraitTune is complementary, not a substitute.

7. Lawful basis and special-category data

Personality scores derived from psychometric responses are treated as inferences about a person's mental and behavioural disposition. Under the EU General Data Protection Regulation (GDPR), inferences of this kind fall within the special categories of personal data defined in Article 9. Processing them therefore requires a specific lawful basis above and beyond the ordinary one used for routine personal data.

TraitTune relies on the respondent's explicit consent under GDPR Article 9(2)(a) as that lawful basis. Consent is recorded at a dedicated gate before any item is shown, is granular to the act of computing a personality profile, and can be withdrawn at any time — at which point processing stops and the underlying response and profile data is deleted on the schedule documented in the Privacy Policy. We do not infer personality scores from any other behavioural signal: only the items the respondent answers in the assessment, knowingly and after consent, are used as input to the engine. The full data-protection picture — controller details, retention windows, sub-processors, transfer mechanisms, and respondent rights — lives in the Privacy Policy and is the authoritative document for those questions; this section exists only to make the lawful-basis story explicit alongside the methodology.

Glossary

Plain-language definitions of the technical terms used above.

Item Response Theory (IRT): A family of statistical models for tests and questionnaires in which the probability of a given response to an item is modelled as a function of the respondent's latent trait and a small number of item parameters (typically discrimination and difficulty). Trait estimates and item parameters are recovered jointly from the response data.
Multidimensional IRT (MIRT): An extension of IRT in which each item can load on more than one latent dimension. Used when traits are correlated and items measure several traits at once, as is the case for personality.
Computerized Adaptive Testing (CAT): A testing format in which the next item shown is selected on the fly using the current trait estimate and the remaining item bank. Each item is chosen to be maximally informative for the respondent in front of you, so the same precision is reached with fewer items than a fixed-form test.
Multidimensional Forced-Choice (MFC): An item format in which a small block of statements (typically two to four) is presented together and the respondent ranks them or picks the most and least characteristic. Forced-choice blocks reduce response-style and social-desirability artefacts that contaminate single-statement Likert items.
Thurstonian IRT: An IRT model for forced-choice and pairwise-preference data, originally proposed for paired comparisons by Thurstone (1927) and re-developed for multidimensional personality assessment by Brown and Maydeu-Olivares (2011, 2013). Recovers latent traits from comparative judgments rather than from absolute Likert ratings.
Best-Worst Scaling (BWS): A judgment task in which the respondent picks the most and least applicable option from a small set. The data are then analyzed under a discrete-choice or Thurstonian model. Provides more information per block than a simple ranking.
Fisher information: A measure of how much a given item is expected to reduce uncertainty about the respondent's trait estimate at the current ability level. Used by adaptive engines to choose the next item — the candidate with the highest expected information at the current trait estimate is selected.
Hierarchical Bayesian estimation: An estimation approach in which item parameters and trait estimates are treated as random variables drawn from prior distributions, and posterior distributions are obtained jointly via Markov chain Monte Carlo or variational methods. Gives credible intervals for every parameter and trait, not just point estimates.
Latent trait: An unobservable underlying property of a respondent (e.g. extraversion, conscientiousness) that the test attempts to estimate from the observable responses to items. Each respondent has a position on each trait dimension; the test returns an estimate with an associated uncertainty.
Big Five (Five-Factor Model): An empirical taxonomy of personality based on five broad dimensions — typically openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism — that have been recovered repeatedly across languages, cultures, and instruments. Treated by TraitTune as a public-domain dimensional reference, not a commercial product.
HEXACO: A six-factor personality taxonomy that adds an honesty-humility dimension to the Big Five and re-organizes the others. Used as an additional reference frame for the dimensional structure.

Selected references

All references are published works in the peer-reviewed psychometric and personality-research literature. They underpin the specific methods described above.

Foundations of comparative judgment and latent-trait modelling

Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34, 273–286.
Lord, F. M. (1980). Applications of Item Response Theory to practical testing problems. Erlbaum.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Danish Institute for Educational Research.

Item Response Theory models used in the instrument

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 17.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In Lord & Novick, Statistical theories of mental test scores. Addison-Wesley.
Muraki, E. (1992). A generalized partial credit model: application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.
Reckase, M. D. (2009). Multidimensional Item Response Theory. Springer.
Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring. Applied Measurement in Education, 7, 255–278.

Forced-choice and Thurstonian modelling

Brown, A., & Maydeu-Olivares, A. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71, 460–502.
Brown, A. (2016). Item response models for forced-choice questionnaires: a common framework. Psychometrika, 81, 135–160.
Maydeu-Olivares, A. (1999). Thurstonian modeling of ranking data via mean and covariance structure analysis. Psychometrika, 64, 325–340.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The Multi-Unidimensional Pairwise Preference model. Applied Psychological Measurement, 29, 184–203.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2012). Adaptive testing with multidimensional pairwise preference items: improving the efficiency of personality and other noncognitive assessments. Organizational Research Methods, 15, 463–487.

Best-Worst Scaling

Finn, A., & Louviere, J. J. (1992). Determining the appropriate response to evidence of public concern. Journal of Public Policy and Marketing, 11, 12–25.
Marley, A. A. J., & Louviere, J. J. (2005). Some probabilistic models for best, worst, and best-worst choices. Journal of Mathematical Psychology, 49, 464–480.
Louviere, J. J., Flynn, T. N., & Marley, A. A. J. (2015). Best-Worst Scaling: Theory, Methods and Applications. Cambridge University Press.

Forced-choice format methodology and its motivation

Hicks, L. E. (1970). Some properties of ipsative, normative, and forced-choice normative measures. Psychological Bulletin, 74, 167–184.
Meade, A. W. (2004). Psychometric problems and issues involved with creating and using ipsative measures for selection. Journal of Occupational and Organizational Psychology, 77, 531–552.
Jackson, D. N., Wroblewski, V. R., & Ashton, M. C. (2000). The impact of faking on employment tests: does forced choice offer a solution? Human Performance, 13, 371–388.
Christiansen, N. D., Burns, G. N., & Montgomery, G. E. (2005). Reconsidering forced-choice item formats for applicant personality assessment. Human Performance, 18, 267–307.

Adaptive testing

Wainer, H., et al. (2000). Computerized Adaptive Testing: A Primer (2nd ed.). Erlbaum.
van der Linden, W. J., & Glas, C. A. W. (2010). Elements of Adaptive Testing. Springer.

Item-pool construction and scale quality

Clark, L. A., & Watson, D. (1995). Constructing validity: basic issues in objective scale development. Psychological Assessment, 7, 309–319.
DeVellis, R. F. (2017). Scale development: theory and applications (4th ed.). SAGE.

Bayesian estimation

Fox, J.-P. (2010). Bayesian Item Response Modeling: theory and applications. Springer.
Gelman, A., et al. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.

Personality frameworks the dimensional structure draws on

Goldberg, L. R. (1993). The structure of phenotypic personality traits. American Psychologist, 48, 26–34.
McCrae, R. R., & Costa, P. T. (2008). The Five-Factor Theory of personality. In Handbook of Personality: Theory and Research (3rd ed.). Guilford.
Ashton, M. C., & Lee, K. (2007). Empirical, theoretical, and practical advantages of the HEXACO model of personality structure. Personality and Social Psychology Review, 11, 150–166.