Concordance exp

Concordance, in the context of statistics, is to study agreements between judges, scales, and measurements. This page provides support for all concordance programs provided by StatTools.

The information on this page is organized in separate panels according to the numerical nature of the measurements, and consists of the following

A commonly used test for agreement, which can be applied to all data, and cope with multiple groups, is Cronbach's Alpha. StatTools provides 3 programs and 1 detailed explanations for Cronbach's Alpha. They are

Cronbach's Alpha Program Page calculates alpha from data
Sample Size for Estimating a Cronbach's Alpha Explained and Tables Page explains sample size estimation for alpha
Sample Size for Estimating a Cronbach's Alpha Program Page calculates sample size required to estimate alpha
Sample Size for Comparing Two Cronbach's Alphas Program Page calculates sample size required for comparing two alphas

Nominal Data : where the numbers used are labels or names, and the scales are unrelated. Typical nominal numbers are 0=no illness, 1=appendicitis, 2=urinary infection, or 1=Caucasian, 2=Asian, 3=African, and so on. StatTools provides 1 program evaluating concordance for nominal data, Kappa

Binary Data : where the numbers used are 0/1, representing binary outcomes of no/yes, false/true, male/female, negative/positive, and so on. StatTools provides 1 program evaluating concordance for binary data, Kuder-Richardson Coefficient

Ordinal Data : where the number and whole integers represents ranks or order, so that 3>2>1, but the distance between the numbers are not specified, 3-2 is not necessarily the same as 2-1. The simplest ordinal scale is a 3 point scale of 0=no, 1=perhaps, 2=yes. The most commonly used ordinal scale is Likert, where 0=strongly disagree, 1=disagree, 2=neutral, 3=agree, and 4=strongly agreee. StatTools provides 3 programs evaluating concordance for ordinal data

Kappa (Cohen and Fleiss) for Ordinal Data Program Page provides 2 programs, Cohen's Kappa for two groups, and Fleiss Kappa for multiple groups
Kendall's W for Ranks Program Page evaluates concordance in ranks between 2 or more groups. This program will accept measurements, as all measurements are ranked before statistical calculations are performed.

Continuous Data : where the numbers are continuous (or near continuous) measurements that are normally distributed. StatTools provides 3 programs evaluating concordance for continuous data

Intraclass Correlation Program Page

Intraclass Correlation Coefficient

Agreement Program Page

Bland Altman Plot

Lin's Concordance Coefficient

Details for each program and procedure are discussed in their individual panels

Historical notes :

In 1937, Kuder and Richardson proposed a coefficient to evaluate the reliability of measurements that composed of multiple binary items. In 1941 Hoyt modified this coefficient, adjusting it for continuity, and name this the Kuder Richardson Hoyt Coefficient. Cronbach in 1951 showed that this coefficient can be used generally in all scaled measurements. As he intended this be a starting point to develop even better indices, he named it Coefficient Alpha. This index is now known as Cronbach's Alpha, and is a widely accepted measurement of internal consistency (reliability) of a multivariate measurement composing of correlated items. If Cronbach's Alpha is applied to binary data, the result is the same as the Kuder Richardson Coefficient (K-R 20).

The initial Cronbach's Alpha, calculated from the covariance matrix, is now known as the Unstandardized Alpha. This value tends to be unstable, and influenced by the scalar measurements used. A better Alpha is considered to be the Standardized Alpha, calculated from the correlation matrix. This is thought to be better as all variables are standardized to a mean of 0 and Standard Deviation of 1, the resulting Alpha is independent of the scales used.

Both indices can be used to measure the internal consistency of multiple-item measurements, representing the averaged correlation between the items. As multiple-item measurements are in theory repeated measurements of the same thing, these indices represents the reliability of the overall set of measurements.

Indices of reliability are often used in the early stages of developing a multiple-items measurement, to ensure that all the items measures a common concept. Items are added, removed, and modified, according to whether the indices of reliability improves, and usually until Alpha is greater than 0.7.

A recent development is the calculations of the Standard Error of Alpha, and from which the confidence interval. This algorithm, by Duhachek and Iacobucci, is now included in StatTools

The development of the Standard Error measurement allows statistical comparison and significance testing. As well as the 95% confidence interval, z=Alpha / SE can be calculated, and the probability that this does not differ from zero follows the normal z distribution.

Example

Data entry and interpretation of results are best demonstrated using the default example data from the Cronbach's Alpha Program Page . In this example, we administered 4 multiple choice questions to 20 students, using 0 or wrong answer and 1 for correct answer. The data is therefore a table of 20 rows, each from a student, and 4 columns, each for one of the tests. We with to know if the tests are similar in difficulty, that is, if the correct and incorect answers agree.

The program first produces the covariance matrix, the diagonal of which is the variance of each measurement (test), and the off diagonal cells the correlation coefficient between the measurements (tests). Please note that the covariance matrix can also be used as a second option for data entry.

The program then calculates the unstandardized Alpha, which is Unstandardized Alpha = 0.61, n=20, SE=0.14, 95%CI=0.33, to 0.89

The program then converts the covariance matrix to a correlation matrix, from which the standardized Alpha is produced. Standardized Alpha = 0.60, n=20, SE=0.16, 95%CI=0.28 to 0.91

Sample Size Calculations for Cronbach's Alpha StatTools provides two sets of sample size programs related to Cronbach's Alpha

The Sample Size for Estimating a Cronbach's Alpha Program Page allows users to estimate the sample size required to calculate a Cronbach's alpha, and also to calculate the 95% confidence interval of Alpha once the data is collected. For a quick reference, tables of sample size requirements are also available in the Sample Size for Estimating a Cronbach's Alpha Explained and Tables Page
The Sample Size for Comparing Two Cronbach's Alphas Program Page allows the user to estimate sample size required to compare two Cronbach's Alphas, and to estimate the 5% confidence interval of the difference from data collected. No table is provided, because the complex combination of sample size difference and number of variables in each Alpha are such that multiple tables will be required. The results required by users are therefore directly obtained from the program

Kappa for nominal data was first described by Fleiss in 1969. Fleiss went on to describe another Kappa for ordinal data, and his name is often associated with this second Kappa. The first Kappa, which is discussed in this panel, is generally known as Kappa for Nominal Data. The program is in the Kappa for Nominal Data Program Page

Kappa is a measurement of concordance or agreement between two or more judges, in the way they classify or categorise subjects into different groups or categories. The following terms are often used

Nominal data These are data sets where the numbers are names and not scales, and they have no mathematical relationship with each other. For example, in the classification of religion, 1 may be Buddhism, 2 may be Ancestor Worship, and 3 may be Hinduism. The difference between 1 and 2, 2 and 3 or 1 and 3 have no particular order, and the only thing that can be said is that they represent things that are different.
Judges are persons or instruments that classify a subject into a particular group or category. For example, a questionnaire that solicit beliefs and activities may be used to classify someone into a particular religion. Similarly, a person, after conversing with a subject, may also perform a classification. These would be two judges.
Subjects are the subjects that are classified. They are patients, school children, members of the public, monkeys, rats, and so on.
Groups or categories are the classification that the judges place the subjects in.
Concordance is a measure of how much classifications produced by different judges agree. Commonly, concordance is expressed as a number between 0 and 1, where 0 represents no agreements at all, and 1 represent complete agreement.

Data entry and interpretation are best described by using the default example provided in the Kappa for Nominal Data Program Page . A school provides 5 councillors which assesses then advises students on future careers, and we wish to evaluate how much the assessments of the students from these 5 councillor agree with each other.

We used a class of 10 students in their final school year, and each of the 5 councillors interviews every student, and classify them into the following categories.

1 = suitable for the caring profession
2 = suitable for engineering
3 = suitable for business

Data entry for Kappa Data entry is as shown in Example of the Kappa for Nominal Data Program Page . The standard data is a table where there are 10 rows, each represents data for a student, and 5 columns, each represents a councillor. The cells contains 1, 2, or 3, according to how each councillor classify that student. For example

The first subject was categorised 1,2,2,2,2 by the 5 judges, suitable for the caring profession by the first judge, and engineering by the other 4.
The second subject categorised 1,1,3,3,3, caring profession by the first two judges, and business by the other 3, and so on.

The alternative data entry is a table of counts. The table has 10 rows, representing the 10 students, and 3 columns representing the 3 classifications. The cell contains the number of time each students is classified in that category. In this example, the first column is for caring profession, the second engineering, and third business, so that

The first student (1,4,0), is classified for caring profession once, and engineering 4 times
The second (2,0 3), caring profession twice, and business 3 times,
and so on

Kappa in this example is 0.41, with a Standard Error of 0.08, and the 95% confidence interval od 0.27 to 0.57

This Kappa is a measurement of agreement between the 5 counsellors. Conventionally, a Kappa of <0.2 is considered poor agreement, 0.21-0.4 fair, 0.41-0.6 moderate, 0.61-0.8 strong, and more than 0.8 near complete agreement.

Given Kappa is an estimate from a sample, the se=Standard Error provides an estimate of error.

The 95% confidence interval is Kappa +/- 1.96 se. If a different confidence interval is required, the table for probability of z in the Probability of z Explained and Table Page can be consulted. Although concordance is usually used as a scalar measurement of agreement, a 95% confidence interval of Kappa that does not cross the zero value does allow a conclusion that significant concordance exists.

The Kuder Richardson Coefficient of reliability (K-R 20) is used to test the reliability of binary measurements such as exam questions, to see if the items within the instruments obtained the same binary (no/yes, right/wrong) results over a population of testing subjects.

The formula for the coefficient can easily be obtained from Wikipedia on the Internet.

Please Note that the K-R 20 was first described in 1937. Hoyt in 1940 modified the formula so that it can be applied to measurements that are not binary. Hoyt's modification eventually was popularised and is now known as Cronbach's Alpha. Cronbach's Alpha, when applied to binary data, will therefore produce the same result as KR-20. Cronbach's Alpha is now much preferred, and will be discussed in its own panel on this page.

Example

Data input and interpretation of results are best demonstrated using the default example in the Kuder Richardson Coefficient for Binary Data Program Page

	T1	T2	T3	T4
Student 1	wrong	correct	correct	wrong
Student 2	correct	correct	correct	correct
Student 3	wrong	correct	wrong	wrong
Student 4	wrong	wrong	correct	wrong
Student 5	correct	correct	correct	correct

We have 4 multiple choice questions (T1 to T4), administered to 5 students. 0 represents wrong answer and 1 correct answer, as shown in the table on the left.

1	1	1	1
0	1	0	0
0	0	1	0
1	1	1	1

The data set to be used is as shown in the table to the right, and the results are K-R 20 = 0.75

The interpretation of the K-R 20 value is similar to that of Kappa. A K-R 20 of <0.2 is considered poor agreement, 0.21-0.4 fair, 0.41-0.6 moderate, 0.61-0.8 strong, and more than 0.8 near complete agreement.

The original descriptions of K-R 20 provided no test of statistical significance or confidence interval, although these can be obtained using the Cronbach's Alpha algorithm.

Cohen & Fleisss Kappa Kendall W

Cohen's Kappa

Cohen's Kappa is a measurement of concordance or agreement between two raters or methods of measurement. The method can be applied to data that are not normally distributed, even binary (no/yes), but is best suited to a close ended ordinal scale, such as the 5 point Likert Scale.

The algorithm is well described by the original papers, text books, and on the Internet (see references). The book by Fleiss is particularly useful as it combines all the developments and enhancements, including the algorithm for estimating variance.

There are two ways of calculating Cohen's Kappa, and these produce different results. The first is by Cohen's original 1960 algorithm, now generally known as the unweighted Kappa. The second is by the weighted method, also described by Cohen but later in 1968, which includes a weighting for each cell, where weight for the cell i,j ( w_ij = 1 - |i-j|/(g-1) ), g being the number of categories of scores. Cohen argued that the weighted Kappa should be used particularly if the variables have more categories than binary (more than yes and no), because the distance from agreement should be taken into consideration. The results of both calculations are presented, and the recommendation is to use the weighted value.

Fless's Kappa

Fleiss's Kappa is an extension of Cohen's kappa to evaluate concordance or agreements between multiple raters, but no weighting is applied. Therefore, Fleiss's Kappa is similar to Cohen's unweighted Kappa (except for rounding errors) if the same data from two raters are submitted to the Fleiss algorithm.

Nomenclature

Ordinal data These are data sets where the numbers are in order, but the distances between numbers are unstated. In other words 3 is bigger than 2 and 2 is bigger than 1, but 3-2 is not necessarily the same as 2-1.

A common ordinal data is the Likert scale, where 1=strongly disagree, 2=disagree, 3=neutral, 4=agree, and 5=strongly agree. Although these numbers are in order, the difference between strongly agree and agree (5-4) is not necessarily the same as between disagree and strongly disagree (2-1).

Instrument is any method of measurement. For example, a ruler, a Likert Scale (5 point scale from strongly disagree to strongly agree), or a machine (e.g. ultrasound measurement of bone length).

raters are one or more of the instruments. This is usually a person, hopefully trained, that determines what score should be given to a subject. A carer evaluating a pain score is a rater, a judge in a beauty contest is a rater.

Subjects are the subjects of the measurements. They are patients, school children, members of the public, monkeys, rats, and so on.

Scores or measurements are the quantities produced by the instruments/raters. Measurements usually means something that are measured physically or chemically. Scores usually is a results produced by human decision.

Concordance This usually means how much scores or measurements produced by different instruments agree. Commonly, concordance is expressed as a number between 0 and 1, where 0 represents no agreements at all, and 1 represent complete replications. In some concordance measurements a negative value may be produced, which signifies opposite results.

the examples from the Kappa (Cohen and Fleiss) for Ordinal Data Program Page will be used to demonstrate data entry and interpretation of results.

Examples 1. Cohen's Kappa : Data entry option 1 The data consists of two obstetricians (raters), palpating the abdomen of 30 pregnant women (subjects), and rated each baby as growth retarded (0), normal (1) or macrosomic (2). The data is a table, where each row represent each pregnant abdomen palpated, the two columns representing the two obstetricians, and the value in each cell the classification given to that baby by that obstetrician.

The result consists firstly the display of the count matrix, with rows representing obstetrician 1's scoring, column obstetrician 2's scoring, and the cell the number of cases so scored by the two obstetrician. The diagonal cells representing where the two agree, the other cells where the two do not agree. Weighted Cohen's Kappa = 0.28, 95%CI = -0.01 to 0.57. As the 95% confidence interval overlaps the null value, the conclusion is that there is no agreement between the two obstetricians.

Example 2. Cohen's Kappa : Data Entry option 2. In this case, the data is a symmetrical matrix of counts. where two midwives reviewed 85 women at the beginning of labour as to how likely the delivery will require a Caesarean Section no risk at all (1), minimal risk (2), high risk (3) and almost certain (4). The data is a symmetrical table, where rows represents the evaluation of midwife 1, and columns the evaluation of midwife 2. The diagonals are where the two agreed (no risk (25), minimal risk (9) high risk (12) certain (21). The cells below the diagonals are the counts where midwife 1 evaluated a higher risk than midwife 2, and those above the diagonal the other way around.

Weighted Cohen's Kappa = 0.82 95% confidence interval = 0.74 to 0.90. As this interval does not overlap the null value (0), the conclusion that the risk assessment of these two midwives significantly agree can be made.

Example 3. Fleiss's Kappa : Data Entry option 1 and 2. The data table is similar to that for option 1 in Cohen's Kappa, except that more than two raters are involved. In this example, we have 5 midwives examining 10 pregnant abdomen, and classify each baby as growth retarded (0), normal (1) or macrosomic (2). The data is therefore a 5 column table, each row representing a baby being assessed each column one of the midwives, and the cell contains the scores.

The program first creates a count array, where the rows represent babies, each column represent the score (in this case 3 scores of 0, 1, and 2), and the cells the number of times that baby received that score. The sum of each row must therefore be 5 for the 5 rating midwives.

In data entry option 2, the counting table can be entered directly.

For both data entry options, Fleiss Kappa for this example is 0.42, 95% confidence interval = 0.28 to 0.56. As this interval does not cross the null (0) value, the conclusion that midwives agree significantly with each other can be made

Introduction Intraclass Correlation Bland Altman Plot Lin's Concordance Index

There are a number of simple statistical methods of evaluating agreements between normally measurements, correlation, regression, and paired differences immediately coming to mind. Each of these method however evaluates only one aspect of the difference or agreement between two sets of measurements.

Most users however requires methods that evaluates agreements in a more nuanced and comprehensive manner. StatTools presents 3 such methods, and they are are

Intraclass Correlation Coefficient
Bland and Altman Plot
Lin's Concordance Coefficient

Details of these methods presented in their separate panels on this page. A comparison of their relative strength and weaknesses are well described by Morgan and Aban (see references), and not repeated here.

Contents of G : 6

Contents of H : 7

Cronbach's Alpha

Cronbach LJ (1951) Coefficient Alpha and the Internal Structure of Tests. Psychometrika 16:297-334
Allinson PD (1975) A Simple Proof of the Spearman-Brown Formula for Continuous Test lengths. Psychometrika 40:4 135-136
Bland JM and Altman DG (1997) Statistics Notes: cronbach's Alpha. BMJ 314(7080) 572
Lopez M (2007) 20th Annual Conference of the National Advisory Committee on Computing Qualifications (NACCQ 2007), Nelson, New Zealand. (www.naccq.ac.nz)
Duhachek A and Iacobucci D (2004) Alpha's Standard Error (ASE): An Accurate and Precise Confidence Interval Estimate Journal of Applied Psychology vol 89 No. 5 p 792-808
http://en.wikipedia.org/wiki/Cronbach%27s_alpha
http://en.wikipedia.org/wiki/Cyril_Hoyt

Sample Size for Cronbach's Alpha

Bonett D G (2002) Sample Size Requirements for Comparing Two Alpha Coefficients Applied Psychological Measurement, Vol. 27 No. 1, January 2003, 72 - 74
Duhachek A and Iacobucci D (2004) Alpha's Standard Error (ASE): An Accurate and Precise Confidence Interval Estimate. Journal of Applied Psychology Vol. 89, No. 5, 792-808
Johanson GA and Brooks GP (2010) Initial Scale Development: Sample Size for Pilot Studies. Educational and Psychological Measurement Vol.70,Iss.3;p.394-400 Contents of References : 7

Kappa for Nominal Data

Fleiss J L (1971) Measuring nominal scale agreement amongst many ratters. Psychological Bulletin 76:378-382
Siegel S and Castellan Jr. N.J. Nonparametric Statistics for the Behavioral Sciences (1988) International Edition. McGraw-Hill Book Company New York. ISBN 0-07-057357-3 p. 284-291

Kuder-Richardson Coefficient

Kuder, G. F. ; M. W. Richardson (1937)The theory of the estimation of test reliability. Psychometrika, 2: 151-60
Brunning JL, Kintz BL (1996) Computational Handbook of Statistics 4th Edition.Addison-Wesley Educational Publishers Inc, New York. ISBN 0-673-99085-0 p 78-81.
Feldt LS. The approximate sampling distribution of Kuder-Richardson reliability coefficient twenty. Psychometrika 1965;30:357-371.
Andrich D. (1982) An Index of Person Separation in Latent Trait Theory, the Traditional KR-20 Index, and the Guttman Scale Response Pattern. Education Research and Perspectives, 9:1, 95-104.
For those who want to write their own program but have difficulties in getting hold of old journal articles or textbooks, the formulae are on the www, K-R 20 on Wiki

Cohen and Fleiss Kappa

Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 20:37-46, 1960.
Cohen J. Weighted kappa: nominal scale agreement with provision for scale and disagreement or partial credit. Psychol. Bull. 70:213-20. 1968.
Fleiss, Joseph L.; Cohen, Jacob; Everitt, B. S. (1969) Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, Vol 72(5): p 323-327
Fleiss JL Statistical methods for rates and proportions second edition. Wiley Series in probability and mathematical statistics. Chapter 13 p. 212-236
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977; 33: 159-74.
wikipedia.org/wiki/Cohen's_kappa
University of York Department of Health Sciences Measurement in Health and Disease Cohen's Kappa a teaching paper with easy to understand and full formulation for Cohen's Kappa, weighted and unweighted, and Standard Error calculations.

Kendall's W

Siegel S and Castellan Jr. N.J. Nonparametric Statistics for the Behavioral Sciences (1988) International Edition. McGraw-Hill Book Company New York. ISBN 0-07-057357-3 p. 262-272
Siegel S and Castellan Jr. N.J. Nonparametric Statistics for the Behavioral Sciences (1988) International Edition. McGraw-Hill Book Company New York. ISBN 0-07-057357-3 Table T. Critical values for Kendall coefficient of concordance W p. 365

Intraclass Correlation Coefficient

Portney LG & Watkins MP (2000) Foundations of clinical research Applications to practice. Prentice Hall Inc. New Jersey ISBN 0-8385-2695-0 p 560-567
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977; 33: 159-74.

Bland and Altman Plot

Hanneman S K. Design, Analysis and Interpretation of Method-Comparison Studies. AACN Adv Crit Care. 2008 Apr-Jun; 19(2): 223-234. [ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2944826/]
Altman DG, Bland JM (1983). "Measurement in medicine: the analysis of method comparison studies". The Statistician. 32 (3): 307-317. doi:10.2307/2987937. JSTOR 2987937. [https://www-users.york.ac.uk/~mb55/meas/ba.pdf]
Bland JM, Altman DG (1986). "Statistical methods for assessing agreement between two methods of clinical measurement" (PDF). Lancet. 327 (8476): 307-10. [https://www-users.york.ac.uk/~mb55/meas/ba.pdf]
[Wikipedia for Bland Altman Plot]

Lin's Coefficient of Concordance

McBride GB (2005) A proposal for strength-of-agreement criteria for Lin's Concordance Correlation Coefficient. NIWA Client Report: HAM2005-062. (PDF) [ https://www.medcalc.org/download/pdf/McBride2005.pdf]
Lin L.I-K (1989) A concordance correlation coefficient to evaluate reproducibility. Biometrics 45:255-268. PubMed
Lin L.I-K (2000) A note on the concordance correlation coefficient. Biometrics 56:324-325.
[ https://en.wikipedia.org/wiki/Concordance_correlation_coefficient]
Chapter 812 Lin's Concordance Correlation Coefficient. PASS Sample Size Software NCSS.com 812-1 NCSS, LLC. [ https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/PASS/Lins_Concordance_Correlation_Coefficient.pdf]
[https://services.niwa.co.nz/services/statistical/concordance An on line calculator]
Morgan C J, Aban I. Methods for evaluating the agreement between diagnostic tests. Journal of Nuclear Cardiology (June 2016), Volume 23, Issue 3, pp 511-513 [ https://link.springer.com/article/10.1007/s12350-015-0175-7]