2011年12月30日 星期五

偽陽性 (false positive)

False positive means that a normal person is diagnosed as a patient.


偽陽性率測值介於0.4-0.5為極差 (fail),介於0.3-0.4為不佳 (poor) ,介於0.2-0.3為尚可 (fair) ,介於0.1-0.2為良好 (good)0-0.1為極佳 (excellent)



王瑋瀚, 花茂棽, 楊啓正, 朱怡娟, 鄭婷文, 葉炳強, . . . 徐文俊. (2008). 台灣WAIS-中文版算術、記憶廣度測驗及其組合估算工作記憶指數在臨床上之適用性:回溯性研究. 中華心理學刊, 50, 187-199.

Gyory, A. Z., Hadfield, C., & Lauer, C. S. (1984). Value of urine microscopy in predicting histological changes in the kidney: double blind comparison. British Medical Journal, 288, 819-822.

2011年12月29日 星期四

Item response theory (項目反應理論)

Item response theory (IRT,項目反應理論) is general statistical theory about examinee item and test performance and how performance related to abilities that are measured by the items in the test.
IRT 為現代測驗理論,是心理學統計模型總稱,主要用來分析問卷數據的數學模型。這些模型的目標是來確定的潛在心理特徵 (latent trait)是否可以通過測試題被反應出來,以及測試題和受試者之間的互動關係。
Hambleton RK, Jones RW. Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12, 38-47.
判斷標準: IRT使用數學函數根據受試者回答問題的情況,通過對題目特徵函數 (item characteristic function)的運算,來推測受試者的能力,經標準化後的潛在特質數值的範圍可從-4 +4IRT估計受試者在連續的潛在特質上所站的位置(即能力大小),另外也估計兩個重要的參數 -- 試題難度 (item difficulty) 及鑑別力 (discriminating power)
臨床意義:  項目反應理論可區別及預測每個受試者的能力,理論上受試者不論程度如何,不須完成整個測驗就能計算其得分,對於個案來說可以減少施測時間及節省體力。另外,項目反應理論不因樣本之不同而影響難度及鑑別度。
研究設計: 大量(一次性評估)收集不同嚴重/功能程度之個案
1.  McHorney CA. Use of item response theory to link 3 modules of functional status items from the Asset and Health Dynamics among the Oldest Old study. Arch Phys Med Rehabil. 2002, 83, 383-394.
2.  Hays RD, Liu H, Spritzer K, Cella D. Item response theory analyses of physical functioning items in the medical outcomes study. Medical Care. 2007, 45: S32-38.

2011年12月28日 星期三

施測者內信度(Intra-rater reliability)

Intra-rater reliability: The stability of data recorded by one observer across two or more trials of which the variables being rated are fixed and time is the only factor that varies between administrations.

Brink Y, Louw QA. Clinical instruments: reliability and validity critical appraisal. Journal of evaluation in clinical practice 2011

可使用組內相關係數(intraclass correlation coefficient, ICC)來驗證。ICC值≧0.75代表具有良好的施測者內信度。

研究設計: 由一位評估者評估同一個案兩次或以上,兩次評估間隔的時間通常較短,如一個星期之內,而評估的個案能力/特質需不隨時間產生變化。
Kruitwagen-van Reenen ET, Post MW, Mulder-Bouwens K, Visser-Meily JM. A simple bedside test for upper extremity impairment after stroke: validation of the Utrecht Arm/Hand Test. Disability and rehabilitation 2009;31:1338-1343.

Model paper:
Thornton M, Sveistrup H. Intra- and inter-rater reliability and validity of the Ottawa Sitting Scale: a new tool to characterise sitting balance in acute care patients. Disability and rehabilitation 2010;32:1568-1575.

臨床意義: 施測者內信度越高,代表評估工具評估個案能力/特質的結果越精確。


Differential item functioning (DIF): is a collection of statistical methods utilized to determine if examination items are appropriate and fair for testing the knowledge of different groups of examinees (e.g., male vs. female or Caucasian vs. African-American).


Reference: Perrone, M. (2006). Differential item functioning and item bias: critical considerations in test fairness. Applied Linguistics, 6, 1-3.

判斷標準:如果不同的族群在某個題目之試題特徵函數(item characteristic curve, ICC)都不相同的話,則該題目出現DIF現象。反之,如果不同族群的試題特徵函數都相同的話,則該題目沒有DIF現象。
因此,DIF的判斷即為檢驗試題特徵函數是否有差異,DIF檢定方法有比較試題參數的統計考驗(the Lord X2 test)、ICC間區域面積法 (the ICC area measure)、近似值比檢定法 (the likelihood ratio test)、Mantel-Haenszel法、標準化法、邏輯迴歸分析法 (logistic regression)、SIBTEST法 (simultaneous item bias test)。

來自不同族群,但能力相同的個體,如果在答對某個試題上的機率有所不同的話,則表示該題目有偏誤的狀況(biased items),產生DIF現象,有DIF之題目會被刪除,因為此題目會對不同族群產生不同的影響及解釋。

Model paper: Crane, P. K., van Belle, G., & Larson, E. B. (2004). Test bias in a cognitive test: differential item functioning in the CASI. Statistics in Medicine, 23, 241-256.

研究設計:依不同特性將個案分群 (例如:性別、種族、診斷等),測量,比較不同特性之族群是否在評估工具之各個題目上,有DIF的現象。

2011年12月23日 星期五


Receiver operating characteristic curve (ROC curve): Values for sensitivity and for false-positive rates (1-specificity) are plotted on the y and x axis of the curve, respectively.

中文解釋:中文名稱為"接受器操作特性曲線",簡稱ROC曲線。由敏感度(sensitivity)和錯誤的判斷(false-positive rates/1-specificity)之交集點,所畫出之曲線。y軸為敏感度,x軸為錯誤的判斷。

Reference: Husted, J. A., Cook, R., J., Farewell, V. T., & Gladman, D. D. (2000). Methods for assessing responsiveness: a critical review and recommendations. Journal of Clinical Epidemiology, 53, 459-468.

判斷標準:ROC曲線的判斷,會以對角線為一參考線,假如檢驗工具的ROC曲線在對角線上,則表示此工會對此疾病沒有鑑別性 (如下圖)。假如ROC曲線越往圖形之左上方移動,則表示工具對疾病的肯定判斷越高,錯誤判斷越小,及此工具有較好的鑑別力。

除了看曲線的圖形鑑別工具之好壞,也可利用曲線下的面積(area under curve, AUC),判別診斷鑑別力。AUC數值為0-1,數值越大表示診斷鑑別力越好。
AUC=0.5                 no discrimination
0.7  AUC < 0.8     acceptable discrimination
0.8  AUC < 0.9     excellent discrimination
AUC  0.9              outstanding discrimination

1. 使用於醫學診斷。獲得ROC曲線後,可計算曲線的最佳"切點"數據。當有未知的新進案例,此切點數據可做為診斷新個案有病或沒病的標準。
2. 驗證工具(screening tool)是否有良好的診斷鑑別力,AUC越高,表示工具的診斷鑑別力越高。
3. 使用ROC曲線來分析一個評估工具和一個外在標準之關係(驗證外在反應性),以確定此工具是否能夠偵測個案之變化,且建立工具之分數改變切點,以判斷個案有否進步/退步。~~與OT測驗工具有關。

Model paper: Butler, S. F., Fernandez, K., Benoit, C., Budman, S. H., & Jamison, R. N. (2008). Validation of the revised screener and opioid assessment for patients with pain (SOAPP-R). J Pain, 9, 360-372.

1. 獲得受測者的工具評量分數,再根據臨床判斷標準(例如:臨床專科醫師之診斷結果),計算敏感度和特異度之數值,以分析ROC曲線。
2. 驗證OT相關之評估工具。分析欲驗證之工具與一個外在標準之關係,以確定此欲驗證之工具的外在反應性。外在標準必須為二分(進步和沒有進步;退步和沒有退步),例如:外在標準可把個案分成,一組個案評定自己有一點進步或有進步很多,另一組個案評定自己沒有改變、或有一點退步、或退步很多。

Item reliability

Item reliability index is the estimate of the replicability of item placement within a hierarchy of items along the measured variable if these same items were to be given to another sample of comparable ability.
Kook SH, Varni JW. Validation of the Korean version of the pediatric quality of life inventory 4.0 (PedsQL) generic core scales in school children and adolescents using the Rasch model. Health and quality of life outcomes 2008;6:41.

標準: 用於整體題目比較:0.7;用於個別題目比較:0.85。介於0和1之間。
Tennant A, Conaghan PG. The Rasch measurement model in rheumatology: what is it and why use it? When should it be applied, and what should one look for in a Rasch paper? Arthritis and rheumatism 2007;57:1358-1362.

研究設計: 以新發展的評估工具評估一群個案(如:200人),獲得題目的難易程度,並以此估計題目的信度。

Model paper:
Hou, W. H., Chen, J. H., Wang, Y. H., Wang, C. H., Lin, J. H., Hsueh, I. P., Hsieh, C. L. (2011). Development of a set of functional hierarchical balance short forms for patients with stroke. Arch Phys Med Rehabil, 92(7), 1119-1125.


2011年12月22日 星期四

複本信度 (Alternate-Forms Reliability)

Alternate-Forms Reliability “a form of reliability in which alternate forms of the same test are given to a group of heterogeneous and representative subjects; scores for the two forms are then correlated.”



判斷標準:計算相關係數 (: Pearson’s r)。相關係數介於0.25-0.5為尚可 (fair),介於0.5-0.75為中度至良好(moderate to good) ,大於0.75為良好至極佳(good to excellent)
Stigler, Stephen M. Francis Galton's Account of the Invention of Correlation. Statistical Science, 1989, 4,73–79.
Benedict RH, Zgaljardic DJ. Practice effects during repeated administrations of memory tests with and without alternate forms. Journal of Clinical Experimental Neuropsychology. 1998, 20, 339-352.

Schmidt KS, Mattis PJ, Adams J, Nestor P. Alternate-form reliability of Dementia Rating Scale-2. Archives of Clinical Neuropsychology, 2005, 20, 435-441.

2011年12月21日 星期三

陰性預測值 (negative predictive value)

Negative  predictive value
 is a possibility that the person diagnosed as a non-patient is not a patient.


判斷標準:陰性預測值介於0.5-0.6為極差 (fail),介於0.6-0.7為不佳 (poor) ,介於0.7-0.8為尚可 (fair) ,介於0.8-0.9為良好 (good),0.9-1.0為極佳 (excellent)



Reference王瑋瀚, 花茂棽, 楊啓正, 朱怡娟, 鄭婷文, 葉炳強, . . . 徐文俊. (2008). 台灣WAIS-Ⅲ中文版算術、記憶廣度測驗及其組合估算工作記憶指數在臨床上之適用性:回溯性研究. 中華心理學刊, 50, 187-199.

Kiyota, Y., Schneeweiss, S., Glynn, R. J., Cannuscio, C. C., Avorn, J., & Solomon, D. H. (2004). Accuracy of Medicare claims-based diagnosis of acute myocardial infarction: Estimating positive predictive value on the basis of review of hospital records. American Heart Journal, 148, 99-104.

Swets, J. A. (1988). Measuring the accuracy of diagnostic systems. Science, 240, 1285-1293.

2011年12月15日 星期四

陽性預測值 (positive predictive value)-20111219更新

Positive predictive value
 is a possibility that the person diagnosed as a  patient is a real patient.


判斷標準:陽性預測值介於0.5-0.6為極差 (fail),介於0.6-0.7為不佳 (poor) ,介於0.7-0.8為尚可 (fair) ,介於0.8-0.9為良好 (good),0.9-1.0為極佳 (excellent)




參考資料王瑋瀚, 花茂棽, 楊啓正, 朱怡娟, 鄭婷文, 葉炳強, . . . 徐文俊. (2008). 台灣WAIS-Ⅲ中文版算術、記憶廣度測驗及其組合估算工作記憶指數在臨床上之適用性:回溯性研究. 中華心理學刊, 50, 187-199.

Kiyota, Y., Schneeweiss, S., Glynn, R. J., Cannuscio, C. C., Avorn, J., & Solomon, D. H. (2004). Accuracy of Medicare claims-based diagnosis of acute myocardial infarction: Estimating positive predictive value on the basis of review of hospital records. American Heart Journal, 148, 99-104.

Swets, J. A. (1988). Measuring the accuracy of diagnostic systems. Science, 240, 1285-1293.

2011年12月14日 星期三

Rasch measurement model

The Rasch model “can examine whether items from a scale measure a unidimensional construct. Rasch analysis transforms ordinal scores to the logit scale and thus to an interval-level measurement.”

1. Hsueh IP, Wang WC, Sheu CF, Hsieh CL. Rasch analysis of combining two indices to assess comprehensive ADL function in stroke patients. Stroke, 2004; 35:721-736.
2. Pallant JF, Tennant A. An introduction to the Rasch measurement model: an example using the Hospital Anxiety and Depression Scale (HADS). Br J Clin Psychol, 2007; 46:1-18.

單參數Rasch 模式(只有題目難易度一個參數): Rasch 分析主要驗證量表項目是否符合Rasch 模式,若符合模式欲期則可以宣稱量表符合單向度假設。此外,如量表項目均能符合Rasch模式的預期,則Rasch模式利用對數函數(logit function)針對答題機率進行計算得到客觀等距量尺。

臨床意義: 經過項目分析,如資料符合Rasch模式,除可確認量表所有項目測量同一建構,滿足單向度外,且分數可以加總,加總之後所得之分數才能被用來代表例如個案之日常生活活動能力。

個案: 篩選符合標準的門診或住院之中風病人
       - 包含不同嚴重/功能程度
施測者: 熟悉量表之治療師

1. MNSQ: infit/outfit 介於0.6-1.4
2. ZSTD: 介於±2之間
3. PCA: 任一因素解釋變異比例不超過20%


Item discriminant validity: to demonstrate that an item measures what it is supposed to measure, and also to determine the extent to which each item measures other concepts that it is not supposed to measure.


統計量之判斷標準:項目和所屬量表之相關比和不所屬量表之相關要高,且高於統計顯著標準,統計顯著標準為2個標準誤(standard error)
1 standard error = 1/ n
(n: sample size)

Reference: Ware, J. E., & Barbara G. (1998). Methods for testing data quality, scaling assumptions, and reliability: The IQOLA project approach. J Clin Epidemio, 51, 945-952.


Model paper: McHorney, C. A., Ware, J. E., Lu, J. F. R., Sherbourne, C. D. (1994). The Mos 36-item short-form health survey (SF-36): III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Medical Care, 32, 40-66


Person reliability

Person reliability is equivalent to the traditional test reliability, which indicates how likely we will be able to get the same ordering of individuals using a repeated test.

Li, J., Liu, H., Feng, T., & Cai, Y. (2011). Psychometric assessment of HIV/STI sexual risk scale among MSM: A Rasch model approach. BMC Public Health, 11, 763.

Criteria: A minimum value of 0.7 is required for group use and 0.85 for individual use.

Tennant, A., & Conaghan, P. G. (2007). The Rasch measurement model in rheumatology: what is it and why use it? When should it be applied, and what should one look for in a Rasch paper? Arthritis Rheum, 57(8), 1358-1362.

Research design: 以新發展的評估工具評估一群個案(如:200人),以每位(團體)個案所得(平均)分數變異誤(標準誤的平方)的反比來獲得評估工具對個別(團體)個案能力估計的信度。
註:person reliability可分為團體層級個別層級。

Model paper:
Hou, W. H., Chen, J. H., Wang, Y. H., Wang, C. H., Lin, J. H., Hsueh, I. P., Hsieh, C. L. (2011). Development of a set of functional hierarchical balance short forms for patients with stroke. Arch Phys Med Rehabil, 92(7), 1119-1125.

臨床意義:Person reliability可幫助我們得知所使用的評估工具是否具有估計不同個案能力的穩定性(精準度)。

2011年12月2日 星期五

未來的專有名詞說明 請加上「研究設計」及 model paper

1. 也就是如何設計研究/收集資料
2. 提出相關的實證論文做為 model paper
3. 請加上統計量之判斷標準

2011年11月30日 星期三


Relative reliability is the degree to which individuals maintain their position in a sample over repeated measurements.

Reference:Bruton A, Conway JH, Holgate ST. Reliability: What is it, and how is it measured? Physiotherapy 2000; 86: 94-99.

*組內相關係數(intra-class correlation, ICC)為相對信度的一個指標。

標準: ICC值≧0.75代表具有良好的相對信度。

研究設計: 對個案作重複的施測,依據研究目的由相同或不同評估者重複施測,以此來檢驗施測者內或施測者間的(相對)信度。

Model paper:
Liaw LJ, Hsieh CL, Lo SK, Chen HM, Lee S, Lin JH. The relative and absolute reliability of two balance performance measures in chronic stroke patients. Disability and rehabilitation 2008;30:656-661.



Floor effect is a “value that observations cannot fall below, such as zero errors on a learning task.”

Reference: Nunnally JC, Bernstein IH. Psychometric theory. McGraw-Hill, INC; 1994.

地板效應: 指由於量表下限的影響,致使無法觀察下限之外的個案能力。

判斷標準: 個案於評估工具之得分分佈中,獲得最分之人數比例,一般以20%為判斷標準
Van Der Putten JJ, Hobart JC, Freeman JA, Thompson AJ. Measuring change in disability after inpatient rehabilitation: comparison of responsiveness of the Barthelindex and the Functional Independence Measure. J Neurol Neurosurg Psychiatry 1999; 66:480-484.

研究方法: 一次性評估收集個案資料
Martinsson L, Eksborg S. Activity Index-a complementary ADL scale to the Barthel Index in the acute stage in patients with severe stroke. Cerebrovasc Dis. 2006; 22: 231-239.


特異性 specificity 20120131更新

Specificity means the proportion of the total number of diagnosed as non-patients in non-patients.


特異性介於0.5-0.6為極差 (fail),介於0.6-0.7為不佳 (poor) ,介於0.7-0.8為尚可 (fair) ,介於0.8-0.9為良好 (good),0.9-1.0為極佳 (excellent)



參考文獻王瑋瀚, 花茂棽, 楊啓正, 朱怡娟, 鄭婷文, 葉炳強, . . . 徐文俊. (2008). 台灣WAIS-Ⅲ中文版算術、記憶廣度測驗及其組合估算工作記憶指數在臨床上之適用性:回溯性研究. 中華心理學刊, 50, 187-199.

Rodstein, M., & Gubner, R. S. (1964). SPECIFICITY + SENSITIVITY OF QRS VOLTAGE CRITERIA OF LEFT VENTRICULAR HYPERTROPHY. American Journal of Cardiology, 13, 619-623.

Swets, J. A. (1988). Measuring the accuracy of diagnostic systems. Science, 240, 1285-1293.  

2011年11月29日 星期二


Item internal consistency is tested by examining the correlations between an item and the scale score computed from all other items in that scale (item-scale correlation after correction for overlap).

Reference: Ware, J. E., & Barbara G. (1998). Methods for testing data quality, scaling assumptions, and reliability: The IQOLA project approach. J Clin Epidemio, 51, 945-952.


判斷標準:每個項目和所屬量表間之相關 > 0.4


Model paper: Sullivan, M., Karlsson, J., & Ware, J. E. (1995). The swedish SF-36 healthe survey-I. Evaluation of data quality, scaling assumptions, reliability and construct validity across general populations in Sweden. Soc Sco Med, 41(10), 1349-1358.


2011年11月23日 星期三


Criterion related validity is the degree to which a measure correlates with a gold standard (the criterion).


Reference: Hobart JC, Lamping DL, Thompson AJ. Evaluating neurological outcome measures: the bare essentials. J Neurol Neurosurg Psychiatry 1996; 60: 127-130.

同時效度/預測效度: 以相關係數來驗證,Pearson's r/Spearman's ρ:
≧0.75 良好;0.40-0.74 中等;
≦0.40 差。


預測效度: 某一時間評估個案,於一段時間後再評估,為追蹤研究。

Model paper:
Hsueh I, Mao H, Huang H, Hsieh C. Clinical applications of balance measures in stroke patients. Formosan Journal of Medicine 2001;5:261-268.

臨床意義:當有新的評估工具要應用於臨床時,我們會找大家公認的「黃金標準」評估工具作為效標來驗證。以平衡功能的評估工具為例:Berg Balance Scale就是公認的黃金標準。如果新的評估工具與黃金標準有良好的效標關連效度,我們就比較有信心說:新的評估工具能評估或預測到我們想評估的某項能力或特質。


Data quality: Indicators of data quality such as item non-response and missing scale scores determine the extent to which an instrument can be used successfully in a clinical setting.

Reference: Hobart, J. C., Riazi, A., Lamping, D. K., Fitzpatrick, R., Thompson, A. J. (2004). Improving the evaluation of therapeutic interventions in multiple sclerosis: development of a patient-based measure of outcome. Health Technology Assessment, 8, 1-48.


判斷標準:遺漏值< 10%為可接受之範圍。


Model paper: Sullivan, M., Karlsson, J., & Ware, J. E. (1995). The swedish SF-36 healthe survey-I. Evaluation of data quality, scaling assumptions, reliability and construct validity across general populations in Sweden. Soc Sco Med, 41(10), 1349-1358.


2011年11月22日 星期二


Ceiling effect “occurs with measures that are relatively easy, when a substantial proportion of individuals obtain either maximum or near-maximum scores and cannot demonstrate the true extent of their abilities, resulting in score distributions that are com-pressed at the upper end of performance.”

Reference: Uttl B. Measurement of Individual Differences: Lessons from Memory Assessment in Research and Clinical Practice. Psychological Science, 2005, 16: 460-467.

天花板效應 (ceiling effect): 即評估工具之測量尺度有效之高分範圍不夠大,導致分數集中在尺度頂端,亦即無法區分高能力者。亦指測驗題目過於簡單,而致使大部分個案得分普遍較高的現象。

判斷標準: 個案於評估工具之得分分佈中,獲得最高分之人數比例,一般以20%為判斷標準
Van Der Putten JJ, Hobart JC, Freeman JA, Thompson AJ. Measuring change in disability after inpatient rehabilitation: comparison of responsiveness of the Barthelindex and the Functional Independence Measure. J Neurol Neurosurg Psychiatry 1999; 66:480-484.

研究方法: 一次性評估收集個案資料

Stucki G, Stucki S, Briihlmann P, Michel BA. Ceiling effects of the Health Assessment Questionnaire and its modified version in some ambulatory rheumatoid arthritis. Annals of the Rheumatic Diseases 1995; 54: 461-465.

臨床意義:評估工具之題目對於某些特性(例如: 功能好的個案)來說,過於簡單,無法確實將個案的能力區分出來,例如巴氏量表10項目對於輕微中風慢性中風個案來說可能過於簡單,無法顯現此族群日常生活活動功能之差異。

敏感性 sensitivity 20120131更新

Sensitivity means the proportion of the total number of diagnosed patients in real patients.


敏感性介於0.5-0.6為極差 (fail),介於0.6-0.7為不佳 (poor) ,介於0.7-0.8為尚可 (fair) ,介於0.8-0.9為良好 (good),0.9-1.0為極佳 (excellent)



王瑋瀚, 花茂棽, 楊啓正, 朱怡娟, 鄭婷文, 葉炳強, . . . 徐文俊. (2008). 台灣WAIS-Ⅲ中文版算術、記憶廣度測驗及其組合估算工作記憶指數在臨床上之適用性:回溯性研究. 中華心理學刊, 50, 187-199.

Rodstein, M., & Gubner, R. S. (1964). SPECIFICITY + SENSITIVITY OF QRS VOLTAGE CRITERIA OF LEFT VENTRICULAR HYPERTROPHY. American Journal of Cardiology, 13, 619-623.

Swets, J. A. (1988). Measuring the accuracy of diagnostic systems. Science, 240, 1285-1293.

2011年11月17日 星期四

建構效度 Construct validity 2012/1/31更新

Construct validity means whether the test can be inferred to the underlying, theoretically existing construct.


由驗證性因素分析之適配度指標判斷。較常用的指標有4項:卡方值/自由度的比值 (<3.0為可接受)、Bentler’s comparative fit index (CFI,CFI > 0.95表示適配度良好)、 Tucker-Lewis Index (TLI,TLI> 0.95表示適配度良好) 、平方均值估計殘差根 (root mean square error of approximation ,RMSEA。RMSEA <0.05表示適配度良好)。有時候並非所有適配度指標都會符合,因此最後由作者主觀判定是否接受此結果。

臨床意義:問卷具有良好的建構效度 ,可確保測量結果確實反映所欲測量的概念,使測量結果更容易被解釋清楚。


Cronbach, L. J., & Meehl, P. E. (1955). CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS. Psychological Bulletin, 52, 281-302.

Hsueh, I. Ping, Jeng, Jiann-Shing, Lee, Yen, Sheu, Ching-Fan, & Hsieh, Ching-Lin. (2011). Construct validity of the stroke-specific quality of life questionnaire in ischemic stroke patients. Archives of Physical Medicine & Rehabilitation, 92, 1113-1118.

Terwee, C. B., Bot, S. D. M., de Boer, M. R., van der Windt, Dawm, Knol, D. L., Dekker, J., . . . de Vet, H. C. W. (2007). Quality criteria were proposed for measurement properties of health status questionnaires. Journal of Clinical Epidemiology, 60, 34-42.

2011年11月16日 星期三


Discriminant validity is “the degree to which concepts that should not be related theoretically are not interrelated in reality.”
Reference: Campbell DT, Fiske DW. (1959). Convergent and discriminant validation by the multitrait- multimethod matrix. Psychological Bulletin, 56, 81-105.

Campbell & Fiske (1959) 在其文章給予discriminant validity上述的解釋,Nunnally& Bernstein (1994)在其書中引用Campbell & Fiske (1959) 關於discriminant validity 的定義並稱之為divergent validity,由此可知兩詞可互用。
Divergent validity : "In order to justify novel measures of attributes, a measure should have divergent validity in the sense of meauring something different from existing methods. Measures of different attributes should therefore not correlate to an extremely high degree." (pp. 92)
Reference : Nunnally JC, Bernstein IH. Psychometric theory. McGraw-Hill, INC. ;1994.

區辨效度: 指評估工具本身的分數,應該和測量不同構念或特質的評估工具之分數,有低相關(甚至無關)。


研究設計: 個案同時接受欲驗證的評估工具及不同構念或特質的評估工具之評估
Ng TP, Niti M, Chiam PC, Kua EH. Physical and cognitive domains of the instrumental activities of daily living: Validation in a multiethnic population of Asian older adults. J Gerontol A Biol Sci Med Sci. 2006; 61: 726-735.



The standard error of measurement (SEM) is a determination of the amount of variation or spread in the measurement errors for a test.

Harvill LM. NCME Instructional module: standard error of measurement. Educational Measurement: Issues and Practice. 1991;10(2):33-41.


研究設計: 對個案進行重複評估。根據研究目的由同一評估者或不同評估者進行重複評估。

標準: SEM小於第一次評估評估值平均的10%代表測量誤差小(評估結果穩定性高)。

Model paper:
Flansbjer UB, Holmback AM, Downham D, Patten C, Lexell J. Reliability of gait performance tests in men and women with hemiparesis after stroke. J Rehabil Med 2005; 37: 75-82.



Discriminative validity: A instrument shows"discruminative validity" if a patient group expected to have worse scores has scores worse than those of comparison subjects. The instrument thus "discriminates" between the groups.

Reference: Crowley, T. J., Mikulich, S. K., Ehlers, K. M., Hall, S. K., & Whitmore, E., A. (2003). Discriminative validity and clinical utility of an abuse-neglect interview for adolescents with conduct and substance use problems. Am J Psychiatry, 160, 1461-1469.




Model paper: Hsieh, Y. W., Lin, J. H., Wang, C. H., Sheu, C. F., Hsueh, I. P., & Hsieh, C. L. (2007). Discriminative, predictive, and evaluative properties of the simplified Stroke Rehabilitation Assessment of Movement instrument in patients with stroke. J Rehabil Med, 39(6), 454-460.


2011年11月10日 星期四

折半信度 (Split-Half Reliability)

Split-half reliability means to separate the test to 2 equivalent parts and to calculate the correlation the results of  these 2 parts. It is often used to estimate the stability and internal consistency of the test when the items of test are enough.


相關係數 (Pearson's r或Spearman's ρ) ≧0.75為良好,0.40-0.74為中等,≦0.40為差。




Cronbach, Lee. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334.

Elena M, Andresen. (2000). Criteria for assessing the tools of disability outcomes research. Archives of Physical Medicine and Rehabilitation, 81, S15-S20.

Klein, C., & Fischer, B. (2005). Instrumental and test-retest reliability of saccadic measures. Biological Psychology, 68, 201-213.

Streiner, D. L. (2003). Starting at the beginning: An introduction to coefficient alpha and internal consistency. Journal of Personality Assessment, 80, 99-103.

2011年11月9日 星期三


Ecological validity refers to the degree to which test performance corresponds to real-world performance.

Reference: Chaytor, N., & Schmitter-Edgecombe, M. (2003). The ecological validity of neuropsychological tests: a review of the literature on everyday cognitive skills. Neuropsychology Review, 13, 181-197.


判斷標準:欲驗證之評估工具與ADL或outcome measure評估工具之相關性。
                     0.60 excellent;
                    0.31-0.59 adequate;
                    ≦ 0.30 poor


Model paper: Chaytor, N., Temkin, N., Machamer, J., & Dikmen, S. (2007). The ecological validity of neuropsychological assessment and the role of depressive symptoms in moderate to severe traumatic brain injury. J Int Neuropsychol Soc, 13(3), 377-385.

研究設計:同時測量欲驗證之評估工具及ADL或outcome measure評估工具,看工具間的相關程度。


Convergent validity is “the degree to which concepts that should be related theoretically are interrelated in reality.”

Campbell DT, Fiske DW. (1959). Convergent and discriminant validation by the multitrait- multimethod matrix. Psychological Bulletin, 56, 81-105.
收斂效度 (Convergent validity): 指評估工具欲測量之特質與理論上相關特質的關連程度。

判斷標準:可使用Pearson’s r 來檢驗兩評估工具之相關。 r 0.6 為具有良好之收斂效度。
Salter K, Jutai JW, Teasell R, Foley NC, Bitensky J, Bayley M. Issues for selection of outcome measures in stroke rehabilitation: ICF activity. Disabil Rehabil 2005; 27: 315-340.

研究設計: 個案同時接受欲驗證的評估工具及理論上相關特質的評估工具之評估

臨床意義: 收斂效度為建構效度之一種,驗證評估工具的收斂效度,可提供資訊使臨床人員瞭解評估工具之項目是否能真正評估到理論上欲評估的建構或特質,即例如欲評估個案之日常生活活動功能之評估工具,真能評估到個案之日常生活功能,而不是評估到個案的認知功能。

2011年11月8日 星期二


The minimal clinically important difference (MCID) can be defined as the smallest difference in score in the domain of interest which patients received as beneficial and which would mandate, in the absence of troublesome side effects and excessive cost, a change in the patient's management.

Jaeschke R, Singer J, Guyatt GH. Measurement of health status. Ascertaining the minimal clinically important difference. Control Clin Trials 1989; 10: 407-415.

中文解釋:「最小臨床重要差異值」可被定義為:於某一範疇(所評估的能力或功能性表現)中,病患認為有益於自己的最小分數改變。 這個改變不會伴隨副作用或過多的花費,將影響臨床上對病患的管理(如:治療計畫的制定)。


(1) 以個案自評量表(如:李克氏15點量表)為一外在標準,由-3進步到-1或1進步到3的差距為MCID。
(2) 自評有些進步個案的平均改變分數與自評沒有進步個案的平均改變分數相減所得的分數為MCID。

Wells, D Beaton, B Shea, M Boers, L Simon, V Strand, P Brooks and P Tugwell Minimal clinically important differences: review of methods. J Rheumatol G 2001;28;406-412

Model paper:

Iyer LV, Haley SM, Watkins MP, Dumas HM. Establishing minimal clinically important differences for scores on the pediatric evaluation of disability inventory for inpatient rehabilitation. Physical therapy 2003;83:888-898.

研究上,當我們檢驗一個評估工具的反應性,通常是從「團體層級」來判斷此評估工具是否能偵測某一個治療的療效,如:治療前和治療後的改變量是否有統計上的顯著意義。進一步,我們也會檢驗「最小可偵測變化值」(minimal detectable change, MDC),以判定統計上個別病患最少需改變多少,才是非測量誤差造成的改變量。然而,在統計上有顯著意義的最小改變量不一定代表臨床上重要的最小改變量。於是,檢驗一個評估工具的「最小臨床重要差異值」,可以知道此評估工具所能呈現團體或個別個案認為重要的最小改變量。

2011年10月27日 星期四

內部一致性 20120131更新

Internal consistency reveals the correlation or consistency of the items in the same domain in a questionnaire. It is often revealed by Cronbach's α.

內部一致性是指問卷中同一面向的題目之間的相關性或一致性,通常以Cronbach's α表示。

判斷標準:Cronbach's α 介於0.7-0.9為內部一致性良好,α>0.95可能表示有些題目概念重複,可考慮刪除。 



Cronbach's α的數值易受問卷的題數影響。題數越多,Cronbach's α值越大。

邀請問卷之適用對象填寫問卷,再依據填寫結果計算問卷每個面向的Cronbach's α。


Cronbach, Lee. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334.

Rush, A. J., Trivedi, M. H., Ibrahim, H. M., Carmody, T. J., Arnow, B., Klein, D. N., . . . Keller, M. B. (2003). The 16-Item Quick Inventory of Depressive Symptomatology (QIDS), clinician rating (QIDS-C), and self-report (QIDS-SR): a psychometric evaluation in patients with chronic major depression. Biological Psychiatry, 54, 573-583.

Terwee, C. B., Bot, S. D. M., de Boer, M. R., van der Windt, Dawm, Knol, D. L., Dekker, J., . . . de Vet, H. C. W. (2007). Quality criteria were proposed for measurement properties of health status questionnaires. Journal of Clinical Epidemiology, 60, 34-42.

2011年10月26日 星期三


External responsiveness reflects the extent to which changes in a measure over a specified time frame relate to corresponding changes in a reference measure of health status.

Husted JA, Cook RJ, Farewell VT, Gladman DD. Methods for assessing responsiveness: a critical review and recommendations. J Clin Epidemiol 2000; 53: 459-468.


研究設計: 追蹤研究/重複評估。如: 入,出院時某一評估工具與參考評估工具皆各評估一次。

標準:以與參考評估工具之間的相關係數來驗證,Pearson's r/Spearman's ρ:≧0.75 良好;0.40-0.74 中等;≦0.40 差。

Model Paper:
Hsueh I, Mao H, Huang H, Hsieh C. Clinical applications of balance measures in stroke patients. Formosan Journal of Medicine 2001;5:261-268.


同時效度-renew 1101

 Concurrent validity is “studies when the measurement to be validated and is measured with the previously validated measure at relatively the same time (concurrently), to see how well both measures are correlated. "

Portney LG, Watkins MP. Foundations of clinical research: Applications to practice. Upper Saddle River: Pearson Prentice Hall; 2009.
Hobart, J. C., Lamping D. L., & Thompson, A. J. (1996). Evaluating neurological outcome measures: The bare essentials. J Neurol Neurosurg Psychiatry, 60, 127-130.

同時效度 (concurrent validity) 是指評估工具的評估結果與目前公認之黃金標準的評估工具,評估結果之關聯程度,以瞭解評估工具是否評估到與公認標準之評估工具相同之建構,可檢驗評估工具結果的正確性。

判斷標準: 可使用Pearson’s r 來檢驗兩評估工具之相關。r 0.6 為具有良好同時效度之基本標準。
Salter K, Jutai JW, Teasell R, Foley NC, Bitensky J, Bayley M. Issues for selection of outcome measures in stroke rehabilitation: ICF activity. Disabil Rehabil 2005; 27: 315-340.

研究設計: 個案同時接受欲驗證的評估工具及效標工具之評估
臨床意義: 臨床上,可瞭解並確認如新發展之日常生活量表(具簡短,快速及全面之特性)與目前大家所公認使用之日常生活量表比較,是否也能有效地評估到相同之建構 (e.g.,日常生活功能)。如確認可評量到相同之建構,也許可考量使用新量表,以提升評估之效率。