Raters’ Assessment Quality in Measuring Teachers’ Competency in Classroom Assessment: Application of Many Facet Rasch Model


  • Rosyafinaz Mohamat Faculty of Education, University of Malaya, Kuala Lumpur, MALAYSIA
  • Bambang Sumintono Faculty of Education, Universitas Islam Internasional Indonesia, INDONESIA
  • Harris Shah Abd Hamid Faculty of Management, Education and Humanities, University College MAIWP International, MALAYSIA




Many Facet Rasch Model, Competency, Classroom Assessment, Rater severity, Multi-rater Analysis


This study examines the raters’ assessment quality when measuring teachers’ competency in Classroom Assessment (CA) using the Many Facet Rasch Model (MFRM) analysis. The instrument used consists of 56 items built based on 3 main constructs: knowledge in CA, skills in CA, and attitude towards CA. The research design of this study is a quantitative method with a multi-rater approach using a questionnaire distributed to the raters. Respondents are 68 raters consisting of The Head of Mathematics and Science Department, The Head of Mathematics Panel, and the Mathematics Teacher to assess 27 ratees. The ratees involved in this study are 27 secondary school Mathematics teachers from Selangor. The results show that among the advantages of MFRM are that it can determine the severity and consistency level of the raters, also detect bias interaction between rater and ratee. Although all raters were given the same instrument, the same aspects of evaluation, and scale category, MFRM can compare the severity level for each rater individually. Furthermore, MFRM can detect measurement biases and make it easier for researchers to communicate about the research findings. MFRM has the advantage of providing complete information and contributes the understanding of the consistency analysis of the rater’s judgement with quantitative evidence support. This indicates that MFRM is an alternative model suitable to overcome the limitations in Classical Test Theory (CTT) statistical models in terms of multi-rater analysis.


Download data is not yet available.


Allen, M. (2017). The SAGE encyclopedia of communication research methods (Volume 1). Retrieved from United States of America

Abdullah Al-Awaid, S. A. (2022). Online education and assessment: Profiling EFL teachers’ competency in Saudi Arabia. World Journal of English Language, 12(2), 82. https://doi.org/10.5430/wjel.v12n2p82

Bahagian Pembangunan Kurikulum. (2019). Panduan pelaksanaan pentaksiran bilik darjah edisi Ke-2. Putrajaya: Kementerian Pendidikan Malaysia.

Barkaoui, K. (2013). Multifaceted Rasch analysis for test evaluation. The Companion to Language Assessment, 1–46. https://doi.org/10.1002/9781118411360.wbcla070

Bartok, L., & Burzler, M. A. (2020). How to assess rater rankings? A theoretical and a simulation approach using the sum of the Pairwise Absolute Row Differences (PARDs). Journal of Statistical Theory and Practice, 14(37). https://doi.org/10.1007/s42519-020-00103-w

Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences (Third Edit). New York: Routledge Taylor & Francis Group.

Boone, W. J. (2020). Rasch basics for the novice. In Rasch measurement: Applications in quantitative educational research (pp. 9–30). Singapore: Springer Nature Singapore Pte Ltd.

Brennan, R. L. (2010). Generalizability theory and classical test theory. Applied Measurement in Education, 24(1), 1–21. https://doi.org/10.1080/08957347.2011.532417

Cai, H. (2015). Weight-based classification of raters and rater cognition in an EFL speaking test. Language Assessment Quarterly, 12(3), 262–282. https://doi.org/10.1080/15434303.2015.1053134

Calhoun, A. W., Boone, M., Miller, K. H., Taulbee, R. L., Montgomery, V. L., & Boland, K. (2011). A multirater instrument for the assessment of simulated pediatric crises. Journal of Graduate Medical Education, 3(1), 88–94.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104

Cronbach, L. J. (1990). Essentials of Pychological Testing (5th Editio). New York: Harper & Row.

Donnon, T., McIlwrick, J., & Woloschuk, W. (2013). Investigating the reliability and validity of self and peer assessment to measure medical students’ professional competencies. Creative Education, 4(6), 23–28. https://doi.org/10.4236/ce.2013.46a005

Eckes, T. (2015). Introduction to Many-Facet Rasch measurement: Analyzing and evaluating rater-mediated assessment. Frankfurt: Peter Lang Edition.

Eckes, T. (2019). Many-facet Rasch measurement: Implications for rater-mediated language assessment. In V. Aryadoust & M. Raquel (Eds.), Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques (pp. 153–176). https://doi.org/10.4324/9781315187815-2

Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many‐faceted Rasch model. Journal of Educational Measurement, 31(2), 93–112. https://doi.org/10.1111/j.1745-3984.1994.tb00436.x

Engelhard, G., & Wind, S. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. New York: Routledge Taylor & Francis Group.

Fahmina, S. S., Masykuri, M., Ramadhani, D. G., & Yamtinah, S. (2019). Content validity uses Rasch model on computerized testlet instrument to measure chemical literacy capabilities. AIP Conference Proceedings, 2194(020023). https://doi.org/10.1063/1.5139755

Fan, J., Knoch, U., & Bond, T. G. (2019). Application of Rasch measurement theory in language assessment: Using measurement to enhance language assessment research and practice. Papers in Language Testing and …, 8(2).

Farrokhi, F., Esfandiari, R., & Dalili, M. V. (2011). Applying the many-facet Rasch model to detect centrality in self-assessment , peer-assessment and teacher assessment. World Applied Sciences Journal, 15, 70–77.

Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619. https://doi.org/10.1177/001316447303300309

Goffin, R. D., & Jackson, D. N. (1992). Analysis of multitrait-multirater performance appraisal data : Composite direct product method versus confirmatory factor analysis. Multivariate Behavioral Research, 27(3), 363–385.

Goodwin, L. D., & Leech, N. L. (2003). The meaning of validity in the new standards for educational and psychological testing. Measurement and Evaluation in Counseling and Development, 36(3), 181–191. https://doi.org/10.1080/07481756.2003.11909741

Han, C. (2021). Detecting and measuring rater effects in interpreting assessment: A methodological comparison of classical test theory, generalizability theory, and many-facet Rasch measurement. New Frontiers in Translation Studies, (April), 85–113. https://doi.org/10.1007/978-981-15-8554-8_5

Hargreaves, A., Earl, L., & Schmidt, M. (2002). Perspectives on alternative assessment reform. American Educational Research Journal, 39(1), 69–95.

Hodges, T. S., Scott, C. E., Washburn, E. K., Matthews, S. D., & Gould, C. (2019). Developing pre-service teachers' critical thinking and assessment skills with reflective writing. In Handbook of Research on Critical Thinking Strategies in Pre-Service Learning Environments (pp. 146-173). IGI Global. https://doi.org/10.4018/978-1-5225-7823-9.ch008

Hsu, L. M., & Field, R. (2003). Interrater agreement measures: Comments on Kappa n , Cohen’s Kappa, Scott’s π, and Aickin’s α. Understanding Statistics, 2(3), 205–219. https://doi.org/10.1207/s15328031us0203_03

Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching: Combining high-quality observations with student surveys and achievement gains. Washington: Bill and Melinda Gates Foundation.

Kudiya, K., Sumintono, B., Sabana, S., & Sachari, A. (2018). Batik artisans’ judgement of batik wax quality and its criteria: An application of the many-facets Rasch model. In Q. Zhang (Ed.), Pacific Rim Objective Measurement Symposium (PROMS) 2016 Conference Proceedings (pp. 27–38). https://doi.org/10.1007/978-981-10-8138-5

Linacre, J. M. (1994). Many-facet Rasch Measurement. Chicago: MESA PRESS.

Linacre, J. M. (2006). A user’s guide to Winsteps/ Ministep Rasch-model computer programs. Chicago: www.winsteps.com.

Lindell, M. K., & Brandt, C. J. (1999). Assessing interrater agreement on the job relevance of a test: A comparison of the cvi, t, rwg(j), and r*wg(j) indexes. Journal of Applied Psychology, 84(4), 640–647. https://doi.org/10.1037/0021-9010.84.4.640

Lohman, M. C. (2004). The development of a multirater instrument for assessing employee problem-solving skill. Human Resource Development Quarterly, 15(3).

Lumley, T., & Mcnamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54–71. https://doi.org/10.1177/026553229501200104

Matsuno, S. (2009). Self-, peer-, and teacher-assessments in Japanese university EFL writing classrooms. Language Testing, 26(1), 075–100. https://doi.org/10.1177/0265532208097337

McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 555–576. https://doi.org/10.1177/0265532211430367

Mohd Yusri Ibrahim, Mohd Faiz Mohd Yaakob, & Mat Rahimi Yusof. (2019). Communication skills: Top priority of teaching competency. International Journal of Learning, Teaching and Educational Research, 18(8), 17–30. https://doi.org/10.26803/ijlter.18.8.2

Newton, P. E. (2009). The reliability of results from national curriculum testing in England. Educational Research, 51(2), 181–212. https://doi.org/10.1080/00131880902891404

Noor Lide Abu Kassim. (2011). Judging behaviour and rater errors: An application of the many-facet Rasch model. GEMA Online Journal of Language Studies, 11(3), 179–197.

Nor Mashitah, Mariani, Jain Chee, Mohamad Ilmee, Hafiza, & Rosmah. (2015). Penggunaan model pengukuran Rasch many-facet (MFRM) dalam penilaian perkembanagn kanak-kanak berasaskan prestasi. Jurnal Pendidikan Awal Kanak-Kanak, 4, 1–21.

Nor Mashitah Mohd Radzi. (2017). Pembinaan dan pengesahan instrumen pentaksiran prestasi standard awal pembelajaran dan perkembangan awal kanak-kanak. Universiti Malaya.

Nur ’Ashiqin Najmuddin. (2011). Instrumen kemahiran generik pelajar pra-universiti berdasarkan penilaian oleh pensyarah. Universiti Kebangsaan Malaysia.

Nurul Farahin Ab Aziz, & Siti Mistima Maat. (2021). Kesediaan dan efikasi guru matematik sekolah rendah dalam pengintegrasian teknologi semasa pandemik COVID-19. Malaysian Journal of Social Sciences and Humanities (MJSSH), 6(8), 93–108. https://doi.org/10.47405/mjssh.v6i8.949

OECD. (2013). Preparing teachers for the 21st century: Using evaluation to improve teaching. In OECD Publishing. OECD Publishing.

Polit, D. F., & Beck, C. T. (2006). The content validity index: Are you sure you know what’s being reported? Critique and recommendations. Reseacrh in Nyrsing & Health, 29, 489–497. https://doi.org/10.1038/s41590-018-0072-8

Rural, J. D. (2021). Competency in assessment of selected DepEd teachers in National Capital Region. European Online Journal of Natural and Social Sciences, 10(4), 639–646. http://www.european-science.com

Sahin, M. G., Teker, G. T., & Güler, N. (2016). An analysis of peer assessment through many facet Rasch model. Journal of Education and Practice, 7(32), 172–181.

Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465–493. https://doi.org/10.1177/0265532208094273

Schmidt, F. L., Oh, I.-S., & Shaffer, J. A. (2016). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. In Validity and Uitility of Selection Methods. https://doi.org/10.1037/0033-2909.124.2.262

Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85(6), 956–970. https://doi.org/10.1037/0021-9010.85.6.956

Seifert, T., & Feliks, O. (2019). Online self-assessment and peer-assessment as a tool to enhance student-teachers' assessment skills. Assessment & Evaluation in Higher Education, 44(2), 169-185. https://doi.org/10.1080/02602938.2018.1487023

Shin, Y. (2010). A Facets analysis of rater characteristics and rater bias in measuring L2 writing performance. English Language & Literature Teaching, 16(1), 123–142.

Siti Rahayah Ariffin. (2008). Inovasi dalam pengukuran dan penilaian. Bangi: Fakulti Pendidikan, Universiti Kebangsaan Malaysia.

Spencer, L. M., & Spencer, S. M. (1993). Competence at work: Models for superior performance. United States of America: John Wiley & Sons, Inc.

Springer, D. G., & Bradley, K. D. (2018). Investigating adjudicator bias in concert band evaluations: An application of the many-facets Rasch model. Musicae Scientiae, 22(3), 377–393. https://doi.org/10.1177/1029864917697782

Styck, K. M., Anthony, C. J., Sandilos, L. E., & DiPerna, J. C. (2020). Examining rater effects on the classroom assessment scoring system. Child Development, 00(0), 1–18.

Sumintono, B. (2016). Aplikasi pemodelan Rasch pada asesmen pendidikan: Implementasi penilaian formatif (assessment for learning). Jurusan Statistika, Institut Teknologi.

Sunjaya, D. K., Herawati, D., Puteri, D. P., & Sumintono, B. (2020). Development and sensory test of eel cookies for pregnant women with chronic energy deficiency using many facet Rasch model: a preliminary study. Progress in Nutrition, 22(3), 1–11. https://doi.org/10.23751/pn.v22i3.10040

Tomasevic, B. I., Trivic, D. D., Milanovic, V. D., & Ralevic, L. R. (2021). The programme for professional development of chemistry teachers’ assessment competency. Journal of the Serbian Chemical Society, 86(10), 997–1010. https://doi.org/10.2298/JSC210710052T

Wang, P., Coetzee, K., Strachan, A., Monteiro, S., & Cheng, L. (2021). Examining rater performance on the CELBAN speaking : A many-facets Rasch measurement analysis. Canadian Journal of Applied Linguistics, 23(2), 73–95.

Warrens, M. J. (2010). A formal proof of a paradox associated with Cohen’s kappa. Journal of Classification, 27(3), 322–332. https://doi.org/10.1007/s00357-010-9060-x

Webb, N. M., Shavelson, R. J., & Steedle, J. T. (2018). Generalizability theory in assessment contexts. In Handbook on measurement, assessment, and evaluation in higher education (pp. 284–305). https://doi.org/10.4324/9780203142189

Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305–319.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA PRESS.

Zhu, W., Ennis, C. D., & Chen, A. (1998). Many-faceted Rasch modeling expert judgment in test development. Measurement in Physical Education and Exercise Science, 2(1), 21–39.

Zhu, Y., Fung, A. S. L., & Yang, L. (2021). A methodologically improved study on raters’ personality and rating severity in writing assessment. SAGE Open, 1–16.

Zuliana Mohd Zabidi, Sumintono, B., & Zuraidah Abdullah. (2021). Enhancing analytic rigor in qualitative analysis : Developing and testing code scheme using many facet Rasch model. Quality & Quantity, 55(2). https://doi.org/10.1007/s11135-021-01152-4




How to Cite

Mohamat, R., Sumintono, B., & Abd Hamid, H. S. (2022). Raters’ Assessment Quality in Measuring Teachers’ Competency in Classroom Assessment: Application of Many Facet Rasch Model . Asian Journal of Assessment in Teaching and Learning, 12(2), 71–88. https://doi.org/10.37134/ajatel.vol12.2.7.2022