Student Ratings of Instruction – Report and Recommendation

Office of Institutional Research

Suffolk County Community College


Over 2,000 research studies on student ratings of instruction have been published. For those interested, researchers have published several major reviews of this body of literature (Aleamoni, 1987; Arreola, 1995; Costin, Greenough, & Menges, 1971; Cashin, 1988 &1995; Centra, 1993; Davis 1993; Braskamp & Ory, 1994; Marsh, 1987; McKeachie, 1994 & 1997). Meta-analysts have provided quantitative summaries of the relationship between student ratings and student learning (Cohen, 1981, 1982, 1983;Dowell & Neal, 1982; Abrami, 1984; McCallum, 1984; all cited in d’Apollonia & Abrami, 1997). More than 25 years of published research evidence supports the conclusion that there is "a moderate to large association between student ratings and student learning, indicating that student ratings of general instructional skill are valid measures of instructor-mediated learning in students" (d’Apollonia & Abrami, 1997, p.1202). McKeachie (1997) summarized the most recent research on the validity of student ratings, stating that "student ratings are the single most valid source of data on teaching effectiveness" (p. 1219). Moreover, according to reviews of the literature conducted by Aleamoni (1987) and Arreola (1995) well-developed, tested, student rating forms of teaching effectiveness exhibit both reliability and validity.

This report assesses a group of published student ratings forms currently available for use at an institutional level. This group includes the Educational Testing Service‘s (ETS’s) Student Instructional Report II (SIR-II), the Instructional Development and Effectiveness Assessment (IDEA), the University of Arizona’s Arizona Teacher-Course Evaluation Questionnaire (AZTEQ), and the Purdue Cafeteria System. The IDEA and the AZTEQ are available in both long and short forms. This report addresses only the long forms, as they are designed to be useful for both administrative/personnel evaluation and improving teacher effectiveness. The following criteria for assessment of these instruments have emerged from the literature: the multidimensionality of the teaching effectiveness construct; the reliability and validity of the measure; and cost.



Teaching Effectiveness – A Multidimensional Construct

Researchers agree that teaching effectiveness is a multidimensional construct. Thus student ratings forms need to, and in fact do measure a variety of different aspects of teaching (Abrami & d’Apollonia, 1990; Feldman, 1976b &1989b; Kulik & McKeachie, 1975; Marsh, 1984; Marsh & Dunkin, 1992). Both Centra (1993) and Braskamp and Ory (1994) identified six factors commonly found in student rating forms:

    1. Course organization and planning
    2. Clarity, communications skills
    3. Teacher student interaction, rapport
    4. Course difficulty, workload
    5. Grading and examinations
    6. Student self-rated learning

Forms must distinguish among the various items and their dimensions to insure that instructors receive ratings on all of the appropriate dimensions. The importance of addressing various dimensions cannot be under-emphasized when the purpose is to improve teaching. However, one or a few global or summary type items might provide sufficient student rating data for personnel decisions (Abrami, 1989a; Abrami & d’Apollonia, 1991). In either case, the simple averaging of dissimilar items is inappropriate.

The SIR-II edited and updated the original five scales of the SIR (course organization & planning; communication; faculty/student interaction; assignments, exams, & grading; course difficulty, workload, & pace) and added three new scales to reflect new emphases in learning research (course outcomes, student effort & involvement, methods of instruction). The IDEA assesses the three major constructs of student learning; difficulty and workload; and teaching methods. Teaching methods is further delineated in terms of communicating content and purpose, involving students, creating enthusiasm, and preparing exams. The AZTEQ was built around four dimensions: instructor’s presentation and delivery, instructor’s interaction and feedback, course components and integration, and workload and difficulty. Clearly the SIR-II, the IDEA, and the AZTEQ cover most of the major components of the construct of teaching effectiveness. Literature on The Purdue Cafeteria System did not specifically discuss the teaching dimensions it assesses, however because instructors, departments, and institutions select rating scale items from a catalog to meet their specific needs, the authors posit that it provides diagnostic evaluation along several dimensions.



Reliability of the Measure

Reliability indicates whether or not a set of items consistently measures a particular construct or set of constructs. Reliability is a pre-condition for validity. This paper focuses on three types of reliability. Consistency across raters (inter-rater reliability or agreement) refers to agreement of all student ratings of one instructor or course. Stability or consistency across time (test-retest reliability) refers to whether or not an instructor receives similar ratings every semester. Generalizablity reflects how well the data assesses the instructor’s general teaching effectiveness, not just instructor effectiveness in a particular course or term.

Inter-rater reliability provides the most common and appropriate indication of the reliability of student rating forms (Marsh & Roche, 1997). Reliabilities vary according to the number of raters, such that Cashin (1995) recommends a minimum of 10 raters for an acceptable reliability of .70 or better. Aleamoni (1987) cites several rating forms with an inter-rater reliability of .9 or better, based on an average class size of 25. These include the SIR-II, the IDEA, and the AZTEQ. No reliability studies on the Purdue Cafeteria System have been done since the early 70s, and those were unpublished.



Validity of the Measure

The validity of a measure indicates to what extent student-rating items measure some aspect of teaching effectiveness. Validity coefficients are interpreted differently than reliability coefficients, as seen below:

.00- .29: even when statistically significant; not practically useful

.30- .49: practically useful

and .50- .70: very useful; not common when studying complex phenomena.

The following discussion of validity includes: content, construct, and criterion validity, as well as the evaluation and control of potential bias.


Content Validity. Content validity incorporates estimates of the extent to which the content of an instrument relates to what it is designed to measure. The items and scales of the SIR-II, IDEA, AZTEQ, and The Purdue Cafeteria System were all designed to reflect the content of what many sources (teachers, students, administrators, conferences, and publications) define as effective teaching. They consequently clearly demonstrate high content validity.


Construct Validity. Construct validity evaluates the degree to which the scores from an instrument correspond to other measures of the underlying theoretical trait. In this case, the student ratings should correspond with the dimensionally specific scales chosen to represent teaching effectiveness. Researchers use factor analysis as one approach to studying construct validity. Factors produced from student ratings should closely duplicate the scales employed. This type of analysis was used in the development of the SIR-II (Centra, 1998). The six scales subjected to factor analysis accounted for 88% of the variance among the scales included in the SIR-II demonstrating the high construct validity of the measure. The AZTEQ was also subjected to factor analysis, revealing that the four factors represented by the questionnaire items accounted for 75% of the total variance of the items, demonstrating the construct validity of this instrument as well. The IDEA demonstrates construct validity differently. The IDEA relies upon student ratings of their own learning on objectives chosen for evaluation by the instructor. Thus, significant correlation (.25) between student learning on course objectives and instructor designated importance of those objectives (in contrast to .02 for learning and importance on unrelated objectives) supports their claim to construct validity (Hoyt, 1973). The Purdue Cafeteria System has not published or provided validity studies addressing this issue. By offering selection of items, or additional items, for a more personalized or relevant evaluation, increased construct validity and flexibility are available on all of these forms.

The correlation of scores with other external variables gives alternate evidence used to measure construct validity. Most specifically, low correlation with items outside the construct indicates lack of bias, or discriminant validity. The original SIR showed little relationship with class size, subject area, course type, expected student grade, class level, and a variety of other studied variables (Centra, 1976). Research on the specific effect of potential biases has not been addressed in literature on the IDEA, the AZTEQ, or The Purdue Cafeteria System.


Criterion Validity. Criterion validity represents performance in relation to particular tasks or discrete cognitive or behavioral objectives. There are two measures of criterion validity. The first is a measure of predictive validity - the degree to which scores predict performance. Reviews (Cohen, 1982; Feldman, 1989b) show that student learning, as represented by scores on an external final exam (across all instructors of the same course) has moderate to high correlation with student ratings. Students in classes in which on average gave the instructors higher ratings also on average scored higher on the exam, or learned more. Only the literature on the SIR-II attempts to specifically address predictive validity. The original SIR (Centra, 1976) demonstrated that learning gains were related to the students’ overall evaluation of the instructor as well as to some of the scale scores. ETS expects similar validity on the SIR-II, but suggests that additional research validate the revisions and additions in the new form.

The second measure of criterion validity focuses on concurrent validity – the degree to which scores on two or more measures directly measure the same thing. The literature presents a variety of measures of teaching effectiveness as parallel to student ratings. Student ratings of effective teaching also have moderate to high correlations with instructor self-ratings (Feldman, 1989a; Marsh, Overall, & Kesler, 1979; and Marsh & Dunkin, 1992), as well as evaluations of teacher effectiveness by colleagues/faculty and administrators (Kulik & McKeachie, 1975; Feldman, 1989a), and alumni (Overall & Marsh, 1980; Braskamp & Ory, 1994). Student ratings on overall instructor effectiveness are also highly correlated with their responses to open ended questions on instructor effectiveness (Ory, Braskamp, & Pieper, 1980; Braskamp, Ory, & Pieper, 1981).

The AZTEQ makes indirect claims of concurrent validity based on its intentional resemblance to "other validated instruments." However, published research on the instruments evaluated here indicates that concurrent validity has not been empirically investigated. Thus none of the instruments assessed in this report has thoroughly demonstrated the concurrent aspect of criterion validity.


Evaluation and control of potential bias. Researchers disagree on the appropriate definition of "bias"(Cashin, 1988, 1995; Marsh, 1984). Some writers have suggested that bias is "anything not under the control of the instructor," (Cashin, 1995). Marsh (1984) argued against this broad definition instead stating that bias "should be restricted to variables not related to teaching effectiveness." Different definitions of bias have served to confuse the literature, thus discussion of bias is often more clearly organized in terms of specific variables which do or do not require control (Cashin, 1995).

According to the Marsh’s (1984) definition of bias, only those variables that are related to student ratings require control. The literature allows for the exclusion of many variables as not significantly correlated with student ratings and therefore non-biasing. The first group of such variables concern instructor characteristics and include: age and teaching experience (Marsh & Hocevar, 1991), gender (Costin, 1971; Feldman, 1992; Marsh & Roche, 1997), race (Li, 1993, cited in Cashin, 1995), personality (Aleamoni, 1987; Braskamp & Ory, 1994; Centra, 1993; Murray, 1983), faculty rank (Arreola, 1995; Marsh & Roche, 1997) and research productivity (Arreola, 1995; Centra, 1993; Feldman, 1987). The second group of variables cover student characteristics and consists of student age (Centra, 1993), gender (Costin, 1971; Feldman, 1977, 1993; Marsh & Roche, 1997), level – e.g. freshman (Mc Keachie, 1979), GPA (Feldman, 1976a) and personality (Abrami, Perry, & Leventhal, 1982). The final set of non-biasing variables includes: class size (aleamoni, 1987;Feldman, 1984) or time of day (Aleamoni, 1987; Feldman 1978); and the time during the second half of the term when ratings are collected (Feldman, 1979). Variables that are correlated with student ratings but enhance learning, such as instructor enthusiasm or expressiveness (Aleamoni, 1987; Marsh & Roche, 1997; Marsh & Ware, 1982) and workload or course difficulty (Marsh, 1987; Marsh & Roche, 1997) are also considered non-biasing and do not require control.

Research has also pointed to several variables requiring control. Student motivation is the most prominent of these variables. The literature supports the belief that instructors of elective courses receive higher ratings than instructors of required courses (Arreola, 1995; Marsh & Roche, 1997), and that prior interest in course subject matter contributes to higher ratings (Marsh and Dunkin, 1992). Academic field also impacts student ratings. According to Marsh and Roche (1997) and others (Braskamp & Ory, 1994; Cashin, 1990; Centra, 1993; Feldman, 1978; Marsh & Dunkin, 1992), instructors of courses in the sciences appear to be rated lower than instructors of courses in the humanities. This may or may not pose a biasing influence. Cashin (1990) points out that the lower ratings in courses requiring more quantitative reasoning skills may be associated with reduced student competency in those areas, necessitating control of academic field. But control is not appropriate if classes within particular fields are poorly taught. The final variable requiring control concerns course level. Higher level courses, especially graduate level courses, tend to receive higher ratings (Aleamoni & Hexner, 1980; Braskcamp & Ory, 1994, Feldman, 1978, Marsh 1997), but these differences are small, and not as relevant to this discussion at a two-year institution.

The effect of expected grades or grading leniency is perhaps the most controversial and most researched of the potential biases to student ratings (Arreola, 1995). To the degree that higher grades reflect greater learning, a positive relationship between grades and learning, is appropriate and should be expected. Research on the grading leniency effect indicates that the effect is both weak and unsubstantial (Braskamp & Ory, 1994; Feldman, 1976a; Marsh & Dunkin, 1992; Marsh & Roche, 1997). Most of the correlation between grades and ratings can be accounted for by self-reported student learning (Howard & Maxwell, 1980, 1982), which supports the hypothesis that teaching effectiveness influences both grades and ratings, therefore student ratings are valid. However, other possible hypotheses have been posed to explain this association (Cashin, 1995: Greenwald & Gillmore, 1997): 1) student motivation (general or course specific) influences both learning and ratings, and is controlled for statistically; or 2) students give high ratings in appreciation for lenient grading. Rather than statistical control for possible leniency, Cashin (1995) recommends peer review of the course material, exams, graded samples of essays and projects, etc… to determine grade inflation.

Comparative data provide an option to statistical control of the above variables. If statistical control is used, Cashin (1995) suggests that course level and academic field be controlled for only if these variables maintain significant differences after controlling for student motivation. Under these circumstances it would be necessary to develop level- or field-specific comparative data, for reference and appropriate interpretation of results.

The four forms assessed in this report each use a different method of handling potential biases. The SIR-II controls for the influence of potential bias or confounding variables through reference to appropriate comparative data. Ideally, for our purposes, this would consist of data from two-year institutions. The SIR-II most recently published comparative data from two-year colleges and universities from 1995 to 1997 (ETS, 1998). ETS also encourages institutions using the SIR-II to collect data on local norms, which serve as an additional reference in interpretation of evaluations. Studies involving the original SIR (Centra, 1976) indicate that potential biases influenced ratings only weakly, if significantly. Student motivation was most highly with student ratings, and ETS recommends that this be taken into account in comparative interpretation of the SIR data. The IDEA provides both unadjusted student rating scores and adjusted scores, which reflect the statistical control of ratings for variables that may bias results (including class size, student motivation, course difficulty, student effort, and other motivational influences). Although IDEA does not maintain a comparative database, it does assist in the collection of data for the purpose of establishing local norms. The AZTEQ technical report indicates that biasing factors present in the research literature have a "small magnitude" of effect on their instrument so long as comparisons among instructors or courses include the control of the following variables: course discipline and content, class size, course level, and the course as a requirement versus as an elective. For the purposes of institutional reporting, AZTEQ provides collection of local comparative data. The Purdue Cafeteria System software can be purchased, and includes a normative data file for its items. Item norms are based on the performance evaluations of all Purdue faculty who have used the Cafeteria System item since 1974. The system also provides a routine that collects local data. Comparisons are made based on local or system norms, and no discussion was given on control of biases. Thus either the SIR-II or the IDEA allows for the drawing of relatively unbiased conclusions; a possibility overtime from the AZTEQ as well.



Cost of Measurement

The costs of using each of the instruments discussed in this report varies widely across a number of variables including: the number of forms, the number of classes, the customization of forms, the bulk of the processing, the number and type of reports requested, the provision of data discs, and the purchase of system software and technical support. Table 2 summarizes current price information.



Limitations of Student Ratings, Students as Raters, and

Application of Ratings Information


Reviewers and meta-analysts clearly agree that evidence supports the assertion that student ratings are related to teaching effectiveness. In fact, "student ratings are the single most valid source of data on teaching effectiveness" (McKeachie, 1997, p. 1219). In addition, well-developed and tested student rating forms of teaching effectiveness, such as those discussed in this report, exhibit both reliability and validity. While these forms provide valuable, useful, and reliable information, some of the limitations of student ratings, of students as raters, and of the application of ratings information bear further discussion.

Ratings in general are inherently subject to two weaknesses, the "error of central tendency" and the "halo effect," which tend to reduce discrimination among individuals and represent subdued estimates of effect (Anastasi & Urbina, 1997). The error of central tendency occurs because most people tend to avoid the extremes in rating, so ratings tend to accumulate in the center of the scale. Thus ratings of "moderately effective" or "somewhat ineffective" may present more modest estimates of teaching effectiveness than are justified. The "halo effect" refers to the tendency of raters to be unduly influenced by a favorable or unfavorable general opinion of the person being rated, and then to let that opinion color all specific ratings. The halo effect causes raters to make less differentiation between the specific strengths and weaknesses of instructors or courses than warranted.

Students’ ratings of their own learning and of the instructor’s techniques (after adjustment for known confounds) have acceptable validity. However, Cashin (1989) concluded that students are not qualified to judge a number of other factors that characterize exemplary instruction:

    1. The appropriateness of the instructors objectives
    2. The relevance of assignments or readings
    3. The degree to which subject matter content was balanced and up-to-date
    4. The degree to which grading standards were unduly lax or severe

Although these issues can form essential components of a comprehensive evaluation of teaching effectiveness, they may require methods other than student ratings to address them.

Student ratings are valuable indicators of teaching effectiveness. They provide constructive information to help guide the improvement efforts of instructors, departments, and institutions. Meta-analysis (Cohen, 1980) shows that ratings feedback is related to improved teaching. However, the greatest increases in teaching effectiveness were found when instructors received not only feedback on student ratings, but a combination of ratings feedback and consultation (type of consultation varied across the studies in the meta-analysis). Thus student ratings provide the most help when combined in a comprehensive program including a variety of evaluation tools and systematic faculty development.






Abrami, P. C. (1989a). How should we use student ratings to evaluate teaching? Research in Higher Education, 30, 221-227.

Abrami, P. C., & d"Apollonia, S. (1990). The dimensionality of ratings and their use in personnel decisions. In M. Theall, & J. Franklin (Eds.), Student ratings of instruction: Issues for improving practice: New Directions for Teaching and Learning, No. 43 (pp. 97-111). San Francisco: Jossey-Bass.

Abrami, P. C., & d’Apollonia, S. (1991). Multidimensional students’ evaluations of teaching effectiveness - generalizability of "N = 1" research; Comments on marsh (1991). Journal of Educational Psychology, 83, 411-415.

Abrami, P. C., Perry, & Leventhal, (1982). The relationship between student personality characteristics, teacher ratings, and student achievement. Journal of Educational Psychology, 74, 111-125.

Aleamoni, L. M. (1987). Typical faculty concerns about student evaluation of teaching. In L. M. Aleamoni (Ed.), Techniques for evaluation and improving instruction. New directions for teaching and learning, no. 31. San Francisco: Jossey-Bass.

Aleamoni, L. M. & Hexner, P. Z. (1980). A review of the research on student evaluation and a report on the effect of different sets of instructions on student course and instructor evaluation. Instructional Science, 9, 67-84.

Anastasi, A. & Urbina, S. (1997). Psychological Testing, 7th edition. Upper Saddle River, NJ: Prentice Hall.

Arreola, R. A. (1995). Developing a comprehensive faculty evaluation system: A handbook for college faculty and administrators on designing and operating a comprehensive faculty evaluation system. Bolton, MA: Anker Publishing Co.

Braskamp, L. A., & Ory, J. C. (1994). Assessing faculty work: Enhancing individual and institutional performance. San Francisco: Jossey-Bass.

Braskamp, L. A., Ory, J. C., & Pieper, D. M. (1981). Student written comments: Dimensions of instructional quality. Journal of Educational Psychology, 73, 65-70.

Cashin, W. E. (1988). Student ratings of teaching: A summary of the research. IDEA Paper No. 20. Manhattan, KS: Kansas State University, Center for Faculty Evaluation and Development..

Cashin, W. E. (1989). Defining and evaluating college teaching. IDEA Paper No. 21. Manhattan, KS: Kansas State University, Center for Faculty Evaluation and Development.

Cashin, W. E. (1995). Student ratings of teaching: The research revisited. IDEA Paper No. 32. Manhattan, KS: Kansas State University, Center for Faculty Evaluation and Development.

Cenra, J. A. (1976). Two studies on the validity of the Student Instructional Report. Student Instructional Report no. 4, Princeton, NJ: Educational Testing Service.

Centra, J. A. (1993). Reflective faculty evaluation: Enhancing teaching and determining faculty effectiveness. San Francisco: Josse-Bass.

Centra, J. A. (1998). The development of the Student Instructional Report II. Princeton NJ: Educational Testing Service.

Cohen, P. A. (1980). Effectiveness of student-rating feedback for improving college instruction: A meta-analysis of findings. Research in Higher Education, 3, 321-341.

Costin , F., Greenough, W. T., & Menges, R. J., (1971). Student ratings of college teaching: Reliability, validity , and usefulness. Review of Educational Research, 41, 511-535.

d’Apollonia, S., & Abrami, P. C. (1997). Navigating student ratings of instruction. American Psychologist, 52, 1198-1208.

Davis, B. G. (1993). Tools of teaching. San Francisco: Jossey-Bass.

Educational Testing Service (ETS), (1998). Student Instructional Report II Comparative Data. Princeton NJ: Educational Testing Service.

Feldman, K. A. (1976a). Grades and college students’ evaluations of their courses and teachers. Research in Higher Education, 4, 69-111.

Feldman, K. A. (1976b). The superior college teacher from the students’ view. Research in Higher Education, 5, 243-288.

Feldman, K. A. (1977). Consistency and variability among college students in rating their teachers and courses: A review and analysis. Research in Higher Education, 6, 233-274.

Feldman, K. A. (1978). Course characteristics and college students’ ratings of their teachers; what we know and what we don’t. Research in Higher Education, 9, 199-242.

Feldman, K. A. (1979). The significance of circumstances for college students’ ratings of their teachers and courses. Research in Higher Education, 10, 149-172.

Feldman, K. A. (1984). Class size and college students’ evaluations of teachers and courses: A closer look. Research in Higher Education, 21, 45-116.

Feldman, K. A. (1987). Research productivity and scholarly accomplishment of college teachers as related to their instructional effectiveness: A review and exploration. Research in Higher Education, 26, 227-298.

Feldman, K. A. (1989b). The association between student ratings of specific instructional dimensions and student achievement: Refining and extending the synthesis of data from multisection validity studies. Research in Higher Education, 30, 583-645.

Feldman, K. A. (1992). College students’ views of male and female college teachers: Part I-Evidence from the social laboratory and experiments. Research in Higher Education, 33, 317-375.

Feldman, K. A. (1993). College students’ views of male and female college teachers: Part II-Evidence from students’ evaluations of their classroom teachers. Research in Higher Education, 34, 151-211.

Howard, G. S. & Maxwell, S. E. (1980). The correlation between student satisfaction and grades: A case of mistaken causation" Journal of Educational Psychology, 72, 810-820.

Howard, G. S. & Maxwell, S. E. (1982). Do grades contaminate student evaluations of instruction? Research in Higher Education, 16, 175-188.

Hoyt, D. P. (1973). Measurement of instructional effectiveness. Research in Higher Education, 1, 367-378.

Kulik, J. A., & McKeachie, W. J. (1975). The evaluation of teachers in higher education. In F. N. Kerlinger (Ed.), Review of research in education (Vol. 3, pp. 210-240). Itasca, IL: F. E. Peacock.

Marsh, H. W. (1984). Students’ evaluations of university teaching; Dimensionality, reliability, validity, potential biases and utility. Journal of Educational Psychology, 76, 707-754.

Marsh, H. W. (1987). Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research. International Journal of Educational Research, 11, 253-388.

Marsh, H. W., & Dunkin, M. (1992). Students’ evaluations of university teaching; A multidimensional perspective. In J. C. Smart (Ed.), Higher Education: Handbook of theory and research (Vol. 8, pp. 143-233). New York: Agathon.

Marsh, H. W., & Hocevar, D. (1991). Students evaluations of teaching effectiveness: The stability of mean ratings of the same teachers over a 13-year period. Teaching & Teacher Education, 7, 303-314.

Marsh, H. W., Overall, J. U., & Kesler, S. P. (1979). Validity of student evaluations of instructional effectiveness: A comparison of faculty self-evaluations and evaluation by their student. Journal of Educational Psychology, 71, 149-160.

Marsh, H. W., & Roche, L. A. (1997). Making students’ evaluation teaching effectiveness effective: The critical issues of validity, bias and utility. American Psychologist, 52, 1187-1197.

Marsh, H. W., & Ware, J. A. (1982). Effects of expressiveness, content coverage and incentive on multidimensional student rating scales: New interpretations of the DR. Fox effect. Journal of Educational Psychology, 74, 126-134.

McKeachie, W. J. (1979). Student ratings of faculty: A reprise. Academe, 65, 384-397.

McKeachie, W. J. (1994). Teaching tips: Strategies, research, and theory for college and university teachers. (9th ed.). Lexington, MA: D. C. Heath.

Murray, H. G. (1983). Low-inference classroom teaching behaviors and student ratings of college teaching effectiveness. Journal of Educational Psychology, 75, 138-149.

Ory, J. C., Braskamp, L. S., & Pieper, D. M. (1980). Congruency of student evaluative information collected by three methods. Journal of Educational psychology, 72, 321-325.

Overall, J. U., & Marsh, H. W. (1980). Students’ evaluations of instruction: A longitudinal study of their stability. Journal of Educational Psychology, 72, 181-185.

    SIR II   IDEA (long form)   AZTEQ (long form)   Purdue Cafeteria
    Number Cost   Number Cost   Number Cost   Number Cost
Ordering Information                        
Custom form (1 time set-up fee)                 <$500      
Custom form (1 time programing fee $/hr)                 $ 50.00      
Cost/questionnaire     $ (0.26)                  
Cost/questionnaire (<1000 forms)           $ 0.30            
Cost/questionnaire (1000-4950 forms)           $ 0.24            
Cost/questionnaire (5000-9500 forms)           $ 0.18            
Cost/questionnaire (10000-19500 forms)           $ 0.15            
Cost/questionnaire (>/=20000 forms)           $ 0.12            
Cost/questionnaire (1000-10000 forms)                 $ 0.15      
Cost/questionnaire (10000-20000 forms)                 $ 0.10      
Cost/questionnaire (20000-50000 forms)                 $ 0.08      
Cost/questionnaire (>50000 forms)                 $ 0.05      
Questionnaires/pkg   100 $ 25.50                  
Order Minimum (pkgs)   3                    
Cover Sheet/class   1     1              
Instructor's Guide/instructor   1     1              
Bulk Discount   >10,000 -10%                  
Includes Shipping/Handling     ?     not incl     ?     ?
Processing Information                        
Price/questionnaire     $ 0.47           $ 0.10     $ 0.56
Price/class                 $ 2.00      
Price/class (1-24 classes)           $ 7.25            
Price/class (25-49 classes)           $ 6.50            
Price/class (50-99 classes)           $ 5.75            
Price/class (100-199 classes)           $ 5.00            
Price/class (200-299 classes)           $ 4.25            
Price/class (>/=300 classes)           $ 3.50            
Bulk Discount   >10,000 -5%                  
Batch Charge (ea batch received for processing)           $ 10.00            
Clerical correction ($/hr)                 $ 15.00      
Report/class   3           2        
Additional rport copies                 $ 0.50      
Combined Reports/class included     $ 0.60                  
Data Disks (institutional results)   1 $ 75.00                  
Includes Shipping/Handlling     ?     not incl   20% not incl     not incl
Purchase of Programs                        
4 programs, user manual, software                       $ 955.00

Student Instructional Report II (SIR-II)
This questionnaire gives students the chance to comment anonymously about a particular course and the way it was taught. Using the rating scale below, students mark the one response for each statement that is closest to their view. (Bubble forms are provided administration of this questionnaire).
(5) = Very Effective
(4) = Effective
(3) = Moderately Effective
(2) = Somewhat Effective
(1) = Ineffective
(0) = Not Applicable, Not used in the course, or you don=t know.
In short, the statement does not apply to the course or instructor.
As students respond to each statement, they are asked to think about each practice as it contributed to their learning the course evaluated.
A. Course Organization and Planning.
1.The instructor=s explanation of course requirements.
2.The instructor=s preparation for each class period.
3.The instructor=s command of the subject matter.
4.The instructor=s use of class time.
5.The instructor=s way of summarizing or emphasizing important points in class.
B. Communication
6.The instructor=s ability to make clear and understandable presentations.
7.The instructor=s command of spoken English(or the language used in the course)
8.The instructor=s use of examples or illustrations to clarify course material.
9.The instructor=s use of challenging questions or problems.
10.The instructor=s enthusiasm for the course material.
C. Faculty/Student Interaction
11.The instructor=s helpfulness and responsiveness to students.
12.The instructor=s respect for students.
13.The instructor=s concern for student progress.
14.The availability of extra help for this class (taking into account of class size).
15.The instructor=s willingness to listen to student questions and opinions.
D. Assignments, Exams, and Grading
16.The information given to students about how they would be graded.
17.The clarity of exam questions.
18.The exams coverage of important aspects of the course.
19.The instructor=s comments on assignments and exams.
20.The overall quality of the textbook(s).
21.The helpfulness of assignments in understanding course material.
Many different teaching practices can be used during a course. In this section (E), students rate only those practices that the instructor included as a part of the course evaluated.
Students are asked to rate the effectiveness of each practice used as it contributed to their learning.
E. Supplementary Instructional Methods
22. Problems or questions presented by the instructor for small group discussions.
23. Term paper(s) or project(s).
24. Laboratory exercises for understanding important course concepts.
25. Assigned projects in which students worked together.
26. Case studies, simulations, or role playing.
27. Instructor=s use of computers as aids in instruction.
For the next TWO sections ( F & G ), students use the rating scale below. They are asked to mark the one response for each statement that is closest to their view.
(5) = Much More Than most courses
(4) = More Than most courses
(3) = About the Same as other courses
(2) = Less Than most courses
(1) = Much Less Than most courses
(0) = Not Applicable, not used in the course, or you don=t know. In short, the statement does not apply to the course or the instructor.

F. Course Outcomes
29. My learning increased in this course.
30. I made progress toward achieving course objectives.
31. My interest in the subject area has increased.
32. This course helped me to think independently about the subject matter.
33. This course actively involved me in what I was learning.

G. Student Effort and Involvement
34. I studied and put effort into the course.
35. I was prepared for each class (writing and reading assignments).
36. I was challenged by this course.

H. Course Difficulty, Workload, and Pace
37. For my preparation and ability, the level of difficulty of this course was:
Very Elementary
Somewhat elementary
About right
Somewhat difficult
Very difficult
38. The work load for this course in relation to other courses of equal credit was:
Much lighter
About the same
Much heavier
39. For me, the pace at which the instructor covered the material during the term was:
Very slow
Somewhat slow
Just about right
Somewhat fast
Very fast
I. Student Information
40. Which one of the following best describes this course for you?
A major/minor requirement
A college requirement
An elective
41. What is your class level?
Freshman/1st year
Sophomore/2nd year
42. Sex Female Male
43. What grade do you expect ot receive in this course? Range from A to Below C
44. Do you communicate better in English or in another language?
Better in English
Better in another language
Equally well in English and another language
J. Overall Evaluation
45. Rate the quality of instruction in this course as it contributed to your learning (try to set aside your feelings about the course content): Scale of Ineffective, Somewhat ineffective, Moderately effective, Effective, and Very effective.
K. Supplementary Questions
The SIR-II provides space for instructors to provide up to10 supplementary questions in this section of the questionnaire.
L. Student Comments
If you would like to make additional comments about the course or instruction, use a separate sheet of paper. You might elaborate on the particular aspects you liked most as well as those you liked least, and how the course or the way it was taught can be improved. An additional form may be provided for your comments. Please give these comments to the instructor.