Educational Theory 24 (Winter 1974) pp. 52-60

Measurability and Educational Concerns
by Edward G. Rozycki

edited 9/27/14

Introduction: squaring the circle
Criteria of Measurability
One Cannot Write Behavioral Objectives

Voluntary Behavior is not Measurable
To Conclude
Review & Discussion Questions

Introduction: Squaring the Circle

District's failure to write a measurable written language goal was not a fatal procedural error and did not constitute a denial of FAPE -- I.P. and Centennial, DP 00-115 [1a]

The measurability of goals and outcomes has continued over the last half century to be a vexing problem for educators at all levels of the enterprise. But it is not necessarily lack of technical skill that impedes their efforts. Measuring some kinds of educational outcomes is very much like the effort to square the circle with a straightedge and compass: its failure derives from conceptual rather than technical issues.

Thus, this essay is about what can and cannot be measured. It is intended to bear on problems of evaluation. By a measurement procedure I will mean a procedure the outcomes of which -- called data -- can be quantified.[1b] What will be said about measurability will bear on those evaluation procedures in which measurement plays an important role. Our conclusions will be relevant to problems of curriculum and goal evaluation; but they will also be shown to be important in answering the more general questions as to what shall be taught and how. This is because certain curricular options derive from mistaken notions about the nature of human behavior and of educational concern.

To indicate the directions we will take, let us consider a statement from Robert E. Stakes' "The Countenance of Educational Evaluation":

. . . the responsibility for describing curricular objectives is (that)... of the (curriculum) evaluator. . . it is his responsibility to transform the behavior of a teacher and the responses of a student into data. . . (Also) it is his responsibility to transform the intentions and expectations of an educator into "data."[2]

Note that Stake distinguishes the "data" of intentions and expectations from the data -- one can almost hear "hard data"-- of teacher-pupil behavior. This is a futile distinction. If by "data" is meant "the outcomes of a measurement procedure, then educationally relevant teacher-pupil behavior does not provide any firmer data than intentions and expectations.

According to Paul Whitmore,

The statement of objectives of a training program must denote measurable attributes observable in the graduate of the program, or otherwise it is impossible to determine whether or not the program is meeting the objectives. [3]

Unhappily, this statement is not merely false; it is preposterous. These introductory remarks are based on the assumption that the quoted authors, when speaking of data and measurement, mean what we have set them out to be above. Only this "hard Science" usage can invest their claims with an interest commensurate with the self-assuredness with which they have been advanced. It will be our endeavor to demonstrate that such claims are founded on grand-scale confusion.

Criteria of Measurability

Our criteria will be simple, in accord with our common sense, but -- where possible -- technically accurate. Roughly stated, our first criterion is this one must be able to unambiguously classify and tally the simple outcomes of an alleged measurement procedure. We can state criterion #1 more formally as

#1. The Criterion of Partitionability: Simple outcomes must he classifiable into categories -- sometimes called "elementary events" -- which are both mutually exclusive and exhaustive of all possible kinds of outcome. [4]

For example, if we know that an outcome is of type A, we must know that consequently it is of no other type that contrasts with type A. (It could belong to a sub-type X which is completely contained in A.) Also, all outcomes must be classifiable.

Let us consider two examples of alleged measurement procedure, one of which meets criterion #1. For the first, someone is using a yardstick to assign lengths to different tables. The outcomes of his procedure are classified as, say, x-inches, y-inches, etc. Every outcome is some number of inches. No outcome is both x and y inches. Criterion #1 has been met. [5]

The second alleged measurement instrument is the procedure for using the Amidon-Flanders Categories of Interaction Analysis.[6] All teacher verbal behavior in the classroom is categorized into one of eight types:

1, accepts feeling; 2, praises or encourages; 3, accepts or uses ideas of student; 4, asks questions; 5, lectures; 6, gives directions; 7, criticizes or justifies authority; 10, silence or confusion.

But if a teacher says to a student, "Are you always so jumpy?" this, treated as an outcome, may be classified either as 7, or 4, or even 1. Thus, the interaction analysis categories cannot found a measure. (The reader who is familiar with the Amidon-Flanders system may be able to raise some objections here, making reference to certain procedures that are used to resolve this problem. We will re-examine this system later and show that in fact the problem is not resolved).

Criterion #2 -- again, roughly put -- will be that the identity of the object measured must not be destroyed by performance of the alleged measurement procedure, e.g. no identity criterion for the object measured can be incompatible with criterion #1.

#2. The Criterion of Construct Validity: Given an alleged measurement procedure, the identity of the (possibly hypothetical) object to be measured must remain invariant through (repeated) performance of the procedure.[7]

Obviously, measuring the length of a table with a yardstick does not change the table. The way we identify the table in the first place --although informal (Could it be otherwise?) -- is independent of the procedure for measuring its length. The way we identify teacher-behavior in the Amidon-Flanders system may obscure the identity of what is being measured, for depending on whether "Are you always so jumpy?" is classified syntactically, in terms of teacher intent or in terms of student-uptake, we may get different classifications. What loses its identity here is what the teacher did; that he "category-7-ed" may not tell us. (This is not the discussion promised above; we are not yet finished with Amidon and Flanders)

Criterion #2 must not be taken to mean that the object to be measured must remain intact physically throughout the measurement procedure; only that it identity -- which is not a "physical" thing -- remain so. The identity is preserved if the proper historical relationship between the outcomes and the object is preserved. The pieces of ash being weighed on the scale must be known to have come from pieces of the specimen to he so measured.[8]

Suppose we wanted to use some form of the Iowa Tests of Basic Skills to measure scholastic achievement for the population of a certain school. However, we devise the following procedure: We give out answer blanks to every student entering the lunchroom telling him that if he hands in a filled-in form at the cafeteria counter, he will receive a free lunch. What is to be measured is the individual pupil's achievement in certain basic skills. Suppose further that we receive back a filled-in answer sheet from each student which in fact correlates well with his grade-point average; repeating the procedure we establish the reliability of the measure. We are nonetheless still disinclined to say we are measuring individual scholastic achievement -- especially if not every student need fill in his own answer sheet. We might with justification say we were measuring something, but it would certainly not be individual scholastic achievement.

Problems of construct validity only arise where we have a tradition of identification that is independent of a particular measurement procedure. We needn't worry about it for, say, the Stanford-Binet if all that such testing were being used for were its predictive validity, e.g. to indicate the probability of future academic success. What Stanford-Binet measures does not seem to be identifiable independent of the test. And for purposes of prediction it does not matter. But, as concerns types of human behavior, we do have ways of identifying them independently of procedures that purport to measure them. And our educational and moral concerns derive entirely from these traditional ways of classifying human behavior.[9]

As educators, we are interested in voluntary rather than reflex behavior. We are concerned with responsibility, self-control and intention. We can identify behavior which exhibits these qualities with fairly high reliability. If certain measurement procedures cannot, this still gives us no cause to disavow our interests. [10] To sum up: if for some object, criterion and criterion #2 cannot both be met, then it is in principle not measurable. Identity conditions must be compatible with partitionability requirements.


There are few concepts more confused -- and confusing -- than behavior. Robert F. Mager. for example, defines it as "overt action" but gives no criterion for overtness.[11] B. F. Skinner insists that behavior "must be described in physical terms," [12] but problems arise exactly at that point where one tries to determine what the word "physical" restricts one to. D. O. Hebb offers perhaps the clearest definition in this tradition: "Behavior is the publicly observable activity of muscles or glands of secretion as manifested in movements of parts of the body or in the appearance of tears, sweat, saliva and so forth."[13] Behavior by this definition meets criterion #1; the instruments used to assign numbers to quantities of sweat, motion, etc., do work with partitionable outcomes. But it can set out distinctions we as educators find relevant, e.g. attempting and feigning an attempt? Can criterion #2 be met?

Mager suggests[14] that it is not merely motion or secretion per se but motion or secretion in certain conditions which defines behavior. But if we want to be able to speak of response generalization we must clearly set out which are the conditions definitive of a type of behavior and which constitute the "new" circumstances to which the behavior has been generalized. And educationally relevant classifications of behavior are determined by considerations other than he movement of the person whose behavior is being classified.

A clear example can be given as follows: suppose we have two persons standing together at normal speaking distance, facing each other. Call them Harry and John. Some noise issues from Harry. Consider the following possible descriptions of Harry's behavior:

a. Harry emitted the sound-sequence: /2aym+ gowing+3 hówm1 /.[15]

b. Harry said, "I'm going home."

c. Harry told John he was going home.

d. Harry informed John that he was going home.

e. Harry surprised John with the statement that he was going home.

We can easily imagine a situation where all of these descriptions are true of what Harry is doing. But given a -- which is the "physical" description of Harry's behavior in b, c, d and e -- neither b nor c nor d nor e need be true. If Harry is a babbling idiot, a might be true and none of the rest. If Harry is reciting aloud a line from a script, a and b might be true and none of the rest. If John already knew that Harry was going home, a, b, and c might be true but none of the rest. If John is never surprised by anything Harry does, but did not already know he was going home, a, b, c, and d but not e might be true. The behavior categories -- verbs to the uninitiated -- which correspond to a, b, c, d and e, respectively, are utter, say, tell (assert), inform and surprise. These are obviously not mutually exclusive categories.

We can distinguish between saying and telling in terms of the intent of the actor to communicate. Informing and surprising depend on hearer-uptake, saying and telling do not. But saying and telling can be distinguished from merely uttering in terms of speaker intent. We have thus three rough classes of categories: motion which includes utter; speaker-intent -- which contains say and tell; and hearer-uptake - which contains inform and surprise.[16] Furthermore, inform depends upon a prior state of hearer ignorance.

Let us focus on the distinction between speaker-intent and hearer-uptake. (In another medium of communication these become writer-intent and reader-uptake.) Note that the subject of the verb remains the same for both classes e.g. Harry informs John as well as Harry tells John. If we indiscriminately mix categories from the two classes we cannot get a partition, i.e. a set of mutually exclusive, classificationally exhaustive categories. We would therefore -- by criterion #1-- not have a measure.

On this count Amidon and Flanders fail again. In their system we have overlapping categories. If by redefinition of the concepts an attempt is made to achieve mutual exclusivity, the system cannot meet criterion #2. One who would use their system faces a double problem: if the conclusion is reached that fifty-per-cent of the teacher's behavior in a given period of time was classified as "5", one cannot still say that in fact the teacher lectured half of the time. Furthermore, what could the teacher know to do if we were to direct him not to "lecture," i.e. "emit category-5-type behavior," so much?[17]

One Cannot Write Behavioral Objectives

In much a similar manner, a simplistic, naive notion of behavior provides the foundation for the confusion that goes by the name of Performance-Based, or Behavioral, Objectives. A careful examination of Robert F. Mager's book, Preparing instructional Objectives, leads one to the seemingly paradoxical conclusion that by Mager's own criteria, one cannot write behavioral objectives. One has written a behavioral objective, according to Mager,[18] when one has specified what the learner will be doing when he is demonstrating his newly acquired competence. I, as a student of Mager, will be demonstrating that I have learned to write behavioral objectives by writing behavioral objectives.

But I am writing behavioral objectives only if I in writing statements that "communicate . , .[my] . . . intent"[19] as a writer of the statement of objective. This confuses entirely the distinction between writer-intent and reader-uptake. I cannot per se write statements that communicate my intent because written statements do not per se communicate intent. It is I, the writer, who might communicate my intent through my written statements to a reader. In the absence of a reader, I can only hope that the future communication will in fact take place. "Writing statements that communicate intent" is thus not a description of present writer behavior, but rather an obliquely stated hypothesis about the future effects of present writing behavior.

Communicate is a verb much like inform; it contrasts with tell and write in that it presumes reader (hearer) uptake. To put it again in Mager's terms[20], there is no terminal behavior describable as communicating, therefore there is no terminal behavior describable as writing communicatively. Thus no one can write behavioral objectives as terminal behavior.

We can go on to formulate a general objection to the whole program. Let us restrict the term terminal behavior to that characterized using verbs like say, tell, utter, etc. where the truth of the statement John did such-and-such is not dependent upon other persons or external conditions in the way the truth of John informed Harry ... or John communicated to Harry ... is. Thus, if saying or telling is terminal, informing and communicating is post-terminal.

We may look on the relationship of terminal to post-terminal behavior as roughly that of an attempt to a success. Telling is, under normal circumstances, trying to inform -- or, at least, to communicate. Under normal conditions the attempts succeeds. But testing constitutes a special set of conditions. A test provides a "normalized" environment so that only certain attempts can achieve success. It is only this success, described post-terminally, which can be an educational objective.

We expect the behavior of the student, the attempt, to adjust itself as it needs to the special conditions of the moment. An indeterminable variety of terminal behavior may be a propos of achieving an educational objective. There is no point in describing our objectives in terms of terminal behavior, for this terminal behavior must always be related to some post-terminal behavior in a manner depends on the recognition of special circumstance. For this recognition we usually rely on the perspicacity and intelligence of the learner.

If we provide her with post-terminally described objectives and practice under varying conditions we can normally expect her to make the adjustment. But if we insist on specifying objectives in terms of terminal behavior, it would seem incumbent on us to teach the learner to identify the general circumstances in which that behavior is appropriate -- unless, of course, we are only teaching her to take tests. But now we are faced with a fantastic burden, if not an impossible one.


The theories of Mager and Amidon and Flanders are but minor variations on a more general theme the most influential proponent of which is B. F. Skinner. We have identified and examined the inconsistencies in Mager and Amidon and FIanders caused by their naive, unexamined notion of behavior. We will now argue that Skinner's entire program founders on the same confusion.

There is an interesting conceptual relationship between the notion of reinforcement and that of measurement. Briefly put, it is that some behavior can be conditioned only if it is measurable. Behavior that is, in principle, not measurable, cannot be conditioned. The general outline of our argument is as follows: Behavior of a given type, B, can be conditioned only if it can be reinforced. It can be reinforced only if the probability of its occurrence can be increased or decreased. But the probability of an event is a special kind of measure of that event; it requires that criterion #1 be met -- B must belong to a partition, or be a complex of partitionable events. Any categories which cannot meet criterion #1 -- by virtue, say, of special identity conditions which hold for them -- cannot be conditioned.

But if we try to restrict ourselves to categories which meet criterion #1, we cannot identify educationally important behavior and so criterion #2 is not met. Thus, it will be shown that educationally important behavior cannot be conditioned.[21]

We will begin by trying to construct a system of behavior categories-in the manner of Amidon and Flanders -- which meet our two criteria. Recalling the discussion of the differences between uttering, saying, telling. informing and surprising, we can set up a few rules of thumb which might help us in the construction of our category-system. If we are successful we will have constructed a behavior-partition (hereafter, BP) We will have a set of "simple acts," as it were, of which the following is true: if someone is X-ing, he is therefore not Y-ing, if X and Y are in the BP and X is not Y; also, no matter what someone is doing, some category in our BP will serve to categorize it (or some combination of categories will).

The following will serve as initial rules:

a. If a person -- call him John -- can be said to be X-ing and in so doing also Y-ing, then not both X and Y can belong to the BP. For example, John can be said to be asserting that he is rich in saying,"I am rich," therefore not both assert and say can be in the BP.

b. If John can be said to be X-ing by Y-ing, then not both X and V are in the BP. John could be said to be frightening Harry by grimacing, thus not both frighten and grimace can be in the BP.

However, we are not interested in just any behavior, but behavior of a particular kind called "voluntary behavior." It is on1y this which can manifest self-control, purpose and intent. Reflex and habit formation may be of some interest but they must stand relegated to a minor role in comparison to voluntary behavior and its treatment.

There are two conditions which must be met for some behavior, B, to be voluntary:

1) If John's behavior, B, is voluntary, then he must be able to refrain from it.

If John cannot help but B, this ipso facto removes it from the realm of voluntary behavior. Vis-à-vis our BP this means that for any X in the BP, X is voluntary only if R(X) -- read:"refrain from X" -- is in the BP also.

2) If John's behavior, B, is voluntary, then John must also be trying to B.

If John is actually B-ing then "trying to B" means "sustaining his B-ing." If John is not actually B-ing, then "trying to B" indicates that the goal of John's present activity is to achieve B-ing. (There is a sense of "try' wherein we can say of someone that he is B-ing without trying and we mean that he is B-ing effortlessly. We are not dealing with this sense.) What this second condition means vis-à-vis the BP is that for any X in the BP, X is voluntary only if there is a 'category, T(X) -- "trying to X" -- in the B also. Notice that these conditions do not specify the criteria in terms of which we might identify trying and refraining behavior. They merely set out minimal conceptual requirements for identifying voluntary behavior. (Note that trying and refraining are per se voluntary; there is no conceivable description such as involuntarily trying or involuntarily refraining.) .

The criteria by which we might identify John's trying and refraining behavior are our knowledge about John and about what he believes and knows; for example, if we know that John cannot whistle, there is nothing John could do that would count as his refraining from whistling. On the other hand, if we know that he believes that crossing his fingers wards off colds, this may warrant our saying on certain occasions of his crossing his fingers that he is trying to ward off a cold. One can only refrain from what is, in fact, possible. But one can try -- albeit futilely -- to do whatsoever one believes is possible.

Voluntary Behavior is Not Measurable

We can now argue conclusively that no system of categories containing categories of voluntary behavior can be a BP and thus meet criterion #1 for measurability.

First: if X is voluntary and in the BP, so is T(X), trying to X. But these can never be mutually exclusive -- for John to be X-ing voluntarily he must also be trying to X. But X and T(X) are not merely two ways of describing the same behavior since one can try to X without actually X-ing. Thus X and T(X) overlap.

Second: depending upon John's beliefs, any given act may he both trying to X and trying to Y -- killing two birds with one stone, as it is said. Thus we have other possibilities of category overlap.

The unavoidable conclusion to be reached is that voluntary behavior is not measurable. It makes no sense to talk about the frequency of occurrence of a particular voluntary act. Thus there is no mathematical validity in talking about the probability of its happening. A voluntary act is not a mathematically stable category to which a probability can be assigned. Thus no voluntary act can be either positively or negatively reinforced, i.e. have the probability of its occurrence increased or decreased. One cannot condition voluntary behavior.[22]

I believe that the above is a conclusive demonstration that concepts dealing with responsible, self-controlled behavior -- the kind which is our main concern as educators -- cannot be accommodated to those of measurement. The reader might point out that the criteria I have used to identify voluntary behavior and trying and refraining behavior, in particular, would not be acceptable to a behaviorist --and many a non-behaviorist -- psychologist. That is so; but they are the only relevant criteria. No characteristic of voluntary behavior is going to be discovered empirically which can replace the conceptual ones I have mentioned above. To think that could be possible is to misunderstand the relationship between conceptual and empirical criteria.

To explain: suppose it were discovered that -- over a long period of experimental trials -- whenever John X-ed voluntarily, a particular instrument reacted in a uniform way. Suppose further that when John X-ed involuntarily the instrument did not register. We might decide to "redefine" -- as it were -- voluntary behavior as "behavior which caused this certain instrument to register" -- call it R-behavior. This would in no way affect the argument I have presented above, because if by "voluntary behavior" were meant "R -behavior,"' we would be talking about something something which would still require proof (not presumption) that it was behavior which one could both try to do and refrain from. (This proof would be particularly important in cases of liability or criminality.) If we did not wish to re-define, then even if we knew that John were R-behaving, it would still make good sense to ask if he were acting voluntarily. Again, the argument above is not affected.

To Conclude

Reinforcement theory has no general applicability; its criteria, its phenomena are not ours. [23] The gulf is a conceptual one and cannot be bridged. Our rewards and punishments are not meant to be -- nor need they be -- reinforcers,[24] for only involuntary behavior can be reinforced and punishing or rewarding such is either pointless or immoral. [25]

We do not present stimuli, we attempt to communicate. We do not seek to elicit responses; but rather, acts. One might object that this is a highly philosophical, heavily value-laden language. It is; but it is the only coherent one we have. The promise of the behavioral laboratories is a sham. Education has gone whoring after Pseudo-Science and ended up anything but pregnant.[26]

Just as there can be no educationally important behavior-modification lacking measurability -- and thus Reinforcement Theory -- so also are we denied that "data' into which Stake would have transformed the behavior and the responses of students and which was to have provided the foundation for evaluation of the educational enterprise.

It is too much to expect that the language of educational evaluation might be purged of references to "data" and "measures"; there is a large component of ritualism which leavens the discourse of educators, which firms up the "objectivity" of their judgment against the raging of the Heathen. We need not forego talk of "data" and "measure" so long as we are aware of what we are not saying and so long as we do not let it confound our theorizing. When it comes to theory, this "ritual leavening" is gas, and no more.



