# Assessing the reliability of differences

Evaluation is often necessary in the comparative analysis of polar groups. These groups can be distinguished, taking into account the different severity of a certain target feature (characteristic) of the phenomenon under study. Usually, the analysis begins with the calculation of the primary statistics of the selected groups, then the significance of the differences is assessed. Very often, quantitative analysis is not limited to one comparison, it becomes necessary to conduct additional comparisons and identify new evidence. Choosing new criteria at random is a thankless task. It is better to use the results of correlation analysis for this.

For example, if you are studying the personal conditionality of the desire to take part in the environmental movement, then the sign by which polar groups can be distinguished can be the subjective assessments of the subjects, expert assessments, and some behavioral indicators presented in numerical form. If the indicators of intellectual development have a small value of the correlation coefficient (<0.35) with the numerical indicators of the named desire, then the selection of polar groups according to intellectual parameters is unlikely to be successful. Most likely, we will not find significant differences between these groups in the severity of the desire to participate in the environmental movement, and we will not receive new data to clarify the patterns of its personal conditioning.

One of the most common tasks in data processing is to evaluate the validity of the differences between two or more series of values. In mathematical statistics, there are a number of ways to do this. Most of the powerful criteria require additional calculations, usually very detailed ones.

The computer variant of data processing has now become the most common. Many statistical applications have procedures for estimating differences between parameters of the same sample or different samples. With a fully computerized processing of the material, it is not difficult to use the appropriate procedure at the right time and evaluate the differences of interest. However, most psychologists do not have free and unlimited access to work with a computer – either the computer park is insufficient, or the psychologist, as a computer user, is not trained and can only carry out processing with the help of qualified personnel. In both cases, a typical computer session ends with the psychologist receiving printer printouts containing the calculated primary statistics, the results of correlation analysis, and sometimes factorial (component) analysis.

The main analysis is carried out later, not in dialogue with the computer. Based on these considerations, we will assume that a psychologist often faces the task of assessing the reliability of differences using previously calculated statistics. When comparing the mean values of a trait, they talk about the reliability (unreliability) of the differences in the arithmetic means, and when comparing the variability of indicators, they talk about the reliability (unreliability) of the deviations of sigma (dispersion) and coefficients of variation.

The reliability of differences in arithmetic means can be assessed by a fairly effective parametric Student’s test . It is calculated according to the formula

M1-M2

t = ———– , m1 + m2

where M1 and M2 are the values of the compared arithmetic means, m1 and m2 are the corresponding values of the statistical errors of the arithmetic means. The sign of the calculated difference between the arithmetic means can be ignored, since only the absolute value of the criterion t matters.

The values of Student’s test t for three levels of significance (p) are given in Appendix 2. The number of degrees of freedom is determined by the formula d = n + n – 2, where n and n are the volumes of the compared samples. With a decrease in sample sizes (n < 10), Student's test becomes sensitive to the form of distribution of the studied trait in the general population. Therefore, in doubtful cases, it is recommended to use non-parametric methods or compare the obtained values with the critical ones (given in the table) for a higher level of significance.

The decision on the reliability of differences is made if the calculated value of t exceeds the tabular value for a given number of degrees of freedom. In the text of a publication or scientific report, the highest significance level of the three is indicated: 0.05, 0.01, 0.001. If 0.05 and 0.01 are exceeded, then write (usually in brackets) P=0.01 or p<0.01. This means that the estimated differences are still random only with a probability of no more than 1 in 100 chances. If the tabular values for all three levels are exceeded, then P=0.001 or p<0.001 is indicated, which means the randomness of the revealed differences between the means is not more than 1 out of 1000 chances.

Example. M1=113.3, m1=2.4, n=13; M2=103.3, m2=2.6, n=16.

113.3 – 103.3

t = —————- = 2.83; 2.4 + 2.6

for d=13+16-2=27 the calculated value exceeds the table value for the probability Р=0.01. The calculated value of 2.83 is greater than the tabulated value of 2.77 for a significance level of Р=0.01. Therefore, the differences between the means are significant at the level of 0.01.

The above formula is simple. Using it, you can use a household calculator with memory to calculate the t criterion without intermediate entries.

It should be remembered that for any numerical value of the criterion for the significance of the difference between the means, this indicator does not evaluate the degree of the revealed difference (it is estimated by the very difference between the means), but only its statistical significance, i.e. the right to extend the conclusion obtained on the basis of comparison of samples about the presence of a difference to the whole phenomenon (the entire process) as a whole. A low calculated difference criterion cannot serve as evidence of the absence of a difference between two features (phenomena), because its significance (degree of probability) depends not only on the average value, but also on the number of compared samples. He does not say that there is no difference, but that with a given sample size it is statistically unreliable: the chance is too great that the difference under the given conditions of determination is random, the probability of its reliability is too small.

Degree, i.e. the magnitude of the identified difference, it is desirable to evaluate based on substantive criteria. At the same time, it is very typical for psychological research to have many indicators, which, in essence, are conditional scores, and the validity of assessment using them should be specially proved. To avoid greater arbitrariness, in such cases it is also necessary to rely on statistical parameters.

Perhaps the most common use for this is sigma. The difference between two arithmetic means of one sigma or more can be considered quite pronounced. If the sigma is calculated for a range of values greater than 35, then the difference of 0.5 sigma can be considered quite pronounced. However, for responsible conclusions about how large the difference between the values is, it is better to use strict criteria.

Data normalization

Let us illustrate the significance of the use of norms on the example of the well-known method of K. Thomas. Recall that in it the conclusion about the dominant strategy of behavior in a conflict situation is based on numerical data. Namely, after calculating the total scores for each scale, it is necessary to identify the scale with the highest score. The strategy corresponding to the scale is interpreted as dominating in a conflict situation. The calculated statistics show that the average values of the scale estimates are different in absolute value. They vary in men from 5.25 points to 7.25 points and in women from 3.71 to 7.65 points (see Table 11).

Tab. 11. Primary statistics of scale estimates of the Thomas method

 Floor Men (n=56) Women (n=71) Strategy Average -95% +95% Sigma Average -95% +95% Sigma assertiveness 5.25 4.45 6.05 2.99 3.71 3.04 4.37 2.83 Cooperation 6.29 5.64 6.93 2.41 6.24 5.74 6.74 2.11 Compromise 5.32 4.71 5.93 2.27 5.62 5.10 6.14 2.19 Avoidance 7.25 6.71 7.79 2.02 7.65 7.18 8.11 1.96 Compliance 5.82 5.19 6.46 2.37 6.70 6.20 7.20 2.11

Note.

Avg. — average values;

-950% and +95.0% — confidence intervals of average values;

The largest average values are highlighted.

Thus, if we do not take into account the normative data obtained on the Russian sample (or verified on the Russian sample), then the interpretation of the results can lead to incorrect conclusions. Indeed, both men and women tend to prefer the avoidance strategy. The guide to the methodology does not say that the dominance of one of the five strategies is a transcultural characteristic of the individual. From the context, it can be understood that the author proceeds from the assumption that each of the five strategies is equally likely to be preferred. Since there are statistically significant correlations between scale indicators, it is hardly possible to speak of an equal probability of following each of the five strategies. In such a situation, when there are no normative data and information about the nature of the distribution of values, it is more reliable to rely on the statistics calculated for your sample. In particular, to assess the severity of dominance of one of the strategies, use sigma and confidence intervals. We add that it is advisable to calculate the norms separately for men and women. According to the data presented, it can be seen that in two of the five scales, the indicators differ significantly in different sexes. When comparing groups or subgroups, this sex specificity may turn out to be a variable whose influence cannot be ignored.

It is expedient to calculate norms in other cases as well. The initial (primary) estimates of the performance of experimental tasks obtained during data collection are not always convenient to use in further work. They are transformed in one way or another. The most common transformations are centering and normalization by standard deviations. Centering is understood as a linear transformation of the values of a feature, in which the average value of the distribution of a certain feature becomes equal to zero. The direction of the scale and its units remain unchanged.

The essence of normalization is the transition to another scale – standardized units of measurement. When standardizing the results of test tests, normalization is most often carried out using standard deviations. Standardization is carried out with a normal distribution of test scores or close to it in appearance.

In psychology, there are a number of scales based on the normal distribution and with different values of M and s. For example, in the IQ intelligence deviation scale: М=100, s=15; in the Wexler scale М=10, s=3. The distributions of various features measured in the experiment have different values of М and s. By translating the obtained primary estimates of different features to a distribution with the same M and s, we get more opportunities for estimating and comparing their variation. To do this, we can use the normalized deviation. The normalized deviation shows how much sigma this or that variant deviates from the average level of the variable feature (arithmetic mean), and is expressed by the formula:

V-M

t = ——-

s

where V is the value of the feature (in initial scores).

With the help of a normalized deviation, one can evaluate any obtained value in relation to the group as a whole, weigh its deviation and at the same time get rid of named values. In order to get rid of negative numbers, you can add some constant to the resulting value of t. It is convenient if all the numbers with which you operate have the same number of digits. Given these considerations, the T-score scale is very convenient. For this scale, a normal distribution is accepted, having M=0, s =10. For recalculation, a constant equal to 50 is taken. The formula for converting initial scores into T-scores is as follows:

V-M

t = 50 + 10 ——-

s

Let us consider the meaning of the normalization procedure with an example. Suppose we are interested in some connections between the communicative skills of salespeople and the location of a store in a large city. In order to make some integral assessment of the communicative skill of a particular seller, we can, through observation, obtain for each subject a number of parameters characterizing his communication with the buyer. For example, we can measure the average duration of eye contact, the average number of smiles in a fixed amount of time, the number of rude, surly calls, and so on. You can characterize the advantages and disadvantages of the location of the store in the city (how “busy place”, etc.). To do this, you can count the number of public transport routes that have stops in the immediate vicinity of the store, estimate its distance from metro stations, take into account the number of stores of a different profile located nearby, etc.

In order to derive some generalized communicative indicator, it is impossible to add the number of smiles to the duration of eye contact and subtract from this sum the number of expressions indicating a low speech culture. It makes no sense to add the number of bus routes with the number of neighboring shops and subtract the distance to the nearest metro from the sum. It is better to collect the required amount of quantitative data by conducting research in a number of stores, calculate the primary statistics for all these indicators, and then, after converting the initial data, obtain T-scores for each indicator.

When normalizing, from each value obtained during data collection in initial units, the arithmetic mean is subtracted, and the difference is divided by sigma. The resulting value is multiplied by 10, then added to 50 or subtracted from 50. By choosing the last arithmetic operation (addition or subtraction), we can set the direction of the contribution that this parameter makes to the calculated integral estimate, i.e. we can set the direction of the transformation, taking into account the specifics of this parameter. If a specific value in the initial units exceeds the arithmetic mean, we can add the normalized deviation (difference divided by sigma) to 50. This will correspond to a greater severity of the estimated mental quality in this subject than the average for our sample.

For example, a particular salesperson having more smiles per sigma (than average) would now be quantified as 60 T-points. A quantitative assessment of the signs of a high speech culture in normalized deviations should be added to 50 T-points, and a low speech culture should be subtracted from 50 T-points. If, for example, the quantitative assessment of some sign of a negative orientation (in initial scores) exceeds the average value by half a sigma, then in T-points it will be equal to 45. After such transformations, calculating the integral indicator of communicative skill for a particular subject, we can add one T-scores to others.

It is advisable to choose the form of data standardization taking into account the range of the obtained initial estimates and the number of gradations. If in the initial scores the number of gradations is 7-15, then steninas may be quite suitable . If the number of gradations reaches 30 or more with a slight skewness of the distribution (asymmetry), then by converting these indicators into stenines we will coarsen the scores, i.e. lose some of the accuracy of the measurement. If there is reason to believe that your measurements are sufficiently effective (for example, there is evidence of good retest reliability, high correlations of indicators obtained in measurements with clear and reliable external validation criteria, etc.) are found, then it will be justified to use standardized units that have the same or even a slightly larger number of gradations.

Correlation analysis