**CALCULATION OF CORRELATION COEFFICIENTS**

**The purpose of the work** : to study the basic techniques for conducting correlation and covariance analysis using the built-in functions of *MSExcel* and using the tools of the Analysis Package: **“Correlation”** and ” **Covariance”** .

*Correlation analysis* is a statistical method for identifying interdependencies between several features.

*A functional dependence* between random variables *X* and *Y* is a dependence in which a change in the value of *X* entails a change in the values of *Y* , that is, *Y* is a function of the random argument *X* .

If each value of one variable corresponds to a set of possible values of another variable, i.e. each value of one variable corresponds to a certain (conditional) distribution of another variable, then such a dependence is called *statistical* ( *stochastic* , *probabilistic* ).

A *statistical relationship* between random variables is a relationship in which a change in one of the variables entails a change in the distribution of the other.

A *correlation dependence* between random variables is a statistical dependence in which a change in one of the variables entails a change in the average value of the other.

**Computing Covariance Estimates and Sample Correlation Coefficients**

*Covariance* characterizes the dispersion and mutual dependence of random variables, has a dimension equal to the product of the dimensions of random variables.

Suppose that as a result of n tests, the system of two random variables ( *X* , *Y* ) has taken the following values: ( *x _{1} , y _{1}* ),(

*x*),…, (

_{2}, y_{2}*x*), …,(

_{i}, y_{i}*x*). The statistical estimate of the covariance is determined by the formula:

_{n},_{yn}The statistical estimate of covariance is calculated by the formula:

,

where: and .

*MSExcel* uses the **COVAR(Array1,Array2)** function to calculate covariance, which is in the *Statistical* category.

**Example 1** . There is a sample from the general population of a system of two random variables X and Y. Determine the covariance of these random variables.

x _{i} |
12.1 | 14.7 | 20.5 | 11.2 | 16.6 | 10.0 | 13.0 | 14.9 | 16.3 | 15.1 |

y _{i} |
53.2 | 44.2 | 51.4 | 57.7 | 45.5 | 42.0 | 53.5 | 68.9 | 57.7 | 63.3 |

*Decision*

Let’s enter the initial data in the range **A1:K12** .

In cell **D3** , enter the formula **=KOVAR(B1:K1;B2:K2)** . We get the result: 2.709.

Covariance has a dimension. A more convenient quantity that characterizes only the dependence of random variables is the correlation coefficient. *The Pearson sample correlation coefficient* is calculated by the formula:

.

To calculate the Pearson sample correlation coefficient in *MSExcel* , use the **CORREL(Array1,Array2) function** , which is in the *Statistical* category.

For the data of example 1, we calculate the correlation coefficient.

In cells **D4** , enter the formula **=CORREL(B1:K1;B2:K2)** and get the result: 0.116.

In *MSExcel* , you can calculate *the coefficient of determination* – the square of the Pearson correlation coefficient. This coefficient shows the proportion of variation in the dependent variable taken into account in the model and determined by the variation of the factors included. To do this, use the function: **=KVPIRSON(Known_y; Known_x)** .

Let’s calculate the coefficient of determination for the data of example 1. In cell **D5** , enter the formula: **=KVPIRSON(B2:K2;B1:K1)** . We get the result: 0.013.

*To test the significance of* the sample correlation coefficient, the null hypothesis *H* _{0} : *r _{xy}* = 0 should be tested against the alternative hypothesis

*H*

_{1}:

*r*¹ 0.

_{xy} If *H* _{0} is accepted, then there is no correlation between the random variables *X* and *Y* . If *H* _{0} is rejected, then there is a correlation between the random variables *X* and *Y* . To test the null hypothesis at a significance level a, it is necessary to calculate the observed value of the criterion using the formula:

.

For the significance level a and the number of degrees of freedom using Student’s distribution find the number .

Compare numbers and : if < , then there is no reason to reject the hypothesis *H* _{0} ; if > , then the hypothesis *H* _{0} is rejected.

Let’s check the significance of the sample correlation coefficient obtained in example 2.

In cell **D6** , enter the formula

**u003d D4 * ROOT (10-2) / ROOT (1-CD4 ^ 2)** . We get the result: 0.33.

In cell **D7** **,** we calculate the critical value using the formula: We obtain the critical value for the case of a *two-sided* critical region, equal to 2.306. Since *T _{obs}* .<

*T*, then the null hypothesis is accepted.

_{cr}. Along with Pearson’s *sample correlation coefficient, Spearman’s and Kendall* ‘s sample rank correlation coefficients are also used. These coefficients can be applied to both data measured on quantitative scales and data measured on ordinal scales.

Spearman’s *rank correlation coefficient* is calculated by the formula:

,

where: , *x _{i}* and

*y*are the ranks of the

_{i}*i*-th object in variables

*X*and

*Y*,

*n*is the number of pairs of observations.

Knowing the sample value , we can test the hypothesis about the significance of ρ. The null hypothesis is formulated as *H* _{0} : ρ=0 – the observed random variables *X* and *Y* are uncorrelated, the alternative one is *H* _{1} : ρ¹0.

To test the null hypothesis, the relation is used:

.

Meaning calculated using ** Student** ‘s t-distribution according to the formula: is the number of degrees of freedom.

*Note* : for *n* £10, accurate statistical tables should be used to assess significance.

If a , then the Spearman rank correlation coefficient is insignificant.

**Example 2** . Colored disks, having the order of shades 1, 2, …, 15, were placed by the subjects in the following order:

7, 4, 2, 3, 10, 1, 6, 8, 9, 5, 11, 15, 14, 12, 13.

To characterize the subject’s ability to distinguish shades of colors using the Spearman’s rank correlation coefficient between actual and observed results.

*Decision*

Let’s enter the initial data in the ranges **A4:A18** , **B4:B18** and **D4:D18** .

Let’s calculate the ranks of actual and observed results. To do this, in cell **C4** , enter the formula:

**=RANK(C4;$C$4: $C$18;1)** and extend it to cell **C18** .

In cell **E4** , enter the formula:

**=RANK(E4;$E$4: $E$18;1)** and extend it to cell **E18** .

In cells **C19** and **E19** , for verification, we calculate the sum of the ranks. Each sum must be equal to the sum of the ranks found by the formula:

.

Let’s calculate the difference in ranks. To do this, enter the formula **=C4-E4** in cell **F4** and extend it to cell **F18** .

In the range **G4:G18** , we calculate the squares of the rank difference.

In cell **G19** we find the sum of squares of the difference in ranks. (equal to 136).

In cell **G21** , find the value of the Spearman sampling coefficient using the formula

**=1-6/G20/(G20^2-1)*G19** .

We get the result: 0.757.

Let us estimate the significance of the coefficient for the significance level a = 0.05. As * _{.} =* 4.179, then the hypothesis of the insignificance of the Spearman correlation coefficient should be rejected: the subject is able to distinguish shades of colors.

*Kendall’s rank correlation coefficient* is calculated as follows.

columns rearranged so that the ranks formed an increasing sequence 1, 2, …, *n* . Now = *i* .

For each rank *y _{i}* we denote by

*p*the number of ranks , moreover , and find their sum: .

_{i} Let *q _{i}* denote the number of ranks , and , and calculate their sum: . Then

*S=PQ*. The coefficient t

*calculated by one of the equivalent formulas:*

_{in is}.

To assess the significance of the Kendall rank correlation coefficient for *n* ³10, use the formula:

,

where is found from the relation:

.

Ф( *x* ) – Laplace function.

If a , then Kendall’s rank correlation coefficient is insignificant.

**Example 3** . Two experts independently evaluated 10 student scientific papers submitted for the competition. The test results in points are presented in the table.

Work | ||||||||||

Expert 1 | ||||||||||

Expert 2 |

Test the hypothesis of complete inconsistency (independence) of expert assessments using Kendall’s rank correlation coefficient at a significance level of a = 0.05.

*Decision*

Let’s enter the initial data in the ranges **A1:A11** , **B1:B11** and **C1:C11** .

Let’s calculate the ranks of the estimates of the first (range **D2:D11** ) and the second expert (range **E2:E11** ).

Let’s select the range **D1:E11** , in which the rating ranks are located, and copy it to the range **F1:G11** , using the command ** Paste Special** .

In the *Paste Special* dialog box that opens, in the *Paste* radio button group, set the *value* radio button and click the *Ok* button.

Select the range F1:G11. Press the right mouse button and execute the commands *Sorting/Custom sorting* in the context menu. In the *Sort* dialog box that opens, set the following field values: *Column Sort by* : Rxi; *Sorting* : values; *Order* : in ascending order and press the *Ok* button. In the range F1:G11, the ranks of the experts’ assessments will appear, sorted in ascending order of the ranks of the first expert’s assessments.

In cell H2, enter the array formula =SUM(IF($G3:$G$11>G2,1:0)), press * +* and then copy this formula into cell H3:H11.

In cell H12, find the sum of the numbers pi. *This will be the R value* .

In cell H13, we calculate the value of Kendall’s sample rank correlation coefficient using the formula: =4*H12/(10^2-10)-1. We get the result: 0.689.

Let us evaluate the significance of the Kendall correlation coefficient. Calculate according to the formula u003d NORMSINV (1-0.05 / 2). We get 1.96.

In cell H15, enter the formula: =H14*ROOT(2*(2*10+5)/9/10/(10-1)).

Get =0.487. As (0.689>0.487), then the null hypothesis should be rejected: the opinions of experts are not independent.

Correlation analysis

For a multivariate sample, covariance and correlation *matrices* are calculated.

The initial data are presented in the form:

where: *i* =1,2,…, *n* – observation number; *j* =1,2,…, *m* is the index number.

**Example 4** . In order to analyze the relationship between indicators of production efficiency: labor productivity , return on assets and material consumption of production A group of ten enterprises of the same type was selected. The data are given in the table:

company number | |||

6.0 | 2.0 | ||

4.9 | 0.8 | ||

7.0 | 2.7 | ||

6.7 | 3.0 | ||

5.8 | 1.0 | ||

6.1 | 2.1 | ||

5.0 | 0.9 | ||

6.9 | 2.6 | ||

6.8 | 3.0 | ||

5.9 | 1.1 |

Calculate the covariance and correlation matrices of the population under consideration.

Calculate the covariance using the analysis package tool “Covariance”.

Run the command **Data®Data Analysis®Covariance** and enter the following data. As a result, a covariance matrix will be displayed, on the main diagonal of which there are sample variances, and on the side diagonal – covariance.

We get the correlation matrix. Run the command **Data®Data Analysis®Correlation** and enter the following data.

Press the OK button and get a table of results.

The correlation matrix looks like:

The pair correlation coefficient between features *j* and *m* is calculated by the formula:

where – sample standard deviation on the basis ; – standard deviation on the basis of *x _{m}* .

The correlation matrix is always symmetrical, with 1’s on its main diagonal.

The significance of paired coefficients can be tested using Student’s t- *test* .

The indicators of closeness of connection can be given a qualitative one based on the Chaddock scale:

A quantitative measure of the tightness of the connection | Quantitative characteristic of the bond strength |

0.1-0.3 | Weak |

0.3-0.5 | Moderate |

0.5-0.7 | Noticeable |

0.7-0.9 | high |

0.9-0.99 | Very high |

A functional connection occurs when the value is 1, and the absence of a connection is 0. If the value of the indicators of the tightness of the connection is less than 0.7, the value of the coefficient of determination will always be below 50%. This means that the share of variation in factor characteristics accounts for a smaller part compared to other factors not taken into account in the model that affect the change in the effective indicator. The regression models built under such conditions are of low practical value.

## Be First to Comment