**LAB #10**

**INTEGRATED SYSTEM STATISTICA.**

**CORRELATION ANALYSIS OF ECONOMIC INFORMATION**

**The purpose of the work:** To acquire practical skills in conducting correlation analysis using specialized modules of the integrated system (IS) STATISTICA *.*

**THEORETICAL PART**

**General information**

A set of methods for assessing correlation characteristics and testing statistical hypotheses about them using sample data is called correlation analysis. Correlation analysis uses the following basic techniques:

1) construction of a correlation field (scatterplot) for two economic indicators or two-dimensional sections, if we are talking about more of them;

2) determination of sample correlation coefficients or compilation of correlation matrices;

3) testing of static hypotheses about the significance of the relationship between the indicators.

The correlation coefficient estimates the tightness of the relationship between the variables under study and is a measure of the linear dependence of the values.

If the relationship between any two quantitative variables is estimated, the paired correlation coefficient is used. To assess the closeness of the relationship between the resulting indicator and the set of input factors, a multiple correlation coefficient is used. To assess the tightness of the relationship between qualitative (ordinal) variables, the rank correlation coefficient is used.

This coefficient, always denoted by the Latin letter r, can take values between -1 and +1, and if the value is closer to 1, then this means a strong connection, and if it is closer to 0, then a weak one.

If the correlation coefficient is negative, this means that there is an opposite relationship: the higher the value of one variable, the lower the value of the other. The strength of the connection is also characterized by the absolute value of the correlation coefficient. For the verbal description of the magnitude of the correlation coefficient, the following gradations are used (see Table 10.1).

*Table 10.1. Gradations of the correlation coefficient*

Meaning | Interpretation |

up to 0.2 | Very weak correlation |

up to 0.5 | Weak correlation |

up to 0.7 | Average Correlation |

up to 0.9 | high correlation |

over 0.9 | Very high correlation |

You can roughly determine the value of the correlation coefficient by analyzing the scatterplot. The closer the points are located relative to some straight line, the more it tends to unity in absolute value, and vice versa, the more diffuse the scatterplot, the closer the correlation coefficient is to zero.

**Module Basic Statistics/Tables IS STATISTICA**

The statistical procedures of the STATISTICA system are grouped into several specialized statistical modules. In each module, you can perform a certain way of processing data without resorting to procedures from other modules.

Work in the STATISTICA system usually begins with the BasicStatistics **/** Tables module (Basic Statistics / Tables). It can be used to pre-process data, perform exploratory analysis of data, determine relationships between them, divide them into groups in various ways, view these groups visually and determine relationships between data. This statistical module includes various groups of statistical procedures that implement the methods of exploratory statistical analysis. The system can calculate almost all descriptive statistics, including median, mode, quartiles, means and standard deviations, confidence intervals for the mean, skewness, kurtosis (with their standard errors), harmonic and geometric mean, and many other descriptive statistics. A wide range of charts allows you to illustrate the exploratory data analysis.

The Correlations subsection includes a large number of tools that allow you to explore the relationships between variables. It is possible to calculate almost all common measures of dependence, including Pearson’s correlation coefficient, Spearman’s rank correlation coefficient, feature contingency coefficient, and many others. Correlation matrices can also be calculated for data with gaps using special methods for handling missing values.

**PRACTICAL PART**

**Task 1.** Analyze the indicators of economic activity of light industry enterprises [7]. Build a correlation matrix, give a graphical interpretation of the results. As initial data, use the data from Table P1 of Appendix 1.

Y1 – labor productivity, million rubles;

X1 is the labor intensity of a unit of production, hours;

X2 is the share of workers in the PPP;

X3 – specific weight of components;

X4 – equipment shift factor;

X5 – bonuses and remuneration per employee, million rubles;

X6 is the share of losses from defects in the cost of production, %;

X7 – return on assets, rub.;

X8 – the average number of PPP;

X9 is the average annual cost of the OPS, billion rubles;

X10 – payroll fund of PPP, million rubles;

X11 – capital-labor ratio, million rubles;

X12 – turnover of normalized working capital;

X13 – turnover of non-standard working capital;

X14 – unproductive expenses.

Let’s carry out a correlation analysis of the variables X1, X3, X7.

To perform correlation analysis, select the Basic Statiatics/Tables module and then the Correlation matrices item.

In the Pearson Product-Moment Correlation window that appears (Fig. 10.1), by clicking the Twolists button, you should determine which variables will be in the rows (First variable list) – these are the variables X1, X3, X7 and in the columns (Second variable list) is the Y variable (Figure 10.2).

*Rice. 10.1 – Pearson Product-Moment Correlation Window*

Click OK to confirm your selection and return to the Pearson Product-Moment Correlation window **.**

*Fig.10.2 – Window Select one or two variable list*

In the Pearson Product-Moment Correlation window, click OK as well. A correlation matrix will appear on the screen (Fig. 10.3).

*Fig.10.3 – Correlation matrix*

There is only one column in this matrix, since we have selected only one variable in the second list. The column gives the correlation coefficients between the variables Y and X1, X3, X7. Coefficients that are significant at the p<0.05 level are automatically highlighted in red. It is these coefficients that should be paid the most attention. In this example, the variable Y is most dependent on the variables X1 (correlation coefficient -0.82) and X3 (correlation coefficient 0.64).

To build a correlation matrix that reflects the closeness of the relationship between all variables, in the Pearson Product-Moment Correlation window, press the One variable list button and select the variables Y, X1, X3 and X7. Then return to the Pearson Product-Moment Correlation window and press the OK button. As a result, a correlation matrix will appear on the screen, reflecting all paired correlation coefficients for the variables under consideration (see Fig. 10.4). Coefficients that are significant at the p<0.05 level are automatically highlighted in red.

*Fig.10.4 – Correlation matrix*

To build a graphical display of the correlation relationship of any two variables, you must click the 2D scatterplot button (scatterplot) in the Pearson Product-Moment Correlation window and determine which variables will be located on the horizontal and vertical axes. Let’s build a scatterplot for the variables Y and X1 (see Fig. 10.5).

Rice. 10.5 – Scatterplot |
Fig.10.6 – Scatterplot |

for variables Y and X1 |
for variables Y and X7 |

For comparison, we construct a scatterplot for the variables Y and X7 in a similar way (see Fig. 10.6).

In this case, the graph shows that the dependence cannot be considered linear. You can try to graphically select the type of dependence. To do this, in the Graphs menu, select the Stats 2D Graph command, then the Scatterplots command, and then in the Scatterplots window (see Fig. 10.7), use the Variables button to define the variables under study and select the type of dependence, for example, polynomial.

*Fig.10.7 – Scatterplots window*

As a result, a scatterplot will appear on the screen (see Fig. 10.8), built on the basis of a polynomial of the fifth degree. An interesting fact is that the degree of the polynomial (the maximum is the fifth degree) STATISTICA determines itself, depending on the original data set.

*Rice. 10.8 – Scatterplot for variables Y and X7*

*when approximated by a polynomial*

As in Figure 10.5, this graph (Figure 10.8) presents a function that reflects the relationship between the variables Y and X7, and the correlation field of points. The graph clearly shows that the fifth degree polynomial does not properly describe the initial data. In this case, you should use the Nonlinear Estimation module mentioned above **,** which allows the user to create their own functions to approximate the data under study.

To evaluate all the connections visually, you can present the correlation matrix in a graphical form (see Fig. 10.9). To do this, in the PearsonProduct-Moment Correlation window, press the Matrix button, then select all variables by pressing the Select All button and confirm the selection by pressing OK.

*Rice. 10.9 – Correlation matrix in graphical representation*

To obtain numerical characteristics (descriptive statistics) of the studied features in the launch panel of the Basic Statistics module, select the Decriptive Statistics section and click OK. Use the Variables button to select the variables for which you want to obtain numerical characteristics (Fig. 10.10), then, by pressing the More statistics button **,** determine which characteristics should be calculated and click OK.

*Rice. 10.10 – Decriptive Satistics window*

Let’s choose numerical characteristics characterizing the distribution of variables: mean value (Mean), median (Median), variance (Variance), coefficients of skewness (Skewness) and kurtosis (Kurtosis). Next, click the Detailed decriptive statistics button and get the result (see Figure 10.11).

*Fig.10.11 – Descriptive statistics of variables X1, X3, X7 and Y*

As can be seen from the table in Fig. 10.11, the distribution of variables is close to normal.

**Tasks for independent work**

**Task 1.** Based on sample data (table 10.2), investigate the influence of factors X1, X2 and X3 on the effective attribute Y.

Having built a correlation field, make an assumption about the presence and type of relationship between the studied factors.

*Table 10.2. Enterprise performance indicators*

Company number | Production output per employee thous. rub. | New OPF, % | Share of highly skilled workers % | Equipment utilization rate |

No. | Y | x1 | x2 | x3 |

3.9 | 0.76 | |||

3.9 | 0.78 | |||

3.7 | 0.75 | |||

0.78 | ||||

3.8 | 0.74 | |||

4.8 | 0.81 | |||

5.4 | 0.81 | |||

4.4 | 0.82 | |||

5.3 | 0.82 | |||

6.8 | 0.82 | |||

0.84 | ||||

6.4 | 0.84 | |||

6.8 | 0.8 | |||

7.2 | 0.8 | |||

0.85 | ||||

8.2 | 0.85 | |||

8.1 | 0.88 | |||

8.5 | 0.87 | |||

9.6 | 0.89 | |||

0.85 |

Task options

Options 1-4: resultant sign – Y, factor sign – X1.

Options 5-7: resultant sign – Y, factor sign – X2.

Options 8-10: resultant sign – Y, factor sign – X3.

**Task 2.** Based on the data from Table P1 of Appendix 1

§ For the variables corresponding to your option (Table 10.3), construct a correlation matrix (take the variables X _{i} as the first list of variables, the variable Y as the second list).

§ Perform a graphical analysis of the most correlated variables. Remove cases that negatively affect the correlation of values. Recalculate the correlation matrix. Draw a conclusion about how the matrix has changed after deleting the data.

§ View the correlation matrix graphically.

*Table 10.3. Task options*

option number | Variables | option number | Variables |

Y1, Y2, X4, X5 | Y2, Y3, X10, X11 | ||

Y1, Y2, X6, X7 | Y2, Y3, X12, X13, | ||

Y1, Y2, X8, X9 | Y2, Y3, X8, X9 | ||

Y1, Y2, X10, X11 | Y1,Y3,X4,X5 | ||

Y1, Y2, X12, X13 | Y1, Y3, X1, X4 |

## Be First to Comment