Biometric data processing
The purpose of the work : to master the methodology of biometric data processing.
1. Study the formulas used to calculate the main biometric parameters: arithmetic mean, variance, standard deviation, coefficient of variation, arithmetic mean error, and assess the statistical significance of the found parameters.
2. Perform biometric processing of data obtained in the course of biological monitoring of the environment and environmental and epidemiological work.
The ultimate goal of the study is to find the parameters (indicators) that characterize the properties of the general population. The general population is all objects of the studied category. A sampling set (sample) is a part of the general population, selected according to special rules and intended to characterize it.
Basic biometric parameters
1.1. A variant, or date (from Latin variantis – changing; modification, variety) is the result of measuring a trait in a separate object of study. It is denoted by the letters V or x.
1.2. Sample size. The population size is characterized by the number of observations – N (the number of measured objects, variants or dates). There are small (N≤.20..30) and large (N≥20…30) populations.
The results of observations in a large sample can be represented as a variation series. A class is a part of a set that combines all objects (variants) that are similar in size to the trait being studied. The numerical values of the class (its boundaries) are denoted by the letter W. The number of options in each individual class is called the frequency and is denoted by the letter n. The sample size is equal to the sum of the frequencies of variants in all classes of the series (N = Σn).
A variational series is a double series of numbers, one of which indicates the values of the feature in classes W, and the other indicates the number of objects in classes n. Classes are arranged in ascending or descending order of the value of the sign W, and the interval between them for the entire variation series is the same. In our example, this is 5: each class differs from the neighboring W by 5:
classes, W 1-5 6-10 11-15 16-20 21-25 26-30
Frequencies, n 2 4 7 5 4 3
N =Σn = 2+4+ 7 + 5+4 + 3 = 25.
1.3. The arithmetic mean is determined by dividing the sum of all characteristic values (the sum of dates) by the volume of the population:
where V is a variant, or a date (the value of a sign for a separate object;
∑ (large sigma) – summation sign; N is the population size (number of observations).
1.4. Limits (lim) , or limits lim = Vmin…Vmax. The boundaries of diversity are determined by extreme options, i.e. the maximum and minimum value of the feature in the aggregate.
1.5. The range of diversity is ρ=Vmax-Vmin. The smaller ρ, the less diversity and vice versa.
1.6. Dispersion C , or the sum of the squares of the central deviations, i.e. the sum of the squares of the difference between the value of each individual option V and the arithmetic mean value M:
C u003d Σ (V – M) 2
1.7. Variant σ 2 – sigma squared, or the average square of the central deviations, is calculated by the formula:
σ2 = .
The expression N-1 is called the number of degrees of freedom and denoted by υ. The sum of squared central deviations is divided not by the volume of the population N, but by the number of objects that do not correspond to the arithmetic mean (N-1), since it is believed that one of the objects is represented by the arithmetic mean.
1.8. The mean square deviation from the arithmetic mean is the main indicator of the diversity of the trait values in the group. It is used to determine a number of other population parameters (coefficient of variation, arithmetic mean error, etc.):
σ = .
σ is taken as a positive value of the square root, is a named value (expressed in units of measurement of the attribute) and shows how much each variant differs on average from the arithmetic mean.
1.9. The coefficient of variation shows what part of σ is from the arithmetic mean.
C v u003d σ / M x 100.
It is expressed in fractions of units or percentages, i.e. is a relative value, which allows it to be used to compare the degree of diversity of different traits and different groups (expressed in different units of measurement). The variety of objects included in the collection is usually due to variability.
1.10. Normalized deviation , i.e. the absolute deviation of each option from the arithmetic mean, expressed in fractions of a sigma. The normalized deviation is determined by the formula:
Replacing the actual values of the attribute with the normalized deviation and putting t on the x-axis, we get a number series with zero value in the center, with negative values to the left and positive values to the right of 0.
The larger the sigma, the wider the curve and the lower its maximum height, and vice versa. When conducting a study, it is important to find the proportion of all options that deviate from the average by a certain amount, i.e. establish which part of the options is in the range from M – tσ to M + tσ (or M ± tσ). There are special tables that show the proportions of options that lie within the specified limits with a measurement step in t equal to 0.1 or even 0.01. So, in the range from M – 1σ to M + 1σ lies 0.6828 (68.3%) of all options in a normally distributed set; from M – 2σ to M + 2σ – 0.9545 (95.5%) of all options; from M – 3σ to M + 3σ – 0.9973 (99.73%). The specified percentage values actually characterize the probability B of finding options within the given limits.
In biological research, three main levels of probability (levels of the reliability of conclusions) and their corresponding values of t are used, called criteria for the reliability of conclusions . The probability B=0.95 corresponds to the value of the criterion t=1.96 (rounded t = 2). This level of probability means that 95 cases of observations out of 100 are within the limits from M-1.96σ to M+1.96σ. The second level of reliability is increased B=0.99, it corresponds to the value t=2.57 (rounded 2.6), i.e. 99 cases of observations out of 100 are within the limits from M-2.57σ to M+2.57σ; probability B=0.999 corresponds to t=3.29 (rounded 3.3). These probabilities are called confidence limits, and the indicated limits: M ± 2σ, M ± 2.6σ and M ± 3.3σ and the intervals between them are called confidence limits and confidence intervals for the corresponding confidence probabilities .
Sometimes they use not the values of confidence probabilities B (i.e., the probabilities that the parameter will be within the given confidence limits), but the corresponding significance levels p : p<0.05, p<0.01 and p<0.001. These levels show the probability of the parameter being outside the confidence limits, i.e. determine the probability of error.
The English statistician Gosset, who published his work under the pseudonym Student, compiled theoretically substantiated tables (tables of standard values of the Student’s test), where the values of t depending on the number of objects. It should be noted that the sample size N becomes larger as the reliability of the conclusions increases (t increases), i.e. as the accuracy of the planned study increases.
- Source of statistical errors. Arithmetic mean error
The ultimate goal of research is to find the parameters of the general population. The differences between the sample parameters and the general ones are of an objective nature, i.e. arise independently of the researcher always when by part (sample) they try to characterize the whole (general population). The sources of statistical errors are the limited size of the sample and the randomness of the selection of objects. They should not be confused with errors of a different kind: typicality (when the sample is drawn up incorrectly and therefore is not representative), instrument or instrument during measurement, etc. Such errors are not revealed by biometric methods, they must be eliminated in advance. The absence of such errors in the measurement results provides a basis for further biometric processing of the material in order to identify statistical errors (errors of representativeness), which cannot be avoided when using the sampling method and which must be taken into account in order to scientifically substantiate the conclusions.
Representativeness errors show the degree to which sample parameters correspond to the parameters of the general population. The smaller the digital error values, the more accurate the calculated parameter, the closer its value to the value of the corresponding parameter of the general population.
Errors are calculated for all sample parameters and are usually denoted by the letter m with a subscript indication of the sign of the parameter for which they are determined: m M , m σ , etc. The greater the number of objects selected in the sample, the smaller the deviation of the sample means from the general mean. Thus, the size of the arithmetic mean error is related to the number of objects in the sample. This relationship is expressed by the formula:
m M = .
The error is expressed in the same units (cm, kg, %, etc.) as the arithmetic mean and is usually written as follows: M± m M .
There is a general rule for determining the statistical significance of sample parameters: the ratio of the sample parameter to its error is compared with t table from Student’s table for the corresponding number of degrees of freedom (υ=N-1). If Р/m Р ≥ t table , then the sample parameter is statistically significant and it is possible to indicate the confidence limits of the general parameter. If P/m P