A. Smooth      05/12/2021

Characteristics of position and dispersion. Variational series and its numerical characteristics: positions, dispersions, forms. Surface quality obtained by rolling with a roller tool. Scheme of the process, pressure value, multiplicity of application of deforming

No matter how important the average characteristics, but no less important characteristic of the array of numerical data is the behavior of the remaining members of the array in relation to the average, how much they differ from the average, how many members of the array differ significantly from the average. In shooting training, they talk about the accuracy of the results, in statistics they study the characteristics of scattering (scatter).

The difference of any value of x from the average value of x is called deviation and calculated as the difference x, - x. In this case, the deviation can take both positive values ​​if the number is greater than the average, and negative values ​​if the number is less than the average. However, in statistics it is often important to be able to operate with a single number that characterizes the "accuracy" of all numerical elements of the data array. Any summation of all deviations of array members will result in zero, since positive and negative deviations cancel each other out. To avoid nulling, the squared differences are used to characterize the scattering, more precisely, the arithmetic mean of the squared deviations. This scattering characteristic is called sample variance.

The greater the variance, the greater the dispersion of the values ​​of the random variable. To calculate the variance, an approximate value of the sample mean x is used with a margin of one digit in relation to all members of the data array. Otherwise, when summing a large number approximate values ​​will accumulate a significant error. In connection with the dimension of numerical values, one drawback of such a scattering index as sample variance should be noted: the unit of measurement of variance D is the square of the unit of values X, whose characteristic is dispersion. To get rid of this shortcoming, statistics introduced such a scattering characteristic as sample standard deviation , which is denoted by the symbol A (read "sigma") and is calculated by the formula

Normally, more than half of the members of the data array differ from the average by less than the value of the standard deviation, i.e. belong to the segment [X - A; x + a]. Otherwise they say: the average indicator, taking into account the spread of the data, is x ± a.

The introduction of another scattering characteristic is related to the dimension of the members of the data array. All numerical characteristics in statistics are introduced in order to compare the results of the study of different numerical arrays characterizing different random variables. However, it is not significant to compare standard deviations from different average values ​​of different data arrays, especially if the dimensions of these values ​​also differ. For example, if the length and weight of any objects or scattering are compared in the manufacture of micro- and macro-products. In connection with the above considerations, a characteristic of relative scattering is introduced, which is called coefficient of variation and is calculated by the formula

To calculate the numerical characteristics of the dispersion of the values ​​of a random variable, it is convenient to use the table (Table 6.9).

Table 6.9

Calculation of the numerical characteristics of the scattering of values ​​of a random variable

Xj- X

(Xj-X) 2 /

In the process of filling this table is the sample mean X, which will be used later in two forms. As the final average characteristic (for example, in the third column of the table) the sample mean X must be rounded to the nearest digit corresponding to the smallest digit of any member of the numeric data array x r However, this indicator is used in the table for further calculations, and in this situation, namely, when calculating in the fourth column of the table, the sample mean X must be rounded up by one digit from the smallest digit of any member of the numeric data array X ( .

The result of calculations using a table like tab. 6.9 will receive the value of the sample variance, and to record the answer, it is necessary to calculate the value of the standard deviation a based on the value of the sample variance.

The answer indicates: a) the average result, taking into account the scatter of data in the form x±o; b) data stability characteristic v. The answer should evaluate the quality of the coefficient of variation: good or bad.

An acceptable coefficient of variation as an indicator of the homogeneity or stability of results in sports research is 10-15%. The coefficient of variation V= 20% in any study is considered a very large indicator. If the sample size P> 25, then V> 32% is a very bad indicator.

For example, for a discrete variational series 1; 5; 4; 4; 5; 3; 3; 1; 1; 1; 1; 1; 1; 3; 3; 5; 3; 5; 4; 4; 3; 3; 3; 3; 3 tab. 6.9 will be filled in as follows (Table 6.10).

Table 6.10

An example of calculating the numerical characteristics of the dispersion of values

*1

fi

1

L P 25 = 2,92 = 2,9

D_S_47.6_ P 25

Answer: a) the average characteristic, taking into account the scatter of the data, is X± a = = 3 ± 1.4; b) the stability of the obtained measurements is at a low level, since the coefficient of variation V = 48% > 32%.

Table analogue. 6.9 can also be used to calculate the scattering characteristics of an interval variation series. At the same time, the options x r will be replaced by representatives of gaps xv ja absolute frequencies option f(- to the absolute frequencies of the gaps fv

Based on the above, the following can be done conclusions.

The conclusions of mathematical statistics are plausible if information about mass phenomena is processed.

Typically, a sample of population objects, which should be representative.

The experimental data obtained as a result of studying any property of the sample objects is the value of a random variable, since the researcher cannot predict in advance which number will correspond to a particular object.

To choose one or another algorithm for the description and primary processing of experimental data, it is important to be able to determine the type of random variable: discrete, continuous, or mixed.

Discrete random variables are described by a discrete variational series and its graphical form - a frequency polygon.

Mixed and continuous random variables are described by an interval variation series and its graphical form - a histogram.

When comparing several samples according to the level of the formed ™ of a certain property, the average numerical characteristics and numerical characteristics of the dispersion of a random variable with respect to the average are used.

When calculating the average characteristic, it is important to correctly choose the type of average characteristic that is adequate to the area of ​​its application. Structural mean values ​​mode and median characterize the structure of the location of the variant in an ordered array of experimental data. The quantitative mean makes it possible to judge the average size of a variant (sample mean).

To calculate the numerical characteristics of scattering - sample variance, standard deviation and coefficient of variation - the tabular method is effective.

Along with the most probable risk value, the spread of possible risk values ​​relative to its central value is important. Accounting for the spread of indicators is also necessary when solving problems of social and hygienic monitoring.

The most common characteristics of the spread of a random variable are the variance and standard deviation.

The variance of a random variable ξ, denoted as D(ξ) (we also use the notation V(ξ) and σ2(ξ)), characterizes the most probable value of the squared deviation of a random variable from its mathematical expectation.

For a discrete random variable taking the values x i with probabilities p i , variance is defined as the weighted sum of nitrate deviations x i from the mathematical expectation ξ with weight coefficients equal to the corresponding probabilities:

D(ξ) =

For a continuous random variable ξ, its variance is determined by the formula:

D(ξ) =

The dispersion has the following practically important properties:

1. The dispersion of any random variable is non-negative:

D(ξ) ≥ 0

2. The dispersion of a constant value is 0:

D(C) = 0

Where C is a constant.

3. The variance of a random variable ξ is equal to the difference between the mathematical expectation of the square of this random variable and the square of the mathematical expectation ξ:

D(ξ) = M [ξ – M (ξ)] 2 = M(ξ 2) – ( .

4. Adding a constant to a random variable does not change the variance; multiplying a random variable by a constant a leads to multiplying the variance by a 2 :

D(aξ + b) = a 2 D(ξ),

Where A And b- constants.

5. Dispersion of the sum of independent random variables is equal to the sum of their variances:

where ξ and η are independent random variables.

The standard deviation of a random variable ξ (the term "standard deviation" is also used) is the number σ (ξ) equal to square root from the variance ξ:

The standard deviation measures the deviation of a random variable from its mathematical expectation in the same quantities in which the random variable itself is measured (in contrast to the variance, the dimension of which is equal to the square of the dimension of the original random variable). For normal distribution the standard deviation is equal to the parameter σ. Thus, the mathematical expectation and standard deviation represent a complete set of characteristics of a normal distribution and uniquely determine the type of distribution density. For distributions that differ from normal, this pair of indicators is not an equally effective characteristic of the distribution.


The coefficient of variation is also used as a characteristic of the scattering of a random variable. The coefficient of variation of a random variable ξ with a nonzero mathematical expectation is the number V(ξ) equal to the ratio of the standard deviation ξ to its mathematical expectation:

The coefficient of variation measures the dispersion of a random variable in fractions of its mathematical expectation and is often expressed as a percentage of the latter. This characteristic should not be used if the mathematical expectation is close to 0 or significantly less than the standard deviation (in this case, small errors in determining the mathematical expectation lead to a high error for the coefficient of variation), and also if the type of distribution density is significantly different from Gaussian.

Asymmetry coefficient ( As) determines the 3rd degree of deviation of a random variable from the mathematical expectation and is determined by the formula:

In practice, this indicator is used as an estimate of the symmetry of the distribution. For any symmetric distribution, it is equal to 0. If the distribution density is not symmetrical (which can often be the case when assessing the risk of death and the risks associated with water and air pollution), then a positive skewness corresponds to the case when the left shoulder of the density curve is steeper than the right one, and negative - the case when the right shoulder is steeper than the left (Figure 4.17).

For skewed distributions, the standard deviation is not a good measure of the spread of a random variable. In this case, indicators such as quartiles, quantiles, and percentiles can be used to characterize scattering.

The first quartile of a random variable ξ with a distribution function F(x) is the number Q1 which is a solution to the equation

F(Q 1) = 1/4

i.e., a number for which the probability that ξ takes values ​​less than Q1, is equal to 1/4, the probability that it takes values ​​greater than Q1 equals 3/4.

second quartile ( Q2) of a random variable is called its median, and the third ( Q 3) - solution of the equation

F(Q 3) = 3/4

Quartiles divide the x-axis into 4 intervals: [-∞, Q1], [Q 1 , Q 2], [Q2, Q3] And [ Q 3, + ∞] in each of which the random variable falls with equal probability, and the figure bounded by the abscissa axis and the distribution density graph - into 4 regions with the same area. And the interval between the first and third quartiles contains 50% of the distribution of the random variable. For symmetrical distributions, the first and third quartiles are equally distant from the median.

Order quantile R random variable ξ with distribution function F(x) is called the number X, which is a solution to the equation

Thus, quartiles are quantiles of the order of 0.25, 0.5, and 0.75. If the order of the p quantile is expressed as a percentage, then the corresponding values X called percentiles, or R-percentage distribution points.

On fig. 4.18 shows, along with quantiles, 2.5 and 97.5 percent distribution points. Between these points, 95% of the distribution of the random variable is concentrated, therefore the interval between them is called the 95% confidence interval of the mean (in particular, in risk assessment, the 95% confidence interval of risk).

Task 2. Which of the following information about the random variable ξ allows us to reject the assumption that it is distributed according to the normal law:

a) ξ is a discrete random variable;

b) mathematical expectation ξ is negative;

c) the distribution ξ is unimodal;

d) the mathematical expectation ξ is not equal to its median;

e) the asymmetry coefficient ξ is negative;

f) the standard deviation ξ is greater than its mathematical expectation;

g) ξ characterizes the distribution of the duration of acute respiratory diseases in the study area;

h) ξ characterizes the distribution of life expectancy in the study area;

i) the median ξ does not coincide with the center of the interval between the first and third quartiles.

Answer: The assumption about the normal law of distribution of a random variable is incompatible with statements a), d), e), h), i).

Rice. 4.17. Sign dependency Fig.4.18. Quartiles and Percentiles:

skewness and shape illustration using the function

distribution density functions

Scattering characteristics

Sample dispersion measures.

The minimum and maximum of the sample are, respectively, the smallest and highest value the variable being studied. The difference between the maximum and minimum is called on a grand scale samples. All sample data are located between the minimum and maximum. These indicators, as it were, outline the boundaries of the sample.

R#1= 15.6-10=5.6

R №2 \u003d 0.85-0.6 \u003d 0.25

Sample variance(English) variance) And standard deviation samples (English) standard deviation) is a measure of the variability of a variable and characterizes the degree of data spread around the center. At the same time, the standard deviation is a more convenient indicator due to the fact that it has the same dimension as the actual data under study. Therefore, the standard deviation indicator is used along with the value of the arithmetic mean of the sample to briefly describe the results of data analysis.

It is more expedient to calculate the sample variance at by the formula:

The standard deviation is calculated using the formula:

The coefficient of variation is a relative measure of the spread of a feature.

The coefficient of variation is also used as an indicator of the homogeneity of sample observations. It is believed that if the coefficient of variation does not exceed 10%, then the sample can be considered homogeneous, that is, obtained from one general population.

Since the coefficient of variation in both samples, they are homogeneous.

The sample can be represented analytically in the form of a distribution function, as well as in the form of a frequency table consisting of two rows. In the upper line - the elements of the sample (options), arranged in ascending order; the bottom line records the frequency option.

Frequency options - number, equal to the number repetitions of this option in the sample.

Sample #1 "Mothers"

Type of distribution curve

Asymmetry or coefficient of skewness (the term was first introduced by Pearson, 1895) is a measure of the skewness of a distribution. If the skewness is distinctly different from 0, the distribution is skewed, the density of the normal distribution is symmetrical about the mean.

Index asymmetries(English) skewness) is used to characterize the degree of symmetry in the distribution of data around a center. Asymmetry can take both negative and positive values. Positive value This parameter indicates that the data is shifted to the left of the center, negative - to the right. Thus, the sign of the skewness index indicates the direction of data bias, while the magnitude indicates the degree of this bias. Asymmetry zero indicates that the data is symmetrically concentrated around the center.

Because the asymmetry is positive, therefore, the top of the curve is shifted to the left from the center.

Kurtosis coefficient(English) kurtosis) is a measure of how tightly the bulk of the data clusters around the center.

With a positive kurtosis, the curve sharpens, with a negative kurtosis, it smoothes out.

The curve is flattened;

The curve is sharpening.

In descriptive statistics, the estimation of sample parameters is central.

Point estimation of distribution parameters

Point Estimation- quantitative characteristic of the general population, a function of the observed random variables. Further we will talk on point estimation of distribution parameters.

Consider the properties of point estimates.

A) Unbiased estimator parameter θ called statistical evaluation θ* , whose mathematical expectation is equal to θ : M(θ* )= θ .

If M(θ* ) > θ (or M(θ* ) < θ ) , then systematic error(non-random error that distorts the measurement results in one direction). The unbiased estimate is a guarantee of protection against systematic errors.

B) However, an unbiased estimate does not always give a good approximation of the estimated parameter. Indeed, possible values θ* may be highly scattered around their mean (variance D(θ* ) can be large). Then the estimate found for this sample, for example θ* 1 may be remote from M(θ* ), and hence from θ . Therefore, following unbiasedness, the requirement of small dispersion is natural.

efficient called the estimate that, for a given sample size, has the smallest variance.

C) When considering samples of a large volume, statistical estimates are subject to the requirement of consistency. Wealthy is called an estimate, which n→∞ in probability tend to the estimated parameter:

For example, if the variance of the unbiased estimate tends to zero at n→∞, then such an estimate turns out to be consistent.

Let's move on to estimating the distribution parameters.

Distribution Options are its numbers. They indicate where, on average, the values ​​of the feature are located ( position measure ), how variable the values ​​are ( scattering measure), and characterize the deviation of the distribution from the normal (shape measure) . In real research conditions, we operate not with parameters, but with their approximate values ​​- parameter estimates, which are functions of the observed values. Note that the larger the sample, the closer the parameter estimate can be to its true value.



Let x 1 , x 2 , … x to variation series and n 1 , n 2 , … n to- frequencies of the corresponding option, n is the sample size.

Position indicators


If an interval statistical distribution is given, then the sample mean is determined for the corresponding intervals.

Where is the middle of the interval.

The sample mean is an unbiased and consistent estimate.

Median- the value of the feature that falls in the middle of the variation series ordered in ascending order. If the series consists of them (2 N+1) option, then the median is ( N+1)th value of the variant if the row consists of 2 N option, then the median is half the sum N– go and ( N+1) - th value option.

Fashion - option with the highest frequency. If there are several such options (they have the same frequency), then the distribution is called polymodal .

Variation indicators

Swipe - the difference between the largest and smallest variant values.

Sample variance(dispersion estimate) - a characteristic of the dispersion of the observed values ​​of the quantitative attribute of the sample around its average value. Denote D in - sample variance

It can be shown that M(D in) = (n/(n-1))D in. Therefore, the corrected (unbiased) variance, which we will denote by , is equal to


In addition to the sample variance for the scattering characteristic, a summary characteristic is used - standard deviation (standard) σ
Selective asymmetry is a characteristic of the symmetry of the distribution. Designated . For symmetric distributions (including the normal distribution), the skewness is zero. If , then the "long part" of the distribution curve is located to the right of the mathematical expectation, if , then to the left of the mathematical expectation (Fig. 2.).

Selective kurtosis - characteristic of the "rise, steepness" of the distribution curve. Designated . For a normal distribution, kurtosis zero. For , then the curve has a higher and sharper peak; if , then the curve has a lower peak than the normal curve (Fig. 1).

One of the reasons for holding statistical analysis consists in the need to take into account the influence of random factors (perturbations) on the indicator under study, which lead to scatter (scattering) of data. Solving problems in which there is data scatter is associated with risk, since even when using all available information, it is impossible to exactly predict what will happen in the future. To work adequately in such situations, it is advisable to understand the nature of the risk and be able to determine the degree of dispersion of the data set. There are three numerical characteristics that describe the measure of dispersion: standard deviation, range, and coefficient of variation (variability). Unlike typical indicators (mean, median, mode) characterizing the center, scattering characteristics show how close to this center are the individual values ​​of the data set
Definition of Standard Deviation Standard deviation(standard deviation) is a measure of the random deviations of data values ​​from the mean. IN real life most of the data is characterized by scatter, i.e. individual values ​​are at some distance from the average.
It is impossible to use the standard deviation as a generalizing characteristic of the scattering by simply averaging the deviations of the data, because some of the deviations will turn out to be positive and the other part will be negative, and, as a result, the averaging result may turn out to be zero. To get rid of the negative sign, a standard trick is used: first calculate dispersion as the sum of squared deviations divided by ( n–1), and then the square root is taken from the resulting value. The formula for calculating the standard deviation is as follows: Note 1. The variance does not carry any additional information compared to the standard deviation, but it is more difficult to interpret, because it is expressed in "units squared", while the standard deviation is expressed in units that are familiar to us (for example, in dollars). Note 2. The above formula is for calculating the standard deviation of a sample and is more accurately called sample standard deviation. When calculating the standard deviation population(denoted by the symbol s) divide by n. The value of the sample standard deviation is somewhat larger (because it is divided by n–1), which provides a correction for the randomness of the sample itself. In the case when the data set has a normal distribution, the standard deviation takes on a special meaning. In the figure below, marks are placed on both sides of the mean at a distance of one, two and three standard deviations, respectively. The figure shows that approximately 66.7% (two-thirds) of all values ​​are within one standard deviation on either side of the mean, 95% of the values ​​will be within two standard deviations of the mean, and almost all of the data (99.7%) will be within three standard deviations of the mean.
66,7%


This property of the standard deviation for normally distributed data is called the "two-thirds rule".

In some situations, such as product quality control analysis, limits are often set such that those observations (0.3%) that are more than three standard deviations from the mean are considered as worthy of attention.

Unfortunately, if the data is not normally distributed, then the rule described above cannot be applied.

There is currently a constraint called Chebyshev's rule that can be applied to skewed (skewed) distributions.

Generate initial data

Table 1 shows the dynamics of changes in daily profit on the stock exchange, recorded on working days for the period from July 31 to October 9, 1987.

Table 1. Dynamics of changes in daily profit on the stock exchange

date Daily Profit date Daily Profit date Daily Profit
-0,006 0,009 0,012
-0,004 -0,015 -0,004
0,008 -0,006 0,002
0,011 0,002 -0,008
-0,001 0,011 -0,010
0,017 0,013 -0,013
0,017 0,002 0,009
-0,004 -0,018 -0,020
0,008 -0,014 -0,003
-0,002 -0,001 -0,001
0,006 -0,001 0,017
-0,017 -0,013 0,001
0,004 0,030 -0,000
0,015 0,007 -0,035
0,001 -0,007 0,001
-0,005 0,001 -0,014
Launch Excel
Create file Click the Save button on the Standard toolbar. open the Statistics folder in the dialog box that appears and name the Scattering Characteristics.xls file.
Set Label 6. On Sheet1, in cell A1, enter the label Daily profit, 7. and in the range A2:A49, enter the data from Table 1.
Set function AVERAGE 8. In cell D1, enter the label Average. In cell D2, calculate the average using the AVERAGE statistical function.
Set STDEV function In cell D4, enter the label Standard Deviation. In cell D5, calculate the standard deviation using the statistical function STDEV
Reduce the word length of the result to the fourth decimal place.
Interpretation of results decline daily profit averaged 0.04% (the value of the average daily profit turned out to be -0.0004). This means that the average daily profit for the considered period of time was approximately equal to zero, i.e. the market was at an average rate. The standard deviation turned out to be 0.0118. This means that one dollar ($1) invested in the stock market per day changed on average by $0.0118, i.e. his investment could result in a profit or loss of $0.0118.
Let's check whether the daily profit values ​​given in Table 1 correspond to the rules of normal distribution 1. Calculate the interval corresponding to one standard deviation on either side of the mean. 2. In cells D7, D8 and F8, set the labels respectively: One standard deviation, Lower limit, Upper limit. 3. In cell D9, enter the formula = -0.0004 - 0.0118, and in cell F9, enter the formula = -0.0004 + 0.0118. 4. Get the result up to four decimal places.

5. Determine the number of daily profits that are within one standard deviation. First, filter the data, leaving the daily profit values ​​in the interval [-0.0121, 0.0114]. To do this, select any cell in column A with daily profit values ​​and run the command:

Data®Filter®AutoFilter

Open the menu by clicking on the arrow in the header Daily Profit, and select (Condition...). In the Custom AutoFilter dialog box, set the options as shown below. Click the OK button.

To count the number of filtered data, select the range of daily profit values, right-click on an empty space in the status bar, and select the Number of values ​​command from the context menu. Read the result. Now display all the original data by running the command: Data®Filter®Show All and turn off the autofilter using the command: Data®Filter®AutoFilter.

6. Calculate the percentage of daily profits that are within one standard deviation of the average. To do this, enter the label in cell H8 Percent, and in cell H9, program the formula for calculating the percentage and get the result with an accuracy of one decimal place.

7. Calculate the range of daily profits within two standard deviations from the mean. In cells D11, D12 and F12, set the labels accordingly: Two standard deviations, Bottom line, Upper bound. In cells D13 and F13, enter the calculation formulas and get the result accurate to the fourth decimal place.

8. Determine the number of daily profits that are within two standard deviations by first filtering the data.

9. Calculate the percentage of daily profits that are two standard deviations away from the average. To do this, enter the label in cell H12 Percent, and in cell H13, program the formula for calculating the percentage and get the result with an accuracy of one decimal place.

10. Calculate the range of daily profits within three standard deviations from the mean. In cells D15, D16 and F16, set the labels accordingly: Three standard deviations, Bottom line, Upper bound. In cells D17 and F17, enter the calculation formulas and get the result accurate to the fourth decimal place.

11. Determine the number of daily profits that are within three standard deviations by first filtering the data. Calculate the percentage of daily profit values. To do this, enter the label in cell H16 Percent, and in cell H17, program the formula for calculating the percentage and get the result with an accuracy of one decimal place.

13. Plot a histogram of the stock's daily earnings on the stock exchange and place it along with the frequency distribution table in the area J1:S20. Show on the histogram the approximate mean and intervals corresponding to one, two, and three standard deviations from the mean, respectively.