Personal growth      03/26/2020

It does not apply to multivariate statistical methods. Multivariate statistical analysis: essence and types. Basic concepts of the factor analysis method, the essence of the tasks it solves

Implementation of a PC in management national economy suggests a transition from traditional methods analysis of the activities of enterprises in more advanced models of economic management, which allow to reveal its underlying processes.

Widespread use in economic research methods mathematical statistics makes it possible to deepen economic analysis, improve the quality of information in planning and forecasting production indicators and analyzing its effectiveness.

The complexity and variety of relationships between economic indicators determine the multidimensionality of features and, therefore, require the use of the most complex mathematical apparatus - methods of multidimensional statistical analysis.

The concept of "multivariate statistical analysis" implies the combination of a number of methods designed to explore a combination of interrelated features. We are talking about the dismemberment (partitioning) of the considered set, which is represented by multidimensional features into a relatively small number of them.

At the same time, the transition from a large number of features to a smaller one aims to reduce their dimension and increase the informative capacity. This goal is achieved by identifying information that is repeated, generated by interrelated features, establishing the possibility of aggregation (combining, summing) according to some features. The latter involves the transformation of the actual model into a model with fewer factor features.

The method of multidimensional statistical analysis makes it possible to identify objectively existing, but not explicitly expressed, patterns that manifest themselves in certain socio-economic phenomena. One has to face this when solving a number of practical problems in the field of economics. In particular, the above takes place if it is necessary to accumulate (fix) simultaneously the values ​​of several quantitative characteristics (features) for the object of observation under study, when each characteristic is prone to uncontrolled variation (in the context of objects), despite the homogeneity of the objects of observation.

For example, when examining homogeneous (in terms of natural and economic conditions and type of specialization) enterprises in terms of a number of indicators of production efficiency, we are convinced that when moving from one object to another, almost each of the selected characteristics (identical) has an unequal numerical value, that is, it finds, so to speak, uncontrolled (random) scatter. Such "random" variation of traits tends to follow some (regular) tendencies, both in terms of the well-defined dimensions of the traits around which the variation occurs, and in terms of the degree and interdependence of the variation itself.

The above leads to the definition of a multidimensional random variable as a set of quantitative features, the value of each of which is subject to uncontrolled scatter during repetitions of this process, statistical observation experience, experiment, etc.

It was previously said that multivariate analysis combines a number of methods; let's call them: factor analysis, principal component analysis, cluster analysis, pattern recognition, discriminant analysis, etc. The first three of these methods are considered in the following paragraphs.

Like other mathematical and statistical methods, multivariate analysis can be effective in its application, provided High Quality the initial information and the mass nature of the observational data are processed using a PC.

Basic concepts of the factor analysis method, the essence of the tasks it solves

In the analysis (and equally studied) socially - economic phenomena one often encounters cases when, among the variety (rich parametricity) of objects of observation, it is necessary to exclude a proportion of parameters, or replace them with a smaller number of certain functions without harming the integrity (completeness) of information. The solution of such a problem makes sense within the framework of a certain model and is determined by its structure. An example of such a model, which is most suitable for many real situations, is the factor analysis model, the methods of which allow you to concentrate features (information about them) by "condensing" a large number into a smaller, more informational one. In this case, the obtained "condensate" of information should be represented by the most significant and defining quantitative characteristics.

The concept of "factorial analysis" should not be confused with the broad concept of the analysis of cause-and-effect relationships, when the influence of various factors (their combinations, combinations) on a productive attribute is studied.

The essence of the factor analysis method is to exclude the description of the multiple characteristics of the studied and replace it with a smaller number of informationally more capacious variables, which are called factors and reflect the most significant properties of phenomena. Such variables are some functions of the original features.

Analysis, in the words of Ya. Okun'9, makes it possible to obtain the first approximate characteristics of the regularities underlying the phenomenon, to formulate the first, general conclusions directions in which further research should be carried out. Further, he points to the main assumption of factor analysis, which is that the phenomenon, despite its heterogeneity and variability, can be described by a small number of functional units, parameters or factors. These terms are called differently: influence, causes, parameters, functional units, abilities, basic or independent indicators. The use of one term or another is subject to

Okun Ya. Factor analysis: Per. With. floor. M.: Statistics, 1974.- P.16.

context about the factor and knowledge of the essence of the phenomenon under study.

The stages of factor analysis are sequential comparisons of various sets of factors and options to groups with their inclusion, exclusion and assessment of the significance of differences between groups.

V.M. Zhukovska and I.B. Muchnik 10, speaking about the essence of the tasks of factor analysis, argue that the latter does not require a priori subdivision of variables into dependent and independent ones, since all variables in it are considered equal.

The task of factor analysis is reduced to a certain concept, the number and nature of the most significant and relatively independent functional characteristics of the phenomenon, its meters or basic parameters - factors. According to the authors, it is important distinctive feature factor analysis is that it allows you to simultaneously explore a large number of interrelated variables without the assumption of "constancy of all other conditions", so necessary when using a number of other methods of analysis. This is the great advantage of factor analysis as a valuable tool for studying the phenomenon, due to the complex diversity and interweaving of relationships.

The analysis relies mainly on observations of the natural variation of variables.

1. When using factor analysis, the set of variables that are studied in terms of relationships between them is not chosen arbitrarily: this method allows you to identify the main factors that have a significant impact in this area.

2. Analysis does not require preliminary hypotheses; on the contrary, it itself can serve as a method for putting forward hypotheses, as well as act as a criterion for hypotheses based on data obtained by other methods.

3. Analysis does not require a priori guesses as to which variables are independent and dependent, it does not exaggerate causality and decides on their measure in the process of further research.

The list of specific tasks to be solved using factor analysis methods will be as follows (according to V.M. Zhukovsky). Let us name the main ones in the field of socio-economic research:

Zhukovskaya V.M., Muchnik I.B. Factor analysis in socio-economic research. - Statistics, 1976. P.4.

1. Determination of the main aspects of the differences between the objects of observation (minimization of the description).

2. Formulation of hypotheses about the nature of differences between objects.

3. Identification of the structure of relationships between features.

4. Testing hypotheses about the relationship and interchangeability of features.

5. Comparison of structures of feature sets.

6. Dismemberment of objects of observation for typical features.

The foregoing indicates the great possibilities of factor analysis in

the study of social phenomena, where, as a rule, it is impossible to control (experimentally) the influence of individual factors.

It is quite effective to use the results of factor analysis in multiple regression models.

Having a pre-formed correlation-regression model of the phenomenon under study in the form of correlated features, with the help of factor analysis, such a set of features can be turned into a significantly smaller number of them by aggregation. At the same time, it should be noted that such a transformation in no way impairs the quality and completeness of information about the phenomenon under study. The generated aggregated features are uncorrelated and represent a linear combination of the primary features. From the formal mathematical side, the problem statement in this case can have an infinite set of solutions. But we must remember that when studying socio-economic phenomena, the obtained aggregated signs must have an economically justified interpretation. In other words, in any case of using the mathematical apparatus, first of all, they come out of the knowledge of the economic essence of the phenomena being studied.

Thus, the above allows us to summarize that factor analysis is a specific research method, which is carried out on the basis of an arsenal of methods of mathematical statistics.

Factor analysis first found its practical application in the field of psychology. Ability to mix a large number of psychological tests to a small number of factors allowed to explain the ability of human intelligence.

In the study of socio-economic phenomena, where there are difficulties in isolating the influence of individual variables, factor analysis can be successfully used. The use of its methods allows, by means of certain calculations, to "filter" non-essential features and continue research in the direction of its deepening.

The effectiveness of this method is obvious in the study of such issues (problems): in the economy - specialization and concentration of production, the intensity of housekeeping, the budget of families of workers, the construction of various generalizing indicators. etc

Econometrics

Multivariate statistical analysis


In multivariate statistical analysis, the sample consists of elements multidimensional space. Hence the name of this section of econometric methods. Of the many problems of multivariate statistical analysis, let's consider two - dependence recovery and classification.

Linear Predictive Function Estimation

Let's start with the problem of point and confidence estimation of a linear predictive function of one variable.

The initial data is a set of n pairs of numbers (t k , x k), k = 1,2,…,n, where t k is an independent variable (for example, time), and x k is a dependent variable (for example, inflation index, US dollar exchange rate, monthly production or the size of the daily revenue of the outlet). Variables are assumed to be related

x k = a (t k - t cf)+ b + e k , k = 1,2,…,n,

where a and b are parameters unknown to statistics and subject to estimation, and e k are errors that distort the dependence. Arithmetic mean of time points

t cf \u003d (t 1 + t 2 + ... + t n) / n

introduced into the model to facilitate further calculations.

Usually, the parameters a and b of the linear dependence are estimated using the least squares method. The reconstructed relationship is then used for point and interval prediction.

As you know, the least squares method was developed by the great German mathematician K. Gauss in 1794. According to this method, to calculate the best function that linearly approximates the dependence of x on t, one should consider a function of two variables


The least squares estimates are those values ​​of a* and b* for which the function f(a,b) reaches a minimum over all values ​​of the arguments.

To find these estimates, it is necessary to calculate the partial derivatives of the function f(a,b) with respect to the arguments a and b, equate them to 0, then find the estimates from the resulting equations: We have:

Let us transform the right parts of the obtained relations. Let's take the common factors 2 and (-1) out of the sign of the sum. Then let's look at the terms. Let's open the brackets in the first expression, we get that each term is divided into three. In the second expression, each term is also the sum of three. So each of the sums is divided into three sums. We have:


We equate the partial derivatives to 0. Then the factor (-2) can be reduced in the resulting equations. Because the

(1)

the equations take the form

Therefore, the estimates of the least squares method have the form

(2)

Due to relation (1), the estimate a* can be written in a more symmetrical form:

It is not difficult to transform this estimate into the form

Therefore, the reconstructed function, which can be used to predict and interpolate, has the form

x*(t) = a*(t - t cf) + b*.

Let us pay attention to the fact that the use of t cf in the last formula in no way limits its generality. Compare with view model

x k = c t k + d + e k , k = 1,2,…,n.

It's clear that

The parameter estimates are similarly related:

There is no need to refer to any probabilistic model to obtain parameter estimates and a predictive formula. However, in order to study the errors in parameter estimates and the restored function, i.e. build confidence intervals for a*, b* and x*(t), such a model is needed.

Nonparametric probabilistic model. Let the values ​​of the independent variable t be determined, and the errors e k , k = 1,2,…,n, be independent identically distributed random variables with zero mathematical expectation and dispersion

unknown statistics.

In the future, we will repeatedly use the Central Limit Theorem(CLT) of the probability theory for quantities e k , k = 1,2,…,n (with weights), therefore, to fulfill its conditions, it is necessary to assume, for example, that the errors e k , k = 1,2,…,n, are finite or have final third absolute moment. However, there is no need to focus on these intramathematical "regularity conditions".

Asymptotic distributions of parameter estimates. From formula (2) it follows that

(5)

According to the CLT, the estimate b* has an asymptotically normal distribution with expectation b and variance

which is evaluated below.

From formulas (2) and (5) it follows that

The last term in the second relation vanishes when summed over i, so it follows from formulas (2-4) that

(6)

Formula (6) shows that the estimate

is asymptotically normal with mean and variance

Note that multidimensional normality exists when each term in formula (6) is small compared to the entire sum, i.e.


From formulas (5) and (6) and the initial assumptions about the errors, the unbiasedness of the parameter estimates also follows.

The unbiasedness and asymptotic normality of least squares estimates make it easy to specify asymptotic confidence limits for them (similar to the limits in the previous chapter) and test statistical hypotheses, for example, about equality to certain values, primarily 0. We leave the reader the opportunity to write out formulas for calculating confidence limits and formulate rules for testing the mentioned hypotheses.

Asymptotic distribution of the prognostic function. From formulas (5) and (6) it follows that

those. the estimate of the prognostic function under consideration is unbiased. That's why

At the same time, since the errors are independent in the aggregate and

, That

Thus,

MULTIVARIATE STATISTICAL ANALYSIS

Section of Mathematics. statistics, dedicated to mathematics. methods of constructing optimal plans for the collection, systematization and processing of multidimensional statistical. data aimed at identifying the nature and structure of the relationship between the components of the studied multidimensional trait and intended to obtain scientific and practical. conclusions. A multidimensional attribute is understood as p-dimensional indicators (features, variables) among which there can be: order the analyzed objects according to the degree of manifestation of the studied property in them; and classification (or nominal), i.e., allowing to divide the studied set of objects into classes that are not amenable to ordering homogeneous (according to the analyzed property). The results of measuring these indicators

on each of the objects of the studied population, they form multidimensional observations, or an initial array of multidimensional data for conducting M. s. A. A significant part of M. s. A. Serves situations in which the studied multidimensional feature is interpreted as multidimensional and, accordingly, the sequence of multidimensional observations (1) - as from the general population. In this case, the choice of methods for processing the original statistic. data and the analysis of their properties is based on certain assumptions regarding the nature of the multidimensional (joint) probability distribution law

Multivariate statistical analysis of multivariate distributions and their main characteristics covers only situations in which the observations (1) being processed are of a probabilistic nature, i.e., they are interpreted as a sample from the corresponding general population. The main tasks of this subsection include: statistical. estimation of the studied multivariate distributions, their main numerical characteristics and parameters; study of the properties of the used statistical. ratings; the study of probability distributions for a number of statistics, with the help of which statistical data are constructed. criteria for testing various hypotheses about the probabilistic nature of the analyzed multivariate data. The main results relate to a particular case when the feature under study is subject to a multidimensional normal distribution law, the density function of which is given by the relation

where is the vector of mathematical. expectations of the components of the random variable , i.e. is the covariance matrix of the random vector , i.e., the covariance of the vector components (we consider the non-degenerate case when ; otherwise, i.e., for the rank , all the results remain valid, but as applied to a subspace of lower dimension , in which it turns out to be concentrated random vector under study).

So, if (1) is a sequence of independent observations that form a random sample from then the maximum likelihood estimates for the parameters and participating in (2) are, respectively, statistics (see , )

where the random vector obeys the p-dimensional normal law and does not depend on , and the joint distribution of matrix elements is described by the so-called Wish distribution r-t a (see), to-rogo

Within the framework of the same scheme, the distributions and moments of such sample characteristics of a multidimensional random variable as the coefficients of pair, partial and multiple correlations, generalized (i.e. ), generalized Hotelling statistics (see ). In particular (see ), if we define as the sample covariance matrix the estimate corrected "for unbiasedness", namely:

then random variable tends to as , and the random variables

obey F-distributions with the numbers of degrees of freedom respectively (p, n-p) and (p, n 1 + n 2-p-1). In relation (7) p 1 and n 2 - the volumes of two independent samples of the form (1), extracted from the same general population - estimates of the form (3) and (4)-(5), built according to i-th sample, A

The total sample covariance , built from estimates and

Multivariate statistical analysis of the nature and structure of the interrelationships of the components of the multidimensional trait under study combines the concepts and results that serve such methods and models of M. s. a., as plural, multidimensional analysis of variance And covariance analysis, factor analysis and principal component analysis, canonical analysis. correlations. The results that make up the content of this subsection can be roughly divided into two main types.

1) Construction of the best (in a certain sense) statistic. estimates for the parameters of the mentioned models and analysis of their properties (accuracy, and in the probabilistic setting - the laws of their distribution, confidence: areas, etc.). So, let the multidimensional attribute under study be interpreted as a vector random , subordinate to the p-dimensional normal distribution, and is divided into two subvectors - columns and dimensions q and p-q, respectively. This also determines the corresponding division of the mathematical vector. expectations , theoretical and sample covariance matrices , namely:

Then (see , ) the subvector (assuming that the second subvector has taken a fixed value ) will also be normal ). In this case, maximum likelihood estimates. for matrices of regression coefficients and covariances of this classic multivariate multiple regression model

there will be mutually independent statistics, respectively

here the distribution of the estimate is subject to the normal law , and estimates n - to the Wishart law with parameters and (the elements of the covariance matrix are expressed in terms of the elements of the matrix ).

The main results on the construction of parameter estimates and the study of their properties in models of factorial analysis, principal components and canonical correlations relate to the analysis of probabilistic-statistical properties of eigenvalues ​​and vectors of various sample covariance matrices.

In schemes that do not fit into the framework of the classic. normal model, and even more so within the framework of any probabilistic model, the main results relate to the construction of algorithms (and the study of their properties) for calculating parameter estimates that are best from the point of view of some exogenously given quality (or adequacy) functional of the model.

2) Construction of statistical. criteria for testing various hypotheses about the structure of the studied relationships. Within the framework of a multivariate normal model (sequences of observations of the form (1) are interpreted as random samples from the corresponding multivariate normal general populations), for example, statistical data are constructed. criteria for testing the following hypotheses.

I. Hypotheses about the equality of the vector mathematical. expectations of the studied indicators to a given specific vector ; is verified using the Hotelling -statistics with substitution in the formula (6)

II. Hypotheses about equality of vectors mathematic. expectations in two populations (with the same but unknown covariance matrices) represented by two samples; verified using statistics (see ).

III. Hypotheses about equality of vectors mathematic. expectations in several general populations (with the same but unknown covariance matrices) represented by their samples; verified with statistics

in which there is an i-th p-dimensional observation in a sample of volume representing j-th general set, and and are estimates of the form (3), constructed respectively separately for each of the samples and for the combined sample of volume

IV. Hypothesis about the equivalence of several normal populations represented by their samples is verified using statistics

in which - an estimate of the form (4), built separately from observations j- samples, j=1, 2, ... , k.

V. Hypotheses about the mutual independence of the subvectors-columns of dimensions, respectively, into which the original p-dimensional vector of the studied indicators is divided is checked using statistics

in which and are sample covariance matrices of the form (4) for the entire vector and for its subvector x(i) respectively.

Multivariate statistical analysis of the geometric structure of the studied set of multivariate observations combines the concepts and results of such models and schemes as discriminant analysis, mixtures of probability distributions, cluster analysis and taxonomy, multivariate scaling. Nodal in all these schemes is the concept of distance (measures of proximity, measures of similarity) between the analyzed elements. At the same time, they can be analyzed as real objects, on each of which the values ​​​​of indicators are fixed - then geometric. the image of the i-th surveyed object will be a point in the corresponding p-dimensional space, and the indicators themselves - then geometric. the image of the l-th index will be a point in the corresponding n-dimensional space.

Methods and results of discriminant analysis (see , , ) are aimed at the following tasks. It is known that a certain number of populations exist, and the researcher has one sample from each population ("training samples"). It is required to construct, based on the available training samples, the best classifying rule in a certain sense, which allows one to assign a certain new element(observation) to its general population in a situation where the researcher does not know in advance which of the populations this element belongs to. Usually, a classifying rule is understood as a sequence of actions: by calculating a scalar function from the indicators under study, according to the values ​​of which, a decision is made to assign an element to one of the classes (construction of a discriminant function); ordering the indicators themselves according to the degree of their informativeness from the point of view of the correct assignment of elements to classes; by computing the corresponding misclassification probabilities.

The problem of analyzing mixtures of probability distributions (see ) most often (but not always) also arises in connection with the study of the "geometric structure" of the population under consideration. In this case, the concept of the r-th homogeneous class is formalized with the help of a general population described by some (usually unimodal) distribution law so that the distribution of the general population, from which the sample (1) is extracted, is described by a mixture of distributions of the form where p r - a priori probability (specific elements) of the r-th class in the general population. The task is to have a "good" statistic. estimation (by sample) of unknown parameters and sometimes To. This, in particular, makes it possible to reduce the problem of classifying elements to a discriminant analysis scheme, although in this case there were no training samples.

Methods and results of cluster analysis (classification, taxonomy, pattern recognition "without a teacher", see , , ) are aimed at solving the following problem. Geometric of the analyzed set of elements is given either by the coordinates of the corresponding points (i.e., by the matrix ... , n) , or a set of geometric their characteristics relative position, for example, by the matrix of pairwise distances . It is required to divide the set of elements under study into a relatively small (known in advance or not) classes so that the elements of one class are at a small distance from each other, while different classes would be, if possible, sufficiently mutually distant from one another and would not be divided into such parts that are distant from each other.

The task of multidimensional scaling (see ) refers to the situation when the set of elements under study is specified using a matrix of pairwise distances and consists in assigning to each of the elements given number(p) coordinates in such a way that the structure of pairwise mutual distances between elements, measured using these auxiliary coordinates, would, on average, be the least different from the given one. It should be noted that the main results and methods of cluster analysis and multidimensional scaling are usually developed without any assumption about the probabilistic nature of the initial data.

The application purpose of multivariate statistical analysis is mainly to serve the following three problems.

The problem of statistical research of dependencies between the analyzed indicators. Assuming that the studied set of statistically recorded indicators x is divided, based on the meaningful meaning of these indicators and the final objectives of the study, into a q-dimensional subvector of predictive (dependent) variables and a (p-q)-dimensional subvector of predictive (independent) variables, we can say that the problem is to determine, based on the sample (1), such a q-dimensional vector function from the class of admissible solutions F, would give the best, in a certain sense, approximation of the behavior of the subvector of indicators . Depending on the specific type of the approximation quality functional and the nature of the analyzed indicators, they come to one or another scheme of multiple regression, dispersion, covariance or confluent analysis.

The problem of classifying elements (objects or indicators) in a general (non-strict) formulation is to divide the entire analyzed set of elements, statistically presented in the form of a matrix or matrix, into a relatively small number of homogeneous, in a certain sense, groups. Depending on the nature of a priori information and the specific type of functional that sets the classification quality criterion, one or another scheme of discriminant analysis, cluster analysis (taxonomy, "unsupervised" pattern recognition), splitting of mixtures of distributions come to be.

The problem of reducing the dimension of the factor space under study and selecting the most informative indicators is to determine such a set of a relatively small number of indicators found in the class of acceptable transformations of the original indicators on Krom, an upper certain exogenously given measure of information content of an m-dimensional system of features is reached (see ). The specification of the functional that sets the measure of autoinformativeness (i.e., aimed at the maximum preservation of information contained in the statistical array (1) relative to the original features themselves), leads, in particular, to various schemes of factor analysis and principal components, to methods of extreme grouping of features . Functionals that specify a measure of external information content, i.e., aimed at extracting from (1) the maximum information regarding some others not contained directly in x, indicative or phenomena, lead to various methods selection of the most informative indicators in statistical schemes. dependency studies and discriminant analysis.

The main mathematical tools of M. s. A. constitute special methods in the theory of systems of linear equations and in the theory of matrices (methods for solving a simple and generalized problem of eigenvalues ​​and vectors; simple inversion and pseudoinversion of matrices; procedures for diagonalizing matrices, etc.) and certain optimization algorithms (methods of coordinate-wise descent, adjoint gradients, branches and boundaries, various versions of random search and stochastic approximations, etc.).

Lit.: Anderson T., Introduction to multivariate statistical analysis, trans. from English, M., 1963; Kendall M. J., Stewart A., Multivariate statistical analysis and time series, trans. from English, M., 1976; Bolshev L. N., "Bull. Int. Stat. Inst.", 1969, No. 43, p. 425-41; Wishart.J., "Biometrika", 1928, v. 20A, p. 32-52: Hotelling H., "Ann. Math. Stat.", 1931, v. 2, p. 360-78; [c] Kruskal J. V., "Psychometrika", 1964, v. 29, p. 1-27; Ayvazyan S. A., Bezhaeva Z. I., . Staroverov O. V., Classification of multidimensional observations, M., 1974.

S. A. Ayvazyan.


Mathematical encyclopedia. - M.: Soviet Encyclopedia. I. M. Vinogradov. 1977-1985.

Technical Translator's Handbook

Section of mathematical statistics (see), devoted to mathematical. methods aimed at identifying the nature and structure of the relationship between the components of the studied multidimensional feature (see) and intended to obtain scientific. and practical……

In a broad sense, a section of mathematical statistics (See Mathematical Statistics), which combines methods for studying statistical data related to objects that are characterized by several qualitative or quantitative ... ... Great Soviet Encyclopedia

MULTIVARIATE STATISTICAL ANALYSIS- a section of mathematical statistics designed to analyze relationships between three or more variables. We can conditionally distinguish three main classes of A.M.S. This is a study of the structure of relationships between variables and a reduction in the dimension of space ... Sociology: Encyclopedia

ANALYSIS COVARIANCE- - a set of mathematical methods. statistics related to the analysis of models of the dependence of the mean value of a certain random variable Y on a set of non-quantitative factors F and at the same time on a set of quantitative factors X. In relation to Y ... ... Russian sociological encyclopedia

Section of Mathematics. statistics, the content of which is the development and study of statistical. methods for solving the following problem of discrimination (discrimination): based on the results of observations, determine which of several possible ... ... Mathematical Encyclopedia, Orlova Irina Vladlenovna, Kontsevaya Natalya Valerievna, Turundaevsky Viktor Borisovich. The book is devoted to multivariate statistical analysis (MSA) and the organization of calculations according to MSA. To implement the methods of multivariate statistics, a statistical processing program is used ...


Chapter 2. Cluster analysis

Chapter 3. Factor Analysis

Chapter 4. Discriminant Analysis

Bibliography

Introduction

Initial information in socio-economic studies is most often presented as a set of objects, each of which is characterized by a number of features (indicators). Since the number of such objects and features can reach tens and hundreds, and the visual analysis of these data is ineffective, the problems of reducing, concentrating the initial data, revealing the structure and the relationship between them based on the construction of generalized characteristics of a set of features and a set of objects arise. Such problems can be solved by methods of multivariate statistical analysis.

Multivariate statistical analysis is a section of mathematical statistics devoted to mathematical methods aimed at identifying the nature and structure of relationships between the components of a multivariate feature under study and intended to obtain scientific and practical conclusions.

The main attention in multivariate statistical analysis is paid to mathematical methods for constructing optimal plans for collecting, systematizing and processing data, aimed at identifying the nature and structure of relationships between the components of the studied multivariate attribute and designed to obtain scientific and practical conclusions.

The initial array of multidimensional data for conducting multivariate analysis is usually the results of measuring the components of a multidimensional attribute for each of the objects of the studied population, i.e. a sequence of multivariate observations. A multidimensional attribute is most often interpreted as a random value, and a sequence of observations as a sample from the general population. In this case, the choice of the method of processing the initial statistical data is made on the basis of certain assumptions regarding the nature of the distribution law of the studied multidimensional attribute.

1. Multivariate statistical analysis of multivariate distributions and their main characteristics covers situations where the processed observations are of a probabilistic nature, i.e. interpreted as a sample from the corresponding general population. The main tasks of this subsection include: statistical estimation of the studied multivariate distributions and their main parameters; study of the properties of the statistical estimates used; study of probability distributions for a number of statistics, which are used to build statistical criteria for testing various hypotheses about the probabilistic nature of the analyzed multivariate data.

2. Multivariate statistical analysis of the nature and structure of the interrelationships of the components of the multidimensional trait under study combines the concepts and results inherent in such methods and models as regression analysis, analysis of variance, analysis of covariance, factor analysis, etc. Methods belonging to this group include both algorithms based on the assumption of the probabilistic nature of the data, and methods that do not fit into the framework of any probabilistic model (the latter are often referred to as data analysis methods).

3. Multidimensional statistical analysis of the geometric structure of the studied set of multivariate observations combines the concepts and results inherent in such models and methods as discriminant analysis, cluster analysis, multidimensional scaling. Nodal for these models is the concept of distance, or a measure of proximity between the analyzed elements as points of some space. In this case, both objects (as points specified in the feature space) and features (as points specified in the object space) can be analyzed.

The applied value of multivariate statistical analysis consists mainly in solving the following three problems:

    the task of statistical study of the dependencies between the indicators under consideration;

    the task of classifying elements (objects or features);

    the task of reducing the dimension of the feature space under consideration and selecting the most informative features.

Multiple regression analysis is designed to build a model that allows the values ​​of independent variables to obtain estimates of the values ​​of the dependent variable.

Logistic regression for solving the classification problem. This is a type of multiple regression, the purpose of which is to analyze the relationship between several independent variables and a dependent variable.

Factor analysis deals with the determination of a relatively small number of hidden (latent) factors, the variability of which explains the variability of all observed indicators. Factor analysis is aimed at reducing the dimension of the problem under consideration.

Cluster and discriminant analysis are designed to divide collections of objects into classes, each of which should include objects that are homogeneous or close in a certain sense. In cluster analysis, it is not known in advance how many groups of objects will turn out and what size they will be. Discriminant analysis divides objects into pre-existing classes.

Chapter 1 Multiple Regression Analysis

Assignment: Research of the housing market in Orel (Soviet and Northern regions).

The table shows data on the price of apartments in Orel and on various factors that determine it:

    total area;

    kitchen area;

    living space;

  • house type;

    number of rooms. (Fig.1)

Rice. 1 Initial data

In the column "Region" the designations are used:

3 - Soviet (elite, belongs to the central regions);

4 - North.

In the column "Type of house":

1 - brick;

0 - panel.

Required:

    Analyze the relationship of all factors with the "Price" indicator and among themselves. Select the factors most suitable for building a regression model;

    Construct a dummy variable that reflects the belonging of the apartment to the central and peripheral areas of the city;

    Build a linear regression model for all factors, including a dummy variable. Explain the economic meaning of the parameters of the equation. Evaluate the quality of the model, the statistical significance of the equation and its parameters;

    Distribute the factors (except for the dummy variable) according to the degree of influence on the “Price” indicator;

    Build a linear regression model for the most influential factors, leaving the dummy variable in the equation. Evaluate the quality and statistical significance of the equation and its parameters;

    Justify the expediency or inexpediency of including a dummy variable in the equation of paragraphs 3 and 5;

    Estimate interval estimates of the parameters of the equation with a probability of 95%;

    Determine how much an apartment with a total area of ​​74.5 m² in an elite (peripheral) area will cost.

Performance:

    After analyzing the relationship of all factors with the “Price” indicator and among themselves, the factors most suitable for building a regression model were selected using the “Forward” inclusion method:

A) the total area;

C) the number of rooms.

Included/excluded variables(a)

Included variables

Excluded variables

Total area

Inclusion (criteria: F-inclusion probability >= .050)

Inclusion (criteria: F-inclusion probability >= .050)

Number of rooms

Inclusion (criteria: F-inclusion probability >= .050)

a Dependent variable: Price

    Variable X4 "Region" is a dummy variable, as it has 2 values: 3-belonging to the central region "Soviet", 4- to the peripheral region "Severny".

    Let's build a linear regression model for all factors (including the dummy variable X4).

Received model:

Y \u003d 348.349 + 35.788 X1 -217.075 X4 +305.687 X7

Evaluation of the quality of the model.

Determination coefficient R 2 = 0.807

Shows the proportion of variation of the resulting trait under the influence of the studied factors. Consequently, about 89% of the variation of the dependent variable is taken into account and due to the influence of the included factors in the model.

Multiple correlation coefficient R = 0.898

Shows the closeness of the relationship between the dependent variable Y with all explanatory factors included in the model.

Standard error = 126.477

Social and economic objects, as a rule, are characterized sufficiently a large number parameters that form multidimensional vectors, and of particular importance in economic and social studies acquire the tasks of studying the relationships between the components of these vectors, and these relationships must be identified on the basis of a limited number of multidimensional observations.

Multivariate statistical analysis is a section of mathematical statistics that studies the methods of collecting and processing multivariate statistical data, their systematization and processing in order to identify the nature and structure of relationships between the components of the studied multivariate attribute, and to draw practical conclusions.

Note that data collection methods may vary. So, if the world economy is being studied, then it is natural to take countries as objects on which the values ​​of the vector X are observed, but if the national economic system is being studied, then it is natural to observe the values ​​of the vector X in the same (of interest to the researcher) country at different points in time .

Statistical methods such as multiple correlation and regression analysis are traditionally studied in the courses of probability theory and mathematical statistics, the discipline "Econometrics" is devoted to the consideration of applied aspects of regression analysis.

This manual is devoted to other methods of studying multivariate general populations based on statistical data.

Methods for reducing the dimension of a multidimensional space allow, without significant loss of information, to move from the original system of a large number of observed interrelated factors to a system of a significantly smaller number of hidden (unobservable) factors that determine the variation of the initial features. The first chapter describes the methods of component and factor analysis, which can be used to identify objectively existing, but not directly observable patterns using principal components or factors.

Multidimensional classification methods are designed to divide collections of objects (characterized by a large number of features) into classes, each of which should include objects that are homogeneous or similar in a certain sense. Such a classification based on statistical data on the values ​​of features on objects can be carried out using the methods of cluster and discriminant analysis, discussed in the second chapter (Multivariate statistical analysis using “STATISTICA”).

The development of computer technology and software contributes to the widespread introduction of multivariate statistical analysis methods into practice. Application packages with a convenient user interface, such as SPSS, Statistica, SAS, etc., remove the difficulties in applying these methods, which are the complexity of the mathematical apparatus based on linear algebra, probability theory and mathematical statistics, and the cumbersomeness of calculations.

However, the use of programs without understanding the mathematical essence of the algorithms used contributes to the development of the researcher's illusion of the simplicity of using multivariate statistical methods, which can lead to incorrect or unreasonable results. Significant practical results can be obtained only on the basis of professional knowledge in the subject area, supported by the knowledge of mathematical methods and application packages in which these methods are implemented.

Therefore, for each of the methods considered in this book, basic theoretical information is given, including algorithms; the implementation of these methods and algorithms in application packages is discussed. The methods under consideration are illustrated by examples of their practical application in economics using the SPSS package.

The manual is written on the basis of the experience of reading the course "Multivariate statistical methods" to students State University management. For a more detailed study of the methods of applied multivariate statistical analysis, books are recommended.

It is assumed that the reader is well acquainted with the courses of linear algebra (for example, in the volume of the textbook and the appendix to the textbook), probability theory and mathematical statistics (for example, in the volume of the textbook).