Basic Introduction
multivariate analysis is a statistical method, including a number of ways, the most basic univariate, multivariate analysis and then extended out.
statistics there are multiple variables (or factors, indicators) statistical analysis at the same time there is an important branch of statistics, univariate statistical development.
statistics multivariate statistical analysis originated in medicine and psychology.
main content
In the study of social and behavioral sciences, the opportunity with the popularity of the application of complex multivariate statistical methods and personal computer research methods to analyze the data was correspondingly increased. Especially in recent years, the number of graduate universities increases every year, based on the need to write dissertations, using multivariate statistical methods and statistical software packages become an indispensable capability.
Chapter multiple regression
canonical correlation analysis Chapter
Chapter discriminant analysis
Chapter Mean the hypothesis test
the fifth chapter multivariate analysis of variance
Chapter VI principal component analysis
Chapter VII of the factor analysis
cluster analysis Chapter VIII
IX multidimensional scaling method
structural equation modeling Chapter X
XI hierarchical linear mode
< h2> statistical analysis(multivariable statistical analysis)
for example hypertension cook for 630 investigation, inspection items, in addition to blood pressure, there are age, sex, weight, body fat and other 15 projects (variable). If you look at the relationship between blood pressure and overweight univariate statistical analysis, the data is typically made in the form of a table. From Table 1, compared to overweight and non-overweight, hypertension prevalence rate by more than double. But if the data is divided by body fat and not body fat into two groups, and then examine the relationship between overweight and the prevalence of hypertension within each group, they failed to find the prevalence of overweight and hypertension have any clear link . In other words, univariate statistical analysis ignores the additional factors (in this example in body fat and age) influence. For information there are a number of variables objective reality but affect each other, using a simple univariate statistical analysis is unreasonable. Multivariate statistical analysis will be able to intrinsic link between variables and interaction into account.
multivariate statistical theory and tools of probability theory and matrix mathematics. But for practical applications are concerned, as long as proper grasp of computer and software package as well as some preliminary multivariate statistical knowledge can use it to solve real problems. Multivariate statistical lot of content, from a practical point of view, including regression analysis, discriminant analysis, factor analysis, principal component analysis, cluster analysis, survival analysis of six large branches.
regression
When a plurality of variables x1, x2, ..., xm (referred to as the regression variables or arguments, independent variable) affect a metric y (referred to as the dependent variable when statistical regularity or dependent variable), can be performed regression analysis, regression analysis, the first task is to find the regression variables influence on the index y (also called regression); second task is to find a large number of variables in the regression which Some can have an impact on the index y (often referred to as factor analysis or screening variables); third task (also known as correlation analysis) is after the impact of a fixed (or eliminate) the other variables, examining each variable regression indicators y is the degree of correlation (referred to as a partial correlation coefficient). The three tasks are often interrelated and can be done at the same time.
regression variables x1, x2, ..., the most common statistical relationship between the dependent variable y xm There are two types: linear and nonlinear models. The linear model is assumed a main part of y (referred to), by x1, x2, xm is expressed as a linear
wherein b0, b1, b2, ..., bm are unknown constants, required to estimate the sample, [epsilon] after the error is substituted with y. This is the most commonly used model, known as multiple linear regression or multiple linear regression. The method of linear regression model to estimate the unknown constants with a lot of samples, the classic method is the least squares method, it is more complete theory, this method is more suitable for the time between regression variable correlation is not significant. Other requirements unknown constants b0, b1, b2, ..., bm ridge regression method also characteristic root regression, principal components regression, etc., are commonly used in the correlation between large variable regression.
main portion and x1y of the nonlinear regression model, X2, ..., xm is a nonlinear function of the relationship:
wherein ┃ forms known, unknown constants α1, α2 ; ... with the sample to estimate. Medicine is the most common non-linear regression logistic regression, it is commonly used in disease research and control the growth and development problems.
In the foregoing cook survey of hypertension, using a linear model and the unknown constants determined least square method, and then select the regression variables, variables can be obtained in 15 seven variables cooking members diastolic blood pressure has significant influence, which are arranged according to the size of the partial correlation coefficient: Age (0.297), the degree of fat (0.253) body, nephritis history (0.162), gender (0.117), job category (0.081), family history of hypertension (0.061), the degree of addicted salt (0.052). From the correlation between the size of the look, effects on diastolic blood pressure, body fat and the effect of age of roughly the same. It may also be seen that: although the impact types, family history and S. salty diastolic blood pressure, but have little effect.
discriminant analysis
According to some indicators of the sample to determine the sample belongs to the category. For example, in medical diagnosis, to determine whether a patient with acute appendicitis, this is a discrimination issue. To answer this question often requires the patient to detect a number of indicators (variables), and based on observations of the indicators will be included in a patient with acute appendicitis or classes are not suffering from acute appendicitis. Discriminant analysis is usually first establish a discriminant function, the observed value is substituted into the corresponding index of each variable, and then judgment is made in accordance with a determination or decision authentication rules (such as the function value is larger than a certain value). For example, in order to investigate the relationship between gastric and nitrite salts of the compounds of the group, the group was told gastric carcinoma (referred to as Hl), atrophic gastritis (H2 of), superficial gastritis (H3) 3 groups of patients following six measurement indicators ( variable): gender (x1, 1 male, 0 female), age (X2), gastric pH value (X3), salivary nitrite concentration (X4), nitrite concentration of gastric juice (X5), gastric juice in dimethyl ammonium nitrite concentration (x6). Discriminant analysis, can be determined six indicators (variables) in the three disease group distribution is significantly different x1, x2, x4, x6; the remaining two metrics distributed in different groups of substantially the same. Corresponding to each disease group the following discriminant function can be established:
u1 = -11.48 + 2.68x1 + 0.37x2 + 0.04x4 + 0.90x6 (H1) u2 = -14.06 + 3.79x1 + 0.35x2 + 0.50x4 + 1.82x6 (H2) u3 = -6.36 + 1.84x1 + 0.27x2 + 0.34x4 + 0.84x6 (H3)
at the discriminant analysis, the value of the measured case (x1, x2 , x4, x6) substituting discriminant function, the function to obtain a set of values u1, u2, u3. Here discrimination rule is: if u1 maximum, the sentence is a case of disease group H1; if u2 maximum, the penalty is a H2; if u3 maximum, then it is H3. In this way, the diagnosis becomes a data processing and analysis, principles of modern hospital on automatic diagnosis based on this. Generally speaking, the doctor's experience and knowledge into the computer, which is empirically establish the diagnosis in a computer ── discriminant function. Variable coefficient before discriminant functions contain important information. x3, x5 coefficient before the above variables are 0; before the three coefficients x1 (2.68,3.79,1.84) described with respect to F (x1 = 0), the male (x1 = 1) is more susceptible to atrophic gastritis (3.79) or gastric cancer (2.68); 3 x2 coefficient before description of the same age to have stomach cancer, atrophic gastritis, superficial gastritis ratio of 0.37: 0.35: 0.27; and the like.
factor analysis
, also known as factor analysis. Medicine, biology, and between all the social and natural phenomena in each variable (or things) often there is correlation or similarity. This is because often exist between the variables (or things) have common factors (known as common factor or common factor), while these common factors affect different variables (or something). The fundamental task of factor analysis is from a number of variables (or something) on the outside to the inside to find their hidden inside common factor, pointing out the main features of common factors, and measured by the actual use of the variable (or things) public construction factor. Factor Analysis of a Type R and Q type of points, called R-type is used for the analysis of variable factors between, among things called for Q type.
R mode factor analysis, for example, provided the sample variables x1, x2, ..., xm, hidden common factor ┃1, ┃2, ..., ┃k. Each variable time constant may be theoretically written as follows:
on the right hand side of the first portion is a variable common factor (┃1, ┃2, ..., ┃k) acting portion, after part of the male part of independent factors (called partial independence). Factor analysis is the fundamental task of the sample determined by ┃1, ┃2, ..., ┃k and it coefficient {α}, referred to as a weight factor [alpha] it or load factor, when the sample is normalized data between each other and assume a common factor uncorrelated, the weight α is the correlation coefficient between the right common factor ┃j variables xi. Using factor analysis method can be inferred from the small number of variable factor in the observed, with a minimum of variable factors to explain the observed, revealing intrinsically linked things. The actual explanation of the factors must be combined expertise by practice test. For example Chinese scholar Liang Yuehua, Sun Shang arch factor analysis was used to identify implicit in six easy to measure physiological parameters (systolic blood pressure, diastolic blood pressure, respiration, heart rate, body temperature and the amount of saliva) inside the common factors ┃1, and experimental ┃1 judgment may well represent the balance of the sympathetic nerve, and finally ┃1 demonstrated TCM's "cold and heat" its essence is sympathetic inhibition or excitement.
principal component analysis is the study of how to relate to each other variables integrated into one (or a few) composite indicator (also known as the main ingredient), and the comprehensive index should reflect the observed variables to maximize provided Information. Such as Hutchison (x1, x2, ..., xm) was observed variables, the general desire of the composite indicator Z can be written.
Z often actually is able to absorb the m variables related to the largest part of the information (similar to the factor analysis ┃1), this observation variable when there is little time between correlation with each other, the use of principal component analysis It is not appropriate. If the correlation between the observed variables can be divided into several groups and each group has little correlation then can not integrated with a principal component of all variables, but should be more take a few principal components. When
actual use, since the principal component analysis and factor analysis is very similar, so the number of statisticians both assays often indiscriminate, each name is also applied.
Principal Component Analysis has many applications in medical research, for example, some people aged 5 easy to measure intrinsic (white spots, age spots, eyes upstanding leg time, arcus, teeth off) integrated into an indicator Z, calculations show that the integrated value Z can absorb five aging levy 43% of all information that can comprehensively reflect the degree of physical aging.
clustering
also called taxonomy. Classical taxonomy was born a few hundred years ago, such as fossil classification, classification of plant specimens. Past classification rely more on some of the specific indicators. We can only use multivariate statistical analysis if required for classification of things, or difficult to use a specific index does not exist. The introduction of mathematical methods and taxonomy called "cluster analysis" is a matter of 60 years. After cluster analysis developed rapidly and achieved widespread use, but not big mature.
R clustering analysis can be divided into type and Q-type, R of said variable for classification type, the sample (observation unit, things), said Q-type classification. Base classification is similarity or distance. If two variables (or sample) very close or similar to each other, naturally divided in the same class. Therefore, when performing cluster analysis must be defined similarity or distance. Similarity or distance type definition Method wide. For example, the correlation coefficient between the common variables representative of the similarity between variables, Euclidean distance (first to go dimensionless) between two points representative of the geometric distance between the two samples. Then the choice of classification of mathematical formulas, to determine their classification. These formulas are also a wide range. There is no single formula is optimal. Practitioners often use a variety of methods spreadsheet, combined with the expertise to determine the classification results.
Survival Analysis
Life Table Survival Analysis origin. In addition to biological survival time by outside influence health, but also influenced by social factors, living conditions. Survival analysis study which factors have a significant impact on the "life" and how its level of risk. Survival analysis of the 20th century are not only a problem for the study of human life, but also for all generalized "life" issues or "death", such as the life of the engine, the survival time of patients after surgery, comparative analysis of the two efficacy, etc. . Survival analysis of a variety of models, the most commonly used Cox regression model, which is characterized by: the relative risk of the combined effects of m can be expressed as a product of variables when the relative risk for each variable acting alone (it is also called a multiplicative model). Further conventional additive model model, which is characterized by: the relative risk of the combined effects of variable m may represent the effect of each variable and individually. What kind of model should be used to determine should be combined expertise in specific issues.
multivariate statistical analysis in addition to the six major branches, path analysis and canonical correlation analysis is also very common. General Regression analysis can only be calculated for each variable (other variables fixed) the magnitude of the y index direct action, and path analysis can be calculated indirectly each indicator variable y simultaneously (i.e., by the action of the variable associated with it in y). Path Analysis has many applications in genetic epidemiological studies in. Canonical correlation analysis is the further development of regression analysis. Simultaneous measurement of multiple indicators (y1, y2, ...) and a number of independent variables (x1, x2, ...) for each thing, the analysis using correlation analysis and more comprehensive Code of how the argument is related to the integrated indicators.