R tutorial for beginners: model

2.7 Performing a simple linear regression analysis with R

With a regression analysis you can, just like with a correlation, find the cohesion between different variables. With a regression analysis you get a formula with which you can predict future outcomes with some degree of reliability. With a regression analysis the interpretation is important.

Variables in a regression analysis

In this section the simple linear regression analysis is explained. This means that a variable gets predicted with the results of other variables. The predicted variable is called the dependent variable. The variables with which the dependent variable is predicted is called the independent variable.

Command

The command you have to use in R to perform a regression analysis is:

lm(*name of the dependent variable*~*independent variable*, *name of variable of the dataset*). You could use every numeric variable for a regression analysis. Variables vary in prediction power when serving as independent variable.

To be able to interpret the regression analysis, you have to give a name to the regression. You can do this by using the code <-. If you want to execute a regression analysis and want to interpret this one later, you use the command *name of the regression*<-lm(*name of the dependent variable*~*independent variable*,*name of variable of the dataset*).

In the example of Figure 23 the following command is used:

Regression1<-lm(SatisfactionCustomer~DistanceCustomer,Projects).

The estimated model

In the example of Figure 23 the satisfaction of the customer is estimated by looking at the distance between the company and the customer. R shows the following formula based on the regression analysis: SatisfactionCustomer = 2.524811 + (-0.003408x), where x stands for the number of kilometers. This formula is called the estimated model. By only putting in the name of the regression, the estimated model appears.

Figure 23: Performing a regression analysis with R

Executing the regression analysis

With the command summary(*name of the regression*) you get an overview of the regression. In this overview you have to look at the part underneath the text Coefficients. After the rows (Intercept) and DistanceCustomer (in this case look at AfstandKlant) the same values as in the estimated model.

In the example of Figure 23 the same values are shown as in the results of the regression analysis.

Significance

After the rows 3 stars (*) are presented. The stars indicate the level of significance. This indicates the importance and strength of the independent variable in predicting the dependent variable. An indication with one star is enough to conclude significance. The more stars after a row, the more significance the variables have. If there are now stars after a row, the conclusion is that the specific independent variable is not significant in predicting the dependent variable. This indicates that the variable has not a lot of prediction power in this regression model, and could be removed from the regression analysis.

Prediction power of the regression analysis / Adjusted R-squared

In the overview of the regression analysis you find the text Adjusted R-squared followed by a value. This is the percentage presented in decimals of the results/outcomes that are correctly be predicted by the current regression model. An Adjusted R-squared of 1 means that 100% of the outcomes can be predicted by the current regression model. An Adjusted R-squared of 0,5 means that 50% of the predictions are correctly predicted by the current regression model.

It is up to the user to judge, based on the Adjusted R-squared value, if the regression model is good enough to give predictions. Generally, a Adjusted R-squared value of 0,5 (or higher) is quite an acceptable regression model to make predictions for.

In the example given in Figure 23 the Adjusted R-squared is 0.1041. This means that in 10,41% of the cases the prediction/regression model could predict the satisfaction of the customer correctly. Because this is quite a low value, the prediction model SatisfactionCustomer = 2.524811 + (-0.003409x) could be judged as a low level prediction model.

New regression analysis

Figure 24 shows that another regression analysis is executed. In this case there is searched for a prediction model that could predict the turnover of a project by looking at the hours spent on a specific project. In this case the following command is used: Regression2<-lm(TurnoverProject~HoursProject,Projects).

By typing in the command Regression2, the following prediction model appears. Turnoverproject = 0 + 40x. In this case x is the number of hours that is spent in the project.

With the command summary(Regression2) the following information about the regression analysis appears (see Figure 24). This shows that only the variable HoursProject (UrenProject) is significant (the intercept could be left away since this has a value of 0).

The information also shows that the value of the Adjusted R-squared is 1. This means that the prediction model is 100% reliable. This is quite logical since the company asks $ 40,- per hour. The turnover is simply the sum of hours and fee per hour.

Figure 24: Another regression analysis in R

To the next step: Multiple linear regression in R

R tutorial for beginners

Pages

Sunday, December 22, 2013

2.7 Performing a simple linear regression analysis with R

2.7 Performing a simple linear regression analysis with R

Variables in a regression analysis

Command

The estimated model

Executing the regression analysis

Significance

Prediction power of the regression analysis / Adjusted R-squared

New regression analysis

Saturday, December 21, 2013

Part 2: Analyzing data with R with basic predictioin models