Showing posts with label model. Show all posts
Showing posts with label model. Show all posts

Sunday, December 22, 2013

2.7 Performing a simple linear regression analysis with R

2.7 Performing a simple linear regression analysis with R

With a regression analysis you can, just like with a correlation, find the cohesion between different variables. With a regression analysis you get a formula with which you can predict future outcomes with some degree of reliability. With a regression analysis the interpretation is important.

Variables in a regression analysis

In this section the simple linear regression analysis is explained. This means that a variable gets predicted with the results of other variables. The predicted variable is called the dependent variable. The variables with which the dependent variable is predicted is called the independent variable.

Command

The command you have to use in R to perform a regression analysis is:
lm(*name of the dependent variable*~*independent variable*, *name of variable of the dataset*). You could use every numeric variable for a regression analysis. Variables vary in prediction power when serving as independent variable.

To be able to interpret the regression analysis, you have to give a name to the regression. You can do this by using the code <-. If you want to execute a regression analysis and want to interpret this one later, you use the command *name of the regression*<-lm(*name of the dependent variable*~*independent variable*,*name of variable of the dataset*).

In the example of Figure 23 the following command is used:
Regression1<-lm(SatisfactionCustomer~DistanceCustomer,Projects).

The estimated model

In the example of Figure 23 the satisfaction of the customer is estimated by looking at the distance between the company and the customer. R shows the following formula based on the regression analysis: SatisfactionCustomer = 2.524811 + (-0.003408x), where x stands for the number of kilometers. This formula is called the estimated model. By only putting in the name of the regression, the estimated model appears.

Figure 23: Performing a regression analysis with R
Figure 23: Performing a regression analysis with R

Executing the regression analysis

With the command summary(*name of the regression*) you get an overview of the regression. In this overview you have to look at the part underneath the text Coefficients. After the rows (Intercept) and DistanceCustomer (in this case look at AfstandKlant) the same values as in the estimated model.

In the example of Figure 23 the same values are shown as in the results of the regression analysis.

Significance

After the rows 3 stars (*) are presented. The stars indicate the level of significance. This indicates the importance and strength of the independent variable in predicting the dependent variable. An indication with one star is enough to conclude significance. The more stars after a row, the more significance the variables have. If there are now stars after a row, the conclusion is that the specific independent variable is not significant in predicting the dependent variable. This indicates that the variable has not a lot of prediction power in this regression model, and could be removed from the regression analysis. 

Prediction power of the regression analysis / Adjusted R-squared

In the overview of the regression analysis you find the text Adjusted R-squared followed by a value. This is the percentage presented in decimals of the results/outcomes that are correctly be predicted by the current regression model. An Adjusted R-squared of 1 means that 100% of the outcomes can be predicted by the current regression model. An Adjusted R-squared of 0,5 means that 50% of the predictions are correctly predicted by the current regression model.

It is up to the user to judge, based on the Adjusted R-squared value, if the regression model is good enough to give predictions. Generally, a Adjusted R-squared value of 0,5 (or higher) is quite an acceptable regression model to make predictions for.

In the example given in Figure 23 the Adjusted R-squared is 0.1041. This means that in 10,41% of the cases the prediction/regression model could predict the satisfaction of the customer correctly. Because this is quite a low value, the prediction model SatisfactionCustomer = 2.524811 + (-0.003409x) could be judged as a low level prediction model. 

New regression analysis

Figure 24 shows that another regression analysis is executed. In this case there is searched for a prediction model that could predict the turnover of a project by looking at the hours spent on a specific project. In this case the following command is used: Regression2<-lm(TurnoverProject~HoursProject,Projects).

By typing in the command Regression2, the following prediction model appears. Turnoverproject = 0 + 40x. In this case x is the number of hours that is spent in the project.

With the command summary(Regression2) the following information about the regression analysis appears (see Figure 24). This shows that only the variable HoursProject (UrenProject) is significant (the intercept could be left away since this has a value of 0).
The information also shows that the value of the Adjusted R-squared is 1. This means that the prediction model is 100% reliable. This is quite logical since the company asks $ 40,- per hour. The turnover is simply the sum of hours and fee per hour.

Figure 24: Another regression analysis in R
Figure 24: Another regression analysis in R





Saturday, December 21, 2013

Part 2: Analyzing data with R with basic predictioin models

In the first session of this manual you might got more familiar with R. The basic executions with R are explained in Part 1 of this tutorial. It is recommended to practice with different datasets to get more familiar with the commands of R.

In part 2, the emphasis is on sorting out the relations and influences between different variables. With these relations, basic prediction models could be created. By doing this, it is important to be able to interpret the results that are presented by R.

In part 2 the files Projects.csv and Projects2.csv are used, these files could be downloaded by clicking on it. These are greater files with more different types of variables. You might have to deal with datasets that are greater than Projects.csv and Projects2.csv, but these basic executions stay the same which make you able to perform analysis on greater datasets.

Good luck with part 2 of this tutorial.

To the fist step of Part 2: Setting the working directory