Checking the data set
Data files come in different shapes and sizes. In the introduction to this manual it is
demonstrated how to convert an .xls file to a csv file. R may, without the right packages installed, not be able to read.xls files. R is always able to read .csv files.
Four criteria to check the data set
It is also important to check the content of the .xls file or .csv file to determine if the data set
is well suited to perform analysis on. In order to determine whether the data is of good
quality, the following four criteria could be used:
Accuracy:
Control of the correctness and reliability of the data set.
Timeliness:
Control if the data is up-to-date or if it is about the right period of time.
Completeness:
Check if there is data missing and check if the data set is voluminous enough to perform analysis on.
Consistency:
Check if the data uses the same values and terms over different data sets and data sources.
Figure 5: Simple dataset for analysis with R |
Transformation
To analyze a data file with R, it is recommended organize the file as simple and easy as possible. All kinds of text, colors or images should be removed removed from the file if you want to make the analysis go smoothly. This will avoid potential errors or other nasty complications in R. Figure 3 shows an example of the simple file Flowersales.csv. The file has been converted from .xls file to a .csv file in the previous section.
Remarks:
- To make R competable to read different types of files, different packages could be installed. At the page packages of this tutorial you learn how to install packages. For the actions with R performed in this manual it is not required to install packages.
- Important! In the page you could see that the data set contains the totals of the different flowers. R reads the first row of the csv-file as catagories (in this case Months, Roses, Tullips and Violets) and the other rows as the data about these catagories. R does not recognize the row Total. R some kind of thinks that Total is a thirteenth month. So I recommend to remove the row containing the totals of the catagories. By doing this you won't perceive problems during analyzing. R is able to calculate the totals by itself if by inserting commands in the R console.
To the next step: 1.3 Importing the dataset into the R console
No comments:
Post a Comment