R Loop Through Subdirectories
Today, I finally got inspired to deal with tons of datasets from the Tribunal Superior Eleitoral on the Brazilian elections. The cause of the delay for putting my finger on them was simply to avoid troubles with messy large text files. The set of data I collect consists of above 40GB of pure text files, which reports electoral results, candidates’ profile, campaign revenues and expenditures etc. Therefore, if anything it may be a good example of using R for data management, and that it might be useful for students while dealing with messy datasets from everywhere.
The Macro that I want to run creates a report based on data within the sub folders. I was thinking of having the user save the report in the lots folder in order to get the users file path and then using that file path to tell the computer where to begin looping through subfolders. The user of your sample application is required to enter a main path. The application then loops through the files in that path and any files contained in sub-folders of the main path. The application logic for traversing the folders and sub-folders will be implemented in a dedicated iterator class.
The task can be stated as follows. Suppose you have a set of data files (data1.txt, data2.txt, […] ,data27.txt) which represents some data–or a subset data–sliced by states or electoral districts. What you want to do is simply stack every data file into a beautiful unique file for more aggregated analyses, or just releasing the computer from storing too many sliced data. In sum, the task is to obtain a table of all subsets; more complex cases will be addressed on later posts. This can be done by browsing to the directory where the files are, then looping through them importing and merging. Finally, the aggregated file can be written back to the disk.
The piece of code below does just that. The first line paste the path where R must look at for the files. The second line creates an empty data object to store each of the importing files if any. The third line reads the path to the files, and then a loop for reading each existing file of type “.txt” as table. The last line in the loop creates the final table by appending each subset that was imported into memory. Finally, the last part of the program, which is out of the loop for efficiency purpose, simply write the final table to the disk as a text file,delimiting the columns by semicolon ‘;’.