Most of the researchers say that the best way to train a prediction model is by using a large data set in which you have control over the various stages of the process (measurement, data preparation, data cleaning, etc). Examples of this approach are the Enron dataset, the Stanford SNAP (study of natures proteins) dataset, the Netflix dataset and the movie-ratings dataset which has thousands of reviews. But what if you have no access to such a dataset? Or what if you want to use a smaller data set? I would say that this approach should not be used to train a model. Most data sets are too small and most of the times they don’t contain relevant information. The performance of the model is likely to decrease. Example 1: Grep for a Zip code from the NO attached data sources dataset: zip=c("zip code","country","longitude","latitude") output=ds$longitude+"_"+ds$latitude+"_"+ds$country+"_"+ds$zip Output: output [1] -89.254875_45.358859_Scotland_Highland_Orkney [2] -89.307571_45.376633_Netherlands_The Hague [3] -89.372052_45.492758_Canada_Ancaster_Ontario [4] -89.316682_45.499917_USA_San Francisco_CA [5] -89.214895_45.373408_USA_Monterey_CA [6] -89.200245_45.405842_USA_Portland_OR [7] -89.068627_45.381062_USA_Seattle_WA [8] -89.041171_45.351477_USA_Portland_OR [9] -89.300929_45.339161_USA_New Orleans_LA [10] -89.257805_45.203958_USA_Houston_TX [11] -89.157775_45.255853_USA_San Jose_CA [12] -89.238832_45.318282_USA_Boston_MA



