Why & How to create a good dataset before building a predictive model?

, December 31, 2021, 0 Comments

Easy availability of operational data and development of Data Science and Machine learning algorithms make the industry and business heavily dependent on Data-driven decision making. Predictive modelling is one of the most popular Data Science techniques frequently used in industry to gain competitive advantage or to forecast accurately any vital parameter of business.

predictive-model-dataset-data-marketexpress-inPast historical data are used in a predictive model built by a combination of conventional Statistical techniques and modern AI (Artificial Intelligence) based analysis to gain insights about business events and predict future events. It has evolved as an advanced analytical tool across business domains and supports business decisions through intelligent business insights.

Whatever be the modelling technique, model performance and generalization of the model’s result will be highly threatened if the quality of data is compromised, messy and noisy. A substantial amount of time (around 80% of the total mining time) is spent in preprocessing of data to make it ready for analysis and model building (Cui et al., 2018). Data preprocessing includes all the data mining techniques used to transform raw data relevant and meaningful for analysis. Quality of data must be checked thoroughly on the following parameters- accuracy, timeliness, cleanliness, relevancy, completeness and consistency. This article gives a comprehensive review of major data preprocessing techniques that work as ‘’must do’’ before applying three popular predictive models- Linear Regression, Logistics Regression and Linear Discriminant Analysis.

Missing value imputations- Most of the models fail to perform well if there are missing values in the dataset. If there are few observations with missing values, then they are dropped before making a predictive model. But compared to the total sample size if there are a significant number of missing values then they need to be imputed. Depending on the data types, missing values are imputed either by mean or median values (for continuous data type) or by Mode (in case of nominal data). Forward and backward imputations take place where the missing value is replaced by the previous or next value of the variable. The moving average method is a very common technique to treat missing values for time series data. A very sophisticated missing value imputation technique is the KNN method. In KNN or K-nearest neighbourhood method, K samples that are similar to the space of missing data are drawn and then used to estimate the missing values.

Dimension Reduction– In case numerous independent variables are used to predict a dependent variable, two issues might arise and threaten the model accuracy. First is multicollinearity where some independent variables are correlated and not independent, then prediction becomes challenging. In this case, PCA (Principal Component Analysis) is used to combine related independent variables and reduce the number of variables into a fewer number of factors without losing much information and without compromising model performance. Sometimes working with a large number of features in making a model and drawing inferences on each feature poses a threat. It also causes a “Curse of dimensionality” where the number of independent variables or features (number of columns in a data set) is higher than the number of observations (number of rows in a dataset) and again dimension reduction helps to overcome this issue. Some other commonly used techniques used for dimension reduction are Factor Analysis, Independent Component Analysis, Forward feature selection, Backward feature elimination etc.

Outlier treatment– Outliers are data that are extremely high or low in values compared to the other data values and distant from most of the observations. In case outliers are not errors, they should be considered in model building, even if the model performance is compromised. When outliers are errors caused by system behaviour, fraudulent behaviour, human error or instrument error, they should be treated as they create serious challenges to the Predictive Models. For example, in a linear regression model, the predictive line readjusts to fit the outliers in the straight line and change the line and the result of predictions. Outliers of any variable are either dropped or matched with the Upper Limit and Lower limit of that variable. Alternate modelling techniques like Random Forest and Gradient Boosting, which are less impacted by outliers are used for predictions.

Conclusion– Some other preprocessing techniques are essential before making a model and they are “Scaling of data”, “Transformation of Variables”, “Feature Selection”, and “Data Partitioning”. I will continue this discussion in my subsequent blogs. Each method of preprocessing of data involves multiple Statistical techniques and can be discussed at length. This blog gives an overview of basing preprocessing of data before building any predictive model.

Reference
Cui, Z. G., Cao, Y., Wu, G. F., Liu, H., Qiu, Z. F., and Chen, C. W. (2018). Research on preprocessing technology of building energy consumption monitoring data based on a machine learning algorithm. Build. Sci. 34 (2), 94–99.