Data preprocessing, good dataset & predictive model

, March 8, 2022, 0 Comments

predictive-model-dataset-data-preprocessing-marketexpress-inWith the rapid development of Data Science and the easy accessibility of data, data-driven decision making has become an integral part of the Managerial process. Data pre-processing is an indispensable process of model building and data analysis techniques. In this article, the other three data preprocessing techniques will be described- Scaling and Standardization of data, Transformation of variables, Feature selection and Data Partitioning.

This article is a continuation of my earlier write up on “Why and How to Create a Good Dataset Before Building a Predictive Model”. I have already discussed the treatment of missing values, dimension reduction techniques and outlier treatment and why pre-processing is important before building any predictive model.

Scaling: Normalization and Standardization of data

Machine learning algorithms don’t know what are the units of each variable and tend to give higher weightage to higher magnitude irrespective of the unit. For example, a variable with 1₹  will be given the same weightage to a 1$ value. And 300 cm will be treated with higher weightage than the 5-meter value. Especially for distance-based models like SVM, KNN algorithm, Clustering scaling is very important, as higher dimensions dominate in distance calculations. As more significant numbers dominate the training model, scaling is done to bring every feature in the same platform without giving extra importance to higher dimensions. In Min-Max scaling, data is transformed in a range between 0 to 1, 0 to 5 or -1 to 1 and it is done by using the formula,

X= (X- X Min ) / (X Max- X Min), X is the normalized value of X.

Standardization or Z score normalization is another technique used to transform the distribution of data to a normal distribution with a mean 0 and standard deviation of 1.

Z = (X- Mean of X) / (Standard Deviation of X). Standardization of data is widely used in Linear Regression, Logistic Regression, SVM, Clustering and ANN models. Scaled features converge faster in the Gradient descent method and therefore in neural network models scaling of data is used extensively. Outliers in any dataset impact Min-Max scaling but do not get affected by the Standardization process. Some other scaling techniques used in data pre-processing are Quantile transformer scaler, Robust scaler and MaxAbsScaler.

Transformation of variables

To boost model performance, the transformation of variables is done as a data pre-processing technique. It helps to make data compatible with any underlying assumptions required before applying a machine learning algorithm. It is used to linearize the relationship between two variables that have non-linear relations. Variable transformations are done for independent as well as for dependent variables based on the requirement and for both categorical and numeric data. Converting categorical variables to a numeric variable is an essential part of a machine learning process as a model can handle only numeric data. One hot encoding, label encoding, feature hashing, frequency encoding and target encoding are mostly used to transform categorical data and appropriate selection of encoding improves model performance.

Numeric variables are transformed to change the scale or to convert a skewed distribution to a Gaussian distribution. One of the most common numeric transformations is the Log transformation of data. As Log(10)= 1, or Log(100)=2, it reduces the dimension of any data value to a great extent and converts a skewed distribution to normal distribution, especially right-skewed distribution. But Log transformation can’t be applied directly for any data having a negative value or in between 0 to 1 as log values of those data are undefined.  To deal with heteroskedasticity of data, power transformation is used in building a linear model. Using the Lambda function power scaler is used in square root, or cube root form and the best value of Lambda is found through Box-cox transformations and The Yeo-Johnson transformation methods.

Data Partitioning

In a predictive model, the data set is split into three sets: training, testing and validation set in a ratio of 70:20:10 or 80:10:10. Partitioning of data helps us to improve model performance and reduces the chance of overfitting the model. The train set is used to feed the model, the model is built on it and it learns from the training data. The validation on the development set is used to measure the variances between train data and development data and the high value indicates overfitting of the model.  Model is tested on test data set to verify the efficiency of the model and the parameters used to measure efficiency are recall, precision and accuracy. To find the best parameters in a classifier, a cross-validation technique is applied. Generalization of the model’s result can be enhanced with proper partitioning of data and testing and validating the result before deployment.

Feature selection

Predictive models are made on real-life data and there may be many features in the data set which are redundant. Moreover, too many features add to the complexity of the model and reduce accuracy. It also creates the challenge of ‘curse on dimensionality’ where several observations might not be proportionally sufficient to include a large number of features in the model. Filter method, embedded method and Wrapper methods are mostly used for feature selection and the correct Statistical process of feature selection is chosen based on data types of input and output variables.

Each of the data pre-processing techniques must be carefully chosen to enhance model performance and each method of preprocessing of data involves multiple Statistical techniques and can be discussed at length. This blog gives an overview of the basic preprocessing of data before building any predictive model.