Data Preparation in ML Project

Data Preparation in ML Project

Data is a precious resource for every organization. But, if we don’t analyze that statement further, it can negate itself. 

Businesses use data for various purposes. On a broad level, it is used to make informed business decisions, execute successful sales and marketing campaigns, etc. But, these cannot be implemented with just raw data.

On a predictive modeling project, such as classification or regression, raw data typically cannot be used directly.

This is because of reasons such as:

  • Data must be in the form of numbers for machine learning algorithms to work.
  • Data criteria are imposed by certain machine learning algorithms.
  • Data errors and statistical noise can need to be corrected.
  • Data can be used to extract complex nonlinear relationships.


As a result, before fitting and evaluating a machine learning model, raw data must be pre-processed. The term "data preparation" is used to describe this phase in a predictive modeling project, but it is also known as "data wrangling," "data washing," "data pre-processing," and "feature engineering." Any of these names may be better suited to sub-tasks within the larger data preparation process. We can define data preparation as the transformation of raw data into a form that is more suitable for modeling.

Nevertheless, there are common or standard tasks that you may use or explore during the data preparation step in a machine learning project.

These tasks include:

  • Data Cleaning: Identifying and correcting mistakes or errors in the data.
  • Feature Selection: Identifying those input variables that are most relevant to the task.
  • Data Transforms: Changing the scale or distribution of variables.
  • Feature Engineering: Deriving new variables from available data.
  • Dimensionality Reduction: Creating compact projections of the data.


Each of these tasks is a whole field of study with specialized algorithms. Data preparation is not performed blindly. In some cases, variables must be encoded or transformed before we can apply a machine learning algorithm, such as converting strings to numbers. In other cases, it is less clear, such as scaling a variable may or may not be useful to an algorithm.

The broader philosophy of data preparation is to discover how to best expose the underlying structure of the problem to the learning algorithms. This is the guiding light. We don’t know the underlying structure of the problem; if we did, we wouldn’t need a learning algorithm to discover it and learn how to make skillful predictions. Therefore, exposing the unknown underlying structure of the problem is a process of discovery, along with discovering the well- or best-performing learning algorithms for the project.


It's always more complex than it seems at first. Different input variables, for example, can necessitate different data preparation methods. Furthermore, different variables or subsets of input variables which necessitate different data preparation methods in different order. It can feel overwhelming, given the large number of methods, each of which may have their own configuration and requirements. Nevertheless, the machine learning process steps before and after data preparation can help to inform what techniques to consider.


What's Your Reaction?