
Jian S et al (2018) Cure: flexible categorical data representation by hierarchical coupling learning. Greene RL, Stillwell AM (1995) Effects of encoding variability and spacing on frequency discrimination. BMC Med Inf Decis Mak 20(5):1–14īaldissera F (1984) Impulse frequency encoding of the dynamic aspects of excitation. Seveso A et al (2020) Ordinal labels in machine learning: a user-centered approach to improve data validity in medical settings. Liu C, Yang L, Qu J (2021) A structured data preprocessing method based on hybrid encoding. Potdar K, Pardawala TS, Pai CD (2017) A comparative study of categorical variable encoding techniques for neural network classifiers. Hancock JT, Khoshgoftaar TM (2020) Survey on categorical data for neural networks. Īlkharusi H (2012) Categorical variables in regression analysis: a comparison of dummy and effect coding. Lopez-Arevalo I, Aldana-Bobadilla E, Molina-Villegas A, Galeana-Zapién H, Muñiz-Sanchez V, Gausin-Valle S (2020) A memory-efficient encoding method for processing mixed-type data on machine learning. Von Eye A, Clogg CC (eds) (1996) Categorical variables in developmental research: methods of analysis. Gnat S (2021) Impact of categorical variables encoding on property mass valuation. Keywordsĭahouda MK, Joe I (2021) A deep-learned embedding technique for categorical features encoding. As per the best of our knowledge, this is the first paper that focuses completely on the analysis of basic categorical encoding techniques based on their correlation with the target variable. In this paper, we analysed and implemented various encoding techniques on the heart disease prediction dataset and were attentive in selecting the best encoding technique which meets the main objectives of the paper.

The main objective of this paper is to provide insights on choosing a technique that not only converts the categorical data into numerical data but also which helps in making the transformed data to become a much better representative of the target variable. There are many types of categorical encoding techniques, each technique has trade-offs and has a notable influence on the outcome of the analysis so, choosing an optimal technique based on the situation is a challenging task. Categorical encoding is one of the crucial steps in data preprocessing, as most of the machine learning models work better with numerical data. The process of transforming the categorical data into numerical data is called “categorical encoding”. Many Machines learning algorithms do not support categorical data, therefore, to utilize the data efficiently categorical data should be converted into numerical data without any distortion in the data distribution. In general, real-world data is a combination of both categorical and numerical data.

Data can be defined as the joint collection of facts and statistics, which yields meaningful insights on proper analysis.
