By NATALIA AFEK (Predictive Solutions)
When analyzing data, we consider both quantitative information (such as salary, age, number of products ordered) and qualitative information (such as gender, education, level of satisfaction with service). In order to make it easier to work with the data or to adapt it to a specific statistical analysis, sometimes the numerical data needs to be converted into qualitative categories.WHAT IS THE PURPOSE OF RECODING?
Recoding quantitative variables into qualitative ones is widely used for several reasons. This transformation of data contributes to a better understanding of the data. Qualitative variables tend to be friendlier than quantitative variables in this regard. Recoding makes it easier to compare different groups or categories, which simplifies further analysis. It also makes the data visualization stage easier. The use of multiple charts, such as bar charts or pie charts, will only make sense if the number of categories presented is not too large. Converting quantitative variablesto qualitative ones can therefore improve the readability of the visualization. This allows for a better understanding of patterns and trends in the data, and makes the results more approachable and accessible to non-experts in the field. Another reason may be the anonymization of data. In some cases, especially in medical or personal data analysis, there is a need to protect privacy. By recoding quantitative variables into qualitative ones, exact values can be hidden, such as bringing exact earnings or medical test results into certain ranges. Recoding also allows the data to be tailored to a particular method of statistical analysis. Examples include the chi-square test or logistic regression analysis, in which the predicted variable must be a qualitative variable with two categories.RECODING TO BINS OF EQUAL WIDTHS
One of the simplest ways to recode quantitative variables into qualitative ones is to divide the range of values into specified intervals, or bins. The width of the bins can be determined by the user based on the indicated value, e.g., when recoding the variable age, it is defined that each bin will have a range of consecutive 10 years, or by indicating the number of division points, e.g., 4 points will divide the set into 5 equal bins within the variable. In this case, if the variable had a range of values from 0 to 100, after determining 4 division points, it will contain 5 equal bins: 0-20, 21-40, 41-60, 61-80, 81-100. Using this approach, it should be borne in mind that the created bins will most likely not have equal numbers. This is due to the fact that the division was made only on the basis of the range of values of the variable (Figure 1). However, this preserves to some extent information about the distribution of the variable in the sample – for example, the 78-97 age category has a significantly smaller size than the earlier age ranges.Figure 1. Histogram showing the distribution of the variable age in the study group. The colors indicate successive ranges, with a fixed width of 20 years.
RECODING TO BINS OF EQUAL NUMBERS
Another approach is to recode quantitative variables based on the observed distribution of the variable. Such a distribution is based on calculated quantiles[1], i.e., the values of a characteristic of a sample dividing its size into n equal parts. The most common quantiles used for such transformations are quartiles and percentiles. Quartiles divide the sample into four equal parts, while percentiles divide the sample into 100, which later allows many different divisions into both 4, but also 5 or 10 equal numerical intervals. This method of recoding will be useful when we want to analyze a dataset by groups of equal numbers, e.g., when we want to present job satisfaction surveys in large cities and rural areas in a simple way. To simplify the analysis and presentation of results, we want to recode one of the variables – earnings – into four categories: very good, good, poor and bad earners. However, we know that the average amount of earnings in urban and rural areas is significantly different (Figure 2). Thus, the same amount may place some in the middle income group and others in the top. From experience, we know that salary satisfaction can depend on the broader context, including, for example, comparison against others in the community, or cost of living.Figure 2. PS IMAGO PRO Violin plot showing the distribution of earnings in the study sample. The average earnings of PLN 3040 (marked with a solid line) in relation to the median in each group (marked with red points) represents the value of the second quartile in the group of residents of large cities and the third quartile in the group of rural residents.
Recoding the variable into four groups, we can present job satisfaction, for example, in the form of a Marimekko graph[3]. In this way, the most important information will have a clear and readable visualization (Figure 2). In addition, recoding into ranges of equal numbers based on quartiles separately for urban and rural areas, allows us to create a variable where the different categories actually represent better and worse earners within the two places of residence. In a further step, we could create the Marimekko chart again – separately for the two groups – and present the results with the initial equal shares of bad, poor, good and very good earners in each.
Figure 3. PS IMAGO PRO Marimekko graph showing job satisfaction in groups with bad, low, good and very good salaries.
RECODING WITH THE PURPOSE OF ANALYSIS
Sometimes recoding a quantitative variable into a qualitative one can depend on the purpose of the analysis. For example, in market research, quantitative variables such as income may be recoded based on an income threshold relevant to an advertising campaign. Another variable often treated in this way is age: it may make sense for one range to include minors “0-18 years,” and only adult records to be divided into narrower but equal widths of, say, 10 years. A popular division based on age (or rather, year of birth) is also the separation of generations, distinguished in the social sciences. A growing number of sociological studies, but also marketing efforts, are based on the division into Generation X (Boomers, born in the 1960s and 1970s), Generation Y (Millenials, born in the 1980s and 1990s) and Generation Z (Zoomers, born in the 21st century). While the dividing points of grouping of this kind may vary somewhat from one source to another, it perfectly illustrates the fact that it is not always necessary for the groups created to be of equal range width or size. However, making such a decision requires some expertise in the purpose of the analysis.RECODING – STEP BY STEP
Recoding quantitative variables into qualitative ones is an important tool in data analysis that facilitates understanding of data, adapts data to the specific needs of statistical techniques, and can improve the quality of analysis. There are a variety of recoding methods, such as equal interval division, equal counts, or recoding with the purpose of the analysis in mind, which can be tailored to a specific research situation. Thus, the decision to recode a quantitative variable into a qualitative one, and the choice of the appropriate method, ultimately depends on the research context and the purpose of the data analysis.The analysis presented in this article was carried out with the help of PS IMAGO PRO.