By PRZEMYSŁAW SOLECKI (Predictive Solutions)

A scatterplot, or scatter graph, is a popular diagnostic tool for associations between quantitative variables. It is invaluable in correlation analysis for readily assessing the nature and the shape of the relationship between two variables.



The scatterplot is a useful tool in other areas of multidimensional analyses, e.g., in linear regression, when diagnosing outliers, or in order to assess the quality of groups derived, e.g., in cluster analysis.

The scatterplot also performs important presentational functions. With this tool, it is possible to identify groups of objects of similar value, distinguish segments of interest, and describe regularities (or irregularities) present in the data set. It can, for example, be used to present brand position (e.g., based on brand awareness in a market segment), prepare a visualization of the relationship between two variables, or position target groups.

There are numerous reasons to enrich our two-dimensional scatterplot with additional information. A quick and easy way to achieve this is by using the scatterplot’s big brother, the multidimensional scatterplot available in PS IMAGO PRO.

In the multidimensional scatterplot in Fig. 1 we can distinguish our products from competitive products using color, present segments developed in the course of the cluster analysis using the shape of the data points on the chart, and represent the average product price using the relative size of the data point.

Figure 1. Sample multidimensional scatterplot


In one of our other blogs, we used the scatterplot and distribution graph to look for the most favorable offers for used cars. Let’s go back to this example and analyze the offers of a sample car dealer. We have a database containing information about the vehicle brand, age, mileage, type of fuel as well as engine capacity. Let’s try to analyze the dependencies between the vehicle’s age and its price and mileage.

It is not surprising that the vehicle’s age affects its price. The relationship is not surprisingly negative (after excluding vintage cars from the analysis): the vehicle is getting cheaper with age. To assess the nature of this relationship we choose Multidimensional scatterplot located in Predictive Solutions-> Graphs.

Figure 2. Relationship between vehicle age and price

As we can see in the above visualization, the initial hypothesis has been confirmed: the car’s price is decreasing with its age. However, we can read from the graph that the price is not decreasing evenly, made easier with the line of best fit added using the LOESS fit method in the chart edit menu. In the case of newer vehicles, every year the price decrease is more precipitous than in the case of older vehicles. In short, cars lose value more slowly with age.

Now let’s look in greater detail at the possibilities of this visualization.

Figure 3. Multi-dimensional scatterplot wizard

Multi-dimensional scatterplot in PS IMAGO PRO allows you to consider up to three additional dimensions compared to the standard scatterplot (that’s five dimensions in total). It is possible to consider a qualitative factor by modifying the shape or color of the data points (the variable should be moved to the Shape field or their colors in the Color field). The multi-dimensional scatterplo also allows you to use an additional quantitative variable in the Size field. Finally, it must be mentioned that in the Options menu it is possible to define the color palette of the graph, the user template as well as the graph title option.

In the example below, I have chosen Region as the color variable. The variable describes the vehicle model’s country of origin.

Figure 4. Relationship between age and price subject to the country of origin

The cars on offer include European, Asian and American models (here we ignore the actual location of the corporation and actual country of production). We notice an interesting relationship: Regardless of age, Asian cars generally have higher prices than vehicles of the same age from other parts of the world. Interestingly, American cars are generally valued slightly lower, so if we are looking for “just a car” and look only at price and the year of production, we can save some money by choosing an American car. Let’s now look deeper into the reasons for such a pricing strategy.


Let’s now analyze the vehicles’ mileage. This is one of key elements when evaluating the wear level of the car being bought. In general, vehicle mileage is dependent on its age, though of course there are additional factors that can affect this relationship. For example, let’s take the car’s intended use: company cars are used more extensively than private cars or cars serving as the second vehicle in a household. In our data set we do not have information about the usage of the vehicle by the previous owner. However, we have the variable with information about engine capacity, so let’s try to formulate a hypothesis that the car’s engine capacity will also affect mileage, regardless of age. Vehicles with large engines, often better equipped and simply more expensive, will more often than not serve as the first car in a household, or as a company car, which, in turn, will result in the vehicle’s higher mileage. In addition, cars with larger engines are more often bought by automotive enthusiasts, who simply drive more. Vehicles with large engines are also frequently large cars, proving better on longer journeys. All this may result in such cars being used more than small urban vehicles. Let’s analyze the relationship between these three variables using the matrix scatterplot available in PS IMAGO PRO (menu Graphs > Legacy Dialogs> Scatter/Dot).

Figure 5. Matrix graph: relationship between age, mileage and engine capacity

While car age is not related with engine capacity in our data set, mileage and engine capacity have a strong linear dependence (linear correlation coefficient = 0.785). Mileage is also connected with the vehicle’s age, but, interestingly, the correlation between these variables is much lower (0.411). The reason for such a phenomena can be the presence of several relatively young vehicles in the dataset, but with a high mileage than we observe in the matrix scatterplot.

To illustrate the relationship of the variables being discussed using a single graph, let’s use the Multi-dimensional scatter graph again. This time, let’s move the Capacity variable to the Size field, with the Years variable on the X-axis and the Vehicle Mileage variable on the Y-axis.

Figure 6. Relationship between age, mileage and engine capacity

On the graph we notice a group of vehicles with a relatively high mileage, but with fewer number of years: they stand out from the general linear dependence between car age and mileage, and, as we can see, these are cars with larger engine capacity. Perhaps these are delivery vehicles, or former company cars (e.g. used by sales representatives). We can also observe on the graph that, compared to same age vehicles, cars with a large engine capacity have a considerably higher mileage.

To sum up: The additional visualization has allowed us to quickly find subsequent characteristics affecting vehicle price. [Multidimensional scatterplot] available in PS IMAGO PRO allows you to easily present multidimensional relationships between variables. It can be used both as an interesting form of visualization as well as being a useful tool to support an analyst working with multidimensional techniques.

Pin It on Pinterest

Share This