STATISTICAL INFERENCE

By NATALIA GOLONKA (Predictive Solutions)

Statistical inference is the branch of statistics through which it becomes possible to describe, analyse and make inferences about the whole population on the basis of a sample.

Studying the entire population can be a very difficult task, sometimes even impossible. If we want to study a group of marshals of provinces of Poland, for example, collecting data from 16 people (i.e., the entire population of marshals) is quite feasible. Usually, however, we want our conclusions to be more universal and practical, so the population studied will include a larger group of people. This is where the aforementioned problem with the implementation of such a project comes in. If we wanted to analyse, for example, the habits of Poles, then reaching every single citizen with the research tool would consume a gigantic amount of resources. However, a complete survey can be replaced by a sample survey. Thanks to statistical inference, we can determine to what extent the sample we actually have access to is representative of the entire population, and how close the conclusions made correspond to reality.

Methods for generalising results include two broad groups: estimation, in which unknown values of distribution parameters are estimated, and statistical hypothesis verification, in which specific conjectures about the distribution of the variables under study are tested.

STATISTICAL ESTIMATION

Estimation is the process of estimating certain parameters of the distribution of a variable in a population based on the data we have from its ‘sample’. Such a parameter may be the mean, variance or other numerical characteristic. For example, if we know the average time of performing a given service in the surveyed sample, we can use it to estimate the average time of performing such a service for the entire population. This will make it possible, for example, to determine how many customers a given banking institution can handle for one day of work in order to ensure that each person is served by an advisor.

Depending on the method chosen, estimation can be divided into point and range estimation. Point estimation is based on determining a single number (estimator) that best represents the unknown parameter in the population. In our example, this could be a value of 21 minutes for the average time taken to serve a customer in the population. While it is convenient to obtain a single, specific number, the disadvantage of this method is that we do not know how precise the value obtained is: is the range 19-23 minutes or 2-40 minutes?

Interval estimation is based on the determination of an interval in which the desired unknown population parameter is located with some given probability. The analyst can determine a confidence interval, with a value of 1−a which determines the probability of estimating the correct value. The resulting interval is called the confidence interval. The larger the value of the confidence coefficient, the wider the confidence interval will be. If, for example, we want to estimate the average age of the recipient of web content created, an interval of 0-100 years will give us almost 100% confidence in the result. The trade-off in such a situation is, of course, the precision of the estimate. Although narrowing the confidence interval will make it less likely that the true value is within it, such an interval will be far more useful; knowing that the respondent is likely to be aged 0-100 is too general for us to use in practice. If we can narrow this range down to, say, 25-35 years old, such information will, among other things, allow us to undertake more accurate sales strategies.

LIMITATIONS OF THE ESTIMATION

Both point and range estimation are, unfortunately, subject to a certain degree of error; although some claim that exceptions only prove the rule, at the end of the day, estimates made usually will not give a 100% certainty of obtaining the correct result.

In the case of point estimation, the result is the estimator – the single value of a given parameter in the sample. If we know the value of this parameter in the population, we can calculate the estimation error by subtracting it from the value of the estimator. However, when we do not have such information from the population as a whole, the quality of the point estimate is usually assessed using the standard error. The standard error is a measure of the dispersion of the estimators from the sample around the true value of the population parameter.

In interval estimation, the magnitude of this error depends on the confidence factor mentioned earlier. Typically, confidence intervals are defined with a 95% probability of the true value of a parameter from the population being within them, but it is also common to find probabilities of 97% or 99%. This decision depends primarily on the nature of the data in hand.

So what should the data be so that the estimation made is as close as possible to the actual values in the population? A key task for the researcher here is the appropriate sampling. First of all, the sample from the population must be selected at random. The second very important aspect is that it should be of an appropriate size. Only when the sample is representative will the conclusions drawn from it be subject to fewer errors and closer to the actual results in the population.

VERIFICATION OF STATISTICAL HYPOTHESES

The second branch of statistical inference is the verification of statistical hypotheses. It allows assumptions about a population to be tested against a statistical sample extracted from it.

The first step in the process of verifying statistical hypotheses is, of course, to set them up properly. It is customary to pose two hypotheses for each inference: the null hypothesis, which assumes that there are no differences, e.g., between groups, measurements, distributions, and the alternative hypothesis against it. In a second step, a statistical test should be selected that is appropriate to the hypotheses being tested and the data available. Further steps already depend on the chosen approach.

Frequency inference is the most commonly used approach to the problem of statistical hypothesis verification. Once the significance level has been determined, a test statistic is calculated and, based on this, a p-value, which you can read more about in the article on statistical significance. Knowing the p-value, a decision can be made to reject or accept the null hypothesis. This approach allows us to control for decision errors; by assuming a significance level a=0.05 we accept that we will make 5 errors per 100 inferences, while when reducing the significance level to a=0.001 it will be just 1 error per 1000.

Another way to verify hypotheses is to use Bayesian inference. This approach goes beyond purely frequentist statistics by adding a subjective element to the process: a priori probability. Bayesian statistics allows existing beliefs to be updated based on new data. A priori beliefs can be based on previous research results, but also on expert knowledge or even intuition. Data collected later allows these beliefs to be verified: a posterior probability. To understand this with an example, let us imagine that we want to predict a person’s life expectancy. Based on life expectancy from the CSO’s 2021 report, we could a priori assume that this age would be 75.6 years. However, assuming we have additional information about the person, such as their health status, lifestyle or genetic predisposition, we can use the Bayesian statistics method to predict their life expectancy more accurately.

Using Bayesian statistics, we therefore update our belief about a person’s life expectancy based on the data collected (a posterior probability). Based on the available information, our estimate may change such that our posterior belief may indicate a higher life expectancy, e.g., 80 years, given the person’s healthy lifestyle and no family history of chronic disease. The essence of Bayesian statistics is to continually update our beliefs based on new data, allowing us to make increasingly accurate predictions and better decisions.

Although frequentist and Bayesian inference are the most commonly used methods for statistical hypothesis verification, it is worth mentioning that approaches such as the likelihood-based inference (likelihood quotient test), which seeks to maximise the reliability function, or the Akaike information criterion (AIC), which is based on information theory and compares different statistical models in terms of the balance between data fit and model complexity, are also available.

SUMMARY

Statistical inference plays a key role in data analysis, enabling accurate and reliable conclusions to be drawn from a sample. Through appropriate hypothesis testing methods and the construction of confidence intervals, the analyst can make decisions based on a sound numerical basis. Introducing a rigorous statistical approach to analysis contributes to a better understanding of phenomena and supports the development of effective action strategies in a variety of fields, whether in customer satisfaction surveys, evaluation of marketing strategies, prediction of adverse events, or verification of the efficacy of newly developed drugs.