Risk Classification and Prediction: A Logistic Regression
Approach for Analyzing Property Risk Classes in Insurance
Companies
Reem Adel Abdallah
Department of Industrial Engineering and Engineering Management
University of Sharjah, Sharjah 27272
United Arab Emirates
Doraid Dalalah, PhD
Associate Professor, Coordinator of the PhD Program
Department of Industrial Engineering and Engineering Management
University of Sharjah, Sharjah 27272
United Arab Emirates
ddalalah@sharjah.ac.ae
Abstract
Risk exists in every aspect of a business, as it cannot be eliminated but rather reduced to an attainable level through
the utilization of effective risk assessment techniques. Risk impacts people differently, some can be seen as risk-
seeking while the majority are risk-averse. For the insurance industry in particular, risk is traded and transferred to
the insurance providers as insurance providers offer a shield from the exposure to risk consequences and the
likelihood of loss, therefore, escalating the risk from the insured entity to the insurer for a given premium. In this
research, a modern model to risk classification will be proposed for property lines insurance. The proposed model
will be validated via data collected from case studies of an insurance company in the United Arab Emirates (UAE).
The model is expected to serve as a tool that helps provide better estimates of risk for various properties.
Keywords
Risk, Insurance, Property, Classification and Regression
1. Introduction
Risks in the financial industry refers to the amount of variability in the anticipated outcome of a business activity
and the associated chances of the payoffs resulting of each outcome. It describes the amount of diversion from the
expected and planned situations; the greater the diversion the greater the risk that is to be faced or tackled. When
companies face decisions that include risk, they may have the option of either taking the risk through further
involvement or avoiding it through disengagement from any new actions. Avoiding the risk means preferring the
status quo as compared to taking the risk in a project with probabilistic outcomes. The risk-aversion or risk-taking
depends on the resources available, the variability in payoffs, possibilities of states of nature, gains, losses, and risk
attitudes.
There are two different groupings of risk: systematic and unsystematic. Systematic risk refers to the external aspects
that influence and cause a risk to a company’s investment, while the unsystematic refers to the assets that may
influence the capabilities of a company or investor as risk occurs. Risks come from various sources, such as risk of
liquidity, sovereign, business or insurance. For the purpose of this research, risk in insurance will be particularly
highlighted and the various types will be discussed for the purpose of classifying the risk in insurance companies.
Insurance industry helps safeguard companies and businesses from various risks that could occur every day. It gives
a financial shield for the insured entity to redeem a loss when unpredicted events occur. Moreover, it aids people in
managing and predicting risks in order to keep them at a minimum. Insurance providers are constantly facing
Proceedings of the International Conference on Industrial Engineering and Operations Management
Nsukka, Nigeria, 5 - 7 April, 2022
IEOM Society International
128
challenges in estimating the required amount of coverage with respect to non-life insurance policies, as there are
many factors that contribute to the changing levels of risk. For instance, in the UAE, majority of insurance
companies are continually seeking to have a robust technique to not only manage risks, but also to predict it
beforehand. In this study, the problem of classifying the risk will be addressed, particularly, property line insurance.
This research will provide many insurance companies with reliable and compelling model which can be utilized in
early stages of insurance risk estimation, leading to efficient decisions that will enhance the financial performance of
the company while reducing the risks carried.
1.1 Objectives
This research aims at constructing a risk classification model for various properties through the use of Binary
Regression. Additionally, it focuses on describing the best practices employed in the UAE insurance industry, which
will help improve the customer experience and enhance the company’s profits. Finally, it aims at successfully
minimizing potential losses by utilizing effective risk prediction models.
2. Literature Review
New business opportunities are being presented with the emergence of Internet of Things (IOT), enabling the market
to better gather data which can be used to improve the process of risk prediction in insurance industry (Baecke and
Bocca 2017). Data mining along with risk assessment methods were utilized in motor insurance, resulting in
enhanced risk control and management for the company since they were able to modify the policy conditions based
on the client’s real needs, in other words, the best way to manage risk and get more efficient results is by
customizing and integrating the insurance coverage along with the client’s usage. Additionally, this implementation
improved the speed of insurance quotation being proposed for the customer, since this method works efficiently
without the use of large historical data to predict risks. Logistic regression is the simplest form of machine learning
algorithm, having a binary dependent variable (0,1) and it is famously used for the purpose of classifying or
categorizing into binary outputs (A/B, 0/1, P/F) with the condition of having at least 2 independent variables or
predictors for the model. The predictors will be checked for how they predict while supervising their effect in the
model as well.
De Menezes et al. (2017) suggested the necessity of integrating logistic regression model with boosting to enrich the
accuracy of the model, since the logistic regression doesn’t factor in any noise in data. Therefore, expert systems are
most suitable for such complex scenarios, where boosting is utilized progressively by enforcing a classification
model to the re-weighted category of the data which is specified for training purposes. The analysis was made to
differentiate the traditional approach with the maximum likelihood model, along with the logistic regression
estimated through the boosting method for the binary classification. This was implemented on the Coronary Heart
Disease (CHD) as a function of multiple biological parameters gathered from patients, with the intention of deciding
the presence or absence of CHD. In conclusion, the results concluded that the model revealed more strength than the
traditional approach. Moreover, it performed better in terms of area under the curve, responsiveness, precision and
reduced false alarm rates. Furthermore, the application of logistic linear regression extend to building prediction
models for COVID-19 patients, as it was used to determine mortality rates in China based on the age and time taken
for triage (Josephus et al., 2021). The model demonstrated around 90% accuracy in predicting the probability of
mortality in the patients, revealing that age is the highest contributing factor for patients.
In insurance industries, the company offers indemnity if the event occurs to the insured, for a given premium to be
paid. In terms of life and health insurance, the type of risk being faced is the amount of money to be paid as a result
of an unfortunate event such as injury or death. In 2018, a study by Grant et al., (2018) proposed the use of
predictive models for enhancing risk prediction in the healthcare industry. The study implemented logistic
regression model to examine the cardiothoracic surgeries carried out, demonstrating an outline of probable risk that
could materialize and allowing for better prediction prior to its occurrence. Moreover, as the healthcare system is
profoundly dependent on the amount of money estimated and paid by insurers, it is critical to present accurate
determination for such amounts as it directly impacts the healthcare system’s performance. A recent study captured
the need for a developed model to better predict risk by putting forward a model which can detect the essential
factors in determining life insurance prices, by utilizing Neuro Fuzzy Inference System (ANFIS) to capture the non-
linear relationship between the data (Mladenovic et al., 2020). Consequently, the outcomes of the study showed that
the most influencing factor in life insurance pricing was smoking. Wang et al., (2021) used a combination of
classification along with logistic regression model in order to look into the introductory risk element for e-bike
Proceedings of the International Conference on Industrial Engineering and Operations Management
Nsukka, Nigeria, 5 - 7 April, 2022
IEOM Society International
129
users. This study collected relevant risk data for a period of three years, such as user’s behavior, unacceptable
conduct of users and environmental factors. The users were divided into five categories on the ranking tree analysis,
with the category of non-professional users above the age of 55 in the suburban areas are correlated with the highest
probability of damage when compared with other categories. Nonetheless, the logistic regression revealed that the
categories showed that risk elements such as highways have repeatedly impacted the severity of damage for
different categories. When comparing the crashes on highways against lower speed routes, the probability of
damage for the highway has increased from approximately 9% to being around 42% for the e-bike users. Dong and
Chan, (2013) explored the dynamic modeling of the long tail loss reserving data, which illustrates the state space
mean model along with the Beta distribution to the CTP loss reserving data under the effect of legislation shifts.
Nonetheless, the paper discussed the Beta distribution with the integration framework of individual loss data,
revealing heterogenous traits and allowing for changing parameters with the groups. On the contrary, a model for
comparing confident bands and managing the credit risk was proposed by Kiatsupaibul et al., (2017) and it deals
with incorporating the interferences of the parameters alpha and beta in order to calculate the upper confidence
level.
In summary, the application of logistic regression throughout the years extend to many applications and purposes.
This review has revealed the power of logistic regression in classifying and categorizing risks in various industries,
including the health sector. However, it was observed that there is a gap in applying this methodology in property
lines of insurance companies in the UAE, when in fact the insurance industry is in need for such implementation to
enhance its own performance along with the financial health of the country. Having said that, this paper will focus
on constructing and implanting regression models for the purpose of predicting risks in insurance companies,
property lines in particular.
3. Methods
In this paper, a UAE based insurance company’s data will be assessed with the aim of accurately and effectively
classifying risk categories (high risk, low risk) of various properties, in order to improve the financial performance
with respect to how companies assess and predict risks. Moreover, the proposed model will be compared against the
traditional approach currently implemented in order to validate the outcomes and demonstrate the effectiveness of
implementing machine learning in risk prediction and classification models. A binary logistic regression model will
be enforced to predict how risky a property is, given the input parameters which will be provided by the insured to
the insurer. The logistic regression will predict the correct category to place the property into, meaning it will
estimate the probability of it falling within one of the two categories (A being less risk and B being more risk).
Furthermore, the risk categories can then be used to identify the risk profiles and the mean Damage Ratio (MDR)
which will aid in forecasting the expected damage under risk neutrality assumption.
3.1 Data Collection
A sample of 100 data were collected from COPE (Construction, Occupancy, Protection and Environment) survey of
the company regarding the construction height, material of property, business activity, fire protection systems,
susceptibility to natural disasters, age of building, territory and estimated maximum loss. Additionally, the risk
classification for the data collected were made available for the purpose of comparing the results of the model
against the actual classification made by the company’s experts. The samples will be analyzed by using SPSS binary
logistic regression and assuming a cut probability of 0.7, which is the most commonly used in literature. In this
proposed model, A is set to be a less risky and B being more risky property.
4. Model Analysis and Discussion
4.1 Model Description
The collected data was coded in SPSS, meaning that the age of the building is a continuous variable therefore it will
have one coefficient. However, for the other variables such as rise of building or natural disasters, they will have
categorical variables and levels (low, medium, high). The coding was made for the purpose of making the data a
good fit for a binary logistic regression. The case summary of the cases along with the dependent variable coding are
presented in tables 1 and 2, respectively. The coding for the dependent variable indicates that the cases labeled as A
(lower risk), their internal value of the system will be 0, however, for cases which are labeled as B (higher risk),
their internal value will be set to 1.
Proceedings of the International Conference on Industrial Engineering and Operations Management
Nsukka, Nigeria, 5 - 7 April, 2022
IEOM Society International
130
Table 1. Case processing summary
Table 2. Dependent variable coding
Original Value
Internal Value
A
0
B
1
Table 3. Chi-square test results
4.2 Model Fit
In order to determine whether the model constructed is a good fit for the data, several tests have been made to verify
this. For example, the Hosmer and Lemeshow test as shown in table 3, is similar to a chi-square test but is
interpreted differently as statistical significance would mean that the model does not fit the data. Statistical
significance would reveal that the model is a not good fit for the data. The model is observed to have a P-value of
0.687 which is greater than alpha 0.05 making the null hypothesis true (P 0.05) and concluding that the model
does not have a statistical significance and therefore it is fitting the data. Data regarding the types of class activity
for the properties included in this analysis is illustrated in figure 1 and table 4. The highest frequency of class
activity observed is both Residential Building and Commercial Property making 22% of the total classes each, while
Tower was the least with only 9%. Figure 1 shows that Towers type buildings is least frequent followed by
warehouses. Low frequency of class activity may result on higher emphasis on classes of higher frequency.
Therefore, we may expect more risk to be involved in commercial buildings as compared to Towers.
Figure 1. Percentage of each class activity.
Proceedings of the International Conference on Industrial Engineering and Operations Management
Nsukka, Nigeria, 5 - 7 April, 2022
IEOM Society International
131
Table 4. Class activity of property
Frequency
Percent
Valid Percent
Cumulative
Percent
Commercial Property
22
22.0
22.0
22.0
Factory
18
18.0
18.0
40.0
Hotel
17
17.0
17.0
57.0
Residential building
22
22.0
22.0
79.0
Tower
9
9.0
9.0
88.0
Warehouse
12
12.0
12.0
100.0
Total
100
100.0
100.0
4.2 Model Probability Analysis
A regression model's log-likelihood value is a technique of determining the model's quality of fit. The greater the
log-likelihood number, the better the model matches the dataset. For a particular model, the log-likelihood value
might vary from negative infinity to positive infinity. The likelihood summary of the model is demonstrated in table
5, showing the relationship between the predictors and the outcome. For the Nagelkerke R Square which ranges
from 0 to 1, the model has an R-square value of 0.697 meaning that it is a good fit.
Table 5. Likelihood model summary
The general logistic formula is described as follow:
(
)
=

(



)
(1)
Gi
ven that p(X) refers to the probability of the classification of safe relative versus risky insurance scenario, the
input vector X is given by [X
1
, X
2
, …, X
n
], the set of coefficients are [a
1
X
1
, a
2
X
2
, …, a
n
X
n
] which will be optimized
by utilizing the statistical software. The equation of the line for the model is described in table 6, the table shows the
significant variables in the model which are age of property, rate of fire exposure from neighboring buildings, rate of
damage from aircrafts crossing over property, rate of maintenance level, rate of housekeeping level, rate of fire
protection level, rate of exposure to storms and floods. The variables in the equation show the regression
coefficients, the predicted change in log odds for every unit positive change of predictor. A positive value of B
represents an increasing value on the predictor and a positive association, while a negative value would mean that
there’s a decreasing likelihood or a negative association. For instance, for Age it can be seen that there is a positive
association and it is statistically significant (sig <0.001) meaning that it has great influence on the outcome of the
risk classification when compared to other variables that aren’t significant, such as Estimated Maximum Loss of
Property with significance of 0.882 which is greater than alpha (0.005) therefore having the lowest impact on the
decision of risk classification.
Proceedings of the International Conference on Industrial Engineering and Operations Management
Nsukka, Nigeria, 5 - 7 April, 2022
IEOM Society International
132
Table 6. Variables in the equation
Table 7. Classification table
Proceedings of the International Conference on Industrial Engineering and Operations Management
Nsukka, Nigeria, 5 - 7 April, 2022
IEOM Society International
133
Fig
ure 2. Classification plot
4.2 Model Classification and Discussion
The classification table lists the observed versus the predicted risk classification. In terms of prediction, the diagonal
values represent the hits where the model predicted the risk classification accurately, while the off diagonals were
the misses. Essentially, out of the 100 data points, the model predicted 89% of the cases and missed only 11% of
them as seen in Table 7. The model has accurately predicted majority of the cases correctly. Hence, revealing the
power of using this test for prediction. The cut-probability in this test was assumed to be 0.7 as it was the optimal
value that gave the highest prediction percentage. Lastly, the classification plot is illustrated in figure 2 where each
symbol representing two cases. This plot gives a graphical representation of the risk classes taken from the
classification table, showing the cut observed group versus the predicted group and their frequencies alongside their
probabilities. As it can be seen, any point falling beyond the cut-probability 0.7 is to be considered as class B
(risky), while the others being less than 0.7 are to be considered as class A (safe). The model has predicted 89% of
the points accurately, while only 11% of them incorrectly and this can be seen from the plot. The points falling at the
extreme left have the lowest probability and are to be considered as class A (safe), while the ones falling at the
extreme right carry the highest probabilities and are to be classified as class B (risky).
5. Conclusion and Future Research
In summary, the paper evaluated the binary logistic regression model for predicting the risks in property lines in an
insurance company. It displayed a high performance and accuracy in predicting risk for a given set of input variables
regarding the properties. The model was capable of detecting majority of the points (89%) and cases accurately
when comparing the predicted cases versus the observed cases in terms of risk classification. As a result, it can be
concluded that this case study underlined the benefits of utilizing regression models in risk prediction, assisting the
companies in making better decisions on whether they should insure a property given the level or classification of
risk it falls under. Consequently, this will enhance their financial performance along with the client experience, since
the insurer will be able to give the best cover option for the insured based on the outputs of the classification model.
Future research can be further made on the basis of significant variables only, to see how the model will respond in
terms of correct predictions.
Proceedings of the International Conference on Industrial Engineering and Operations Management
Nsukka, Nigeria, 5 - 7 April, 2022
IEOM Society International
134
References
Baecke, P. and Bocca, L., The value of vehicle telematics data in insurance risk selection processes. Decision
Support Systems, 98, pp.69-79, 2017.
De Menezes, F., Liska, G., Cirillo, M., and Vivanco, M., Data classification with binary response through the
Boosting algorithm and logistic regression. Expert Systems With Applications, 69, 62-73, 2017.
Dong, A., and Chan, J., Bayesian analysis of loss reserving using dynamic models with generalized beta
distribution. Insurance: Mathematics And Economics, 53(2), 355-365, 2013.
Grant, S., Collins, G. and Nashef, S., Statistical Primer: developing and validating a risk prediction model. European
Journal of Cardio-Thoracic Surgery, vol. 54, no. 2, pp.203-208, 2018.
Josephus, B., Nawir, A., Wijaya, E., Moniaga, J., and Ohyver, M., Predict Mortality in Patients Infected with
COVID-19 Virus Based on Observed Characteristics of the Patient using Logistic Regression. Procedia
Computer Science, vol. 179, no. 871-877, 2021.
Kiatsupaibul, S., Hayter, A., and Somsong, S., Confidence sets and confidence bands for a beta distribution with
applications to credit risk management. Insurance: Mathematics And Economics, vol. 75, pp. 98-104, 2017.
Mladenovic, S., Milovancevic, M., Mladenovic, I., Petrovic, J., Milovanovic, D., Petković, B., Resic, S. and
Barjaktarović, M. Identification of the important variables for prediction of individual medical costs billed by
health insurance. Technology in Society,vol. 62, p.101307, 2020.
Wang, J., Song, H., Fu, T., Behan, M., Jie, L., He, Y., and Shangguan, Q. Crash prediction for freeway work zones
in real time: A comparison between Convolutional Neural Network and Binary Logistic Regression
model. International Journal Of Transportation Science And Technology, 2021.
Wang, Z., Huang, S., Wang, J., Sulaj, D., Hao, W., & Kuang, A. Risk factors affecting crash injury severity for
different groups of e-bike riders: A classification tree-based logistic regression model. Journal Of Safety
Research, vol. 76, pp. 176-183, 2021.
Biographies
Reem Adel Abdallah is a graduate student in the Department of Industrial Engineering and Engineering
Management at the University of Sharjah, UAE. She obtained her B.Sc. degree in Industrial Engineering and
Engineering Management from the University of Sharjah.
Doraid M. Dalalah received his BSc in mechanical engineering from Jordan University of Science and Technology.
He received his master’s degree in Industrial Engineering from Jordan University in 1999. He worked as a
maintenance engineer at Jordan Cement Factories till 2000. Doraid finished his Ph.D. degree from Lehigh
University-USA in industrial and system engineering, 2005. Currently, Dr. Dalalah is an associate professor in
Industrial engineering and Engineering Management at University of Sharjah.
Proceedings of the International Conference on Industrial Engineering and Operations Management
Nsukka, Nigeria, 5 - 7 April, 2022
IEOM Society International
135