MTH551 Writing Report And Fit A Markov Chain Model To Simulated Insurance Claims Data Using R Or Excel

Investigate 3 themes and to write up your findings in the form of a report. You should aim for a length of 1,500 words, suitably illustrated with diagrams produced using a statistical package (R) or Excel.

The report should have an Introduction and a Conclusion; the main body of the report should be divided into sections as appropriate.

There should be 2 files:

1. The report as a Word document (aiming for 1500 words)

2. The code and/or Excel file you used to generate your results, which must be appropriately documented so that a reader can follow what you did and why you did it. If using R, this should be in the form of a file with a .r extension, not the output from an R-Studio session.

I have attached the assignment description for more details and the data.

**Assignment in Stochastic Modelling**

**Due Thursday 09 April 2020 at 6pm**

This task requires you to fit a Markov chain model to simulated insurance claims data. The data are in the file ‘Classification Scheme Data.csv’ (posted on Blackboard).

The Mastodon Insurance Company studies a cohort of 600 drivers, who were all below 25 years old at the start of the study. In each year the number of claims made by every driver was noted.

Mastodon operates a classification scheme with six discount levels from level 0 (no discount, ie the driver pays full premium) to level 5 (50% discount), with a 10% increase in discount at each step. A policyholder who makes no claims in a year moves up one level (unless already at level 5); a policyholder who makes 1 or more claims moves down one level (unless already at level 0).

Before the study began, the drivers were categorised using variables such as age, gender, postcode and miles driven per year. The categories reflect Mastodon’s expectation of the level of risk associated with that driver:

· Category A — very low risk, ie the best drivers

· Category B — low risk

· Category C — medium risk

· Category D — high risk, ie the worst drivers

## The data set

The data set supplied is a .csv file posted on Blackboard.

The top row of the data set provided consists of headers.

The category in which the driver was classified is the first column of the data set, which has the header “Category”.

The discount level in which the driver was located at the start of the study is the second column, which has the header “Initial”.

The remaining 15 columns give the number of claims in each of the years 1 through 15 of the study.

## Notation

Denote by *Xi,n *the discount level in which the *i*th policyholder finds themselves at time 0, 1, 2, …, 16 (time measured in* *years). I will refer to this as matrix X below.

## The task

The aim of this task is to investigate 3 themes and to write up your findings in the form of a report. You should aim for a length of 1,500 words, suitably illustrated with diagrams produced using a statistical package (R is fine) or Excel.

The report should have an Introduction and a Conclusion; the main body of the report should be divided into sections as appropriate.

When you submit you should upload two files to Blackboard:

1. The report, as a Word document

2. The code and/or Excel file you used to generate your results, which must be appropriately documented so that a reader can follow what you did and why you did it. If using R, this should be in the form of a file with a .r extension, not the output from an R-Studio session.

## The 3 Themes

### Theme 1

Assume that the number of claims made by a customer each year forms a sequence of iid random variables, with common distribution being Poisson. You will investigate three possible cases regarding the Poisson parameter (*λ*):

1) λ is the same for all policyholders

2) λ is the same for all policyholders in a given category

3) each policyholder has his/her own value of λ

Use the data to investigate whether these assumptions are reasonable. In each case, choose λ by the method of moments. Then compare the three using the sum of squared errors. Use a Likelihood Ratio test on the null hypothesis that the 4 values of λ considered in step 2 are actually equal to each other. In each case, you will have a 600 x 15 matrix of squared errors (sum through the entire matrix).

### Theme 2

Use the model from case 1 above, ie *λ* is the same for all policyholders.

For a given policyholder, the discount level at times 0, 1, 2, …, 16 follows a kind of time series that data scientists call a “Markov chain”.

For test values of *λ *of {.06, .12, .18, .24, .30}:

1) calculate the transition matrix of the Markov chain

2) calculate the equilibrium distribution of the Markov chain

3) calculate the long-run average (over all 600 drivers) % of full premium paid in one year

4) calculate the year-by-year (over all 16 years) average % of full premium paid each year; this will be a vector of 16 numbers (you average each column of X, then subtract from 1).

5) graph the results from step 4

### Theme 3

Devise a new classification scheme for Mastodon. You can choose the number of discount levels, the size of the discount offered at each level, and the rules for the transitions between levels (how many levels a driver goes down if he/she makes 1 claim in a year, how many if 2 claims, and so on).

Retain the rule that a year with no claims will result in moving up one level (unless already at the top).

Also you must maintain the Markov Property. In other words, your classification scheme cannot look back in time to old claim history.

Repeat the analysis of Theme 2 (especially steps 4 and 5) for your proposed scheme. Explain why you think your scheme is better than the existing scheme.

## Assessment criteria

Your assignment will be assessed on the basis of the following criteria:

· Correctness of any calculations

· Interpretation of findings and commentary on illustrations

· Relevance and clarity of plots, including labelling and captioning

· Quality of the submitted report

· Clarity, correctness and documentation of programming code and/or spreadsheet