# Predictive modelling

Predictive modelling is the process by which a model is created or chosen to try to best predict the probability of an outcome.[1] In many cases the model is chosen on the basis of detection theory to try to guess the probability of an outcome given a set amount of input data, for example given an email determining how likely that it is spam.

Models can use one or more classifiers in trying to determine the probability of a set of data belonging to another set, say spam or 'ham'.

## Models and classifiers

Many models exist to try to predict on the basis of input data.

### Majority classifier

The majority classifier takes non-anomalous data and incorporates it within its calculations. This ensures that the results produced by the predictive modelling system are as valid as possible.

### Logistic regression

Logistic regression is a technique in which unknown values of a discrete variable are predicted based on known values of one or more continuous and/or discrete variables. Logistic regression differs from OLS regression in that the dependent variable is binary in nature. This procedure has many applications. In biostatistics, the researcher may be interested in trying to model the probability of a patient being diagnosed with a certain type of cancer based on knowing, say, the incidence of that cancer in his or her family. In business, the marketer may be interested in modelling the probability of an individual purchasing a product based on the price of that product. Both of these are examples of a simple, binary logistic model. The model is "simple" in that each has only one independent, or predictor, variable, and it is "binary" in that the dependent variable can take on only one of two values: cancer or no cancer, and purchase or does not purchase.

### Uplift Modelling

Uplift Modelling is a technique for modelling the change in probability caused by an action. Typically this is a marketing action such as an offer to buy a product, to use a product more or to re-sign a contract. For example in a retention campaign you wish to predict the change in probability that a customer will remain a customer if they are contacted. A model of the change in probability allows the retention campaign to be targeted at those customers on whom the change in probability will be beneficial. This allows the retention programme to avoid triggering unnecessary churn or customer attrition without wasting money contacting people who would act anyway.

↑Jump back a section

## Applications

### Archaeology

Predictive modelling in archaeology gets its foundations from Gordon Willey's mid-fifties work in the Virú Valley of Peru.[2] Complete, intensive surveys were performed then covariability between cultural remains and natural features such as slope, and vegetation were determined. Development of quantitative methods and a greater availability of applicable data led to growth of the discipline in the 1960s and by the late 1980s, substantial progress had been made by major land managers worldwide.

Generally, predictive modelling in archaeology is establishing statistically valid causal or covariable relationships between natural proxies such as soil types, elevation, slope, vegetation, proximity to water, geology, geomorphology, etc., and the presence of archaeological features. Through analysis of these quantifiable attributes from land that has undergone archaeological survey, sometimes the “archaeological sensitivity” of unsurveyed areas can be anticipated based on the natural proxies in those areas. Large land managers in the United States, such as the Bureau of Land Management (BLM), the Department of Defense (DOD),[3][4] and numerous highway and parks agencies, have successfully employed this strategy. By using predictive modelling in their cultural resource management plans, they are capable of making more informed decisions when planning for activities that have the potential to require ground disturbance and subsequently affect archaeological sites.

### Customer relationship management

Predictive modelling is used extensively in analytical customer relationship management and data mining to produce customer-level models that describe the likelihood that a customer will take a particular action. The actions are usually sales, marketing and customer retention related.[5]

For example, a large consumer organisation such as a mobile telecommunications operator will have a set of predictive models for product cross-sell, product deep-sell and churn. It is also now more common for such an organisation to have a model of savability using an uplift model. This predicts the likelihood that a customer can be saved at the end of a contract period (the change in churn probability) as opposed to the standard churn prediction model.

### Auto insurance

Predictive Modelling is utilised in vehicle insurance to assign risk of incidents to policy holders from information obtained from policy holders. This is extensively employed in usage-based insurance solutions where predictive models utilise telemetry based data to build a model of predictive risk for claim likelihood.[citation needed] Black-box auto insurance predictive models utilise GPS or accelerometer sensor input only.[citation needed] Some models include a wide range of predictive input beyond basic telemetry including advanced driving behaviour, independent crash records, road history, and user profiles to provide improved risk models.[citation needed]

### Notable failures of predictive modeling

Although not widely discussed by the main stream predictive modeling community, predictive modeling is a methodology that has been widely used in the financial industry in the past and some of the spectacular failures have contributed to the financial crisis of 2008. These failures exemplify the danger of relying blindly on models that are essentially backforward looking in nature. The following examples are by no mean a complete list:

1) FICO score. A black box algorithm that calculates the credit score of a borrower. The score supposedly predicts the default probability of a borrower based on vast historical credit, default and personal data. During the 2008 mortgage default wave, the FICO score dramatically underestimated the default rate of almost every credit score category, from the prime group (the highest credit score group) to the subprime group (the lowest credit score group).

2) Bond rating. S&P, Moody's and Fitch quantify the probability of default of bonds with discrete variables called rating. The rating can take on discrete values from AAA down to D. The rating is a predictor of the risk of default based on a variety of variables associated with the borrower and macro-economic data that are drawn from historicals. The rating agencies failed spectacularly with their ratings on the 600 billion USD mortgage backed CDO market. Almost the entire AAA sector (and the super-AAA sector, a new rating the rating agencies provided to represent super safe investment) of the CDO market defaulted or severely downgraded during 2008, many of which obtained their ratings less than just a year ago.

3) Statistical models that try to predict the equity market price based on back dated data. So far, no such model is considered to consistently over long term make correct predictions. One particularly memorable failure is that of Long Term Capital Management, a fund that hired highly quantitative specialists, including a Nobel Prize winner in economics, to develop a sophisticated statistical model that predicted the price spreads between different securities. The models produced impressive profits until a spectacular debacle that forced the then Fed Reserve chairman Alan Greenspan to step in to broker a rescue plan by the wall street broker dealers in order to prevent a meltdown of the bond market.

It is worth noting that FICO score, bond ratings and Long Term Capital all have had long and highly successful track records. However, the past success seemed in this case only a good "predictor" of how spectacular the later failure can be and how severe the damage it would cause.

### Possible fundamental limitations of predictive model based on data fitting

1) History cannot always predict future: using relations derived from historical data to predict the future implicitly assumes there are certain steady-state conditions or constants in the complex system. This is almost always wrong when the system involves people.

2) The issue of unknown unknowns: in all data collection, the collector first defines the set of variables for which data is collected. However, no matter how extensive the collector considers his selection of the variables, there is always the possibility of new variables that have not been considered or even defined, yet critical to the outcome.

3) Self-defeat of an algorithm: after an algorithm becomes an accepted standard of measurement, it can be taken advantage of by people who understand the algorithm and have the incentive to fool or manipulate the outcome. This is what happened to the CDO rating. The CDO dealers actively fulfilled the rating agencies input to reach an AAA or super-AAA on the CDO they are issuing by cleverly manipulating variables that were "unknown" to the rating agencies' "sophisticated" models.

↑Jump back a section