7.30. Regression Data Analysis

In this fictitious example, you sell top-of-the-range beauty products through a complex network of reps throughout the USA. Leads for new territories are generated on your website, and the more promising candidates are trained to sell in their area. Remuneration is by commission only, and most of the sales force are women who work part-time. Each salesperson has to place a minimum order for your products (at discounted prices) and is encouraged to recruit other reps, who also work on a commission basis, paying 3% of their commission to their recruiter. A sliding scale of commissions is paid through the chain of recruitments generated in this way, and the policy has proved very effective. When a sales territory achieves a sufficient sales figure, you organize a special promotion, hiring a small conference center, flying in executives and top salespeople, and distributing samples.

These promotions are your key selling strategy, greatly boosting recruitments and sales, but they are also expensive. In the economic downturn you've found that several have not left you much in profit. How can you tell when promotions are worth undertaking? More exactly, how — based on your past sales records — can you estimate their effectiveness so that promotions don't overspend?

Theory

What you need is an algorithm like:

Increased Sales resulting from Special Promotion = $A + ($B x Factor B) + ($C x Factor C) + ($D x Factor D) + . . . ($X x Factor X)


where Factors A, B, C, D, etc. relate to measures you have data on (e.g. number of reps in area, average number of monthly products sold by reps, number attending promotion . . .) and the $A, $B, $C, $D etc. are coefficients (i.e. weightings). You derive values for these coefficients by solving the equation.

For how you solve the equation (there are several ways) you'll have to refer to textbooks, and there you'll also find measures for goodness of fit (R-squared, analyses of the pattern of residuals and hypothesis testing). Statistical significance can be checked by an F-test of the overall fit, followed by t-tests of individual parameters. Yes, it's complicated, and even the articles referenced below only scratch the surface. But the theory is for statisticians: all you need to know is how to run one of the many regression programs on the market. Assemble the data, key it into the data interface, and the programs will estimate not only all the unknowns, but how significant these unknowns are. Having derived the algorithm from past sales data, you can use it to estimate future sales.

But what sort of factors would be relevant in this case? Initially you don't know, and indeed don't have to. The beauty of regression analysis is that the program will automatically sort through the factors, attaching a relevance to them. Most can probably be dispensed with in a simple and robust algorithm, but you won't know before running the regression program. In this case you might start by assembling data as follows. For each special promotion you assemble figures for:

FACTOR

ABBREVIATION

UNITS

Increased annual sales

in territory resulting

from promotion

SALES

US$ '000

Average USA business

confidence over year

following promotion

CONFID

-2 to +2

Number of sales reps

in territory before

promotion

NOREPS

number

Average monthly

sales of reps before

promotion

AVSALE

US$'000

Affluence of sales

territory (estimated)

TETAFF

1 to 5

Total number

attending promotion

(excluding staff)

ATTEND

number

You consult your records and key the figures into the regression program:

SALES

CONFID

NOREPS

AVSALE

TETAFF

ATTEND

11.5

0

4

0.94

3

82

34.5

2

9

1.24

4

105

7.6

-1

5

0.65

2

74

18.7

1

5

1.01

1

59

31.7

-1

12

0.79

5

89

6.3

2

16

0.44

1

208

22.6

1

2

1.60

5

92

5.4

-1

2

0.65

1

84

12.7

-2

3

1.46

3

158

13.6

0

6

0.75

2

73

45.1

2

7

1.50

5

45

11.8

1

3

0.95

3

83

13.4

-1

8

0.42

5

64

8.9

11

5

0.64

3

82

26.8

1

8

1.39

5

73

19.5

2

3

0.57

1

24

17.4

1

4

0.87

3

64

You then run the regression analysis, and examine the various measures the program provides for each suggested algorithm generated. Among these are:

Term

Coefficient

t statistic

 

Source

of Variation

Sum

of Squares

F statistic

Intercept

-1.507

-0.39

 

Model

1731.5

19.02

CONFID

1.213

0.82

 

Residual

200.3

 

NOREPS

2.056

4.54

 

Total

1931.8

 

AVSALE

19.83

3.59

 

 

 

 

TETAFF

0.3773

0.35

 

 

 

 

ATTEND

-0.1476

-2.74

 

 

 

 

For the algorithm:

Sales = -1.507 + 1.213 x CONFID + 2.056 x NOREPS + 19.83 x AVSALE + 0.3773 x TETAFF -0.1476 x ATTEND

It's not a bad match, but you'll notice that the weighting for those attending the promotion is negative, i.e. the sales decrease with the number of people attending the promotion. That's hardly what you want, and what does it mean? Perhaps the promotion recruits too many reps lacking the experience and energy to work the territory properly, only spoiling opportunities for others. You might want to cut down on numbers attending by screening out the unsuitable candidates.

Or it might mean that distressed times bring in too many hopefuls looking for supplementary income, i.e. NOREPS is not independent of CONFID. In fact you'd probably do better to rerun the regression package excluding NOREPS data, when this new and more robust algorithm is generated:

Term

Coefficient

t statistic

 

Source of Variation

Sum of Squares

F statistic

Intercept

-4.208

-0.89

 

Model

1594.4

14.17

CONFID

4.432

3.99

 

Residual

337.5

 

NOREPS

1.167

2.96

 

Total

1931.8

 

AVSALE

8.739

1.87

 

 

 

 

TETAFF

2.145

2.02

 

 

 

 

Sales = -4.208 + 4.432 x CONFID + 1.167 x NOREPS + 8.739 x AVSALE + 2.145 x TETAFF


The program gives you a plot of predicted against actual sales figures:

You'd want to refine your methodology for estimating business confidence and affluence in the sales territory, but even as matters stand, regression analysis has:

1. Identified the key factors in your sales promotions.
2. Given you numerical estimates of their importance.
3. Allowed you to broadly predict sales, and so avoid promotion overspends.

Resources

1. Statistics on the Web. Clay Helberg's useful listing of sites.
2. Free Statistics. Good listing of open source and freeware statistics packages.
3. Wil's Domain. Straightforward listing of statistics software, both free and commercial.
4. Statistical Analysis Software Survey. Useful tables if you're familiar with statistics packages.
5. Numerical Mathematics. Inexpensive linear regression package.
6. FitAll. General purpose, nonlinear regression analysis programs.
7. Sagata Regression. Basic regression packages that work with Excel.
8. StatFi. Regression package with good list of features.
9. AnalyseIt. Multifeatured regression package that works with Excel.

Questions

1. What is regression analysis? Why is it useful?
2. Give a hypothetical example of its use.
3. In what circumstances could regression analysis be more useful than cluster analysis or neural networks?

Sources and Further Reading

1. Regression analysis. Wikibooks. Extensive sets of articles.
2. Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran. O'Reilly. August 2007. Includes specimen code in Python.
3. Essentials of Statistics by David Brink. Bookboon. 2010. Clear and rigorous treatment in 103 pp. free ebook.