# A Coverage Theory for Least Squares [by Simone Garatti]

Least Squares, in its different forms, is probably the most used approach to construct models from a sample of observations. In the article “A Coverage Theory for Least Squares”, we refer to least-squares more generally as a methodology to make decisions. To give a concrete example, consider the problem of deciding the location of a station that serves a population. This can e.g. be a laundry, a library or a supermarket. After obtaining the location of the homes of a sample of the population, according to a least-squares approach the service station can be placed at the barycentric position of the home locations, a decision that minimizes the sum of squared home-service distances. In the article, our main interest lies in assessing the quality of least squares decisions: once the location of the service station has been determined based on a sample, one can ask how good the decision made is for the rest of the population. For example, one can evaluate the levels of satisfaction – as measured by the home-service distance – for the individuals in the sample, and ask how representative these satisfaction levels are of the level of satisfaction of the whole population. Say e.g. that the individual who is the 10th furthest away from the station in a sample of 100 has to walk 0.4 miles to go to the station, what is the proportion of the population that has to walk more than 0.4 miles to go to the station? Answering this question and related ones has an important impact on the usage of the least squares method, and even on the acceptance of the solution obtained from it (if too many people have to walk for too long, the decision to place the station in the least squares location might be rejected in favor of constructing two stations instead of one at an extra cost). This problem has so far received little attention from the statistical community and this article tries to fill this gap by presenting a new theory that is applicable to least squares decisions across diverse fields. In the service location problem, the empirical proportion of members in the sample that pay a cost above a given value is not a valid statistic for quantifying the proportion of the whole population whose cost is above that given value. This is not surprising since the least squares solution introduces a bias towards making small the cost of the members in the sample. On the other hand, it is shown in the article that, by introducing suitable margins, valid and tight statistics can be obtained which hold true distribution-free, that is, these statistics can be applied without availing of extra knowledge on how the population distributes.

## Author: xi'an

I am a professor of Statistics at Université Paris Dauphine, France, and University of Warwick, United Kingdom, with a definitely unhealthy (but so far not fatal) fascination for mountains and (easy) climbing, in particular for Scotland in Winter, an almost-daily run, and a reading list mainly centred at fantasy books… Plus a blog that often seems to take most of my time. Not that anyone forces me to edit it...!