Does anyone have a literary reference to the calculation of unknown cases / “Dark figure” or knows how to solve (if at all) the following problem or has any idea?
The following example: Two real estate platforms publish houses that are for sale (sold homes). These two real estate platforms are independent of each other and cover a very large share of homes sold in a country per year. However, it is not known exactly what percentage of the total market each of the platforms cover. It is known: A) The number of sold homes per year, which were offered on platform X ONLY. B) The number of sold homes per year that were offered on platform Y ONLY. C) The number of sold homes per year, that were offered on both platforms, X and Y. Question: How many properties have been offered in a specific year (including the properties that were not on platform X and Y)? thx
Dr. Frank Gaeth
|
See capture-recapture, https://en.wikipedia.org/wiki/Mark_and_recapture
If you really know the platforms are independent, the formula is simply: Estimated Total sold homes per year = (A*B)/C see the wikipedia page for more references. |
Andy: I think you blew it.
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
See the Wikip page, where A and B are /marginal totals/ and not the cell frequencies as defined in the Question. The formula works using the marginal totals. The example on the Wikipedia page has cell frequencies implicitly defined as (10, 5), (10, 5). For the formula to work, the 2x2 table, labeled appropriately for the formula, is - + total - 10 5 + 10 5=C , 15= A ------ 20 10=B Then 15*10/5 gives 30, the total for all the cells. Whether you start from the cells or the marginal totals, you can fill in three of the cells. Then the simple estimate for the other cell is the number that is proportional, so that the correlation is zero: Independence. The page also shows estimators that are less biased, and has links to similar problems. -- Rich Ulrich > Date: Wed, 25 May 2016 07:43:10 -0700 > From: [hidden email] > Subject: Re: Dark figure > To: [hidden email] > > See capture-recapture, https://en.wikipedia.org/wiki/Mark_and_recapture > > If you really know the platforms are independent, the formula is simply: > > Estimated Total sold homes per year = (A*B)/C > > see the wikipedia page for more references. > > |
Yep your right, good catch. So here it would be
N = [(A+C)*(B+C)]/C |
In reply to this post by Andy W
Just a couple of suggestions to look at:
(1) See: Chao, A., Tsay, P. K., Lin, S. H., Shau, W. Y., & Chao, D. Y. (2001). The applications of capture-recapture models to epidemiological data. Statistics in medicine, 20(20), 3123-3157. This is available at: https://www.researchgate.net/profile/Anne_Chao/publication/263455563_Population_size_estimation_for_capture-recapture_models_with_applications_to_epidemiological_data/links/55a221fe08ae1c0e046418d5.pdf (2) See: Amstrup, S. C., McDonald, T. L., & Manly, B. F. (Eds.). (2010). Handbook of capture-recapture analysis. Princeton University Press. Portions of this can be seen in preview mode on Google books: https://books.google.com/books?hl=en&lr=&id=hOJxGNERUKgC&oi=fnd&pg=PP2&ots=-3VFWM9c6F&sig=-TJXIgnQ-yXA7rvJFEBypGKR-5I#v=onepage&q&f=false -Mike Palij New York University [hidden email] ----- Original Message ----- From: "Andy W" <[hidden email]> To: <[hidden email]> Sent: Wednesday, May 25, 2016 10:43 AM Subject: Re: Dark figure > See capture-recapture, > https://en.wikipedia.org/wiki/Mark_and_recapture > > If you really know the platforms are independent, the formula is > simply: > > Estimated Total sold homes per year = (A*B)/C > > see the wikipedia page for more references. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Yeah there is a quite a bit of work on this problem across a few fields. (Statistical Science just posted an issue on the topic.)
A bit of a less mathy (than typical journal articles at least) introduction that I have saved is here, http://granta.com/violence-in-blue/ They also talk about what happens when the samples are correlated. |
On Wednesday, May 25, 2016 12:20 PM, Andy W wrote:
> Yeah there is a quite a bit of work on this problem across a few > fields. > (Statistical Science just posted an issue on the topic.) > > A bit of a less mathy (than typical journal articles at least) > introduction > that I have saved is here, > > http://granta.com/violence-in-blue/ > > They also talk about what happens when the samples are correlated. Nice article. It lays out the basics clearly as well as the problem of dependence and the need for three or more "lists" to measure that dependence. With two lists, like in the situation that Frank provided, the correlation cannot be directly estimated but one can see the effect of different degrees of correlation on the estimates. And then there is the problem of the "uncatchables", that is, relevant instances that for one reason or another, cannot be counted (in the article above, areas that refuse to provide the FBI with their number of police based homicides are an example). Which makes the numbers more suspect but hopefully the dark arts of statistics can help to remedy the situation somewhat. ;-) -Mike Palij New York University [hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |