SamplingUtils (Spark 2.0.0 JavaDoc)

Object
- org.apache.spark.util.random.SamplingUtils

public class SamplingUtils
extends Object

Constructor Summary

Constructors
Constructor and Description

SamplingUtils()

Constructors
Constructor and Description
`SamplingUtils()`

Method Summary

Methods
Modifier and Type	Method and Description
`static double`	`computeFractionForSampleSize(int sampleSizeLowerBound, long total, boolean withReplacement)` Returns a sampling rate that guarantees a sample of size >= sampleSizeLowerBound 99.99% of the time.
`static <T> scala.Tuple2<Object,Object>`	`reservoirSampleAndCount(scala.collection.Iterator<T> input, int k, long seed, scala.reflect.ClassTag<T> evidence$1)` Reservoir sampling implementation that also returns the input size.

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - SamplingUtils
```
public SamplingUtils()
```
- Method Detail
  - reservoirSampleAndCount
```
public static <T> scala.Tuple2<Object,Object> reservoirSampleAndCount(scala.collection.Iterator<T> input,
                                                      int k,
                                                      long seed,
                                                      scala.reflect.ClassTag<T> evidence$1)
```
    Reservoir sampling implementation that also returns the input size.
    
    Parameters:
    input - input size
    k - reservoir size
    seed - random seed
    evidence$1 - (undocumented)
    
    Returns:
    (samples, input size)
  - computeFractionForSampleSize
```
public static double computeFractionForSampleSize(int sampleSizeLowerBound,
                                  long total,
                                  boolean withReplacement)
```
    Returns a sampling rate that guarantees a sample of size >= sampleSizeLowerBound 99.99% of the time.
    How the sampling rate is determined: Let p = num / total, where num is the sample size and total is the total number of datapoints in the RDD. We're trying to compute q > p such that - when sampling with replacement, we're drawing each datapoint with prob_i ~ Pois(q), where we want to guarantee Pr[s < num] < 0.0001 for s = sum(prob_i for i from 0 to total), i.e. the failure rate of not having a sufficiently large sample < 0.0001. Setting q = p + 5 * sqrt(p/total) is sufficient to guarantee 0.9999 success rate for num > 12, but we need a slightly larger q (9 empirically determined). - when sampling without replacement, we're drawing each datapoint with prob_i ~ Binomial(total, fraction) and our choice of q guarantees 1-delta, or 0.9999 success rate, where success rate is defined the same as in sampling with replacement.
    The smallest sampling rate supported is 1e-10 (in order to avoid running into the limit of the RNG's resolution).
    
    Parameters:
    sampleSizeLowerBound - sample size
    total - size of RDD
    withReplacement - whether sampling with replacement
    
    Returns:
    a sampling rate that guarantees sufficient sample size with 99.99% success rate

Class SamplingUtils

Constructor Summary

Method Summary

Methods inherited from class Object

Constructor Detail

SamplingUtils

Method Detail

reservoirSampleAndCount

computeFractionForSampleSize