sampling / learning curve

wessel
wessel New Altair Community Member
edited November 5 in Community Q&A
Dear all,

Sampling the training set can have a major impact on classification accuracy.
Especially when the data is skewed.

Lets say you have a dataset of 100k negative examples and 1k positive examples.
And you wish experiment with different pos/neg ratios in the training set.

To do this you need:
example filter: select all negative
example filter: absolute amount
example filter: select all positive
example filter: absolute amount
merge


when there are more then two classes, it gets even more cumbersome.


Would be cool if this could be combined into a single operator.

This might also be faster and more memory efficient.

Best regards,

Wessel
Tagged:

Answers

  • fischer
    fischer New Altair Community Member
    Hi,

    just to get it right: What would be the parameters of your operator? If I get it right, it would be

    - a ratio for each class
    - an absolute number of examples you want as output?

    Cheers,
    Simon
  • wessel
    wessel New Altair Community Member
    Lets see:

    Input: a dataset

    Parameters fields:
    label = class_A  [absolute amount] or [relative amount] and [sampling type]
    label = class_B  [absolute amount] or [relative amount] and [sampling type]
    ...
    label = class_Z  [absolute amount] or [relative amount] and [sampling type]

    Defaults: absolute amount = '' relative amount = 1 sampling type = linear


    Examples:
    Input, dataset with 2000 examples of class A

    class_A  [1000] or [] and [linear]           Returns a dataset containing the first 1000 instances of class A

    class_A  [1000] or [] and [random]        Returns a dataset containing 1000 instances of class A randomly sampled

    class_A  [] or [0.5] and [linear]              Returns a dataset containing the first 1000 instances of class A

    class_A  [] or [0.5] and [random]           Returns a dataset containing 1000 instances of class A randomly sampled

    class_A  [3000] or [] and [random]        Returns an error?

    class_A  [] or [1.4] and [random]        Returns an error?