"Bug in MinimalEntropyParitioning?"
Legacy User
New Altair Community Member
Hello everybody,
I get strange results when I apply MinimumEntropyPartitioning on some datasets and wonder whether this is due to a bug in the implementation.
Let me illustrate the problem: I have a dataset with one attribute ("X") and one label with two possible values.
There are 6 possible values for X, 1 to 6. In total, I have 1116 rows, with the following target label distributions:
X-value #negatives #positives #rows
1.0 124 62 186
2.0 124 62 186
3.0 0 186 186
4.0 0 186 186
5.0 124 62 186
6.0 124 62 186
Now of course I would expect a discretization into [-infty,2], ]2,4], ]4,infty] with 372. Instead, I get:
range1 [-∞ - 2] (372), range2 [2 - 5] (558), range3 [5 - ∞] (186)
It seems like there is a bug in the operator that does not correctly distinguish open and closed interval limits.
Does anybody know of a solution or a workaround?
Best,
Henrik
I get strange results when I apply MinimumEntropyPartitioning on some datasets and wonder whether this is due to a bug in the implementation.
Let me illustrate the problem: I have a dataset with one attribute ("X") and one label with two possible values.
There are 6 possible values for X, 1 to 6. In total, I have 1116 rows, with the following target label distributions:
X-value #negatives #positives #rows
1.0 124 62 186
2.0 124 62 186
3.0 0 186 186
4.0 0 186 186
5.0 124 62 186
6.0 124 62 186
Now of course I would expect a discretization into [-infty,2], ]2,4], ]4,infty] with 372. Instead, I get:
range1 [-∞ - 2] (372), range2 [2 - 5] (558), range3 [5 - ∞] (186)
It seems like there is a bug in the operator that does not correctly distinguish open and closed interval limits.
Does anybody know of a solution or a workaround?
Best,
Henrik
Tagged:
0
Answers
-
Hi Henrik,
this seems to be a problem indeed. Perhabs you could add a tiny litte noise on your values. Resolving the not uniquenes causing your problem.
But to solve it in general I will take a look at the code.
Greetings,
Sebastian0 -
Hi Sebastian,
thanks for the reply, I also thought that the problem could be diminished if I had more continuous values. But of course if would be best if you could fix the problem in general.
Best,
Henrik
0 -
Hi,
in the meantime I found the bug and fixed it. The bug is in the function
private Double getMinEntropySplitpoint(LinkedList<double[]> truncatedExamples, Attribute label) {
in the class MinimalEntropyDiscretization. It does not consider the case where a split results in 0 examples of one class. Here is the fix:
// Calculate entropies.
double entropy1 = 0.0d;
for (int i = 0; i < label.getMapping().size(); i++) {
entropy1 -= frequencies1 * MathFunctions.ld(frequencies1);
}
double entropy2 = 0.0d;
for (int i = 0; i < label.getMapping().size(); i++) {
entropy2 -= frequencies2 * MathFunctions.ld(frequencies2);
}
Best,
Henrik
0 -
Hi Henrik,
thanks for sending this in! We will check and integrate your suggestion as soon as possible.
Cheers,
Ingo0