Observations and questions on source code
Hello,
I have been digging deep into the source code for classification in RapidMiner while trying to create a minimal library to embed in a Java app.
I have come across a few inconsistencies, in my mind, and a few questions. This will probably be a long post and I am not sure if this is the correct forum, but here goes.
1) In MemoryExampleTable, why does the constructor take a DataRowReader instead of an Iterator<DataRow>. I have an Iterator<DataRow> in my code, and to pass it to a MemoryTable I have to implement DataRowReader or use the strangely named ListDataRowReader, which seems like and extra unnecessary step. Changing this would make the method more flexible by accepting an interface instead of a class and it could be made without affecting existing code.
2) Similarly, the MemoryExampleTable constructor could be changed to accept an Iterable<Attribute> instead of a List<Attribute>, this would have no affect on existing code. I know this is not a big deal, but it can make code that is working with a MemoryTable much cleaner. For example, I am dealing with lazily evaluated Iterables, to create a MemoryTable, I have to convert the Iterable to a List and pass it into the constructor. The method only uses List as an Iterable anyway.
3) Most Java code is probably going to be dealing with Collections or Lists instead of arrays, or at least I believe they should at that level of abstraction. So it is inconvenient to have to convert all the Collections (or Lists or Iterables) into arrays to pass into the the DataRowFactory to create a DataRow. It would seems easier to also have a method that accepts an Iterable and internally call the method that accepts an array. This way, if you only wan to create a DataRow, you don't need to deal with the low-level array conversions. Even better would be for this method to iterate over both inputs and create the DataRow without having to convert the Iterables into an array. The toArray method of Collections makes a copy of the original values, which is unnecessary in this case.
I feel it is inconsistent, to create a MemoryTable with a List of attributes, but create a DataRow, with an array of the exact same attributes.
I know these changes seem trivial, but they really help to make code that uses the library cleaner, and easier to maintain.