Observations and questions on source code

dacamo76
dacamo76 New Altair Community Member
edited November 5 in Community Q&A
Hello,

I have been digging deep into the source code for classification in RapidMiner while trying to create a minimal library to embed in a Java app.

I have come across a few inconsistencies, in my mind, and a few questions. This will probably be a long post and I am not sure if this is the correct forum, but here goes.

1) In MemoryExampleTable, why does the constructor take a DataRowReader instead of an Iterator<DataRow>. I have an Iterator<DataRow> in my code, and to pass it to a MemoryTable I have to implement DataRowReader or use the strangely named ListDataRowReader, which seems like and extra unnecessary step. Changing this would make the method more flexible by accepting an interface instead of a class and it could be made without affecting existing code.

2) Similarly, the MemoryExampleTable constructor could be changed to accept an Iterable<Attribute> instead of a List<Attribute>, this would have no affect on existing code. I know this is not a big deal, but it can make code that is working with a MemoryTable much cleaner. For example, I am dealing with lazily evaluated Iterables, to create a MemoryTable, I have to convert the Iterable to a List and pass it into the constructor. The method only uses List as an Iterable anyway.

3) Most Java code is probably going to be dealing with Collections or Lists instead of arrays, or at least I believe they should at that level of abstraction. So it is inconvenient to have to convert all the Collections (or Lists or Iterables) into arrays to pass into the the DataRowFactory to create a DataRow. It would seems easier to also have a method that accepts an Iterable and internally call the method that accepts an array. This way, if you only wan to create a DataRow, you don't need to deal with the low-level array conversions. Even better would be for this method to iterate over both inputs and create the DataRow without having to convert the Iterables into an array. The toArray method of Collections makes a copy of the original values, which is unnecessary in this case.

I feel it is inconsistent, to create a MemoryTable with a List of attributes, but create a DataRow, with an array of the exact same attributes.

I know these changes seem trivial, but they really help to make code that uses the library cleaner, and easier to maintain.

Answers

  • dacamo76
    dacamo76 New Altair Community Member
    Another few questions:

    4) The Attribute Interface. I don't understand clone completely, as I've never used it, but from what I gather it should call super.clone(). So I am assuming that the clone() method in the Attribute interface is just a method that returns a copy of the current object. Under this assumption, there is nothing in the class docs that says if it is a deep copy or a shallow copy, so digging deeper I see that it is a mixture of both.
    The NominalAttributes don't copy the mappings, they keep the same references. (There is actually a call to clone the mappings in the copy constructors but it is commented out.)
    None of this behavior is documented, so you can't be sure what is going on by just looking at the javadocs. At worst, you will get unexpected behavior by assuming it is a complete copy. At best, you dig through the source code and see what is really going on.

    This is also the case with the setMapping(NominalMapping) method for the nominal attributes.
    In this case it is worse, as both override the same method but cause different side-effects.
    The BinominalAttribute method just assigns the new mapping, so the calling class keeps a reference to the internal mapping.
    The Polynominal method creates a copy of the new mapping and assigns the copy. The calling class has no reference to the internal mapping.
    This behavior was causing a few problems in my code as I did not understand what was going on for some time.
    I am not sure which of the two is the correct behavior for the method. The javadocs for the method is just a copy of the getMapping method.

    The reason I encountered these errors was while I was trying to create attributes to create an ExampleTable to pass to a model.
    I could not use any of the clone() or copy constructor methods and get them to work. They always kept the original mappings and I got an error while creating a new ExampleTable. So I ended up having to create a whole new set of Attributes using the AttributeFactory method passing in the name and type of the attribute.

    From the equals() method I gather that an Attribute is considered equal if it has the same name and table index. It does not take into account the type of attribute or the mappings if it is a nominal attribute. Is this correct?

    If I can get my head around all this and figure out what the actual intentions of these methods are, I would be more than willing to write the corresponding javadoc to document a strict contract for the methods, this way saving other people the same pain I have gone through.

    I know this is a long rant, but I am trying to figure out what is the expected behavior of these methods.

    Cheers,

    Daniel
  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    Hi,

    the RapidMiner code grew over the past decade or so, and of course there are a few flaws in it as in any other code. I guess we have to deal with it for now ;)
    As the main focus of RapidMiner is not to serve as a library (although it can) but to provide a well working data mining platform, there is no time for a code rewrite in the near future. But you never know what the future may bring ;)

    But it's quite easy to create an exampleSet to use.

    List<Attribute> listOfAttributes;
    listOfAttributes.add(AttributeFactory.createAttribute("helloWorldAttribute", Ontology.NUMERICAL);
    // setup your attribute list
    MemoryExampleTable table = new MemoryExampleTable(listOfAttributes);
    // setup your data rows
    double[] doubleArray = new double[] { 1.0d, 2.0d, 3.0d };
    table.addDataRow(new DoubleArrayDataRow(doubleArray));
    ExampleSet exSet = table.createExampleSet();
    // if you need special roles:
    Attribute att = listOfAttributes.get(0);
    exSet.getAttributes().setSpecialAttribute(att, "label");
    If you however want to dig deeper, you may have to check how RapidMiner uses the code to avoid a trap or two.

    Regards,
    Marco
  • dacamo76
    dacamo76 New Altair Community Member
    Hi Marco,

    Thanks for your reply. My question was more in regards to how I can help with the documentation.
    I can code the changes, I just need to know what the expected behavior is, and I can't get that information from the docs.

    It was not meant as a complaint. In fact I made some changes to be able to load a model from a Reader, thus bypassing the file system and giving a user the ability to deploy a solution to apply models in an environment with no disk access, like Google App Engine.

    I feel I have good grasp of how to create the example sets and pass them through a process and a model and have dealt with many shortcomings trying to make RapidMiner a library API, which as you mention, is not the original intent.
    So I am fully aware I will encounter obstacles, and I hope to contribute and help fix as many as I encounter.

    I also implemented changes in the equals method for Attributes and Nominal Mappings to be able to safely create implementations of those interfaces and have the equals method work correctly. Your simple example below highlights the low level understanding one must have of the RapidMiner internals to embed it in a java program.
    You must know the ExampleTable and ExampleSet relationship as well as the DataRow and Attribute relationship pretty well.

    My intent was to highlight some of these issues Java developers with no knowledge of RM have, and with your help and feedback, implement some of those changes.
    If you however want to dig deeper, you may have to check how RapidMiner uses the code to avoid a trap or two.
    Yup, and this is exactly what I want to help out with and wanted to know where is the best forum to hash out these issues and discuss the RM libraries and interfaces. And what do I need to do to get those changes incorporated into RM. As of now, I need the "Create Object from XML Reader" method to function correctly on my web server instance.

    thanks,

    Daniel
    Marco Boeck wrote:

    Hi,

    the RapidMiner code grew over the past decade or so, and of course there are a few flaws in it as in any other code. I guess we have to deal with it for now ;)
    As the main focus of RapidMiner is not to serve as a library (although it can) but to provide a well working data mining platform, there is no time for a code rewrite in the near future. But you never know what the future may bring ;)

    But it's quite easy to create an exampleSet to use.

    List<Attribute> listOfAttributes;
    listOfAttributes.add(AttributeFactory.createAttribute("helloWorldAttribute", Ontology.NUMERICAL);
    // setup your attribute list
    MemoryExampleTable table = new MemoryExampleTable(listOfAttributes);
    // setup your data rows
    double[] doubleArray = new double[] { 1.0d, 2.0d, 3.0d };
    table.addDataRow(new DoubleArrayDataRow(doubleArray));
    ExampleSet exSet = table.createExampleSet();
    // if you need special roles:
    Attribute att = listOfAttributes.get(0);
    exSet.getAttributes().setSpecialAttribute(att, "label");
    If you however want to dig deeper, you may have to check how RapidMiner uses the code to avoid a trap or two.

    Regards,
    Marco
  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    Hi,

    unfortunately we don't quite have the time to go over a lot of code in the forums and discuss the intention and reason for the implementation, but we appreciate fixes/enhancements following these guidelines: http://rapid-i.com/content/view/51/81/lang,en/

    Regards,
    Marco