"Beginner Machine Laerning Question"

Ghostrider
Ghostrider New Altair Community Member
edited November 5 in Community Q&A
Say I want to predict the price of an automobile based on attributes of the automobile.  Assume that I know things such as tire size, date of manufacture, number of doors, etc.  I could throw all these attributes into a decision tree learner and hope to find some relation about the cost of the car.  But can I get a better result by using relations that I already know about the attributes?  For example, assume that I don't know how much horsepower that the engine produces, but I do know information about the attributes that correlate with the engine's horsepower such as the engine displacement, number of cylinders, and number of gears in the transmission.  Although I don't know the horsepower, assume that I can roughly calculate it form these parameters.  Question is, doesn't it make more sense to try to isolate these attributes from the other attributes and use them exclusively for building a model for engine horsepower which can then be supplied to a higher layer learner that can try to figure out how horsepower and other factors affect an automobile's price?  Obviously, if I don't have any idea about how the attributes relate, it's probably better to just supply them all into one learning algorithm.  But if I know information about the relation among certain attributes, it seems like it would be a better approach to isolate the attributes into groups, build a model for what these attributes represent, and then use these sub-models to train another model, this would be like a hierarchy of learning, going from detailed attributes (number of cylinders, engine displacement, gears in transmission) to predict higher-level attributes (horsepower, torque), and finally predict price of auto from these higher level attributes (horsepower, quality of interior, car marker's reputation, etc).  Question is, is this a good approach?  Idea is to use information about relationships that I already know and direct the learning process.  Second question, what if I don't know how to calculate horsepower from those low-level attributes, I only know that those attributes are related?
Tagged:

Answers

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Hello!

    If you are predicting the price as an amount, that is a regression problem. So you can only use regression learners, not e. g. decision trees. Of course, you could put your numeric target variable into classes like "< 15,000", "15,000 - 30,000" etc. so that you have a classification problem and can use most learners.

    Your idea with predicting more attributes in sub-models is interesting. Would you need those attributes later? If not, you can always experiment with variable selection or use a learner that selects the best variables itself.

    It is never sure that additional attributes help. One can only try. In your example, if you have attributes that correlate with higher horsepower, and cars with more horsepower are pricier, those attributes will have a positive effect anyway. You would be predicting one helper attribute and possibly introduce noise into the model, or just redundance.

    With the "Select Subprocess" operator, you can always create alternative paths in your process and put that e.g. into a parameter optimization in order to see how your submodels perform versus no submodels.
  • Ghostrider
    Ghostrider New Altair Community Member
    Hi balazsb,
    Welcome to the forums! 

    If I know that two or more attributes are related, but are completely independent with the rest of attributes, I'd like to isolate those attributes from the others to help guide the learning algorithm (and reduce the complexity of the problem through a divide-and-conqueror approach).  I think it might be possible to do this by training a model that takes input from a sub-model, but it's not clear how the sub-model(s) would be trained without having labels for the sub models.

    Another example that I thought of while watching the Neural Market Trend tutorials linked from the RM homepage is that often in predicting time series, preceding days are simply treated as another attribute using the time series window operator -- one attribute will be the current value, another attribute will be the value from the previous example, and another attribute would be the current value from two examples ago.  But doing so seems like such a waste.  If I was asking another human to look for trends in data, it would certainly be useful to know that attribute 1 was taken on Wednesday, attribute 2 was taken on Tuesday, and attribute 3 was taken on Monday rather than essentially telling the learning algorithm, "here's 3 values from this example, look for a pattern". 

    Point is, is there some way we can use knowledge about the problem to guide and improve the efficiency of the learning process?  If so, are there books or good references describing such techniques?  As a newbie to data mining, I think I'd really benefit.
  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    I'd like to isolate those attributes from the others to help guide the learning algorithm
    There are steps in RapidMiner to determine attribute weights using different methods. Use them to rank your attributes by their contribution to the end result.
    but it's not clear how the sub-model(s) would be trained without having labels for the sub models
    It should be possible to find out things like the car's horsepower for a number of cases (e. g. 50) and use those to build your submodel for the rest of the data.
    But doing so seems like such a waste.  If I was asking another human to look for trends in data, it would certainly be useful to know that attribute 1 was taken on Wednesday
    You never know which attributes will be "waste" and which will be the most significant. In time series, the value of the last day, independent of its weekday, is surely a huge predictor. If you suspect that the weekday can play a role, you are always welcome to create an attribute for it.

    In RapidMiner, just create a copy of your date attribute with "Generate Copy" and then extract the desired property by converting the new attribute with "Date to nominal".
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.10" expanded="true" name="Process">
        <process expanded="true" height="116" width="480">
          <operator activated="true" class="generate_sales_data" compatibility="5.0.10" expanded="true" height="60" name="Generate Sales Data" width="90" x="45" y="30"/>
          <operator activated="true" class="generate_copy" compatibility="5.0.10" expanded="true" height="76" name="Generate Copy" width="90" x="179" y="30">
            <parameter key="attribute_name" value="date"/>
            <parameter key="new_name" value="weekday"/>
          </operator>
          <operator activated="true" class="date_to_nominal" compatibility="5.0.10" expanded="true" height="76" name="Date to Nominal" width="90" x="313" y="30">
            <parameter key="attribute_name" value="weekday"/>
            <parameter key="date_format" value="F"/>
          </operator>
          <connect from_op="Generate Sales Data" from_port="output" to_op="Generate Copy" to_port="example set input"/>
          <connect from_op="Generate Copy" from_port="example set output" to_op="Date to Nominal" to_port="example set input"/>
          <connect from_op="Date to Nominal" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    is there some way we can use knowledge about the problem to guide and improve the efficiency of the learning process?
    Of course, if you can express this knowledge in the data or in the process.

    Examples:
    • You could introduce a numerical attribute like "manufacturer image" and assign values to it.
    • If you know that two attributes have a numerical relation, you can use "Generate Attributes" to express that relation, e. g. "(att1+att2)/att2"
  • Ghostrider
    Ghostrider New Altair Community Member
    image

    Above is an image of my idea.  Horsepower, manufacturer's image, and interior quality are all qualities that determine the cost of a car.  Each has attributes which determine the magnitude of each of these 3 qualities.  Question I have is is there any advantage to separating the 3 groups of attributes (assume that I know Att1, Att2, and Att3 are only good for predicting horsepower and have no correlation with the other two categories, mfg. image and interior quality) or would it be just as well to feed them all into the Price of Car model directly?  It seems like the learning algorithm would have an easier time with the first case.
  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Ghostrider wrote:

    It seems like the learning algorithm would have an easier time with the first case.
    Maybe yes, maybe not. Most algorithms are quite capable in finding the relevant attributes and their combinations for their learning.

    If you work with a two-level model like you described, you either need to find an algorithmic approach (generating attributes as described earlier) or gather "label" data for the sub-models, train and learn those submodels and then integrate their predictions as additional attributes for the "big" model. But there is always the danger of introducing more noise into the model with this approach.

    Try visualising your data with the target attribute (price) and the different attributes. If the attributes neatly separate the cases into clusters of differently-colored objects in the graph, the learning algorithms should be able to that, too.

    Just try a few learners in a Cross-Validation (X-Validation) and see how they perform. If their performance is too bad, you can start building more complex models until you get the desired accuracy (if it is ever possible).