Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Predicting Unknowns from Known's via Supervised

Hi,

We are trying to model revenue assurance predictive model in identifying the possible electricity theft. Our approach is to take the already known (theft meter hourly reads) and predict if any other meters follow similar usage patterns (anomalies and pattern matching to fraud).

The ratio is we have around 400 known theft meters and 110k unknown. As you can see we have very small ratio of known that we need to match up with unknowns(example set). I have tried KNN,GBT and Naive Bayes and tracking the performance using "Performance Binominal classification" (i.e.) LABEL=FRAUD =TRUE/FALSE. Also, Tried SVM as recommend by most research papers and its performance was terrible, trying parameter optimization and it is running from 2 days:-(

Below are my questions

(1) What would be the best supervised machine learning algorithms for these kind of prediction classifications?

(2) Also, how do we feed back the confirmed false positive meters as not theft to the model, so that model refines and start treating these as not theft and yields a better output(prediction)-Would appreciate if you can share a sample process on how to perform a feedback to model

Thx for the valuable input.

Find more posts tagged with

AI Studio

Getting Started

Accepted answers

All comments

Telcontar120

You may want to try the one-class label SVM approach instead and focus on the characteristics of the known fraud cases. There is a related thread discussion here you should review with a link to a sample process: https://community.rapidminer.com/t5/Getting-Started-Forum/One-class-label-learning/m-p/44038#M1350

sunnyal

Thank you. How diffeent is this one-class as oppsoed to C-SVC or radial?? The current problem with other svm types is that they are terribly slow..

Telcontar120

I suspect the reason the current SVM is so slow is because of the large number of examples of the "unknown" class. If you are using only the "known" class, which is much smaller, then the SVM algorithm will be much faster.

Thomas_Ott

What @Telcontar120 said. Focus on training the 'knowns' and go from there.

sgenzer

sunnyal

Thank you guys. I liked Rumsfeld analogy :-)

I trained "Knowns" (True’s) with C-SVC and then tested with "Unknows" (False) and it just predicted everything as True. misery..

I wanted to try "one-class", but SVM operator complains about not supported binominal (True/False) or numerical (1/0) labels.

How do we define a label as "one class"?? see attached my process

Attached sample data

@sunnyal Loading in your sample data you can do something like this. With the "one class" application you just train the model on the knowns and exclude the other class completely. Then when it scores it generates how far inside or outside you are from what it trained one.

Note this is just a sample template, I think you're going to have to do some feature generation to make it better). Just make sure to set your Meters to an ID role.

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.6.002" expanded="true" height="68" name="Retrieve Electric Fraud Sample Data" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../data/Electric Fraud Sample Data"/>
      </operator>
      <operator activated="true" class="nominal_to_date" compatibility="7.6.002" expanded="true" height="82" name="Nominal to Date" width="90" x="179" y="34">
        <parameter key="attribute_name" value="DIM_DT_ID"/>
        <parameter key="date_type" value="date_time"/>
        <parameter key="date_format" value="yyyy-MM-dd"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.002" expanded="true" height="82" name="Set Label (2)" width="90" x="313" y="34">
        <parameter key="attribute_name" value="METER"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles">
          <parameter key="METER" value="id"/>
        </list>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.6.002" expanded="true" height="103" name="Filter Examples" width="90" x="514" y="136">
        <parameter key="invert_filter" value="true"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="FAULT_INDICATOR.equals.FALSE"/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.6.002" expanded="true" height="82" name="Select Attributes (2)" width="90" x="916" y="187">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="FAULT_INDICATOR"/>
        <parameter key="invert_selection" value="true"/>
      </operator>
      <operator activated="true" class="guess_types" compatibility="7.6.002" expanded="true" height="82" name="Guess Types" width="90" x="648" y="34"/>
      <operator activated="true" class="set_role" compatibility="7.6.002" expanded="true" height="82" name="Set Label (3)" width="90" x="782" y="34">
        <parameter key="attribute_name" value="FAULT_INDICATOR"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles">
          <parameter key="METER" value="id"/>
        </list>
      </operator>
      <operator activated="true" class="support_vector_machine_libsvm" compatibility="7.6.002" expanded="true" height="82" name="SVM" width="90" x="916" y="34">
        <parameter key="svm_type" value="one-class"/>
        <parameter key="gamma" value="0.001"/>
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.6.002" expanded="true" height="82" name="Apply Model" width="90" x="1117" y="85">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Retrieve Electric Fraud Sample Data" from_port="output" to_op="Nominal to Date" to_port="example set input"/>
      <connect from_op="Nominal to Date" from_port="example set output" to_op="Set Label (2)" to_port="example set input"/>
      <connect from_op="Set Label (2)" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Guess Types" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="original" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Guess Types" from_port="example set output" to_op="Set Label (3)" to_port="example set input"/>
      <connect from_op="Set Label (3)" from_port="example set output" to_op="SVM" to_port="training set"/>
      <connect from_op="SVM" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

sunnyal

Tom,

Thank you. After modyfing my design as per teh sample I get all 400k examples treated as "outside". I guess SVM isnt doing right thing for me. When I use Naive Bayes or GBT I get some predictions though, but way too many fasle postives.

To further refine my other working models, is there a way we can feed the confirmed false positive meters as an additional input data as a feed back (not theft/false postive) to the model, so that model refines and start treating these as not theft and yields a better output(prediction)?

Thx

MartinLiebig

Hi,

What you describe is Boosting. This is the technique GBTs are using internally.

Did you run a Grid optimize for GBT and SVMs? What kernels did you try?

Best,

Martin

sunnyal

Hi Martin,

Thanks for your note.

Yes, I tried optimizing parameters for SVM and it didn’t yield much of benefit. I used rbf kernel for SVM and tried optimizing SVM for Gamma and C values, but it was running for 2 days and still going. I tried limiting example set and optimize for only actual known theft and yet it results were terrible. I also tried GBT, but not better results. Can you suggest me the what parameters and appropriate values one should optimize for GBT?? However, Naive Bayes yielded a better result than any other learners as it predicted few flat line power consumption (which are possible candidates), However, all of them seem false positives when we actually investigated those homes. As such, is there any way we can feed these false positives back to NB or GBT model to not treat these meters as positives??

Thanks for your support