How to process categorical type data using unsupervised algorithm in anomaly detection?

liming
liming New Altair Community Member
edited November 5 in Community Q&A
I encounter a problem in anomaly detection. We know that distance is measured between different instances. Now my dataset contains categorical data. I have 3 choices. First, I remove the categorical features, however, I think that there are useful messages in categorical features. Second, I transform the categorical data into numerical value using LabelEncoder of sklearn, however, I think the transform can't correspond to the distance measure. Third, I use OneHotEncoder of sklearn to process the categorical features, however, I think that the demensions of features increase and it affect clustering.

Answers

  • varunm1
    varunm1 New Altair Community Member
    Hello @liming

    General preference is to one hot encode and yes it increases the dimensions of features but you can use PCA for dimensionality reduction on these features to reduce them. If this is not good, you can use k-modes in python which is a mixed model that can take both categorical and numeric features for clustering.

    K-modes: http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf

    Thanks
  • YYH
    YYH
    Altair Employee
    Have you tried the anomaly detection extension from RapidMiner marketplace? As far as I know, the knn-global anomaly score operator can use nominal measures to calculate nearest neighbor distances. The LOF outlier detecter is similar. If you want to apply PCA for anomaly scores, you will need to convert nominal to numerical attributes. Here is an example applied on the Titanic data
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="34">
            <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="145" name="Multiply" width="90" x="313" y="34"/>
          <operator activated="true" class="nominal_to_numerical" compatibility="9.2.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="514" y="493">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="false"/>
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="coding_type" value="dummy coding"/>
            <parameter key="use_comparison_groups" value="false"/>
            <list key="comparison_groups"/>
            <parameter key="unexpected_value_handling" value="all 0 and warning"/>
            <parameter key="use_underscore_in_name" value="false"/>
          </operator>
          <operator activated="true" class="anomalydetection:Robust Principal Component Analysis Anomaly Score (rPCA)" compatibility="2.4.001" expanded="true" height="68" name="Robust Principal Component Analysis Anomaly Score (rPCA)" width="90" x="648" y="493">
            <parameter key="probability_for_normal_class" value="0.975"/>
            <parameter key="component_usage" value="use all components"/>
            <parameter key="major_components" value="use variance threshold"/>
            <parameter key="cumulative_variance" value="0.5"/>
            <parameter key="number_of_major_pcs" value="1"/>
            <parameter key="minor_components" value="use max eigenvalue"/>
            <parameter key="eigenvalue_threshold_max" value="0.2"/>
            <parameter key="number_of_minor_pcs" value="1"/>
          </operator>
          <operator activated="true" class="anomalydetection:One-Class LIBSVM Anomaly Score" compatibility="2.4.001" expanded="true" height="82" name="One-Class LIBSVM Anomaly Score" width="90" x="648" y="340">
            <parameter key="svm_type" value="one-class"/>
            <parameter key="svm_kernel_type" value="rbf"/>
            <parameter key="degree" value="3"/>
            <parameter key="automatic gamma tuning" value="true"/>
            <parameter key="gamma" value="0.0"/>
            <parameter key="coef0" value="0.0"/>
            <parameter key="nu" value="0.5"/>
            <parameter key="beta" value="0.5"/>
            <parameter key="lambda" value="0.001"/>
            <parameter key="epsilon" value="0.001"/>
            <parameter key="cache_size" value="80"/>
            <parameter key="shrinking" value="true"/>
          </operator>
          <operator activated="true" class="anomalydetection:k-NN Global Anomaly Score" compatibility="2.4.001" expanded="true" height="103" name="k-NN Global Anomaly Score" width="90" x="648" y="34">
            <parameter key="k" value="10"/>
            <parameter key="use k-th neighbor distance only (no average)" value="false"/>
            <parameter key="measure_types" value="MixedMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="GeneralizedIDivergence"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
            <parameter key="parallelize evaluation process" value="false"/>
            <parameter key="number of threads" value="4"/>
          </operator>
          <operator activated="true" class="detect_outlier_lof" compatibility="9.2.001" expanded="true" height="82" name="Detect Outlier (LOF)" width="90" x="648" y="187">
            <parameter key="minimal_points_lower_bound" value="10"/>
            <parameter key="minimal_points_upper_bound" value="20"/>
            <parameter key="distance_function" value="euclidian distance"/>
          </operator>
          <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="k-NN Global Anomaly Score" to_port="example set"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Detect Outlier (LOF)" to_port="example set input"/>
          <connect from_op="Multiply" from_port="output 3" to_op="One-Class LIBSVM Anomaly Score" to_port="example set"/>
          <connect from_op="Multiply" from_port="output 4" to_op="Nominal to Numerical" to_port="example set input"/>
          <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Robust Principal Component Analysis Anomaly Score (rPCA)" to_port="example set input"/>
          <connect from_op="Robust Principal Component Analysis Anomaly Score (rPCA)" from_port="example set output" to_port="result 4"/>
          <connect from_op="One-Class LIBSVM Anomaly Score" from_port="example set" to_port="result 3"/>
          <connect from_op="k-NN Global Anomaly Score" from_port="example set" to_port="result 1"/>
          <connect from_op="Detect Outlier (LOF)" from_port="example set output" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
          <portSpacing port="sink_result 5" spacing="0"/>
        </process>
      </operator>
    </process>