🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

subsetting in execute R process

User: "juju"
New Altair Community Member
Updated by Jocelyn

Hi, some basic subsetting of data frame seems to be not working when executing R in Rapidminer.

I ran this R code in Rstudio and it yields correct dimension after subsetting:

cat('dimension of training x:', dim(as.data.frame(dat[train, x])), '\n')
cat('dimension of training y:', dim(as.data.frame(dat[train, y])), '\n')
# dimension of training x: 138 60
# dimension of training y: 138 1

And then I paste this R code into a execute R process in Rapidminer, and it yields:

Jul 7, 2016 4:57:43 PM INFO: dimension of training x: 62 1 

Jul 7, 2016 4:57:43 PM INFO: dimension of training y: 62 1

I print many other things to debug, and other things are all same in two cases (Rstudio vs Rapidminer). -- See complete output below

Sonar data (208 rows * 61 columns) is used in both cases.

 

--------

 

Complete R code:

 

 

library(mlbench)

rm_main = function(dat, in_rapidminer = T){

cat('Starting R script now ...\n')


# find columns of x (attribute) and y (response) ####
if(in_rapidminer){
meta = melt(metaData)
meta$L1 = NULL
names(meta) = c('value', 'variable', 'column')

meta = dcast(meta, formula = column ~ variable)
print(meta)

y_name = meta[meta$role %in% 'label', 'column']
x_name = meta[meta$role %in% 'attribute', 'column']

y = names(dat) %in% y_name
x = names(dat) %in% x_name
} else {
# in R manually specify it:
y_name = 'Class'
y = names(dat) %in% y_name
x = ! names(dat) %in% c(y_name, 'pred_prob', 'pred')
}

cat('y column:', which(y), '\n')
cat('x column(s):', which(x), '\n')

cat('dimension of data:', dim(dat), '\n')


# encode y (only work for binary) ####
f1 = paste0('~', y_name, '- 1')
dat[[y_name]] = model.matrix(as.formula(f1), data = dat)[ , 1]


# ####
n_row = nrow(dat)
n_fold = 3

set.seed(123)

group = (seq_len(n_row) - 1) %% n_fold + 1
group = sample(group) # random permutation
print(table(group))

# n_fold CV ####
for(ii in seq_len(n_fold)){

cat('CV round', ii, '\n')
train = group != ii
cat('dimension of data:', dim(dat), '\n')
cat('how many rows in training set:', sum(train), '\n')

cat('dimension of training x:', dim(as.data.frame(dat[train, x])), '\n')
cat('dimension of training y:', dim(as.data.frame(dat[train, y])), '\n')

}

return(1)
}


data(Sonar)
dat = Sonar

rm_main(dat, in_rapidminer = F)

Complete output:

 

Starting R script now ...
y column: 61
x column(s): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
dimension of data: 208 61
group
1 2 3
70 69 69
CV round 1
dimension of data: 208 61
how many rows in training set: 138
dimension of training x: 138 60
dimension of training y: 138 1
CV round 2
dimension of data: 208 61
how many rows in training set: 139
dimension of training x: 139 60
dimension of training y: 139 1
CV round 3
dimension of data: 208 61
how many rows in training set: 139
dimension of training x: 139 60
dimension of training y: 139 1

(Highlight in red by me)

 

 

 

Complete Rapidminer code:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.4.000">
<context>
<input>
<location>//_your_path_/Sonar</location>
</input>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="multiply" compatibility="6.4.000" expanded="true" height="94" name="Multiply" width="90" x="112" y="120"/>
<operator activated="true" class="r_scripting:execute_r" compatibility="6.4.000" expanded="true" height="76" name="CV" width="90" x="313" y="165">
<parameter key="script" value="rm_main = function(dat, in_rapidminer = T){&#10; &#10; cat('Starting R script now ...\n')&#10; &#10;&#10; # find columns of x (attribute) and y (response) ####&#10; if(in_rapidminer){&#10; meta = melt(metaData)&#10; meta$L1 = NULL&#10; names(meta) = c('value', 'variable', 'column')&#10; &#10; meta = dcast(meta, formula = column ~ variable)&#10; print(meta)&#10; &#10; y_name = meta[meta$role %in% 'label', 'column']&#10; x_name = meta[meta$role %in% 'attribute', 'column']&#10; &#10; y = names(dat) %in% y_name&#10; x = names(dat) %in% x_name&#10; } else {&#10; # in R manually specify it:&#10; y_name = 'Class'&#10; y = names(dat) %in% y_name&#10; x = ! names(dat) %in% c(y_name, 'pred_prob', 'pred')&#10; }&#10; &#10; cat('y column:', which(y), '\n')&#10; cat('x column(s):', which(x), '\n')&#10; &#10; cat('dimension of data:', dim(dat), '\n')&#10; &#10; &#10; # encode y (only work for binary) ####&#10; f1 = paste0('~', y_name, '- 1')&#10; dat[[y_name]] = model.matrix(as.formula(f1), data = dat)[ , 1]&#10; &#10; &#10; # ####&#10; n_row = nrow(dat)&#10; n_fold = 3&#10; &#10; set.seed(123)&#10; &#10; group = (seq_len(n_row) - 1) %% n_fold + 1&#10; group = sample(group) # random permutation&#10; print(table(group))&#10; &#10; # n_fold CV ####&#10; for(ii in seq_len(n_fold)){&#10; &#10; cat('CV round', ii, '\n')&#10; train = group != ii&#10; cat('dimension of data:', dim(dat), '\n')&#10; cat('how many rows in training set:', sum(train), '\n')&#10; &#10; cat('dimension of training x:', dim(as.data.frame(dat[train, x])), '\n')&#10; cat('dimension of training y:', dim(as.data.frame(dat[train, y])), '\n')&#10;&#10; }&#10; &#10; return(1)&#10;}"/>
</operator>
<connect from_port="input 1" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_port="result 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="CV" to_port="input 1"/>
<connect from_op="CV" from_port="output 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="90"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<description align="center" color="yellow" colored="false" height="58" resized="true" width="214" x="250" y="243">This R code works with binary (two-class) response only</description>
</process>
</operator>
</process>

Complete log in Rapidminer:

Jul 7, 2016 5:15:25 PM INFO: Starting R script now ...

Jul 7, 2016 5:15:25 PM INFO: column role type

Jul 7, 2016 5:15:25 PM INFO: 1 attribute_1 attribute real

Jul 7, 2016 5:15:25 PM INFO: 2 attribute_10 attribute real

Jul 7, 2016 5:15:25 PM INFO: 3 attribute_11 attribute real

<I omit some lines here for clarity>

Jul 7, 2016 5:15:25 PM INFO: 58 attribute_7 attribute real

Jul 7, 2016 5:15:25 PM INFO: 59 attribute_8 attribute real

Jul 7, 2016 5:15:25 PM INFO: 60 attribute_9 attribute real

Jul 7, 2016 5:15:25 PM INFO: 61 class label nominal

Jul 7, 2016 5:15:25 PM INFO: y column: 61

Jul 7, 2016 5:15:25 PM INFO: x column(s): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Jul 7, 2016 5:15:25 PM INFO: dimension of data: 208 61

Jul 7, 2016 5:15:25 PM INFO: group

Jul 7, 2016 5:15:25 PM INFO: 1 2 3

Jul 7, 2016 5:15:25 PM INFO: 70 69 69

Jul 7, 2016 5:15:25 PM INFO: CV round 1

Jul 7, 2016 5:15:25 PM INFO: dimension of data: 208 61

Jul 7, 2016 5:15:25 PM INFO: how many rows in training set: 138

Jul 7, 2016 5:15:25 PM INFO: dimension of training x: 61 1

Jul 7, 2016 5:15:25 PM INFO: dimension of training y: 61 1


Jul 7, 2016 5:15:25 PM INFO: CV round 2

Jul 7, 2016 5:15:25 PM INFO: dimension of data: 208 61

Jul 7, 2016 5:15:25 PM INFO: how many rows in training set: 139

Jul 7, 2016 5:15:25 PM INFO: dimension of training x: 61 1

Jul 7, 2016 5:15:25 PM INFO: dimension of training y: 61 1


Jul 7, 2016 5:15:25 PM INFO: CV round 3

Jul 7, 2016 5:15:25 PM INFO: dimension of data: 208 61

Jul 7, 2016 5:15:25 PM INFO: how many rows in training set: 139

Jul 7, 2016 5:15:25 PM INFO: dimension of training x: 61 1

Jul 7, 2016 5:15:25 PM INFO: dimension of training y: 61 1


Jul 7, 2016 5:15:25 PM INFO: Saving results.

Jul 7, 2016 5:15:25 PM INFO: Process

Any help's appreciated. Thanks-

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "Andrew2"
    New Altair Community Member
    Accepted Answer

    If you add a line like this near the beginning of the code, it seems to fix the issue.

     

    dat <- as.data.frame(dat)

    It's something to do with the type of the dat variable. In pure R, it's a data.frame. In RapidMiner it's also a data.table

     

    Andrew