How to declare missing polynomial values
FabianDi
New Altair Community Member
Hey guys, this is probably an easy one for someone who knows where to look
i'm currently working on trying out different methods for imputing missing data.
For this, i used an existing (complete) dataset and delete values using an R script.
However, when i do this to a polynomial attribute, the dataset still shows 0 missing values, and "?" has just been added to the range of the polynomial attribute. I tried the "Declare Missing Values" operator, but that didn't help (or i didn't use it correctly).
For this, i used an existing (complete) dataset and delete values using an R script.
However, when i do this to a polynomial attribute, the dataset still shows 0 missing values, and "?" has just been added to the range of the polynomial attribute. I tried the "Declare Missing Values" operator, but that didn't help (or i didn't use it correctly).
Here is a minimal example of the problem:
<?xml version="1.0" encoding="UTF-8"?><process version="9.8.000"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.8.000" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="SYSTEM"/><br> <process expanded="true"><br> <operator activated="true" class="retrieve" compatibility="9.8.000" expanded="true" height="68" name="Retrieve SouthGermanCredit" width="90" x="45" y="85"><br> <parameter key="repository_entry" value="//Local Repository/data/SouthGermanCredit"/><br> </operator><br> <operator activated="true" class="r_scripting:execute_r" compatibility="9.6.000" expanded="true" height="103" name="Delete MCAR" width="90" x="313" y="85"><br> <parameter key="script" value=" library(utils) library(caret) # Erstellt eine Aufallmatrix (MCAR) mit den uebergebenen Parametern # Achtung: Mehr Ausfaelle als Werte erzeugt Endlosschleife mcarSingle <- function(zeilen, ausfaelle){ x <- sample(1:zeilen, ausfaelle) ausfallvektor <- seq(0, 0, length.out = zeilen) for(i in 1:length(x)){ ausfallvektor[x[i]] <- 1 } return(ausfallvektor) } marSingle <- function(daten, spalte, ausfaelle){ koeffizienten <- sample(0:100, ncol(daten)-1,) #Ausfallwahrscheinlichkeit fuer jede Zeile ausfallwert <- seq(0, 0, length.out=nrow(daten)) for(i in 1:nrow(daten)){ ausfallwert[i] = (koeffizienten%*%(as.matrix(daten)[i,-spalte]))[1,1] } # Auf Summe = 1 normieren - evtl unnoetig ausfallwert <- ausfallwert/(sum(ausfallwert)) # Ausfaelle aus abhaengig von Ausfallwerten samplen ausfallvektor <- seq(0, 0, length.out = nrow(daten)) for(i in 1:ausfaelle){ gezogen <- sample(ausfallwert, 1, prob = ausfallwert) indices <- which(ausfallwert==gezogen, arr.ind=TRUE) ausfallvektor[indices[1]] <- 1 ausfallwert[indices[1]] <- 0 } return(ausfallvektor) } mcar <- function(zeilen, spalten, ausfaelle){ matrix <- diag(x = 0, nrow = zeilen, ncol = spalten) for(i in 1:ausfaelle){ # Um Dopplungen zu verhindern, der Umweg mit 'done' done <- FALSE while(!done){ x <- sample(1:zeilen, 1) y <- sample(1:spalten, 1) if(matrix[x,y] == 0){ matrix[x,y] <- 1 done <- TRUE } } } return(matrix) } # Erstellt eine Aufallmatrix (MAR) mit den uebergebenen Parametern mar <- function(daten, ausfaelle){ spalten <- ncol(daten) #21 zeilen <- nrow(daten) #1000 vektor <- matrix(0L, nrow = spalten, ncol = spalten-1) # Ausfallmodelle fuer alle Variablen generieren for(i in 1:ncol(daten)){ vektor[i,] <- sample(0:100, ncol(daten)-1, replace=TRUE)/100 } # Ausfallwerte fuer alle Eintraege im Datensatz erzeugen ausfallwert <- matrix(0L, nrow = zeilen, ncol = spalten) for(i in 1:spalten){ for(j in 1:zeilen){ ausfallwert[j,i] = sum(vektor[i,]*(daten[j,-i])) } } # Auf Summe = 1 normieren - evtl unnoetig ausfallwert <- ausfallwert/(sum(ausfallwert)) # Ausfaelle aus abhaengig von Ausfallwerten samplen ausfallmatrix <- matrix(0L, nrow = zeilen, ncol = spalten) for(i in i:ausfaelle){ gezogen <- sample(ausfallwert, 1, prob = ausfallwert) indices <- which(ausfallwert==gezogen, arr.ind=TRUE) ausfallmatrix[indices[1,1], indices[1,2]] <- 1 ausfallwert[indices[1,1], indices[1,2]] <- 0 } return(ausfallmatrix) } # Ersetzt Daten zufaellig durch NA # Labelspalte wird nicht ignoriert, muss ggfs. vorher entfernt werden deletedatamcar <- function(data, spalte, ausfaelle){ mcarvector <- mcarSingle(nrow(data), ausfaelle) # Das Ersetzen geht sicher in einer Zeile, ist fuer mich so aber verstaendlicher for(i in 1:nrow(data)){ if(mcarvector[i] == 1){ data[i, spalte] <- NA } } return(data) } deletedatamar <- function(data, spalte, ausfaelle){ marvector <- marSingle(data, spalte, ausfaelle) # Das Ersetzen geht sicher in einer Zeile, ist fuer mich so aber verstaendlicher for(i in 1:nrow(data)){ if(marvector[i] == 1){ data[i, spalte] <- NA } } return(data) } rm_main = function(data) { return(deletedatamcar(data, 3, 100)) } "/><br> <parameter key="use_default_R" value="true"/><br> <parameter key="Rscript_executable" value="C:/R/Rscript.exe"/><br> <parameter key="use_default_R_LIBS_paths" value="true"/><br> <enumeration key="R_LIBS_paths"/><br> </operator><br> <connect from_op="Retrieve SouthGermanCredit" from_port="output" to_op="Delete MCAR" to_port="input 1"/><br> <connect from_op="Delete MCAR" from_port="output 1" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process><br><br>
I guess the dataset is not included in the above code, so this is the one i am using, the only relevant thing, is that the attribute "moral" is imported as polynomial type:
Sorry, can't post the Link.. It's the South German Credit (Update) Dataset from UCI.
Sorry, can't post the Link.. It's the South German Credit (Update) Dataset from UCI.
If it helps, this is the problem i have:
The attribute "moral" has 0 missing values, but contains "?" in it's range, which should be the missing ones.
Happy for any suggestions
Happy for any suggestions
Tagged:
0
Answers
-
Declare missing values should do the trick, might be one of the settings you overlooked. Don't know them by hard but selecting nominal and adding the questionmark in the value field should be enough.
You could also replace the questionmark with nothing and have it followed by a trim to get the same result.1