Is there any way to find common values in records?

smmsamm
smmsamm New Altair Community Member
edited November 5 in Community Q&A

Would you Please guide me to convert my CSV data file:

    1,A,C,Z,F,G
    2,G,Q,R,C,
    3,Z,G,Q,
    4,C,F,
    5,O,P,
    6,O,X,Y,J,
    7,A,P,X,

I have this table with ~1,000,000 records like these 7 records that you see (**In real Database A,B,C,... are words in string**), Records 1 and 2 are common in G and C value and 2,3 and 1,3 and ...

I want to sync records if they have at least two common value like Records 1 & 2,3,4 (but record 5,6,7 haven't at least 2 shared values with others) and generate a list like this:

    1 A C Z F G Q R
    2 G Q R C A Z F
    3 Z G Q A C F R
    4 C F A Z G Q R
    5 O P
    6 O X Y J
    7 A P X

at the end we must have 4 same records if we sort data and one others without sync:

    1 A C F G Q R Z
    2 A C F G Q R Z
    3 A C F G Q R Z
    4 A C F G Q R Z
    5 O P
    6 J O X Y
    7 A P X

Maybe I do not use good term for my meaning, please see:

    1 A C Z F G
    2 G Q R C

record 1 has C and G common with Record 2 now 1 has not R and Q thus we must have 1 A C Z F G + Q and R and Record 2 has not A,Z and F thus we must have: 2 G Q R C + A,Z and F thus at the end we have:

    1 A C Z F G Q R
    2 G Q R C A Z F

I need all records Respectively in the queue from top to bottom.
for other example: record 4:
1,A,C,Z,F,G
2,G,Q,R,C,
3,Z,G,Q,
4,C,F,
5,O,P,
6,O,X,Y,J,
7,A,P,X,

at first we have :
1 A C Z F G Q R
2 G Q R C A Z F
3 Z G Q A C F R
4,C,F,
5,O,P,
6,O,X,Y,J,
7,A,P,X,

Now at 4 we have C and F common with 1
thus
we generate:
4 C F + a z g q r


at first I wrote a delphi code but it is so slow. second Someone suggest me this groovy code:

    def f=[:]
    new File('Data.csv').readLines().each{
    def items=it.split(',')
    def name
    items.eachWithIndex { String entry, int i ->
    if(i==0){
    name=entry
    }
    else if(entry){
    if(!f[entry])
    f[entry]=[]
    f[entry]<<name
    }
    }
    
    }
    f.findAll {it.value.size()>1}

It is very fast (because of using a map file I think), but It only finds the common values.
Now I am here,
Please give me any help or suggestion.
Thank you

Tagged:

Answers

  • rfuentealba
    rfuentealba New Altair Community Member
    edited November 2018
    Hello, @smmsamm

    The best way to do such a thing, considering that A, B, C... etc, are strings, is by using a database, especially a SQL database. I assume that 1, 2, 3, ..., 7 are sentences, hence:

    CREATE TABLE sentences (
    sentence_id int not null,
    sentence_content text not null default '',
    primary key (sentence_id)
    );

    CREATE TABLE words (
    word_id int not null auto_increment,
    word_content varchar(255) not null unique,
    primary key (word_id)
    );

    CREATE TABLE sentences_words (
    sentence_id int not null,
    word_id int not null,
    primary key (sentence_id, word_id),
    foreign key (sentence_id) references sentences (id),
    foreign key (word_id) references words (id)
    );

    CREATE TABLE sentence_pairings (
    sentence_first_id int not null,
    sentence_second_id int not null,
    similarities int not null default 0,
    primary key (sentence_id, word_id),
    foreign key (sentence_first_id) references sentences (id),
    foreign key (sentence_second_id) references sentences (id)
    );
    Why would anyone care about using a database for these things? Piece of cake: because you are trying to find a relation between words and sentences, and there is nothing better to process than a relational database. Once you get your data sorted, you have many choices to find what groups can be together, e.g.:
    • Create a pivot between sentences and words, and run a bubble-sort algorithm through each.
    • Create a pivot between sentences and words, and run some sorts of heuristics to find similarities.
    • Create a pivot between sentences and words, use word counting to quickly solve which words give you better grouping so you don't have to bring all the words to do bubble-sorting or heuristics and invest your time in the most common words only, akin to what you do with NLP when discarding useless terms.
    • Run an unsupervised learning algorithm that can help you group stuff by differences (and I wouldn't recommend this but this is just to keep you on topic with RapidMiner).

    You can also use SQL superpowers, create subqueries to filter pairs of sentences that appear more than once when using words as counters. Since you have a million records, either you aren't on the free tier of RapidMiner or you are still unable to find the right bazooka to address your problem. Not that this is an issue to answer but it will be if you want to apply RapidMiner to this problem with your license.

    All the best,

    Rodrigo.