Automatic Text Signal Finder for Binary Response
I have 2 datasets:
Dataset 1 - this has the response variable and some potential categorical predictors (the response is 1 or 0). Each entity has a unique record (let's call them entities A to Z)
Dataset 2 - this has thousands of records with lots of text for each entity. So each entity could have thousands of rows, each with paragraphs of information
I want to predict the response in Dataset 1 based on the text information in Dataset 2. So here is what I think should happen next:
1) Concatenating the thousands of rows for each entity in Dataset 2 such that the resulting table is one row per entity (with a ton of text information per record).
2) Join Dataset 1 with Dataset 2 based on entity ID
Assuming above is correct so far (please correct if better way as I haven't done this yet), I am wondering if there's a ML algorithm that could find me all the words/phrases/fuzzy combos that are predictive of the response variable in dataset 1. Please advise!
Thanks!
Dataset 1 - this has the response variable and some potential categorical predictors (the response is 1 or 0). Each entity has a unique record (let's call them entities A to Z)
Dataset 2 - this has thousands of records with lots of text for each entity. So each entity could have thousands of rows, each with paragraphs of information
I want to predict the response in Dataset 1 based on the text information in Dataset 2. So here is what I think should happen next:
1) Concatenating the thousands of rows for each entity in Dataset 2 such that the resulting table is one row per entity (with a ton of text information per record).
2) Join Dataset 1 with Dataset 2 based on entity ID
Assuming above is correct so far (please correct if better way as I haven't done this yet), I am wondering if there's a ML algorithm that could find me all the words/phrases/fuzzy combos that are predictive of the response variable in dataset 1. Please advise!
Thanks!