🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"Text Mining: How do I assign/create a macro to reference a group of attributes collectively?"

User: "batstache611"
New Altair Community Member
Updated by Jocelyn

Hi,

Before I begin, I sincerely apologize that I cannot share my process here due to confidentiality issues with my school. This is a service learning project that I am doing for the co-op placements department of my business school. But I will try my best to describe it.

 

I have a process that tries to classify a job based on lexicons. I have one dictionary per category of jobs ~ 30 dictionaries for ~30 categories. Also I have one massive custom stopwords dictionary. Each job posting is run against these dictionaries and we try to see how many words from each of the dictionaries are contained in each individual job posting. The idea is that whichever category of dictionary gets the highest word count for a given job posting, that is the predicted category of that job. The concept by itself is simple, except in order to automate the whole thing and run it on scale, I'm using file and repository loops, macros, branches, subprocesses, etc.

 

The process works fine except the results are very clutterred. For every job posting, word counts for all 30 dictionaries are being returned. I'd like to limit it to just the highest one or the top 3. I know that can use the Max function in Generate Attributes to select the one with the highest count but that would mean the dictionary names will be hard-coded into the process. I'd like it to be able to handle new dictionaries on it's own in the future without me having to go in to the parameter settings and modifying things. Also if I used attribute names in Max(), the function will be very long ex: Max(dictionary_1, dictionary_2, dictonary_3, ...., dictionary_30). Is there a way to use a macro instead to refer to these dictionary attributes such that I can write a simple function - Max(%{dictionary}) and have it select the highest count?

 

I've attached a sample csv with breakpoint results for one row/document/job posting. As you can see, it has wordcounts for several dictionaries however I'm only interested in the largest one. And I need to do this for over 5k job postings. I want to have an attribute(s) that picks the top or the top three categories for each document/row using macros and generate attributes.

 

Thank you very much and your help is greatly appreciated.

Find more posts tagged with