negation in clinical note text mining

sleclair
sleclair New Altair Community Member
edited November 5 in Community Q&A

Hi all,

I'm working with a large set of clinical notes and it seems like the clinicians are trained to spend half their time writing down what is NOT going on with the patient. So, in order to apply many text mining techniques I'm having to learn how to handle negation in context.

 

I've seen a brief dialog about this topic in which @mschmitz and @SvenVanPoucke discussed the issue https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Include-Negations-in-Dictionary-based-Sentiment-Approach/m-p/44266/highlight/true#M29247 

 

And, I see that Martin added negation with a word window to the Operator toolkit Dictionary Based Sentiment. I think the way it was implemented is very flexible and I look forward to using it when I focus on sentiment.

 

Right now, I'm attempting to 'tag' my corpus of documents regarding "Suicide-Mentioned" vs. "Suicide-Deny-Mention" as a way to make our documents search a little better. It's difficult to write the regex or Lucene queries needed to reliably find Suicide related notes so I want to preprocess and tag the notes for the clinicians using Python or RapidMiner's more sophisticated toolsets.

 

There are 2M documents in the corpus, each of which may be as short at 1-2 sentences to as long as several pages. They are typical unstructured text notes although there are patterns in how the different clinicians discuss suicidality (deny or endorse).

 

My first pass at the task used regex inside of SQL Server and ran for three days to get through the 2M documents. The quality is being reviewed now, but I don't think it will be acceptable to the clinical director. Recall may not be high enough for field use with this approach.

 

There are some medical note negation tools available Negex and PyContext and several papers that address the issue. I'm new to RapidMiner and would like to apply RM to the issue and thought to ask for advice on how folks here might address an issue like this.

 

Thanks in advance for your help/advice...Steve 

Answers

  • DocMusher
    DocMusher New Altair Community Member
    Hi,
    Indeed, "probably not a possible tumor" are the sentences found in the real world. I am very interested in your project. Could you send me a few example texts? Anyway I would like to help you with this.
    Cheers
    Sven
  • sleclair
    sleclair New Altair Community Member

    Hi Sven,

    One of the other important projects on my plate is the de-identification of our medical records, making it easier to share without requiring a Business Associate Agreement, BAA. I will check to see if we can create a sample dataset of suicide related sentence (de-identified) as a community dataset. About how many sample sentences of each type do you think would be the minimum number that would allow for some meaningful analysis in a community dataset? 

     

    From the regex SQL work project:

    ['Suicide-Deny-Mention'] => 195449 documents tagged
    ['Suicide-Mention']           =>  28395 documents tagged
    ['Suicide-Unclear-Mention']=> 4231 documents tagged

     

    I attached the SQL that I used to do pattern matching in SQL server. In it you can see the type of negation patterns I used:

       where value like 'never' or value like '%n''t' or  value like 'no' or  value like 'not' or value like 'den[yi]%'

     

    and the key terms in the source data:

      where (AssessmentRaw like '%suicid%' or AssessmentRaw like '%[ /:-,;]si[/ :-,;]%' or AssessmentRaw like '%kill herself%' or AssessmentRaw like '%kill himself%' or AssessmentRaw like '%kill myself%' )

     

    ...Steve

  • DocMusher
    DocMusher New Altair Community Member
    Hi, I integrated this research in a RM process: https://github.com/vmenger/deduce
    De-identification.
    Cheers
    Sven
  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

    i am definitly not a medical expert like you or Sven, but here is my DS view on it.

     

    I think you can treat it in three different ways:

    1. Manual "keyword based filtering". This essentially runs either into Lucene Queries or something like Dict Based Sentiment. It requieres a list of either quieres of keywords. Based on this you calculate the sum of this as a score. This can be used as a "prefilter". Afterwards experts can read through this list and tag false positives. This can then be used as an input for a machine learning algorithm.

    2. One might try unsupervised methods on it. E.g. LDA topic recognition. The idea would be to find "suicide-topics".

    3. I had a client who used external data on medical terms. The idea would be to not just use the text but also e.g. the wikipedia result for a desease. This can "enhance" the text itself. You can do something like. If (Wikipedia lookup includes "suicide") then score+0.5. But i think this is the hardest approach of the 3.

     

    Best,

    Martin

  • SGolbert
    SGolbert New Altair Community Member

    Hi Steve,

     

    can you eventually retrieve the files or database dump into your computer? I think you will have performance issues when retrieving records from the DB and processing with RapidMiner on the fly. The best case scenario would be when you can load all the data into memory.

     

    This is a important issue, because if you cannot move the data, you will have to bring the analysis to the data (as procedure on the DB).

     

    Best,

    Sebastian

  • sleclair
    sleclair New Altair Community Member

    Hi Sven,

    I saw that and played with the library a bit this weekend. It would take a bit of work to convert it to English but very doable. There are some other approaches that are referenced in the literature (MIST and others) - pros/cons to each. I haven't settled on an approach yet but the Deduce approach looks viable.

     

    Whatever techinique I use will include proof reading as the last pass so the data set will be small maybe 600 notes. Even then, the risk/reward ratio may prevent us from publishing.

     

    ...Steve

  • sleclair
    sleclair New Altair Community Member

    Hi Martin,

    Thanks for the suggestions, thought provoking.

     

    1) Keyword based filtering - so, do you think I should try the Sentiment operator you provide on the toolkit to score for endorse vs. deny? I think we can use a regex filter to collect the documents that mention suicidality. Do you think this would be a better technique than using regex to determine endorse vs. deny?

     

    I'm planning on hiriing a temp/intern to follow a protocol and label a few thousand notes so we would then apply various supervised learning techniques on the rest of the corpus. I'm just hoping to give the human a decent set of pre-labeled notes.

     

    2) I've played with LDA on an off over the last month as a way to identify key topics in our notes. The idea I'm pursuing is to collect the key terms in each note and then use D3 to show a network graph for a client allowing her clinician to visualize how key topics relate to each other (or not) within her medical record. Word clouds are easy and fun to look at but don't convey much useful data - even when we played word clouds over time it didn't drive enough clinical value. My hypothesis is that a graph (nodes/edges) might be a richer visualization tool. The challenge I haven't figured out (yet) is how to discover edges between nodes (keywords) that might actually have value/meaning to clinicians. I have some ideas for the edges (word2vec similarity scores for one) but haven't decided yet how to proceed.

     

    I don't yet trust my skill with LDA to reliably find 'suicide-topics'. Do you think this is a likely approach?

     

    3) Hmm, sounds like fun and I would never have considered an approach like this. Sounds like it beyond my current skillset ;>) 

     

    Thanks for your ideas Martin...Steve

     

  • sleclair
    sleclair New Altair Community Member

    Hi Sebastian,

    Good point, not sure.

     

    2M medical records is 11 Gb as an Azure Search index and 24Gb as a SQL table. I can create a VM with 192GB of RAM, so I guess it depends largely on the algorithm we are using.

     

    I'm considering ways of segmenting data based on diagnosis or level of care allowing our models to be based on much smaller but still large enough to have > 100k records for training. Of course, suicidality cuts across all diagnoses and care levels. One of the data issues we are facing is how comparatively sparse the data is, < 2% of our notes mention people endorsing suicidal ideation. In a typical year we lose maybe 35 people out of the >10k that we serve (and we are focused on driving that number to 0).

     

    ...Steve

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

    honestly, if you have an intern doing labeling - thats the easiest way to do it :)

     

    BR,

    Martin

  • SGolbert
    SGolbert New Altair Community Member

    Hi,

     

    I'm guessing that false positives are not much of an issue here. If you can generate a set of representative features, you could turn it into an annomaly detection problem with a rather conservative model (density < 5% are annomalies).

     

    Having seen your SQL file I can understand why you want to move to RapidMiner!

     

    If there are no disclosure problems (or the data can be annonimized), I think this is a good candidate for a Kaggle dataset (or competition if you have some funds).

     

    Best regards,

    Sebastian

  • sleclair
    sleclair New Altair Community Member

    Hi Sebastian,

    I agree regarding competition(s), I think decent size mental health clinical dataset would benefit a lot of agencies that can't afford to hire the caliber of people that compete and yet would be fun/challenging for the competitors. I think we could find adequate prize money too.

     

    Helping agencies dedicated to helping others is why I chose to work to join this org five years ago (a great decision on my part).

     

    My guess is that I won't have an approved anonymized set until early next year, and even then I might only be allowed to share it with select people under an NDA - we'll see, not something we've done before. I'll definitely report back when I know more.

     

    One think I could use this forum's help with is to 'design' a dataset that maximizes the opportunities for text mining projects. Would you be interested in collaborating on this stage? I can disclose a lot about our data without disclosing PHI data in order to optimize what we anonymize. For instance, I'm just starting to draft a protocol for a temp/intern to start reviewing and manually tagging a subset of our data. I could post that to this forum for feedback.

     

    Thoughts?

    ...Steve

     

     

  • DocMusher
    DocMusher New Altair Community Member

    Hi,

    Looks like an interesting project.

    Sven

  • jacobcybulski
    jacobcybulski New Altair Community Member
    I am not sure if this solution is suitable to handling sensitive medical documents, however, here is a "poor man's negative tagger", which I made for one of my projects in text mining of 800,000 reviews, where I needed to explicitly handle negation and catch negatively tagged terms for further processing.

    The solution is based on window sliding over the tokenized document, identifying negative words from the list, and prefixing the next token with a "neg:" tag. It is not very fast but it is good for a quick-and-dirty negative tagging of terms.

    The included stop list and a list of negatives are samples only.

    Jacob


  • sgenzer
    sgenzer
    Altair Employee
    I like it @jacobcybulski. Adding it to my list of "stuff to add to the community repo" :smile:
  • DocMusher
    DocMusher New Altair Community Member
    Detecting Hedge Cues and their Scope in Biomedical Literature with Conditional Random Fields:
    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2991497/