If the pdfs are articles, is there a way to exclude References section from being mined. The section often starts with the same term (i.e. 'References'), so I tried to define some Split or a specific Tokenize option but I failed.
I would be grateful for any suggestion.