I currently start to use text mining extension to extract chemical formula from PDF files. I use process documents from files operator and tokenize operator with regular expression.
There are many chemical formulas in PDF files. I want to extract them. The chemical formulas are mostly like LiCoMnO4, 0.4Li2Mn0.06Ni0.2O4, K1/3Mn2/3Al2/9, H2(g), .... Is there anyone who can tell me what kind of regular expression can extract them?
Thank you very much.
So something like \s[^ ]+[A-Z][a-z].+\s
You'll probably need to tune the boundaries, as now it just looks for combis devided by spaces.