Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
How to use regular expression to extract chemical formula
Lei
I currently start to use text mining extension to extract chemical formula from PDF files. I use process documents from files operator and tokenize operator with regular expression.
There are many chemical formulas in PDF files. I want to extract them. The chemical formulas are mostly like LiCoMnO4, 0.4Li2Mn0.06Ni0.2O4, K1/3Mn2/3Al2/9, H2(g), .... Is there anyone who can tell me what kind of regular expression can extract them?
Thank you very much.
Find more posts tagged with
AI Studio
Accepted answers
kayman
One way would be to look for uppercase lowercase combinations inside a word boundary, but not (only) at the beginning. This isn't a combination you see in 'normal' words so it could work.
So something like \s[^ ]+[A-Z][a-z].+\s
You'll probably need to tune the boundaries, as now it just looks for combis devided by spaces.
All comments
kayman
One way would be to look for uppercase lowercase combinations inside a word boundary, but not (only) at the beginning. This isn't a combination you see in 'normal' words so it could work.
So something like \s[^ ]+[A-Z][a-z].+\s
You'll probably need to tune the boundaries, as now it just looks for combis devided by spaces.
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups