I have a web scrapper that crawls pages with an average size of 3MB, full of
... elements with all sorts of attributes. From these pages, I usually have to grab just a few informations. For this, I am usually using this type of regex below (I know it will fail if the devs of the page in the future change id="xxx"
to id = "xxx"
but let's not make this regex more complex adding *
in many places for now):
id="email"[^>]*?>(.*?)
.*?id="phone"[^>]*?>(.*?)
The problem is that mostly (.*?)
makes backtrack a lot. After studying for more than a week, I came to the conclusion that the best approach to avoid the backtracking (in my cases) is using atomic group ?>
(which once matched what is inside it, does not backtrack) and possessive quantifier.
I succesfully managed using atomic group and possessive quantifiers in lots of my regex and it indeed helped a ton! But in the case above, I still cant find how possessive can help. I could, for example, use possessive quantifier to change from:
id="email"[^>]*?>(.*?)