[ad_1]
Helow below is the string from a PDF File in which I’m trying to retrieve some specific element,
import PyPDF2
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
pageObj = pdfReader.getPage(0)
text = pageObj.extractText()
# Example of data from text
test = "Name : \n \n \n \nENTRP XXXXU\n \nSTREET\nWWWWWWW \nCITY \n USA\n \n \n \n \n \n \n \n \nCompany :\n \n \nENTEPRISE WWWWW\nXXXXXX \n001 STREET XXX XXXX \nCITY - USA\n \n \n \nTel : 02 02 02 02 02\n \n \nMail : [email protected]\n \nCodeq : 13000w0\n \nCodea : X9W9X4\n \n \n Contract n° :\n0 4 4 4 w x 00 \n \n(replace contract 1111111111)\n \n \n \nINFORMATIONS \n\n \nCoden :\n1111111100028\n \nCode :\n9499\n \n \nCONTRACT \n\n \n\nNEW \nCODEIN –––––––––––––––––––––––––––––––––\n A17\n \n\nCODEPN –––––––––––––––––––––––––––––––––\n T00\n \n"
In order to retrieve specific info from the text, I remove break line and long white spaces
clean = re.sub(r'\n',' ', test) # Remove Return to the line
clean = ' '.join(filter(len, clean.split(' '))) # Remove Multiple WhiteSpaces
clean
'Name : ENTRP XXXXU STREET WWWWWWW CITY USA Company : ENTEPRISE WWWWW XXXXXX 001 STREET XXX XXXX CITY - USA Tel : 02 02 02 02 02 Mail : [email protected] Codeq : 13000w0 Codea : X9W9X4 Contract n° : 0 4 4 4 w x 00 (replace contract 1111111111) INFORMATIONS Coden : 1111111100028 Code : 9499 CONTRACT NEW CODEIN ––––––––––––––––––––––––––––––––– A17 CODEPN ––––––––––––––––––––––––––––––––– T00'
The aim is to create a Dataframe based on some specific information.
HEre is the expected output
Name Company TEL MAIL Codeq Codea Contract n°
ENTRP XXXXU STREET WWWWWWW CITY USA ENTEPRISE WWWWW XXXXXX 001 STREET XXX XXXX CITY - USA 02 02 02 02 02 [email protected] 13000w0 X9W9X4 0 4 4 4 w x 00 (replace contract 1111111111)
.....
Coden Code CODEIN CODEPN
1111111100028 9499 A17 T00
How is it possible to do something similar with REGEX (Get the value after ‘:’ or after long sequence of ‘-‘) ? Or Perhaps even another approach ?
Any help is welcome!
[ad_2]