[ad_1]
I have a text file from which I am trying to extract the titles to excel column. However, the required titles are within specific pattern:
COM *******************
COM * Title 1*
COM *******************
COM ***************************
COM * Sub 1 *
COM ***************************
{
...TEXT DETAILS...
}
COM ***************************
COM * Sub 2 *
COM ***************************
{
...TEXT DETAILS...
}
COM *******************
COM * Title 2*
COM *******************
COM ***************************
COM * T2 Sub 1 *
COM ***************************
{
...TEXT DETAILS...
}
COM ***************************
COM * T2 Sub 2 *
COM ***************************
{
...TEXT DETAILS...
}
The required output of string extraction (title) format is:
['Title 1', 'Sub 1',..,'T2 Sub 2']
or excel column as
CATEGORY
Title 1
Sub 1
Sub 2
Title 2
T2 Sub 1
T2 Sub 2
It is actually the ‘COM *****’ pattern and the middle line consisting of the title that I am unable to implement. I recently extracted required string based on string pattern which I think was similar to my current problem.
For that case i/p text file was in this format:
CTG 'GEN:LT'
{
TEXT DETAILS....
}
CTG 'GEN:FR'
{
TEXT DETAILS....
}
CTG 'GEN:G_L02'
{
TEXT DETAILS....
}
CTG 'GEN:ER'
{
TEXT DETAILS....
}
CTG 'GEN:C1'
{
TEXT DETAILS....
}
My goal was to extract the string after CTG which is in ‘ ‘
My idea here was to detect the CTG string and print the string next to it. And here is how I implemented the same:
import re
def getCtgName(text):
matches = re.findall(r"'(.+?)'",text)
return matches
mylines = [] # Declare an empty list.
with open ('filepath.txt', 'rt') as myfile: # Open .txt for reading text.
for myline in myfile: # For each line in the file,
mylines.append(myline.rstrip('\n')) # strip newline and add to list.
columns = []
substr = "CTG" # substring to search for.
for line in mylines: # string to be searched
if substr in line:
columns.append(getCtgName(line)[0])
print(columns)
And got the output as:
['GEN:LT', 'GEN:FR',..., 'GEN:C1']
I believe similar logic can be implemented for the Title extraction between those comment (COM****) lines, any help with the code or logic or resources will be appreciated. Thank you!
[ad_2]