dimanche 17 janvier 2021

Syntax for Conditional Statement With Regex Function

I have created a code to parse through multiple pdf files and return a line of data from each page. I came across the issue that some of the pages within my pdf files do not have this line. When this happens my code just omits the page entirely; however I would like it to print a single 'none' for the pages where it can not find the specified line. I thought this was a simple fix but its proving to be a little more complicated that I thought. Here is an example of the line I am pulling and what I have tried:

#pattern I told my code to look for within each page of pdf

sqft_re = re.compile('(\d+(sqft)\s+[$]\d+[.]\d+\s+\d{2}/\d{2})') 

#this is an example of what the line I want in each page looks like: 

'1600sqft $154.98 10/14' 

Basically I want the code to parse through every pdf and return the line if it can find it. If it can not I want it to return a single 'none' for said page without that line. I have called the lines to a list like so:

lines = []

Here is how I set my for loop to look through each page of my pdf files:

for files in os.listdir(directory):
  if files.endwith(".pdf"): 
       with pdfplumber.open(files) as pdf:
         pages = pdf.pages
         for page in pdf.pages:
           text = page.extract_text()
           for line in text.split('\n'):
             
             line = sqft_re.search(line)
             if line:
                 line.group(1)
                 lines.append(line)

Example of output:

lines

'1600sqft $154.98 10/14' 
'1450qft $113.02 07/05' 
'90sqft $60.17 05/12' 
'3000sqft $500.98 09/20' 

This code successfully returns a the list of data for pages with the line. However pages without the line are omitted. Here is what I thought would fix the problem and simply print none for pages without the line:

for files in os.listdir(directory):
  if files.endwith(".pdf"): 
       with pdfplumber.open(files) as pdf:
         pages = pdf.pages
         for page in pdf.pages:
           text = page.extract_text()
           for line in text.split('\n'):
             
             line = sqft_re.search(line)
             if line:
                 line.group(1)
             else:
                 line = 'None'
             lines.append(line)

However this did not work and now instead of just substituting 'None' for pages without the value every single line within the pdf page is printed as 'None' except for where it matches the line. So basically I now have a list that looks like this:

lines

'None'
'None'
'None'
'1600sqft $154.98 10/14' 
'None'
'None'
'None'
'1450qft $113.02 07/05' #etc.....

I have tried some other things like calling a different function when it does not match what I am looking for, making my own string to substitute the value with and a couple more. I am still getting the same problem. In my sample pdf there is only one page without this line so my list should look like:

'1600sqft $154.98 10/14' 
'1450qft $113.02 07/05' 
'90sqft $60.17 05/12' 
'3000sqft $500.98 09/20' 
'None'

I am also pretty new to python (R is what I primarily work with) so I am sure I am overlooking something here but any guidance to what I am missing would be appreciated!

Aucun commentaire:

Enregistrer un commentaire