vendredi 26 mars 2021

Using if-else find-split statements in python to find and extract html tags

The attached text file contains data scraped from a non-English newspaper website. I am trying to extract the article title and publication date. I am having trouble extracting the publication date from its tags because it does not have the same tag everywhere. That is, for some of the elements, the date is enclosed in a tag that is different from the date tag in the rest of the elements. I want to write code that generates a date element which will take in either of the tags. I sort of get the logic, but am unable to write the code in python, and would really appreciate some help. Below I have commented out the section where I am trying to write the code for extracting the date.

Thank you!!

with open('listfile.txt', 'r', encoding='utf8') as my_file:  #contains text and metadata from newspaper articles
    rawData = my_file.read()


rawDataList = rawData.split("]\n[")

rawDataList = list(filter(None, rawDataList)) #remove empty elements.  



#defining function to strip HTML tags

def stripTags(pageContents):
    insideTag = 0
    text = ''

    for char in pageContents:
        if char == '<':
            insideTag = 1
        elif (insideTag == 1 and char == '>'):
            insideTag = 0
        elif insideTag == 1:
            continue
        else:
            text += char
    return text


works_list=[]
for x in rawDataList:
    data=x.split("/span><div></div></div></div>, ")[0]  

    title= data.split("</h1>, <span")[0]
    clean_title=stripTags(title)
    
    
########

#Date Tag Type 1: Some of the dates are enclosed within this tag-->
#<div class="time-social-share-wrapper storyPageMetaData-m__time-social-share-wrapper__2-RAX"><div class="storyPageMetaData-m__publish-time__19bdV storyPageMetaData-m__no-update__3AA06"><time datetime="2021-03-21T23:46:54+06:00">প্রকাশ: ২১ মার্চ ২০২১, ২৩: ৪৬ </time>

#Date Tag Type 2: The rest are enclosed within this tag--->
#<div class="time-social-share-wrapper storyPageMetaData-m__time-social-share-wrapper__2-RAX"><div class="storyPageMetaData-m__publish-time__19bdV"><time datetime="2020-01-12T12:51:00+06:00">আপডেট: ১৬ জানুয়ারি ২০২০, ১১: ১২ </time>

###########
    
#####This is where I am trying to create an object which takes the value of whatever date is present in the metadata
    
    #if data.find("") is not None:    #looks for one type of tag, and assigns the value if found
       # pubDate = data.find("").get_text()
        #else:
            #pubDate = data.find("").get_text() #if you can't find the tag mentioned above, look for the other type of tag and assign that value
        #genres_lst.append(genre_str)    
    #clean_pubDate=stripTags(pubDate)

   
    d={}
    d['title']=clean_title
    #d['date']=clean_pubDate
    works_list.append(d)
    
works_list

Aucun commentaire:

Enregistrer un commentaire