The attached text file contains data scraped from a non-English newspaper website. I am trying to extract the article title and publication date. I am having trouble extracting the publication date from its tags because it does not have the same tag everywhere. That is, for some of the elements, the date is enclosed in a tag that is different from the date tag in the rest of the elements. I want to write code that generates a date element which will take in either of the tags. I sort of get the logic, but am unable to write the code in python, and would really appreciate some help. Below I have commented out the section where I am trying to write the code for extracting the date.
Thank you!!
with open('listfile.txt', 'r', encoding='utf8') as my_file: #contains text and metadata from newspaper articles
rawData = my_file.read()
rawDataList = rawData.split("]\n[")
rawDataList = list(filter(None, rawDataList)) #remove empty elements.
#defining function to strip HTML tags
def stripTags(pageContents):
insideTag = 0
text = ''
for char in pageContents:
if char == '<':
insideTag = 1
elif (insideTag == 1 and char == '>'):
insideTag = 0
elif insideTag == 1:
continue
else:
text += char
return text
works_list=[]
for x in rawDataList:
data=x.split("/span><div></div></div></div>, ")[0]
title= data.split("</h1>, <span")[0]
clean_title=stripTags(title)
########
#Date Tag Type 1: Some of the dates are enclosed within this tag-->
#<div class="time-social-share-wrapper storyPageMetaData-m__time-social-share-wrapper__2-RAX"><div class="storyPageMetaData-m__publish-time__19bdV storyPageMetaData-m__no-update__3AA06"><time datetime="2021-03-21T23:46:54+06:00">প্রকাশ: ২১ মার্চ ২০২১, ২৩: ৪৬ </time>
#Date Tag Type 2: The rest are enclosed within this tag--->
#<div class="time-social-share-wrapper storyPageMetaData-m__time-social-share-wrapper__2-RAX"><div class="storyPageMetaData-m__publish-time__19bdV"><time datetime="2020-01-12T12:51:00+06:00">আপডেট: ১৬ জানুয়ারি ২০২০, ১১: ১২ </time>
###########
#####This is where I am trying to create an object which takes the value of whatever date is present in the metadata
#if data.find("") is not None: #looks for one type of tag, and assigns the value if found
# pubDate = data.find("").get_text()
#else:
#pubDate = data.find("").get_text() #if you can't find the tag mentioned above, look for the other type of tag and assign that value
#genres_lst.append(genre_str)
#clean_pubDate=stripTags(pubDate)
d={}
d['title']=clean_title
#d['date']=clean_pubDate
works_list.append(d)
works_list
Aucun commentaire:
Enregistrer un commentaire