jeudi 28 mars 2019

How to identify if 'span' child tag exist in 'p' tag returned by beautifulsoup?

I am making a web scrapper that scrapes an online novel from the index webpage and the code creates and epub file for each book of the novel. I the translator of the novel has set up the webpages for the novel in 2 different formats. The first format is a 'p' tag with spam tag inside. the 'spam' tag has a bunch of css in it for each section of the paragraphs depending if its normal text or initialize. The other format is the text in the 'p' tag with no span tag and css code. I have been able to use Beautiful soup to get the portion of the code that only has the novel from the webpage. I am stuck trying to make an if statement that says if 'span' exists inside the chapter content run this code else this code.

I have tried using if chapter.find('span') != []: and if chapter.find_all('span') != []: from beautiful soup, but these beautiful soup codes return actual values not a Boolean values. I tested this by printing yes or no if chapter had the tag but solution would only say yes or only say not when I checked both 2 different chapters that I can conform didn't have different formats.

How I'm getting code from individual chapters of novel websites using this

    #get link for chapter 1 from index
    r = requests.get(data[1]['link'])
    soup = BeautifulSoup(r.content, 'html.parser')

    # if webpage announcement change 0 to 1
    chapter = soup.find_all('div', {"class" : "fr-view"})[0].find_all('p')

after this point depending on the chapter it will return code similar to:

    #chapter equals this
    [<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">Chapter 1 - title</span></p>,
    <p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">stuff</span></p>,
    <p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: italic; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">italizes</span><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap"> stuff</span></p>]


or it will return

    #chapter equals this
    [<p>Chapter 6 - title</p>,
    <p>stuff</p>]

I'm trying to make and if statement that can read chapter and tell me if spam tag or not so I can execute the correct code.

Aucun commentaire:

Enregistrer un commentaire