mardi 5 janvier 2021

Matlab: Simple regexp expression use problem

I believe I have a simple problem. This is a data sample of a literature work that I would like to divide up:

WholeBook = "Random info - at beginning-man. "+ ...
        "Random info still continues. "+ ...
        "CHAPTER 1 " + ...
        "1 This is sentence one of verse one, "+ ...
        "This still sentence one of verse one. "+ ...
        "2 This is sentence one of verse two. "+ ...
        "This is sentence two of verse two. "+ ...
        "3 This is sentence one of verse three; "+ ...
        "this still sentence one of verse three. "+ ...
        "CHAPTER 2 " + ...
        "Random info in middle two. "+ ...
        "Random info still continues again. "+ ...
        "1 This is sentence four? "+ ...
        "2 This is sentence five, "+ ...
        "3 this still sentence five but verse three!"+ ...
        "Random info at end's end.";

I would like to divide the following data in a table like this (This is how the solution should look):

enter image description here

However, my current solution looks like this:

enter image description here

Thus row 1 is incorrect, but row 2 is correct. Otherwise said, my solution works if there is indeed information after "CHAPTER #", but not if there is no information. This is the code that produced this solution:

[tokens, RandomInfoMiddle] = regexp(WholeBook, '(CHAPTER \d)\s*(.*?)1', 'tokens', 'match');
RandomInfoMiddle = RandomInfoMiddle';
RandomInfoMiddle = regexprep(RandomInfoMiddle,'CHAPTER \d+ (.+) \d$','$1'); %Delete "Chapter+Nr" + ...1
            % To explain the regular expression (CHAPTER \d)\.\s*(.*?)1:
            % (CHAPTER \d) matches CHAPTER with any number, and the () brackets surrounding it will capture the match in the tokens variable.
            % \. matches the period
            % \s* matches any possible whitespace
            % (.*?)1 will capture any text till the next 1 in the text. Note the question mark to make it match lazy, otherwise it will match all the text till the last 1 in str.

Please help me find a solution as described in the first picture/table. (I suspect the use of an if statement coupled with the correct regexp expression.)

All help is appreciated.

Aucun commentaire:

Enregistrer un commentaire