How shall I perform multiline matching and substitution using awk?











up vote
1
down vote

favorite












In a text file, ignoring any trailing whitespace at the end of each line, I assume that if a line is not ended with a digit, then there is a line break between the line and the next line, and I would like to find these line breaks and then concatenate them into one line. For example



line 1
li
ne 2


There is a line break between the second and the third lines and I should modify the file to be



line 1
line 2


To find such line breaks, I need to do multiline matching. I does it by changing record separator, but the following doesn't work:



$ awk 'BEGIN{RS="";}; { if (match($0, /[^[:digit:] ] *n/)) print $0;} ' inputfile


To concatenate two lines separated by a line break, I am still wondering.



Thanks.










share|improve this question






















  • setting RS to the empty string will turn on paragraph mode (records will be separated by runs of empty lines), not 'multiline matching' which is always on in awk. It's no wonder your script doesn't work, because it will just treat the whole file as a single record and print it, terminated by an extra newline (ORS). Also, there's absolutely no point in using the match() function, if you're not using its return value or the RSTART or RLENGTH variables.
    – mosvy
    Nov 13 at 17:02

















up vote
1
down vote

favorite












In a text file, ignoring any trailing whitespace at the end of each line, I assume that if a line is not ended with a digit, then there is a line break between the line and the next line, and I would like to find these line breaks and then concatenate them into one line. For example



line 1
li
ne 2


There is a line break between the second and the third lines and I should modify the file to be



line 1
line 2


To find such line breaks, I need to do multiline matching. I does it by changing record separator, but the following doesn't work:



$ awk 'BEGIN{RS="";}; { if (match($0, /[^[:digit:] ] *n/)) print $0;} ' inputfile


To concatenate two lines separated by a line break, I am still wondering.



Thanks.










share|improve this question






















  • setting RS to the empty string will turn on paragraph mode (records will be separated by runs of empty lines), not 'multiline matching' which is always on in awk. It's no wonder your script doesn't work, because it will just treat the whole file as a single record and print it, terminated by an extra newline (ORS). Also, there's absolutely no point in using the match() function, if you're not using its return value or the RSTART or RLENGTH variables.
    – mosvy
    Nov 13 at 17:02















up vote
1
down vote

favorite









up vote
1
down vote

favorite











In a text file, ignoring any trailing whitespace at the end of each line, I assume that if a line is not ended with a digit, then there is a line break between the line and the next line, and I would like to find these line breaks and then concatenate them into one line. For example



line 1
li
ne 2


There is a line break between the second and the third lines and I should modify the file to be



line 1
line 2


To find such line breaks, I need to do multiline matching. I does it by changing record separator, but the following doesn't work:



$ awk 'BEGIN{RS="";}; { if (match($0, /[^[:digit:] ] *n/)) print $0;} ' inputfile


To concatenate two lines separated by a line break, I am still wondering.



Thanks.










share|improve this question













In a text file, ignoring any trailing whitespace at the end of each line, I assume that if a line is not ended with a digit, then there is a line break between the line and the next line, and I would like to find these line breaks and then concatenate them into one line. For example



line 1
li
ne 2


There is a line break between the second and the third lines and I should modify the file to be



line 1
line 2


To find such line breaks, I need to do multiline matching. I does it by changing record separator, but the following doesn't work:



$ awk 'BEGIN{RS="";}; { if (match($0, /[^[:digit:] ] *n/)) print $0;} ' inputfile


To concatenate two lines separated by a line break, I am still wondering.



Thanks.







text-processing awk gawk






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 at 16:02









Tim

1




1












  • setting RS to the empty string will turn on paragraph mode (records will be separated by runs of empty lines), not 'multiline matching' which is always on in awk. It's no wonder your script doesn't work, because it will just treat the whole file as a single record and print it, terminated by an extra newline (ORS). Also, there's absolutely no point in using the match() function, if you're not using its return value or the RSTART or RLENGTH variables.
    – mosvy
    Nov 13 at 17:02




















  • setting RS to the empty string will turn on paragraph mode (records will be separated by runs of empty lines), not 'multiline matching' which is always on in awk. It's no wonder your script doesn't work, because it will just treat the whole file as a single record and print it, terminated by an extra newline (ORS). Also, there's absolutely no point in using the match() function, if you're not using its return value or the RSTART or RLENGTH variables.
    – mosvy
    Nov 13 at 17:02


















setting RS to the empty string will turn on paragraph mode (records will be separated by runs of empty lines), not 'multiline matching' which is always on in awk. It's no wonder your script doesn't work, because it will just treat the whole file as a single record and print it, terminated by an extra newline (ORS). Also, there's absolutely no point in using the match() function, if you're not using its return value or the RSTART or RLENGTH variables.
– mosvy
Nov 13 at 17:02






setting RS to the empty string will turn on paragraph mode (records will be separated by runs of empty lines), not 'multiline matching' which is always on in awk. It's no wonder your script doesn't work, because it will just treat the whole file as a single record and print it, terminated by an extra newline (ORS). Also, there's absolutely no point in using the match() function, if you're not using its return value or the RSTART or RLENGTH variables.
– mosvy
Nov 13 at 17:02












4 Answers
4






active

oldest

votes

















up vote
1
down vote



accepted










You could run something along the lines of



awk 'BEGIN{RS=SUBSEP; ORS="" } {print gensub(/([^0-9])n/,"\1","g",$0)}' ex




  • RS=SUBSEP sets the Register Separator to a value that is never present in a text file (slurps the input file to $0)

  • then do you favorite multiline transformation






share|improve this answer























  • Thanks. Do you know matching without substitution for multiline case?
    – Tim
    Nov 13 at 21:29










  • I was wondering if this reply doesn't work well sometimes? Why is this reply downvoted?
    – Tim
    Nov 13 at 22:30












  • Is RS="f" also a working solution?
    – Tim
    Nov 13 at 22:40






  • 1




    This seems to add an empty line at the end of the output. I'm not sure exactly why at the moment.
    – Kusalananda
    Nov 13 at 22:40






  • 1




    @JJoao In general, print non-record data with printf and records with print. Since you're operating in "slurp mode" here (so to speak) and therefore do not really operate on records, it would be appropriate to use printf.
    – Kusalananda
    Nov 14 at 9:36


















up vote
4
down vote













I would address it differently: by looping over the input until you find a "line-ending condition":



awk '{ 
line=$0;
while($0 !~ /[[:digit:]] *$/ && getline > 0) {
line=line$0;
}
print line
}' < input


On an extended input file of:



line 1
li
ne 2
li
ne
number 3
line 4


Or, more verbosely (to see the trailing space):



$ cat -e input
line 1$
li$
ne 2$
li$
ne $
number 3$
line 4$


The output is:



line 1
line 2
line number 3
line 4





share|improve this answer























  • Thanks. The script in your reply is very specific to the problem. I would like to see if there is a more general script, which can allow me to specify a multiline pattern and match (and substitute) the matches.
    – Tim
    Nov 13 at 16:50












  • What "multilne patterns" are you thinking of?
    – RudiC
    Nov 13 at 17:26


















up vote
2
down vote













$ cat file
line 1
li
ne 2
lo
ng li
ne 3




$ awk 'line ~ /[0-9]$/ { print line; line = "" } { line = line $0 } END { print line }' file
line 1
line 2
long line 3


This accumulates an "output line" in the variable line, and whenever this variable ends with a digit, it is printed and reset. It is also printed at the very end to output the last line (whether complete or not).



Approximate sed equivalent (but with an explicit loop):



$ sed -e ':again' -e '/[0-9]$/{ p; d; }; N; s/n//' -e 'tagain' file
line 1
line 2
long line 3





share|improve this answer




























    up vote
    0
    down vote













    Small GNU sed?



    sed ':L; /[0-9] *$/!{N; bL;}; s/n//g' file





    share|improve this answer























    • doesn't work for me?
      – andrew lorien
      Nov 13 at 23:27











    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f481498%2fhow-shall-i-perform-multiline-matching-and-substitution-using-awk%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote



    accepted










    You could run something along the lines of



    awk 'BEGIN{RS=SUBSEP; ORS="" } {print gensub(/([^0-9])n/,"\1","g",$0)}' ex




    • RS=SUBSEP sets the Register Separator to a value that is never present in a text file (slurps the input file to $0)

    • then do you favorite multiline transformation






    share|improve this answer























    • Thanks. Do you know matching without substitution for multiline case?
      – Tim
      Nov 13 at 21:29










    • I was wondering if this reply doesn't work well sometimes? Why is this reply downvoted?
      – Tim
      Nov 13 at 22:30












    • Is RS="f" also a working solution?
      – Tim
      Nov 13 at 22:40






    • 1




      This seems to add an empty line at the end of the output. I'm not sure exactly why at the moment.
      – Kusalananda
      Nov 13 at 22:40






    • 1




      @JJoao In general, print non-record data with printf and records with print. Since you're operating in "slurp mode" here (so to speak) and therefore do not really operate on records, it would be appropriate to use printf.
      – Kusalananda
      Nov 14 at 9:36















    up vote
    1
    down vote



    accepted










    You could run something along the lines of



    awk 'BEGIN{RS=SUBSEP; ORS="" } {print gensub(/([^0-9])n/,"\1","g",$0)}' ex




    • RS=SUBSEP sets the Register Separator to a value that is never present in a text file (slurps the input file to $0)

    • then do you favorite multiline transformation






    share|improve this answer























    • Thanks. Do you know matching without substitution for multiline case?
      – Tim
      Nov 13 at 21:29










    • I was wondering if this reply doesn't work well sometimes? Why is this reply downvoted?
      – Tim
      Nov 13 at 22:30












    • Is RS="f" also a working solution?
      – Tim
      Nov 13 at 22:40






    • 1




      This seems to add an empty line at the end of the output. I'm not sure exactly why at the moment.
      – Kusalananda
      Nov 13 at 22:40






    • 1




      @JJoao In general, print non-record data with printf and records with print. Since you're operating in "slurp mode" here (so to speak) and therefore do not really operate on records, it would be appropriate to use printf.
      – Kusalananda
      Nov 14 at 9:36













    up vote
    1
    down vote



    accepted







    up vote
    1
    down vote



    accepted






    You could run something along the lines of



    awk 'BEGIN{RS=SUBSEP; ORS="" } {print gensub(/([^0-9])n/,"\1","g",$0)}' ex




    • RS=SUBSEP sets the Register Separator to a value that is never present in a text file (slurps the input file to $0)

    • then do you favorite multiline transformation






    share|improve this answer














    You could run something along the lines of



    awk 'BEGIN{RS=SUBSEP; ORS="" } {print gensub(/([^0-9])n/,"\1","g",$0)}' ex




    • RS=SUBSEP sets the Register Separator to a value that is never present in a text file (slurps the input file to $0)

    • then do you favorite multiline transformation







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 14 at 9:33

























    answered Nov 13 at 17:24









    JJoao

    6,9441826




    6,9441826












    • Thanks. Do you know matching without substitution for multiline case?
      – Tim
      Nov 13 at 21:29










    • I was wondering if this reply doesn't work well sometimes? Why is this reply downvoted?
      – Tim
      Nov 13 at 22:30












    • Is RS="f" also a working solution?
      – Tim
      Nov 13 at 22:40






    • 1




      This seems to add an empty line at the end of the output. I'm not sure exactly why at the moment.
      – Kusalananda
      Nov 13 at 22:40






    • 1




      @JJoao In general, print non-record data with printf and records with print. Since you're operating in "slurp mode" here (so to speak) and therefore do not really operate on records, it would be appropriate to use printf.
      – Kusalananda
      Nov 14 at 9:36


















    • Thanks. Do you know matching without substitution for multiline case?
      – Tim
      Nov 13 at 21:29










    • I was wondering if this reply doesn't work well sometimes? Why is this reply downvoted?
      – Tim
      Nov 13 at 22:30












    • Is RS="f" also a working solution?
      – Tim
      Nov 13 at 22:40






    • 1




      This seems to add an empty line at the end of the output. I'm not sure exactly why at the moment.
      – Kusalananda
      Nov 13 at 22:40






    • 1




      @JJoao In general, print non-record data with printf and records with print. Since you're operating in "slurp mode" here (so to speak) and therefore do not really operate on records, it would be appropriate to use printf.
      – Kusalananda
      Nov 14 at 9:36
















    Thanks. Do you know matching without substitution for multiline case?
    – Tim
    Nov 13 at 21:29




    Thanks. Do you know matching without substitution for multiline case?
    – Tim
    Nov 13 at 21:29












    I was wondering if this reply doesn't work well sometimes? Why is this reply downvoted?
    – Tim
    Nov 13 at 22:30






    I was wondering if this reply doesn't work well sometimes? Why is this reply downvoted?
    – Tim
    Nov 13 at 22:30














    Is RS="f" also a working solution?
    – Tim
    Nov 13 at 22:40




    Is RS="f" also a working solution?
    – Tim
    Nov 13 at 22:40




    1




    1




    This seems to add an empty line at the end of the output. I'm not sure exactly why at the moment.
    – Kusalananda
    Nov 13 at 22:40




    This seems to add an empty line at the end of the output. I'm not sure exactly why at the moment.
    – Kusalananda
    Nov 13 at 22:40




    1




    1




    @JJoao In general, print non-record data with printf and records with print. Since you're operating in "slurp mode" here (so to speak) and therefore do not really operate on records, it would be appropriate to use printf.
    – Kusalananda
    Nov 14 at 9:36




    @JJoao In general, print non-record data with printf and records with print. Since you're operating in "slurp mode" here (so to speak) and therefore do not really operate on records, it would be appropriate to use printf.
    – Kusalananda
    Nov 14 at 9:36












    up vote
    4
    down vote













    I would address it differently: by looping over the input until you find a "line-ending condition":



    awk '{ 
    line=$0;
    while($0 !~ /[[:digit:]] *$/ && getline > 0) {
    line=line$0;
    }
    print line
    }' < input


    On an extended input file of:



    line 1
    li
    ne 2
    li
    ne
    number 3
    line 4


    Or, more verbosely (to see the trailing space):



    $ cat -e input
    line 1$
    li$
    ne 2$
    li$
    ne $
    number 3$
    line 4$


    The output is:



    line 1
    line 2
    line number 3
    line 4





    share|improve this answer























    • Thanks. The script in your reply is very specific to the problem. I would like to see if there is a more general script, which can allow me to specify a multiline pattern and match (and substitute) the matches.
      – Tim
      Nov 13 at 16:50












    • What "multilne patterns" are you thinking of?
      – RudiC
      Nov 13 at 17:26















    up vote
    4
    down vote













    I would address it differently: by looping over the input until you find a "line-ending condition":



    awk '{ 
    line=$0;
    while($0 !~ /[[:digit:]] *$/ && getline > 0) {
    line=line$0;
    }
    print line
    }' < input


    On an extended input file of:



    line 1
    li
    ne 2
    li
    ne
    number 3
    line 4


    Or, more verbosely (to see the trailing space):



    $ cat -e input
    line 1$
    li$
    ne 2$
    li$
    ne $
    number 3$
    line 4$


    The output is:



    line 1
    line 2
    line number 3
    line 4





    share|improve this answer























    • Thanks. The script in your reply is very specific to the problem. I would like to see if there is a more general script, which can allow me to specify a multiline pattern and match (and substitute) the matches.
      – Tim
      Nov 13 at 16:50












    • What "multilne patterns" are you thinking of?
      – RudiC
      Nov 13 at 17:26













    up vote
    4
    down vote










    up vote
    4
    down vote









    I would address it differently: by looping over the input until you find a "line-ending condition":



    awk '{ 
    line=$0;
    while($0 !~ /[[:digit:]] *$/ && getline > 0) {
    line=line$0;
    }
    print line
    }' < input


    On an extended input file of:



    line 1
    li
    ne 2
    li
    ne
    number 3
    line 4


    Or, more verbosely (to see the trailing space):



    $ cat -e input
    line 1$
    li$
    ne 2$
    li$
    ne $
    number 3$
    line 4$


    The output is:



    line 1
    line 2
    line number 3
    line 4





    share|improve this answer














    I would address it differently: by looping over the input until you find a "line-ending condition":



    awk '{ 
    line=$0;
    while($0 !~ /[[:digit:]] *$/ && getline > 0) {
    line=line$0;
    }
    print line
    }' < input


    On an extended input file of:



    line 1
    li
    ne 2
    li
    ne
    number 3
    line 4


    Or, more verbosely (to see the trailing space):



    $ cat -e input
    line 1$
    li$
    ne 2$
    li$
    ne $
    number 3$
    line 4$


    The output is:



    line 1
    line 2
    line number 3
    line 4






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 13 at 19:19









    qubert

    5666




    5666










    answered Nov 13 at 16:25









    Jeff Schaller

    36.1k952119




    36.1k952119












    • Thanks. The script in your reply is very specific to the problem. I would like to see if there is a more general script, which can allow me to specify a multiline pattern and match (and substitute) the matches.
      – Tim
      Nov 13 at 16:50












    • What "multilne patterns" are you thinking of?
      – RudiC
      Nov 13 at 17:26


















    • Thanks. The script in your reply is very specific to the problem. I would like to see if there is a more general script, which can allow me to specify a multiline pattern and match (and substitute) the matches.
      – Tim
      Nov 13 at 16:50












    • What "multilne patterns" are you thinking of?
      – RudiC
      Nov 13 at 17:26
















    Thanks. The script in your reply is very specific to the problem. I would like to see if there is a more general script, which can allow me to specify a multiline pattern and match (and substitute) the matches.
    – Tim
    Nov 13 at 16:50






    Thanks. The script in your reply is very specific to the problem. I would like to see if there is a more general script, which can allow me to specify a multiline pattern and match (and substitute) the matches.
    – Tim
    Nov 13 at 16:50














    What "multilne patterns" are you thinking of?
    – RudiC
    Nov 13 at 17:26




    What "multilne patterns" are you thinking of?
    – RudiC
    Nov 13 at 17:26










    up vote
    2
    down vote













    $ cat file
    line 1
    li
    ne 2
    lo
    ng li
    ne 3




    $ awk 'line ~ /[0-9]$/ { print line; line = "" } { line = line $0 } END { print line }' file
    line 1
    line 2
    long line 3


    This accumulates an "output line" in the variable line, and whenever this variable ends with a digit, it is printed and reset. It is also printed at the very end to output the last line (whether complete or not).



    Approximate sed equivalent (but with an explicit loop):



    $ sed -e ':again' -e '/[0-9]$/{ p; d; }; N; s/n//' -e 'tagain' file
    line 1
    line 2
    long line 3





    share|improve this answer

























      up vote
      2
      down vote













      $ cat file
      line 1
      li
      ne 2
      lo
      ng li
      ne 3




      $ awk 'line ~ /[0-9]$/ { print line; line = "" } { line = line $0 } END { print line }' file
      line 1
      line 2
      long line 3


      This accumulates an "output line" in the variable line, and whenever this variable ends with a digit, it is printed and reset. It is also printed at the very end to output the last line (whether complete or not).



      Approximate sed equivalent (but with an explicit loop):



      $ sed -e ':again' -e '/[0-9]$/{ p; d; }; N; s/n//' -e 'tagain' file
      line 1
      line 2
      long line 3





      share|improve this answer























        up vote
        2
        down vote










        up vote
        2
        down vote









        $ cat file
        line 1
        li
        ne 2
        lo
        ng li
        ne 3




        $ awk 'line ~ /[0-9]$/ { print line; line = "" } { line = line $0 } END { print line }' file
        line 1
        line 2
        long line 3


        This accumulates an "output line" in the variable line, and whenever this variable ends with a digit, it is printed and reset. It is also printed at the very end to output the last line (whether complete or not).



        Approximate sed equivalent (but with an explicit loop):



        $ sed -e ':again' -e '/[0-9]$/{ p; d; }; N; s/n//' -e 'tagain' file
        line 1
        line 2
        long line 3





        share|improve this answer












        $ cat file
        line 1
        li
        ne 2
        lo
        ng li
        ne 3




        $ awk 'line ~ /[0-9]$/ { print line; line = "" } { line = line $0 } END { print line }' file
        line 1
        line 2
        long line 3


        This accumulates an "output line" in the variable line, and whenever this variable ends with a digit, it is printed and reset. It is also printed at the very end to output the last line (whether complete or not).



        Approximate sed equivalent (but with an explicit loop):



        $ sed -e ':again' -e '/[0-9]$/{ p; d; }; N; s/n//' -e 'tagain' file
        line 1
        line 2
        long line 3






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 13 at 22:49









        Kusalananda

        116k15218351




        116k15218351






















            up vote
            0
            down vote













            Small GNU sed?



            sed ':L; /[0-9] *$/!{N; bL;}; s/n//g' file





            share|improve this answer























            • doesn't work for me?
              – andrew lorien
              Nov 13 at 23:27















            up vote
            0
            down vote













            Small GNU sed?



            sed ':L; /[0-9] *$/!{N; bL;}; s/n//g' file





            share|improve this answer























            • doesn't work for me?
              – andrew lorien
              Nov 13 at 23:27













            up vote
            0
            down vote










            up vote
            0
            down vote









            Small GNU sed?



            sed ':L; /[0-9] *$/!{N; bL;}; s/n//g' file





            share|improve this answer














            Small GNU sed?



            sed ':L; /[0-9] *$/!{N; bL;}; s/n//g' file






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 13 at 22:55









            Kusalananda

            116k15218351




            116k15218351










            answered Nov 13 at 17:25









            RudiC

            3,0811211




            3,0811211












            • doesn't work for me?
              – andrew lorien
              Nov 13 at 23:27


















            • doesn't work for me?
              – andrew lorien
              Nov 13 at 23:27
















            doesn't work for me?
            – andrew lorien
            Nov 13 at 23:27




            doesn't work for me?
            – andrew lorien
            Nov 13 at 23:27


















             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f481498%2fhow-shall-i-perform-multiline-matching-and-substitution-using-awk%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

            ComboBox Display Member on multiple fields

            Is it possible to collect Nectar points via Trainline?