Extracting numbers from text files

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}

I have some text files from which I want to extract certain data. I want to extract some specific numbers from them. In particular I want to search the files for the first occurrence of string1 and take the numbers that follow it. That is, I want to take all numbers, dots, or minus signs and stop once another character is reached. Then I want to write away those numbers to a separate file.

Preferably I would be able to do this for multiple strings at once (so also look for string2, do the same there and write away the results in some listed format, say {numbers1,numbers2}. But this last part is less important.

How would I accomplish this?

I did not include specific data since was hoping there was a general solution for the question I asked. Such a tool would be generally useful in numerous occasions. (I tried to piece together a general solution from the various questions on how to extract a number from a specific string, but failed.)

The data would look something like

bla bla bla label1_5234_blablab_some_other_text_and_numbers_23343_blabla_more_text_and_numbers_maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after blabla_label2_34343_this_is_some_other_number_want_to_be_able_to_extract_if_I_look_for_label2_instead_of_label1

label3 = -0.34343 

and_more_text_and_so_on_and_so_forth

The patterns to look for would then be label1_, label2_ or label3 =. (Of course it should work regardless of the exact form of label1. But since that apparently wasn't completely clear let me add another example.
height_2.3 blabla_bla_length_3.4, should give 2.3, 3.4 or {2.3,3.4} depending on whether we ask for height, length or both.)

And the output would be, if given one pattern to look for, say label1_

or when looking for label3 =

-0.34343

Then in addition it would be nice if it could search for two things at once and group them. So for instance giving both patterns above outputting

{5234,-0.34343}

Finally it would be nice if it could group results for this for multiple files if fed multiple files:

{out1a,out1b}

{out2a,out2b}

edited Feb 18 at 13:01

asked Feb 15 at 11:22

Kvothe

1164

add a comment |

How would I accomplish this?

The data would look something like

bla bla bla label1_5234_blablab_some_other_text_and_numbers_23343_blabla_more_text_and_numbers_maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after blabla_label2_34343_this_is_some_other_number_want_to_be_able_to_extract_if_I_look_for_label2_instead_of_label1

label3 = -0.34343 

and_more_text_and_so_on_and_so_forth

And the output would be, if given one pattern to look for, say label1_

or when looking for label3 =

-0.34343

Then in addition it would be nice if it could search for two things at once and group them. So for instance giving both patterns above outputting

{5234,-0.34343}

Finally it would be nice if it could group results for this for multiple files if fed multiple files:

{out1a,out1b}

{out2a,out2b}

edited Feb 18 at 13:01

asked Feb 15 at 11:22

Kvothe

1164

add a comment |

How would I accomplish this?

The data would look something like

bla bla bla label1_5234_blablab_some_other_text_and_numbers_23343_blabla_more_text_and_numbers_maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after blabla_label2_34343_this_is_some_other_number_want_to_be_able_to_extract_if_I_look_for_label2_instead_of_label1

label3 = -0.34343 

and_more_text_and_so_on_and_so_forth

And the output would be, if given one pattern to look for, say label1_

or when looking for label3 =

-0.34343

Then in addition it would be nice if it could search for two things at once and group them. So for instance giving both patterns above outputting

{5234,-0.34343}

Finally it would be nice if it could group results for this for multiple files if fed multiple files:

{out1a,out1b}

{out2a,out2b}

edited Feb 18 at 13:01

asked Feb 15 at 11:22

Kvothe

1164

How would I accomplish this?

The data would look something like

bla bla bla label1_5234_blablab_some_other_text_and_numbers_23343_blabla_more_text_and_numbers_maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after blabla_label2_34343_this_is_some_other_number_want_to_be_able_to_extract_if_I_look_for_label2_instead_of_label1

label3 = -0.34343 

and_more_text_and_so_on_and_so_forth

And the output would be, if given one pattern to look for, say label1_

or when looking for label3 =

-0.34343

Then in addition it would be nice if it could search for two things at once and group them. So for instance giving both patterns above outputting

{5234,-0.34343}

Finally it would be nice if it could group results for this for multiple files if fed multiple files:

{out1a,out1b}

{out2a,out2b}

text-processing sed

edited Feb 18 at 13:01

asked Feb 15 at 11:22

Kvothe

1164

edited Feb 18 at 13:01

asked Feb 15 at 11:22

Kvothe

1164

edited Feb 18 at 13:01

asked Feb 15 at 11:22

Kvothe

1164

asked Feb 15 at 11:22

Kvothe

1164

asked Feb 15 at 11:22

Kvothe

1164

add a comment |

3 Answers
3

active

oldest

votes

If you want all the results from a single file grouped together, then it's likely easiest to slurp the whole of each file into memory and process it as one block. You can do that in perl by unsetting the line separator - the conventional way to do that in a perl one-liner is -0777.

Next you need a regular expression that matches a sequence of decimal digits, decimal separators etc. preceded by label[123]_ or label[123] =

Putting it together:

perl -0777nE 'say "{", (join ",", /label[123](?:_| = )K[0-9.+-]+/g), "}"' file1 file2 [...]

Note: I have not tried to address maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after

answered Feb 15 at 14:38

steeldriver

70.8k11115187

Thank you! Would you might helping me out with one last thing? I was thinking of the output in terms of an ordered pair so that I know which number corresponds to which label. Another format in which this is still clear would also be great of course. The problem with the solution above is that if I use /(?:label1|label2)(?:_| = ) , it gives results in order of occurrence and I don't know which result corresponds to which label.

– Kvothe
Feb 18 at 11:37

Is there a way so that the output could be in some form where I know which number corresponds to which label. (For example either keeping the same order as the input or maybe formatted as label1[result],label2[result]).

– Kvothe
Feb 18 at 11:37

add a comment |

`sed` solution

With $p holding the label regex, e.g. p='label[13](_| = )':

sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | 

sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | 

sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

The first command removes linebreaks and adds a new one after every match, the second one removes lines without a match and extracts the numbers and the third one makes them comma-separated and encloses them in curly brackets.

$p must hold a valid regex and exactly one group (or you need to adjust the RHS part of the third substitution expression), for example:

p='label1(_)'

p='label3( = )'

p='label[13](_| = )'

p='(label1_|label3 = )'

p='(height|length)_'

Multiple different strings in the group are to be separated by |.

Examples

$ <input cat

bla bla bla label1_5234_blablab_some_other_text_and_numbers_23343_blabla_more_text_and_numbers_maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after blabla_label2_34343_this_is_some_other_number_want_to_be_able_to_extract_if_I_look_for_label2_instead_of_label1

label3 = -0.34343 

and_more_text_and_so_on_and_so_forth

$ p='label1(_)'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

5234

$ p='label3( = )'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

-0.34343

$ p='label[13](_| = )'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

{5234,-0.34343}

$ echo "height_2.3 blabla_bla_length_3.4" >>input

$ p='(height)_'

$ <input2 sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

2.3

$ p='(height|length)_'

$ <input2 sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

{2.3,3.4}

edited Feb 18 at 13:27

answered Feb 15 at 14:36

dessert

25.5k674108

Thank you! This is great! I have some questions if you don't mind? Firstly in the real case the labels are really different words so I would need to ask for alternatives in a different way. But somehow both my tries: p='(label1(_)|label3( = ))' and p='(label1|label3)(_| = )' fail (where each individual pattern does find what I want). What am I doing wrong? Also, why does | have to be escaped? Isn't it acting as alternative operator and not a literal character and should thus not be escaped? And finally is it possible to match (or keep) only the first match (of a specific label)?

– Kvothe
Feb 18 at 10:46

@Kvothe If the patterns are in fact different please edit your question post accordingly.

– dessert
Feb 18 at 12:03

I can add an extra example, but of course an example will always just be an example. I tried to describe more generally what should happen in the text. I meant that label1 stands for some text, say height, while label2 stands for some other word say length. Of course the whole idea is that it should work for any label. I will see if I can clarify that in the question.

– Kvothe
Feb 18 at 12:57

@Kvothe I edited to reflect the changes, but my approach works without changes for these labels as well – I added the regex as a further example and showed the effect in the Examples section.

– dessert
Feb 18 at 13:30

thanks. The thing I was doing wrong is that I wasn't escaping the initial parentheses and the |, as done in p='(height|length)_'. Would you mind explaining why these need to be escaped? I had not expected that since we don't want them to stand for the literal symbol we want them to stand for the operators.

– Kvothe
Feb 18 at 13:42

|
show 1 more comment

For single file

grep -oP "(?<=label1_)[0-9.+-]+[^_ ]+" ./file | head -n 1 >> ./tmpfile

grep -oP "(?<=label3 = )[0-9.+-]+[^_ ]+" ./file | head -n 1 >> ./tmpfile

paste -sd, ./tmpfile | awk '{ print "{"$0"}" }' >> ./newfile

rm ./tmpfile

For multiple files in a folder.

cd to the folder and run:

for file in *; do

if [ "$file" == "newfile" ] ; then continue; fi

grep -oP "(?<=label1_)[0-9.+-]+[^_ ]+" $file | head -n 1 >> ./tmpfile

grep -oP "(?<=label3 = )[0-9.+-]+[^_ ]+" $file | head -n 1 >> ./tmpfile

paste -sd, ./tmpfile | awk '{ print "{"$0"}" }' >> ./newfile

rm ./tmpfile

done

edited Feb 19 at 12:39

answered Feb 15 at 14:23

Vijay

2,1041822

Thanks! The second pipe does not seem to do what I wanted. It takes out the minus sign I wanted to keep for example. I edited to a form where I think it answers my question. (Also added the taking of only the first resulting match). sed 's/_/ /g' ./file | grep -oP "(?<=label1 )[^ ]+" | grep -oE -m1 '(-)?[0-9](.)?([0-9]+)?'

– Kvothe
Feb 18 at 11:16

Finally I am confused by the space in (?<=label1 ). There was no space after label1. There was instead an underscore. What is going on here? Why should there be a space here?

– Kvothe
Feb 18 at 11:19

@Kvothe Thanks. Improved, edited and added "for" loop

– Vijay
Feb 19 at 8:55

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1118484%2fextracting-numbers-from-text-files%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Next you need a regular expression that matches a sequence of decimal digits, decimal separators etc. preceded by label[123]_ or label[123] =

Putting it together:

perl -0777nE 'say "{", (join ",", /label[123](?:_| = )K[0-9.+-]+/g), "}"' file1 file2 [...]

Note: I have not tried to address maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after

answered Feb 15 at 14:38

steeldriver

70.8k11115187

Thank you! Would you might helping me out with one last thing? I was thinking of the output in terms of an ordered pair so that I know which number corresponds to which label. Another format in which this is still clear would also be great of course. The problem with the solution above is that if I use /(?:label1|label2)(?:_| = ) , it gives results in order of occurrence and I don't know which result corresponds to which label.

– Kvothe
Feb 18 at 11:37

Is there a way so that the output could be in some form where I know which number corresponds to which label. (For example either keeping the same order as the input or maybe formatted as label1[result],label2[result]).

– Kvothe
Feb 18 at 11:37

add a comment |

Next you need a regular expression that matches a sequence of decimal digits, decimal separators etc. preceded by label[123]_ or label[123] =

Putting it together:

perl -0777nE 'say "{", (join ",", /label[123](?:_| = )K[0-9.+-]+/g), "}"' file1 file2 [...]

Note: I have not tried to address maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after

answered Feb 15 at 14:38

steeldriver

70.8k11115187

Thank you! Would you might helping me out with one last thing? I was thinking of the output in terms of an ordered pair so that I know which number corresponds to which label. Another format in which this is still clear would also be great of course. The problem with the solution above is that if I use /(?:label1|label2)(?:_| = ) , it gives results in order of occurrence and I don't know which result corresponds to which label.

– Kvothe
Feb 18 at 11:37

Is there a way so that the output could be in some form where I know which number corresponds to which label. (For example either keeping the same order as the input or maybe formatted as label1[result],label2[result]).

– Kvothe
Feb 18 at 11:37

add a comment |

Next you need a regular expression that matches a sequence of decimal digits, decimal separators etc. preceded by label[123]_ or label[123] =

Putting it together:

perl -0777nE 'say "{", (join ",", /label[123](?:_| = )K[0-9.+-]+/g), "}"' file1 file2 [...]

Note: I have not tried to address maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after

answered Feb 15 at 14:38

steeldriver

70.8k11115187

Next you need a regular expression that matches a sequence of decimal digits, decimal separators etc. preceded by label[123]_ or label[123] =

Putting it together:

perl -0777nE 'say "{", (join ",", /label[123](?:_| = )K[0-9.+-]+/g), "}"' file1 file2 [...]

Note: I have not tried to address maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after

answered Feb 15 at 14:38

steeldriver

70.8k11115187

answered Feb 15 at 14:38

steeldriver

70.8k11115187

answered Feb 15 at 14:38

steeldriver

70.8k11115187

answered Feb 15 at 14:38

steeldriver

70.8k11115187

Thank you! Would you might helping me out with one last thing? I was thinking of the output in terms of an ordered pair so that I know which number corresponds to which label. Another format in which this is still clear would also be great of course. The problem with the solution above is that if I use /(?:label1|label2)(?:_| = ) , it gives results in order of occurrence and I don't know which result corresponds to which label.

– Kvothe
Feb 18 at 11:37

Is there a way so that the output could be in some form where I know which number corresponds to which label. (For example either keeping the same order as the input or maybe formatted as label1[result],label2[result]).

– Kvothe
Feb 18 at 11:37

add a comment |

Thank you! Would you might helping me out with one last thing? I was thinking of the output in terms of an ordered pair so that I know which number corresponds to which label. Another format in which this is still clear would also be great of course. The problem with the solution above is that if I use /(?:label1|label2)(?:_| = ) , it gives results in order of occurrence and I don't know which result corresponds to which label.

– Kvothe
Feb 18 at 11:37

Is there a way so that the output could be in some form where I know which number corresponds to which label. (For example either keeping the same order as the input or maybe formatted as label1[result],label2[result]).

– Kvothe
Feb 18 at 11:37

Thank you! Would you might helping me out with one last thing? I was thinking of the output in terms of an ordered pair so that I know which number corresponds to which label. Another format in which this is still clear would also be great of course. The problem with the solution above is that if I use /(?:label1|label2)(?:_| = ) , it gives results in order of occurrence and I don't know which result corresponds to which label.

– Kvothe
Feb 18 at 11:37

Is there a way so that the output could be in some form where I know which number corresponds to which label. (For example either keeping the same order as the input or maybe formatted as label1[result],label2[result]).

– Kvothe
Feb 18 at 11:37

add a comment |

`sed` solution

With $p holding the label regex, e.g. p='label[13](_| = )':

sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | 

sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | 

sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

$p must hold a valid regex and exactly one group (or you need to adjust the RHS part of the third substitution expression), for example:

p='label1(_)'

p='label3( = )'

p='label[13](_| = )'

p='(label1_|label3 = )'

p='(height|length)_'

Multiple different strings in the group are to be separated by |.

Examples

$ <input cat

bla bla bla label1_5234_blablab_some_other_text_and_numbers_23343_blabla_more_text_and_numbers_maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after blabla_label2_34343_this_is_some_other_number_want_to_be_able_to_extract_if_I_look_for_label2_instead_of_label1

label3 = -0.34343 

and_more_text_and_so_on_and_so_forth

$ p='label1(_)'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

5234

$ p='label3( = )'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

-0.34343

$ p='label[13](_| = )'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

{5234,-0.34343}

$ echo "height_2.3 blabla_bla_length_3.4" >>input

$ p='(height)_'

$ <input2 sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

2.3

$ p='(height|length)_'

$ <input2 sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

{2.3,3.4}

edited Feb 18 at 13:27

answered Feb 15 at 14:36

dessert

25.5k674108

Thank you! This is great! I have some questions if you don't mind? Firstly in the real case the labels are really different words so I would need to ask for alternatives in a different way. But somehow both my tries: p='(label1(_)|label3( = ))' and p='(label1|label3)(_| = )' fail (where each individual pattern does find what I want). What am I doing wrong? Also, why does | have to be escaped? Isn't it acting as alternative operator and not a literal character and should thus not be escaped? And finally is it possible to match (or keep) only the first match (of a specific label)?

– Kvothe
Feb 18 at 10:46

@Kvothe If the patterns are in fact different please edit your question post accordingly.

– dessert
Feb 18 at 12:03

I can add an extra example, but of course an example will always just be an example. I tried to describe more generally what should happen in the text. I meant that label1 stands for some text, say height, while label2 stands for some other word say length. Of course the whole idea is that it should work for any label. I will see if I can clarify that in the question.

– Kvothe
Feb 18 at 12:57

@Kvothe I edited to reflect the changes, but my approach works without changes for these labels as well – I added the regex as a further example and showed the effect in the Examples section.

– dessert
Feb 18 at 13:30

thanks. The thing I was doing wrong is that I wasn't escaping the initial parentheses and the |, as done in p='(height|length)_'. Would you mind explaining why these need to be escaped? I had not expected that since we don't want them to stand for the literal symbol we want them to stand for the operators.

– Kvothe
Feb 18 at 13:42

|
show 1 more comment

`sed` solution

With $p holding the label regex, e.g. p='label[13](_| = )':

sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | 

sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | 

sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

$p must hold a valid regex and exactly one group (or you need to adjust the RHS part of the third substitution expression), for example:

p='label1(_)'

p='label3( = )'

p='label[13](_| = )'

p='(label1_|label3 = )'

p='(height|length)_'

Multiple different strings in the group are to be separated by |.

Examples

$ <input cat

bla bla bla label1_5234_blablab_some_other_text_and_numbers_23343_blabla_more_text_and_numbers_maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after blabla_label2_34343_this_is_some_other_number_want_to_be_able_to_extract_if_I_look_for_label2_instead_of_label1

label3 = -0.34343 

and_more_text_and_so_on_and_so_forth

$ p='label1(_)'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

5234

$ p='label3( = )'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

-0.34343

$ p='label[13](_| = )'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

{5234,-0.34343}

$ echo "height_2.3 blabla_bla_length_3.4" >>input

$ p='(height)_'

$ <input2 sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

2.3

$ p='(height|length)_'

$ <input2 sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

{2.3,3.4}

edited Feb 18 at 13:27

answered Feb 15 at 14:36

dessert

25.5k674108

Thank you! This is great! I have some questions if you don't mind? Firstly in the real case the labels are really different words so I would need to ask for alternatives in a different way. But somehow both my tries: p='(label1(_)|label3( = ))' and p='(label1|label3)(_| = )' fail (where each individual pattern does find what I want). What am I doing wrong? Also, why does | have to be escaped? Isn't it acting as alternative operator and not a literal character and should thus not be escaped? And finally is it possible to match (or keep) only the first match (of a specific label)?

– Kvothe
Feb 18 at 10:46

@Kvothe If the patterns are in fact different please edit your question post accordingly.

– dessert
Feb 18 at 12:03

I can add an extra example, but of course an example will always just be an example. I tried to describe more generally what should happen in the text. I meant that label1 stands for some text, say height, while label2 stands for some other word say length. Of course the whole idea is that it should work for any label. I will see if I can clarify that in the question.

– Kvothe
Feb 18 at 12:57

@Kvothe I edited to reflect the changes, but my approach works without changes for these labels as well – I added the regex as a further example and showed the effect in the Examples section.

– dessert
Feb 18 at 13:30

thanks. The thing I was doing wrong is that I wasn't escaping the initial parentheses and the |, as done in p='(height|length)_'. Would you mind explaining why these need to be escaped? I had not expected that since we don't want them to stand for the literal symbol we want them to stand for the operators.

– Kvothe
Feb 18 at 13:42

|
show 1 more comment

`sed` solution

With $p holding the label regex, e.g. p='label[13](_| = )':

sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | 

sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | 

sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

$p must hold a valid regex and exactly one group (or you need to adjust the RHS part of the third substitution expression), for example:

p='label1(_)'

p='label3( = )'

p='label[13](_| = )'

p='(label1_|label3 = )'

p='(height|length)_'

Multiple different strings in the group are to be separated by |.

Examples

$ <input cat

bla bla bla label1_5234_blablab_some_other_text_and_numbers_23343_blabla_more_text_and_numbers_maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after blabla_label2_34343_this_is_some_other_number_want_to_be_able_to_extract_if_I_look_for_label2_instead_of_label1

label3 = -0.34343 

and_more_text_and_so_on_and_so_forth

$ p='label1(_)'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

5234

$ p='label3( = )'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

-0.34343

$ p='label[13](_| = )'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

{5234,-0.34343}

$ echo "height_2.3 blabla_bla_length_3.4" >>input

$ p='(height)_'

$ <input2 sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

2.3

$ p='(height|length)_'

$ <input2 sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

{2.3,3.4}

edited Feb 18 at 13:27

answered Feb 15 at 14:36

dessert

25.5k674108

`sed` solution

With $p holding the label regex, e.g. p='label[13](_| = )':

sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | 

sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | 

sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

$p must hold a valid regex and exactly one group (or you need to adjust the RHS part of the third substitution expression), for example:

p='label1(_)'

p='label3( = )'

p='label[13](_| = )'

p='(label1_|label3 = )'

p='(height|length)_'

Multiple different strings in the group are to be separated by |.

Examples

$ <input cat

bla bla bla label1_5234_blablab_some_other_text_and_numbers_23343_blabla_more_text_and_numbers_maybe_label1_again_but_now_I_no_longer_care_about_what_comes_after blabla_label2_34343_this_is_some_other_number_want_to_be_able_to_extract_if_I_look_for_label2_instead_of_label1

label3 = -0.34343 

and_more_text_and_so_on_and_so_forth

$ p='label1(_)'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

5234

$ p='label3( = )'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

-0.34343

$ p='label[13](_| = )'

$ <input sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

{5234,-0.34343}

$ echo "height_2.3 blabla_bla_length_3.4" >>input

$ p='(height)_'

$ <input2 sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

2.3

$ p='(height|length)_'

$ <input2 sed ':a;N;$!ba;s/n/ /g;s/'"$p"'[-.0-9]+/&n/g' | sed '/.*'"$p"'[-.0-9]+/!d;s/.*'"$p"'([-.0-9]+)/2/' | sed ':a;N;$!ba;s/n/,/g;s/.*/{&}/'

{2.3,3.4}

edited Feb 18 at 13:27

answered Feb 15 at 14:36

dessert

25.5k674108

edited Feb 18 at 13:27

answered Feb 15 at 14:36

dessert

25.5k674108

answered Feb 15 at 14:36

dessert

25.5k674108

answered Feb 15 at 14:36

dessert

25.5k674108

Thank you! This is great! I have some questions if you don't mind? Firstly in the real case the labels are really different words so I would need to ask for alternatives in a different way. But somehow both my tries: p='(label1(_)|label3( = ))' and p='(label1|label3)(_| = )' fail (where each individual pattern does find what I want). What am I doing wrong? Also, why does | have to be escaped? Isn't it acting as alternative operator and not a literal character and should thus not be escaped? And finally is it possible to match (or keep) only the first match (of a specific label)?

– Kvothe
Feb 18 at 10:46

@Kvothe If the patterns are in fact different please edit your question post accordingly.

– dessert
Feb 18 at 12:03

I can add an extra example, but of course an example will always just be an example. I tried to describe more generally what should happen in the text. I meant that label1 stands for some text, say height, while label2 stands for some other word say length. Of course the whole idea is that it should work for any label. I will see if I can clarify that in the question.

– Kvothe
Feb 18 at 12:57

@Kvothe I edited to reflect the changes, but my approach works without changes for these labels as well – I added the regex as a further example and showed the effect in the Examples section.

– dessert
Feb 18 at 13:30

thanks. The thing I was doing wrong is that I wasn't escaping the initial parentheses and the |, as done in p='(height|length)_'. Would you mind explaining why these need to be escaped? I had not expected that since we don't want them to stand for the literal symbol we want them to stand for the operators.

– Kvothe
Feb 18 at 13:42

|
show 1 more comment

Thank you! This is great! I have some questions if you don't mind? Firstly in the real case the labels are really different words so I would need to ask for alternatives in a different way. But somehow both my tries: p='(label1(_)|label3( = ))' and p='(label1|label3)(_| = )' fail (where each individual pattern does find what I want). What am I doing wrong? Also, why does | have to be escaped? Isn't it acting as alternative operator and not a literal character and should thus not be escaped? And finally is it possible to match (or keep) only the first match (of a specific label)?

– Kvothe
Feb 18 at 10:46

@Kvothe If the patterns are in fact different please edit your question post accordingly.

– dessert
Feb 18 at 12:03

I can add an extra example, but of course an example will always just be an example. I tried to describe more generally what should happen in the text. I meant that label1 stands for some text, say height, while label2 stands for some other word say length. Of course the whole idea is that it should work for any label. I will see if I can clarify that in the question.

– Kvothe
Feb 18 at 12:57

@Kvothe I edited to reflect the changes, but my approach works without changes for these labels as well – I added the regex as a further example and showed the effect in the Examples section.

– dessert
Feb 18 at 13:30

thanks. The thing I was doing wrong is that I wasn't escaping the initial parentheses and the |, as done in p='(height|length)_'. Would you mind explaining why these need to be escaped? I had not expected that since we don't want them to stand for the literal symbol we want them to stand for the operators.

– Kvothe
Feb 18 at 13:42

Thank you! This is great! I have some questions if you don't mind? Firstly in the real case the labels are really different words so I would need to ask for alternatives in a different way. But somehow both my tries: p='(label1(_)|label3( = ))' and p='(label1|label3)(_| = )' fail (where each individual pattern does find what I want). What am I doing wrong? Also, why does | have to be escaped? Isn't it acting as alternative operator and not a literal character and should thus not be escaped? And finally is it possible to match (or keep) only the first match (of a specific label)?

– Kvothe
Feb 18 at 10:46

@Kvothe If the patterns are in fact different please edit your question post accordingly.

– dessert
Feb 18 at 12:03

I can add an extra example, but of course an example will always just be an example. I tried to describe more generally what should happen in the text. I meant that label1 stands for some text, say height, while label2 stands for some other word say length. Of course the whole idea is that it should work for any label. I will see if I can clarify that in the question.

– Kvothe
Feb 18 at 12:57

@Kvothe I edited to reflect the changes, but my approach works without changes for these labels as well – I added the regex as a further example and showed the effect in the Examples section.

– dessert
Feb 18 at 13:30

thanks. The thing I was doing wrong is that I wasn't escaping the initial parentheses and the |, as done in p='(height|length)_'. Would you mind explaining why these need to be escaped? I had not expected that since we don't want them to stand for the literal symbol we want them to stand for the operators.

– Kvothe
Feb 18 at 13:42

|
show 1 more comment

For single file

grep -oP "(?<=label1_)[0-9.+-]+[^_ ]+" ./file | head -n 1 >> ./tmpfile

grep -oP "(?<=label3 = )[0-9.+-]+[^_ ]+" ./file | head -n 1 >> ./tmpfile

paste -sd, ./tmpfile | awk '{ print "{"$0"}" }' >> ./newfile

rm ./tmpfile

For multiple files in a folder.

cd to the folder and run:

for file in *; do

if [ "$file" == "newfile" ] ; then continue; fi

grep -oP "(?<=label1_)[0-9.+-]+[^_ ]+" $file | head -n 1 >> ./tmpfile

grep -oP "(?<=label3 = )[0-9.+-]+[^_ ]+" $file | head -n 1 >> ./tmpfile

paste -sd, ./tmpfile | awk '{ print "{"$0"}" }' >> ./newfile

rm ./tmpfile

done

edited Feb 19 at 12:39

answered Feb 15 at 14:23

Vijay

2,1041822

Thanks! The second pipe does not seem to do what I wanted. It takes out the minus sign I wanted to keep for example. I edited to a form where I think it answers my question. (Also added the taking of only the first resulting match). sed 's/_/ /g' ./file | grep -oP "(?<=label1 )[^ ]+" | grep -oE -m1 '(-)?[0-9](.)?([0-9]+)?'

– Kvothe
Feb 18 at 11:16

Finally I am confused by the space in (?<=label1 ). There was no space after label1. There was instead an underscore. What is going on here? Why should there be a space here?

– Kvothe
Feb 18 at 11:19

@Kvothe Thanks. Improved, edited and added "for" loop

– Vijay
Feb 19 at 8:55

add a comment |

For single file

grep -oP "(?<=label1_)[0-9.+-]+[^_ ]+" ./file | head -n 1 >> ./tmpfile

grep -oP "(?<=label3 = )[0-9.+-]+[^_ ]+" ./file | head -n 1 >> ./tmpfile

paste -sd, ./tmpfile | awk '{ print "{"$0"}" }' >> ./newfile

rm ./tmpfile

For multiple files in a folder.

cd to the folder and run:

for file in *; do

if [ "$file" == "newfile" ] ; then continue; fi

grep -oP "(?<=label1_)[0-9.+-]+[^_ ]+" $file | head -n 1 >> ./tmpfile

grep -oP "(?<=label3 = )[0-9.+-]+[^_ ]+" $file | head -n 1 >> ./tmpfile

paste -sd, ./tmpfile | awk '{ print "{"$0"}" }' >> ./newfile

rm ./tmpfile

done

edited Feb 19 at 12:39

answered Feb 15 at 14:23

Vijay

2,1041822

Thanks! The second pipe does not seem to do what I wanted. It takes out the minus sign I wanted to keep for example. I edited to a form where I think it answers my question. (Also added the taking of only the first resulting match). sed 's/_/ /g' ./file | grep -oP "(?<=label1 )[^ ]+" | grep -oE -m1 '(-)?[0-9](.)?([0-9]+)?'

– Kvothe
Feb 18 at 11:16

Finally I am confused by the space in (?<=label1 ). There was no space after label1. There was instead an underscore. What is going on here? Why should there be a space here?

– Kvothe
Feb 18 at 11:19

@Kvothe Thanks. Improved, edited and added "for" loop

– Vijay
Feb 19 at 8:55

add a comment |

For single file

grep -oP "(?<=label1_)[0-9.+-]+[^_ ]+" ./file | head -n 1 >> ./tmpfile

grep -oP "(?<=label3 = )[0-9.+-]+[^_ ]+" ./file | head -n 1 >> ./tmpfile

paste -sd, ./tmpfile | awk '{ print "{"$0"}" }' >> ./newfile

rm ./tmpfile

For multiple files in a folder.

cd to the folder and run:

for file in *; do

if [ "$file" == "newfile" ] ; then continue; fi

grep -oP "(?<=label1_)[0-9.+-]+[^_ ]+" $file | head -n 1 >> ./tmpfile

grep -oP "(?<=label3 = )[0-9.+-]+[^_ ]+" $file | head -n 1 >> ./tmpfile

paste -sd, ./tmpfile | awk '{ print "{"$0"}" }' >> ./newfile

rm ./tmpfile

done

edited Feb 19 at 12:39

answered Feb 15 at 14:23

Vijay

2,1041822

For single file

grep -oP "(?<=label1_)[0-9.+-]+[^_ ]+" ./file | head -n 1 >> ./tmpfile

grep -oP "(?<=label3 = )[0-9.+-]+[^_ ]+" ./file | head -n 1 >> ./tmpfile

paste -sd, ./tmpfile | awk '{ print "{"$0"}" }' >> ./newfile

rm ./tmpfile

For multiple files in a folder.

cd to the folder and run:

for file in *; do

if [ "$file" == "newfile" ] ; then continue; fi

grep -oP "(?<=label1_)[0-9.+-]+[^_ ]+" $file | head -n 1 >> ./tmpfile

grep -oP "(?<=label3 = )[0-9.+-]+[^_ ]+" $file | head -n 1 >> ./tmpfile

paste -sd, ./tmpfile | awk '{ print "{"$0"}" }' >> ./newfile

rm ./tmpfile

done

edited Feb 19 at 12:39

answered Feb 15 at 14:23

Vijay

2,1041822

edited Feb 19 at 12:39

answered Feb 15 at 14:23

Vijay

2,1041822

answered Feb 15 at 14:23

Vijay

2,1041822

answered Feb 15 at 14:23

Vijay

2,1041822

Thanks! The second pipe does not seem to do what I wanted. It takes out the minus sign I wanted to keep for example. I edited to a form where I think it answers my question. (Also added the taking of only the first resulting match). sed 's/_/ /g' ./file | grep -oP "(?<=label1 )[^ ]+" | grep -oE -m1 '(-)?[0-9](.)?([0-9]+)?'

– Kvothe
Feb 18 at 11:16

Finally I am confused by the space in (?<=label1 ). There was no space after label1. There was instead an underscore. What is going on here? Why should there be a space here?

– Kvothe
Feb 18 at 11:19

@Kvothe Thanks. Improved, edited and added "for" loop

– Vijay
Feb 19 at 8:55

add a comment |

Thanks! The second pipe does not seem to do what I wanted. It takes out the minus sign I wanted to keep for example. I edited to a form where I think it answers my question. (Also added the taking of only the first resulting match). sed 's/_/ /g' ./file | grep -oP "(?<=label1 )[^ ]+" | grep -oE -m1 '(-)?[0-9](.)?([0-9]+)?'

– Kvothe
Feb 18 at 11:16

Finally I am confused by the space in (?<=label1 ). There was no space after label1. There was instead an underscore. What is going on here? Why should there be a space here?

– Kvothe
Feb 18 at 11:19

@Kvothe Thanks. Improved, edited and added "for" loop

– Vijay
Feb 19 at 8:55

Thanks! The second pipe does not seem to do what I wanted. It takes out the minus sign I wanted to keep for example. I edited to a form where I think it answers my question. (Also added the taking of only the first resulting match). sed 's/_/ /g' ./file | grep -oP "(?<=label1 )[^ ]+" | grep -oE -m1 '(-)?[0-9](.)?([0-9]+)?'

– Kvothe
Feb 18 at 11:16

Finally I am confused by the space in (?<=label1 ). There was no space after label1. There was instead an underscore. What is going on here? Why should there be a space here?

– Kvothe
Feb 18 at 11:19

@Kvothe Thanks. Improved, edited and added "for" loop

– Vijay
Feb 19 at 8:55

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Ask Ubuntu!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky