Search for text in files of same name in multiple folders of various depths

Situation: In Linux, I have a parent folder with almost 100 folders of various names. Each folder has a file ResourceParent.xml and hundreds of of different version numbers each of which has its own ResourceVer.xml file. I am interested in both ResourceParent.xml in the 1st level folder and the ResourceVer.xml in the LATEST version folder (highest number) e.g. ver548.

I need to search inside each file for 3 tags .txt|.csv|.xls and return the information inside these tags into a report.txt file. The tags usually on the same line so I think Grep is ok.

What I've tried:

grep -nr -E ".txt|.csv|.xls" . > /dir/to/the/ReportFile.txt

This takes way too long as it searches in every one of the thousands of directories and produces a lot of unnecessary duplicated data.

Also, I've tried to go into each folder depending on what I'm looking for and running this script, which is a bit better re reduced duplicates and more relevant data, but it is still too cumbersome.

Question: How do I run a linux script to search for tags in a file structure that looks like this:
Tags of interest inside .xml files:

".txt|.csv|.xls"

current location:

/dir

File of interest 1:

/dir/par/ResourceParent.xml

File of interest 2:

(need the latest ver number)

/dir/par/ver###/ResourceVer.xml

Needed output file:

ResourceReport.txt

Update

I found ls | tail -1 selects the folder with the greatest ver number. So I think the answer involves this..

edited Jan 15 at 13:45

Zanna

50.8k13136241

asked Jan 15 at 6:09

Joe

286

Can't solve your entire problem, but this should do for your "grep"-problem: find . -type f ( -name "*.csv" -o -name "*.txt" -o -name "*.xls" ) or rather find . -type f ( -name "*.csv" -o -name "*.txt" -o -name "*.xls" ) > /dir/to/the/ReportFile.txt If you want the full path you could change the . to /dir/to/the/ReportFile/. Hope that helps!

– Patient32Bit
Jan 15 at 7:15

@Patient32Bit I think the suffixes are in the text in the files, not in their names

– Zanna
Jan 15 at 17:10

1

@Zanna That is correct - I need to get those ".csv" etc strings that exist inside the file

– Joe
Jan 15 at 18:48

@Patient32Bit another part of my project includes an issue that your suggestion resolves I think! In this issue there is no consistent levels of folder structure but there are similarities with folder names and file names at the end of each branch

– Joe
Jan 15 at 19:14

add a comment |

I need to search inside each file for 3 tags .txt|.csv|.xls and return the information inside these tags into a report.txt file. The tags usually on the same line so I think Grep is ok.

What I've tried:

grep -nr -E ".txt|.csv|.xls" . > /dir/to/the/ReportFile.txt

This takes way too long as it searches in every one of the thousands of directories and produces a lot of unnecessary duplicated data.

Also, I've tried to go into each folder depending on what I'm looking for and running this script, which is a bit better re reduced duplicates and more relevant data, but it is still too cumbersome.

Question: How do I run a linux script to search for tags in a file structure that looks like this:
Tags of interest inside .xml files:

".txt|.csv|.xls"

current location:

/dir

File of interest 1:

/dir/par/ResourceParent.xml

File of interest 2:

(need the latest ver number)

/dir/par/ver###/ResourceVer.xml

Needed output file:

ResourceReport.txt

Update

I found ls | tail -1 selects the folder with the greatest ver number. So I think the answer involves this..

edited Jan 15 at 13:45

Zanna

50.8k13136241

asked Jan 15 at 6:09

Joe

286

Can't solve your entire problem, but this should do for your "grep"-problem: find . -type f ( -name "*.csv" -o -name "*.txt" -o -name "*.xls" ) or rather find . -type f ( -name "*.csv" -o -name "*.txt" -o -name "*.xls" ) > /dir/to/the/ReportFile.txt If you want the full path you could change the . to /dir/to/the/ReportFile/. Hope that helps!

– Patient32Bit
Jan 15 at 7:15

@Patient32Bit I think the suffixes are in the text in the files, not in their names

– Zanna
Jan 15 at 17:10

1

@Zanna That is correct - I need to get those ".csv" etc strings that exist inside the file

– Joe
Jan 15 at 18:48

@Patient32Bit another part of my project includes an issue that your suggestion resolves I think! In this issue there is no consistent levels of folder structure but there are similarities with folder names and file names at the end of each branch

– Joe
Jan 15 at 19:14

add a comment |

I need to search inside each file for 3 tags .txt|.csv|.xls and return the information inside these tags into a report.txt file. The tags usually on the same line so I think Grep is ok.

What I've tried:

grep -nr -E ".txt|.csv|.xls" . > /dir/to/the/ReportFile.txt

This takes way too long as it searches in every one of the thousands of directories and produces a lot of unnecessary duplicated data.

Also, I've tried to go into each folder depending on what I'm looking for and running this script, which is a bit better re reduced duplicates and more relevant data, but it is still too cumbersome.

Question: How do I run a linux script to search for tags in a file structure that looks like this:
Tags of interest inside .xml files:

".txt|.csv|.xls"

current location:

/dir

File of interest 1:

/dir/par/ResourceParent.xml

File of interest 2:

(need the latest ver number)

/dir/par/ver###/ResourceVer.xml

Needed output file:

ResourceReport.txt

Update

I found ls | tail -1 selects the folder with the greatest ver number. So I think the answer involves this..

edited Jan 15 at 13:45

Zanna

50.8k13136241

asked Jan 15 at 6:09

Joe

286

I need to search inside each file for 3 tags .txt|.csv|.xls and return the information inside these tags into a report.txt file. The tags usually on the same line so I think Grep is ok.

What I've tried:

grep -nr -E ".txt|.csv|.xls" . > /dir/to/the/ReportFile.txt

This takes way too long as it searches in every one of the thousands of directories and produces a lot of unnecessary duplicated data.

Also, I've tried to go into each folder depending on what I'm looking for and running this script, which is a bit better re reduced duplicates and more relevant data, but it is still too cumbersome.

Question: How do I run a linux script to search for tags in a file structure that looks like this:
Tags of interest inside .xml files:

".txt|.csv|.xls"

current location:

/dir

File of interest 1:

/dir/par/ResourceParent.xml

File of interest 2:

(need the latest ver number)

/dir/par/ver###/ResourceVer.xml

Needed output file:

ResourceReport.txt

Update

I found ls | tail -1 selects the folder with the greatest ver number. So I think the answer involves this..

command-line text-processing grep

edited Jan 15 at 13:45

Zanna

50.8k13136241

asked Jan 15 at 6:09

Joe

286

edited Jan 15 at 13:45

Zanna

50.8k13136241

asked Jan 15 at 6:09

Joe

286

edited Jan 15 at 13:45

Zanna

50.8k13136241

edited Jan 15 at 13:45

Zanna

50.8k13136241

edited Jan 15 at 13:45

Zanna

50.8k13136241

asked Jan 15 at 6:09

Joe

286

asked Jan 15 at 6:09

Joe

286

asked Jan 15 at 6:09

Joe

286

Can't solve your entire problem, but this should do for your "grep"-problem: find . -type f ( -name "*.csv" -o -name "*.txt" -o -name "*.xls" ) or rather find . -type f ( -name "*.csv" -o -name "*.txt" -o -name "*.xls" ) > /dir/to/the/ReportFile.txt If you want the full path you could change the . to /dir/to/the/ReportFile/. Hope that helps!

– Patient32Bit
Jan 15 at 7:15

@Patient32Bit I think the suffixes are in the text in the files, not in their names

– Zanna
Jan 15 at 17:10

1

@Zanna That is correct - I need to get those ".csv" etc strings that exist inside the file

– Joe
Jan 15 at 18:48

@Patient32Bit another part of my project includes an issue that your suggestion resolves I think! In this issue there is no consistent levels of folder structure but there are similarities with folder names and file names at the end of each branch

– Joe
Jan 15 at 19:14

add a comment |

Can't solve your entire problem, but this should do for your "grep"-problem: find . -type f ( -name "*.csv" -o -name "*.txt" -o -name "*.xls" ) or rather find . -type f ( -name "*.csv" -o -name "*.txt" -o -name "*.xls" ) > /dir/to/the/ReportFile.txt If you want the full path you could change the . to /dir/to/the/ReportFile/. Hope that helps!

– Patient32Bit
Jan 15 at 7:15

@Patient32Bit I think the suffixes are in the text in the files, not in their names

– Zanna
Jan 15 at 17:10

1

@Zanna That is correct - I need to get those ".csv" etc strings that exist inside the file

– Joe
Jan 15 at 18:48

@Patient32Bit another part of my project includes an issue that your suggestion resolves I think! In this issue there is no consistent levels of folder structure but there are similarities with folder names and file names at the end of each branch

– Joe
Jan 15 at 19:14

Can't solve your entire problem, but this should do for your "grep"-problem: find . -type f ( -name "*.csv" -o -name "*.txt" -o -name "*.xls" ) or rather find . -type f ( -name "*.csv" -o -name "*.txt" -o -name "*.xls" ) > /dir/to/the/ReportFile.txt If you want the full path you could change the . to /dir/to/the/ReportFile/. Hope that helps!

– Patient32Bit
Jan 15 at 7:15

@Patient32Bit I think the suffixes are in the text in the files, not in their names

– Zanna
Jan 15 at 17:10

@Zanna That is correct - I need to get those ".csv" etc strings that exist inside the file

– Joe
Jan 15 at 18:48

@Patient32Bit another part of my project includes an issue that your suggestion resolves I think! In this issue there is no consistent levels of folder structure but there are similarities with folder names and file names at the end of each branch

– Joe
Jan 15 at 19:14

add a comment |

1 Answer
1

active

oldest

votes

Perhaps with two commands...

grep --include="ResourceParent.xml" -r -E '.txt|.csv|.xls' > file

for d in par*; a=("$d"/*); b=($(sort -V <<<"${a[*]}")); grep -HE '.txt|.csv|.xls' "${b[@]: -1}"/*; done >> file

The second one puts the contents of each directory at the par level into an array sorted by version number so that you can search just the last item in the array. This seems to work (I am getting the last version number) and only takes a couple of seconds on my test directory structure (the first command takes about twice as long).

If your version numbers are padded so that they sort naturally, for the second command you would be able to use simply:

for d in par*; a=("$d"/*); grep -HE '.txt|.csv|.xls' "${a[@]: -1}"/*; done >> file

I mean if your numbers are ver1 ver2 ... ver100, you will need to sort the array, but if they are ver001, ver002 ... ver100, you will not need to sort the array because it will be in the right order anyway.

You may need to replace "${b[@]: -1}"/* with "${b[@]: -1}"/ResourceVer.xml. I did not create other files. You will presumably also need to replace par* with something (I think you said you have about 100 directories at this level).

But maybe you wanted the data sorted by the directories at the level of par so that you get

data from par1/ResourceParent.xml

data from par1/ver{latest}/ResourceVer.xml

data from par2/ResourceParent/xml

data from par2/ver{latest}/ResourceVer.xml

You could perform some text processing on the output file, but it depends how your par directories are named. Since I named them par1 par2 ... par200

sort -V file >> betterfile

will to do that job, assuming filenames had no newlines.

You could also trim off the filenames by using grep -h (instead of -H) in the original commands (though that would mean that you could not sort the data afterwards by the above method), or by text processing at the end, for example, if your filenames have no colons or newlines this would be quite reliable:

sed 's/^[^:]*://' file

You can write to the file instead of stdout by adding the -i flag to sed after testing.

Thanks to John1024 whose answer on U&L provides a great way to get the last filename that doesn't rely on parsing the output of ls or find or gratuitously loop over the structure to count iterations.

edited Jan 15 at 14:38

answered Jan 15 at 13:44

Zanna

50.8k13136241

The version folders do have ver001, ver002 etc. There seems to be no more than 999 anywhere but given enough years it'll get there I think?

– Joe
Jan 15 at 19:05

@Joe haha I suppose so. I think the shorter command should work until then. But I have assumed that your structure has consistent depth levels and I will have to rethink if it does not... I think recursive globbing should fix it...

– Zanna
Jan 15 at 20:09

Actually the structure is not consistent. However, my approach was to start simple and solve one problem at a time like this one haha. I was going to just navigate to the folder after which has consistent structure and run the commands. I'll get into this later for testing.. :-)

– Joe
Jan 15 at 22:25

@Joe if the higher version number directories are newer, you can use find which looks at the metadata. It might be slow but it will be reliable. I haven't been able to sort the array with various levels and numbers though I can try to figure it out

– Zanna
Jan 16 at 7:07

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1109826%2fsearch-for-text-in-files-of-same-name-in-multiple-folders-of-various-depths%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Perhaps with two commands...

grep --include="ResourceParent.xml" -r -E '.txt|.csv|.xls' > file

for d in par*; a=("$d"/*); b=($(sort -V <<<"${a[*]}")); grep -HE '.txt|.csv|.xls' "${b[@]: -1}"/*; done >> file

If your version numbers are padded so that they sort naturally, for the second command you would be able to use simply:

for d in par*; a=("$d"/*); grep -HE '.txt|.csv|.xls' "${a[@]: -1}"/*; done >> file

But maybe you wanted the data sorted by the directories at the level of par so that you get

data from par1/ResourceParent.xml

data from par1/ver{latest}/ResourceVer.xml

data from par2/ResourceParent/xml

data from par2/ver{latest}/ResourceVer.xml

You could perform some text processing on the output file, but it depends how your par directories are named. Since I named them par1 par2 ... par200

sort -V file >> betterfile

will to do that job, assuming filenames had no newlines.

sed 's/^[^:]*://' file

You can write to the file instead of stdout by adding the -i flag to sed after testing.

edited Jan 15 at 14:38

answered Jan 15 at 13:44

Zanna

50.8k13136241

The version folders do have ver001, ver002 etc. There seems to be no more than 999 anywhere but given enough years it'll get there I think?

– Joe
Jan 15 at 19:05

@Joe haha I suppose so. I think the shorter command should work until then. But I have assumed that your structure has consistent depth levels and I will have to rethink if it does not... I think recursive globbing should fix it...

– Zanna
Jan 15 at 20:09

Actually the structure is not consistent. However, my approach was to start simple and solve one problem at a time like this one haha. I was going to just navigate to the folder after which has consistent structure and run the commands. I'll get into this later for testing.. :-)

– Joe
Jan 15 at 22:25

@Joe if the higher version number directories are newer, you can use find which looks at the metadata. It might be slow but it will be reliable. I haven't been able to sort the array with various levels and numbers though I can try to figure it out

– Zanna
Jan 16 at 7:07

add a comment |

Perhaps with two commands...

grep --include="ResourceParent.xml" -r -E '.txt|.csv|.xls' > file

for d in par*; a=("$d"/*); b=($(sort -V <<<"${a[*]}")); grep -HE '.txt|.csv|.xls' "${b[@]: -1}"/*; done >> file

If your version numbers are padded so that they sort naturally, for the second command you would be able to use simply:

for d in par*; a=("$d"/*); grep -HE '.txt|.csv|.xls' "${a[@]: -1}"/*; done >> file

But maybe you wanted the data sorted by the directories at the level of par so that you get

data from par1/ResourceParent.xml

data from par1/ver{latest}/ResourceVer.xml

data from par2/ResourceParent/xml

data from par2/ver{latest}/ResourceVer.xml

You could perform some text processing on the output file, but it depends how your par directories are named. Since I named them par1 par2 ... par200

sort -V file >> betterfile

will to do that job, assuming filenames had no newlines.

sed 's/^[^:]*://' file

You can write to the file instead of stdout by adding the -i flag to sed after testing.

edited Jan 15 at 14:38

answered Jan 15 at 13:44

Zanna

50.8k13136241

The version folders do have ver001, ver002 etc. There seems to be no more than 999 anywhere but given enough years it'll get there I think?

– Joe
Jan 15 at 19:05

@Joe haha I suppose so. I think the shorter command should work until then. But I have assumed that your structure has consistent depth levels and I will have to rethink if it does not... I think recursive globbing should fix it...

– Zanna
Jan 15 at 20:09

Actually the structure is not consistent. However, my approach was to start simple and solve one problem at a time like this one haha. I was going to just navigate to the folder after which has consistent structure and run the commands. I'll get into this later for testing.. :-)

– Joe
Jan 15 at 22:25

@Joe if the higher version number directories are newer, you can use find which looks at the metadata. It might be slow but it will be reliable. I haven't been able to sort the array with various levels and numbers though I can try to figure it out

– Zanna
Jan 16 at 7:07

add a comment |

Perhaps with two commands...

grep --include="ResourceParent.xml" -r -E '.txt|.csv|.xls' > file

for d in par*; a=("$d"/*); b=($(sort -V <<<"${a[*]}")); grep -HE '.txt|.csv|.xls' "${b[@]: -1}"/*; done >> file

If your version numbers are padded so that they sort naturally, for the second command you would be able to use simply:

for d in par*; a=("$d"/*); grep -HE '.txt|.csv|.xls' "${a[@]: -1}"/*; done >> file

But maybe you wanted the data sorted by the directories at the level of par so that you get

data from par1/ResourceParent.xml

data from par1/ver{latest}/ResourceVer.xml

data from par2/ResourceParent/xml

data from par2/ver{latest}/ResourceVer.xml

You could perform some text processing on the output file, but it depends how your par directories are named. Since I named them par1 par2 ... par200

sort -V file >> betterfile

will to do that job, assuming filenames had no newlines.

sed 's/^[^:]*://' file

You can write to the file instead of stdout by adding the -i flag to sed after testing.

edited Jan 15 at 14:38

answered Jan 15 at 13:44

Zanna

50.8k13136241

Perhaps with two commands...

grep --include="ResourceParent.xml" -r -E '.txt|.csv|.xls' > file

for d in par*; a=("$d"/*); b=($(sort -V <<<"${a[*]}")); grep -HE '.txt|.csv|.xls' "${b[@]: -1}"/*; done >> file

If your version numbers are padded so that they sort naturally, for the second command you would be able to use simply:

for d in par*; a=("$d"/*); grep -HE '.txt|.csv|.xls' "${a[@]: -1}"/*; done >> file

But maybe you wanted the data sorted by the directories at the level of par so that you get

data from par1/ResourceParent.xml

data from par1/ver{latest}/ResourceVer.xml

data from par2/ResourceParent/xml

data from par2/ver{latest}/ResourceVer.xml

You could perform some text processing on the output file, but it depends how your par directories are named. Since I named them par1 par2 ... par200

sort -V file >> betterfile

will to do that job, assuming filenames had no newlines.

sed 's/^[^:]*://' file

You can write to the file instead of stdout by adding the -i flag to sed after testing.

edited Jan 15 at 14:38

answered Jan 15 at 13:44

Zanna

50.8k13136241

edited Jan 15 at 14:38

answered Jan 15 at 13:44

Zanna

50.8k13136241

answered Jan 15 at 13:44

Zanna

50.8k13136241

answered Jan 15 at 13:44

Zanna

50.8k13136241

The version folders do have ver001, ver002 etc. There seems to be no more than 999 anywhere but given enough years it'll get there I think?

– Joe
Jan 15 at 19:05

@Joe haha I suppose so. I think the shorter command should work until then. But I have assumed that your structure has consistent depth levels and I will have to rethink if it does not... I think recursive globbing should fix it...

– Zanna
Jan 15 at 20:09

Actually the structure is not consistent. However, my approach was to start simple and solve one problem at a time like this one haha. I was going to just navigate to the folder after which has consistent structure and run the commands. I'll get into this later for testing.. :-)

– Joe
Jan 15 at 22:25

@Joe if the higher version number directories are newer, you can use find which looks at the metadata. It might be slow but it will be reliable. I haven't been able to sort the array with various levels and numbers though I can try to figure it out

– Zanna
Jan 16 at 7:07

add a comment |

The version folders do have ver001, ver002 etc. There seems to be no more than 999 anywhere but given enough years it'll get there I think?

– Joe
Jan 15 at 19:05

@Joe haha I suppose so. I think the shorter command should work until then. But I have assumed that your structure has consistent depth levels and I will have to rethink if it does not... I think recursive globbing should fix it...

– Zanna
Jan 15 at 20:09

Actually the structure is not consistent. However, my approach was to start simple and solve one problem at a time like this one haha. I was going to just navigate to the folder after which has consistent structure and run the commands. I'll get into this later for testing.. :-)

– Joe
Jan 15 at 22:25

@Joe if the higher version number directories are newer, you can use find which looks at the metadata. It might be slow but it will be reliable. I haven't been able to sort the array with various levels and numbers though I can try to figure it out

– Zanna
Jan 16 at 7:07

The version folders do have ver001, ver002 etc. There seems to be no more than 999 anywhere but given enough years it'll get there I think?

– Joe
Jan 15 at 19:05

@Joe haha I suppose so. I think the shorter command should work until then. But I have assumed that your structure has consistent depth levels and I will have to rethink if it does not... I think recursive globbing should fix it...

– Zanna
Jan 15 at 20:09

Actually the structure is not consistent. However, my approach was to start simple and solve one problem at a time like this one haha. I was going to just navigate to the folder after which has consistent structure and run the commands. I'll get into this later for testing.. :-)

– Joe
Jan 15 at 22:25

@Joe if the higher version number directories are newer, you can use find which looks at the metadata. It might be slow but it will be reliable. I haven't been able to sort the array with various levels and numbers though I can try to figure it out

– Zanna
Jan 16 at 7:07

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Ask Ubuntu!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

RKwYeKF 8KEM CF7QW3uqjWs rzp5MML,BN20d JfXEc6gYcYCsAum7itmH,GzpUZ6zuv 6

搜尋此網誌

Cfrgtkky