Last week we got a request from on of our users with a tricky question:
I need a tool that:
- reads two files (
parent
, andsubset
)- I need to pick out random lines from the
parent
- but it should ignore lines in the
subset
. And the number of random lines to be picked should be equal to the number of lines in the subset file.Two example files are attached:
Parent
:abcd bcde cdef defg efgh fghi ghij hijk ijkl jklm klmn lmno mnop nopq opqr pqrs qrst rstu stuv tuvw uvwx vwxy wxyz
And
subset
(is a subset of the parent file):hijk ijkl jklm opqr pqrs qrst
Any idea how to solve that? It is possible with the normal Galaxy tools :)
We proposed the following solution:
-
“Join two files” to filter out the lines that are in the
subset
file and theparent
file. You should have now a newparent
file without those line from thesubset
. -
“Sort” the new file with the option
Random order (-R)
. Sort has indeed a random sorting option, somehow unintuitive - but useful.
Now we have a random sorted file without the unwanted lines. What needs to be done is to extract N lines, where N is the length of thesubset
file. -
“Add column to an existing dataset” with the
iterate
option enabled. This you need to do for both files the initialsubset
file and the new random sorted partent from step 2. -
Now use “Join two files” again, but this time on the newly created datasets with the additional column. This will give you exactly N lines where N is the amount of lines in the
subset
file. -
(Optional) If needed, remove the additional column with the Cut tool.