Last week we got a request from on of our users with a tricky question:
I need a tool that:
- reads two files (
parent, andsubset)- I need to pick out random lines from the
parent
- but it should ignore lines in the
subset. And the number of random lines to be picked should be equal to the number of lines in the subset file.Two example files are attached:
Parent:abcd bcde cdef defg efgh fghi ghij hijk ijkl jklm klmn lmno mnop nopq opqr pqrs qrst rstu stuv tuvw uvwx vwxy wxyzAnd
subset(is a subset of the parent file):hijk ijkl jklm opqr pqrs qrst
Any idea how to solve that? It is possible with the normal Galaxy tools :)
We proposed the following solution:
-
“Join two files” to filter out the lines that are in the
subsetfile and theparentfile. You should have now a newparentfile without those line from thesubset. -
“Sort” the new file with the option
Random order (-R). Sort has indeed a random sorting option, somehow unintuitive - but useful.
Now we have a random sorted file without the unwanted lines. What needs to be done is to extract N lines, where N is the length of thesubsetfile. -
“Add column to an existing dataset” with the
iterateoption enabled. This you need to do for both files the initialsubsetfile and the new random sorted partent from step 2. -
Now use “Join two files” again, but this time on the newly created datasets with the additional column. This will give you exactly N lines where N is the amount of lines in the
subsetfile. -
(Optional) If needed, remove the additional column with the Cut tool.