Tengo las siguientes secuencias que están en formato fasta con encabezado de secuencia y sus nucleótidos. ¿Cómo puedo extraer aleatoriamente las secuencias. Por ejemplo, me gustaría seleccionar aleatoriamente 2 secuencias de las secuencias totales. Se proporcionan herramientas para hacerlo, es extraer de acuerdo con el porcentaje, pero no con el número de secuencias. ¿Alguien puede ayudarme?
A.fasta
>chr1:1310706-1310726 GACGGTTTCCGGTTAGTGGAA >chr1:901959-901979 GAGGGCTTTCTGGAGAAGGAG >chr1:983001-983021 GTCCGCTTGCGGGACCTGGGG >chr1:984333-984353 CTGGAATTCCGGGCGCTGGAG >chr1:1154147-1154167 GAGATCGTCCGGGACCTGGGT
Rendimiento esperado
>chr1:1154147-1154167 GAGATCGTCCGGGACCTGGGT >chr1:901959-901979 GAGGGCTTTCTGGAGAAGGAG
Si está trabajando con archivos fasta, use BioPython , para obtener n
secuencias use random.sample :
from Bio import SeqIO from random import sample with open("foo.fasta") as f: seqs = SeqIO.parse(f,"fasta") print(sample(list(seqs), 2))
Salida:
[SeqRecord(seq=Seq('GAGATCGTCCGGGACCTGGGT', SingleLetterAlphabet()), id='chr1:1154147-1154167', name='chr1:1154147-1154167', description='chr1:1154147-1154167', dbxrefs=[]), SeqRecord(seq=Seq('GTCCGCTTGCGGGACCTGGGG', SingleLetterAlphabet()), id='chr1:983001-983021', name='chr1:983001-983021', description='chr1:983001-983021', dbxrefs=[])]
Puedes extraer las cuerdas si es necesario:
print([(seq.name,str(seq.seq)) for seq in sample(list(seqs),2)]) [('chr1:1310706-1310726', 'GACGGTTTCCGGTTAGTGGAA'), ('chr1:983001-983021', 'GTCCGCTTGCGGGACCTGGGG')]
Si las líneas estuvieran siempre en pares y omitiera los metadatos en la parte superior, podría comprimir:
from random import sample with open("foo.fasta") as f: print(sample(list(zip(f, f)), 2))
Lo que te dará pares de líneas en tuplas:
[('>chr1:983001-983021\n', 'GTCCGCTTGCGGGACCTGGGG\n'), ('>chr1:984333-984353\n', 'CTGGAATTCCGGGCGCTGGAG\n')]
Para tener las líneas listas para ser escritas:
from Bio import SeqIO from random import sample with open("foo.fasta") as f: seqs = SeqIO.parse(f, "fasta") samps = ((seq.name, seq.seq) for seq in sample(list(seqs),2)) for samp in samps: print(">{}\n{}".format(*samp))
Salida:
>chr1:1310706-1310726 GACGGTTTCCGGTTAGTGGAA >chr1:983001-983021 GTCCGCTTGCGGGACCTGGGG
Dado el formato de archivo que ha mostrado, y suponiendo que el archivo no es demasiado grande, no necesita ningún módulo externo (por ejemplo, biopión) para hacer esto:
import random with open('A.fasta') as f: data = f.read().splitlines() for i in random.sample(range(0, len(data), 2), 2): print data[i] print data[i+1]
Ejemplo de salida:
> chr1: 984333-984353 CTGGAATTCCGGGCGCTGGAG > chr1: 901959-901979 GAGGGCTTTCTGGAGAAGGAG
Esto simplemente selecciona 2 encabezados de secuencia aleatoria (las líneas de A.fasta con índices pares en los data
) y la línea que lo sigue.
Si su archivo es grande, entonces los módulos externos pueden tener optimizaciones para hacer frente a conjuntos de datos más grandes.
No sé mucho sobre Fasta, pero Python tiene un módulo Fasta (primero debes instalarlo).
>>> from pyfasta import Fasta >>> f = Fasta('tests/test1.fasta') >>> sorted(f.keys()) ['chr1', 'chr2', 'chr3']
Luego puedes usar la función de muestra del módulo Aleatorio de Python y elegir tantas como quieras al azar …
from random import sample sample(f, how_many_you_want)
import sys,random from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_protein # Use: python scriptname.py number_of_random_seq infile.fasta outfile.fasta infile = sys.argv[2] #Name of the input file seq = list(SeqIO.parse(infile,"fasta")) #Create a list with all the sequence records print "Input fasta file = ", infile totseq = len(seq) #Total number of sequences in the input file print "Number of sequences in the original file = ", totseq randseq = int(sys.argv[1]) #Number of random sequences desired print "Number of random sequences desired = ", randseq if randseq > totseq: print "The requested number of random sequences is greater that the total number of input sequences. Exiting." exit() outfile = sys.argv[3] #Name of the output file print "Output fasta file = ", outfile outrandseq = [] outlist = [] print "Randomly chosen output sequences:" for i in range(randseq): choose = random.randint(1,totseq-1) #Choose a random sequence record number for j in range(len(outrandseq)): #Test to see if the random sequence record number has already been chosen if choose == outrandseq[j]: choose = random.randint(1,totseq-1) #Choose a new random sequence record number if the current one has already been chosen outrandseq.append(choose) print choose outseq = seq[choose] outlist.append(outseq) #Append seq record to output list SeqIO.write(outlist, outfile, "fasta") #Write the output list to the outfile exit()
import sys,random from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_protein # I use this from small numbers of sequences (input file up to 10000 sequences) and it works fine. # For very large sequence sets it may be too slow -- I just have not tried. # Use: python scriptname.py number_of_random_seq infile.fasta outfile.fasta infile = sys.argv[2] #Name of the input file seq = list(SeqIO.parse(infile,"fasta")) #Create a list with all the sequence records print "Input fasta file = ", infile totseq = len(seq) #Total number of sequences in the input file print "Number of sequences in the original file = ", totseq numrandseq = int(sys.argv[1]) #Number of random sequences desired print "Number of random sequences desired = ", numrandseq if numrandseq > totseq: print "The requested number of random sequences is greater that the total number of input sequences. Exiting." exit() outfile = sys.argv[3] #Name of the output file print "Output fasta file = ", outfile outrandseqset = [] i = 1 for i in range(numrandseq): #Create a list of random sequence record numbers for output choice = random.randint(1,totseq) outrandseqset.append(choice) i = 1 j = 1 duplicate = 1 while duplicate: #Make sure no sequences are duplicated in the list duplicate = 0 for i in range(numrandseq): for j in range(i+1, numrandseq): if outrandseqset[i] == outrandseqset[j]: outrandseqset[j] = random.randint(1,totseq) duplicate = 1 i = 1 print "Randomly chosen output sequences:" for i in range(numrandseq): print outrandseqset[i] outlist = [] i = 1 for i in range(numrandseq): #Create the list of seq records to be written to the output file seqnum = outrandseqset[i] outseq = seq[seqnum] outlist.append(outseq) SeqIO.write(outlist, outfile, "fasta") #Write the output list to the outfile exit()
Depende si tienes unix sort
o shuf
instalado. Si es así, es muy fácil Seleccionar aleatoriamente 3000 líneas de un archivo con códigos awk
grep ‘>’ A.fasta> ARCHIVO
CONTIGS =
sort -R FILE | head -n2|tr "\n" " "
sort -R FILE | head -n2|tr "\n" " "
o
CONTIGS = shuf
shuf -n2 FILE|tr "\n" " "
Luego, usa samtools para extraer
samtools faidx A.fasta $CONTIGS