Download the genome in FASTA format.
Open this file in Python and read it in construction a single DNA string of length 580076 characters long. Be sure and remove the header information in the file. An easy way to do this is just read in every line, removing the white space, and concatenating the resulting strings together. Print out the strand and its length to make sure this works.
Part A of this project is the analysis of the pattern of start codons and stop codons in the above sequence. It should be noted that stop codons always stop the translation while start codons do not always start a translation. I wonder how many start and stop codons are there. Are there more starts than stops or vice versa? Check this out. Write a script that counts them.. Remember there are three different stops. How many times is the distance from stop to the next stop greater than 600? Are there multiple starts found along these long stretches of no stops? What thoughts do you have about this? Have your program print out the six frames (three forward and 3 complementary) as follows. Large stop blocks are those that are 600 chars long without any intervening stops. For each frame print a line like this:
Frame # : startct= # , stopct= # , large stop blocks= #.
As we discussed in class build a dict() that associates all the start codons just prior to a stop with the stop. The stop is the key and the list of starts is the value
We will restrict our analysis to these large stop to stop blocks.
Part B: But before we do this let’s look at the genbank file for this little guy. It contains the actual genes that the original researchers annotated. Normally when the sequencing is first performed this information is not known. They have to look at every ORF, and either convert it to a protein sequence and check if this protein is known or at least look at the sequence statistically and see if it resembles known protein in its pattern. In this file you will notice CDS entries. The Coding Sequence (CDS) is the actual region of DNA that is supposedly translated to form proteins, tRNA etc. Some are hypothetical in the sense that the protein was not observed at the time of the annotation.
While the ORF may contain introns (in eukaryotes), the ORF and the CDS are the same in prokaryotes. Since this is a long file write a program that extracts the gene information using regular expressions. For each gene put the gene in a list(or some other data structure ie dict()) with the start location being the first and the stop its second value. Print out the smallest gene, the largest gene in length and the number of genes. We can use this list to check to see if any of the ORF’s we find in the FASTA file are in the dictionary. I will discuss regular expressions on monday.
Part C: The final stage of this program is to determine which of the large ORFs that you find in the FASTA file are actual genes in the gb file. Just go thru the either the Fasta or the Genbank data and see if the gene or ORF is in the other. Print out the number that you find and the largest 5 genes. Just print its start and stop value and whether or not it is on the complementary strand. Also print out the number of large ORF’s that are not found in the gb file.
13 pekerja bebas membida secara purata $31 untuk pekerjaan ini
i have very good experience in python besides my skills in data structures and algorithms as well Relevant Skills and Experience python OOP data structures Proposed Milestones $25 USD - whole project
I have good experience about python language. I am an Engineer and l will like to help you. Relevant Skills and Experience I have my final year project using python language and implement it on raspberry pi3.
Hello! I can help You with your problem, and You can not pay me for my work. Соответствующие навыки и опыт Look my account. Предлагаемые промежуточные платежи $10 USD - ...