What programming language you start with really all depends on where you want to go with programming/coding. The great thing about this field is that there are an absolute abundance of smaller fields that you can go into, all using programming in their own unique ways. For web applications, a good start would be with HTML and later moving your way through CSS, JavaScript, JQuery, PHP, SQL, and any of the JavaScript libraries. Ruby is also a popular choice, so I would recommend checking that out too. For more scientific fields or areas with more machine learning and A.I., Python is generally a great place to start as it is widely used in that field of study. C++ is also a very useful language to know for that, but it can be a little more challenging for beginners. For game and application design, languages such as C#, C, Swift, Kotlin, and Java are most often used for that.
Description
DNA sequences are made up of a 4 character alphabet - A, C, T or G, that describe the nucleotide bases in a gene sequence. To ascertain the sequence of DNA, scientists use chemical methods to identify the component nucleotides in a method called DNA sequencing. DNA shotgun sequencing is a method whereby DNA subsequences of the same larger sequence are produced at massive parallel scale by DNA sequencing methods, and the overlap between segments is used to reconstruct the input gene. This is a fast and accurate method, and is dropping in price. Shotgun sequencing was used to perform the first entire sequence of a human's DNA, for example. For additional background information, see Wikipedia on shotgun sequencing.
You're working in a DNA laboratory and you have to reconstruct a gene's sequence from a series of fragments!
Formal Input Description
You'll be given a series of DNA sequence fragments, which include overlaps with neighbor sequences, but not in any specific order - it's random. Your job is to read in the fragments and reconstruct the input DNA sequence that lead to the fragments.
Formal Output Description
Your program should emit the DNA sequence it calculated.
Sample Input
You'll be given the DNA sequence of one of the strands of DNA (e.g. no complementarity calculations or inferences required) using the DNA alphabet of "a,t,c,g". Assume no read errors, also. Example:
tgca
taggcta
gtcatgcttaggcta
agcatgctgcag
tcatgct
Sample Output
Your program should emit the shortest DNA sequence that would contain the above fragments. Example:
agcatgctgcagtcatgcttaggcta
Challenge Input
gatccatctggatcctatagttcatggaaagccgctgc
tatttcaacattaattgttggttttgatacagatggtacacca
aaaagaaattcaaaaagaacaagaaaaatctgaaaaacaacaaaa
ggaatgtcaatcctatagattaactgttgaagattcaccatcagttg
tggaaataaaaatattgaaattgcagtcattagaaataaacaac
tcaagtagaatatgccatggaagcagtaagaaaaggtactgttg
tgcaagatcaattagaaaaatcgttaaattagatgaccacatt
tgtcgttgaagctgaaaaagaaattcaaaaagaacaagaaaaatct
gaaaaacaacaaaaataaattacatcaaattccttttttt
caatcgttttattagatgaacaagaaattgataaattagttgc
aatctttatcaaactgatccatctggatcctatagttcatg
gaaattgcagtcattagaaataaacaaccaatcgttttattagatg
atcgttaaattagatgaccacatttgtttaacctttgctggt
aattatacagacgttagtgaagaggaatcaattaaattagcagtta
tatactcaaagtggtggtgttagaccatttggtatttcaacattaat
ttttaggtgttgaaaagaaagcaaccgctaaacttcaaga
aagaaagcaaccgctaaacttcaagatgcaagatcaattagaaaa
ccccacctttttttttaattatcttcaagtttttttaaaaaaaaaaaaaaaa
gaatttttagaaaagaattatacagacgttagtgaagaggaatc
agtgcaagatacgatagagcaattacagttttctcaccagatg
aattaaattagcagttagagctttattagagattgttgaaag
cagttggtgtacgtggtaaagatgttattgttttaggtgttgaa
ttcaacaacgttatactcaaagtggtggtgttagaccatttgg
ataaattacatcaaattcctttttttccccacctttttttt
aattggtcgtagttcaaagagtgttggtgaatttttagaaaag
aatatatttctaaatttattgctggtattcaacaacgt
aacaagaaattgataaattagttgctgtcgttgaagctga
gagctttattagagattgttgaaagtggaaataaaaatatt
ttaactgccgattcacgtgtattaattagtaaagcattaat
acgatagagcaattacagttttctcaccagatggtcatctttt
aaggtactgttgcagttggtgtacgtggtaaagatgttattg
tgtttaacctttgctggtttaactgccgattcacgtgtattaatt
aataatataatatatatataaatacataataatgtcaagtgcaagat
agtaaagcattaatggaatgtcaatcctatagattaactgt
tgaagattcaccatcagttgaatatatttctaaatttattgctggta
gaaagccgctgcaattggtcgtagttcaaagagtgttggt
gtcatctttttcaagtagaatatgccatggaagcagtaagaa
tgttggttttgatacagatggtacaccaaatctttatcaaact
Challenge Input Solution
aataatataatatatatataaatacataataatgtcaagtgcaagatacgatagagcaattacagttttctcaccagatggtcatctttttcaagtagaatatgccatggaagcagtaagaaaaggtactgttgcagttggtgtacgtggtaaagatgttattgttttaggtgttgaaaagaaagcaaccgctaaacttcaagatgcaagatcaattagaaaaatcgttaaattagatgaccacatttgtttaacctttgctggtttaactgccgattcacgtgtattaattagtaaagcattaatggaatgtcaatcctatagattaactgttgaagattcaccatcagttgaatatatttctaaatttattgctggtattcaacaacgttatactcaaagtggtggtgttagaccatttggtatttcaacattaattgttggttttgatacagatggtacaccaaatctttatcaaactgatccatctggatcctatagttcatggaaagccgctgcaattggtcgtagttcaaagagtgttggtgaatttttagaaaagaattatacagacgttagtgaagaggaatcaattaaattagcagttagagctttattagagattgttgaaagtggaaataaaaatattgaaattgcagtcattagaaataaacaaccaatcgttttattagatgaacaagaaattgataaattagttgctgtcgttgaagctgaaaaagaaattcaaaaagaacaagaaaaatctgaaaaacaacaaaaataaattacatcaaattcctttttttccccacctttttttttaattatcttcaagtttttttaaaaaaaaaaaaaaaa
Solution
in C++
#include <iostream>
#include <fstream>
#include <vector>
#include <algorithm>
#include<array>
bool check(std::string s, std::vector<std::string> r) {
for (auto str: r)
if (s.find(str) == std::string::npos) {
std::cout << str << " not found in " << s << std::endl;
return false;
}
return true;
}
/* Returns pair of number x,y
* x - length of common edge in the end of S and begiinning of T
* y - length of common edge in the end of T and begiinning of S
* based on longest common substring algorithm
* */
std::pair<size_t, size_t> LCSubstr(std::string S, std::string T) {
size_t m = S.length(), n = T.length();
std::vector<std::vector<size_t>> L(m);
for (auto &t : L)
t.resize(n);
std::pair<size_t, size_t> x{0, 0};
for (size_t i = 0; i < m; ++i)
for (size_t j = 0; j < n; ++j) {
if (S[i] == T[j]) {
if (i == 0 || j == 0)
L[i][j] = 1;
else
L[i][j] = L[i - 1][j - 1] + 1;
if (i == m - 1 && j == L[i][j] - 1 && L[i][j] > x.first) {
x.first = L[i][j];
}
if (j == n - 1 && i == L[i][j] - 1 && L[i][j] > x.second)
x.second = L[i][j];
} else
L[i][j] = 0;
}
return x;
}
int main() {
std::vector<std::string> reads, r2;
std::ifstream t;
t.open("dna.txt");
while (t) {
std::string buffer;
std::getline(t, buffer);
if (buffer == "")
continue;
reads.push_back(buffer);
}
t.close();
r2 = reads;
//remove substrings
std::sort(reads.begin(), reads.end(), [](std::string a, std::string b) { return a.length() > b.length(); });
for (size_t i = 0; i < reads.size(); ++i)
for (size_t j = i + 1; j < reads.size(); ++j) {
if (reads[i].find(reads[j]) != std::string::npos) {
reads.erase(reads.begin() + j);
--j;
}
}
std::vector<std::vector<std::pair<size_t, size_t>>> lengths(reads.size());
for (auto &l : lengths)
l.resize(reads.size());
//fill data about common edges lengths
for (size_t i = 0; i < lengths.size(); ++i)
for (size_t j = i + 1; j < lengths[i].size(); ++j)
lengths[i][j] = LCSubstr(reads[i], reads[j]);
while (reads.size() > 1) {
size_t x = 0, y = 1;
bool first = true;
size_t len = lengths[0][1].first;
//find longest edge
for (size_t i = 0; i < lengths.size(); ++i)
for (size_t j = i + 1; j < lengths[i].size(); ++j) {
if (lengths[i][j].first > len) {
len = lengths[i][j].first;
x = i;
y = j;
first = true;
}
if (lengths[i][j].second > len) {
len = lengths[i][j].second;
x = i;
y = j;
first = false;
}
}
//join strings and information
if (first) {
reads[x] += reads[y].substr(len);
for (size_t i = 0; i < lengths[x].size(); ++i)
lengths[x][i].first = lengths[y][i].first;
for (auto &l : lengths)
l[x].second = l[y].second;
} else {
reads[x] = reads[y] + reads[x].substr(len);
for (size_t i = 0; i < lengths[x].size(); ++i)
lengths[x][i].second = lengths[y][i].second;
for (auto &l : lengths)
l[x].first = l[y].first;
}
reads.erase(reads.begin() + y);
lengths.erase(lengths.begin() + y);
for (auto &l : lengths)
l.erase(l.begin() + y);
}
std::cout << reads[0] << std::endl;
std::cout << check(reads[0], r2);
}
DNA sequences are made up of a 4 character alphabet - A, C, T or G, that describe the nucleotide bases in a gene sequence. To ascertain the sequence of DNA, scientists use chemical methods to identify the component nucleotides in a method called DNA sequencing. DNA shotgun sequencing is a method whereby DNA subsequences of the same larger sequence are produced at massive parallel scale by DNA sequencing methods, and the overlap between segments is used to reconstruct the input gene. This is a fast and accurate method, and is dropping in price. Shotgun sequencing was used to perform the first entire sequence of a human's DNA, for example. For additional background information, see Wikipedia on shotgun sequencing.
You're working in a DNA laboratory and you have to reconstruct a gene's sequence from a series of fragments!
Formal Input Description
You'll be given a series of DNA sequence fragments, which include overlaps with neighbor sequences, but not in any specific order - it's random. Your job is to read in the fragments and reconstruct the input DNA sequence that lead to the fragments.
Formal Output Description
Your program should emit the DNA sequence it calculated.
Sample Input
You'll be given the DNA sequence of one of the strands of DNA (e.g. no complementarity calculations or inferences required) using the DNA alphabet of "a,t,c,g". Assume no read errors, also. Example:
tgca
taggcta
gtcatgcttaggcta
agcatgctgcag
tcatgct
Sample Output
Your program should emit the shortest DNA sequence that would contain the above fragments. Example:
agcatgctgcagtcatgcttaggcta
Challenge Input
gatccatctggatcctatagttcatggaaagccgctgc
tatttcaacattaattgttggttttgatacagatggtacacca
aaaagaaattcaaaaagaacaagaaaaatctgaaaaacaacaaaa
ggaatgtcaatcctatagattaactgttgaagattcaccatcagttg
tggaaataaaaatattgaaattgcagtcattagaaataaacaac
tcaagtagaatatgccatggaagcagtaagaaaaggtactgttg
tgcaagatcaattagaaaaatcgttaaattagatgaccacatt
tgtcgttgaagctgaaaaagaaattcaaaaagaacaagaaaaatct
gaaaaacaacaaaaataaattacatcaaattccttttttt
caatcgttttattagatgaacaagaaattgataaattagttgc
aatctttatcaaactgatccatctggatcctatagttcatg
gaaattgcagtcattagaaataaacaaccaatcgttttattagatg
atcgttaaattagatgaccacatttgtttaacctttgctggt
aattatacagacgttagtgaagaggaatcaattaaattagcagtta
tatactcaaagtggtggtgttagaccatttggtatttcaacattaat
ttttaggtgttgaaaagaaagcaaccgctaaacttcaaga
aagaaagcaaccgctaaacttcaagatgcaagatcaattagaaaa
ccccacctttttttttaattatcttcaagtttttttaaaaaaaaaaaaaaaa
gaatttttagaaaagaattatacagacgttagtgaagaggaatc
agtgcaagatacgatagagcaattacagttttctcaccagatg
aattaaattagcagttagagctttattagagattgttgaaag
cagttggtgtacgtggtaaagatgttattgttttaggtgttgaa
ttcaacaacgttatactcaaagtggtggtgttagaccatttgg
ataaattacatcaaattcctttttttccccacctttttttt
aattggtcgtagttcaaagagtgttggtgaatttttagaaaag
aatatatttctaaatttattgctggtattcaacaacgt
aacaagaaattgataaattagttgctgtcgttgaagctga
gagctttattagagattgttgaaagtggaaataaaaatatt
ttaactgccgattcacgtgtattaattagtaaagcattaat
acgatagagcaattacagttttctcaccagatggtcatctttt
aaggtactgttgcagttggtgtacgtggtaaagatgttattg
tgtttaacctttgctggtttaactgccgattcacgtgtattaatt
aataatataatatatatataaatacataataatgtcaagtgcaagat
agtaaagcattaatggaatgtcaatcctatagattaactgt
tgaagattcaccatcagttgaatatatttctaaatttattgctggta
gaaagccgctgcaattggtcgtagttcaaagagtgttggt
gtcatctttttcaagtagaatatgccatggaagcagtaagaa
tgttggttttgatacagatggtacaccaaatctttatcaaact
Challenge Input Solution
aataatataatatatatataaatacataataatgtcaagtgcaagatacgatagagcaattacagttttctcaccagatggtcatctttttcaagtagaatatgccatggaagcagtaagaaaaggtactgttgcagttggtgtacgtggtaaagatgttattgttttaggtgttgaaaagaaagcaaccgctaaacttcaagatgcaagatcaattagaaaaatcgttaaattagatgaccacatttgtttaacctttgctggtttaactgccgattcacgtgtattaattagtaaagcattaatggaatgtcaatcctatagattaactgttgaagattcaccatcagttgaatatatttctaaatttattgctggtattcaacaacgttatactcaaagtggtggtgttagaccatttggtatttcaacattaattgttggttttgatacagatggtacaccaaatctttatcaaactgatccatctggatcctatagttcatggaaagccgctgcaattggtcgtagttcaaagagtgttggtgaatttttagaaaagaattatacagacgttagtgaagaggaatcaattaaattagcagttagagctttattagagattgttgaaagtggaaataaaaatattgaaattgcagtcattagaaataaacaaccaatcgttttattagatgaacaagaaattgataaattagttgctgtcgttgaagctgaaaaagaaattcaaaaagaacaagaaaaatctgaaaaacaacaaaaataaattacatcaaattcctttttttccccacctttttttttaattatcttcaagtttttttaaaaaaaaaaaaaaaa
Solution
in C++
#include <iostream>
#include <fstream>
#include <vector>
#include <algorithm>
#include<array>
bool check(std::string s, std::vector<std::string> r) {
for (auto str: r)
if (s.find(str) == std::string::npos) {
std::cout << str << " not found in " << s << std::endl;
return false;
}
return true;
}
/* Returns pair of number x,y
* x - length of common edge in the end of S and begiinning of T
* y - length of common edge in the end of T and begiinning of S
* based on longest common substring algorithm
* */
std::pair<size_t, size_t> LCSubstr(std::string S, std::string T) {
size_t m = S.length(), n = T.length();
std::vector<std::vector<size_t>> L(m);
for (auto &t : L)
t.resize(n);
std::pair<size_t, size_t> x{0, 0};
for (size_t i = 0; i < m; ++i)
for (size_t j = 0; j < n; ++j) {
if (S[i] == T[j]) {
if (i == 0 || j == 0)
L[i][j] = 1;
else
L[i][j] = L[i - 1][j - 1] + 1;
if (i == m - 1 && j == L[i][j] - 1 && L[i][j] > x.first) {
x.first = L[i][j];
}
if (j == n - 1 && i == L[i][j] - 1 && L[i][j] > x.second)
x.second = L[i][j];
} else
L[i][j] = 0;
}
return x;
}
int main() {
std::vector<std::string> reads, r2;
std::ifstream t;
t.open("dna.txt");
while (t) {
std::string buffer;
std::getline(t, buffer);
if (buffer == "")
continue;
reads.push_back(buffer);
}
t.close();
r2 = reads;
//remove substrings
std::sort(reads.begin(), reads.end(), [](std::string a, std::string b) { return a.length() > b.length(); });
for (size_t i = 0; i < reads.size(); ++i)
for (size_t j = i + 1; j < reads.size(); ++j) {
if (reads[i].find(reads[j]) != std::string::npos) {
reads.erase(reads.begin() + j);
--j;
}
}
std::vector<std::vector<std::pair<size_t, size_t>>> lengths(reads.size());
for (auto &l : lengths)
l.resize(reads.size());
//fill data about common edges lengths
for (size_t i = 0; i < lengths.size(); ++i)
for (size_t j = i + 1; j < lengths[i].size(); ++j)
lengths[i][j] = LCSubstr(reads[i], reads[j]);
while (reads.size() > 1) {
size_t x = 0, y = 1;
bool first = true;
size_t len = lengths[0][1].first;
//find longest edge
for (size_t i = 0; i < lengths.size(); ++i)
for (size_t j = i + 1; j < lengths[i].size(); ++j) {
if (lengths[i][j].first > len) {
len = lengths[i][j].first;
x = i;
y = j;
first = true;
}
if (lengths[i][j].second > len) {
len = lengths[i][j].second;
x = i;
y = j;
first = false;
}
}
//join strings and information
if (first) {
reads[x] += reads[y].substr(len);
for (size_t i = 0; i < lengths[x].size(); ++i)
lengths[x][i].first = lengths[y][i].first;
for (auto &l : lengths)
l[x].second = l[y].second;
} else {
reads[x] = reads[y] + reads[x].substr(len);
for (size_t i = 0; i < lengths[x].size(); ++i)
lengths[x][i].second = lengths[y][i].second;
for (auto &l : lengths)
l[x].first = l[y].first;
}
reads.erase(reads.begin() + y);
lengths.erase(lengths.begin() + y);
for (auto &l : lengths)
l.erase(l.begin() + y);
}
std::cout << reads[0] << std::endl;
std::cout << check(reads[0], r2);
}
Comments
Post a Comment