So I had to write a perl script that finds all the k-mere of length 15 in a 2L chromosome of Drosophila melanogaster. Lead them into a hash, and counts the number of occurrences of each k-mer. A k-mer is a sequence of length K taken from a longer sequence. I need to loop through the hash and print each k-mer on a line, followed by a tab, then the number of occurrences of that k-mer. Then I have open a file handle for writing output to uniqueKmersEndingGG.fasta, change the window length from 15-23, go through the hash of k-mere and only print out the first 1000 that occur and end with GG, put a FASTA header before each k-mer.
---------------------------------------------------------------------------------------------------------------------------------------
So I had to write a perl
script that
·
finds all the
k-mere of length 15 in a 2L chromosome of Drosophila melanogaster.
·
Lead them into a
hash
·
counts the number
of occurrences of each k-mer.
A k-mer is a sequence of
length K taken from a longer sequence.
I need to loop through the
hash and print each k-mer on a line, followed by a tab, then the number of
occurrences of that k-mer.
Then I have open a file
handle for writing output to uniqueKmersEndingGG.fasta, change the window
length from 15-23, go through the hash of k-mere and only print out the first
1000 that occur and end with GG, put a FASTA header before each k-mer.
Vocab:
·
Drosophila
melanogaster : fruit fly
·
2L chromosome of
Drosophila melanogaster
·
k-mere of length
15
·
FASTA
·
K-mere length 15
end with GG 2L chromosome of Drosophila melanogaster
·
Perl References
·
http://stackoverflow.com/questions/5948360/perl-read-a-file-into-an-array
·
http://www.perlmonks.org/?node_id=73439
Biology References
·
http://flybase.org/reports/FBsp00000001.html
·
·
https://www.biostars.org/p/16396/
·
mysql --user=genome
--host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select chrom,size from chromInfo
limit 5'
·
The sequences are long strings : http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&chromInfoPage=
·
http://blast.ncbi.nlm.nih.gov/Blast.cgi
·
FASTA Format http://prodata.swmed.edu/promals/info/fasta_format_file_example.htm
·
http://en.wikipedia.org/wiki/FASTA_format
·
http://code.izzid.com/2011/10/13/How-to-write-a-fasta-file-in-perl.html
mysql --user=genome --host=genome-mysql.cse.ucsc.edu
-A -D hg19 -e 'select chrom,size from chromInfo limit 5'
--defaults-file=# Only read default options from the given
file #.
--defaults-extra-file=#
Read this file after the global files are read.
--defaults-group-suffix=#
Also read groups with
concat(group, suffix)
--login-path=# Read this path from the login file.
Variables
(--variable-name=value)
and boolean options
{FALSE|TRUE} Value (after reading
options)
---------------------------------
----------------------------------------
auto-rehash TRUE
auto-vertical-output FALSE
bind-address (No default value)
character-sets-dir (No default value)
column-type-info FALSE
comments FALSE
compress FALSE
debug-check FALSE
debug-info FALSE
database hg19
default-character-set auto
delimiter ;
enable-cleartext-plugin FALSE
vertical FALSE
force FALSE
named-commands FALSE
ignore-spaces FALSE
init-command (No default value)
local-infile FALSE
no-beep FALSE
host
genome-mysql.cse.ucsc.edu
html FALSE
xml FALSE
line-numbers TRUE
unbuffered FALSE
column-names TRUE
sigint-ignore FALSE
port 0
prompt mysql>
quick FALSE
raw FALSE
reconnect FALSE
shared-memory-base-name (No default value)
socket (No default value)
ssl FALSE
ssl-ca (No default value)
ssl-capath (No default value)
ssl-cert (No default value)
ssl-cipher (No default value)
ssl-key (No default value)
ssl-crl (No default value)
ssl-crlpath (No default value)
ssl-verify-server-cert FALSE
table FALSE
user genome
safe-updates FALSE
i-am-a-dummy FALSE
connect-timeout 0
max-allowed-packet 16777216
net-buffer-length 16384
select-limit 1000
max-join-size 1000000
secure-auth TRUE
show-warnings FALSE
plugin-dir (No default value)
default-auth (No default value)
histignore (No default value)
binary-mode FALSE
connect-expired-password FALSE
C:\Program
Files\MySQL\MySQL Workbench CE 6.1.6>mysql
--user=genome --host=geno
me-mysql.cse.ucsc.edu -D hg19
Welcome to the MySQL
monitor. Commands end with ; or \g.
Your MySQL connection id is
14577255
Server version: 5.6.10-log
MySQL Community Server (GPL)
Copyright (c) 2000, 2014,
Oracle and/or its affiliates. All rights reserved.
Oracle is a registered
trademark of Oracle Corporation and/or its
affiliates. Other names
may be trademarks of their respective
owners.
Type 'help;' or '\h' for
help. Type '\c' to clear the current input statement.
mysql> describe
chromeInfo
-> ;
ERROR 1146 (42S02): Table
'hg19.chromeInfo' doesn't exist
mysql> describe
chromInfo
-> ;
+----------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------------+------+-----+---------+-------+
| chrom | varchar(255) | NO
| PRI | | |
| size | int(10) unsigned | NO |
| 0 | |
| fileName |
varchar(255) | YES | |
NULL | |
+----------+------------------+------+-----+---------+-------+
3 rows in set (0.10 sec)
mysql>
select * from chromInfo;
+-----------------------+-----------+----------------------+
|
chrom | size | fileName |
+-----------------------+-----------+----------------------+
|
chr1 | 249250621 |
/gbdb/hg19/hg19.2bit |
|
chr2 | 243199373 | /gbdb/hg19/hg19.2bit
|
|
chr3 | 198022430 |
/gbdb/hg19/hg19.2bit |
|
chr4 | 191154276 |
/gbdb/hg19/hg19.2bit |
|
chr5 | 180915260 |
/gbdb/hg19/hg19.2bit |
|
chr6 | 171115067 |
/gbdb/hg19/hg19.2bit |
|
chr7 | 159138663 |
/gbdb/hg19/hg19.2bit |
|
chrX | 155270560 |
/gbdb/hg19/hg19.2bit |
|
chr8 | 146364022 |
/gbdb/hg19/hg19.2bit |
|
chr9 | 141213431 |
/gbdb/hg19/hg19.2bit |
|
chr10 | 135534747 |
/gbdb/hg19/hg19.2bit |
|
chr11 | 135006516 |
/gbdb/hg19/hg19.2bit |
|
chr12 | 133851895 |
/gbdb/hg19/hg19.2bit |
|
chr13 | 115169878 |
/gbdb/hg19/hg19.2bit |
|
chr14 | 107349540 |
/gbdb/hg19/hg19.2bit |
|
chr15 | 102531392 |
/gbdb/hg19/hg19.2bit |
|
chr16 | 90354753 | /gbdb/hg19/hg19.2bit |
|
chr17 | 81195210 | /gbdb/hg19/hg19.2bit |
|
chr18 | 78077248 | /gbdb/hg19/hg19.2bit |
|
chr20 | 63025520 | /gbdb/hg19/hg19.2bit |
|
chrY | 59373566 | /gbdb/hg19/hg19.2bit |
|
chr19 | 59128983 | /gbdb/hg19/hg19.2bit |
|
chr22 |
51304566 | /gbdb/hg19/hg19.2bit |
|
chr21 | 48129895 | /gbdb/hg19/hg19.2bit |
|
chr6_ssto_hap7 | 4928567 | /gbdb/hg19/hg19.2bit |
|
chr6_mcf_hap5 | 4833398 | /gbdb/hg19/hg19.2bit |
|
chr6_cox_hap2 |
4795371 | /gbdb/hg19/hg19.2bit |
|
chr6_mann_hap4 | 4683263 | /gbdb/hg19/hg19.2bit |
|
chr6_apd_hap1 | 4622290 | /gbdb/hg19/hg19.2bit |
|
chr6_qbl_hap6 | 4611984 | /gbdb/hg19/hg19.2bit |
|
chr6_dbb_hap3 | 4610396 | /gbdb/hg19/hg19.2bit |
|
chr17_ctg5_hap1 | 1680828 | /gbdb/hg19/hg19.2bit |
|
chr4_ctg9_hap1 | 590426 | /gbdb/hg19/hg19.2bit |
|
chr1_gl000192_random | 547496 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000225 | 211173 | /gbdb/hg19/hg19.2bit |
|
chr4_gl000194_random | 191469 | /gbdb/hg19/hg19.2bit |
|
chr4_gl000193_random | 189789 | /gbdb/hg19/hg19.2bit |
|
chr9_gl000200_random | 187035 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000222 | 186861 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000212 | 186858 | /gbdb/hg19/hg19.2bit |
|
chr7_gl000195_random | 182896 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000223 | 180455 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000224 | 179693 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000219 | 179198 | /gbdb/hg19/hg19.2bit |
|
chr17_gl000205_random | 174588 |
/gbdb/hg19/hg19.2bit |
|
chrUn_gl000215 | 172545 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000216 | 172294 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000217 |
172149 | /gbdb/hg19/hg19.2bit |
|
chr9_gl000199_random | 169874 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000211 | 166566 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000213 | 164239 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000220 | 161802 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000218 | 161147 | /gbdb/hg19/hg19.2bit |
|
chr19_gl000209_random | 159169 |
/gbdb/hg19/hg19.2bit |
|
chrUn_gl000221 | 155397 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000214 | 137718 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000228 | 129120 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000227 | 128374 | /gbdb/hg19/hg19.2bit |
|
chr1_gl000191_random | 106433 | /gbdb/hg19/hg19.2bit |
|
chr19_gl000208_random | 92689 | /gbdb/hg19/hg19.2bit
|
|
chr9_gl000198_random | 90085 | /gbdb/hg19/hg19.2bit |
|
chr17_gl000204_random | 81310 |
/gbdb/hg19/hg19.2bit |
|
chrUn_gl000233 | 45941 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000237 | 45867 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000230 | 43691 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000242 | 43523 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000243 | 43341 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000241 | 42152 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000236 | 41934 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000240 | 41933 | /gbdb/hg19/hg19.2bit |
|
chr17_gl000206_random | 41001 |
/gbdb/hg19/hg19.2bit |
|
chrUn_gl000232 | 40652 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000234 | 40531 | /gbdb/hg19/hg19.2bit |
|
chr11_gl000202_random | 40103 |
/gbdb/hg19/hg19.2bit |
|
chrUn_gl000238 | 39939 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000244 | 39929 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000248 | 39786 | /gbdb/hg19/hg19.2bit |
|
chr8_gl000196_random | 38914 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000249 | 38502 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000246 | 38154 | /gbdb/hg19/hg19.2bit |
|
chr17_gl000203_random | 37498 | /gbdb/hg19/hg19.2bit
|
|
chr8_gl000197_random | 37175 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000245 | 36651 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000247 | 36422 | /gbdb/hg19/hg19.2bit |
|
chr9_gl000201_random | 36148 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000235 | 34474 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000239 | 33824 | /gbdb/hg19/hg19.2bit |
|
chr21_gl000210_random | 27682 |
/gbdb/hg19/hg19.2bit |
|
chrUn_gl000231 | 27386 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000229 | 19913 | /gbdb/hg19/hg19.2bit |
|
chrM | 16571 | /gbdb/hg19/hg19.2bit |
|
chrUn_gl000226 | 15008 | /gbdb/hg19/hg19.2bit |
|
chr18_gl000207_random | 4262 |
/gbdb/hg19/hg19.2bit |
+-----------------------+-----------+----------------------+
93
rows in set (0.10 sec)