***************************************************************************** * SUBROUTINE READSEQ 04-17-00 * ***************************************************************************** * Author: Alex Ropelewski, Pittsburgh Supercomputing Center (PSC) * * This is a rewrite of the classic readsq routine used at the PSC. * ***************************************************************************** * (C)OPYRIGHT 1991, PITTSBURGH SUPERCOMPUTING CENTER * * 2000 Mellon Institute Building * * 4400 Fifth Avenue * * Pittsburgh, PA 15213 * * (412) 268-4960 * ***************************************************************************** * This routine will read in a sequence in NBRF, EMBL, GENBANK, MOLGEN, GCG, * * or FASTA file formats. (The file type must be passed to this routine. * * Also The sequence file must be opened to allow read access. * ***************************************************************************** SUBROUTINE READSEQ(ERROR,UNT,FILFMT,ISEQ,IDIM, + CSEQ,CDIM,LOCUS,ACCNUM,DEFLIN,ALPH,LENGTH) IMPLICIT NONE INTEGER ERROR !--> *OUT* THE ERROR FLAG (SEE CODES BELOW). INTEGER UNT !--> *IN* THE READ UNIT INTEGER FILFMT !--> *IN* THE SEQUENCE FILE FORMAT INTEGER IDIM !--> *IN* THE DIMENSION OF ARRAY ISEQ INTEGER ISEQ(IDIM) !--> *OUT* THE SEQUENCE (INTEGER NUCLEOTIDE CODES) INTEGER CDIM !--> *IN* THE DIMENSION OF ARRAY CSEQ CHARACTER*1 CSEQ(CDIM) !--> *OUT* THE SEQUENCE (CHAR NUCLEOTIDE CODES) INTEGER ALPH(0:127) !--> *OUT* THE ALPHABET TO BE USED. CHARACTER*(*) LOCUS !--> *INOUT* THE LOCUS OR SEQID CHARACTER*(*) ACCNUM !--> *INOUT* THE ACCESSION NUMBER CHARACTER*(*) DEFLIN !--> *OUT* THE DEFINITION LINE INTEGER LENGTH !--> *OUT* THE SEQUENCES LENGTH. integer type,i,j **************************************************************************** * CODE DEFINITIONS: * *---- ERROR=0 -> NO ERROR *---- ERROR=1 -> INVALID MODE UTILIZED *---- ERROR=2 -> INVALID FILE FORMAT UTILIZED *---- ERROR=3 -> EOF DETECTED ON READ *---- ERROR=4 -> INVALAD FILE FORMAT FOR ACCECSSION NUMBER FIND. ****************************************************************************** * Local Vars and parameters * ***************************** LOGICAL EOF INTEGER UNKNOWN, GENBNK, EMBL, NBRF, PROFIL, GCG, FASTA PARAMETER ( unknown=0, GENBNK=1, EMBL=2, NBRF=3 ) PARAMETER (PROFIL=4, GCG=5, FASTA=6) LOGICAL SIMEOF(2) data simeof /.FALSE.,.FALSE./ SAVE SIMEOF EOF=.FALSE. ERROR=0 type=1 DO 10 I=1,len(LOCUS) LOCUS(i:i)=' ' 10 CONTINUE DO 11 I=1,len(ACCNUM) ACCNUM(i:i)=' ' 11 CONTINUE DO 12 I=1,len(DEFLIN) DEFLIN(i:i)=' ' 12 CONTINUE IF(FILFMT.EQ.GENBNK) THEN CALL READGB(UNT,LENGTH,IDIM,ISEQ,CSEQ,ACCNUM, + LOCUS,DEFLIN,EOF,ALPH) IF(EOF) ERROR=3 ELSEIF(FILFMT.EQ.EMBL) THEN CALL RDEMBL(UNT,IDIM,LENGTH,DEFLIN, + LOCUS,ACCNUM,ISEQ,CSEQ,EOF,ALPH) IF(EOF) ERROR=3 ELSEIF(FILFMT.EQ.GCG) THEN IF (SIMEOF(TYPE)) THEN ERROR=3 SIMEOF(TYPE)=.NOT.SIMEOF(TYPE) ELSE CALL RDGCG(UNT,LENGTH,IDIM,ISEQ,CSEQ,ACCNUM, + LOCUS,DEFLIN,EOF,ALPH) IF (EOF) THEN SIMEOF(TYPE)=.TRUE. IF(LENGTH.LE.0) ERROR=3 ENDIF ENDIF ELSEIF(FILFMT.EQ.NBRF) THEN CALL NBRFSQ(UNT,LENGTH,IDIM,ISEQ,CSEQ,LOCUS,DEFLIN, + EOF,ALPH) ACCNUM = LOCUS IF(EOF) ERROR=3 ELSEIF(FILFMT.EQ.FASTA) THEN CALL RFASTA(UNT,LENGTH,IDIM,ISEQ,CSEQ,LOCUS,DEFLIN, + EOF,ALPH ) ACCNUM = LOCUS IF(EOF) ERROR=3 ELSE ERROR=2 ENDIF ************************************ * print*, '--->in readseq' * print*, 'locus=',locus * print*, 'accnum=',accnum * print*, 'length=',length * print*, 'deflin=',deflin(1:60) * print*, 'seq=' * print*, (iseq(i),i=1,10) * print*, 'lseq=' * print*, (cseq(i),i=1,10) ************************************ RETURN END SUBROUTINE WHATFM( SEQUNT, TERMNL, FLFMT, FLAG, HEAD, guessnap ) *************************************************************************** * COPYRIGHT - PITTSBURGH SUPERCOMPUTING CENTER Last modified: Aug. 1995 * *************************************************************************** * Written by: Hugh B. Nicholas December 1988 * * Modified by: Alex Ropelewski * *************************************************************************** C *** Subroutine WHATFM determines whether a sequence file is EMBL format C *** GenBank format, Genbank format without a header, NBRF, MOLGEN, C *** GCG Profile, GCG, FASTA or an undetermined format. C *** The unit number of the file to be checked C *** is passed to the subroutine in SEQUNT. FLAG returns 0 if no error C *** occoured, negative numbers for warnings, positive numbers for fatal C *** errors. HEAD should be true if one wants the GENBANK header C *** information written to unit TERMNL. The file format is C *** returned as an integer value in the variable FLFMT. C C FLFMT = 0 Undetermined file format, not GenBank or EMBL C FLFMT = 1 GenBank file format C FLFMT = 2 EMBL file format C FLFMT = 3 NBRF file format C FLFMT = 4 MolGen/Stanford file format C FLFMT = 5 PROFILE file format C FLFMT = 6 GCG file format C FLFMT = 7 FASTA file format C * FLAG = 0 No error occoured * FLAG = -1 *WARNING* could not find GenBank header info * FLAG = -2 *WARNING* Undetermined file type * FLAG = 1 *FATAL* READ error detected. * FLAG = 2 *FATAL* WRITE error detected. * FLAG = 3 *FATAL* REWIND error detected. * IMPLICIT NONE C C *** Passed variables C INTEGER SEQUNT, TERMNL, FLFMT, FLAG, guessnap LOGICAL HEAD C C *** Local Variables and constants C INTEGER NA, PROT PARAMETER (NA=1, PROT=2) INTEGER UNKNOWN, GENBNK, EMBL, NBRF, PROFIL, GCG, FASTA PARAMETER ( unknown=0, GENBNK=1, EMBL=2, NBRF=3 ) PARAMETER (PROFIL=4, GCG=5, FASTA=6) INTEGER I, HEADER logical star CHARACTER*12 DBANK( 0:7 ) CHARACTER*70 GBHEAD( 9 ) CHARACTER*132 LINE, TEST C DATA DBANK / 'Undetermined', 'GenBank ', 'EMBL ', * 'NBRF ', 'MolGen ', 'PROFIL ', * 'GCG ', 'FASTA ' / C EXTERNAL ALLCAP C 1 FORMAT ( A132 ) 2 FORMAT ( //' The format of sequence data file is: ', A12 ) 3 FORMAT ( /' GenBank file header information',/) 4 FORMAT ( ' ', A70 ) 5 FORMAT ( /' The header information was not present in the', * ' GenBank sequence file.' ) C C FLAG=0 READ( SEQUNT, 1, ERR=999 ) LINE TEST = LINE CALL ALLCAP( TEST, LEN(TEST) ) HEADER = 0 C C STAR=.FALSE. IF( TEST(1:2) .EQ. 'ID' ) THEN FLFMT = EMBL 110 CONTINUE READ( SEQUNT, 1, ERR=999 ) TEST CALL ALLCAP( TEST, LEN(TEST) ) IF (TEST(1:2).NE.'SQ') goto 110 CALL TESTLINE(TEST,GUESSNAP) ELSE IF( TEST(1:5) .EQ. 'LOCUS' ) THEN FLFMT = GENBNK GUESSNAP= NA C ELSE IF( TEST(20:45) .EQ. 'GENETIC SEQUENCE DATA BANK' ) THEN C Changed by AJR bercause NCBI can't get it straight! ELSE IF( index(TEST,'GENETIC SEQUENCE DATA BANK').NE.0) THEN FLFMT = GENBNK GUESSNAP= NA 200 HEADER = HEADER + 1 GBHEAD( HEADER ) = LINE( 1:70 ) READ( SEQUNT, 1,ERR=999 ) LINE CALL ALLCAP( LINE, 5 ) IF( LINE(1:5) .NE. 'LOCUS' .AND. HEADER .LT. 9 ) GO TO 200 ELSE IF(TEST(1:1) .EQ. '>') THEN C........First test for NBRF-PIR file format: IF((TEST(2:4) .EQ. 'P1;').OR. + (TEST(2:4) .EQ. 'F1;').OR. + (TEST(2:4) .EQ. 'DL;').OR. + (TEST(2:4) .EQ. 'DC;').OR. + (TEST(2:4) .EQ. 'RL;').OR. + (TEST(2:4) .EQ. 'RC;').OR. + (TEST(2:4) .EQ. 'N1;').OR. + (TEST(2:4) .EQ. 'N2;').OR. + (TEST(2:4) .EQ. 'N3;')) THEN C.............Assume NBRF-PIR FORMAT: IF((TEST(2:4) .EQ. 'P1;').OR.(TEST(2:4) .EQ. 'F1;')) THEN GUESSNAP= PROT ELSE GUESSNAP= NA ENDIF FLFMT=NBRF ELSE C.............Assume FASTA FORMAT: FLFMT=FASTA CALL TESTLINE(SEQUNT,GUESSNAP) ENDIF C---- Time to find out if profile or GCG format: ELSE FLFMT = unknown REWIND( SEQUNT,ERR=888 ) DO 1000 I=1,1000 READ( SEQUNT, 1, ERR=999 ) LINE TEST = LINE CALL ALLCAP( TEST, LEN(TEST) ) IF (INDEX(TEST,'..').NE.0) THEN IF ( (INDEX(TEST,'CONS').NE.0) .AND. + (INDEX(TEST,'GAP') .NE.0) .AND. + (INDEX(TEST,'LEN') .NE.0) ) THEN FLFMT=PROFIL GOTO 1010 ELSEIF ( (INDEX(TEST,'CHECK:').NE.0) .AND. + (INDEX(TEST,'TYPE:') .NE.0) .AND. + (INDEX(TEST,'LENGTH:') .NE.0) ) THEN FLFMT=GCG READ( SEQUNT, 1, ERR=999 ) LINE CALL TESTLINE(SEQUNT,GUESSNAP) GOTO 1010 ENDIF ENDIF 1000 CONTINUE FLAG=-2 1010 CONTINUE ENDIF REWIND( SEQUNT,ERR=888 ) C WRITE( TERMNL, 2, ERR=9999 ) DBANK( FLFMT ) IF( FLFMT .EQ. GENBNK .AND. HEADER .GT. 0 .AND. HEAD) THEN WRITE( TERMNL, 3, ERR=9999 ) WRITE( TERMNL, 4, ERR=9999 ) (GBHEAD( I ),I = 1, HEADER) ELSE IF( FLFMT .EQ. GENBNK .AND. HEADER .EQ. 0 ) THEN FLAG=-1 ELSE END IF C RETURN ************************ * READ ERROR OCCOURED * ************************ 999 FLAG=1 RETURN ************************ * WRITE ERROR OCCOURED * ************************ 9999 FLAG=2 RETURN ************************ * REWIND ERROR * ************************ 888 FLAG=3 RETURN END C************************************************************************ C TESTLINE -> Send a line of residues, It will determine if the sequence C is Nucleic Acid or Protein C************************************************************************ SUBROUTINE TESTLINE(SEQUNT,GUESSNAP) IMPLICIT NONE INTEGER SEQUNT,GUESSNAP INTEGER LASTCH EXTERNAL LASTCH CHARACTER*512 LINE INTEGER I,COUNT,A,C,G,T,X INTEGER NA, PROT PARAMETER (NA=1, PROT=2) READ(SEQUNT,FMT='(A)') LINE CALL ALLCAP(LINE,LEN(LINE)) COUNT=0 A=0 C=0 G=0 T=0 X=0 DO 100 I=1, LASTCH(LINE,LEN(LINE)) COUNT=COUNT+1 IF (LINE(I:I).EQ.'A') THEN A=A+1; ELSEIF (LINE(I:I).EQ.'C') THEN C=C+1; ELSEIF (LINE(I:I).EQ.'G') THEN G=G+1; ELSEIF (LINE(I:I).EQ.'T') THEN T=T+1; ELSEIF ((LINE(I:I).EQ.'X').OR.(LINE(I:I).EQ.'N')) THEN X=X+1; ELSEIF ((LINE(I:I).EQ.' ').OR.(LINE(I:I).EQ.CHAR(13))) THEN count=count-1 ELSEIF((LINE(I:I).GT.'0').AND.(LINE(I:I).LT.'9')) then count=count-1 ENDIF 100 Continue C print *, 'I=',i C print *, 'A=',A,' C=',C,' G=',G,' T=',T,' X=',X,' Count=',COUNT IF (A+C+G+T+X.EQ.COUNT) THEN GUESSNAP=NA ELSE GUESSNAP=PROT ENDIF C print *, 'IN Testline...Guessing ',guessnap,' Line is below:' C print *, Line RETURN END C C C C SUBROUTINE READGB( NFILE, LENGTH, MAXSQP, SEQ, LSEQ, * ACNUM, LOCUS, SEQDEF, FERTIG, nuc) C IMPLICIT NONE C C *** Passed Variable declarations C INTEGER TERMNL, NFILE, LENGTH, MAXSQP INTEGER SEQ( MAXSQP ) CHARACTER*(*) LSEQ( MAXSQP ) CHARACTER*(*) ACNUM CHARACTER*(*) LOCUS CHARACTER*(*) SEQDEF LOGICAL FERTIG INTEGER NUC( 0:127 ) C C *** Local variable declarations C INTEGER FIRST, I, J, K, LC, LS, NA, NDEND, ND, NP CHARACTER*90 LOCLN, ACC, ORIGIN, TERM, DEFLN, TEMP, TEXT integer itmp1, itmp2, itmp3 integer lastch external lastch CHARACTER*160 SPACES,LABEL CHARACTER*160 PROD CHARACTER*160 GENE CHARACTER*160 NOTE CHARACTER*30 NUMFMT,TMPNUM LOGICAL JEP,CEP,BNUM INTEGER TBEGIN,TEND,LBEGIN,PCNT,BNFT DATA SPACES( 1: 40)/' '/ DATA SPACES( 41: 80)/' '/ DATA SPACES( 81:120)/' '/ DATA SPACES(120:160)/' '/ Logical SAVEL CHARACTER*160 K_ACCN CHARACTER*160 K_LOC CHARACTER*160 K_DEF CHARACTER*160 K_LINE integer k_length, R_length LOGICAL DONE DATA DONE/.TRUE./ save k_accn,K_loc,K_def,k_line,K_length,R_length,done C * SAVE NUC C C *** The NUC array translates the sequences from a letter to a numeric C *** representation. Positive numbers between 1 and 15 are legitimate C *** nucleic acid codes. Note that T and U are translated to the same C *** numeric value. C C A B C D E F G H I J K * DATA NUC / 65 * 0, 8, 7, 4, 11, 0, 0, 2, 13, 0, 0, 3, C C L M N O P Q R S T U V W X Y * 2 0, 12, 15, 0, 0, 0, 10, 6, 1, 1, 14, 9, 15, 5, C C Z a b c d e f g h i j k * 3 0, 6 * 0, 8, 7, 4, 11, 0, 0, 2, 13, 0, 0, 3, C C l m n o p q r s t u v w x * 4 0, 12, 15, 0, 0, 0, 10, 6, 1, 1, 14, 9, 15, C C y z * 5 5, 0, 5 * 0 / C C 1 FORMAT ( A90 ) 2 FORMAT ( 1 ( 9X, 6 ( 1X, 10( A1: ) ) ) ) 3 FORMAT ( //' An ACCESSION line was not found for the entry:', * 2 ( /' ', A80 ), / ) 4 FORMAT (//' No TERNIMATOR "//" record for entry:', /' ', A80, * /' with length = ',I7, ', and Accession number = ', A6, 1 /' The line that should have been the terminator record', 2 ' is:', /' ', A80, // ) 5 FORMAT ( //' A DEFINITION line was not found for the entry:', * /' ', A80, / ) C C C C C AJR - This routine changed in several places to allow for sequence C segnemtation of large sequences with a 60 residue overlap. IF (.NOT.DONE) THEN ACNUM= K_ACCN LOCUS= K_LOC SEQDEF= K_DEF LENGTH= K_LENGTH IF ((K_LENGTH-R_LENGTH+60).GT.MAXSQP) THEN ITMP2=MAXSQP ELSE ITMP2=K_LENGTH-R_length+60 ENDIF itmp1=0 DONE=.TRUE. C.......Copy over last 60 residues DO 15 I=10,MAX(LASTCH(K_LINE,LEN(K_LINE)),71) IF( K_LINE(I:I) .GT. ' ') THEN ITMP1=ITMP1+1 Seq(ITMP1)= NUC( ICHAR( K_LINE(I:I) )) ENDIF 15 CONTINUE C write (6,*) 'SEQUENCE LENGTH=',LENGTH,' ITMP2=',itmp2, C + 'itmp1=',itmp1,' maxsqp=',maxsqp C call flush(6) GOTO 425 ENDIF DONE=.TRUE. r_length=0 200 READ( NFILE, 1, END = 500 ) LOCLN FERTIG = .FALSE. C IF( LOCLN( 1:5 ) .NE. 'LOCUS' ) GOTO 200 READ( LOCLN( 23:29 ), '(I7)' ) LENGTH ! No. of bases in data LOCUS = LOCLN( 13 : 22 ) * do i=1,80 * print *, i,locln(i:i) * enddo C print*, 'length=',length C print*, 'locln(23:29)',locln(23:29) READ( NFILE, 1 ) DEFLN IF( DEFLN(1:10) .EQ. 'DEFINITION' ) THEN SEQDEF = DEFLN( 12:80 ) READ( NFILE, 1 ) ACC IF( ACC(1:3) .EQ. ' ' ) THEN ! More Definition lines DO 230 NP = 70, 1, -1 230 IF( SEQDEF(NP:NP) .NE. ' ' ) GO TO 240 240 FIRST = NP + 2 DO 250 NP = 94 - FIRST, 13, -1 ! 94 = 13 + 80 + 1 250 IF( ACC(NP:NP) .EQ. ' ' ) GO TO 260 260 LC = NP - 1 IF( LC .GT. 12 ) THEN NDEND = FIRST + LC - 13 NA = 12 DO 270 ND = FIRST, NDEND, 1 NA = NA + 1 270 SEQDEF( ND:ND ) = ACC( NA:NA ) ELSE END IF 280 READ( NFILE, 1 ) ACC IF( ACC(1:9) .NE. 'ACCESSION' ) GO TO 280 C ELSE IF( ACC(1:9) .EQ. 'ACCESSION' ) THEN C ELSE ! error WRITE( *, 3 ) LOCLN, SEQDEF STOP ' GBacc' END IF C C *** Process Accession Number(s) C ACNUM = SPACES ACNUM = ACC( 13:13+index(ACC(13:len(ACC)),' ')-1) C ELSE WRITE( *, 5 ) LOCLN STOP ' GBdef' END IF C C *** Search for FEATURE line if it exists... C 300 CONTINUE 301 READ( NFILE, 1 ) TEXT IF( TEXT( 1:6 ) .EQ. 'ORIGIN' ) GO TO 399 IF( TEXT( 1:8 ) .NE. 'FEATURES' ) GO TO 301 C C *** Search for ORIGIN line which marks the beginning of the C *** nucleotide sequence data C 397 READ( NFILE, 1 ) TEXT 398 IF( TEXT( 1:6 ) .NE. 'ORIGIN' ) GO TO 397 399 CONTINUE C C C *** Now read the nucleotide sequence and translate it into numeric form C C Changed 10-94 AJR. This is much faster on Cray than the previous code. C 405 CONTINUE ITMP1=0 IF (LENGTH.GT.MAXSQP) THEN ITMP2=MAXSQP ELSE ITMP2=LENGTH ENDIF 425 CONTINUE READ(NFILE,1) TEMP IF (TEMP(1:2).EQ.'//') THEN length=itmp1 RETURN ENDIF DO 450 I=10,MAX(LASTCH(TEMP,LEN(TEMP)),71) IF( TEMP(I:I) .GT. ' ') THEN IF(ITMP1.LT.ITMP2) THEN ITMP1=ITMP1+1 Seq(ITMP1)= NUC( ICHAR( TEMP(I:I) )) else IF (ITMP1+R_length.NE.ITMP2) THEN write(*,fmt='(a,a,a,i8,a)') + ' *** WARNING *** Sequence ', + ACNUM(1:lastch(acnum,len(acnum))), + ' was larger than ',MAXSQP,'. It was segmented.' write(*,fmt='(17x,a,a,a)') 'Alignment', + ' residue numbering may not', + ' be accurate.' K_ACCN=ACNUM K_LOC=LOCUS K_DEF=SEQDEF K_LENGTH=LENGTH K_LINE=TEMP DONE=.FALSE. length=itmp1 R_LENGTH=R_length+itmp1-60 return ENDIF ENDIF ENDIF 450 continue if (((ITMP1+mod(itmp2,60)).EQ.ITMP2).AND. + (itmp2.ne.maxsqp)) THEN ITMP3=mod(itmp2,60) ELSE ITMP3=60 ENDIF * print *, 'ITMP1=',itmp1,' length=',length IF ((ITMP1+ITMP3).LE.(ITMP2)) GOTO 425 455 continue IF (ITMP2+R_length.NE.LENGTH) THEN write(*,fmt='(a,a,a,i8,a)') + ' *** WARNING *** Sequence ', + ACNUM(1:lastch(acnum,len(acnum))), + ' was larger than ',MAXSQP,'. It was segmented.' write(*,fmt='(17x,a,a,a)') 'Alignment residue ', + 'numbering may not be accurate.' K_ACCN=ACNUM K_LOC=LOCUS K_DEF=SEQDEF K_LENGTH=LENGTH K_LINE=TEMP DONE=.FALSE. length=itmp1 R_LENGTH=R_length+itmp1-60 return ENDIF length=itmp1 C Write(6,*), ' Done reading sequence..."',ACNUM, C + '" Length: ',LENGTH,' maxsp: ',MAXSQP C call flush(6) READ(NFILE,1) TEMP C print *, 'TEMP=',temp IF (TEMP(1:2).EQ.'//') RETURN ******************************************************** * IF (LENGTH.GT.MAXSQP) THEN * PRINT*, ' *** WARNING *** Sequence "',ACNUM, * + '" Length: ',LENGTH,' > ',MAXSQP, * + ' Truncating sequence.' * LENGTH=MAXSQP * ENDIF * READ( NFILE, 2 ) ( LSEQ( I ), I = 1, LENGTH ) * DO 400 LS = 1, LENGTH * SEQ( LS ) = NUC( ICHAR( LSEQ( LS ) ) ) * 00 CONTINUE *C *C *** Find "//" line which marks the end of each data entry, then start *C *** the next entry. *C * READ( NFILE, 1 ) TERM * IF( TERM(1:2) .EQ. '//' ) RETURN ! Normal return point ************************************************************************ C C *** ERROR in GenBank data file, correct and rerun program C WRITE( *, 4 ) LOCUS, LENGTH, ACNUM, TERM CLOSE( UNIT = NFILE, STATUS = 'KEEP' ) STOP ' GBtrm' C 500 FERTIG = .TRUE. RETURN END SUBROUTINE RDGCG( NFILE, LENGTH, MAXSQP, SEQ, LSEQ, * ACNUM, LOCUS, SEQDEF, FERTIG, NUC ) C IMPLICIT NONE C C *** Passed Variable declarations C INTEGER TERMNL, NFILE, LENGTH, MAXSQP INTEGER SEQ( MAXSQP ) CHARACTER*1 LSEQ( MAXSQP ) CHARACTER*(*) ACNUM CHARACTER*(*) LOCUS CHARACTER*(*) SEQDEF LOGICAL FERTIG INTEGER NUC( 0:127 ) C C *** Local variable declarations C INTEGER FIRST, I, LC, LS, NA, NDEND, ND, NP CHARACTER*90 LINE, LINE2 CHARACTER*90 LOCLN, ACC, ORIGIN, TERM integer lastch external lastch C C 1 FORMAT ( A90 ) C C FERTIG = .FALSE. DO 10 I=1,len(LOCUS) LOCUS(i:i)=' ' 10 CONTINUE DO 11 I=1,len(ACNUM) ACNUM(i:i)=' ' 11 CONTINUE DO 12 I=1,len(SEQDEF) SEQDEF(i:i)=' ' 12 CONTINUE 200 CONTINUE READ( NFILE, 1, END = 500 ) LINE LINE2=LINE CALL ALLCAP(LINE,LEN(LINE)) IF ( (INDEX(LINE,'CHECK:').NE.0) .AND. + (INDEX(LINE,'TYPE:') .NE.0) .AND. + (INDEX(LINE,'LENGTH:') .NE.0) .AND. + (INDEX(LINE,'..') .NE.0)) THEN DO 250 I=1,LEN(LINE) IF (LINE(I:I).NE.' ') GOTO 300 250 CONTINUE 300 CONTINUE I=MIN(I,INDEX(LINE,'LENGTH:')) LOCUS = 'GCG-' // LINE(I:INDEX(LINE,'LENGTH:')-1) ACNUM = LINE(I:INDEX(LINE,'LENGTH:')-1) SEQDEF = LINE2(I:lastch(LINE2,len(line2))) C------------Now read the sequence and C------------translate it into numeric form LENGTH=0 350 CONTINUE READ( NFILE, 1, END = 500 ) LINE DO 600 I=1,LEN(LINE) CALL ALLCAP(LINE,LEN(LINE)) IF ((LINE(I:I).GE.'A').AND.(LINE(I:I).LE.'Z')) THEN LENGTH=LENGTH+1 LSEQ(LENGTH) = LINE(I:I) ENDIF 600 CONTINUE GOTO 350 ENDIF GOTO 200 500 CONTINUE FERTIG = .TRUE. DO 400 LS = 1, LENGTH SEQ( LS ) = NUC( ICHAR( LSEQ( LS ) ) ) 400 CONTINUE RETURN END C C C SUBROUTINE NBRFSQ( NFILE, LENGTH, MAXSQP, SEQ, LSEQ, LOCUS, DEFLN, * FERTIG, nuc) C IMPLICIT NONE C C *** Passed Variable declarations C INTEGER NFILE, LENGTH, MAXSQP INTEGER SEQ( MAXSQP ) CHARACTER*1 LSEQ( MAXSQP ) CHARACTER*(*) LOCUS CHARACTER*(*) DEFLN LOGICAL FERTIG INTEGER NUC( 0:127 ) CHARACTER*132 TEXT CHARACTER*8 TMPNUM, NUMFMT LOGICAL BNUM INTEGER I,J,K C C *** Local variable declarations C INTEGER BASE, L, LLEN, LS CHARACTER*80 LOCLN CHARACTER*130 SQLINE C INTEGER LASTCH ! function declaration C EXTERNAL LASTCH CHARACTER*160 SPACES DATA SPACES( 1: 40)/' '/ DATA SPACES( 41: 80)/' '/ DATA SPACES( 81:120)/' '/ DATA SPACES(120:160)/' '/ CHARACTER*160 K_LOC CHARACTER*160 K_DEF CHARACTER*160 K_SQLINE LOGICAL DONE DATA DONE/.TRUE./ save K_loc,K_def,k_sqline,done C * SAVE NUC C C *** The NUC array translates the sequences from a letter to a numeric C *** representation. Positive numbers between 1 and 15 are legitimate C *** nucleic acid codes. Note that T and U are translated to the same C *** numeric value. C C * * DATA NUC / 42 * 0, 100, C C A B C D E F G H I J K * 1 22 * 0, 8, 7, 4, 11, 0, 0, 2, 13, 0, 0, 3, C C L M N O P Q R S T U V W X Y * 2 0, 12, 15, 0, 0, 0, 10, 6, 1, 1, 14, 9, 15, 5, C C Z a b c d e f g h i j k * 3 0, 6 * 0, 8, 7, 4, 11, 0, 0, 2, 13, 0, 0, 3, C C l m n o p q r s t u v w x * 4 0, 12, 15, 0, 0, 0, 10, 6, 1, 1, 14, 9, 15, C C y z * 5 5, 0, 5 * 0 / C C 1 FORMAT ( A80 ) 2 FORMAT ( A130 ) C C IF(.NOT.DONE) THEN LOCLN=K_LOC GOTO 101 ENDIF 100 READ( NFILE, 1, END = 500 ) LOCLN 101 CONTINUE FERTIG = .FALSE. C IF( LOCLN(1:1) .EQ. '>' .AND. LOCLN(4:4) .EQ. ';' ) THEN LOCUS = LOCLN( 5:10 ) IF (DONE) THEN READ( NFILE, 1 ) DEFLN ELSE DEFLN=K_DEF ENDIF C C *** Now read the nucleotide sequence and translate it into numeric form C LS = 0 **************************** * note --> I added end=500 **************************** 200 IF(.NOT.DONE) THEN SQLINE=K_SQLINE DONE=.TRUE. ELSE READ( NFILE, 2, end=500,err=500) SQLINE ENDIF LLEN = LASTCH( SQLINE, 130 ) DO 300 L = 1, LLEN BASE = NUC( ICHAR( SQLINE( L:L ) ) ) IF( BASE .GE. 1 .AND. BASE .LE. 23 ) THEN LS = LS + 1 SEQ( LS ) = BASE LSEQ( LS ) = SQLINE( L:L ) ELSE IF(ICHAR( SQLINE( L:L ) ).eq.ichar('*') ) THEN LENGTH = LS GOTO 400 ELSE END IF 300 CONTINUE IF (LLEN+LS+200.GT.MAXSQP) THEN C...........Segment Sequence here -- Save stuff + return K_SQLINE=SQLINE K_DEF=DEFLN K_LOC=LOCLN LENGTH=LS DONE=.FALSE. write(*,fmt='(a,a,a,i8,a)') + ' *** WARNING *** Sequence ', + LOCUS(1:lastch(LOCUS,len(LOCUS))), + ' was larger than ',MAXSQP,'. It was segmented.' write(*,fmt='(17x,a,a,a)') 'Alignment residue ', + 'numbering may not be accurate.' RETURN ENDIF GO TO 200 C ELSE GO TO 100 ! Skip lines until first line of entry END IF C C.....LENGTH IS SET, SEQUENCE IS READ IN. NOW READ IN F; LINES! C 400 CONTINUE RETURN 500 FERTIG = .TRUE. RETURN END C C C SUBROUTINE RFASTA( NFILE, LENGTH, MAXSQP, SEQ, LSEQ, LOCUS, DEFLN, * FERTIG, nuc) C IMPLICIT NONE C C *** Passed Variable declarations C INTEGER NFILE, LENGTH, MAXSQP INTEGER SEQ( MAXSQP ) CHARACTER*1 LSEQ( MAXSQP ) CHARACTER*(*) LOCUS CHARACTER*(*) DEFLN LOGICAL FERTIG INTEGER NUC( 0:127 ) INTEGER TYPE C C *** Local variable declarations C INTEGER MXUNT,CURUNT PARAMETER (MXUNT=2) LOGICAL FIRST(MXUNT) LOGICAL EOF(MXUNT) INTEGER UNT(MXUNT) INTEGER I, J INTEGER BASE, L, LENG, LS CHARACTER*130 LOCLN(MXUNT) CHARACTER*130 SQLINE C INTEGER LASTCH ! function declaration C EXTERNAL LASTCH C * SAVE NUC C C *** The NUC array translates the sequences from a letter to a numeric C *** representation. Positive numbers between 1 and 15 are legitimate C *** nucleic acid codes. Note that T and U are translated to the same C *** numeric value. C C * * DATA NUC / 42 * 0, 100, C C A B C D E F G H I J K * 1 22 * 0, 8, 7, 4, 11, 0, 0, 2, 13, 0, 0, 3, C C L M N O P Q R S T U V W X Y * 2 0, 12, 15, 0, 0, 0, 10, 6, 1, 1, 14, 9, 15, 5, C C Z a b c d e f g h i j k * 3 0, 6 * 0, 8, 7, 4, 11, 0, 0, 2, 13, 0, 0, 3, C C l m n o p q r s t u v w x * 4 0, 12, 15, 0, 0, 0, 10, 6, 1, 1, 14, 9, 15, C C y z * 5 5, 0, 5 * 0 / C C DATA FIRST /.FALSE.,.FALSE./ DATA EOF /.FALSE.,.FALSE./ DATA UNT /0,0/ SAVE FIRST, UNT, EOF, LOCLN 1 FORMAT ( A80 ) 2 FORMAT ( A130 ) C C DO 25 I=1,MXUNT IF(UNT(i).EQ.0) THEN CURUNT=I UNT(I)=NFILE IF (EOF(CURUNT)) THEN EOF(CURUNT)=.FALSE. FERTIG=.TRUE. RETURN ELSE GOTO 50 ENDIF ELSEIF (UNT(I).EQ.NFILE) THEN CURUNT=I IF (EOF(CURUNT)) THEN EOF(CURUNT)=.FALSE. FERTIG=.TRUE. RETURN ELSE GOTO 50 ENDIF ENDIF 25 CONTINUE PRINT*, 'ERROR in RFASTA - TOO MANY FILE UNITS' STOP 'RFASTA' 50 CONTINUE IF (FIRST(CURUNT)) THEN FIRST(CURUNT)=.FALSE. 100 READ( NFILE, 1, END = 500 ) LOCLN(CURUNT) FERTIG = .FALSE. IF( LOCLN(CURUNT)(1:1) .EQ. '>') GOTO 150 GOTO 100 ENDIF 150 CONTINUE IF( LOCLN(CURUNT)(1:1) .EQ. '>') THEN I = INDEX(locln(CURUNT),'>') J = INDEX(locln(CURUNT),' ') IF (J.NE.0) THEN J = MIN(J-1,LEN(LOCUS),LASTCH(LOCLN(CURUNT), + LEN(LOCLN(CURUNT)))) ELSE J = MIN(LEN(LOCUS),LASTCH(LOCLN(CURUNT), + LEN(LOCLN(CURUNT)))) ENDIF J=MAX(J,I+1) LOCUS = LOCLN(CURUNT)(I+1:J) J=MIN(MAX(LASTCH(LOCLN(CURUNT),LEN(LOCLN(CURUNT))),I), + LEN(DEFLN)) DEFLN = LOCLN(CURUNT)(I:J) C C *** Now read the nucleotide sequence and translate it into numeric form C LS = 0 **************************** * note --> I added end=1000 **************************** 200 READ( NFILE, 2, END=1000) LOCLN(CURUNT) IF( LOCLN(CURUNT)(1:1) .EQ. '>') THEN RETURN ENDIF LENG = LASTCH( LOCLN(CURUNT), 130 ) DO 300 L = 1, LENG BASE = NUC( ICHAR( LOCLN(CURUNT)( L:L ) ) ) IF( BASE .GE. 1 .AND. BASE .LE. 23 ) THEN LS = LS + 1 SEQ( LS ) = BASE LSEQ( LS ) = LOCLN(CURUNT)( L:L ) ENDIF 300 CONTINUE LENGTH=LS GO TO 200 ELSE 120 READ( NFILE, 1, END = 500 ) LOCLN(CURUNT) FERTIG = .FALSE. GO TO 150 ! Skip lines until first line of entry END IF C 500 CONTINUE FERTIG = .TRUE. EOF(CURUNT)=.FALSE. RETURN 1000 CONTINUE FERTIG = .FALSE. EOF(CURUNT)=.TRUE. RETURN END C C CC C SUBROUTINE EMINIT C IMPLICIT NONE C C *** Passed Variable declarations ( entry rdembl {read embl}) C INTEGER TERMNL, KEYBRD, NFILE, MAXSQP, LENGTH INTEGER SEQ( MAXSQP ) CHARACTER*1 LSEQ( MAXSQP ) CHARACTER*(*) ACNUM CHARACTER*(*) SQNAME CHARACTER*(*) SEQDEF LOGICAL FERTIG INTEGER NUC( 0:127 ) C C *** Local Variable declarations C INTEGER NAMEP, NPCHRS C PARAMETER( NAMEP = 10, NPCHRS = 20 ) C INTEGER C1, C2, DELTA, EXTENT, I, L, LASTC, LIMIT, LS, * START, FINISH,J,K CHARACTER*1 PUNCT( 0:NPCHRS ) CHARACTER*31 WORD CHARACTER*80 IDLINE, ACLINE, DELINE, TERM, temp INTEGER itmp1, itmp2, itmp3 LOGICAL EMPTY C INTEGER LASTCH, LSEMBL ! function declaration EXTERNAL GETWRD, LASTCH, LSEMBL C LOGICAL SWISS SAVE PUNCT CHARACTER*160 SPACES,LABEL CHARACTER*160 PROD CHARACTER*160 GENE CHARACTER*160 NOTE CHARACTER*20 NUMFMT,TMPNUM LOGICAL JEP,CEP,BNUM INTEGER TBEGIN,TEND,LBEGIN,PCNT,BNFT DATA SPACES( 1: 40)/' '/ DATA SPACES( 41: 80)/' '/ DATA SPACES( 81:120)/' '/ DATA SPACES(120:160)/' '/ * + ,NUC C C *** The NUC array translates the sequences from a letter to a numeric C *** representation. Positive numbers between 1 and 15 are legitimate C *** nucleic acid codes. Note that T and U are translated to the same C *** numeric value. C C A B C D E F G H I J K * DATA NUC / 65 * 0, 8, 7, 4, 11, 0, 0, 2, 13, 0, 0, 3, C C L M N O P Q R S T U V W X Y * 2 0, 12, 15, 0, 0, 0, 10, 6, 1, 1, 14, 9, 15, 5, C C Z a b c d e f g h i j k * 3 0, 6 * 0, 8, 7, 4, 11, 0, 0, 2, 13, 0, 0, 3, C C l m n o p q r s t u v w x * 4 0, 12, 15, 0, 0, 0, 10, 6, 1, 1, 14, 9, 15, C C y z * 5 5, 0, 5 * 0 / C C *** PUNCT contains the legal punctuation characters for VAX FORTRAN. C *** The first element, 0, is a sentinal needed by the binary search (POSN) C *** subroutine. The punctuation characters are in ASCII collating C *** sequence order. C DATA PUNCT / ' ', ' ', ' ', '!', '"', '$', '%', '&', '''', '(', * ')', '*', '+', ',', '-', '.', '/', ':', '<', '=', 1 '>' / C C 1 FORMAT ( A80 ) 2 FORMAT ( 1 ( 4X, 6 ( 1X, 10(A1:) ) ) ) 3 FORMAT ( //' The expected ACcession line was not found for', * ' entry:', /' ', A73 ) 4 FORMAT ( //' The expected DEfinition line was not found for', * ' entry:', /' ', A73, 1 /' with accession number: ', A6 ) 5 FORMAT ( //' No TERMINATOR record "//" for entry: ', A10, I8, A6, * /' ', A80, //' The line found instead of the expected', 1 ' terminator was:', /' ', A72, // ) C C PUNCT( 0 ) = CHAR( 0 ) PUNCT( 1 ) = CHAR( 9 ) RETURN C C ENTRY RDEMBL( NFILE, MAXSQP, LENGTH, SEQDEF, + SQNAME, ACNUM, SEQ, LSEQ, FERTIG, nuc) C C 200 READ( NFILE, 1, END = 500 ) IDLINE FERTIG = .FALSE. C IF( IDLINE( 1:2 ) .EQ. 'ID' ) THEN SQNAME = IDLINE( 6:15 ) IF (INDEX(IDLINE(15:80),'PRT;').NE.0) THEN SWISS=.TRUE. ELSE SWISS=.FALSE. ENDIF 220 READ( NFILE, 1 ) ACLINE IF( ACLINE( 1:2 ) .EQ. 'AC' ) THEN ELSE IF( ACLINE( 1:2 ) .EQ. 'XX' .OR. * ACLINE( 1:2 ) .EQ. 'CC' ) THEN GO TO 220 ELSE WRITE( *, 3 ) IDLINE( 1:73 ) STOP 'IdLine' END IF C CALL GETWRD( 5, 73, ACLINE, START, FINISH, NPCHRS, EMPTY, WORD, * PUNCT ) ACNUM=SPACES ACNUM = WORD(1:LASTCH(WORD,LEN(WORD))-1) C C *** Skip the first line not of type "AC", it should be one of types C *** "DT", "XX", or "CC". -- Read until line type "DEfinition" is found. C 290 READ( NFILE, 1 ) DELINE IF( DELINE( 1:2 ) .EQ. 'DE' ) THEN SEQDEF = DELINE( 6:72 ) ELSE IF( DELINE( 1:2 ) .EQ. 'XX' .OR. * DELINE( 1:2 ) .EQ. 'CC' .OR. * DELINE( 1:2 ) .EQ. 'NI' .OR. 1 DELINE( 1:2 ) .EQ. 'DT' .OR. 2 DELINE( 1:2 ) .EQ. 'AC' ) THEN GO TO 290 ELSE WRITE( *, 4 ) IDLINE( 1:73 ), ACNUM END IF C C C READ( NFILE, 1 ) DELINE IF( DELINE( 1:2 ) .EQ. 'DE' ) THEN LASTC = LASTCH( SEQDEF, 68 ) DELTA = 79 - LASTC EXTENT = 6 + DELTA DO 310 I = EXTENT, 6, -1 IF( DELINE(I:I) .EQ. ' ' ) THEN LIMIT = I GO TO 320 ELSE END IF 310 CONTINUE LIMIT = EXTENT - 1 320 C1 = LASTC + 2 C2 = C1 + LIMIT - 6 SEQDEF( C1 : C2 ) = DELINE( 6 : LIMIT ) ELSE END IF C C *** Now find the FT line and read the features C 340 IF((DELINE(1:2).NE.'SQ').AND.(DELINE(1:2).NE.'FT')) THEN READ( NFILE, 1 ) DELINE GO TO 340 ELSE END IF C..........Now we have either read in FT or SQ IF (DELINE(1:2) .EQ. 'SQ' ) GOTO 355 C C *** Now find the SeQuence line and read the length of the sequence C 350 IF( DELINE(1:2) .NE. 'SQ' ) THEN READ( NFILE, 1 ) DELINE GO TO 350 ELSE END IF C 355 LENGTH = LSEMBL( DELINE, 72 ) C C C *** Now read the nucleotide sequence and translate it into numeric form C C Changed 11-94 SF. This is much faster on Cray than the previous code. C This change was copied from the READGB subroutine ITMP1=0 IF (LENGTH.GT.MAXSQP) THEN ITMP2=MAXSQP ELSE ITMP2=LENGTH ENDIF 425 CONTINUE READ(NFILE,1) TEMP IF (TEMP(1:2).EQ.'//') RETURN ! Normal return point DO 450 I=5,MIN(LASTCH(TEMP,LEN(TEMP)),71) C IF( (NUC( ICHAR( TEMP(I:I) )) .GT. 0) THEN IF ( TEMP(I:I) .GT. ' ' ) THEN IF(ITMP1.LT.ITMP2) THEN ITMP1=ITMP1+1 Seq(ITMP1)= NUC( ICHAR( TEMP(I:I) )) ENDIF ENDIF 450 continue * print *, 'ITMP1=',itmp1,' length=',length IF (ITMP1.LE.ITMP2) GOTO 425 IF (ITMP2.NE.LENGTH) THEN write(*,*) ' *** WARNING *** Sequence ',ACNUM, + ' was larger than ',MAXSQP,'. It was truncated.' return ENDIF C C************************************************************************* C READ( NFILE, 2 ) ( LSEQ( L ), L = 1, LENGTH ) C C DO 400 LS = 1, LENGTH C SEQ( LS ) = NUC( ICHAR( LSEQ( LS ) ) ) C 400 CONTINUE C C C *** Now check to insure that the next line is a terminator line - then C *** go onto the next entry in the file. C C READ( NFILE, 1 ) TERM C IF( TERM(1:2) .EQ. '//' ) RETURN ! normal return C************************************************************************** C C *** Error in EMBL file - correct and re-run the program C WRITE( *, 5 ) SQNAME, LENGTH, ACNUM, SEQDEF, TERM(1:72) GO TO 500 C ELSE GO TO 200 END IF C 500 FERTIG = .TRUE. RETURN END C C C C C C C C INTEGER FUNCTION LSEMBL( TEXT, LENGTH ) C C **************************************************************************** C ** Copyright Pittsburgh Supercomputing Center -- June 1988 ** C **************************************************************************** C C *** LSEMBL returns the numeric value of the first string of contiguous C *** digits beginning after position 12 in the character variable TEXT. C *** The function returns only positive values since negative values are C *** not expected (the number being translated is the count of bases in C *** the sequence of an entry in the EMBL data file format). Zero is C *** returned if no digits are encountered on the line before either C *** the end of the line or the string BP is encountered. C IMPLICIT NONE C INTEGER ZEROCH C PARAMETER ( ZEROCH = 48 ) ! location of zero "0" in ASCII C collating sequence C INTEGER LENGTH CHARACTER*(*) TEXT C INTEGER L, FIRST, NUMBER C DO 100 L = 13, LENGTH IF( TEXT( L:L ) .EQ. ' ' ) THEN ELSE IF( LGE( TEXT(L:L), '0') .AND. LLE( TEXT(L:L), '9') ) THEN FIRST = L NUMBER = ICHAR( TEXT( L:L ) ) - ZEROCH GO TO 150 ELSE IF(TEXT(L:L+1) .EQ. 'BP' .OR. TEXT(L:L+1) .EQ. 'bp') THEN LSEMBL = 0 RETURN ELSE END IF 100 CONTINUE C LSEMBL = 0 RETURN C 150 DO 200 L = FIRST + 1, LENGTH IF( LGE( TEXT(L:L), '0' ) .AND. LLE( TEXT(L:L), '9' ) ) THEN NUMBER = ( 10 * NUMBER ) + ( ICHAR( TEXT(L:L) ) - ZEROCH ) ELSE LSEMBL = NUMBER RETURN END IF 200 CONTINUE C LSEMBL = NUMBER RETURN END C C