Tutorial 032 Word-count that Text File


The following is the last "screensworth" of output obtained by  processing a text file of...

THE GEORGICS  by Virgil translated by James Rhoades .....

     20965   who                                                  
     20966   erst
     20967   trilled
     20968   for
     20969   shepherd-wights
     20970   The
     20971   wanton
     20972   ditty
     20973   and
     20974   sang
     20975   in
     20976   saucy
     20977   youth
     20978   Thee
     20979   Tityrus
     20980   neath
     20981   the
     20982   spreading
     20983   beech
     20984   tree
     20985   shade
     20986   THE
     20987   END
 
Its rather a long text file ( over 21,000 words-worth), which took a while to process! I suggest you start on something more modest...

Just select/copy some text off the web, or use one of your word-processed documents; paste it into "notepad" then save the text file into the same directory as you store your copy of the program. Have fun!


Program Listing :

      REM: Word-count that text file
      REM: Save the text file in the same directory as this program is stored
      REM: Richard Weston 24th June 2003
      MODE8
      VDU14
      COLOUR2
      *.*.*
      COLOUR15
      INPUT''"File name - including file extension (eg .txt) : "file$
      fnum=OPENIN file$
      IF fnum=0 THEN PRINT "No ";file$;" data": END
      n=0
      REPEAT
        finished=FALSE
        word$=""
        REPEAT
          temp=BGET#fnum :REM Read byte
          PROCprocess
        UNTIL finished
        IF LEN(word$)>1 THEN
          n+=1
          PRINT n;"   ";word$
        ENDIF
      UNTIL  EOF#fnum
      CLOSE#fnum
      END
      :
      DEF PROCprocess
      IF temp>64 AND temp<91 THEN
        word$+=CHR$(temp)
      ENDIF
      :
      IF temp>96 AND temp<123 THEN
        word$+=CHR$(temp)
      ENDIF
      :
      IF temp=45 THEN  word$+=CHR$(temp)
      REM^ hyphen
      IF temp=10 THEN finished=TRUE
      IF temp>31 AND temp<65 THEN finished=TRUE
      IF temp=45 THEN finished=FALSE
      IF temp>90 AND temp<97 THEN finished=TRUE
      IF temp>122 THEN finished=TRUE
      ENDPROC


Annotated Listing :

      REM: Word-count that text file
      REM: Save the text file in the same directory as this program is stored
      REM: Richard Weston 24th June 2003
      MODE8
      VDU14 *** paged mode - press "shift" to move onto the next page of output
***^ you can remove or "REM out" this line if you wish to count a big document like I did - for fun of course, because Word will do this instantly for you, but you won't know how it does it.

      COLOUR2
      *.*.*              <<********** this gives you the contents of the current directory !!! Neat eh?
      COLOUR15
      INPUT''"File name - including file extension (eg .txt) : "file$
***^ You choose the text file to be opened and have its words counted...
      fnum=OPENIN file$
***^ channel number asigned here
      IF fnum=0 THEN PRINT "No ";file$;" data": END
      n=0
***^ n is used to count the words detected in the file
      REPEAT
***^ keeps on until all words found and file ends
        finished=FALSE
***^ Finished+TRUE when the end of a word is detected...
        word$=""
***^ initialise a new word with a "null" (empty) string
        REPEAT
***^ keeps on pulling bytes off the file until the end of a word is detected
          temp=BGET#fnum : REM Read byte
          PROCprocess
***^ has a look at each byte and decides what, if anything, to do with it
        UNTIL finished
        IF LEN(word$)>1 THEN
***^ chucks out a lot of blank spaces, "a", "I" etc which are not of much interest and put huge spaces into the output if allowed
          n+=1
***^ count another word
          PRINT n;"   ";word$
***^ prints the output for you to read!!!
        ENDIF
      UNTIL  EOF#fnum
***^ end of file marker
      CLOSE#fnum
***^ Be tidy; close your files after use...
      END
      :
      DEF PROCprocess
      IF temp>64 AND temp<91 THEN
        word$+=CHR$(temp)
      ENDIF
***^ Capital letters allowed and added to string
      :
      IF temp>96 AND temp<123 THEN
        word$+=CHR$(temp)
      ENDIF
***^ Lower case letters allowed and added to string
      :
      IF temp=45 THEN  word$+=CHR$(temp)
***   REM^ hyphen *** allowed and added to string - thus hyphenated words are not dehyphenated!

      IF temp=10 THEN finished=TRUE
***^ new line signal detected
      IF temp>31 AND temp<65 THEN finished=TRUE
***^ SPACE, !,",#,comma, full stop etc disallowed and taken to signal the end of a word
      IF temp=45 THEN finished=FALSE
*** ^ reallows the hyphen/ negative sign
      IF temp>90 AND temp<97 THEN finished=TRUE
***^ Disallows [ ,\ ,], etc
      IF temp>122 THEN finished=TRUE
***^ {, |, } etc disallowed in words, for heaven's sake!!! Virgil never had to worry about those...
      ENDPROC


Next Tutorial

Richard Weston's Homepage