Tutorial 032 Word-count that Text File
The following is the last "screensworth" of output obtained by processing
a text file of...
THE GEORGICS by Virgil translated by James Rhoades .....
20965 who
20966 erst
20967 trilled
20968 for
20969 shepherd-wights
20970 The
20971 wanton
20972 ditty
20973 and
20974 sang
20975 in
20976 saucy
20977 youth
20978 Thee
20979 Tityrus
20980 neath
20981 the
20982 spreading
20983 beech
20984 tree
20985 shade
20986 THE
20987 END
Its rather a long text file ( over 21,000 words-worth), which took a while
to process! I suggest you start on something more modest...
Just select/copy some text off the web, or use one of your word-processed
documents; paste it into "notepad" then save the text file
into the same directory as you store your copy of the program. Have fun!
Program Listing :
REM: Word-count that
text file
REM: Save the text file in the same directory
as this program is stored
REM: Richard Weston 24th June 2003
MODE8
VDU14
COLOUR2
*.*.*
COLOUR15
INPUT''"File name - including file extension
(eg .txt) : "file$
fnum=OPENIN file$
IF fnum=0 THEN PRINT "No ";file$;" data":
END
n=0
REPEAT
finished=FALSE
word$=""
REPEAT
temp=BGET#fnum :REM
Read byte
PROCprocess
UNTIL finished
IF LEN(word$)>1 THEN
n+=1
PRINT n;"
";word$
ENDIF
UNTIL EOF#fnum
CLOSE#fnum
END
:
DEF PROCprocess
IF temp>64 AND temp<91 THEN
word$+=CHR$(temp)
ENDIF
:
IF temp>96 AND temp<123 THEN
word$+=CHR$(temp)
ENDIF
:
IF temp=45 THEN word$+=CHR$(temp)
REM^ hyphen
IF temp=10 THEN finished=TRUE
IF temp>31 AND temp<65 THEN finished=TRUE
IF temp=45 THEN finished=FALSE
IF temp>90 AND temp<97 THEN finished=TRUE
IF temp>122 THEN finished=TRUE
ENDPROC
Annotated Listing :
REM: Word-count that
text file
REM: Save the text file in the same directory
as this program is stored
REM: Richard Weston 24th June 2003
MODE8
VDU14 *** paged mode - press "shift" to
move onto the next page of output
***^ you can remove or "REM out" this line if you wish to count a big document
like I did - for fun of course, because Word will do this instantly for you,
but you won't know how it does it.
COLOUR2
*.*.*
<<********** this gives you the contents of the current
directory !!! Neat eh?
COLOUR15
INPUT''"File name - including file extension
(eg .txt) : "file$
***^ You choose the text file to be opened and have its words counted...
fnum=OPENIN file$
***^ channel number asigned here
IF fnum=0 THEN PRINT "No ";file$;" data":
END
n=0
***^ n is used to count the words detected in the file
REPEAT
***^ keeps on until all words found and file ends
finished=FALSE
***^ Finished+TRUE when the end of a word is detected...
word$=""
***^ initialise a new word with a "null" (empty) string
REPEAT
***^ keeps on pulling bytes off the file until the end of a word is detected
temp=BGET#fnum :
REM Read byte
PROCprocess
***^ has a look at each byte and decides what, if anything, to do with it
UNTIL finished
IF LEN(word$)>1 THEN
***^ chucks out a lot of blank spaces, "a", "I" etc which are not of much
interest and put huge spaces into the output if allowed
n+=1
***^ count another word
PRINT n;"
";word$
***^ prints the output for you to read!!!
ENDIF
UNTIL EOF#fnum
***^ end of file marker
CLOSE#fnum
***^ Be tidy; close your files after use...
END
:
DEF PROCprocess
IF temp>64 AND temp<91 THEN
word$+=CHR$(temp)
ENDIF
***^ Capital letters allowed and added to string
:
IF temp>96 AND temp<123 THEN
word$+=CHR$(temp)
ENDIF
***^ Lower case letters allowed and added to string
:
IF temp=45 THEN word$+=CHR$(temp)
*** REM^ hyphen *** allowed and added to string - thus hyphenated
words are not dehyphenated!
IF temp=10 THEN finished=TRUE
***^ new line signal detected
IF temp>31 AND temp<65 THEN finished=TRUE
***^ SPACE, !,",#,comma, full stop etc disallowed and taken to signal the
end of a word
IF temp=45 THEN finished=FALSE
*** ^ reallows the hyphen/ negative sign
IF temp>90 AND temp<97 THEN finished=TRUE
***^ Disallows [ ,\ ,], etc
IF temp>122 THEN finished=TRUE
***^ {, |, } etc disallowed in words, for heaven's sake!!! Virgil never
had to worry about those...
ENDPROC
Next Tutorial
Richard Weston's Homepage