ScriptFu: tests of TinyScheme string-ports and string escapes

Add document about string-like objects in TinyScheme.

Add tests of several outstanding issues in TinyScheme.
In preparation for fixing them.
This commit is contained in:
bootchk 2024-03-28 11:42:54 -04:00 committed by Lloyd Konneker
parent 5ac6cf6e6c
commit 1e104769d7
9 changed files with 1082 additions and 60 deletions

View File

@ -0,0 +1,288 @@
# String-like objects in ScriptFu's TinyScheme
!!! Work in progress. This documents what should be, and not what is actually implemented. There are bugs in the current implementation.
## About
This is a language reference for the string-like features of
Script-fu's TinyScheme language.
This may differ from other Scheme languages. TinyScheme is a subset of R5RS Scheme, and ScriptFu has more than the original TinyScheme because it has been modified to support unichars and bytes. Both string-ports, bytes, and unichar are recent additions to Scheme, and not standardized among Schemes.
This is not a tutorial, but a technical document intended to be testable.
Terminology. We use "read method" to denote the function whose name is "read".
We occasionally use "read" to mean "one of the functions: read, read-byte, or read-char.
## The problem of specification
TinyScheme is a loose subset of the R5RS specification.
These are not part of R5RS, but optional implementations:
- string-ports
- unichar
- byte operations
Racket is a Scheme language that also implements the above.
See Racket for specifications, examples, and tests,
but ScriptFu's TinyScheme may differ.
SRFI-6 is one specification for string-port behavior.
SRFI-6:
- has a reference implementation.
- does not describe testable behavior.
- does not discuss unichar or byte operations
## Overview
Script-fu's TinyScheme has these string-like objects:
- string
- string-port
Both are:
- sequences of chars
- practically infinite
- UTF-8 encoded, i.e. the chars are unichars of multiple bytes
They are related, you can:
- initialize an input string-port from a string
- get a string from an output string-port
However, the passed string is not owned by a string-port but is a separate object with its own lifetime.
Differences are:
- string-ports implement the port API (open, read, write, and close)
- string-ports have byte methods but strings do not
- strings have a length method but string-ports do not
- strings have indexing and substring methods but string-ports do not
- write to a string-port is less expensive than an append to a string
Note that read and write methods to a string-port traffic in objects, not chars.
In other words, they serialize or deserialize, representing objects by text, and parsing text into objects.
Symbols also have string representations, but no string-like methods besides conversion to and from strings.
## ScriptFu's implementation is in C
This section does not describe the language but helps to ground the discussion.
ScriptFu does not use the reference implementation of SRFI-6.
The reference implementation is in Scheme.
The reference implementation is on top of strings.
ScriptFu's implementation of *string-ports is not on top of strings.*
ScriptFu's implementation is in C language.
(The reason is not recorded, but probably for performance.)
The reference implementation of SRFI-6 requires that READ and WRITE be redefined to use a redefined READ-CHAR etc. that
are string-port aware.
TinyScheme does something similar: inbyte() and backbyte() dispatch on the kind of port.
Internally, TinyScheme terminates all strings and string-ports with the NUL character (the 0 byte.) This is not visible in Scheme.
## Allocations
A main concern of string-like objects is how they are allocated and their lifetimes.
All string-like objects are allocated from the heap and with one cell from TinyScheme's cell pool (which is separately allocated from the heap.)
A string-port and any string used with a string-port are separate objects with separate lifetimes.
The length of string-like objects is limited by the underlying allocator (malloc) and the OS.
### String allocations
Strings and string literals are allocated.
They are allocated exactly.
Any append to a string causes a reallocation.
Any substring of a string causes a new allocation and returns a new string instance.
String literals are allocated but are immutable.
### String-port allocations
String-ports of kind output have an allocated, internal buffer.
A buffer has a "reserve" or free space.
The buffer can sometimes accomodate writes without a reallocation.
Writes to an output string-port can be less expensive (higher performing)
than appends to a string, which always reallocates.
But note that writes are not the same as appending (see below.)
The write method can write larger than the size that is pre-allocated for the buffer (256 bytes.)
A string-port of kind input is not a buffer.
It is allocated once.
It's size is fixed when opened.
## Byte, char, and object methods
### String methods
Strings are composed of characters.
The method string-ref accesses a character component.
Strings have no byte methods.
Characters can be converted to integers and then to bytes.
See "Support for byte type" at the Gimp developer web site.
Strings have no object methods: read and write.
### Port methods
Ports, and thus string-ports, have byte and char methods:
read-byte, read-char
write-byte, write-char
Ports also have methods trafficing in objects:
read, write
### Methods and port kinds
String-ports are of two kinds:
- input
- output
*There is also an input-output kind of string-port in TinyScheme,
but this document does not describe it and any use of it is not supported in ScriptFu*
You should only use the "read" methods on a string-port of kind input.
You should only use the "write" methods on a string-port of kind output.
A call of a read method on a string-port of kind output returns an error, and vice versa.
### Mixing byte, char, and object methods
*You should not mix byte methods with char methods, unless you are careful.*
You must understand UTF-8 encoding to do so.
*You should not mix char methods with read/write methods, unless you are careful.*
You must understand parsing and representation of Scheme to do so.
### The NUL character and byte
Internally, TinyScheme terminates all strings and string-ports
with the NUL character (the 0 byte.)
*You should not write the NUL character to strings or string-ports. The result can be surprising, and is not described here.*
You must understand the role of NUL bytes in C strings.
You cannot read the NUL character or byte from a string or string-port since the interpreter always sees it as a terminator.
Note that a string escape sequence for NUL, which is itself a string without a NUL character, can be inside a string or string-port.
You can read and write the NUL character to file ports that you are treating as binary files and not text files.
## Length methods
### Strings
The length of a string is in units of characters. Remember that each character in UTF-8 encoding may comprise many bytes.
(string-length "foo") => 3
### Ports
Ports have no methods for obtaining the length, either in characters or byte units. Some other Schemes have such methods.
## String-port and initial strings
### Input ports
The method open-input-port takes an initial string.
The initial string can be large, limited only by malloc.
TinyScheme copies the initial string to the port.
Subsequently, these have no effect on the port:
- the initial string going out of scope
- an append to the initial string
Subsequent calls to read methods progress along the port contents,
until finally EOF is returned.
The initial string can be the empty string and then the first read will return the EOF object.
There are no methods for appending to an input string-port after it is opened.
### Output ports
*The method open-output-port optionally takes an initial string but it is ignored.*
In other Schemes, any initial string is the name of the port.
In version 2 of Script-Fu's TinyScheme, the initial string was loosely speaking the buffer for the string-port.
This document does not describe the version 2 behavior.
An output string-port is initially empty and not the initial string.
The initial string may go out of scope without effect on an output string-port.
You can write more to an output port than the length of the initial string.
## The get-output-string method
A string-port of kind output is a stream that can be written to
but that can be read only by getting its entire contents.
(get-output-string port)
Returns a string that is the accumulation of all prior writes to the output string-port. This is a loose definition. Chars, bytes, and objects can be written, and it is the representation of objects that accumulate.
The port must be open at the time of the call.
A get-output-string call on a newly opened empty port returns the empty string.
Consecutive calls to get-output-string return two different string objects, but they are equivalent.
The returned string is a distinct object from the port.
These subsequent events have no effect on the returned string:
- writes to the port
- closing the port
- the port subsequently going out of scope
Again, *you should not mix write-byte, write-char, and write to an output string-port, without care*
## Writing and reading strings to a string-port
These are different:
- write a string to an output string-port
- append a string to a string
A string written to an output string-port writes escaped quotes into the string-port:
```
> (define aPort (open-output-string))
> (write "foo" aPort)
> (get-output-string aPort)
"\"foo\""
```
That is, writing a string to an output string-port
writes seven characters, three for foo and two pairs of
backslash quote.
```
\"foo\"
1234567
```
This is a string, which in the REPL prints surrounded by quotes:
```
"\"foo\""
```

View File

@ -7,6 +7,8 @@ if not stable
] ]
test_scripts = [ test_scripts = [
# test binding to the PDB
'tests' / 'PDB' / 'image' / 'image-new.scm', 'tests' / 'PDB' / 'image' / 'image-new.scm',
'tests' / 'PDB' / 'image' / 'image-precision.scm', 'tests' / 'PDB' / 'image' / 'image-precision.scm',
'tests' / 'PDB' / 'image' / 'image-indexed.scm', 'tests' / 'PDB' / 'image' / 'image-indexed.scm',
@ -76,6 +78,7 @@ if not stable
# comprehensive, total test # comprehensive, total test
'tests' / 'PDB' / 'pdb.scm', 'tests' / 'PDB' / 'pdb.scm',
# test TinyScheme embedded interpreter
'tests' / 'TS' / 'sharp-expr.scm', 'tests' / 'TS' / 'sharp-expr.scm',
'tests' / 'TS' / 'sharp-expr-char.scm', 'tests' / 'TS' / 'sharp-expr-char.scm',
@ -84,7 +87,10 @@ if not stable
'tests' / 'TS' / 'cond-expand.scm', 'tests' / 'TS' / 'cond-expand.scm',
'tests' / 'TS' / 'atom2string.scm', 'tests' / 'TS' / 'atom2string.scm',
'tests' / 'TS' / 'integer2char.scm', 'tests' / 'TS' / 'integer2char.scm',
'tests' / 'TS' / 'string-port.scm', 'tests' / 'TS' / 'string-escape.scm',
'tests' / 'TS' / 'string-port-input.scm',
'tests' / 'TS' / 'string-port-output.scm',
# WIP 'tests' / 'TS' / 'string-port-unichar.scm',
'tests' / 'TS' / 'testing.scm', 'tests' / 'TS' / 'testing.scm',
'tests' / 'TS' / 'vector.scm', 'tests' / 'TS' / 'vector.scm',
'tests' / 'TS' / 'no-memory.scm', 'tests' / 'TS' / 'no-memory.scm',

View File

@ -46,6 +46,19 @@
(assert '(equal? (integer->char 13) #\return)) (assert '(equal? (integer->char 13) #\return))
(assert '(equal? (integer->char 32) #\space)) (assert '(equal? (integer->char 32) #\space))
; Mispelled sharp constants
; Any sequence of chars starting with #\n, up to a delimiter,
; that does not match "newline"
; is parsed as the sharp constant for the lower case ASCII n char.\
; Similarly for tab, return, space
(test! "mispelled sharp char constant for newline")
; 110 is the codepoint for lower case n
(assert '(equal? (integer->char 110) #\n))
(assert '(equal? (integer->char 110) #\newlin))
(assert '(equal? (integer->char 110) #\newlines))
; sharp constant character ; sharp constant character
@ -84,7 +97,7 @@
; by a sharp constant hex e.g. #\x1f for 31 ; by a sharp constant hex e.g. #\x1f for 31
; Edge codepoint tests (test! "Edge codepoint tests")
; Tests of edge cases, near a code slightly different ; Tests of edge cases, near a code slightly different
; Codepoint US Unit Separator, edge case to 32, space ; Codepoint US Unit Separator, edge case to 32, space
@ -153,7 +166,7 @@
#\x0)) #\x0))
; sharp constants for delimiter characters (test! "sharp constants for delimiter characters")
; These test the sharp constant notation for characters space and parens ; These test the sharp constant notation for characters space and parens
; These are in the ASCII range ; These are in the ASCII range

View File

@ -0,0 +1,204 @@
; test string escape sequences
; An "escape sequence" is a sequence of characters that,
; when parsing a string, yields a single character.
; All escape sequences start with the backslash.
; TS is unicode: lengths are in unichars, not bytes
; 0xff is the C language notation for a hex constant
; Many tests are lax using string-length
; We can't test certain errors since they terminate
; - Doublequote without trailing doublequote
; - buffer overflow
; - short hex escapes (<2 hex digits)
(test! "escaped doublequote")
(assert `(= (string-length "\"") 1))
; escaped newline, tab, carriage return
(assert `(= (string-length "\n") 1))
(assert `(= (string-length "\t") 1))
(assert `(= (string-length "\r") 1))
(test! "escaped backslash")
; escaped backslash, stands for itself
(assert `(= (string-length "\\") 1))
(test! "escaped other chars, ASCII")
; any other escaped char, that is not an octal digit, stands for itself
(assert `(= (string-length "\a") 1))
(test! "escaped other chars, unichar")
(assert `(= (string-length "\λ") 1))
; !!! Note that readable sequences for sharp constants for control chars
; are not suitable in strings.
; #\tab is not a sharp constant expression, and \tab is not a string escape
(assert `(= (string-length "\tab") 3))
; octal escape sequences
; FUTURE obsolete these: we don't need to support both hex and octal.
(test! "octal escapes")
(test! "octal NUL")
; one digit octal sequence
; NUL character, a zero byte, yields a string, but empty
(assert `(= (string-length "\0") 0))
; two digit octal sequence
; 0o11 is tab
(assert `(string=? "\11" "\t"))
(test! "octal escaped characters match non-escaped ASCII characters")
; A is 65 is 0o101
(assert `(string=? "\101" "A"))
; Three digit octal sequences that don't fit in a byte.
; Comments in the code says it should yield an error.
; So < 255, which is 0o377, should work.
(test! "octal 377")
; Yields a sequence of bytes that is not proper UTF-8 encoded, string length 0
(assert `(= (string-length "\377") 0))
; (test! "octal 400 yields error")
; In v2 the max value is 255, that fits in a byte.
; FIXME: the code comments says 0x400 should yield an error
;(assert-error `(string-length "\400")
; "Error: Error reading string")
; !!! But in UTF-8 0x377==255 is encoded in two bytes
; and yields LATIN SMALL LETTER Y WITH DIAERESIS
; !!! length in chars is 1, length in bytes is 2.
; FUTURE (assert `(= (string-length "\377") 1))
; FUTURE: if we don't obsolete octal escapes altogether,
; then three or four octal digits should be allowed.
;(test! "octal 777")
; 0o777 is 0x1ff
; 1 char, encoded as 2 bytes.
; (assert `(= (string-length "\777") 1))
; TODO test the string is two-bytes
; we don't have string-length-bytes function
; four octal digits yields two char and three bytes.
; (assert `(= (string-length "\3777") 2))
; TODO test the second char is '7'
(test! "hex escapes")
(test! "hex NUL")
; NUL character, a zero byte, yields a string, but empty
(assert `(= (string-length "\x0000") 0))
;(test! "short hex escape")
; TODO Can't be tested, aborts interpreter, parsing fails
; maybe wrapping it in a string-port
; require at least two hex digits
;(assert-error `(string-length "\x")
; "Error: Error reading string")
;(assert-error `(string-length "\x0")
; "Error: Error reading string")
(test! "2 digit hex escape, ASCII")
; yields A
(assert `(= (string-length "\x41") 1))
(test! "2 digit hex escape, non-ASCII > 127")
; FIXME, fails string length 0
; See scheme.c line 1957 *p++=c is pushing one byte
;
; Yields LATIN SMALL LETTER Y WITH DIAERESIS
; Yields one character of two UTF-8 bytes.
(assert `(= (string-length "\xff") 1))
; Uppercase \XFF also accepted
; yields LATIN SMALL LETTER Y WITH DIAERESIS
;(assert `(= (string-length "\XFF") 1))
(test! "3 digit hex escape x414 yields two characters")
; This is the current behavior.
; SF parses only two hex digits as part of the hex escape,
; and the third hex digit is parsed as itself.
; FUTURE parse a max of four bytes of hex, like say Racket
; yields A4
(assert `(= (string-length "\x414") 2))
; FUTURE: Now does not accept
;(test! "3 digit hex escape")
; yields one unnamed char, 3 bytes
;(assert `(= (string-length "\xfff") 1))
;(test! "4 digit hex escape")
; yields unnamed char, 3 bytes
;(assert `(= (string-length "\xffff") 1))
;(test! "5 digit hex escape")
; yields 2 chars, the unnamed char, 3 bytes,
; and the char LOWER CASE F, 1 byte
;(assert `(= (string-length "\xfffff") 1))
; Every four digit hex value is a valid codepoint
; meaning it will encode in UTF-8.
; Whether it displays a visible glyph depends on other factors.
(test! "consecutive escape sequences")
(test! "consecutive hex escapes")
; two A chars
(assert `(= (string-length "\x41\x41") 2))
; FIXME fails
;(test! "consecutive hex escapes")
; two CENT chars
;(assert `(= (string-length "\xa2\xa2") 2))
; FIXME fails
; (test! "consecutive octal escapes")
; two CENT chars
; (assert `(= (string-length "\242\242") 2))
(test! "consecutive escaped backslash and hex escape")
; yields 3 characters: BACKSLASH, A, BACKSLASH,
(assert `(= (string-length "\\\x41\\") 3))
; FIXME fails
;(test! "consecutive escaped backslash and hex escape")
; yields 3 characters: BACKSLASH, CENT, BACKSLASH,
;(assert `(= (string-length "\\\xa2\\") 3))

View File

@ -0,0 +1,212 @@
; Test cases for string ports of kind input
; See general discussion of string ports at string-port-output.scm
; You read objects from a string.
; A read has the side effect of advancing a cursor in the string.
; read-byte is discouraged on an output string-port.
; Complicated by fact that strings are UTF-8 encoded.
; read-char is a method also
; !!! The input port object does not own a string object.
; The "string" internally is a C pointer to a Scheme cell for a Scheme string.
; The port does not have a cell referring to the cell for the string.
; It does NOT survive garbage collection.
; Closing a port frees memory to C, but few cells to Scheme.
; Closing a port leaves the symbol defined until it goes out of scope,
; but the symbol no longer is bound to a port object
; i.e. operations on it fail.
; setup
; Some tests use new ports, not the setup one.
; Note initial contents is a sequence of alphabetic chars,
; which reads as one symbol object.
(define aStringPort (open-input-string "foo"))
; tests
(test! "open-input-string yields a port")
(assert `(port? ,aStringPort))
(test! "open-input-string yields a port of kind input")
(assert `(input-port? ,aStringPort))
(test! "open-input-string yields a port NOT of kind output")
(assert `(not (output-port? ,aStringPort)))
(test! "write always fails on an input string-port")
(assert-error `(write "bar" ,aStringPort)
"write: argument 2 must be: output port")
(test! "write-char always fails on an input string-port")
(assert-error `(write-char #\a ,aStringPort)
"write-char: argument 2 must be: output port")
(test! "write-byte always fails on an input string-port")
(assert-error `(write-byte (integer->byte 72),aStringPort)
"write-byte: argument 2 must be: output port")
(test! "get-output-string always fails on an input string-port")
(assert-error `(get-output-string ,aStringPort)
"get-output-string: argument 1 must be: output port")
; read
; refresh the port
(define aStringPort (open-input-string "foo"))
(test! "string read from input-string equals initial contents of port, one symbol")
; yields a symbol whose repr is "foo"
; ??? This seems to fail sometimes, possibly due to gc, see below?
(assert `(string=?
(symbol->string (read ,aStringPort))
"foo"))
(test! "next read from input-string equals EOF")
(assert `(eof-object? (read ,aStringPort)))
; Note now the port is empty and for testing we must make another
; port with unichar contents
(define aStringPort (open-input-string "λ"))
; FIXME issue #11040
; This is now returning EOF where it should return a unichar char as a symbol
(test! "read from input-string with unichar content equals that unichar as symbol")
; yields a symbol whose repr is "λ"
(assert `(string=?
(symbol->string (read ,aStringPort))
"λ"))
; port with escape sequence for NUL char
(define aStringPort (open-input-string "a\x00b"))
(test! "read from input-string with escape sequence for NUL is truncated")
; yields a symbol whose repr is "a"
(assert `(string=?
(symbol->string (read ,aStringPort))
"a"))
; read multiple objects
; TODO
; garbage collection
(define aStringPort (open-input-string "foo"))
; using aStringPort whose contents read as a symbol "foo"
(test! "input string-port with literal contents MAY NOT survive garbage collection")
; !!! We wrote "foo" but assert that "foo" is NOT THE CONTENTS
; This test corrupts the port.
; This test is of a random result and may fail.
; After gc, a C pointer of the port implementation
; is pointing to the garbage collected string,
; some memory whose contents are undefined.
; Usually a symbol is returned and it is not "foo".
; But it could still be "foo".
(assert `(not
(string=?
(symbol->string
(begin
(gc)
(read ,aStringPort)))
"foo")))
; read-char and read-byte
(define aStringPort (open-input-string "foo"))
(test! "read-char on input string-port, ASCII")
; read-char works, but discouraged from mixing with read
; since read parses a Scheme object, and the read char might
; be syntax.
(assert `(equal?
(read-char ,aStringPort)
#\f ))
(define aStringPort (open-input-string "λ"))
; FIXME fails for same reason as above
(test! "read-char on input string-port, unichar")
(assert `(equal?
(read-char ,aStringPort)
#\λ ))
; Example code for getting char from byte from read-byte
; (integer->char (byte->integer (read-byte port)))
; read-byte
;
; read-byte should not be mixed with read-char or read, without care.
(define aStringPort (open-input-string "foo"))
(test! "read-byte to EOF on input-string, ASCII chars")
; The first byte is the single byte UTF-8 encoding of f char,
; then two o chars, then EOF
(assert `(eof-object?
(begin
(read-byte ,aStringPort)
(read-byte ,aStringPort)
(read-byte ,aStringPort)
(read-byte ,aStringPort))))
(define aStringPort (open-input-string "λa"))
; FIXME fails for same reason as above
(test! "read-byte then read-char on input-string, two-byte UTF-8 encoded char")
; The first byte of the lambda char is 0xce 206, the next 0xbb 187, code point is 0x3bb
; Expect this leaves the port in condtion for a subsequent read-char or read
(assert `(= (byte->integer (read-byte ,aStringPort))
206))
(assert `(= (byte->integer (read-byte ,aStringPort))
187))
(assert `(equal? (read-char ,aStringPort)
#\a))
; closing
(define aStringPort (open-input-string "foo"))
(test! "closing a port")
(assert `(close-port ,aStringPort))
(test! "a closed port cannot be read")
(assert-error `(read ,aStringPort)
"read: argument 1 must be: input port")

View File

@ -0,0 +1,243 @@
; Test cases for string ports of kind output
; A port has a bifurcated API:
; input API
; output API.
; Some ports support both.
; The input API has write, but not read method.
; Some ports support byte and char operations.
; A string port is-a port.
; !!! write and read methods take or return Scheme objects
; i.e. strings, symbols, atoms, etc.
; A string port is of kind: input or output.
; A string port does not have all the methods of the port API:
; kind output has write method, but not read.
; kind input has read method, but not write.
; A string output port stores its contents in memory (unlike device ports).
; A get-output-string returns contents previously written.
; A string port is practically infinite.
; A string port is like a string.
; A sequence of writes are like a sequence of appends to a string,
; except the things written are objects, not just strings.
; You can only get the entire string.
; A get does not have the side effect of advancing a cursor in the string.
; write-byte is discouraged on an output string-port.
; Complicated by fact that strings are UTF-8 encoded.
; !!! The port object does not own a string object.
; The "string" internally is in a UTF-8 encoded C allocated chunk
; of memory, but not in a Scheme cell for a Scheme string.
; It survives garbage collection.
; Closing a port frees memory to C, but few cells to Scheme.
; Closing a port leaves the symbol defined until it goes out of scope,
; but the symbol no longer is bound to a port object
; i.e. operations on it fail.
; setup
; Some tests use new ports, not the setup one.
; This port is unlimited, should grow
(define aStringPort (open-output-string))
; tests
(test! "open-output-string yields a port")
(assert `(port? ,aStringPort))
(test! "open-output-string yields a port of kind output")
(assert `(output-port? ,aStringPort))
(test! "open-output-string yields a port NOT of kind input")
(assert `(not (input-port? ,aStringPort)))
(test! "read method fails on an output string-port")
(assert-error `(read ,aStringPort)
"read: argument 1 must be: input port")
(test! "read-byte method fails on an output string-port")
(assert-error `(read-byte ,aStringPort)
"read-byte: argument 1 must be: input port")
(test! "string get from port equals string write to port")
; !!! with escaped double quote
(assert `(string=?
(let* ((aStringPort (open-output-string)))
(write "foo" aStringPort)
(get-output-string aStringPort))
"\"foo\""))
(test! "string get from port equals string repr of symbol written to port")
; !!! without escaped double quote
(assert `(string=?
(let* ((aStringPort (open-output-string)))
; !!! 'foo is-a symbol whose repr is three characters: foo
; write to a port writes the repr
(write 'foo aStringPort)
(get-output-string aStringPort))
(symbol->string 'foo)))
(test! "get-output-string called twice returns the same string")
; Can get the same string twice
(assert `(string=?
(begin
(write "foo" ,aStringPort)
(get-output-string ,aStringPort)
(get-output-string ,aStringPort))
"\"foo\""))
(test! "port contents survive garbage collection")
; using aStringPort whose contents are "foo"
(assert `(string=?
(begin
(gc)
(get-output-string ,aStringPort))
"\"foo\""))
; tests of the form (open-output-string <initial string>)
; Some Schemes have an optional argument a string that is the initial contents?
; Guile does not. Racket does not, but takes a name for the port. MIT does not.
;
; The initial string is always overwritten, and is just an allocation.
; Only the size of the initial string matters, not the contents.
; Also, see test9.scm, which tests this using a string whose scope is larger
; and so does not get garbage collected.
(define aLimitedStringPort (open-output-string "initial"))
(test! "initial contents string is just an allocation")
; !!! only 7 bytes are allocated.
; get-output-string returns empty string, not the initial contents.
(assert `(string=? (get-output-string ,aLimitedStringPort)
""))
(test! "writing to output string-port w initial contents may truncate")
; Only 7 chars are written, and a double quote char takes one
(assert `(string=?
(begin
(write "INITIALPLUS" ,aLimitedStringPort)
(get-output-string ,aLimitedStringPort))
"\"INITIA"))
;(test! "port contents survive garbage collection")
; using aStringPort whose contents are "INITIAL"
; TODO this may be crashing
;(assert `(string=?
; (begin
; TODO need to open a port and write to it
; (gc)
; (get-output-string ,aLimitedStringPort))
; "\"INITIAL\""))
; write bytes
; initial contents "foo"
(test! "write-byte on output-string, ASCII char")
(assert `(write-byte (integer->byte 72) ,aStringPort))
; write is effective when byte is an ASCII char, valid UTF-8 encoding
; Note that the yield is a repr of a string followed by a repr of a char.
(assert `(string=? (get-output-string ,aStringPort)
"\"foo\"H"))
; This test corrupts aStringPort.
; It tests what an author should NOT do: write a byte that is not UTF-8 encoding.
(test! "write-byte on output-string, non ASCII char")
(assert `(write-byte (integer->byte 172) ,aStringPort))
; write yields strange results when byte is not a proper UTF-8 encoding.
; Note that the yield is same as before, and doesn't show the written byte.
(assert `(string=? (get-output-string ,aStringPort)
"\"foo\"H"))
; closing
(test! "closing a port")
(assert `(close-port ,aStringPort))
(test! "a closed port cannot be get-output-string")
(assert-error `(get-output-string ,aStringPort)
"get-output-string: argument 1 must be: output port")
(test! "a closed port cannot be written")
(assert-error `(write 'foo ,aStringPort)
"write: argument 2 must be: output port ")
; closing not affect prior gotten contents
(test! "closing output port not affect prior gotten contents")
; setup
(define aStringPort (open-output-string))
(write "foo" aStringPort)
(define contents (get-output-string aStringPort))
(close-port aStringPort)
(gc)
(assert `(string=? ,contents
"\"foo\""))
; What is read equals the string written.
; Edge case: writing more than 256 characters in two tranches
; where second write crosses end boundary of 256 char buffer.
; issue #9495
; Failing
;(assert '(string=?
; (let* ((aStringPort (open-output-string)))
; (write (string->symbol (make-string 250 #\A)) aStringPort)
; (write (string->symbol (make-string 7 #\B)) aStringPort)
; (get-output-string aStringPort))
; (string-append
; (make-string 250 #\A)
; (make-string 7 #\B))))
; read/write are opposites
; !!! Note in this case lack of escaped quotes on what is read
; FIXME, this fails
(test! "read's of a get-output-string return what was write'd before")
; setup
(define aOutStringPort (open-output-string))
(write "foo" aOutStringPort)
(write "bar" aOutStringPort)
(define aInStringPort (open-input-string (get-output-string aOutStringPort)))
(close-port aOutStringPort)
(gc)
; aInStringPort is open having contents "\"foo\"\"bar\""
; test the original strings can be read consecutively
(assert `(string=? (read ,aInStringPort)
"foo"))
(assert `(string=? (read ,aInStringPort)
"bar"))

View File

@ -0,0 +1,108 @@
; test string ports with unichar
; WIP: requires changes to string escapes: four hex digits
; Not currently in the test suite
; This tests unichars in strings work nicely with string-ports.
; Algebraic tests combining two different areas of the code.
; The concern is that bytes and chars are counted correctly.
; Also tests hex escape sequences, which are required to express
; edge case unichars otherwise not possible to put in a text.
; The edge cases of unichar:
; - simple non-ASCII unichar character named LAMBDA
; - edge of UTF-8 U+FFFF
; - edge of NUL-terminated strings: ASCII character named NUL
; TODO testing of string-ports in string-port.scm could contain this
; input-string
(test! "string containing unichar as symbol parses")
; open-input-string yields string without double quotes
; read from that string is-a list containing a symbol
(assert `(list? (read (open-input-string "'(λ)"))))
; car of that list is-a symbol
(assert `(symbol? (car (read (open-input-string "'(λ)")))))
(test! "string containing unichar as character parses into symbol")
; \x3bb is the hex escaped for the LAMBDA character
; but it is in the input string as a symbol
(assert `(symbol? (read (open-input-string "\x3bb"))))
(test! "string containing string embedding a unichar parses into string")
; string inside string via escaped double quotes
(assert `(string? (read (open-input-string "\"\x3bb\""))))
(test! "string containing string embedding a unprintable unichar parses into string")
; string inside string via escaped double quotes
(assert `(string? (read (open-input-string "\"\xFFFF\""))))
; output-string
; open-output-string takes an optional "initial contents string"
(test! "get-output-string from output port equals string written to port")
(test! "simple LAMBDA")
; !!! with escaped double quote
(assert '(string=?
(let* ((aStringPort (open-output-string)))
(write "λ" aStringPort)
(get-output-string aStringPort))
"\"λ\""))
(test! "U+FFFF")
(assert '(string=?
(let* ((aStringPort (open-output-string)))
(write "\xFFFF" aStringPort)
(get-output-string aStringPort))
"\"\xFFFF\""))
(test! "writing embedded NUL to output port shortens the string")
; !!! NUL written as \x0000 since \x00af is a hex escape sequence
; I don't think this leaks memory, the internal C string
; is correctly managed if the code always uses gunichar functions
; to calculate the internal strlength in unichars ?
(assert '(string=?
(let* ((aStringPort (open-output-string)))
(write "before\x0000after" aStringPort)
(get-output-string aStringPort))
"\"before\""))
(test! "initial contents to a new output port, with unichar")
(assert '(string=?
(let* ((aStringPort (open-output-string "λ")))
(get-output-string aStringPort))
"λ"))
; input-output-string
; open-input-output-string is non-standard Scheme, even not R5RS.
; And it is poorly documented.
; I suppose it is supposed to be a pipe, where read() consumes,
; and get-output-string returns the contents without consuming.
; !!! read is different from get-output-string
; read forms a Scheme object
; Here, the Scheme object is a symbol.
; FIXME fails ??? but because it is unichar, or because it fails with ASCII?
(test! "string read from input-output port equals string written to port")
(assert '(string=?
(let* ((aStringPort (open-input-output-string "foo")))
(write "λ" aStringPort)
(symbol->string (read aStringPort)))
"\"fooλ\""))

View File

@ -1,56 +0,0 @@
; Test cases for string ports
; a string port is-a port (having read and write methods).
; a string port stores its contents in memory (unlike device ports).
; A read returns contents previously written.
; A string port is practically infinite.
; a string port is like a string
; a sequence of writes are like a sequence of appends to a string
; Note that each assert is in its own environment,
; so we can't define a global port outside????
; Why shouldn't this work?
; (define aStringPort (open-output-string))
; (assert `(port? aStringPort))
; open-output-string yields a port
(assert '(port? (open-output-string)))
; string read from port equals string written to port
; !!! with escaped double quote
(assert '(string=?
(let* ((aStringPort (open-output-string)))
(write "foo" aStringPort)
(get-output-string aStringPort))
"\"foo\""))
; string read from port equals string repr of symbol written to port
; !!! without escaped double quote
(assert '(string=?
(let* ((aStringPort (open-output-string)))
; !!! 'foo is-a symbol whose repr is three characters: foo
; write to a port writes the repr
(write 'foo aStringPort)
(get-output-string aStringPort))
(symbol->string 'foo)))
; What is read equals the string written.
; For edge case: writing more than 256 characters in two tranches
; where second write crosses end boundary of 256 char buffer.
; issue #9495
; Failing
;(assert '(string=?
; (let* ((aStringPort (open-output-string)))
; (write (string->symbol (make-string 250 #\A)) aStringPort)
; (write (string->symbol (make-string 7 #\B)) aStringPort)
; (get-output-string aStringPort))
; (string-append
; (make-string 250 #\A)
; (make-string 7 #\B))))

View File

@ -15,7 +15,11 @@
(testing:load-test "atom2string.scm") (testing:load-test "atom2string.scm")
(testing:load-test "integer2char.scm") (testing:load-test "integer2char.scm")
(testing:load-test "string-port.scm") (testing:load-test "string-escape.scm")
(testing:load-test "string-port-output.scm")
(testing:load-test "string-port-input.scm")
; WIP
; (testing:load-test "string-port-unichar.scm")
(testing:load-test "sharp-expr.scm") (testing:load-test "sharp-expr.scm")
(testing:load-test "sharp-expr-char.scm") (testing:load-test "sharp-expr-char.scm")