There are several possible options, which are appropriate under different circumstances:
Ropes
Use the
The disadvantages are:
• Single character replacements are slow. Consequently STL algorithms are likely to be slow when updating ropes. (Insertions near the beginning take roughly the same amount of time as single character replacements, and much less time than corresponding insertions for the other string alternatives.)
• The
C strings
This is likely to be the most efficient way to represent a large collection of very short strings. It is by far the most space efficient alternative for small strings. For short strings, the C library functions in
• Operations such as concatenation and substring are much more expensive than for
• The user needs to be aware of sharing between string representations. If strings are assigned by copying pointers, an update to one string may affect another.
• C strings provide no help in storage management. This may be a major issue, although a garbage collector can help alleviate it.
vector<char>
If a string is treated primarily as an array of characters, with frequent in-place updates, it is reasonable to represent it as
Disadvantages are:
• Vector assignments are much more expensive than C string pointer assignments; the only way to share string representations is to pass pointers or references to vectors.
• Most operations on entire strings (
• A number of standard string operations (
• Conversion to C strings is currently slow, even for short strings. That may change in future implementations.
This package was a minimal adaptation of the freely available Modena strings package. It was intended as a stopgap. We do not intend to develop it further.
It shares some of the reference lifetime problems of other implementations that try to conform to the draft standard. Its exact semantics were never well-defined. Under rare conditions, it will have unexpected semantics for single-threaded applications. It fails on the example given above. We strongly discourage use for multi- threaded applications.
Rope Implementation Overview
The rope container type included in SGI's version of the STL is based loosely on the ropes in the Xerox Cedar environment or C 'cords', as described in Boehm, Atkinson, and Plass, 'Ropes: An Alternative to Strings',
A rope is represented as a pointer to _Rope_RopeRep structure, which represents a tree node. Every tree node corresponds to a piece of a rope. Although we refer to 'tree nodes', each such piece can be shared between different ropes, or can even be reused in the same rope if the corresponding substring is repeated. Thus ropes are really represented as directed acyclic graphs. Nonetheless, we will continue to refer to trees, since that is both the usual case, and more intuitive.
Each tree node contains a size field giving the length of the rope piece, a depth field specifying the depth (or height) of the tree rooted at the node, a boolean field indicating whether the subtree has been balanced, and a tag field indicating which of the four variants or subclasses of _Rope_RopeRep is used to represent the list. (The balanced bit is really of interest only for concatenation tree nodes, see below.)
It would have been possible to use virtual functions and/or RTTI to replace the use of the tag field. We chose not to pursue that route, since the tag field can be much smaller than a vtable pointer, and the tag based code is probably also faster in this case.
The 4 subclasses of _Rope_RopeRep are:
1. (_Rope_RopeLeaf) Leaves containing string characters. Short ropes are usually represented as a single such node. In the case of the standard character type, the actual array of characters is NULL-terminated to allow fast generation of an equivalent C string.
2. (_Rope_RopeConcatenation) Concatenation nodes. These have two children
3. (_Rope_RopeFunction) Function nodes. These contain a pointer to a function object that can be used to compute sections of the string. This facility makes it possible to manipulate a rope that is computed lazily as the pieces are needed. For example, it is possible to treat a file as a rope without actually reading in the entire file. Thus a text editor can represent even a 100 MB file being edited as a rope, updating it with standard rope operations, while still consuming only very small amount of memory.
4. (_Rope_RopeSubstring) Substring nodes. These contain a pointer to a base rope tree node, and a starting position within that rope. They denote a substring of the base rope. These are generated only to represent substrings of ropes that are expensive to compute explicitly. The base field never points to a concatenation tree node. If the substring operation is applied to either a very large leaf node (which can be built by converting a very long C string to a rope) or to a function node representing a long string, then it produces a substring node. Substring nodes also contain a pointer to a function object that performs the appropriate character extraction. They are a subclass of function nodes, and a number of operations treat them simply as function nodes. Many uses of ropes will never result in the generation of a substring node. They are however essential for applications that use function nodes to lazily evaluate strings.
Only concatenation nodes have nonzero depth fields. Depth fields are guaranteed to fit into a byte, since we impose a static maximum on rope depth.
The rope implementation can be compiled in two different ways. Normally __GC will not be defined. In this case each tree node will also contain a reference count field. This keeps track of the number of rope variables,