Efficient Text Editing Buffer

If a simple Array is used to hold lines of text, whenever a line is edited, the array will need to be sliced, lines modified, then put back together again. It works fine up to a limit at which point the editor becomes noticeably unresponsive.

I scratched my head over this one for a while, thinking about various tree structures, or skip lists, and so on. Then the answer came to me - make a data structure like a zip list, but use arrays instead of lists. There are lots of packages implementing zip lists, but so far I didn’t see one for arrays.

It will look something like this:

type alias ZippedArray a =
    { head : Array a
    , zip : a
    , zipAt : Int
    , tail : Array a
    , length : Int
    }

Most successive edits in a text editor will happen on the same line. With this structure, the current line can be the focus of the zipper, and can be modified directly without slicing/appending the array.

When editing moves to a different line, the zipper focus will be shifted. I believe Array.append is O(log n)? So even for very large text files performance should be good enough when jumping around the file.

Plan is to:

  • Flesh this idea out and implement a simple editor using it.
  • Check it performs well on large files (say 1M lines).
  • Look at how the buffer can be more flexible than just holding lines of Strings. Need support for color syntax highlighting.
6 Likes

See also https://en.wikipedia.org/wiki/Gap_buffer though not quite on point within a pure language.

1 Like

Thanks, I never heard of a Gap buffer before, but that looks very close to what I am implementing, its just that I have the ‘gap’ as a separate item being the focus of the zipper.

There was also a link to the ‘rope’ structure apparently used by emacs: https://en.wikipedia.org/wiki/Rope_(data_structure)

And this: https://en.wikipedia.org/wiki/Piece_table, which brings up the issue of implementing undo efficiently, something I had not considered yet.

5 Likes

I pushed a demo up here:

This is using my gap buffer implementation to hold the text. When the cursor is moved with up/down arrows or page up/down, the zipper focus is shifted to the line where the cursor is - so the underlying arrays are sliced and appended as you move around.

The demo is running 10K lines, locally I tried 100K and it was just as fast. At 1M it becomes very unresponsive - there is still some efficiency could be squeezed out of the buffer algorithms, since my zipper refocussing is not optimal. I am also wondering, where do I draw the line? I would very rarely edit a 100K line file, let alone 1M.

Next is to implement some basic editing and better cursor behaviour.

2 Likes

I did a little work today to optimize the re-zipping of the buffer when the focus moves. Even on the 1M line file, the performance is not bad, just the occasional blip as the garbage collector runs.

Bear in mind that in a real editor I won’t move the focus with the cursor all the time, only when the user actually makes changes. Also edits on a line will only affect the focussed line, so no re-zipping will be needed in that case. Only when editing moves to a different line, or the line is broken or joined by inserting or removing a newline, will the entire buffer need to be re-zipped. So I think this will easily be fast enough, even for huge files.

6 Likes

Still a little work to do around getting the cursor to behave nicely when wrapping around line ends but - I have now added a basic editing capability to the demo. Delete and Backspace work as expected and you can enter text.

This is all done through capturing onKeydown events. Input through a hidden text area will be the focus of my next spike.

Since I created the GapBuffer data model, it made sense to use it recursively to represent both the lines and columns of text. So I created a TextBuffer type alias like this:

type alias TextBuffer =
    GapBuffer String (GapBuffer Char Char)

The GapBuffer is modelled like this:

type alias GapBuffer a b =
    { head : Array a
    , zip :
        Maybe
            { val : b
            , at : Int
            , tail : Array a
            }
    , length : Int
    , toZip : a -> b
    , toArray : b -> a
    }

The purpose of the dual type variables a and b is to allow flexibility in the representation of the contents when focussed, and when not focussed. So you can see in the TextBuffer that I am using String to hold non-focussed lines, but a GapBuffer Char Char to hold the focussed line. The toZip and toArray functions are isomorphic and translate between the representations as needed.

At the moment, I am using String for my TextBuffer lines. When I add syntax highlighting, I will move to some kind of tagged string representation, perhaps:

type alias TaggedString a = List (a, String)

Where the a type variable allows strings to be tagged and the tags can be interpreted in the view to apply coloring to the strings. Each line will be formed from a list of strings and tags.

Using a scheme like this when swapping between the non-focussed and focussed line representations, I would likely stick with GapBuffer Char Char for the focussed representation. Then in the toArray function, I would parse (or re-lex) the characters to derive the TaggedString representation to get the correct syntax highlighting.

One problem that I have not figured out yet, is what to do when the lexer has state? So in an editor you can enter a single double quote mark " and if the language you are working with has strings, you might see all subsequent lines of text turn a different color - they were code, now they are strings. If your editor is a bit slow, you might even see this re-coloring ripple down the buffer. How do I handle this? What if the buffer is very large and takes a long time to ripple down?

For now, I will just do a stateless lexer - maybe color words depending on whether they start with a consonant or a vowel, for example.

Cursor movement now improved, home/end work, it wraps lines, maintains a target column when moving around beyond the line end and so on.

So I’ve been thinking about this and had a few suggestions on Slack too. I think what is needed is for each line to maintain a lexing context. For this string quoting example, the context might be:

type Context
    = Normal
    | Quoted

Each line should store the context at the start and end of the line. If an update is made to a line, and that update changes the context at the end of the line, then it will be necessary to ripple the effect of that down the buffer. So I will need a ripple function something like this:

type RippleOutcome
    = Continue
    | Paused
    | Done

ripple : (a -> a -> Int -> RippleOutcome) -> GapBuffer a b -> GapBuffer a b
ripple rippleFn buffer =
    ...

The user will supply the rippleFn implementation. This gets the line before, the current line and the index of the current line, and indicates whether the ripple should continue, pause or is complete. So if a line is Normal but the previous line end is Quoted, that line would need to be re-lexed (by passing it through buffer.toZip >> buffer.toArray). If that changes the line end context, then the ripple should Continue. If the line end context does not change, then the ripple is Done.

The line index is also passed to the function, and this can be compared with the currently visible region of the buffer. If the ripple has passed off the bottom of the screen, it can be Paused. If the user scrolls down, the ripple would need to be started up again.

Another alternative is that the user does not scroll down right away, but makes more edits above where the previous ripple has run to. In that case a new ripple must be started for those edits. That might be Done before it reaches where the previous ones got to, in which case the previous ones remain pending. But it could also overtake some of the previous ripples, in which case they should be promoted to Done and discarded as no longer relevant since newer information has made them redundant.

If a lexer is stateless, the rippleFn would always just return Done.

This suggests that the line model needs to be something like:

type alias Line tag ctx = 
    { start : ctx,
    , end : ctx,
    , taggedStrings : List (tag, String)
    }
1 Like

So I jumped in an implemented the quoted/non-quoted lexer for syntax highlighting, rather than something stateless. This will help me test the pathological case which is the user entering " on the first line, then Ctrl+End to jump to the end of the file.

Current implementation is using a 500 line file, and rippling the entire file on every change anyway. Its a bit slow even on such a small file, so the optimisations will definitely be needed:

Got all this rippling stuff working now. :boom:

You can test the pathalogical case by entering a single " on the first line and then CTRL+End to jump to the end of the file. This must ripple down the entire file, so you can see it takes a little while. Demo is currently set to 10K lines, on which it is not too bad.

You can also PageDown a few times, enter a ", then jump back to the top and enter ". Then jump to the End and you will see it is a lot quicker to get there, as the ripple from the start catches up with the second one, and then lines beyond that are in the correct state already since they were never processed. The rippling completes at that point, so only a small section of the file is processed.

Might still be a few bugs, but I am calling a success on this spike, since I completed everything I set out to:

  • Flesh this idea out and implement a simple editor using it.
  • Check it performs well on large files (say 1M lines) - Not good on the pathalogical case.
  • Look at how the buffer can be more flexible than just holding lines of Strings. Need support for color syntax highlighting.

===

If I was doing this in a multi-threaded language, I would do the ripple processing on a background thread, so the pathalogical case would not make the editor unresponsive. This is not an option in Elm.

I could work around this by doing the ripple processing in chunks, and doing one chunk per message put through the update function - cooperative multi-tasking in other words. I’ve done similar before and the results are never as good as you might like for keeping the UI responsive; but it is still an option.

Another workaround is to have the editor always close any " by auto-inserting "", which should reduce the frequence with which this problem is hit.

===

As I worked on the implementation, things took their own direction and the code has come out a little differently to my initial sketch (doesn’t it always?).

I changed the toZip and toArray functions names to be toFocus and fromFocus, as this seemed better. The fromFocus function now also get a first Maybe parameter, which is the gap buffer element 1 item before the focus being tidied away. The reason for this, is so that any context from the previous line can be taken account of - if the previous line ended inside a quoted string, the next lines lexical context will also start out inside a quoted string:

type alias GapBuffer a b =
    { head : Array a
    , zip :
        Maybe
            { val : b
            , at : Int
            , tail : Array a
            }
    , length : Int
    , toFocus : a -> b
    , fromFocus : Maybe a -> b -> a
    }

With that change the ripple function can just be:

type alias TextBuffer tag ctx =
    { lines : GapBuffer (Line tag ctx) (GapBuffer Char Char)
    , ripples : Set Int
    }

rippleTo : Int -> TextBuffer tag ctx -> TextBuffer tag ctx
rippleTo to buffer =

Every change to the TextBuffer results in a ripple being created at the line on which the change was made. These ripples are held in a Set so they are automatically de-duplicated, and also Set is ordered, so they can be processed in order from earliest to latest.

Any ripple starting before the to parameter will be processed. If the ripple stops uncompleted at the to parameter, it is retained for futher processing from that point, when that is needed. Any ripple that gets overtaken by an earlier one is discarded, as there is now no need to process it.

This allows incremental processing of all the ripples, so is nice and efficient.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.