A Brief Glance at How Various Text Editors Manage Their Textual Data


Google> chinese proverb efficiency
Efficiency is doing things right; effectiveness is doing the right things.-Peter F. Drucker 
Not quite chinese, but sounds wise enough!



Summary

A review of how GNU Moe, Plan 9 Sam, Bill Joy's Ex/Vi, GNOME GtkTextView/GtkTextBuffer, GNU Emacs, and due to popular demand, Scintilla, manage their textual data. This article is subject to updates.

Foreword

This is a short review of how various text editors manage their textual data within a single buffer. Like many programmers, I have a fascination with text editors. We only want the best tools for the job - so what does the best mean? For some, they want editing to have high physical efficiency - using keyboard shortcuts and command keys, maybe the odd time the mouse comes in handy. Others want their editors to be virtually efficient - to make the most out of their resources (RAM, disk space) - or to be "small". There is a race though to have all of these - and the editors mentioned in this  article allow you to see what their authors decided to go with.

You get three choices, and may choose two:



Each editor section includes a copy of this diagram with the properties they have highlighted.

Note: This is in no way 100% accurate representation of these text editors. A text editor is a complex application, so there are many pros (and cons) that this review ignores or does not cover.

And with that, lets begin our journey.

GNU Moe

First public release: September 6th 2005


"GNU Moe is a powerful, 8-bit clean, console text editor for ISO-8859 and ASCII character encodings. It has a modeless, user-friendly interface, online help, multiple windows, unlimited undo/redo capability, unlimited line length, global search/replace (on all buffers at once), block operations, automatic indentation, word wrapping, file name completion, directory browser, duplicate removal from prompt histories, delimiter matching, text conversion from/to UTF-8, romanization, etc."

TL;DR: Almost like GNU Nano (on the surface).

From my observations, Moe is a small C++ console text editor. It is easy to use. The code base is easy to navigate, as it remains small and simple. It took me around 5-10 minutes to find insertion-related code. Like it says in the TL;DR, if Nano didn't exist, Moe would certainly be its replacement.

Moe sees its text as a list of lines with columns. It accesses the data as a 2D array: data[line][col].

The buffer is initialized as a vector with 10,000 lines. If a file is loaded with more than 10,000 lines, or surpasses this limit while editing, it will add 1/10th of the amount of max lines to the buffer.

This means Moe always uses (vector*(10000*lines)) when it's started. It isn't often I open files that are 10,000 lines, so this would be wasteful for my uses. This behavior only becomes useful after the 10,000 line mark, although a little wasteful. If say while typing we hit the 1,000,000 line mark, Moe will allocate space for 100,000 lines. This is not very resource efficient, unless you are someone who typically types an additional 100,000 lines after your millionth line. Besides that though, Moe has a simple interface, easy and memorable keyboard shortcuts, and small program size, which makes it suitable for embedded systems, but only if they are equipped with a generous amount of RAM.

Plan 9 Sam

First public release: 1980s

"Sam is an interactive multi-file text editor intended for bitmap displays. A textual command language supplements the mouse-driven, cut-and-paste interface to make complex or repetitive editing tasks easy to specify."

Sam is relatively small when compared to Vim and Emacs - in fact it is probably the second or most smallest editor mentioned in this review. It is ed's successor, but much more generic. Sam works on the character-level instead of the line-level. It is one of the first editors to separate its UI from the actual editor - Sam can be used on both the command-line and as a graphical text editor. Its command language makes it scriptable too.

Sam's data is linear.

Sam manages a buffer as a variable-sized block, starting at 8KiB.
It will expand itself by n bytes after the 8KiB point, or if n bytes is greater than 8KiB, however many times n fits into 8KiB blocks. The remainder of the bytes are put into a final 8KiB block.

This makes Sam a fairly efficient editor. From what I've read, it will cache and write blocks to disk when navigating through its data to preserve RAM. An issue that is only apparent is when you are working with huge files, but unlikely to occur. As a user, if you decide to open up a 20 GB file (bare with me), Sam will write all ((20GB/8KiB) - 8KiB) blocks to disk. Your disk may be 40GB large, and half of that is taken up by this monstrous 20GB file, that leaves us with 20GB, which lets say 10GB is taken by the operating system and other files. Now we have 10GB, which is not enough for Sam's caching. This is probably a rare case though, and is most likely safe to ignore. Overall I'd say Sam is fairly efficient at managing the growth of its buffer, but requires O(n) time to insert/delete text. It is not a very physically efficient either. Its command language definitely helps (and is the best part about Sam), but leaves for more to be desired. As mentioned earlier, its program size is fairly small too, even for being statically linked on Plan 9.

Bill Joy's Vi/Ex

First public release: 1976

"Vi (visual) is a display oriented text editor based on ex(1). Ex and vi run the same code; it is possible to get to the command mode of ex from within vi and vice-versa."

I was shocked to see what the original vi's code looked like. It was full of hacks and tried to be clever by managing the terminal (it handles 2 separate real terminals) and using it to read lines back to the buffer. It took me almost an hour to find the code that actually placed text into the buffer. On the bright side though...

...Vi is still small when compared to its bigger brother Vim. It is fast and efficient because of its design for 300 baud connections and terminals. Like Sam, Vi also has its own command language that can both be scripted and keyed, meaning the commands happen has you type.

As a descendant of Ed, Vi is also a line-based text editor.

Vi represents its buffer as an array of lines. It simply adds or deletes lines where it has to, which is very similar to how GNU Moe does it, as we discussed earlier.

When Vi is at or past the end of its buffer, it will request more lines. If the system supports pages, and the page isn't zero or a ridiculously high number, then Vi will allocate the page size divided by the size of a line in bytes, and resize the buffer.

If the page size is zero or a ridiculous number, then it will extend the buffer by 4KiB.

Otherwise if none of the above happens, Vi will extend the buffer by 1KiB worth of lines.

So Vi uses a similar method to Sam to grow its buffer. This is very efficient, but the consequence is Vi will become slow when working with huge files, because it has to traverse through a bigger linear array. It will have a hard time working with large lines. Luckily in the real world, large lines are not a common thing to encounter in source code files. Vi is mostly known for its powerful physical efficiency though, so people who use it rarely care for either of these.


GNOME GtkTextView/GtkTextBuffer

First public release: April 14th 1998

"GTK+ has an extremely powerful framework for multiline text editing. The primary objects involved in the process are GtkTextBuffer, which represents the text being edited, and GtkTextView, a widget which can display a GtkTextBuffer. Each buffer can be displayed by any number of views.
One of the important things to remember about text in GTK+ is that it's in the UTF-8 encoding."

Featureful and developed over the years, GtkTextView coupled with GtkTextBuffer makes extending functionality of this editor easy. A good example of extending these is GtkSourceView, which adds source code support. It is easy to hack and build an editor from scratch in a night (see: tyreese, my very simple editor). Unlike majority of editors, it supports UTF-8 and separates UI from the editor (actually in this case, it goes a step further and separates the UI from the editor and the editor from the buffer!). It is very well made. Essentially the definition of modern text editing.

GtkTextBuffer actually uses a GtkTextBTree. This is simply a binary tree data structure - or the proper term: rope data structure.

When text is inserted, GtkTextBTree creates new segments (nodes) and puts the new text into them.

When text is deleted, the segments are simply removed.

The only issue with this is that you have to build an editor out of these components. This makes the virtual efficiency very high, but consequently has terrible physical efficiency, and you have to invest the time into creating a reliable interface (most likely just a vi overlay). I'm surprised programmers haven't created overlays for Vi or Emacs for the GtkTextView. Definitely an opportunity for someone there. Another disadvantage is in order to use these components, you have to include the entire GTK+ library. This makes any GTK+ based editor larger than other editors (although rare, surprisingly!), but very portable. A GTK+ editor has a good chance of running on Windows, Mac, GNU/Linux, *BSD, and even in web browsers, thanks to the Broadway project.

GNU Emacs

First public release: March 20th 1985

"GNU Emacs is an extensible, customizable text editor—and more. At its core is an interpreter for Emacs Lisp, a dialect of the Lisp programming language with extensions to support text editing."

A better description would probably be "Emacs can be anything you want it to be".

I actually began to re-use Emacs again. Its extensibility is just unparalleled by anything. It also separates its UI from the editor part, allowing for different UIs and not being restricted to terminals. This is absolutely awesome for people coming from Vi. I'm one of those people. As soon as I installed Emacs 24, all I had to do to get Vi functionality was M-x list-packages, navigate to the Evil package, install, restart Emacs, and boom. I was right at home. This makes Emacs dangerously powerful. To be able to slap on any front end and get the same Emacs functionality too is just awesome.

Emacs uses a data structure called a gap buffer. It splits the text into three parts - the first part, the gap, and the second part.

The gap grows and shrinks depending on insert and deletions.

When the gap becomes too small, a new gap is created. This is very efficient for insertions and deletions, and O(n) for navigation. This makes it slightly worse than ropes, but better than the majority of solutions out there. In fact, because of its simplicity, editor creators opt for gap buffers because they make their code easier to maintain and to work with the data.

Emacs is both physically (if you for example install the Evil package) and virtually efficient, but at the price of bigger program size. In the age of computers with at least 100GB disks and at least 1GB RAM, this is not an issue. This is why I've opted for Emacs as my editor for the time being...

Scintilla

First public release: March 14th 1999

"SciTE is a SCIntilla based Text Editor. Originally built to demonstrate Scintilla, it has grown to be a generally useful editor with facilities for building and running programs. It is best used for jobs with simple configurations - I use it for building test and demonstration programs as well as SciTE and Scintilla, themselves."

Due to popular demand, I've added Scintilla to this page. Scintilla is the basis for many text editors on Windows. It is architecturally very similar to GTKTextView/GtkTextBuffer. Scintilla separates itself into 3 parts: Editor, Document, and CellBuffer. As the authors of Scintilla have noted, it is best suited for prototypes ("simple configurations").

Scintilla unfortunately opts for inferior textual data management. It is very similar to GNU Moe, using a list of lines. Unlike GNU Moe though, Scintilla is thread-safe and uses a reader-writer architecture.

Huge apology to Neil, I messed up. Here's his comment that clarifies what Scintilla really does:

"Author of Scintilla here. Scintilla does not use a list of lines. The text is stored in a gap buffer (the substance field in the code shown), like EMACS. Line start positions (added by the InsertLine method) are also stored in a gap buffer but with a 'step' which enables modifications in close proximity to affect few elements."

Scintilla parses a string until it sees a return carriage or new line control code (trying to compensate for UNIX and Windows line-endings). Once it does, it simply adds the line to the buffer.

This means Scintilla doesn't waste space while growing its buffer. It will always use the optimum amount of space for storing text in memory.

What is inefficient about this is if you have a giant line, Scintilla will slow right down, because it goes through every character in the block to determine where new lines are. So if you copy & paste from an external source, you could potentially see a slow down. As a code editor though, this should rarely be a problem. Like GtkTextView, Scintilla's Editor interface is lacking key binding overlays to allow users to use Vi or Emacs keybindings out of the box.

Conclusion

It was interesting to see how these editors each approached the textual data management problem. What amazes me even more is that this was only feasible because of the FOSS world we have. If these projects were not open source, I would have never done this study to improve my knowledge of how the tools I use every day work. Thank you FOSS enthusiasts.

If you'd like me to add an editor, email me or leave a comment here. I'd be glad to look at it for you :)

Keep being awesome!

Comments

  1. Awesome, interesting read. I was wondering if you could remark on the flywheel pattern? Is it used at all in modern editors or is it just a case of a bad example case? It doesnt seem it would match up well with these approaches.

    ReplyDelete
    Replies
    1. Did you mean Flyweight pattern (http://en.wikipedia.org/wiki/Flyweight_pattern) ?

      I'm not sure how this pattern could be applied to buffer management. As far as I know, for the graphical aspect, some editors probably use this (Emacs GUI, and GtkTextView). Other than that though I don't see where the pattern would be applicable.

      Delete
  2. How about IntelliJ?

    ReplyDelete
  3. Another suggestion, nedit, which is very very fast compared to most other GUI X11 editors.

    ReplyDelete
  4. This was great, any chance of something similar for how terminal emulators arrange their buffers, as some definitely seen slower than others when it comes to pasting for instance ?

    ReplyDelete
  5. Author of pyvim here. Thanks for the article!

    For pyvim, the code is extremely simple. We just have a Python string object representing the file content, and this seems to work really well up to 100,000 lines, thanks to Python's efficient string manipulations.

    ReplyDelete
  6. Author of Scintilla here. Scintilla does not use a list of lines. The text is stored in a gap buffer (the substance field in the code shown), like EMACS. Line start positions (added by the InsertLine method) are also stored in a gap buffer but with a 'step' which enables modifications in close proximity to affect few elements.

    ReplyDelete
    Replies
    1. Sorry for messing that up Neil. I'll add in your comment to the post. After another read of the code I used, I feel like it's a terrible representation of what happens. Do you know a better piece that would cover it better? Thank you!

      Delete
    2. Argh, the blog software ate my lengthy reply.

      The best piece of code to understand text storage is SplitVector<T>. 'Split Vector' is another term for 'Gap Buffer'.
      https://sourceforge.net/p/scintilla/code/ci/default/tree/src/SplitVector.h
      The GapTo and InsertFromArray methods are the most salient.

      Scintilla is mostly not thread-safe. There are methods for loading and saving on a background thread since these are commonly slow operations. Otherwise its up to the platforms. GTK+ hasn't supported multiple GUI threads for some years. On Windows, SendMessage can be used to call between threads but it has synchronization overhead so most calls to Scintilla go through a different mechanism that is not thread-safe with SendMessage only used for inter-thread operations.

      All editors need to be able to translate between character positions and line numbers. For example, after scrolling to line 100, the text starting from the start of line 100 should be displayed. This requires scanning (some of) the text for line end characters. Scintilla performs this eagerly by creating a line start position index upon insertion and then maintaining this as the text changes. Other editors perform line end scanning lazily or use other index structures. These have various consequences for memory use and speed but there is no obvious scheme that performs better in all cases.

      A deeper examination of editor data structures that includes the mechanisms used to translate from character positions to lines and columns to pixels and back again would be worthwhile along with ancillary data structures used for decorating the text with markers and syntax styling. Often these features are more responsible for performance than the basic text storage.

      The BasicInsertString code shown is an older and simpler version. The current version can optionally handle Unicode line ends (LS, PS, and NEL) and is much uglier ;-).
      https://sourceforge.net/p/scintilla/code/ci/default/tree/src/CellBuffer.cxx#l623

      This article covers data structures in early GUI editors:
      http://www.cs.cmu.edu/~wjh/papers/byte.html

      Delete
    3. '86 is early. Oh yeah, I'm getting old...

      Delete
  7. Very interesting. I love that you have outlined some of the basic data structures and algorithms used by some common and not so common editors. I hope there is a compendium which goes into some other ideas as to how this thing could/do work and what problems they solve (ie. the old 'never open up an sql dump file in anything that tries to syntax highlight' issue.)

    ReplyDelete
  8. How about 'atom'. It has gained a lot of popularity lately.

    ReplyDelete
  9. It would be interesting to add Atom and Brackets to the mix.

    ReplyDelete
  10. Replies
    1. Sublime underneath uses GTK, so it's reasonable to assume it uses the GTK text widgets.

      Delete
    2. Source? I thought sublime uses a home grown UI library, not GTK?

      Delete
    3. Open Sublime in a hex editor, there are gtk debug strings everywhere, along with gtktextview/edit strings too. It's as "home grown" as a fork can get really. Some searching should bring up more evidence.

      Delete
  11. Another request for details about Atom. Also, Visual Studio Code and Brackets, if you can get to it :D

    ReplyDelete
    Replies
    1. https://github.com/atom/text-buffer

      Delete
  12. Author of pyvim here. Thanks for the article!

    For pyvim, the code is extremely simple. We just have a Python string object representing the file content, and this seems to work really well up to 100,000 lines, thanks to Python's efficient string manipulations.

    ReplyDelete
  13. Another vote for Atom here. And also a vote for TextMate.

    ReplyDelete
    Replies
    1. I would love to do TextMate, but it's closed source. I could reverse engineer it but that would be a lot of work (it already is WITH sources). Atom is definitely coming. I plan on also comparing Firefox and Chrome's textarea implementations. Visual Studio Code and Brackets will also be in it. I'll probably write it in a few weeks from now, currently in the middle of writing another article that's pretty awesome :)

      Delete
    2. TextMate has actually been open source for while now! A few years ago, when the open-sourcing happened, I spent a little while looking at how it handled the text buffer. Sadly I can't remember a single thing about it now... :)

      https://github.com/textmate/textmate

      Delete
  14. The Tk toolkit's text widget makes for a good read. The Gtk text widget was based off of that. SWT's StyledText is an excellent implementation of an optimised gapped-buffer and makes for good study material.

    ReplyDelete
  15. Great article! Word 1.1a source code http://www.computerhistory.org/atchm/microsoft-word-for-windows-1-1a-source-code/ would make for interesting reading ...

    ReplyDelete

Post a Comment

Popular Posts