ECS performance

(2705 words)

Yes, premature optimization is the killer of all good programming projects, but it’s my project and therefore I can do what I want!

Firstly, we need some way to measure the performance of the system, because without knowing how things have improved it is mostly impossible to know if any performance improvements have worked. This is just a simple case of creating some benchmark tests.

Some interesting things to note here. The first run of any test will be mostly pointless as we would be more likely to be measuring the memory bandwidth of the code execution path and not the actual speed of any code we are benchmarking. We also want to do things more than once, because then any variation in what the system is doing, for example me typing on the keyboard in neovim could impact on the benchmarking, so we run it a number of times and find the iteration that runs the quickest. No, we don’t care about averages in these sort of tests, we just want to remove as many variables are possible like being task switched.

First test:

We have a Position and Velocity structs that just use x and y fields that are f32, and a movement system that just gets the position and velocity and updates then.

const Position = struct { x: f32, y: f32 };
const Velocity = struct { dx: f32, dy: f32 };
pub fn movement(iter: *const ng.SystemIterator) void {
    const dt = iter.delta_time;

    for (iter.entities) |entity| {
        if (entity.getPtr(Position)) |position| {
            if (entity.get(Velocity)) |velocity| {
                position.x += velocity.dx * dt;
                position.y += velocity.dy * dt;
            }
        }
    }
}

This takes 637 ns per entity to update things. That’s, err, rather slow. Of course it is, it is doing a couple of hash table lookups for each entity and also checking that the component is mutable according to the system definition.

However, before we do anything radical, let’s just try compiling the tests with ReleaseFast rather than Debug mode. Ok, a lot better, but still slow. 19.4 ns per entity.

What if we used a vector Vec2 instead of individual struct fields?

const Position = struct { pos: ng.Vec2 };
const Velocity = struct { vel: ng.Vec2 };

Because we want to multiple the delta time with the velocity in a single operation, we @splat the single f32 into a Vec2.

pub fn movement(iter: *const ng.SystemIterator) void {
    const dt: ng.Vec2 = @splat(iter.delta_time);

    for (iter.entities) |entity| {
        if (entity.getPtr(Position)) |position| {
            if (entity.get(Velocity)) |velocity| {
                position.pos += velocity.vel * dt;
            }
        }
    }
}

This has made no different to the speed with the same 19.4 ns per entity. Why didn’t this make things any better?

Looking at the code produced might be interesting. When using the explicit x, y fields we have:

6cd53:       c5 fb 10 04 c2          vmovsd (%rdx,%rax,8),%xmm0
                        position.x += velocity.dx * dt;
                        position.y += velocity.dy * dt;
6cd58:       c5 f8 59 85 30 ff ff    vmulps -0xd0(%rbp),%xmm0,%xmm0
6cd5f:       ff
6cd60:       48 8b 85 70 ff ff ff    mov    -0x90(%rbp),%rax
6cd67:       c4 c1 7b 10 4c c5 00    vmovsd 0x0(%r13,%rax,8),%xmm1
6cd6e:       c5 f0 58 c0             vaddps %xmm0,%xmm1,%xmm0
6cd72:       c4 c1 78 13 44 c5 00    vmovlps %xmm0,0x0(%r13,%rax,8)

When we use Vec2, we have:

6ba1b:       c5 fb 10 04 c8          vmovsd (%rax,%rcx,8),%xmm0
                        position.pos += velocity.vel * dt;
6ba20:       c5 f8 59 85 30 ff ff    vmulps -0xd0(%rbp),%xmm0,%xmm0
6ba27:       ff
6ba28:       48 8b 85 70 ff ff ff    mov    -0x90(%rbp),%rax
6ba2f:       c4 c1 7b 10 4c c5 00    vmovsd 0x0(%r13,%rax,8),%xmm1
6ba36:       c5 f8 58 c1             vaddps %xmm1,%xmm0,%xmm0
6ba3a:       c4 c1 78 13 44 c5 00    vmovlps %xmm0,0x0(%r13,%rax,8)

Basically, its the same. They are both moving the velocity into the xmm0 register, multiplying it by the stack variable (dt), and then adding it the position. Optimizing compilers are wonderful sometimes.

How can we improve this? Well, within the inner loop we are calling getPtr and get, and those are calling into an AutoArrayHashMap. The problem of course is that we know that out EntityIndex is just a u24 and can therefore just be used as the ‘hash’ directly rather than using some complex and time consuming hash function.

const ComponentContext = struct {
    pub fn hash(_: ComponentContext, key: EntityIndex) u32 {
        return key;
    }
    pub fn eql(
        _: ComponentContext,
        a: EntityIndex,
        b: EntityIndex,
        _: usize,
    ) bool {
        return a == b;
    }
};

First, we define our own ComponentContext struct that we can pass into the ArrayHashMapUnmanaged function.

store: std.ArrayHashMapUnmanaged(
    EntityIndex,
    Component,
    ComponentContext,
    false,
) = .empty,

Then we can just use that ComponentContext, and the value false stating that we are not going to be storing the hash as a separate block of memory and rather just use the eql function defined above from the key table.

Ok, that improved things to about 13.5 ns. Not bad, but I’m sure we could do better. The next obvious step would be to go to use archetype data storage, but I think that is a step too far at this point in the project. Yes, it would be fun to implement and probably faster, but it is not a five-minute job. If an optimization can’t be done in five minutes and can’t provide a 25% improvement then don’t do it, yet. We’ve gone from 19.5 ns to 13.5 ns, and that’s more than a 25% improvement.

The other little improvement that I’ve made is the saving of the ECS database. I was playing around with things and all of a sudden the program started crashing. I quickly isolated it to the autosave system, which was calling the ecs save function. The basic problem was that I was creating an interned string store using one allocator and then resizing it using another one. This obviously caused some problems. To resolve this, I placed all the saving functionality into a single SaveMemory struct that used a single allocator for the save data, called memory, and the string intern store, called strings. This meant that the save function just became:

pub fn save(alloc: std.mem.Allocator) ![]u8 {
    var memory = SaveMemory.init(alloc);

    const head_offset = try memory.write_header("HEAD");
    try memory.write(u32, 0);

    inline for (std.meta.fields(ComponentFieldKind)) |field| {
        try memory.write_int(field.value);
        try memory.write_string(field.name);
    }

    try memory.write_header_length(head_offset);

    try save_component_info(&memory);
    try save_entity_info(&memory);
    try save_recycled(&memory);

    const strings_offset = try memory.save_strings();
    try memory.write_u32_at(head_offset + 8, strings_offset);

    return memory.toOwnedSlice();
}

This not only cleaned up the implementation a lot, it also kept all allocations for the save in a single allocator that had the lifetime of this function, at least up until the toOwnedSlice call.

Time to move onto something much more important… User Interface functionality. Last time we touched the UI system, we could display windows and do some simple layout of things inside that window, but we could not interact with those things. We couldn’t push buttons.

Now we can. Firstly, we have reorganized the UI objects into their own separate zig files. This means we have a Window.zig file, and a Button.zig file, etc. The functionality for each UI object can then be placed into these files. However, before we could get there, we had to walk the tree of UI objects to find which object to send events to. The following code was used:

var iter = self.children();
while (iter.next()) |child| {
    const obj = get(child) catch continue;
    if (@reduce(.And, last_mouse > obj.pos) and
        @reduce(.And, last_mouse < obj.pos + obj.size))
    {
        if (obj.process_event(child, event)) {
            return true;
        }
    }
}

This iterates through each child of an object, checking that the last mouse position is within the object and then processing the event if necessary. This is done recursively until we find the deepest child in the tree that contains the event. If we find no children, then either we are the deepest child in the tree or all our children are not covered by the mouse, and then we call the process event function for the type of this object.

switch (self.data) {
    .window => |*win| return win.process_event(handle, event),
    .button => |*button| return button.process_event(handle, event),
    .vbox => |*vbox| return vbox.process_event(handle, event),
    .hbox => |*hbox| return hbox.process_event(handle, event),
    .text => |*text| return text.process_event(handle, event),
}

As we add more UI object types, this switch statement will become larger. Now, if these functions return true, then the event processing is stopped. In other words, the event has been consumed by the object. If not, then we return back with false, which causes the above loop to not return and the parent will process the event. Effectively this means that we can process events on the way down the tree and if a child doesn’t process an event the parents can on the way back up again. If nothing processes the event, then we can give that back to the application, today that would be the map movement interaction.

The functionality for a button can then just be added. This was really simple.

if (ui.find_object(parent, ident)) |handle| {
    var object = ui.get(handle) catch return null;
    object.active = true;

    ui.move_child_last(parent, handle);
    ui.push_build_stack(handle);

    if (object.data.button.clicked > 0) {
        const clicked = object.data.button.clicked;
        object.data.button.clicked = 0;
        return clicked;
    } else {
        return null;
    }

In the begin_button function, we first check if an object with the same ident was registered in the last frame. If it was, then we can update that object. First we move that object to be the last child in the parent, keeping the order of the children to the calling order. Then we push this object onto the build stack to allow children of this object to be created, such as the text object within the button.

We also get the number of times the button has been clicked in the last frame and return that, or null. Why not return a bool? Well, if you are running at a very low frame rate it is possible to miss button presses if you happen to very quickly click multiple times on a button. Therefore, accumulating the number of clicks and returning that will allow these clicks to be processed.

If the button was not found, then we have to create one.

} else {
    const handle = ui.new() catch |err| {
        log.err("button {}", .{err});
        return null;
    };

    const object = ui.get(handle) catch return null;

    object.* = .{
        .ident = ident,
        .active = true,
        .data = .{
            .button = .{
                .min_size = .{ options.width, options.height },
            },
        },
    };

    ui.add_child_last(parent, handle);
    ui.push_build_stack(handle);
}

How do we process events? Well, remember above where we walked the tree of objects and found the deepest child that the mouse was within. Well, given that, we don’t really care about the position of the mouse, just that we were called.

pub fn process_event(self: *Button, handle: Handle, event: ng.Event) bool {
    switch (event) {
        .mouse_down => |ev| return self.process_mouse_down(handle, ev),
        .mouse_double_click => |ev| return self.process_mouse_down(handle, ev),
        .mouse_up => |ev| return self.process_mouse_up(handle, ev),
        else => {},
    }
    return false;
}

We only really care about mouse down, double clicks, and up events. Everything, at this time, is ignored by returning false.

fn process_mouse_down(self: *Button, handle: Handle, event: ng.MouseEvent) bool {
    if (event.button == .left) {
        self.clicked += 1;
        self.pressed = true;
        ui.captured_mouse = handle;
    }
    return true;
}

The process_mouse_down function just checks we are pressing the .left button, and then increments the clicked value, remember that it is pressed and then capture the mouse.

fn process_mouse_up(self: *Button, _: Handle, event: ng.MouseEvent) bool {
    if (event.button == .left) {
        self.pressed = false;
        ui.captured_mouse = null;
    }
    return true;
}

The process_mouse_up function reverts this. What is the pressed value used for?

pub fn draw(self: Button, obj: *const Object) void {
    if (self.pressed) {
        ui.draw_rectangle(obj.pos, obj.size, self.pressed_color);
    } else {
        ui.draw_rectangle(obj.pos, obj.size, self.background_color);
    }
}

Well, if the button is pressed then we change the background color of the button. This does leave one question open. How do we not interact with the Text object that is inside the button?

pub fn process_event(_: *Text, _: Handle, _: ng.Event) bool {
    return false;
}

For the Text object, we never process events. The text is just a label. It cannot be clicked, it cannot be moved, it cannot be scrolled. It always returns false to any event.

To test this, I wrote a little bit of code that changed what was displayed in a debug window based on the number of button clicks and also based on time.

ng.ui_add_text(.{}, "{} frames", .{state.frame_counter});
ng.ui_add_text(.{}, "{} clicks", .{state.button_clicks});
// ng.ui.end_hbox ();

if (ng.ui_add_button(.{
    .text = "Button",
    .width = 100,
    .height = 40,
    .padding = .{ .top = 8, .left = 8, .right = 8, .bottom = 8 },
})) |count| {
    state.button_clicks += count;
}

if (state.button_clicks > 0 and state.button_clicks < 30) {
    ng.ui_add_text(.{}, "abcdefghijklmnopqrstuvwxyz", .{});
}
if (state.button_clicks > 10) ng.ui_add_text(.{}, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", .{});
if (state.button_clicks > 20) ng.ui_add_text(.{}, "0123456789", .{});

if (debug_state) {
    ng.ui_add_text(.{}, "State On {d:0.3}", .{debug_timer});
} else {
    ng.ui_add_text(.{}, "State Off {d:0.3}", .{debug_timer});
}

debug_timer -= state.dt;
if (debug_timer < 0) {
    debug_timer += 1;
    debug_state = !debug_state;
}

This displays the number of frames displayed, the number of button clicks. It then adds a button, with the imaginative label of “Button”. If that returns a value, then we increment the state.button_clicks by that amount. And then based on that value, we display some static text. We also have one of either a State On or a State Off message based off the delta_time of the frame.

This raised one little issue. Each time we add a new text item, and then later not add it, the UI object is kept around. This leaks memory, and we don’t want to leak memory. To resolve this, I changed the UI’s Handle to be a packed struct just like in the ECS system:

pub const Handle = packed struct(u32) {
    idx: HandleIndex,
    gen: HandleGeneration,

And created a generations and recycled arrays to manage these. This means that we can recycle the UI object handles just like we do with an ECS entity identifier. We can also delete them. Deletion in the UI is implicit. For each object each frame, we keep a note of whether the object has been added again.

var iter = ui_walk(window);
while (iter.next()) |handle| {
    const obj = get(handle) catch return;
    if (obj.active) {
        obj.shown = obj.active;
        obj.active = false;
    } else {
        obj.shown = false;
    }
}

If it was not, then we will find all of these in a new function called delete_not_shown_in. This iterates over all the children of an object and removes any that were not shown. This only works within windows, not for windows. This means that the position of windows, when reopened, will stay the same.

object = first_window;
while (object) |handle| {
    const obj = get(handle) catch return;
    object = obj.succ;

    delete_not_shown_in(handle);
}
fn delete_not_shown_in(phandle: Handle) void {
    const parent = get(phandle) catch return;
    var child: ?Handle = parent.first_child;
    while (child) |chandle| {
        const obj = get(chandle) catch return;
        child = obj.succ;

        if (obj.first_child) |_| {
            delete_not_shown_in(chandle);
        }

        if (obj.shown == false) {
            remove_child(phandle, chandle);
            delete(chandle);
        }
    }
}

The remove_child function is a very simple function that unlinks the child from the parents children linked list. It then calls delete on the child handle. This does the obvious thing, especially if you remember ECS blog.

used_objects.items[handle.idx] = false;
generations.items[handle.idx] +%= 1;
recycled.append(allocator, handle.idx) catch return;

Obviously, the new function was changed to pop indexes from the recycled array and return that rather than create a new UI object. To help with this, we have a used_objects array that denotes if an object is valid or not.

if (recycled.pop()) |index| {
    const gen = generations.items[index];
    used_objects.items[index] = true;
    return .{ .idx = index, .gen = gen };
}

To test all of this, I hooked up the tab key to a temporary variable called slow_frame_rate and called sleep(0.5) if it was true. This allowed me to test the multiple clicks, and the multiple things added or removed per frame with ease.

We are now up to 13912 lines of code.

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Zig                             29           1714           1519          13912
-------------------------------------------------------------------------------

Next week is going to be drawing some map stuff. We have nodes and links in the ECS system, and I think it is time to start drawing these and maybe even interact with them.

Fourth, Game, Ecs