Hi,
On Mon Nov 11, 2024 at 13:15:52 +0000, Richard Clark wrote:
Your explanation needs a lot more detail as it raises many more questions than it answers. I specifically did not use irq-based messaging because it does not provide the handshaking that I need. Sending a signal that a message is ready, without the ability to receive some sort of acknowledgement event in return would force the sender into a painfully slow and inefficient polling loop. The ipc_call function is perfect for this purpose as it not only provides the acknowledgement that the receiver has processed the message, but can return a status as well. All event-driven with no polling and no delays. The event-driven handshake has to exist so that the sender knows when it is safe to begin sending the next message... how does an irq do this? It is only a one-way signal. Your irq messaging example can only send one message and then has to poll shared memory to know when the receiver has gotten it. They all use the same underlying ipc functions, just with different kernel object types, so I don't understand why an ipc_call would be slow and an irq would be faster. In all cases, the return handshake is required to avoid polling.
I don't know about your mechanism, I just know that shared memory communication can work with shared memory and a notification. For example, Virtio uses exactly this. Notifications (Irqs) are sent in both directions. This is a rather asychronous model. Other use-cases might need other ways of doing it. Of course, polling should not be used, except when sitting alone on a core and being specifically for that.
Your comment to not use malloc is extremely confusing. I've also seen your response that using a lot of small malloc/free calls will slow down the kernel. That just can't be correct. Malloc is one of the most used and abused calls in the entire C library. If it is not extremely fast and efficient, then something is seriously wrong with the underlying software. Please confirm that this is the case. Because if true, then I will have to allocate a few megabytes up front in a large buffer and port over my own malloc to point to it. Again, this just doesn't make sense. Can I not assign an individual heap to each process? The kernel should only hold a map to the large heap space, not each individual small buffer that gets malloc'ed. The kernel should not even be involved in a malloc at all.
Right, the kernel has no business with malloc and free (except the really downwards mechanisms of providing proper memory pages to the process). Malloc and free are a pure user-level implementation which works on a chunk of memory. The malloc is the one from uclibc, and is as fast as it is.
I do need to benchmark my message-passing exactly as is, with malloc and free, and signals and waits and locks and all. I am not interested in individual component performance, but need to know the performance when it is all put together in exactly the form that it will be used. If 3 or 4 messages per millisecond is real, then something needs to get redesigned and fixed. I can't use it at that speed.
Sure you need the overall performance, however to understand what's going on looking into the individual phases can be a good thing. What do you do with signals, waits and locks? Is your communication within one process or among multiple processes? Or a mix of it?
Our applications involve communications and message passing. They are servers that run forever, not little web applications. We need to process hundreds of messages per millisecond, not single digits. So this is a huge concern for me.
Understood.
I'll go break things up to find the slow parts, to test them one at a time, but your help in identifying more possible issues would be greatly appreciated.
Thanks, will do my best.
Adam
-----Original Message----- From: Adam Lackorzynski adam@l4re.org Sent: Monday, November 11, 2024 5:29 AM To: Richard Clark richard.clark@Coheretechnology.us; l4-hackers@os.inf.tu-dresden.de Subject: Re: Throughput questions....
Hi Richard,
for using shared memory based communication I'd like to suggest to use L4::Irqs instead of IPC messages, especially ipc-calls which have a back and forth. Please also do not use malloc within a benchmark (or benchmark malloc separately to get an understanding how the share between L4 ops and libc is split). On QEMU it should be ok when running with KVM, less so without KVM.
I do not have a recommendation for an AMD-based laptop.
Cheers, Adam
On Thu Nov 07, 2024 at 13:36:06 +0000, Richard Clark wrote:
Dear L4Re experts,
We now have a couple projects in which we are going to be utilizing your OS, so I've been implementing and testing some of the basic functionality that we will need. Namely that would be message passing.... I've been using the Hello World QEMU example as my starting point and have created a number of processes that communicate via a pair of unidirectional channels with IPC and shared memory. One channel for messages coming in, one channel for messages going out. The sender does an IPC_CALL() when a message has been put into shared memory. The receiver completes an IPC_RECEIVE(), fetches the message, and then responds with the IPC_REPLY() to the original IPC_CALL(). It is all interrupt/event driven, no sleeping, no polling. It works. I've tested it for robustness and it behaves exactly as expected, with the exception of throughput.
I seem to be getting only 4000 messages per second. Or roughly 4 messages per millisecond. Now there are a couple malloc() and free() and condition_wait() and condition_signal()s going on as the events and messages get passed through the sender and receiver threads, but nothing (IMHO) that should slow things down too much. Messages are very small, like 50 bytes, as I'm really just trying to get a handle on basic overhead. So pretty much, yes, I'm beating the context-switching mechanisms to death...
My questions: Is this normal(ish) throughput for a single-core x86_64 QEMU system? Am I getting hit by a time-sliced scheduler issue and most of my CPU is being wasted? How do I switch to a different non-time-sliced scheduler? Thoughts on what I could try to improve throughput?
And lastly... We are going to be signing up for training soon... do you have a recommendation for a big beefy AMD-based linux laptop?
Thanks!
Richard H. Clark
Adam