Indeed an interesting question which comes up time to time. Also note there may be improvements for this in later SU as opposed to build from 5+ years ago. (349 released Jan 2014).
Overall, there are a lot of variables at play here to amount network traffic being processed at the network adapter level of the system (or OS level processing), even before it get up to HMP level. In that case, if you are only using the xx_listen APs, then there will be transcoding done on the inbound and outbound streams since the G711 is decoded to linear on the virtual TDM bus and encoded back to G711 on the way out.
You may be better off switching to faster method, that being dev_portconnect and NATIVE mode for the IPM streams, that way no processing is done (ie decode/encode) and what RTP packets come in get directly shipped back out the other IPM channel connect.
Unfortunately I don't have exact metrics for relay of data in a hairpin call. How exactly were you taking the measurement in this case? Was it from wireshark trace of some other means. I know typically people have tried to bounce a DTMF off the system, but then that may involve delay
as well if using RFC2833 vs that of inband DTMF due to difference in processing.
Regards,
Jeff