Apache crashes when there’s multi-thread request in a web app server with CF2018 & Apache 2.4.
Hi Team,
We encountered an interesting (rather annoying) bug with Apache & CF2018, in one of the Proof of Concept Servers (CF2018 with Apache2.4). The scenario, which I am going to describe here is working well in our production servers which are not using CF2018 or Apache2.4 (and for years now).
Few reports are scheduled to run one after another and the Apache crashes with Segment Fault, before the next report is run. ColdFusion doesn’t crash though. This is how the log looks like, for Apache & Apache connector. The issue occurs, irrespective of the status of SELinux being enforced/disabled.
/httpd/error.log
[Thu Feb 07 16:52:25.147757 2019] [core:notice] [pid 3063] AH00052: child pid 3436 exit signal Segmentation fault (11)
[Thu Feb 07 16:52:25.147840 2019] [core:notice] [pid 3063] AH00052: child pid 3445 exit signal Segmentation fault (11)
[Thu Feb 07 16:52:55.307720 2019] [core:notice] [pid 3063] AH00052: child pid 3444 exit signal Segmentation fault (11)
[Fri Feb 08 03:00:25.332926 2019] [core:notice] [pid 3063] AH00052: child pid 4118 exit signal Segmentation fault (11)
[Fri Feb 08 07:29:01.457578 2019] [core:notice] [pid 3063] AH00052: child pid 4219 exit signal Segmentation fault (11)
/wsconfig/1/mod_jk.log
[Thu Feb 07 16:52:24 2019] [3445:140618548705024] [warn] ajp_get_endpoint::jk_ajp_common.c (3705): Unable to get the free endpoint for worker cfusion from 1 slots
[Thu Feb 07 16:52:24 2019] [3436:140618548705024] [warn] ajp_get_endpoint::jk_ajp_common.c (3705): Unable to get the free endpoint for worker cfusion from 1 slots
[Thu Feb 07 16:52:54 2019] [3444:140618548705024] [warn] ajp_get_endpoint::jk_ajp_common.c (3705): Unable to get the free endpoint for worker cfusion from 1 slots
[Fri Feb 08 03:00:24 2019] [4118:140618548705024] [warn] ajp_get_endpoint::jk_ajp_common.c (3705): Unable to get the free endpoint for worker cfusion from 1 slots
[Fri Feb 08 07:29:00 2019] [4219:140618548705024] [warn] ajp_get_endpoint::jk_ajp_common.c (3705): Unable to get the free endpoint for worker cfusion from 1 slots
If a single report is scheduled to run, it runs without any issues.
It sounds very similar to this issue
https://coldfusion.adobe.com/discussion/2549262/
I would be grateful, if someone from Adobe or the Community could assist in this.
Thank You,
Annie
I am coming here in 2023 to confirm that this problem is unresolved in Coldfusion 2021. The solution described in this thread works. However, lots of the discussion seems to make assumptions that you have read material that is no longer available because Adobe likes to delete and move useful content. So I’ll recap the issue and solution here.
The problem
On a new installation of Coldfusion 2021, on a Red Hat distribution (in my case AWS AMI2 Linux), the error log for apache (/var/log/httpd_log by default) will throw frequent memory related errors. Some examples logged errors that indicate you have this problem are:
- child pid 26027 exit signal Segmentation fault (11)
- munmap_chunk(): invalid pointer
- child pid 25874 exit signal Aborted (6)
When errors like this appear, an upstream proxy server would simultaneously report that the affected CF server had prematurely closed its connection. So, this was definitely impacting clients.
The resolution/workaround
On the default installation of apache, using the configuration created by wsconfig, a file will be created at /etc/httpd/conf/workers.properties. Edit this file. Adjust this line:
heartbeat_interval=30
Instead it should be:
heartbeat_interval=0
Upon making that adjustment, and restarting httpd and cf2021, I was able to run in a production environment with no errors at all for a 12 hour period under full load. (Before making the adjustment, similar traffic levels resulted in errors every 1-20 minutes, sporadically)
It’s amazing to me that this problem has not been addressed by Adobe directly. This is the second out-of-the-box broken function that I have found. I can’t wait to get rid of ColdFusion for this reason.
I hope this helps someone who comes looking for answers like I was, so you don’t have to hit so many dead ends.
snc, thanks for your summary of the workaround of setting that heartbeat_interval to zero. Though it was mentioned a few times in the discussion below, including by Annie originally in Feb 2021, it could be easy to lose that in the maze of comments.
In case you or anyone else may lament that early replies here (mostly from me) in early 2019 focused on checking on the state of CF updates, connector updates, and PMT updates (all of which turned out to be red herrings), well, please do note that back then (4 years ago) I was just trying to make sure it wasn’t about any of those things (since the problem was NOT happening widely enough to know the cause).
But it did become clear that it was just about the need (for those hitting this problem) to disable the heartbeat check. And I agree with you that it’s sad that 4 years later people should have to even FIND this discussion and know to MAKE that change. Adobe, why is this not yet fixed?
Then again, you lament that “lots of the discussion seems to make assumptions that you have read material that is no longer available”. There’s only one thing “no longer available”, and that’s the discussion forum thread mentioned in the original question by the OP, Annie. That IS odd that it fails to work, and I tried to find it back in 2022 when I commented on I responded to sparksx, and I tried again just now.
But sure, there are sometimes resources we may get pointed to that may disappear in time–or be old (and not obviously so). This is a challenge with most tech that’s been around so long. At least there’s new content always being created for CF, whether here on the portal, or in the docs, on the Adobe CF forums, in blog posts from myself and others, as well as things like the CFML slack channel, the facebook CF group, and more. I point to those in a resource I keep, pointing out many such resources (and I fight to keep that updated), at cf411.com/cfcommhelp
Sparx. had you tried the commenting out of the heartbeat setting in the workers.properties? or setting it to 0? See other comments below where folks confirmed that solved the problem for them. Note also that it became clear that this did NOT seem to matter whether one used the PMT or not (the heartbeat is something that feeds the connector monitoring aspect of the PMT).
Annie, I know you solved the problem by making that change, but for the sake of completeness, could you give some consideration to the comment I’d shared back on Feb 14 (after yours), asking if you had enabled the PMT (which is what the heartbeat is related to), and where I pointed out how there was an update to the PMT that was available?
Even if you may feel the issue no longer applies for you, your answers could help others facing and finding this problem here.
It would certainly be nice to hear from Adobe as to whether this issue was (or was to be) addressed by update 3. It would also be helpful to hear from anyone experiencing it, whether changing that heartbeat interval fixed it.
And if so (for any where that fixed it), I would be curious to know: is the instance in question being monitored by the CF2018 PMT (a new monitor, which can be setup anywhere, whether on the same server as CF or elsewhere that can reach it). You can see if the instance is being monitored by visiting the new menu element in the CF 2018 Admin for the PMT. It lets you know if the instance is being monitored or not.
I am curious to know if the users experiencing this issue (where changing the heartbeat helps) either ARE or are NOT having the PMT monitor the instance in question, in case that’s significant.
Also, note that there was an update to the PMT, that came WITH CF2018 update 2–but it is not applied automatically. You must do it manually. I have a blog post here on doing that:
https://coldfusion.adobe.com/?p=4889
I’d be curious, as well, to know if those experiencing this issue maybe had NOT done that PMT update (as my understanding of the heartbeat interval is that it’s related to the monitoring of the web connection by CF for the sake of the PMT).
OK, thanks for the clarification, Annie. (I will assume by “no errors” in the update log that you mean not just that you “didn’t see any at the bottom”, but specifically that you had looked at the table near the top, tracking successes and errors, and it shows there were zero fatal or non-fatal errors.)
And I missed that you said that changing the heartbeat_interval DID help solve things, as a workaround. That’s interesting.
Hope someone at Adobe may come to a solution for you soon, as it could affect others. I’ve not heard this elsewhere yet, myself.
(There seems quite a few different issues with the latest update–affecting some though not all. But since it was an update across all 3 supported CF versions, they will likely have their hands full addressing such issues in the coming hours and days.)
Your assumption is right. There are 0 FatalErrors and 0 NonFatalErrors.
Someone from Adobe said that the bug is fixed in the CF2018, update 2. But its still happening and the workaround helps. There’s a bug raised with them. Hope there’s a solution for this.
Thanks for your prompt response.
Annie
Annie, did you upgrade the apache connector for cf, as was indicated in this update’s tech note? It would be easy to miss.
And did you confirm that there are no errors in the update log created for you in the cfusion/hf-updates and the subdolder for this update?
These things may show this is not a “bug” in the update as it may seem.
Hi Charlie,
There were no errors in the update log and the apache connector was upgraded and the SELinux context was applied as usual to the mod_jk.conf and workers.properies. The apache started as usual, and the application seems to be working fine. CFAdmin is accessible. On changing the heartbeat_interval=0, the apache stopped crashing.
Annie
Thanks for the Adobe Support Team to quick response over e-mail. Their solution was to update the workers.properties with the following.
heartbeat_interval=0
Updating workers.properties file solved the issue. However it would be great if you would let us know when this bug will be solved? And its related to Performance Monitoring Tool. Looking for an update on this please.
I was having the same issue with Centos7, CF2018 and Apache 2.4 on our new servers, this heartbeat_interval=0 fix seems to have done the job.
Most pages would serve fine and those that didn’t would would shows as a 500 err_empty_reponse in the browser and be ok after a refresh.
It was far more likely to occur on pages that took longer to load, and in particular on one developers machine that is a bit slower (and due an upgrade).
Slower pages are now loading successully, haven’t seen this issue on any machine the fix has been implemented on.
Our previous setup was Centos6, CF11, Apache 2.2 and we didn’t have this issue.
Same here on a new CF2018/Apache 2.4 installation, saw an increase in 502 errors from one server after switching from CF2016 and found the same errors in the logs. Changing heartbeat_interval to 0 seems to have worked, thanks for posting Annie!
This was without PMT enabled and without installing the PMT update. Other servers with the same configuration don’t appear to have the same problem.
You must be logged in to post a comment.