Validating Elastic Beanstalk worker tier cron schedules
If you’ve worked with Elastic Beanstalk worker tiers, you may be familiar with the cron.yaml
file, which sqsd uses to send jobs to your app. I recently deployed a new version of a Rails app to our worker tier environment, and the deploy failed, and hard.
I’ve had bad deploys before, but one nice thing about Elastic Beanstalk is that the previous version of your app continues running if a deployment fails. Not this time though. The deployment failed and took everything down with it, including the previous (and should-be-still-running) version of the app.
This seemed very strange, so I tried re-deploying the previous version, which had been running just a few minutes prior, but that deployment failed too.
Now, I’m writing this based on my memory from a couple months ago, so bear with me here if the details aren’t 100%. I believe I checked systemctl status sqsd.service
and was directed to run journalctl -xe
. I think I tried starting sqsd manually using the command shown in the journalctl
output (/opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd start
).
Regardless, at some point I saw this (or a similar) message:
Failed of parsing file 'cron.yaml', because: min out of range (ArgumentError)
I had just added a new job to the cron.yaml
file in the latest deployment, but I couldn’t find anything wrong with it. The schedule
value seemed fine and was identical to the schedule of another job, but set for a couple minutes later.
After more digging, I found that the problem was coming from inside sqsd, the worker-tier daemon that delivers messages from an SQS queue to the app. I found the sqsd code on the machine1 and started adding logging output to find exactly where it was failing.
Turns out the failure was occurring when sqsd parses the cron schedules (in schedule_parser.rb
) using the parse-cron gem. Here’s the relevant bit showing the part of the code that was failing and raising the error:
begin
# ...
schedule = task_def.fetch('schedule')
parser = CronParser.new(schedule)
parser.next(Time.now) # test by computing next scheduled time
parser.last(Time.now) # test by computing last scheduled time
# ...
rescue => e
raise ScheduleFileError.new("Failed of parsing file '#{SCHEDULE_FILE_NAME}', because: #{e.message}")
end
After banging my head against the wall trying to figure out how this new job could be failing, I added a line of code to log the schedule
variable before it was parsed. That’s when I found it wasn’t the new job with a bad schedule, but one that had been there for months, successfully parsing and deploying all along. What on earth…
Here’s the job that was failing. It should run every 15 minutes from 7:00am until 7:45pm Central2.
- name: my-job
url: /my-job
schedule: '*/15 12-24 * * *'
If you work with cron schedules much, you’ll probably spot the error right away. I don’t, and didn’t. The original schedule (from years ago) was */15 12-23 * * *
, but I decided I wanted it to run for an extra hour each day, so I increased the hours component from 12-23
to 12-24
. Without realizing it at the time, I’d made an invalid cron schedule, as Crontab.guru shows.
I hurridly updated the schedule to its correct form, */15 0,12-23 * * *
, and deployed the app. It worked.
But I wasn’t satisfied. I had to figure out why the bad schedule had worked all along until now. I started off by trying this in a local IRB session:
parser = CronParser.new('*/15 12-24 * * *')
#=> #<CronParser:... @source="*/15 12-24 * * *", @time_source=Time>
parser.next(Time.now)
#=> (a timestamp)
parser.last(Time.now)
#=> (a timestamp)
No failures… So the CronParser
is allowing the invalid 12-24
hour field. More poking around… until it hit me. The (literal) variable here was Time.now
.
I had done the (failed) deployment at night but was now trying to reproduce the problem during the day. I tried parser.next(10.hours.from_now)
, and there it was, the error I’d been searching for: min out of range.
Without further inspecting the parse-cron gem’s code, it appears to be parsing an invalid cron schedule without complaint and then failing when trying to determine the next or previous execution time if the schedule is invalid.
To prevent future failures like this, I now have a test that reads and parses the app’s cron.yaml
file and validates its cron schedules using a different gem called crontab_syntax_checker.3 Here’s a simplified version for minitest:
require 'test_helper'
require 'crontab_line'
class CronTest < ActiveSupport::TestCase
cron_file = Rails.root.join('cron.yaml')
YAML.load(cron_file.read)['cron'].each do |task|
test "#{task['name']} has a valid schedule" do
# this raises an error if the schedule is bad
CrontabLine.create_by_entry("#{task['schedule']} echo")
end
end
end
The echo
part is there because this gem expects a command to come after the schedule, so I’ve added a dummy command.
If you wanted to be really sure sqsd won’t choke on a cron schedule, you could write a test that uses the same parse-cron library that sqsd uses and try each schedule in your cron.yaml
file for every minute of a two-year period (a leap year and a common year). That’s the approach I started with, but it means calling parser.next
(and maybe parser.last
) more than a million times for each schedule:
(366 days + 365 days) × 24 hours × 60 minutes = 1,052,640 times
To be sure, this is overkill. But it’s also exhaustive. You could have a test like this and only run it with a special command line flag. I think the above test is enough, but if you’re unhapy with how fast your test suite runs and would like to slow it down, here’s your code:
require 'test_helper'
class CronTest < ActiveSupport::TestCase
cron_file = Rails.root.join('cron.yaml')
YAML.load(cron_file.read)['cron'].each do |task|
test "#{task['name']} has a valid schedule" do
# guarantee one of the two years is a leap year
base_time = Time.local(2020, 1, 1).utc.midnight
0.upto((366 + 365) * 24 * 60) do |add_minutes|
parser = CronParser.new(task['schedule'])
Timecop.freeze(base_time + add_minutes.minutes) do
# calling next and last can cause parser to raise
# `ArgumentError: min out of range` if the cron
# schedule syntax contains an out-of-range value
parser.next
parser.last
end
end
end
end
end
-
At the time of this writing, the code for
aws-sqsd
lives on worker tiers at/opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/
↩︎ -
The instance uses UTC, and I’m on Central Time, which has a -0600 standard offset and -0500 daylight offset. Daylight Saving Time is accounted for in the app so that a “cron” hour of “12” means “7am CST” or “7am CDT” depending on whether DST is in effect. ↩︎
-
crontab_syntax_checker defines global methods, so it’s in our
Gemfile
as:group :test do gem 'crontab_syntax_checker', require: false end
Then in the test, I require a specific class (
require 'crontab_line'
) which has everything we need and doesn’t pollute the global namespace. ↩︎