Validating Elastic Beanstalk worker tier cron schedules

If you’ve worked with Elastic Beanstalk worker tiers, you may be familiar with the cron.yaml file, which sqsd uses to send jobs to your app. I recently deployed a new version of a Rails app to our worker tier environment, and the deploy failed, and hard.

I’ve had bad deploys before, but one nice thing about Elastic Beanstalk is that the previous version of your app continues running if a deployment fails. Not this time though. The deployment failed and took everything down with it, including the previous (and should-be-still-running) version of the app.

This seemed very strange, so I tried re-deploying the previous version, which had been running just a few minutes prior, but that deployment failed too.

Now, I’m writing this based on my memory from a couple months ago, so bear with me here if the details aren’t 100%. I believe I checked systemctl status sqsd.service and was directed to run journalctl -xe. I think I tried starting sqsd manually using the command shown in the journalctl output (/opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd start).

Regardless, at some point I saw this (or a similar) message:

Failed of parsing file 'cron.yaml', because: min out of range (ArgumentError)

I had just added a new job to the cron.yaml file in the latest deployment, but I couldn’t find anything wrong with it. The schedule value seemed fine and was identical to the schedule of another job, but set for a couple minutes later.

After more digging, I found that the problem was coming from inside sqsd, the worker-tier daemon that delivers messages from an SQS queue to the app. I found the sqsd code on the machine¹ and started adding logging output to find exactly where it was failing.

Turns out the failure was occurring when sqsd parses the cron schedules (in schedule_parser.rb) using the parse-cron gem. Here’s the relevant bit showing the part of the code that was failing and raising the error:

begin
  # ...
  schedule = task_def.fetch('schedule')
  parser = CronParser.new(schedule)
  parser.next(Time.now) # test by computing next scheduled time
  parser.last(Time.now) # test by computing last scheduled time
  # ...
rescue => e
  raise ScheduleFileError.new("Failed of parsing file '#{SCHEDULE_FILE_NAME}', because: #{e.message}")
end

After banging my head against the wall trying to figure out how this new job could be failing, I added a line of code to log the schedule variable before it was parsed. That’s when I found it wasn’t the new job with a bad schedule, but one that had been there for months, successfully parsing and deploying all along. What on earth…

Here’s the job that was failing. It should run every 15 minutes from 7:00am until 7:45pm Central².

- name: my-job
  url: /my-job
  schedule: '*/15 12-24 * * *'

If you work with cron schedules much, you’ll probably spot the error right away. I don’t, and didn’t. The original schedule (from years ago) was */15 12-23 * * *, but I decided I wanted it to run for an extra hour each day, so I increased the hours component from 12-23 to 12-24. Without realizing it at the time, I’d made an invalid cron schedule, as Crontab.guru shows.

I hurridly updated the schedule to its correct form, */15 0,12-23 * * *, and deployed the app. It worked.

But I wasn’t satisfied. I had to figure out why the bad schedule had worked all along until now. I started off by trying this in a local IRB session:

parser = CronParser.new('*/15 12-24 * * *')
#=> #<CronParser:... @source="*/15 12-24 * * *", @time_source=Time>
parser.next(Time.now)
#=> (a timestamp)
parser.last(Time.now)
#=> (a timestamp)

No failures… So the CronParser is allowing the invalid 12-24 hour field. More poking around… until it hit me. The (literal) variable here was Time.now.

I had done the (failed) deployment at night but was now trying to reproduce the problem during the day. I tried parser.next(10.hours.from_now), and there it was, the error I’d been searching for: min out of range.

Without further inspecting the parse-cron gem’s code, it appears to be parsing an invalid cron schedule without complaint and then failing when trying to determine the next or previous execution time if the schedule is invalid.

To prevent future failures like this, I now have a test that reads and parses the app’s cron.yaml file and validates its cron schedules using a different gem called crontab_syntax_checker.³ Here’s a simplified version for minitest:

require 'test_helper'
require 'crontab_line'

class CronTest < ActiveSupport::TestCase
  cron_file = Rails.root.join('cron.yaml')
  YAML.load(cron_file.read)['cron'].each do |task|
    test "#{task['name']} has a valid schedule" do
      # this raises an error if the schedule is bad
      CrontabLine.create_by_entry("#{task['schedule']} echo")
    end
  end
end

The echo part is there because this gem expects a command to come after the schedule, so I’ve added a dummy command.

If you wanted to be really sure sqsd won’t choke on a cron schedule, you could write a test that uses the same parse-cron library that sqsd uses and try each schedule in your cron.yaml file for every minute of a two-year period (a leap year and a common year). That’s the approach I started with, but it means calling parser.next (and maybe parser.last) more than a million times for each schedule:

(366 days + 365 days) × 24 hours × 60 minutes = 1,052,640 times

To be sure, this is overkill. But it’s also exhaustive. You could have a test like this and only run it with a special command line flag. I think the above test is enough, but if you’re unhapy with how fast your test suite runs and would like to slow it down, here’s your code:

require 'test_helper'

class CronTest < ActiveSupport::TestCase
  cron_file = Rails.root.join('cron.yaml')
  YAML.load(cron_file.read)['cron'].each do |task|
    test "#{task['name']} has a valid schedule" do
      # guarantee one of the two years is a leap year
      base_time = Time.local(2020, 1, 1).utc.midnight
      0.upto((366 + 365) * 24 * 60) do |add_minutes|
        parser = CronParser.new(task['schedule'])
        Timecop.freeze(base_time + add_minutes.minutes) do
          # calling next and last can cause parser to raise
          # `ArgumentError: min out of range` if the cron
          # schedule syntax contains an out-of-range value
          parser.next
          parser.last
        end
      end
    end
  end
end

Notes

At the time of this writing, the code for aws-sqsd lives on worker tiers at /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/ ↩︎
The instance uses UTC, and I’m on Central Time, which has a -0600 standard offset and -0500 daylight offset. Daylight Saving Time is accounted for in the app so that a “cron” hour of “12” means “7am CST” or “7am CDT” depending on whether DST is in effect. ↩︎
crontab_syntax_checker defines global methods, so it’s in our Gemfile as:
```
group :test do
  gem 'crontab_syntax_checker', require: false
end
```
Then in the test, I require a specific class (require 'crontab_line') which has everything we need and doesn’t pollute the global namespace. ↩︎