Skip to content

Request for feedback: Fixing a bug with SPDX validation #21

@nellshamrell

Description

@nellshamrell

I'm having some trouble with fixing a bug in https://github.com/clearlydefined/spdx

TLDR; We are having issues figuring out where parantheses go in SPDX expressions. This is causing validation errors on ClearlyDefined curation pull requests. This is because of some quirks in how we validate license expressions. Figuring out where parentheses go is hard.

NOTE: If you find yourself asking "What was this written this way?", I wasn't there when it was written and I have no doubt the previous developers were doing the best they could with the resources that they had. It's also mostly worked for over two years, which reflects very well on them and their work.

Problem

When ClearlyDefined validates "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only", it says that it is an invalid SPDX string.

According to the SPDX documentation, using three "ors" is a valid license string to show a choice between three licenses.

ClearlyDefined should validate "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only" as a valid license.

Background

When someone submits a curation to ClearlyDefined, the ClearlyDefined service validates the license(s) in the curation.

The ClearlyDefined service uses clearlydefined/spdx to validate license strings. It does this through calling SPDX.normalize on a submitted license string in a curation.

spdx.normalize

Currently, when we call SPDX.normalize, it says that "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only" is not a valid license string.

Let's take a look at that function:

function normalize(expression) {
  if (!expression || !expression.trim()) return null
  return stringify(parse(expression))
}

And let's zoom in on this line:

return stringify(parse(expression))

First, we call the ``parse method``` on the expression.

The parse method takes the string and turns it into an abstract syntax tree. In the case of "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only", it turns it into an AST that looks like this:

{
  left: { license: 'MPL-1.1'},
  conjunction: 'or',
  right: {
    left: { license: 'LGPL-2.1-only'},
    conjunction: 'or',
     right: { license: 'GPL-2.0-only'}
   }
}

Then, once we have our result from parse, normalize calls stringify on the result.

return stringify(parse(expression))

This appears to be the source of the error.

function stringify(obj) {
  if (obj.hasOwnProperty('noassertion') || obj.exception === NOASSERTION) return NOASSERTION
  if (obj.license) return `${obj.license}${obj.plus ? '+' : ''}${obj.exception ? ` WITH ${obj.exception}` : ''}`
  const left = obj.left.conjunction === 'or' ? `(${stringify(obj.left)})` : stringify(obj.left)
  const right = obj.right.conjunction === 'or' ? `(${stringify(obj.right)})` : stringify(obj.right)
  return `${left} ${obj.conjunction.toUpperCase()} ${right}`
}

This function takes the AST as a parameter:

{
  left: { license: 'MPL-1.1'},
  conjunction: 'or',
  right: {
    left: { license: 'LGPL-2.1-only'},
    conjunction: 'or',
     right: { license: 'GPL-2.0-only'}
   }
}

And returns an SPDX string. In the case above, the stringify function would return "MPL-1.1 OR (LGPL-2.1-only OR GPL-2.0-only)", rather than "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only".

It does this by recursively traversing the AST, dividing things into right and left.

The first time the code parses the above AST, it would divide it like this

//LEFT

{ license: 'MPL-1.1' }

//RIGHT

{
  left: { license: 'LGPL-2.1-only' },
  conjunction: 'or',
  right: { license: 'GPL-2.0-only' }
}

When it hits this line of code:

const right = obj.right.conjunction === 'or' ? `(${stringify(obj.right)})` : stringify(obj.right)

It would see that the "Right" object has a conjunction of "or", so it would put parentheses around all of it. This is how we get "MPL-1.1 OR (LGPL-2.1-only OR GPL-2.0-only)".

This isn't something we want to always be the case, we can have OR statements that do not need parentheses around them.

Modifying spdx.normalize

It took a few iterations, but eventually this modification to the code seemed to get things mostly right.

function stringify(obj) {
  if (obj.hasOwnProperty('noassertion') || obj.exception === NOASSERTION) return NOASSERTION
  if (obj.license) return `${obj.license}${obj.plus ? '+' : ''}${obj.exception ? ` WITH ${obj.exception}` : ''}`
  const left = obj.left.conjunction === 'or' ? `(${stringify(obj.left)})` : stringify(obj.left)
  const right = obj.right.conjunction === 'or' && obj.left.conjunction !== undefined ? `(${stringify(obj.right)})` : stringify(obj.right)
  return `${left} ${obj.conjunction.toUpperCase()} ${right}`
}

However, this broke another one of the tests.

According to the test, this AST:

{
   left: { license: 'MIT' },
   conjunction: 'or',
   right: {
     left: { license: 'BSD-2-Clause' },
     conjunction: 'or',
     right: { left: { license: 'BSD-3-Clause' }, conjunction: 'and', right: { license: 'Unlicense' } }
   }
},

Should be parsed into "MIT OR (BSD-2-Clause OR BSD-3-Clause AND Unlicense)" (and in the original code it was). This led me to do some research into the rules of parentheses in SPDX license strings.

According to the SPDX documentation, there is a default order of precedence when there are not parentheses present in an SPDX string. (See "4) Order of Precedence and Parentheses" in the documentation).

So with this string "LGPL-2.1-only OR BSD-3-Clause AND MIT", the AND operator is applied before the OR operator. So the license represents a choice between "LGPL-2.1-only and the expression BSD-3-Clause" and "MIT", since the "and" is applied before the "Or".

According to the documentation, parentheses are used to express an order of precedence that is different from the default order. So "MIT AND (LGPL-2.1-or-later OR BSD-3-Clause)" states that the ORD operators should be applied before the AND operator, which is different from the default precedence.

I don't know that there is a way to know where parentheses go when translating an AST representing an SPDX expression into a string. It would depend on the intent of the person who originally wrote it.

How to fix?

There are a few options that I can think of:

  1. Alter the AST to track where parentheses go. This should be possible, though it would be a significant change.
  2. Do away with the AST and use a different method for parsing the license expression to see if it is valid. I imagine other projects are somehow comparing SPDX expressions and verifying that they are valid, it would be worthwhile to research how others handle this problem (we can't be the only one).
  3. Something I'm not thinking of.

I'm inclined to start with researching how other projects validate SPDX strings and see if there is a good example we can follow (or, perhaps, a different library we can use), then make the decision about what approach to take to fix this.

What do others think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions