itemis ANALYZE’s PDF adapter identifies fragments of a PDF document as traceable artifacts. These artifacts can be identified by either their textual contents or by comments in the PDF document.
Open the ANALYZE configuration with the ANALYZE configuration editor, and add a new data access as described in section "Data accesses". Select PDF files as data access type.
Supported options:
Example:
resource "*.pdf"
This configuration specifies that ANALYZE should load and analyze all files residing in the workspace whose filename extension is .pdf.
Open the ANALYZE configuration with the ANALYZE configuration editor, and add a new artifact type as described in section "Artifact types". Select your previously-configured PDF files data access in the Data access drop-down list.
The PDF artifact type configuration supports the following keywords:
Example for analyzing comments:
analyze comments
Example for text parsing and artifact extraction:
locate text where pattern matches "(?sm)(?<id>\\[A:.*?])(?<txt>([^\\[])*)" {
name group("id")
identified by group("$1")
map{
attr to group ("txt").substringBefore("HEADER")+group("txt").substringAfter("FOOTER")
}
}
The pattern uses multiline-mode search (
?sm
) to find all occurrences of specific text elements. Such a text element starts with „[A:” (
[A:
), followed by none or multiple characters (
.*?
) and ends with a square bracket (
]
).
The group
txt will contain the text until the next square bracket (
*)
).
Extracted artifacts are named according to the value of the captured group named id. Artifacts are identified by the value of the first matching group, here id. The custom attribute attr is mapped to the txt group.
If a
txt element stretches more than one page, the
attr will include the page’s header and footer texts. If you want strip them off, you can use the
substringBefore and
substringAfter methods as shown. To actually use the example’s code snippet, please replace
HEADER
by the header’s first characters and
FOOTER
by the footer’s last characters. These character sequences should be unique in order to match only in the header and in the footer, respectively, and nowhere else.