String processing for a latex file

stairs · September 10, 2020, 11:41am

Hi,

I would like to use a scala script or program process a latex file in the following way:

Whenever it sees an \begin{itemize} it replaces it with \begin{itemize}%itemize_[i]_[i+1] and \end{itemize} with \end{itemize}%itemize_[i]_[i-1]

where i is the current stack depth which begins at 0.

Example:

\begin{itemize}
 \begin{itemize}
    \item xyz
 \end{itemize}
 \begin{itemize}
 ...

gets transformed to:

\begin{itemize}%itemize_0_1
 \begin{itemize}%itemize_1_2
    \item xyz
 \end{itemize}%itemize_2_1
 \begin{itemize}%itemize_1_2
...

(not sure actually this is valid latex but the idea is still shown)

Is this a regex project or a parsing project? Would you read in the whole program as a string and do string searches? Is this doable or will take ages?

This would be just for itemizes.

Not sure where to start here.

Thanks.

crater2150 · September 10, 2020, 1:17pm

As long as you don’t have multiple \begin or \end{itemize} in a single line, this can be done in a single fold.

You can use scala.io.Source.fromFile to open the LaTeX file. The Source class provides a getLines method, which will give you an Iterator over all the lines. I think looking at it one line at a time is the best approach here.

As you need to keep some iteration state (the current nesting depth), a fold should be used. The foldLeft method is the correct one here, as foldRight would give the wrong iteration order and count nesting starting from the end of the file. The result should be again a list of lines, so we also need to pass through a list to collect them. So our first parameter for foldLeft is the Tuple (List[String](), 0). To make the parts of that accessible more easily in the function passed to foldLeft, we will use case-syntax: {case ((lines, nesting), cur) => ... }. lines is our list for collecting the results, nesting is the current nesting depth starting with 0, and cur is the line we are looking at. The function will always return a tuple of (List[String], Int), with the current line added to the list and the updated nesting level.

You only need to match fixed strings and then append something to the end of the line, so a simple str.contains("\\begin{itemize}") is enough and is probably also the quickest. You then have three cases to consider: either your line contains a begin, or an end or neither. In the first two cases, append your comment to the line and update the nesting, otherwise just append the unchanged line and keep the current nesting:

if (cur.contains("\\begin{itemize}"))
  (cur + s"%itemize_${nesting}_${nesting+1}" :: lines, nesting + 1)
else if (cur.contains("\\end{itemize}"))
  (cur + s"%itemize_${nesting}_${nesting-1}" :: lines, nesting - 1)
else (cur :: lines, nesting)

The result of the fold will be a tuple of the list of lines and the final nesting value. You can use the latter for checking for errors, if it isn’t zero, the \begin and \end lines don’t match up. The line list will be reversed, because we always prepend to the list, so just call reverse on it in the end.

The complete fold will look like this:

inputLines.foldLeft((List[String](), 0)){case ((lines, nesting), cur) =>
    if (cur.contains("\\begin{itemize}"))
      (s"$cur%itemize_${nesting}_${nesting+1}" :: lines, nesting + 1)
    else if (cur.contains("\\end{itemize}"))
      (s"$cur%itemize_${nesting}_${nesting-1}" :: lines, nesting - 1)
    else (cur :: lines, nesting)
  }._1.reverse

You can then proceed to write the result back to a file (keep in mind that getLines will remove any line breaks).

If your input may contain multiple \begin or \end blocks on the same line, this will get more complicated. But in that case you would have to change the expected result anyways, as the % comments in TeX will comment out the rest of the line.

stairs · September 11, 2020, 10:30am

Many thanks for the thorough answer - works a treat!