×
all 15 comments

[–]hypnomarten 4 points5 points  (8 children)

I found this in the info about the clases: Note that the ‘[’ and ‘]’ characters that enclose the class name are part of the name, so a regular expression using these classes needs one more pair of brackets. For example, a regular expression matching a sequence of one or more letters and digits would be ‘[[:alnum:]]+’, not ‘[:alnum:]+’.

So, try [[:noascii:]]

Seems to work for me.

[–]Calm-Bass-4740[S] 0 points1 point  (7 children)

That did the trick. Thanks. Unfortunately, my strategy is not going to work. I wanted to use that to find characters that were octal codes in my buffer, but those are not considered nonascii characters. I think they may not be characters at all and therefore regular expressions may not be helpful. Does anyone have other thoughts?

[–]hypnomarten 2 points3 points  (1 child)

Glad I could help - well, so far...

You can search for octal codes by using the isearch-repeat-forward (usually C-s) and then press C-q for quoting the code you are looking for, like 344. Be aware, that octal codes can look like (and be, actually) normal characters. 344 for example is the german ä. But a \222 looks like a \222 - at least in my test buffer. Hope, that helps.

[–]Calm-Bass-4740[S] 1 point2 points  (0 children)

Yes. Thank you. Once I find such codes I have to look them up in a hash table and replace the old ones.

[–]eli-zaretskiiGNU Emacs maintainer 2 points3 points  (1 child)

Try [[:unibyte:]].

[–]Calm-Bass-4740[S] 0 points1 point  (0 children)

Thank you. [^[:unibyte:]] does seem to catch what I need to.

[–]Calm-Bass-4740[S] 1 point2 points  (2 children)

I think I found one solution at https://lists.libreplanet.org/archive/html/help-gnu-emacs/2019-07/msg00107.html . I was not aware that you could search by hex codes.

(re-search-forward "[\x80-\xff]")

[–]mmaugGNU Emacs `sql.el` maintainer 2 points3 points  (1 child)

Regular expressions match code points not just bytes. So the easiest way to detect outside of US ASCII (0 Nul - 127 Del) including multi-byte code points is to search with the regular expression [^[:ascii:]]

This will match all Unicode (European, Asian, African, historic, emoji, …) not included in the 1960s standard

[–]Calm-Bass-4740[S] 0 points1 point  (0 children)

Ha, the 1960s is when the file format I'm working with was invented, though the encoding for non-ascii languages came later.

[–]mmaugGNU Emacs `sql.el` maintainer 0 points1 point  (3 children)

What is the encoding of the buffer? If the buffer is Unicode then octal bytes are not character code points. But if the file is actually a different encoding, making sure the buffer is using the same encoding could significantly alter your perception of the file contents.

Locating bytes displayed in octal is a separate problem and more specifics are needed to definitively recommend a solution. File encoding and Unicode display are incredibly complex topics that is far more difficult that the ASCII model implies.

[–]Calm-Bass-4740[S] 0 points1 point  (2 children)

The files will be visited literally and they are divided into "records". Each may be encoded different from the last. Emacs doesn't understand the encoding, so I have to "translate" it and that is what I'm working on. The encoding is MARC8 and characters are either 1 byte, except for some Asian languages, when a character is 3 bytes. Since not all characters fit in 1 byte, the character sets change based on escape sequences (extended Latin, Cyrillic, Hebrew, Arabic...). It is challenging to understand. (Oh, and combining characters aren't in Unicode order either.) I'm not a programmer by trade so I'll be asking other questions as I go through the process. For now, I think the above information is enough to get me to the next step.

[–]mmaugGNU Emacs `sql.el` maintainer 0 points1 point  (1 child)

You are certainly in the deep end of the pool (with no floaties in sight). Without fully understanding what you need to accomplish, I fear I'd lead you astray so I'll defer to purely technical issues and ignore my design instincts. (Which would be to write a C or Python program to convert the file to an utf-8 encoding and go from there but that ignores whatever the downstream needs might be.)

That said, I do know Eli Z has been seen in these parts, and they are as knowledgeable of Unicode and Emacs buffer representation as anyone since they were the author of much of it.

[–]Calm-Bass-4740[S] 0 points1 point  (0 children)

Eli has posted on Reddit that creating new encodings is not documented, and they are difficult to write. So, I'm not shooting for a full-fledged Emacs encoding, just a translation. That may or may not work.

[–]mmaugGNU Emacs `sql.el` maintainer 0 points1 point  (1 child)

(You've stirred up my ADHD — I won't rest for weeks…)

A quick DDG search and Wikipedia convinces me that if your file is properly encoded that there'll be FOSS tools around for manipulating the data you have. The MARC-8 and MARC-21 file formats are well documented and have formal specifications. My guess is that you are not the first one to try to make sense of what they hold. Stand on the shoulders of others if you can…

[–]Calm-Bass-4740[S] 1 point2 points  (0 children)

I am working (with another person), on a MARC mode so end users (to the extent that end users may use Emacs) can edit the files. There are open source alternatives for the encoding/decoding part of the process, and I could make the user do that before editing the files with Emacs. Or, I could have MARC mode only understand UTF-8 records. But, at this point, I have chosen to try the harder road. I have probably done so because I am ignorant of the trials ahead.