Skip to content

Option --restrict-filenames should implicitly de-zalgo the text. #32629

@mk-pmb

Description

@mk-pmb
Contributor

Checklist

  • I'm reporting a feature request
    I've verified that I'm running youtube-dl version 2021.12.17
    I've searched the bugtracker for similar feature requests including closed ones

Problem

I had to manually fix the output filename for a video whose title contains lots of character decorations (Zalgo):

$ ytdl-rg3 --format=18 --get-title -- i41M0no6ff4
Film Theory: There Is Only DAD DaD ͈͘DAd ̒̇͞d̻̈́̚aD dAD

$ ytdl-rg3 --format=18 --output='%(title)s' --get-filename -- i41M0no6ff4
Film_Theory_-_There_Is_Only_DAD_D_a_D_DA_d_d_a_D_d_A_D

(Why I need --format= to read the title can be left for another day.)

Proposed solution

Let's automatically dezalgo the input before restricting filename. On PyPI I found zalgolib but it may be easier to just regexp-replace the unicode ranges mentioned there.

Fortunately, for this specific video I found a good enough work-around:

$ ytdl-rg3 --format=18 --get-title i41M0no6ff4 | tr -cd 'A-Za-z \n' | tr ' ' _
Film_Theory_There_Is_Only_DAD_DaD_DAd_daD_dAD

Activity

dirkf

dirkf commented on Nov 8, 2023

@dirkf
Contributor

Maybe I should deploy this Unicode processing, assuming for the moment that the algorithm would effectively strip the decorations?

mk-pmb

mk-pmb commented on Nov 8, 2023

@mk-pmb
ContributorAuthor

Decoding confusable characters sounds like a neat feature, but I'm undecided whether it's worth the disk space for the character equivalence table. I feel that for --restrict-filenames it would be more in its spirit to just strip everything with codepoint >= 127. Might be overreach though, thus my idea of using the ranges from zalgolib.

dirkf

dirkf commented on Nov 9, 2023

@dirkf
Contributor

Everything with codepoint >= 127 is stripped and replaced by _, which explains your example output. You're suggesting that the Unicode codes representing decorations be dropped instead, no?

My hope was that the confusable remapping would also strip the combining decoration codes, and then any resulting non-ASCII codes would be mapped to _ as now. I'll try this out when I get a chance.

dirkf

dirkf commented on Nov 10, 2023

@dirkf
Contributor

Indeed confusable processing doesn't make any difference.

But this apparently reasonable change (skip Control and Mark codes) in the replace_insane() local function does:

         if restricted and ord(char) > 127:
-            return '_'
+            return '' if unicodedata.category(char)[0] in 'CM' else '_'

Then the filename as above is Film_Theory_-_There_Is_Only_DAD_DaD_DAd_daD_dAD.

mk-pmb

mk-pmb commented on Nov 12, 2023

@mk-pmb
ContributorAuthor

I don't understand enough details to comment on the implementation, but the new output looks like it solves my problem. 👍

dirkf

dirkf commented on Nov 12, 2023

@dirkf
Contributor

The logic uses the General Category assigned to each Unicode code point (https://en.wikipedia.org/wiki/Unicode#General_Category_property). The Major category (L, M, N, P, S, Z, C), indicating Letter, Mark, Number, Punctuation, Symbol, Separator, or Other, is the first character of the category name (all two-letter values). I made the executive decision that for code points above 127 Marks and Others (eg, combining marks) should be omitted rather than mapped to _ for --restrict_filename.

Fortunately Python offers us the unicodedata library module to help with this, so it's a (less than) one-line change.

mk-pmb

mk-pmb commented on Nov 12, 2023

@mk-pmb
ContributorAuthor

While you're at it anyway, would it be cumbersome to have an option to use a custom character instead of underline?

dirkf

dirkf commented on Nov 13, 2023

@dirkf
Contributor

I'm inclined to think that it would be, unless you have a very compelling use case, especially since this option exists only to help file systems that don't understand Unicode (so there isn't a large choice of alternate characters).

There are utilities, depending on the platform, that could be used with --exec ... to rename downloaded files.

yt-dl could acquire some of the output template formatting enhancements from yt-dlp, which includes a string replacement syntax. I would favour that over a custom option, but it involves a lot of code to replace/extend the built-in Python formatting syntax currently used.

mk-pmb

mk-pmb commented on Nov 14, 2023

@mk-pmb
ContributorAuthor

Yeah ok that indeed sounds too cumbersome. My benefit would have been to simplify the renaming tool I --exec but I see now that the impact on your codebase here would be disproportional.

added 3 commits that reference this issue on Nov 29, 2023
e07c1fd
d3fd177
9296c14
mk-pmb

mk-pmb commented on Dec 4, 2023

@mk-pmb
ContributorAuthor

Thanks! 👍

added a commit that references this issue on Jan 27, 2024
4274723
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @dirkf@mk-pmb

        Issue actions

          Option --restrict-filenames should implicitly de-zalgo the text. · Issue #32629 · ytdl-org/youtube-dl