-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Closed
Labels
Description
Checklist
- I'm reporting a feature requestI've verified that I'm running youtube-dl version 2021.12.17I've searched the bugtracker for similar feature requests including closed onesTo pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Problem
I had to manually fix the output filename for a video whose title contains lots of character decorations (Zalgo):
$ ytdl-rg3 --format=18 --get-title -- i41M0no6ff4
Film Theory: There Is Only DAD DaD ͈͘DAd ̒̇͞d̻̈́̚aD dAD
$ ytdl-rg3 --format=18 --output='%(title)s' --get-filename -- i41M0no6ff4
Film_Theory_-_There_Is_Only_DAD_D_a_D_DA_d_d_a_D_d_A_D
(Why I need --format= to read the title can be left for another day.)
Proposed solution
Let's automatically dezalgo the input before restricting filename. On PyPI I found zalgolib but it may be easier to just regexp-replace the unicode ranges mentioned there.
Fortunately, for this specific video I found a good enough work-around:
$ ytdl-rg3 --format=18 --get-title i41M0no6ff4 | tr -cd 'A-Za-z \n' | tr ' ' _
Film_Theory_There_Is_Only_DAD_DaD_DAd_daD_dAD
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
dirkf commentedon Nov 8, 2023
Maybe I should deploy this Unicode processing, assuming for the moment that the algorithm would effectively strip the decorations?
mk-pmb commentedon Nov 8, 2023
Decoding confusable characters sounds like a neat feature, but I'm undecided whether it's worth the disk space for the character equivalence table. I feel that for
--restrict-filenamesit would be more in its spirit to just strip everything with codepoint >= 127. Might be overreach though, thus my idea of using the ranges from zalgolib.dirkf commentedon Nov 9, 2023
Everything with codepoint >= 127 is stripped and replaced by
_, which explains your example output. You're suggesting that the Unicode codes representing decorations be dropped instead, no?My hope was that the confusable remapping would also strip the combining decoration codes, and then any resulting non-ASCII codes would be mapped to
_as now. I'll try this out when I get a chance.dirkf commentedon Nov 10, 2023
Indeed confusable processing doesn't make any difference.
But this apparently reasonable change (skip Control and Mark codes) in the
replace_insane()local function does:Then the filename as above is
Film_Theory_-_There_Is_Only_DAD_DaD_DAd_daD_dAD.mk-pmb commentedon Nov 12, 2023
I don't understand enough details to comment on the implementation, but the new output looks like it solves my problem. 👍
dirkf commentedon Nov 12, 2023
The logic uses the General Category assigned to each Unicode code point (https://en.wikipedia.org/wiki/Unicode#General_Category_property). The Major category (
L,M,N,P,S,Z,C), indicating Letter, Mark, Number, Punctuation, Symbol, Separator, or Other, is the first character of the category name (all two-letter values). I made the executive decision that for code points above 127 Marks and Others (eg, combining marks) should be omitted rather than mapped to_for--restrict_filename.Fortunately Python offers us the
unicodedatalibrary module to help with this, so it's a (less than) one-line change.mk-pmb commentedon Nov 12, 2023
While you're at it anyway, would it be cumbersome to have an option to use a custom character instead of underline?
dirkf commentedon Nov 13, 2023
I'm inclined to think that it would be, unless you have a very compelling use case, especially since this option exists only to help file systems that don't understand Unicode (so there isn't a large choice of alternate characters).
There are utilities, depending on the platform, that could be used with
--exec ...to rename downloaded files.yt-dl could acquire some of the output template formatting enhancements from yt-dlp, which includes a string replacement syntax. I would favour that over a custom option, but it involves a lot of code to replace/extend the built-in Python formatting syntax currently used.
mk-pmb commentedon Nov 14, 2023
Yeah ok that indeed sounds too cumbersome. My benefit would have been to simplify the renaming tool I
--execbut I see now that the impact on your codebase here would be disproportional.[utils] Make restricted filenames ignore characters in Unicode catego…
[utils] Make restricted filenames ignore characters in Unicode catego…
[utils] Make restricted filenames ignore characters in Unicode catego…
mk-pmb commentedon Dec 4, 2023
Thanks! 👍
[utils] Make restricted filenames ignore characters in Unicode catego…