知笔记批量导出HTML并转换为Markdown

为知笔记导出

为知笔记有两种格式：

一种是常规html。
一种是markdown，导出html时，务必勾选渲染markdown，否则会输出非标准的html，即html中有转义后的md源码，但是图片又是按照html的img标签解析的，处理会很麻烦。

批量处理

借助find和xargs实现。注意：

xargs需要指定-S参数，否则文件名超长会导致不替换
文件名不能包含一些特殊字符，例如连续多个空格，例如"$"符号
为了避免转换失败，每一步操作前后建议备份，如果失败了重新来

编码转换

为知笔记导出的文件一般是utf-8，但是部分文件可能是utf-16编码
下面的脚本，先借助file命令找到utf-16编码的文件，然后用iconv转换到utf-8
如果是mac，使用file -bI获取编码，否则用file -bi获取

1	find . -name "*.html" \| xargs -I% -S 10000 bash -c 'file -bI "%" \| grep utf-16 >/dev/null && iconv -f utf-16 -t utf-8 "%" > "%.new" && mv -f "%.new" "%" && echo "convert to utf-8 %"'

图片相对路径

有些笔记导出后，图片相对路径是有问题的，多了前缀，例如

1	<img src="file:///C:/Users/x/Desktop/Notes/文件夹//笔记_files/image.png"/>

可以用vscode打开，正则表达式批量替换为空字符

1	file:///C:/Users/x/Desktop/Notes/.*//

使用pandoc转换成markdown

pandoc需要自行安装，mac上运行brew install pandoc即可。注意pandoc只支持utf-8编码格式。

单个html转换到markdown的命令如下

-native_divs 可以删除html中多余的div。
-native_spans 可以删除html中多余的span。
-raw_html 可以去掉markdown中的html源码（会丢失一些属性信息，例如图片尺寸）。

1	pandoc -f html-native_divs-native_spans -t markdown+hard_line_breaks-raw_html 1.html -o 1.md

# 批量转换html到md
find . -name "*.html" | xargs -I% -S 10000 bash -c 'S="%"; D="$S.md"; pandoc -f html-native_divs-native_spans -t gfm+hard_line_breaks-raw_html "$S" -o "$D"'

# 统计html数量
find . -name "*.html" | wc -l

# 统计输出的md文件数量
find . -name "*.md" | wc -l

# 列出未处理或失败的文件
find . -name "*.html" | xargs -I% -S 10000 bash -c 'S="%"; D="$S.md"; [ -f "$D" ] || echo "$D not exists"'

整理文件名

整理文件名，注意使用mv -n，避免文件名冲突导致的覆盖

源文件为 .md.html 的，改为 .md
源文件为 .html 的，改为 .md

1	find . -name "*.html.md" \| xargs -I% -S 10000 bash -c 'S="%"; D=`echo "$S" \| sed "s/.html.md/.md/" \| sed "s/.md.md/.md/"`; mv -vn "$S" "$D"'

删除无用文件

# 搜索无用文件
# -v表示翻转匹配
# -i表示忽略大小写
find . -name "*" -type f | grep -v -i -e .html -e .htm -e .md -e .git -e .jpg -e .jpeg -e .png -e .gif -e .bmp

# 删除无用文件，使用xargs，或使用find的-delete参数删除
find . -name ".DS_Store" | xargs -I% rm "%"
find . -name "*.css" | xargs -I% rm "%"
find . -name "*.xml" | xargs -I% rm "%"
find . -name "*.ttf" | xargs -I% rm "%"
find . -name "*.woff" | xargs -I% rm "%"
find . -name "*.woff2" | xargs -I% rm "%"
find . -name "*.eot" | xargs -I% rm "%"
find . -name "*.js" | xargs -I% rm "%"
find . -name "wiz_abstract.html" | xargs -I% rm '%'
find . -name "wiz_full.html" | xargs -I% rm '%'
find . -name "wiz_mobile.html" | xargs -I% rm '%'

# 删除空目录
find . -type d -empty -delete