全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 LATEX论坛
1104 0
2016-10-18

Here's one of such jolly features. Many LaTeX users may find it very useful.

Loading a text file with encoding auto-detection

Here's a LaTeX document consisting of a Polish poem. Probably, most of you wouldn't have been able to guess the file's character encoding if I hadn't left some hints. But it's OK, we have a little challenge.

Let's use some (currently experimental) stringi functions to guess the file's encoding.

First of all, we should read the file as a raw vector (anyway, each text file is a sequence of bytes).

复制代码

Let's try to detect the file's character encoding automatically.

复制代码

Encoding detection is, at best, an imprecise operation using statistics and heuristics. ICU indicates that most probably we deal with Polish text in ISO-8859-2 (a.k.a. latin2) here. What a coincidence: it's true.

Let's re-encode the file. Our target encoding will be UTF-8, as it is a “superset'' of all 8-bit encodings. We really love portable code:

复制代码

Of course, if we knew a priori that the file is in ISO-8859-2, we'd just call:

复制代码

So far so good.

Word count

LaTeX word counting is a quite complicated task and there are many possible approaches
to perform it. Most often, they rely on running some external tools (which may be a bit inconvenient for some users). Personally, I've always been most satisfied with the output produced by the KileLaTeX IDE for KDE desktop environment.

As not everyone has Kile installed, I've had decided to grab Kile's algorithm (the power of open source!), made some not-too-invasive stringi-specific tweaks and here we are:

复制代码

Some other aggregates are also available (they are meaningful in case of any text file):

复制代码

Finally, here's the word count for my R programming book (in Polish). Importantly, each chapter is stored in a separate .tex file (there are 30 files), so "clicking out” the answer in Kile would be a bit problematic:

复制代码

Notably, my publisher was satisfied with the above estimate.

Next time we'll take a look at ICU's very powerful transliteration services.

More information

For more information check out the stringi package website and its on-line documentation.

For bug reports and feature requests visit our GitHub profile.

Any comments and suggestions are warmly welcome.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群