linux/fs/unicode/README.utf8data

The utf8data.h file in this directory is generated from the Unicode
Character Database for version 12.1.0 of the Unicode standard.

The full set of files can be found here:

  http://www.unicode.org/Public/12.1.0/ucd/

Note!

The URL's listed below are not stable.  That's because Unicode 12.1.0
has not been officially released yet; it is scheduled to be released
on May 8, 2019.  We taking Unicode 12.1.0 a few weeks early because it
contains a new Japanese character which is required in order to
specify Japenese dates after May 1, 2019, when Crown Prince Naruhito
ascends to the Chrysanthemum Throne.  (Isn't internationalization fun?
The abdication of Emperor Akihito of Japan is requiring dozens of
software packages to be updated with only a month's notice.  :-)

We will update the URL's (and any needed changes to the checksums)
after the final Unicode 12.1.0 is released.

Individual source links:

  https://www.unicode.org/Public/12.1.0/ucd/CaseFolding-12.1.0d2.txt
  https://www.unicode.org/Public/12.1.0/ucd/DerivedAge-12.1.0d3.txt
  https://www.unicode.org/Public/12.1.0/ucd/extracted/DerivedCombiningClass-12.1.0d2.txt
  https://www.unicode.org/Public/12.1.0/ucd/DerivedCoreProperties-12.1.0d2.txt
  https://www.unicode.org/Public/12.1.0/ucd/NormalizationCorrections-12.1.0d1.txt
  https://www.unicode.org/Public/12.1.0/ucd/NormalizationTest-12.1.0d3.txt
  https://www.unicode.org/Public/12.1.0/ucd/UnicodeData-12.1.0d2.txt

md5sums (verify by running "md5sum -c README.utf8data"):

  900e76da1d822a160fd6b8c0b1d70094  CaseFolding.txt
  131256380bff4fea8ad4a851616f2f10  DerivedAge.txt
  e731a4089b30002144e107e3d6f8d1fa  DerivedCombiningClass.txt
  a47c9fbd7ff92a9b261ba9831e68778a  DerivedCoreProperties.txt
  fcab6dad15e440879d92f315978f93d3  NormalizationCorrections.txt
  f9ff1c55a60decf436100f791b44aa98  NormalizationTest.txt
  755f6af699f8c8d2d958da411f78f6c6  UnicodeData.txt

sha1sums (verify by running "sha1sum -c README.utf8data"):

  dc9245f6803c4ac99555c361f5052e0b13eb779b  CaseFolding.txt
  3281104f237184cdb5d869e86eb8573678ada7da  DerivedAge.txt
  2f5f995ccb96e0fa84b15151b35d5e2681535175  DerivedCombiningClass.txt
  5b8698a3fcd5018e1987f296b02e2c17e696415e  DerivedCoreProperties.txt
  cd83935fbc012345d8792d2c704f69497e753835  NormalizationCorrections.txt
  ea419aae505b337b0d99a83fa83fe58ddff7c19f  NormalizationTest.txt
  dc973c0fc93d6f09d9ab9f70d1c9f89c447f0526  UnicodeData.txt


To update to the newer version of the Unicode standard, the latest
released version of the UCD can be found here:

  http://www.unicode.org/Public/UCD/latest/

To build the utf8data.h file, from a kernel tree that has been built,
cd to this directory (fs/unicode) and run this command:

	make C=../.. objdir=../.. utf8data.h.new

After sanity checking the newly generated utf8data.h.new file (the
version generated from the 12.1.0 UCD should be 4,109 lines long, and
have a total size of 324k) and/or comparing it with the older version
of utf8data.h, rename it to utf8data.h.

If you are a kernel developer updating to a newer version of the
Unicode Character Database, please update this README.utf8data file
with the version of the UCD that was used, the md5sum and sha1sums of
the *.txt files, before checking in the new versions of the utf8data.h
and README.utf8data files.
unicode: introduce UTF-8 character database The decomposition and casefolding of UTF-8 characters are described in a prefix tree in utf8data.h, which is a generate from the Unicode Character Database (UCD), published by the Unicode Consortium, and should not be edited by hand. The structures in utf8data.h are meant to be used for lookup operations by the unicode subsystem, when decoding a utf-8 string. mkutf8data.c is the source for a program that generates utf8data.h. It was written by Olaf Weber from SGI and originally proposed to be merged into Linux in 2014. The original proposal performed the compatibility decomposition, NFKD, but the current version was modified by me to do canonical decomposition, NFD, as suggested by the community. The changes from the original submission are: * Rebase to mainline. * Fix out-of-tree-build. * Update makefile to build 11.0.0 ucd files. * drop references to xfs. * Convert NFKD to NFD. * Merge back robustness fixes from original patch. Requested by Dave Chinner. The original submission is archived at: <https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs> The utf8data.h file can be regenerated using the instructions in fs/unicode/README.utf8data. - Notes on the update from 8.0.0 to 11.0: The structure of the ucd files and special cases have not experienced any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of Cherokee LC characters, which is an interesting case for case-folding. The update is accompanied by new tests on the test_ucd module to catch specific cases. No changes to mkutf8data script were required for the updates. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:38:44 +03:00			`The utf8data.h file in this directory is generated from the Unicode`
unicode: update unicode database unicode version 12.1.0 Regenerate utf8data.h based on the latest UCD files and run tests against the latest version. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:59:17 +03:00			`Character Database for version 12.1.0 of the Unicode standard.`
unicode: introduce UTF-8 character database The decomposition and casefolding of UTF-8 characters are described in a prefix tree in utf8data.h, which is a generate from the Unicode Character Database (UCD), published by the Unicode Consortium, and should not be edited by hand. The structures in utf8data.h are meant to be used for lookup operations by the unicode subsystem, when decoding a utf-8 string. mkutf8data.c is the source for a program that generates utf8data.h. It was written by Olaf Weber from SGI and originally proposed to be merged into Linux in 2014. The original proposal performed the compatibility decomposition, NFKD, but the current version was modified by me to do canonical decomposition, NFD, as suggested by the community. The changes from the original submission are: * Rebase to mainline. * Fix out-of-tree-build. * Update makefile to build 11.0.0 ucd files. * drop references to xfs. * Convert NFKD to NFD. * Merge back robustness fixes from original patch. Requested by Dave Chinner. The original submission is archived at: <https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs> The utf8data.h file can be regenerated using the instructions in fs/unicode/README.utf8data. - Notes on the update from 8.0.0 to 11.0: The structure of the ucd files and special cases have not experienced any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of Cherokee LC characters, which is an interesting case for case-folding. The update is accompanied by new tests on the test_ucd module to catch specific cases. No changes to mkutf8data script were required for the updates. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:38:44 +03:00
			`The full set of files can be found here:`

unicode: update unicode database unicode version 12.1.0 Regenerate utf8data.h based on the latest UCD files and run tests against the latest version. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:59:17 +03:00			`http://www.unicode.org/Public/12.1.0/ucd/`

			`Note!`

			`The URL's listed below are not stable. That's because Unicode 12.1.0`
			`has not been officially released yet; it is scheduled to be released`
			`on May 8, 2019. We taking Unicode 12.1.0 a few weeks early because it`
			`contains a new Japanese character which is required in order to`
			`specify Japenese dates after May 1, 2019, when Crown Prince Naruhito`
			`ascends to the Chrysanthemum Throne. (Isn't internationalization fun?`
			`The abdication of Emperor Akihito of Japan is requiring dozens of`
			`software packages to be updated with only a month's notice. :-)`

			`We will update the URL's (and any needed changes to the checksums)`
			`after the final Unicode 12.1.0 is released.`
unicode: introduce UTF-8 character database The decomposition and casefolding of UTF-8 characters are described in a prefix tree in utf8data.h, which is a generate from the Unicode Character Database (UCD), published by the Unicode Consortium, and should not be edited by hand. The structures in utf8data.h are meant to be used for lookup operations by the unicode subsystem, when decoding a utf-8 string. mkutf8data.c is the source for a program that generates utf8data.h. It was written by Olaf Weber from SGI and originally proposed to be merged into Linux in 2014. The original proposal performed the compatibility decomposition, NFKD, but the current version was modified by me to do canonical decomposition, NFD, as suggested by the community. The changes from the original submission are: * Rebase to mainline. * Fix out-of-tree-build. * Update makefile to build 11.0.0 ucd files. * drop references to xfs. * Convert NFKD to NFD. * Merge back robustness fixes from original patch. Requested by Dave Chinner. The original submission is archived at: <https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs> The utf8data.h file can be regenerated using the instructions in fs/unicode/README.utf8data. - Notes on the update from 8.0.0 to 11.0: The structure of the ucd files and special cases have not experienced any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of Cherokee LC characters, which is an interesting case for case-folding. The update is accompanied by new tests on the test_ucd module to catch specific cases. No changes to mkutf8data script were required for the updates. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:38:44 +03:00
			`Individual source links:`

unicode: update unicode database unicode version 12.1.0 Regenerate utf8data.h based on the latest UCD files and run tests against the latest version. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:59:17 +03:00			`https://www.unicode.org/Public/12.1.0/ucd/CaseFolding-12.1.0d2.txt`
			`https://www.unicode.org/Public/12.1.0/ucd/DerivedAge-12.1.0d3.txt`
			`https://www.unicode.org/Public/12.1.0/ucd/extracted/DerivedCombiningClass-12.1.0d2.txt`
			`https://www.unicode.org/Public/12.1.0/ucd/DerivedCoreProperties-12.1.0d2.txt`
			`https://www.unicode.org/Public/12.1.0/ucd/NormalizationCorrections-12.1.0d1.txt`
			`https://www.unicode.org/Public/12.1.0/ucd/NormalizationTest-12.1.0d3.txt`
			`https://www.unicode.org/Public/12.1.0/ucd/UnicodeData-12.1.0d2.txt`
unicode: introduce UTF-8 character database The decomposition and casefolding of UTF-8 characters are described in a prefix tree in utf8data.h, which is a generate from the Unicode Character Database (UCD), published by the Unicode Consortium, and should not be edited by hand. The structures in utf8data.h are meant to be used for lookup operations by the unicode subsystem, when decoding a utf-8 string. mkutf8data.c is the source for a program that generates utf8data.h. It was written by Olaf Weber from SGI and originally proposed to be merged into Linux in 2014. The original proposal performed the compatibility decomposition, NFKD, but the current version was modified by me to do canonical decomposition, NFD, as suggested by the community. The changes from the original submission are: * Rebase to mainline. * Fix out-of-tree-build. * Update makefile to build 11.0.0 ucd files. * drop references to xfs. * Convert NFKD to NFD. * Merge back robustness fixes from original patch. Requested by Dave Chinner. The original submission is archived at: <https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs> The utf8data.h file can be regenerated using the instructions in fs/unicode/README.utf8data. - Notes on the update from 8.0.0 to 11.0: The structure of the ucd files and special cases have not experienced any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of Cherokee LC characters, which is an interesting case for case-folding. The update is accompanied by new tests on the test_ucd module to catch specific cases. No changes to mkutf8data script were required for the updates. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:38:44 +03:00
			`md5sums (verify by running "md5sum -c README.utf8data"):`

unicode: update unicode database unicode version 12.1.0 Regenerate utf8data.h based on the latest UCD files and run tests against the latest version. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:59:17 +03:00			`900e76da1d822a160fd6b8c0b1d70094 CaseFolding.txt`
			`131256380bff4fea8ad4a851616f2f10 DerivedAge.txt`
			`e731a4089b30002144e107e3d6f8d1fa DerivedCombiningClass.txt`
			`a47c9fbd7ff92a9b261ba9831e68778a DerivedCoreProperties.txt`
			`fcab6dad15e440879d92f315978f93d3 NormalizationCorrections.txt`
			`f9ff1c55a60decf436100f791b44aa98 NormalizationTest.txt`
			`755f6af699f8c8d2d958da411f78f6c6 UnicodeData.txt`
unicode: introduce UTF-8 character database The decomposition and casefolding of UTF-8 characters are described in a prefix tree in utf8data.h, which is a generate from the Unicode Character Database (UCD), published by the Unicode Consortium, and should not be edited by hand. The structures in utf8data.h are meant to be used for lookup operations by the unicode subsystem, when decoding a utf-8 string. mkutf8data.c is the source for a program that generates utf8data.h. It was written by Olaf Weber from SGI and originally proposed to be merged into Linux in 2014. The original proposal performed the compatibility decomposition, NFKD, but the current version was modified by me to do canonical decomposition, NFD, as suggested by the community. The changes from the original submission are: * Rebase to mainline. * Fix out-of-tree-build. * Update makefile to build 11.0.0 ucd files. * drop references to xfs. * Convert NFKD to NFD. * Merge back robustness fixes from original patch. Requested by Dave Chinner. The original submission is archived at: <https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs> The utf8data.h file can be regenerated using the instructions in fs/unicode/README.utf8data. - Notes on the update from 8.0.0 to 11.0: The structure of the ucd files and special cases have not experienced any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of Cherokee LC characters, which is an interesting case for case-folding. The update is accompanied by new tests on the test_ucd module to catch specific cases. No changes to mkutf8data script were required for the updates. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:38:44 +03:00
			`sha1sums (verify by running "sha1sum -c README.utf8data"):`

unicode: update unicode database unicode version 12.1.0 Regenerate utf8data.h based on the latest UCD files and run tests against the latest version. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:59:17 +03:00			`dc9245f6803c4ac99555c361f5052e0b13eb779b CaseFolding.txt`
			`3281104f237184cdb5d869e86eb8573678ada7da DerivedAge.txt`
			`2f5f995ccb96e0fa84b15151b35d5e2681535175 DerivedCombiningClass.txt`
			`5b8698a3fcd5018e1987f296b02e2c17e696415e DerivedCoreProperties.txt`
			`cd83935fbc012345d8792d2c704f69497e753835 NormalizationCorrections.txt`
			`ea419aae505b337b0d99a83fa83fe58ddff7c19f NormalizationTest.txt`
			`dc973c0fc93d6f09d9ab9f70d1c9f89c447f0526 UnicodeData.txt`

unicode: introduce UTF-8 character database The decomposition and casefolding of UTF-8 characters are described in a prefix tree in utf8data.h, which is a generate from the Unicode Character Database (UCD), published by the Unicode Consortium, and should not be edited by hand. The structures in utf8data.h are meant to be used for lookup operations by the unicode subsystem, when decoding a utf-8 string. mkutf8data.c is the source for a program that generates utf8data.h. It was written by Olaf Weber from SGI and originally proposed to be merged into Linux in 2014. The original proposal performed the compatibility decomposition, NFKD, but the current version was modified by me to do canonical decomposition, NFD, as suggested by the community. The changes from the original submission are: * Rebase to mainline. * Fix out-of-tree-build. * Update makefile to build 11.0.0 ucd files. * drop references to xfs. * Convert NFKD to NFD. * Merge back robustness fixes from original patch. Requested by Dave Chinner. The original submission is archived at: <https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs> The utf8data.h file can be regenerated using the instructions in fs/unicode/README.utf8data. - Notes on the update from 8.0.0 to 11.0: The structure of the ucd files and special cases have not experienced any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of Cherokee LC characters, which is an interesting case for case-folding. The update is accompanied by new tests on the test_ucd module to catch specific cases. No changes to mkutf8data script were required for the updates. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:38:44 +03:00
			`To update to the newer version of the Unicode standard, the latest`
			`released version of the UCD can be found here:`

			`http://www.unicode.org/Public/UCD/latest/`

			`To build the utf8data.h file, from a kernel tree that has been built,`
			`cd to this directory (fs/unicode) and run this command:`

			`make C=../.. objdir=../.. utf8data.h.new`

			`After sanity checking the newly generated utf8data.h.new file (the`
unicode: update unicode database unicode version 12.1.0 Regenerate utf8data.h based on the latest UCD files and run tests against the latest version. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:59:17 +03:00			`version generated from the 12.1.0 UCD should be 4,109 lines long, and`
			`have a total size of 324k) and/or comparing it with the older version`
unicode: introduce UTF-8 character database The decomposition and casefolding of UTF-8 characters are described in a prefix tree in utf8data.h, which is a generate from the Unicode Character Database (UCD), published by the Unicode Consortium, and should not be edited by hand. The structures in utf8data.h are meant to be used for lookup operations by the unicode subsystem, when decoding a utf-8 string. mkutf8data.c is the source for a program that generates utf8data.h. It was written by Olaf Weber from SGI and originally proposed to be merged into Linux in 2014. The original proposal performed the compatibility decomposition, NFKD, but the current version was modified by me to do canonical decomposition, NFD, as suggested by the community. The changes from the original submission are: * Rebase to mainline. * Fix out-of-tree-build. * Update makefile to build 11.0.0 ucd files. * drop references to xfs. * Convert NFKD to NFD. * Merge back robustness fixes from original patch. Requested by Dave Chinner. The original submission is archived at: <https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs> The utf8data.h file can be regenerated using the instructions in fs/unicode/README.utf8data. - Notes on the update from 8.0.0 to 11.0: The structure of the ucd files and special cases have not experienced any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of Cherokee LC characters, which is an interesting case for case-folding. The update is accompanied by new tests on the test_ucd module to catch specific cases. No changes to mkutf8data script were required for the updates. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2019-04-25 20:38:44 +03:00			`of utf8data.h, rename it to utf8data.h.`

			`If you are a kernel developer updating to a newer version of the`
			`Unicode Character Database, please update this README.utf8data file`
			`with the version of the UCD that was used, the md5sum and sha1sums of`
			`the *.txt files, before checking in the new versions of the utf8data.h`
			`and README.utf8data files.`