Skip to content

[File] recognize encoding for remote resource #37

@AcckiyGerman

Description

@AcckiyGerman

Original: datopian/datahub.io#105

File = require('data.js').File
// loading ISO8859 resource:
> file = File.load('https://raw.githubusercontent.com/frictionlessdata/test-data/master/files/csv/encodings/iso8859.csv')
> file.encoding
'utf-8'

Acceptance criteria

  • File.load('https://raw.githubusercontent.com/frictionlessdata/test-data/master/files/csv/encodings/iso8859.csv').encoding == 'ISO-8859-1'
  • File.load('https://raw.githubusercontent.com/frictionlessdata/test-data/master/files/csv/encodings/western-macos-roman.csv').encoding == <macOS-roman-or-so>

Tasks

  • add test
  • realize encoding recognize

Analysis

We need to change this method:

class FileRemote extends File {
   ...
   get encoding() {
       return DEFAULT_ENCODING
  }

analysis update

encoding() method should:

  • connect to remote resource
  • get small portion of raw-data
  • try to recognize encoding

I tried to implement this schema, using chardet.detectFileSync() lib but it works only with files - any argument is treated as a file-name.

Possible solutions:

  • save a part of remote resource in a local temp file, then use chardet.detectFileSync(temp)
  • use some other lib to recognize encoding using remote Stream

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions