Go Through a File as Fast as You Can

Some languages surprise our team with IO performance. For example, we love Ruby but Perl most of the time (if not all) beats Ruby on IO reading speed. Surprising but not offensive. Of course, grep wins all battles because it doesn't store anything and it's brilliantly written.

Outside the bottom-feeding battles of dynamic languages, what would something assumed to be "fast" behave like? What if we wanted to do concurrent reads? Is there even an advantage? What would be a good design? What would be a fruitless design?

Also, this may serve as a nice interview quiz.

The Experiment

Maybe other experiments could be added ...

gen_100k.txt is a plain text file in utf-8 with unix line encodings. It's 87MB so the git clone might be slow. Sorry.
Go through the file as fast as you can and find all the phone numbers. This is not a test of i18n so the phone numbers will all look like this: (xxx) xxx-xxxx
Print the number of phone numbers found.
Verify your result by tail gen_100k.txt and looking at the answer. When the file was generated, the answer was saved within itself.
Time it with time. (see results section for how)
Add it to results.txt in this repo.

I want these languages tested:

Perl
Ruby
Go

Later, I'd like these just for fun:

PHP - yes, php cli app, woooo
Scala
Python

Results

Pipe time stderr to results.txt.

echo -e '### My Results!\n' >> results.md
{ time ruby telephones.rb gen_100k.txt; } 2>> results.md

Or substitute your script/program there. You should probably do at least three runs. We can do averages later.

The Challenge

IO speed is fixed (maybe). The real challenge is in program design to get through the file fast enough while asyncing/whatever out to a processing layer. At least this is what I imagine. I'm not sure if there's anything like concurrent or parallel IO.

Originally I was measuring with a 1.4MB file but it was producing sub-second runs in Ruby and that's not enough. There's a generator.rb file that's included but I think we should all be working off a known file. We can change this later if we want (add more lines etc) with the generator.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
README.md		README.md
gen_100k.txt		gen_100k.txt
generator.rb		generator.rb
nonidiomatic.rb		nonidiomatic.rb
results.md		results.md
telephones.go		telephones.go
telephones.pl		telephones.pl
telephones.rb		telephones.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Go Through a File as Fast as You Can

The Experiment

Results

The Challenge

About

Uh oh!

Releases

Packages

Languages

regexer/file_io_polyglot

Folders and files

Latest commit

History

Repository files navigation

Go Through a File as Fast as You Can

The Experiment

Results

The Challenge

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages