diff --git a/.coveragerc b/.coveragerc deleted file mode 100644 index f5c8b701a79a8..0000000000000 --- a/.coveragerc +++ /dev/null @@ -1,29 +0,0 @@ -# .coveragerc to control coverage.py -[run] -branch = False -omit = */tests/* -plugins = Cython.Coverage - -[report] -# Regexes for lines to exclude from consideration -exclude_lines = - # Have to re-enable the standard pragma - pragma: no cover - - # Don't complain about missing debug-only code: - def __repr__ - if self\.debug - - # Don't complain if tests don't hit defensive assertion code: - raise AssertionError - raise NotImplementedError - - # Don't complain if non-runnable code isn't run: - if 0: - if __name__ == .__main__.: - -ignore_errors = False -show_missing = True - -[html] -directory = coverage_html_report diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index 95729f845ff5c..21df1a3aacd59 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -1,24 +1,23 @@ -Contributing to pandas -====================== +# Contributing to pandas Whether you are a novice or experienced software developer, all contributions and suggestions are welcome! -Our main contribution docs can be found [here](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst), but if you do not want to read it in its entirety, we will summarize the main ways in which you can contribute and point to relevant places in the docs for further information. +Our main contributing guide can be found [in this repo](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst) or [on the website](https://pandas-docs.github.io/pandas-docs-travis/contributing.html). If you do not want to read it in its entirety, we will summarize the main ways in which you can contribute and point to relevant sections of that document for further information. + +## Getting Started -Getting Started ---------------- If you are looking to contribute to the *pandas* codebase, the best place to start is the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues). This is also a great place for filing bug reports and making suggestions for ways in which we can improve the code and documentation. -If you have additional questions, feel free to ask them on the [mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata) or on [Gitter](https://gitter.im/pydata/pandas). Further information can also be found in our [Getting Started](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#where-to-start) section of our main contribution doc. +If you have additional questions, feel free to ask them on the [mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata) or on [Gitter](https://gitter.im/pydata/pandas). Further information can also be found in the "[Where to start?](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#where-to-start)" section. + +## Filing Issues + +If you notice a bug in the code or documentation, or have suggestions for how we can improve either, feel free to create an issue on the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues) using [GitHub's "issue" form](https://github.com/pandas-dev/pandas/issues/new). The form contains some questions that will help us best address your issue. For more information regarding how to file issues against *pandas*, please refer to the "[Bug reports and enhancement requests](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#bug-reports-and-enhancement-requests)" section. -Filing Issues -------------- -If you notice a bug in the code or in docs or have suggestions for how we can improve either, feel free to create an issue on the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues) using [GitHub's "issue" form](https://github.com/pandas-dev/pandas/issues/new). The form contains some questions that will help us best address your issue. For more information regarding how to file issues against *pandas*, please refer to the [Bug reports and enhancement requests](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#bug-reports-and-enhancement-requests) section of our main contribution doc. +## Contributing to the Codebase -Contributing to the Codebase ----------------------------- -The code is hosted on [GitHub](https://www.github.com/pandas-dev/pandas), so you will need to use [Git](http://git-scm.com/) to clone the project and make changes to the codebase. Once you have obtained a copy of the code, you should create a development environment that is separate from your existing Python environment so that you can make and test changes without compromising your own work environment. For more information, please refer to our [Working with the code](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#working-with-the-code) section of our main contribution docs. +The code is hosted on [GitHub](https://www.github.com/pandas-dev/pandas), so you will need to use [Git](http://git-scm.com/) to clone the project and make changes to the codebase. Once you have obtained a copy of the code, you should create a development environment that is separate from your existing Python environment so that you can make and test changes without compromising your own work environment. For more information, please refer to the "[Working with the code](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#working-with-the-code)" section. -Before submitting your changes for review, make sure to check that your changes do not break any tests. You can find more information about our test suites can be found [here](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#test-driven-development-code-writing). We also have guidelines regarding coding style that will be enforced during testing. Details about coding style can be found [here](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#code-standards). +Before submitting your changes for review, make sure to check that your changes do not break any tests. You can find more information about our test suites in the "[Test-driven development/code writing](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#test-driven-development-code-writing)" section. We also have guidelines regarding coding style that will be enforced during testing, which can be found in the "[Code standards](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#code-standards)" section. -Once your changes are ready to be submitted, make sure to push your changes to GitHub before creating a pull request. Details about how to do that can be found in the [Contributing your changes to pandas](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#contributing-your-changes-to-pandas) section of our main contribution docs. We will review your changes, and you will most likely be asked to make additional changes before it is finally ready to merge. However, once it's ready, we will merge it, and you will have successfully contributed to the codebase! +Once your changes are ready to be submitted, make sure to push your changes to GitHub before creating a pull request. Details about how to do that can be found in the "[Contributing your changes to pandas](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#contributing-your-changes-to-pandas)" section. We will review your changes, and you will most likely be asked to make additional changes before it is finally ready to merge. However, once it's ready, we will merge it, and you will have successfully contributed to the codebase! diff --git a/.gitignore b/.gitignore index 96b1f945870de..4598714db6c6a 100644 --- a/.gitignore +++ b/.gitignore @@ -62,6 +62,8 @@ dist coverage.xml coverage_html_report *.pytest_cache +# hypothesis test database +.hypothesis/ # OS generated files # ###################### @@ -99,6 +101,7 @@ asv_bench/pandas/ # Documentation generated files # ################################# doc/source/generated +doc/source/api/generated doc/source/_static doc/source/vbench doc/source/vbench.rst @@ -107,6 +110,5 @@ doc/build/html/index.html # Windows specific leftover: doc/tmp.sv doc/source/styled.xlsx -doc/source/templates/ env/ doc/source/savefig/ diff --git a/.pep8speaks.yml b/.pep8speaks.yml index fda26d87bf7f6..cbcb098c47125 100644 --- a/.pep8speaks.yml +++ b/.pep8speaks.yml @@ -3,10 +3,17 @@ scanner: diff_only: True # If True, errors caused by only the patch are shown +# Opened issue in pep8speaks, so we can directly use the config in setup.cfg +# (and avoid having to duplicate it here): +# https://github.com/OrkoHunter/pep8speaks/issues/95 + pycodestyle: max-line-length: 79 - ignore: # Errors and warnings to ignore + ignore: + - W503, # line break before binary operator + - W504, # line break after binary operator - E402, # module level import not at top of file - E731, # do not assign a lambda expression, use a def - - E741, # do not use variables named 'l', 'O', or 'I' - - W503 # line break before binary operator + - C406, # Unnecessary list literal - rewrite as a dict literal. + - C408, # Unnecessary dict call - rewrite as a literal. + - C409 # Unnecessary list passed to tuple() - rewrite as a tuple literal. diff --git a/.travis.yml b/.travis.yml index 2d2a0bc019c80..e478d71a5c350 100644 --- a/.travis.yml +++ b/.travis.yml @@ -23,70 +23,51 @@ env: git: # for cloning - depth: 1000 + depth: 2000 matrix: fast_finish: true exclude: # Exclude the default Python 3.5 build - python: 3.5 - include: - - os: osx - language: generic - env: - - JOB="3.5, OSX" ENV_FILE="ci/travis-35-osx.yaml" TEST_ARGS="--skip-slow --skip-network" + include: - dist: trusty env: - - JOB="3.7" ENV_FILE="ci/travis-37.yaml" TEST_ARGS="--skip-slow --skip-network" + - JOB="3.7" ENV_FILE="ci/deps/travis-37.yaml" PATTERN="(not slow and not network)" - dist: trusty env: - - JOB="2.7, locale, slow, old NumPy" ENV_FILE="ci/travis-27-locale.yaml" LOCALE_OVERRIDE="zh_CN.UTF-8" SLOW=true - addons: - apt: - packages: - - language-pack-zh-hans - - dist: trusty - env: - - JOB="2.7, lint" ENV_FILE="ci/travis-27.yaml" TEST_ARGS="--skip-slow" LINT=true + - JOB="2.7" ENV_FILE="ci/deps/travis-27.yaml" PATTERN="(not slow or (single and db))" addons: apt: packages: - python-gtk2 + - dist: trusty env: - - JOB="3.6, coverage" ENV_FILE="ci/travis-36.yaml" TEST_ARGS="--skip-slow --skip-network" PANDAS_TESTING_MODE="deprecate" COVERAGE=true - # In allow_failures + - JOB="3.6, locale" ENV_FILE="ci/deps/travis-36-locale.yaml" PATTERN="((not slow and not network) or (single and db))" LOCALE_OVERRIDE="zh_CN.UTF-8" + - dist: trusty env: - - JOB="3.6, slow" ENV_FILE="ci/travis-36-slow.yaml" SLOW=true + - JOB="3.6, coverage" ENV_FILE="ci/deps/travis-36.yaml" PATTERN="((not slow and not network) or (single and db))" PANDAS_TESTING_MODE="deprecate" COVERAGE=true + # In allow_failures - dist: trusty env: - - JOB="3.6, NumPy dev" ENV_FILE="ci/travis-36-numpydev.yaml" TEST_ARGS="--skip-slow --skip-network" PANDAS_TESTING_MODE="deprecate" - addons: - apt: - packages: - - xsel + - JOB="3.6, slow" ENV_FILE="ci/deps/travis-36-slow.yaml" PATTERN="slow" + # In allow_failures - dist: trusty env: - - JOB="3.6, doc" ENV_FILE="ci/travis-36-doc.yaml" DOC=true + - JOB="3.6, doc" ENV_FILE="ci/deps/travis-36-doc.yaml" DOC=true allow_failures: - dist: trusty env: - - JOB="3.6, slow" ENV_FILE="ci/travis-36-slow.yaml" SLOW=true - - dist: trusty - env: - - JOB="3.6, NumPy dev" ENV_FILE="ci/travis-36-numpydev.yaml" TEST_ARGS="--skip-slow --skip-network" PANDAS_TESTING_MODE="deprecate" - addons: - apt: - packages: - - xsel + - JOB="3.6, slow" ENV_FILE="ci/deps/travis-36-slow.yaml" PATTERN="slow" - dist: trusty env: - - JOB="3.6, doc" ENV_FILE="ci/travis-36-doc.yaml" DOC=true + - JOB="3.6, doc" ENV_FILE="ci/deps/travis-36-doc.yaml" DOC=true before_install: - echo "before_install" @@ -100,6 +81,12 @@ before_install: - uname -a - git --version - git tag + # Because travis runs on Google Cloud and has a /etc/boto.cfg, + # it breaks moto import, see: + # https://github.com/spulec/moto/issues/1771 + # https://github.com/boto/boto/issues/3741 + # This overrides travis and tells it to look nowhere. + - export BOTO_CONFIG=/dev/null install: - echo "install start" @@ -115,24 +102,17 @@ before_script: script: - echo "script start" - - ci/run_build_docs.sh - - ci/script_single.sh - - ci/script_multi.sh - - ci/lint.sh - - echo "checking imports" - - source activate pandas && python ci/check_imports.py - - echo "script done" - -after_success: - - ci/upload_coverage.sh + - source activate pandas-dev + - ci/build_docs.sh + - ci/run_tests.sh after_script: - echo "after_script start" - - source activate pandas && pushd /tmp && python -c "import pandas; pandas.show_versions();" && popd - - if [ -e /tmp/single.xml ]; then - ci/print_skipped.py /tmp/single.xml; + - source activate pandas-dev && pushd /tmp && python -c "import pandas; pandas.show_versions();" && popd + - if [ -e test-data-single.xml ]; then + ci/print_skipped.py test-data-single.xml; fi - - if [ -e /tmp/multiple.xml ]; then - ci/print_skipped.py /tmp/multiple.xml; + - if [ -e test-data-multiple.xml ]; then + ci/print_skipped.py test-data-multiple.xml; fi - echo "after_script done" diff --git a/LICENSES/DATEUTIL_LICENSE b/LICENSES/DATEUTIL_LICENSE new file mode 100644 index 0000000000000..6053d35cfc60b --- /dev/null +++ b/LICENSES/DATEUTIL_LICENSE @@ -0,0 +1,54 @@ +Copyright 2017- Paul Ganssle +Copyright 2017- dateutil contributors (see AUTHORS file) + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + +The above license applies to all contributions after 2017-12-01, as well as +all contributions that have been re-licensed (see AUTHORS file for the list of +contributors who have re-licensed their code). +-------------------------------------------------------------------------------- +dateutil - Extensions to the standard Python datetime module. + +Copyright (c) 2003-2011 - Gustavo Niemeyer +Copyright (c) 2012-2014 - Tomi Pieviläinen +Copyright (c) 2014-2016 - Yaron de Leeuw +Copyright (c) 2015- - Paul Ganssle +Copyright (c) 2015- - dateutil contributors (see AUTHORS file) + +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + + * Redistributions of source code must retain the above copyright notice, + this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + * Neither the name of the copyright holder nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF +LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING +NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS +SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +The above BSD License Applies to all code, even that also covered by Apache 2.0. diff --git a/LICENSES/MUSL_LICENSE b/LICENSES/MUSL_LICENSE new file mode 100644 index 0000000000000..a8833d4bc4744 --- /dev/null +++ b/LICENSES/MUSL_LICENSE @@ -0,0 +1,132 @@ +musl as a whole is licensed under the following standard MIT license: + +---------------------------------------------------------------------- +Copyright © 2005-2014 Rich Felker, et al. + +Permission is hereby granted, free of charge, to any person obtaining +a copy of this software and associated documentation files (the +"Software"), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions: + +The above copyright notice and this permission notice shall be +included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. +IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY +CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, +TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE +SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. +---------------------------------------------------------------------- + +Authors/contributors include: + +Anthony G. Basile +Arvid Picciani +Bobby Bingham +Boris Brezillon +Brent Cook +Chris Spiegel +Clément Vasseur +Emil Renner Berthing +Hiltjo Posthuma +Isaac Dunham +Jens Gustedt +Jeremy Huntwork +John Spencer +Justin Cormack +Luca Barbato +Luka Perkov +M Farkas-Dyck (Strake) +Michael Forney +Nicholas J. Kain +orc +Pascal Cuoq +Pierre Carrier +Rich Felker +Richard Pennington +sin +Solar Designer +Stefan Kristiansson +Szabolcs Nagy +Timo Teräs +Valentin Ochs +William Haddon + +Portions of this software are derived from third-party works licensed +under terms compatible with the above MIT license: + +The TRE regular expression implementation (src/regex/reg* and +src/regex/tre*) is Copyright © 2001-2008 Ville Laurikari and licensed +under a 2-clause BSD license (license text in the source files). The +included version has been heavily modified by Rich Felker in 2012, in +the interests of size, simplicity, and namespace cleanliness. + +Much of the math library code (src/math/* and src/complex/*) is +Copyright © 1993,2004 Sun Microsystems or +Copyright © 2003-2011 David Schultz or +Copyright © 2003-2009 Steven G. Kargl or +Copyright © 2003-2009 Bruce D. Evans or +Copyright © 2008 Stephen L. Moshier +and labelled as such in comments in the individual source files. All +have been licensed under extremely permissive terms. + +The ARM memcpy code (src/string/armel/memcpy.s) is Copyright © 2008 +The Android Open Source Project and is licensed under a two-clause BSD +license. It was taken from Bionic libc, used on Android. + +The implementation of DES for crypt (src/misc/crypt_des.c) is +Copyright © 1994 David Burren. It is licensed under a BSD license. + +The implementation of blowfish crypt (src/misc/crypt_blowfish.c) was +originally written by Solar Designer and placed into the public +domain. The code also comes with a fallback permissive license for use +in jurisdictions that may not recognize the public domain. + +The smoothsort implementation (src/stdlib/qsort.c) is Copyright © 2011 +Valentin Ochs and is licensed under an MIT-style license. + +The BSD PRNG implementation (src/prng/random.c) and XSI search API +(src/search/*.c) functions are Copyright © 2011 Szabolcs Nagy and +licensed under following terms: "Permission to use, copy, modify, +and/or distribute this code for any purpose with or without fee is +hereby granted. There is no warranty." + +The x86_64 port was written by Nicholas J. Kain. Several files (crt) +were released into the public domain; others are licensed under the +standard MIT license terms at the top of this file. See individual +files for their copyright status. + +The mips and microblaze ports were originally written by Richard +Pennington for use in the ellcc project. The original code was adapted +by Rich Felker for build system and code conventions during upstream +integration. It is licensed under the standard MIT terms. + +The powerpc port was also originally written by Richard Pennington, +and later supplemented and integrated by John Spencer. It is licensed +under the standard MIT terms. + +All other files which have no copyright comments are original works +produced specifically for use as part of this library, written either +by Rich Felker, the main author of the library, or by one or more +contibutors listed above. Details on authorship of individual files +can be found in the git version control history of the project. The +omission of copyright and license comments in each file is in the +interest of source tree size. + +All public header files (include/* and arch/*/bits/*) should be +treated as Public Domain as they intentionally contain no content +which can be covered by copyright. Some source modules may fall in +this category as well. If you believe that a file is so trivial that +it should be in the Public Domain, please contact the authors and +request an explicit statement releasing it from copyright. + +The following files are trivial, believed not to be copyrightable in +the first place, and hereby explicitly released to the Public Domain: + +All public headers: include/*, arch/*/bits/* +Startup files: crt/* diff --git a/Makefile b/Makefile index 4a82566cf726e..d2bd067950fd0 100644 --- a/Makefile +++ b/Makefile @@ -13,7 +13,7 @@ build: clean_pyc python setup.py build_ext --inplace lint-diff: - git diff master --name-only -- "*.py" | grep "pandas" | xargs flake8 + git diff upstream/master --name-only -- "*.py" | xargs flake8 develop: build -python setup.py develop diff --git a/README.md b/README.md index 3c8fe57400099..ce22818705865 100644 --- a/README.md +++ b/README.md @@ -48,16 +48,8 @@ - - circleci build status - - - - - - - - appveyor build status + + Azure Pipelines build status @@ -89,7 +81,7 @@ -## What is it +## What is it? **pandas** is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both @@ -97,7 +89,7 @@ easy and intuitive. It aims to be the fundamental high-level building block for doing practical, **real world** data analysis in Python. Additionally, it has the broader goal of becoming **the most powerful and flexible open source data analysis / manipulation tool available in any language**. It is already well on -its way toward this goal. +its way towards this goal. ## Main Features Here are just a few of the things that pandas does well: @@ -171,7 +163,7 @@ pip install pandas ``` ## Dependencies -- [NumPy](https://www.numpy.org): 1.9.0 or higher +- [NumPy](https://www.numpy.org): 1.12.0 or higher - [python-dateutil](https://labix.org/python-dateutil): 2.5.0 or higher - [pytz](https://pythonhosted.org/pytz): 2011k or higher @@ -231,9 +223,9 @@ Most development discussion is taking place on github in this repo. Further, the All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome. -A detailed overview on how to contribute can be found in the **[contributing guide.](https://pandas.pydata.org/pandas-docs/stable/contributing.html)** +A detailed overview on how to contribute can be found in the **[contributing guide](https://pandas-docs.github.io/pandas-docs-travis/contributing.html)**. There is also an [overview](.github/CONTRIBUTING.md) on GitHub. -If you are simply looking to start working with the pandas codebase, navigate to the [GitHub “issues” tab](https://github.com/pandas-dev/pandas/issues) and start looking through interesting issues. There are a number of issues listed under [Docs](https://github.com/pandas-dev/pandas/issues?labels=Docs&sort=updated&state=open) and [good first issue](https://github.com/pandas-dev/pandas/issues?labels=good+first+issue&sort=updated&state=open) where you could start out. +If you are simply looking to start working with the pandas codebase, navigate to the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues) and start looking through interesting issues. There are a number of issues listed under [Docs](https://github.com/pandas-dev/pandas/issues?labels=Docs&sort=updated&state=open) and [good first issue](https://github.com/pandas-dev/pandas/issues?labels=good+first+issue&sort=updated&state=open) where you could start out. You can also triage issues which may include reproducing bug reports, or asking for vital information such as version numbers or reproduction instructions. If you would like to start triaging issues, one easy way to get started is to [subscribe to pandas on CodeTriage](https://www.codetriage.com/pandas-dev/pandas). diff --git a/appveyor.yml b/appveyor.yml deleted file mode 100644 index c6199c1493f22..0000000000000 --- a/appveyor.yml +++ /dev/null @@ -1,91 +0,0 @@ -# With infos from -# http://tjelvarolsson.com/blog/how-to-continuously-test-your-python-code-on-windows-using-appveyor/ -# https://packaging.python.org/en/latest/appveyor/ -# https://github.com/rmcgibbo/python-appveyor-conda-example - -# Backslashes in quotes need to be escaped: \ -> "\\" - -matrix: - fast_finish: true # immediately finish build once one of the jobs fails. - -environment: - global: - # SDK v7.0 MSVC Express 2008's SetEnv.cmd script will fail if the - # /E:ON and /V:ON options are not enabled in the batch script interpreter - # See: http://stackoverflow.com/a/13751649/163740 - CMD_IN_ENV: "cmd /E:ON /V:ON /C .\\ci\\run_with_env.cmd" - clone_folder: C:\projects\pandas - PANDAS_TESTING_MODE: "deprecate" - - matrix: - - - CONDA_ROOT: "C:\\Miniconda3_64" - APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2017 - PYTHON_VERSION: "3.6" - PYTHON_ARCH: "64" - CONDA_PY: "36" - CONDA_NPY: "113" - - - CONDA_ROOT: "C:\\Miniconda3_64" - APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2015 - PYTHON_VERSION: "2.7" - PYTHON_ARCH: "64" - CONDA_PY: "27" - CONDA_NPY: "110" - -# We always use a 64-bit machine, but can build x86 distributions -# with the PYTHON_ARCH variable (which is used by CMD_IN_ENV). -platform: - - x64 - -# all our python builds have to happen in tests_script... -build: false - -install: - # cancel older builds for the same PR - - ps: if ($env:APPVEYOR_PULL_REQUEST_NUMBER -and $env:APPVEYOR_BUILD_NUMBER -ne ((Invoke-RestMethod ` - https://ci.appveyor.com/api/projects/$env:APPVEYOR_ACCOUNT_NAME/$env:APPVEYOR_PROJECT_SLUG/history?recordsNumber=50).builds | ` - Where-Object pullRequestId -eq $env:APPVEYOR_PULL_REQUEST_NUMBER)[0].buildNumber) { ` - throw "There are newer queued builds for this pull request, failing early." } - - # this installs the appropriate Miniconda (Py2/Py3, 32/64 bit) - # updates conda & installs: conda-build jinja2 anaconda-client - - powershell .\ci\install.ps1 - - SET PATH=%CONDA_ROOT%;%CONDA_ROOT%\Scripts;%PATH% - - echo "install" - - cd - - ls -ltr - - git tag --sort v:refname - - # this can conflict with git - - cmd: rmdir C:\cygwin /s /q - - # install our build environment - - cmd: conda config --set show_channel_urls true --set always_yes true --set changeps1 false - - cmd: conda update -q conda - - cmd: conda config --set ssl_verify false - - # add the pandas channel *before* defaults to have defaults take priority - - cmd: conda config --add channels conda-forge - - cmd: conda config --add channels pandas - - cmd: conda config --remove channels defaults - - cmd: conda config --add channels defaults - - # this is now the downloaded conda... - - cmd: conda info -a - - # create our env - - cmd: conda env create -q -n pandas --file=ci\appveyor-%CONDA_PY%.yaml - - cmd: activate pandas - - cmd: conda list -n pandas - # uninstall pandas if it's present - - cmd: conda remove pandas -y --force & exit 0 - - cmd: pip uninstall -y pandas & exit 0 - - # build em using the local source checkout in the correct windows env - - cmd: '%CMD_IN_ENV% python setup.py build_ext --inplace' - -test_script: - # tests - - cmd: activate pandas - - cmd: test.bat diff --git a/asv_bench/benchmarks/algorithms.py b/asv_bench/benchmarks/algorithms.py index cccd38ef11251..34fb161e5afcb 100644 --- a/asv_bench/benchmarks/algorithms.py +++ b/asv_bench/benchmarks/algorithms.py @@ -1,97 +1,94 @@ -import warnings from importlib import import_module import numpy as np + import pandas as pd from pandas.util import testing as tm + for imp in ['pandas.util', 'pandas.tools.hashing']: try: hashing = import_module(imp) break - except: + except (ImportError, TypeError, ValueError): pass -from .pandas_vb_common import setup # noqa - class Factorize(object): - goal_time = 0.2 + params = [[True, False], ['int', 'uint', 'float', 'string']] + param_names = ['sort', 'dtype'] - params = [True, False] - param_names = ['sort'] - - def setup(self, sort): + def setup(self, sort, dtype): N = 10**5 - self.int_idx = pd.Int64Index(np.arange(N).repeat(5)) - self.float_idx = pd.Float64Index(np.random.randn(N).repeat(5)) - self.string_idx = tm.makeStringIndex(N) - - def time_factorize_int(self, sort): - self.int_idx.factorize(sort=sort) - - def time_factorize_float(self, sort): - self.float_idx.factorize(sort=sort) + data = {'int': pd.Int64Index(np.arange(N).repeat(5)), + 'uint': pd.UInt64Index(np.arange(N).repeat(5)), + 'float': pd.Float64Index(np.random.randn(N).repeat(5)), + 'string': tm.makeStringIndex(N).repeat(5)} + self.idx = data[dtype] - def time_factorize_string(self, sort): - self.string_idx.factorize(sort=sort) + def time_factorize(self, sort, dtype): + self.idx.factorize(sort=sort) -class Duplicated(object): - - goal_time = 0.2 +class FactorizeUnique(object): - params = ['first', 'last', False] - param_names = ['keep'] + params = [[True, False], ['int', 'uint', 'float', 'string']] + param_names = ['sort', 'dtype'] - def setup(self, keep): + def setup(self, sort, dtype): N = 10**5 - self.int_idx = pd.Int64Index(np.arange(N).repeat(5)) - self.float_idx = pd.Float64Index(np.random.randn(N).repeat(5)) - self.string_idx = tm.makeStringIndex(N) - - def time_duplicated_int(self, keep): - self.int_idx.duplicated(keep=keep) + data = {'int': pd.Int64Index(np.arange(N)), + 'uint': pd.UInt64Index(np.arange(N)), + 'float': pd.Float64Index(np.arange(N)), + 'string': tm.makeStringIndex(N)} + self.idx = data[dtype] + assert self.idx.is_unique - def time_duplicated_float(self, keep): - self.float_idx.duplicated(keep=keep) + def time_factorize(self, sort, dtype): + self.idx.factorize(sort=sort) - def time_duplicated_string(self, keep): - self.string_idx.duplicated(keep=keep) +class Duplicated(object): -class DuplicatedUniqueIndex(object): - - goal_time = 0.2 + params = [['first', 'last', False], ['int', 'uint', 'float', 'string']] + param_names = ['keep', 'dtype'] - def setup(self): + def setup(self, keep, dtype): N = 10**5 - self.idx_int_dup = pd.Int64Index(np.arange(N * 5)) + data = {'int': pd.Int64Index(np.arange(N).repeat(5)), + 'uint': pd.UInt64Index(np.arange(N).repeat(5)), + 'float': pd.Float64Index(np.random.randn(N).repeat(5)), + 'string': tm.makeStringIndex(N).repeat(5)} + self.idx = data[dtype] # cache is_unique - self.idx_int_dup.is_unique + self.idx.is_unique - def time_duplicated_unique_int(self): - self.idx_int_dup.duplicated() + def time_duplicated(self, keep, dtype): + self.idx.duplicated(keep=keep) -class Match(object): +class DuplicatedUniqueIndex(object): - goal_time = 0.2 + params = ['int', 'uint', 'float', 'string'] + param_names = ['dtype'] - def setup(self): - self.uniques = tm.makeStringIndex(1000).values - self.all = self.uniques.repeat(10) + def setup(self, dtype): + N = 10**5 + data = {'int': pd.Int64Index(np.arange(N)), + 'uint': pd.UInt64Index(np.arange(N)), + 'float': pd.Float64Index(np.random.randn(N)), + 'string': tm.makeStringIndex(N)} + self.idx = data[dtype] + # cache is_unique + self.idx.is_unique - def time_match_string(self): - with warnings.catch_warnings(record=True): - pd.match(self.all, self.uniques) + def time_duplicated_unique(self, dtype): + self.idx.duplicated() class Hashing(object): - goal_time = 0.2 - def setup_cache(self): N = 10**5 @@ -126,3 +123,23 @@ def time_series_timedeltas(self, df): def time_series_dates(self, df): hashing.hash_pandas_object(df['dates']) + + +class Quantile(object): + params = [[0, 0.5, 1], + ['linear', 'nearest', 'lower', 'higher', 'midpoint'], + ['float', 'int', 'uint']] + param_names = ['quantile', 'interpolation', 'dtype'] + + def setup(self, quantile, interpolation, dtype): + N = 10**5 + data = {'int': np.arange(N), + 'uint': np.arange(N).astype(np.uint64), + 'float': np.random.randn(N)} + self.idx = pd.Series(data[dtype].repeat(5)) + + def time_quantile(self, quantile, interpolation, dtype): + self.idx.quantile(quantile, interpolation=interpolation) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/attrs_caching.py b/asv_bench/benchmarks/attrs_caching.py index 48f0b7d71144c..d061755208c9e 100644 --- a/asv_bench/benchmarks/attrs_caching.py +++ b/asv_bench/benchmarks/attrs_caching.py @@ -5,13 +5,9 @@ except ImportError: from pandas.util.decorators import cache_readonly -from .pandas_vb_common import setup # noqa - class DataFrameAttributes(object): - goal_time = 0.2 - def setup(self): self.df = DataFrame(np.random.randn(10, 6)) self.cur_index = self.df.index @@ -25,8 +21,6 @@ def time_set_index(self): class CacheReadonly(object): - goal_time = 0.2 - def setup(self): class Foo: @@ -38,3 +32,6 @@ def prop(self): def time_cache_readonly(self): self.obj.prop + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/binary_ops.py b/asv_bench/benchmarks/binary_ops.py index cc8766e1fa39c..22b8ed80f3d07 100644 --- a/asv_bench/benchmarks/binary_ops.py +++ b/asv_bench/benchmarks/binary_ops.py @@ -6,13 +6,9 @@ except ImportError: import pandas.computation.expressions as expr -from .pandas_vb_common import setup # noqa - class Ops(object): - goal_time = 0.2 - params = [[True, False], ['default', 1]] param_names = ['use_numexpr', 'threads'] @@ -44,8 +40,6 @@ def teardown(self, use_numexpr, threads): class Ops2(object): - goal_time = 0.2 - def setup(self): N = 10**3 self.df = DataFrame(np.random.randn(N, N)) @@ -58,6 +52,8 @@ def setup(self): np.iinfo(np.int16).max, size=(N, N))) + self.s = Series(np.random.randn(N)) + # Division def time_frame_float_div(self): @@ -80,10 +76,19 @@ def time_frame_int_mod(self): def time_frame_float_mod(self): self.df % self.df2 + # Dot product -class Timeseries(object): + def time_frame_dot(self): + self.df.dot(self.df2) + + def time_series_dot(self): + self.s.dot(self.s) + + def time_frame_series_dot(self): + self.df.dot(self.s) - goal_time = 0.2 + +class Timeseries(object): params = [None, 'US/Eastern'] param_names = ['tz'] @@ -111,8 +116,6 @@ def time_timestamp_ops_diff_with_shift(self, tz): class AddOverflowScalar(object): - goal_time = 0.2 - params = [1, -1, 0] param_names = ['scalar'] @@ -126,8 +129,6 @@ def time_add_overflow_scalar(self, scalar): class AddOverflowArray(object): - goal_time = 0.2 - def setup(self): N = 10**6 self.arr = np.arange(N) @@ -149,3 +150,6 @@ def time_add_overflow_b_mask_nan(self): def time_add_overflow_both_arg_nan(self): checked_add_with_arr(self.arr, self.arr_mixed, arr_mask=self.arr_nan_1, b_mask=self.arr_nan_2) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/categoricals.py b/asv_bench/benchmarks/categoricals.py index 2a7717378c280..e5dab0cb066aa 100644 --- a/asv_bench/benchmarks/categoricals.py +++ b/asv_bench/benchmarks/categoricals.py @@ -11,13 +11,9 @@ except ImportError: pass -from .pandas_vb_common import setup # noqa - class Concat(object): - goal_time = 0.2 - def setup(self): N = 10**5 self.s = pd.Series(list('aabbcd') * N).astype('category') @@ -34,8 +30,6 @@ def time_union(self): class Constructor(object): - goal_time = 0.2 - def setup(self): N = 10**5 self.categories = list('abcde') @@ -52,6 +46,8 @@ def setup(self): self.values_some_nan = list(np.tile(self.categories + [np.nan], N)) self.values_all_nan = [np.nan] * len(self.values) self.values_all_int8 = np.ones(N, 'int8') + self.categorical = pd.Categorical(self.values, self.categories) + self.series = pd.Series(self.categorical) def time_regular(self): pd.Categorical(self.values, self.categories) @@ -74,17 +70,22 @@ def time_all_nan(self): def time_from_codes_all_int8(self): pd.Categorical.from_codes(self.values_all_int8, self.categories) + def time_existing_categorical(self): + pd.Categorical(self.categorical) -class ValueCounts(object): + def time_existing_series(self): + pd.Categorical(self.series) - goal_time = 0.2 + +class ValueCounts(object): params = [True, False] param_names = ['dropna'] def setup(self, dropna): n = 5 * 10**5 - arr = ['s%04d' % i for i in np.random.randint(0, n // 10, size=n)] + arr = ['s{:04d}'.format(i) for i in np.random.randint(0, n // 10, + size=n)] self.ts = pd.Series(arr).astype('category') def time_value_counts(self, dropna): @@ -93,8 +94,6 @@ def time_value_counts(self, dropna): class Repr(object): - goal_time = 0.2 - def setup(self): self.sel = pd.Series(['s1234']).astype('category') @@ -104,20 +103,29 @@ def time_rendering(self): class SetCategories(object): - goal_time = 0.2 - def setup(self): n = 5 * 10**5 - arr = ['s%04d' % i for i in np.random.randint(0, n // 10, size=n)] + arr = ['s{:04d}'.format(i) for i in np.random.randint(0, n // 10, + size=n)] self.ts = pd.Series(arr).astype('category') def time_set_categories(self): self.ts.cat.set_categories(self.ts.cat.categories[::2]) -class Rank(object): +class RemoveCategories(object): - goal_time = 0.2 + def setup(self): + n = 5 * 10**5 + arr = ['s{:04d}'.format(i) for i in np.random.randint(0, n // 10, + size=n)] + self.ts = pd.Series(arr).astype('category') + + def time_remove_categories(self): + self.ts.cat.remove_categories(self.ts.cat.categories[::2]) + + +class Rank(object): def setup(self): N = 10**5 @@ -156,8 +164,6 @@ def time_rank_int_cat_ordered(self): class Isin(object): - goal_time = 0.2 - params = ['object', 'int64'] param_names = ['dtype'] @@ -167,7 +173,7 @@ def setup(self, dtype): sample_size = 100 arr = [i for i in np.random.randint(0, n // 10, size=n)] if dtype == 'object': - arr = ['s%04d' % i for i in arr] + arr = ['s{:04d}'.format(i) for i in arr] self.sample = np.random.choice(arr, sample_size) self.series = pd.Series(arr).astype('category') @@ -197,8 +203,6 @@ def time_categorical_series_is_monotonic_decreasing(self): class Contains(object): - goal_time = 0.2 - def setup(self): N = 10**5 self.ci = tm.makeCategoricalIndex(N) @@ -214,7 +218,6 @@ def time_categorical_contains(self): class CategoricalSlicing(object): - goal_time = 0.2 params = ['monotonic_incr', 'monotonic_decr', 'non_monotonic'] param_names = ['index'] @@ -245,3 +248,42 @@ def time_getitem_list(self, index): def time_getitem_bool_array(self, index): self.data[self.data == self.cat_scalar] + + +class Indexing(object): + + def setup(self): + N = 10**5 + self.index = pd.CategoricalIndex(range(N), range(N)) + self.series = pd.Series(range(N), index=self.index).sort_index() + self.category = self.index[500] + + def time_get_loc(self): + self.index.get_loc(self.category) + + def time_shape(self): + self.index.shape + + def time_shallow_copy(self): + self.index._shallow_copy() + + def time_align(self): + pd.DataFrame({'a': self.series, 'b': self.series[:500]}) + + def time_intersection(self): + self.index[:750].intersection(self.index[250:]) + + def time_unique(self): + self.index.unique() + + def time_reindex(self): + self.index.reindex(self.index[:500]) + + def time_reindex_missing(self): + self.index.reindex(['a', 'b', 'c', 'd']) + + def time_sort_values(self): + self.index.sort_values(ascending=False) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/ctors.py b/asv_bench/benchmarks/ctors.py index 3f9016787aab4..9082b4186bfa4 100644 --- a/asv_bench/benchmarks/ctors.py +++ b/asv_bench/benchmarks/ctors.py @@ -2,38 +2,74 @@ import pandas.util.testing as tm from pandas import Series, Index, DatetimeIndex, Timestamp, MultiIndex -from .pandas_vb_common import setup # noqa +def no_change(arr): + return arr + + +def list_of_str(arr): + return list(arr.astype(str)) + + +def gen_of_str(arr): + return (x for x in arr.astype(str)) + + +def arr_dict(arr): + return dict(zip(range(len(arr)), arr)) + + +def list_of_tuples(arr): + return [(i, -i) for i in arr] + + +def gen_of_tuples(arr): + return ((i, -i) for i in arr) -class SeriesConstructors(object): - goal_time = 0.2 +def list_of_lists(arr): + return [[i, -i] for i in arr] - param_names = ["data_fmt", "with_index"] - params = [[lambda x: x, + +def list_of_tuples_with_none(arr): + return [(i, -i) for i in arr][:-1] + [None] + + +def list_of_lists_with_none(arr): + return [[i, -i] for i in arr][:-1] + [None] + + +class SeriesConstructors(object): + + param_names = ["data_fmt", "with_index", "dtype"] + params = [[no_change, list, - lambda arr: list(arr.astype(str)), - lambda arr: dict(zip(range(len(arr)), arr)), - lambda arr: [(i, -i) for i in arr], - lambda arr: [[i, -i] for i in arr], - lambda arr: ([(i, -i) for i in arr][:-1] + [None]), - lambda arr: ([[i, -i] for i in arr][:-1] + [None])], - [False, True]] - - def setup(self, data_fmt, with_index): + list_of_str, + gen_of_str, + arr_dict, + list_of_tuples, + gen_of_tuples, + list_of_lists, + list_of_tuples_with_none, + list_of_lists_with_none], + [False, True], + ['float', 'int']] + + def setup(self, data_fmt, with_index, dtype): N = 10**4 - arr = np.random.randn(N) + if dtype == 'float': + arr = np.random.randn(N) + else: + arr = np.arange(N) self.data = data_fmt(arr) self.index = np.arange(N) if with_index else None - def time_series_constructor(self, data_fmt, with_index): + def time_series_constructor(self, data_fmt, with_index, dtype): Series(self.data, index=self.index) class SeriesDtypesConstructors(object): - goal_time = 0.2 - def setup(self): N = 10**4 self.arr = np.random.randn(N, N) @@ -56,11 +92,12 @@ def time_dtindex_from_index_with_series(self): class MultiIndexConstructor(object): - goal_time = 0.2 - def setup(self): N = 10**4 self.iterables = [tm.makeStringIndex(N), range(20)] def time_multiindex_from_iterables(self): MultiIndex.from_product(self.iterables) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/dtypes.py b/asv_bench/benchmarks/dtypes.py new file mode 100644 index 0000000000000..e59154cd99965 --- /dev/null +++ b/asv_bench/benchmarks/dtypes.py @@ -0,0 +1,39 @@ +from pandas.api.types import pandas_dtype + +import numpy as np +from .pandas_vb_common import ( + numeric_dtypes, datetime_dtypes, string_dtypes, extension_dtypes) + + +_numpy_dtypes = [np.dtype(dtype) + for dtype in (numeric_dtypes + + datetime_dtypes + + string_dtypes)] +_dtypes = _numpy_dtypes + extension_dtypes + + +class Dtypes(object): + params = (_dtypes + + list(map(lambda dt: dt.name, _dtypes))) + param_names = ['dtype'] + + def time_pandas_dtype(self, dtype): + pandas_dtype(dtype) + + +class DtypesInvalid(object): + param_names = ['dtype'] + params = ['scalar-string', 'scalar-int', 'list-string', 'array-string'] + data_dict = {'scalar-string': 'foo', + 'scalar-int': 1, + 'list-string': ['foo'] * 1000, + 'array-string': np.array(['foo'] * 1000)} + + def time_pandas_dtype_invalid(self, dtype): + try: + pandas_dtype(self.data_dict[dtype]) + except TypeError: + pass + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/eval.py b/asv_bench/benchmarks/eval.py index 8e581dcf22b4c..837478efbad64 100644 --- a/asv_bench/benchmarks/eval.py +++ b/asv_bench/benchmarks/eval.py @@ -5,13 +5,9 @@ except ImportError: import pandas.computation.expressions as expr -from .pandas_vb_common import setup # noqa - class Eval(object): - goal_time = 0.2 - params = [['numexpr', 'python'], [1, 'all']] param_names = ['engine', 'threads'] @@ -43,8 +39,6 @@ def teardown(self, engine, threads): class Query(object): - goal_time = 0.2 - def setup(self): N = 10**6 halfway = (N // 2) - 1 @@ -65,3 +59,6 @@ def time_query_datetime_column(self): def time_query_with_boolean_selection(self): self.df.query('(a >= @self.min_val) & (a <= @self.max_val)') + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/frame_ctor.py b/asv_bench/benchmarks/frame_ctor.py index 9def910df0bab..dfb6ab5b189b2 100644 --- a/asv_bench/benchmarks/frame_ctor.py +++ b/asv_bench/benchmarks/frame_ctor.py @@ -7,13 +7,9 @@ # For compatibility with older versions from pandas.core.datetools import * # noqa -from .pandas_vb_common import setup # noqa - class FromDicts(object): - goal_time = 0.2 - def setup(self): N, K = 5000, 50 self.index = tm.makeStringIndex(N) @@ -47,8 +43,6 @@ def time_nested_dict_int64(self): class FromSeries(object): - goal_time = 0.2 - def setup(self): mi = MultiIndex.from_product([range(100), range(100)]) self.s = Series(np.random.randn(10000), index=mi) @@ -59,7 +53,6 @@ def time_mi_series(self): class FromDictwithTimestamp(object): - goal_time = 0.2 params = [Nano(1), Hour(1)] param_names = ['offset'] @@ -76,7 +69,6 @@ def time_dict_with_timestamp_offsets(self, offset): class FromRecords(object): - goal_time = 0.2 params = [None, 1000] param_names = ['nrows'] @@ -91,11 +83,25 @@ def time_frame_from_records_generator(self, nrows): class FromNDArray(object): - goal_time = 0.2 - def setup(self): N = 100000 self.data = np.random.randn(N) def time_frame_from_ndarray(self): self.df = DataFrame(self.data) + + +class FromLists(object): + + goal_time = 0.2 + + def setup(self): + N = 1000 + M = 100 + self.data = [[j for j in range(M)] for i in range(N)] + + def time_frame_from_lists(self): + self.df = DataFrame(self.data) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/frame_methods.py b/asv_bench/benchmarks/frame_methods.py index 1819cfa2725db..ba2e63c20d3f8 100644 --- a/asv_bench/benchmarks/frame_methods.py +++ b/asv_bench/benchmarks/frame_methods.py @@ -1,24 +1,19 @@ import string -import warnings import numpy as np -import pandas.util.testing as tm -from pandas import (DataFrame, Series, MultiIndex, date_range, period_range, - isnull, NaT) -from .pandas_vb_common import setup # noqa +from pandas import ( + DataFrame, MultiIndex, NaT, Series, date_range, isnull, period_range) +import pandas.util.testing as tm class GetNumericData(object): - goal_time = 0.2 - def setup(self): self.df = DataFrame(np.random.randn(10000, 25)) self.df['foo'] = 'bar' self.df['bar'] = 'baz' - with warnings.catch_warnings(record=True): - self.df = self.df.consolidate() + self.df = self.df._consolidate() def time_frame_get_numeric_data(self): self.df._get_numeric_data() @@ -26,8 +21,6 @@ def time_frame_get_numeric_data(self): class Lookup(object): - goal_time = 0.2 - def setup(self): self.df = DataFrame(np.random.randn(10000, 8), columns=list('abcdefgh')) @@ -48,8 +41,6 @@ def time_frame_fancy_lookup_all(self): class Reindex(object): - goal_time = 0.2 - def setup(self): N = 10**3 self.df = DataFrame(np.random.randn(N * 10, N)) @@ -70,16 +61,41 @@ def time_reindex_axis1(self): def time_reindex_both_axes(self): self.df.reindex(index=self.idx, columns=self.idx) - def time_reindex_both_axes_ix(self): - self.df.ix[self.idx, self.idx] - def time_reindex_upcast(self): self.df2.reindex(np.random.permutation(range(1200))) -class Iteration(object): +class Rename(object): - goal_time = 0.2 + def setup(self): + N = 10**3 + self.df = DataFrame(np.random.randn(N * 10, N)) + self.idx = np.arange(4 * N, 7 * N) + self.dict_idx = {k: k for k in self.idx} + self.df2 = DataFrame( + {c: {0: np.random.randint(0, 2, N).astype(np.bool_), + 1: np.random.randint(0, N, N).astype(np.int16), + 2: np.random.randint(0, N, N).astype(np.int32), + 3: np.random.randint(0, N, N).astype(np.int64)} + [np.random.randint(0, 4)] for c in range(N)}) + + def time_rename_single(self): + self.df.rename({0: 0}) + + def time_rename_axis0(self): + self.df.rename(self.dict_idx) + + def time_rename_axis1(self): + self.df.rename(columns=self.dict_idx) + + def time_rename_both_axes(self): + self.df.rename(index=self.dict_idx, columns=self.dict_idx) + + def time_dict_rename_both_axes(self): + self.df.rename(index=self.dict_idx, columns=self.dict_idx) + + +class Iteration(object): def setup(self): N = 1000 @@ -87,6 +103,7 @@ def setup(self): self.df2 = DataFrame(np.random.randn(N * 50, 10)) self.df3 = DataFrame(np.random.randn(N, 5 * N), columns=['C' + str(c) for c in range(N * 5)]) + self.df4 = DataFrame(np.random.randn(N * 1000, 10)) def time_iteritems(self): # (monitor no-copying behaviour) @@ -103,10 +120,70 @@ def time_iteritems_indexing(self): for col in self.df3: self.df3[col] + def time_itertuples_start(self): + self.df4.itertuples() + + def time_itertuples_read_first(self): + next(self.df4.itertuples()) + def time_itertuples(self): - for row in self.df2.itertuples(): + for row in self.df4.itertuples(): + pass + + def time_itertuples_to_list(self): + list(self.df4.itertuples()) + + def mem_itertuples_start(self): + return self.df4.itertuples() + + def peakmem_itertuples_start(self): + self.df4.itertuples() + + def mem_itertuples_read_first(self): + return next(self.df4.itertuples()) + + def peakmem_itertuples(self): + for row in self.df4.itertuples(): pass + def mem_itertuples_to_list(self): + return list(self.df4.itertuples()) + + def peakmem_itertuples_to_list(self): + list(self.df4.itertuples()) + + def time_itertuples_raw_start(self): + self.df4.itertuples(index=False, name=None) + + def time_itertuples_raw_read_first(self): + next(self.df4.itertuples(index=False, name=None)) + + def time_itertuples_raw_tuples(self): + for row in self.df4.itertuples(index=False, name=None): + pass + + def time_itertuples_raw_tuples_to_list(self): + list(self.df4.itertuples(index=False, name=None)) + + def mem_itertuples_raw_start(self): + return self.df4.itertuples(index=False, name=None) + + def peakmem_itertuples_raw_start(self): + self.df4.itertuples(index=False, name=None) + + def peakmem_itertuples_raw_read_first(self): + next(self.df4.itertuples(index=False, name=None)) + + def peakmem_itertuples_raw(self): + for row in self.df4.itertuples(index=False, name=None): + pass + + def mem_itertuples_raw_to_list(self): + return list(self.df4.itertuples(index=False, name=None)) + + def peakmem_itertuples_raw_to_list(self): + list(self.df4.itertuples(index=False, name=None)) + def time_iterrows(self): for row in self.df.iterrows(): pass @@ -114,8 +191,6 @@ def time_iterrows(self): class ToString(object): - goal_time = 0.2 - def setup(self): self.df = DataFrame(np.random.randn(100, 10)) @@ -125,8 +200,6 @@ def time_to_string_floats(self): class ToHTML(object): - goal_time = 0.2 - def setup(self): nrows = 500 self.df2 = DataFrame(np.random.randn(nrows, 10)) @@ -139,8 +212,6 @@ def time_to_html_mixed(self): class Repr(object): - goal_time = 0.2 - def setup(self): nrows = 10000 data = np.random.randn(nrows, 10) @@ -166,8 +237,6 @@ def time_frame_repr_wide(self): class MaskBool(object): - goal_time = 0.2 - def setup(self): data = np.random.randn(1000, 500) df = DataFrame(data) @@ -184,8 +253,6 @@ def time_frame_mask_floats(self): class Isnull(object): - goal_time = 0.2 - def setup(self): N = 10**3 self.df_no_null = DataFrame(np.random.randn(N, N)) @@ -218,7 +285,6 @@ def time_isnull_obj(self): class Fillna(object): - goal_time = 0.2 params = ([True, False], ['pad', 'bfill']) param_names = ['inplace', 'method'] @@ -233,7 +299,6 @@ def time_frame_fillna(self, inplace, method): class Dropna(object): - goal_time = 0.2 params = (['all', 'any'], [0, 1]) param_names = ['how', 'axis'] @@ -254,8 +319,6 @@ def time_dropna_axis_mixed_dtypes(self, how, axis): class Count(object): - goal_time = 0.2 - params = [0, 1] param_names = ['axis'] @@ -284,8 +347,6 @@ def time_count_level_mixed_dtypes_multi(self, axis): class Apply(object): - goal_time = 0.2 - def setup(self): self.df = DataFrame(np.random.randn(1000, 100)) @@ -314,8 +375,6 @@ def time_apply_ref_by_name(self): class Dtypes(object): - goal_time = 0.2 - def setup(self): self.df = DataFrame(np.random.randn(1000, 1000)) @@ -325,8 +384,6 @@ def time_frame_dtypes(self): class Equals(object): - goal_time = 0.2 - def setup(self): N = 10**3 self.float_df = DataFrame(np.random.randn(N, N)) @@ -363,7 +420,6 @@ def time_frame_object_unequal(self): class Interpolate(object): - goal_time = 0.2 params = [None, 'infer'] param_names = ['downcast'] @@ -389,7 +445,6 @@ def time_interpolate_some_good(self, downcast): class Shift(object): # frame shift speedup issue-5609 - goal_time = 0.2 params = [0, 1] param_names = ['axis'] @@ -411,8 +466,6 @@ def time_frame_nunique(self): class Duplicated(object): - goal_time = 0.2 - def setup(self): n = (1 << 20) t = date_range('2015-01-01', freq='S', periods=(n // 64)) @@ -431,7 +484,6 @@ def time_frame_duplicated_wide(self): class XS(object): - goal_time = 0.2 params = [0, 1] param_names = ['axis'] @@ -445,7 +497,6 @@ def time_frame_xs(self, axis): class SortValues(object): - goal_time = 0.2 params = [True, False] param_names = ['ascending'] @@ -458,8 +509,6 @@ def time_frame_sort_values(self, ascending): class SortIndexByColumns(object): - goal_time = 0.2 - def setup(self): N = 10000 K = 10 @@ -473,7 +522,6 @@ def time_frame_sort_values_by_columns(self): class Quantile(object): - goal_time = 0.2 params = [0, 1] param_names = ['axis'] @@ -486,8 +534,6 @@ def time_frame_quantile(self, axis): class GetDtypeCounts(object): # 2807 - goal_time = 0.2 - def setup(self): self.df = DataFrame(np.random.randn(10, 10000)) @@ -500,23 +546,27 @@ def time_info(self): class NSort(object): - goal_time = 0.2 params = ['first', 'last', 'all'] param_names = ['keep'] def setup(self, keep): - self.df = DataFrame(np.random.randn(1000, 3), columns=list('ABC')) + self.df = DataFrame(np.random.randn(100000, 3), + columns=list('ABC')) - def time_nlargest(self, keep): + def time_nlargest_one_column(self, keep): self.df.nlargest(100, 'A', keep=keep) - def time_nsmallest(self, keep): + def time_nlargest_two_columns(self, keep): + self.df.nlargest(100, ['A', 'B'], keep=keep) + + def time_nsmallest_one_column(self, keep): self.df.nsmallest(100, 'A', keep=keep) + def time_nsmallest_two_columns(self, keep): + self.df.nsmallest(100, ['A', 'B'], keep=keep) -class Describe(object): - goal_time = 0.2 +class Describe(object): def setup(self): self.df = DataFrame({ @@ -530,3 +580,6 @@ def time_series_describe(self): def time_dataframe_describe(self): self.df.describe() + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/gil.py b/asv_bench/benchmarks/gil.py index 21c1ccf46e1c4..6819a296c81df 100644 --- a/asv_bench/benchmarks/gil.py +++ b/asv_bench/benchmarks/gil.py @@ -23,12 +23,11 @@ def wrapper(fname): return fname return wrapper -from .pandas_vb_common import BaseIO, setup # noqa +from .pandas_vb_common import BaseIO class ParallelGroupbyMethods(object): - goal_time = 0.2 params = ([2, 4, 8], ['count', 'last', 'max', 'mean', 'min', 'prod', 'sum', 'var']) param_names = ['threads', 'method'] @@ -60,7 +59,6 @@ def time_loop(self, threads, method): class ParallelGroups(object): - goal_time = 0.2 params = [2, 4, 8] param_names = ['threads'] @@ -82,7 +80,6 @@ def time_get_groups(self, threads): class ParallelTake1D(object): - goal_time = 0.2 params = ['int64', 'float64'] param_names = ['dtype'] @@ -126,8 +123,6 @@ def time_kth_smallest(self): class ParallelDatetimeFields(object): - goal_time = 0.2 - def setup(self): if not have_real_test_parallel: raise NotImplementedError @@ -174,7 +169,6 @@ def run(period): class ParallelRolling(object): - goal_time = 0.2 params = ['median', 'mean', 'min', 'max', 'var', 'skew', 'kurt', 'std'] param_names = ['method'] @@ -273,3 +267,6 @@ def time_parallel(self, threads): def time_loop(self, threads): for i in range(threads): self.loop() + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/groupby.py b/asv_bench/benchmarks/groupby.py index b51b41614bc49..59e43ee22afde 100644 --- a/asv_bench/benchmarks/groupby.py +++ b/asv_bench/benchmarks/groupby.py @@ -1,14 +1,14 @@ -import warnings -from string import ascii_letters -from itertools import product from functools import partial +from itertools import product +from string import ascii_letters +import warnings import numpy as np -from pandas import (DataFrame, Series, MultiIndex, date_range, period_range, - TimeGrouper, Categorical, Timestamp) -import pandas.util.testing as tm -from .pandas_vb_common import setup # noqa +from pandas import ( + Categorical, DataFrame, MultiIndex, Series, TimeGrouper, Timestamp, + date_range, period_range) +import pandas.util.testing as tm method_blacklist = { @@ -22,8 +22,6 @@ class ApplyDictReturn(object): - goal_time = 0.2 - def setup(self): self.labels = np.arange(1000).repeat(10) self.data = Series(np.random.randn(len(self.labels))) @@ -35,8 +33,6 @@ def time_groupby_apply_dict_return(self): class Apply(object): - goal_time = 0.2 - def setup_cache(self): N = 10**4 labels = np.random.randint(0, 2000, size=N) @@ -69,8 +65,6 @@ def time_copy_overhead_single_col(self, df): class Groups(object): - goal_time = 0.2 - param_names = ['key'] params = ['int64_small', 'int64_large', 'object_small', 'object_large'] @@ -95,7 +89,6 @@ def time_series_groups(self, data, key): class GroupManyLabels(object): - goal_time = 0.2 params = [1, 1000] param_names = ['ncols'] @@ -111,8 +104,6 @@ def time_sum(self, ncols): class Nth(object): - goal_time = 0.2 - param_names = ['dtype'] params = ['float32', 'float64', 'datetime', 'object'] @@ -151,8 +142,6 @@ def time_series_nth(self, dtype): class DateAttributes(object): - goal_time = 0.2 - def setup(self): rng = date_range('1/1/2000', '12/31/2005', freq='H') self.year, self.month, self.day = rng.year, rng.month, rng.day @@ -164,8 +153,6 @@ def time_len_groupby_object(self): class Int64(object): - goal_time = 0.2 - def setup(self): arr = np.random.randint(-1 << 12, 1 << 12, (1 << 17, 5)) i = np.random.choice(len(arr), len(arr) * 5) @@ -182,8 +169,6 @@ def time_overflow(self): class CountMultiDtype(object): - goal_time = 0.2 - def setup_cache(self): n = 10000 offsets = np.random.randint(n, size=n).astype('timedelta64[ns]') @@ -210,8 +195,6 @@ def time_multi_count(self, df): class CountMultiInt(object): - goal_time = 0.2 - def setup_cache(self): n = 10000 df = DataFrame({'key1': np.random.randint(0, 500, size=n), @@ -229,9 +212,7 @@ def time_multi_int_nunique(self, df): class AggFunctions(object): - goal_time = 0.2 - - def setup_cache(): + def setup_cache(self): N = 10**5 fac1 = np.array(['A', 'B', 'C'], dtype='O') fac2 = np.array(['one', 'two'], dtype='O') @@ -261,8 +242,6 @@ def time_different_python_functions_singlecol(self, df): class GroupStrings(object): - goal_time = 0.2 - def setup(self): n = 2 * 10**5 alpha = list(map(''.join, product(ascii_letters, repeat=4))) @@ -278,8 +257,6 @@ def time_multi_columns(self): class MultiColumn(object): - goal_time = 0.2 - def setup_cache(self): N = 10**5 key1 = np.tile(np.arange(100, dtype=object), 1000) @@ -307,8 +284,6 @@ def time_col_select_numpy_sum(self, df): class Size(object): - goal_time = 0.2 - def setup(self): n = 10**5 offsets = np.random.randint(n, size=n).astype('timedelta64[ns]') @@ -336,8 +311,6 @@ def time_category_size(self): class GroupByMethods(object): - goal_time = 0.2 - param_names = ['dtype', 'method', 'application'] params = [['int', 'float', 'object', 'datetime'], ['all', 'any', 'bfill', 'count', 'cumcount', 'cummax', 'cummin', @@ -387,7 +360,6 @@ def time_dtype_as_field(self, dtype, method, application): class RankWithTies(object): # GH 21237 - goal_time = 0.2 param_names = ['dtype', 'tie_method'] params = [['float64', 'float32', 'int64', 'datetime64'], ['first', 'average', 'dense', 'min', 'max']] @@ -406,8 +378,6 @@ def time_rank_ties(self, dtype, tie_method): class Float32(object): # GH 13335 - goal_time = 0.2 - def setup(self): tmp1 = (np.random.random(10000) * 0.1).astype(np.float32) tmp2 = (np.random.random(10000) * 10.0).astype(np.float32) @@ -421,8 +391,6 @@ def time_sum(self): class Categories(object): - goal_time = 0.2 - def setup(self): N = 10**5 arr = np.random.random(N) @@ -459,7 +427,6 @@ def time_groupby_extra_cat_nosort(self): class Datelike(object): # GH 14338 - goal_time = 0.2 params = ['period_range', 'date_range', 'date_range_tz'] param_names = ['grouper'] @@ -477,8 +444,6 @@ def time_sum(self, grouper): class SumBools(object): # GH 2692 - goal_time = 0.2 - def setup(self): N = 500 self.df = DataFrame({'ii': range(N), @@ -490,7 +455,6 @@ def time_groupby_sum_booleans(self): class SumMultiLevel(object): # GH 9049 - goal_time = 0.2 timeout = 120.0 def setup(self): @@ -505,14 +469,12 @@ def time_groupby_sum_multiindex(self): class Transform(object): - goal_time = 0.2 - def setup(self): n1 = 400 n2 = 250 index = MultiIndex(levels=[np.arange(n1), tm.makeStringIndex(n2)], - labels=[np.repeat(range(n1), n2).tolist(), - list(range(n2)) * n1], + codes=[np.repeat(range(n1), n2).tolist(), + list(range(n2)) * n1], names=['lev1', 'lev2']) arr = np.random.randn(n1 * n2, 3) arr[::10000, 0] = np.nan @@ -553,8 +515,6 @@ def time_transform_multi_key4(self): class TransformBools(object): - goal_time = 0.2 - def setup(self): N = 120000 transition_points = np.sort(np.random.choice(np.arange(N), 1400)) @@ -569,8 +529,6 @@ def time_transform_mean(self): class TransformNaN(object): # GH 12737 - goal_time = 0.2 - def setup(self): self.df_nans = DataFrame({'key': np.repeat(np.arange(1000), 10), 'B': np.nan, @@ -579,3 +537,6 @@ def setup(self): def time_first(self): self.df_nans.groupby('key').transform('first') + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/index_object.py b/asv_bench/benchmarks/index_object.py index f1703e163917a..f76040921393f 100644 --- a/asv_bench/benchmarks/index_object.py +++ b/asv_bench/benchmarks/index_object.py @@ -3,12 +3,9 @@ from pandas import (Series, date_range, DatetimeIndex, Index, RangeIndex, Float64Index) -from .pandas_vb_common import setup # noqa - class SetOperations(object): - goal_time = 0.2 params = (['datetime', 'date_string', 'int', 'strings'], ['intersection', 'union', 'symmetric_difference']) param_names = ['dtype', 'method'] @@ -34,8 +31,6 @@ def time_operation(self, dtype, method): class SetDisjoint(object): - goal_time = 0.2 - def setup(self): N = 10**5 B = N + 20000 @@ -48,8 +43,6 @@ def time_datetime_difference_disjoint(self): class Datetime(object): - goal_time = 0.2 - def setup(self): self.dr = date_range('20000101', freq='D', periods=10000) @@ -86,8 +79,6 @@ def time_modulo(self, dtype): class Range(object): - goal_time = 0.2 - def setup(self): self.idx_inc = RangeIndex(start=0, stop=10**7, step=3) self.idx_dec = RangeIndex(start=10**7, stop=-1, step=-3) @@ -107,8 +98,6 @@ def time_min_trivial(self): class IndexAppend(object): - goal_time = 0.2 - def setup(self): N = 10000 @@ -138,7 +127,6 @@ def time_append_obj_list(self): class Indexing(object): - goal_time = 0.2 params = ['String', 'Float', 'Int'] param_names = ['dtype'] @@ -183,8 +171,6 @@ def time_get_loc_non_unique_sorted(self, dtype): class Float64IndexMethod(object): # GH 13166 - goal_time = 0.2 - def setup(self): N = 100000 a = np.arange(N) @@ -192,3 +178,6 @@ def setup(self): def time_get_loc(self): self.ind.get_loc(0) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/indexing.py b/asv_bench/benchmarks/indexing.py index 739ad6a3d278b..57ba9cd80e55c 100644 --- a/asv_bench/benchmarks/indexing.py +++ b/asv_bench/benchmarks/indexing.py @@ -2,108 +2,119 @@ import numpy as np import pandas.util.testing as tm -from pandas import (Series, DataFrame, MultiIndex, Int64Index, Float64Index, +from pandas import (Series, DataFrame, Panel, MultiIndex, + Int64Index, UInt64Index, Float64Index, IntervalIndex, CategoricalIndex, IndexSlice, concat, date_range) -from .pandas_vb_common import setup, Panel # noqa class NumericSeriesIndexing(object): - goal_time = 0.2 - params = [Int64Index, Float64Index] - param = ['index'] + params = [ + (Int64Index, UInt64Index, Float64Index), + ('unique_monotonic_inc', 'nonunique_monotonic_inc'), + ] + param_names = ['index_dtype', 'index_structure'] - def setup(self, index): + def setup(self, index, index_structure): N = 10**6 - idx = index(range(N)) - self.data = Series(np.random.rand(N), index=idx) + indices = { + 'unique_monotonic_inc': index(range(N)), + 'nonunique_monotonic_inc': index( + list(range(55)) + [54] + list(range(55, N - 1))), + } + self.data = Series(np.random.rand(N), index=indices[index_structure]) self.array = np.arange(10000) self.array_list = self.array.tolist() - def time_getitem_scalar(self, index): + def time_getitem_scalar(self, index, index_structure): self.data[800000] - def time_getitem_slice(self, index): + def time_getitem_slice(self, index, index_structure): self.data[:800000] - def time_getitem_list_like(self, index): + def time_getitem_list_like(self, index, index_structure): self.data[[800000]] - def time_getitem_array(self, index): + def time_getitem_array(self, index, index_structure): self.data[self.array] - def time_getitem_lists(self, index): + def time_getitem_lists(self, index, index_structure): self.data[self.array_list] - def time_iloc_array(self, index): + def time_iloc_array(self, index, index_structure): self.data.iloc[self.array] - def time_iloc_list_like(self, index): + def time_iloc_list_like(self, index, index_structure): self.data.iloc[[800000]] - def time_iloc_scalar(self, index): + def time_iloc_scalar(self, index, index_structure): self.data.iloc[800000] - def time_iloc_slice(self, index): + def time_iloc_slice(self, index, index_structure): self.data.iloc[:800000] - def time_ix_array(self, index): + def time_ix_array(self, index, index_structure): self.data.ix[self.array] - def time_ix_list_like(self, index): + def time_ix_list_like(self, index, index_structure): self.data.ix[[800000]] - def time_ix_scalar(self, index): + def time_ix_scalar(self, index, index_structure): self.data.ix[800000] - def time_ix_slice(self, index): + def time_ix_slice(self, index, index_structure): self.data.ix[:800000] - def time_loc_array(self, index): + def time_loc_array(self, index, index_structure): self.data.loc[self.array] - def time_loc_list_like(self, index): + def time_loc_list_like(self, index, index_structure): self.data.loc[[800000]] - def time_loc_scalar(self, index): + def time_loc_scalar(self, index, index_structure): self.data.loc[800000] - def time_loc_slice(self, index): + def time_loc_slice(self, index, index_structure): self.data.loc[:800000] class NonNumericSeriesIndexing(object): - goal_time = 0.2 - params = ['string', 'datetime'] - param_names = ['index'] + params = [ + ('string', 'datetime'), + ('unique_monotonic_inc', 'nonunique_monotonic_inc'), + ] + param_names = ['index_dtype', 'index_structure'] - def setup(self, index): - N = 10**5 + def setup(self, index, index_structure): + N = 10**6 indexes = {'string': tm.makeStringIndex(N), 'datetime': date_range('1900', periods=N, freq='s')} index = indexes[index] + if index_structure == 'nonunique_monotonic_inc': + index = index.insert(item=index[2], loc=2)[:-1] self.s = Series(np.random.rand(N), index=index) self.lbl = index[80000] - def time_getitem_label_slice(self, index): + def time_getitem_label_slice(self, index, index_structure): self.s[:self.lbl] - def time_getitem_pos_slice(self, index): + def time_getitem_pos_slice(self, index, index_structure): self.s[:80000] - def time_get_value(self, index): + def time_get_value(self, index, index_structure): with warnings.catch_warnings(record=True): self.s.get_value(self.lbl) - def time_getitem_scalar(self, index): + def time_getitem_scalar(self, index, index_structure): self.s[self.lbl] + def time_getitem_list_like(self, index, index_structure): + self.s[[self.lbl]] -class DataFrameStringIndexing(object): - goal_time = 0.2 +class DataFrameStringIndexing(object): def setup(self): index = tm.makeStringIndex(1000) @@ -137,8 +148,6 @@ def time_boolean_rows_object(self): class DataFrameNumericIndexing(object): - goal_time = 0.2 - def setup(self): self.idx_dupe = np.array(range(30)) * 99 self.df = DataFrame(np.random.randn(10000, 5)) @@ -163,7 +172,6 @@ def time_bool_indexer(self): class Take(object): - goal_time = 0.2 params = ['int', 'datetime'] param_names = ['index'] @@ -181,8 +189,6 @@ def time_take(self, index): class MultiIndexing(object): - goal_time = 0.2 - def setup(self): mi = MultiIndex.from_product([range(1000), range(1000)]) self.s = Series(np.random.randn(1000000), index=mi) @@ -211,8 +217,6 @@ def time_index_slice(self): class IntervalIndexing(object): - goal_time = 0.2 - def setup_cache(self): idx = IntervalIndex.from_breaks(np.arange(1000001)) monotonic = Series(np.arange(1000000), index=idx) @@ -233,7 +237,6 @@ def time_loc_list(self, monotonic): class CategoricalIndexIndexing(object): - goal_time = 0.2 params = ['monotonic_incr', 'monotonic_decr', 'non_monotonic'] param_names = ['index'] @@ -276,8 +279,6 @@ def time_get_indexer_list(self, index): class PanelIndexing(object): - goal_time = 0.2 - def setup(self): with warnings.catch_warnings(record=True): self.p = Panel(np.random.randn(100, 100, 100)) @@ -290,8 +291,6 @@ def time_subset(self): class MethodLookup(object): - goal_time = 0.2 - def setup_cache(self): s = Series() return s @@ -308,8 +307,6 @@ def time_lookup_loc(self, s): class GetItemSingleColumn(object): - goal_time = 0.2 - def setup(self): self.df_string_col = DataFrame(np.random.randn(3000, 1), columns=['A']) self.df_int_col = DataFrame(np.random.randn(3000, 1)) @@ -323,8 +320,6 @@ def time_frame_getitem_single_column_int(self): class AssignTimeseriesIndex(object): - goal_time = 0.2 - def setup(self): N = 100000 idx = date_range('1/1/2000', periods=N, freq='H') @@ -336,8 +331,6 @@ def time_frame_assign_timeseries_index(self): class InsertColumns(object): - goal_time = 0.2 - def setup(self): self.N = 10**3 self.df = DataFrame(index=range(self.N)) @@ -352,3 +345,6 @@ def time_assign_with_setitem(self): np.random.seed(1234) for i in range(100): self.df[i] = np.random.randn(self.N) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/indexing_engines.py b/asv_bench/benchmarks/indexing_engines.py new file mode 100644 index 0000000000000..f3d063ee31bc8 --- /dev/null +++ b/asv_bench/benchmarks/indexing_engines.py @@ -0,0 +1,64 @@ +import numpy as np + +from pandas._libs import index as libindex + + +def _get_numeric_engines(): + engine_names = [ + ('Int64Engine', np.int64), ('Int32Engine', np.int32), + ('Int16Engine', np.int16), ('Int8Engine', np.int8), + ('UInt64Engine', np.uint64), ('UInt32Engine', np.uint32), + ('UInt16engine', np.uint16), ('UInt8Engine', np.uint8), + ('Float64Engine', np.float64), ('Float32Engine', np.float32), + ] + return [(getattr(libindex, engine_name), dtype) + for engine_name, dtype in engine_names + if hasattr(libindex, engine_name)] + + +class NumericEngineIndexing(object): + + params = [_get_numeric_engines(), + ['monotonic_incr', 'monotonic_decr', 'non_monotonic'], + ] + param_names = ['engine_and_dtype', 'index_type'] + + def setup(self, engine_and_dtype, index_type): + engine, dtype = engine_and_dtype + N = 10**5 + values = list([1] * N + [2] * N + [3] * N) + arr = { + 'monotonic_incr': np.array(values, dtype=dtype), + 'monotonic_decr': np.array(list(reversed(values)), + dtype=dtype), + 'non_monotonic': np.array([1, 2, 3] * N, dtype=dtype), + }[index_type] + + self.data = engine(lambda: arr, len(arr)) + # code belows avoids populating the mapping etc. while timing. + self.data.get_loc(2) + + def time_get_loc(self, engine_and_dtype, index_type): + self.data.get_loc(2) + + +class ObjectEngineIndexing(object): + + params = [('monotonic_incr', 'monotonic_decr', 'non_monotonic')] + param_names = ['index_type'] + + def setup(self, index_type): + N = 10**5 + values = list('a' * N + 'b' * N + 'c' * N) + arr = { + 'monotonic_incr': np.array(values, dtype=object), + 'monotonic_decr': np.array(list(reversed(values)), dtype=object), + 'non_monotonic': np.array(list('abc') * N, dtype=object), + }[index_type] + + self.data = libindex.ObjectEngine(lambda: arr, len(arr)) + # code belows avoids populating the mapping etc. while timing. + self.data.get_loc('b') + + def time_get_loc(self, index_type): + self.data.get_loc('b') diff --git a/asv_bench/benchmarks/inference.py b/asv_bench/benchmarks/inference.py index 16d9e7cd73cbb..423bd02b93596 100644 --- a/asv_bench/benchmarks/inference.py +++ b/asv_bench/benchmarks/inference.py @@ -2,12 +2,11 @@ import pandas.util.testing as tm from pandas import DataFrame, Series, to_numeric -from .pandas_vb_common import numeric_dtypes, lib, setup # noqa +from .pandas_vb_common import numeric_dtypes, lib class NumericInferOps(object): # from GH 7332 - goal_time = 0.2 params = numeric_dtypes param_names = ['dtype'] @@ -34,8 +33,6 @@ def time_modulo(self, dtype): class DateInferOps(object): # from GH 7332 - goal_time = 0.2 - def setup_cache(self): N = 5 * 10**5 df = DataFrame({'datetime64': np.arange(N).astype('datetime64[ms]')}) @@ -54,7 +51,6 @@ def time_add_timedeltas(self, df): class ToNumeric(object): - goal_time = 0.2 params = ['ignore', 'coerce'] param_names = ['errors'] @@ -111,3 +107,6 @@ def setup_cache(self): def time_convert(self, data): lib.maybe_convert_numeric(data, set(), coerce_numeric=False) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/csv.py b/asv_bench/benchmarks/io/csv.py index 0f5d07f9fac55..771f2795334e1 100644 --- a/asv_bench/benchmarks/io/csv.py +++ b/asv_bench/benchmarks/io/csv.py @@ -1,19 +1,16 @@ import random -import timeit import string import numpy as np import pandas.util.testing as tm from pandas import DataFrame, Categorical, date_range, read_csv -from pandas.compat import PY2 from pandas.compat import cStringIO as StringIO -from ..pandas_vb_common import setup, BaseIO # noqa +from ..pandas_vb_common import BaseIO class ToCSV(BaseIO): - goal_time = 0.2 fname = '__test__.csv' params = ['wide', 'long', 'mixed'] param_names = ['kind'] @@ -43,7 +40,6 @@ def time_frame(self, kind): class ToCSVDatetime(BaseIO): - goal_time = 0.2 fname = '__test__.csv' def setup(self): @@ -54,9 +50,15 @@ def time_frame_date_formatting(self): self.data.to_csv(self.fname, date_format='%Y%m%d') -class ReadCSVDInferDatetimeFormat(object): +class StringIORewind(object): + + def data(self, stringio_object): + stringio_object.seek(0) + return stringio_object + + +class ReadCSVDInferDatetimeFormat(StringIORewind): - goal_time = 0.2 params = ([True, False], ['custom', 'iso8601', 'ymd']) param_names = ['infer_datetime_format', 'format'] @@ -66,16 +68,17 @@ def setup(self, infer_datetime_format, format): 'iso8601': '%Y-%m-%d %H:%M:%S', 'ymd': '%Y%m%d'} dt_format = formats[format] - self.data = StringIO('\n'.join(rng.strftime(dt_format).tolist())) + self.StringIO_input = StringIO('\n'.join( + rng.strftime(dt_format).tolist())) def time_read_csv(self, infer_datetime_format, format): - read_csv(self.data, header=None, names=['foo'], parse_dates=['foo'], + read_csv(self.data(self.StringIO_input), + header=None, names=['foo'], parse_dates=['foo'], infer_datetime_format=infer_datetime_format) class ReadCSVSkipRows(BaseIO): - goal_time = 0.2 fname = '__test__.csv' params = [None, 10000] param_names = ['skiprows'] @@ -95,9 +98,7 @@ def time_skipprows(self, skiprows): read_csv(self.fname, skiprows=skiprows) -class ReadUint64Integers(object): - - goal_time = 0.2 +class ReadUint64Integers(StringIORewind): def setup(self): self.na_values = [2**63 + 500] @@ -108,19 +109,18 @@ def setup(self): self.data2 = StringIO('\n'.join(arr.astype(str).tolist())) def time_read_uint64(self): - read_csv(self.data1, header=None, names=['foo']) + read_csv(self.data(self.data1), header=None, names=['foo']) def time_read_uint64_neg_values(self): - read_csv(self.data2, header=None, names=['foo']) + read_csv(self.data(self.data2), header=None, names=['foo']) def time_read_uint64_na_values(self): - read_csv(self.data1, header=None, names=['foo'], + read_csv(self.data(self.data1), header=None, names=['foo'], na_values=self.na_values) class ReadCSVThousands(BaseIO): - goal_time = 0.2 fname = '__test__.csv' params = ([',', '|'], [None, ',']) param_names = ['sep', 'thousands'] @@ -140,21 +140,19 @@ def time_thousands(self, sep, thousands): read_csv(self.fname, sep=sep, thousands=thousands) -class ReadCSVComment(object): - - goal_time = 0.2 +class ReadCSVComment(StringIORewind): def setup(self): data = ['A,B,C'] + (['1,2,3 # comment'] * 100000) - self.s_data = StringIO('\n'.join(data)) + self.StringIO_input = StringIO('\n'.join(data)) def time_comment(self): - read_csv(self.s_data, comment='#', header=None, names=list('abc')) + read_csv(self.data(self.StringIO_input), comment='#', + header=None, names=list('abc')) -class ReadCSVFloatPrecision(object): +class ReadCSVFloatPrecision(StringIORewind): - goal_time = 0.2 params = ([',', ';'], ['.', '_'], [None, 'high', 'round_trip']) param_names = ['sep', 'decimal', 'float_precision'] @@ -164,20 +162,19 @@ def setup(self, sep, decimal, float_precision): rows = sep.join(['0{}'.format(decimal) + '{}'] * 3) + '\n' data = rows * 5 data = data.format(*floats) * 200 # 1000 x 3 strings csv - self.s_data = StringIO(data) + self.StringIO_input = StringIO(data) def time_read_csv(self, sep, decimal, float_precision): - read_csv(self.s_data, sep=sep, header=None, names=list('abc'), - float_precision=float_precision) + read_csv(self.data(self.StringIO_input), sep=sep, header=None, + names=list('abc'), float_precision=float_precision) def time_read_csv_python_engine(self, sep, decimal, float_precision): - read_csv(self.s_data, sep=sep, header=None, engine='python', - float_precision=None, names=list('abc')) + read_csv(self.data(self.StringIO_input), sep=sep, header=None, + engine='python', float_precision=None, names=list('abc')) class ReadCSVCategorical(BaseIO): - goal_time = 0.2 fname = '__test__.csv' def setup(self): @@ -193,9 +190,7 @@ def time_convert_direct(self): read_csv(self.fname, dtype='category') -class ReadCSVParseDates(object): - - goal_time = 0.2 +class ReadCSVParseDates(StringIORewind): def setup(self): data = """{},19:00:00,18:56:00,0.8100,2.8100,7.2000,0.0000,280.0000\n @@ -206,12 +201,17 @@ def setup(self): """ two_cols = ['KORD,19990127'] * 5 data = data.format(*two_cols) - self.s_data = StringIO(data) + self.StringIO_input = StringIO(data) def time_multiple_date(self): - read_csv(self.s_data, sep=',', header=None, - names=list(string.digits[:9]), parse_dates=[[1, 2], [1, 3]]) + read_csv(self.data(self.StringIO_input), sep=',', header=None, + names=list(string.digits[:9]), + parse_dates=[[1, 2], [1, 3]]) def time_baseline(self): - read_csv(self.s_data, sep=',', header=None, parse_dates=[1], + read_csv(self.data(self.StringIO_input), sep=',', header=None, + parse_dates=[1], names=list(string.digits[:9])) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/excel.py b/asv_bench/benchmarks/io/excel.py index 58ab6bb8046c5..1bee864fbcf2d 100644 --- a/asv_bench/benchmarks/io/excel.py +++ b/asv_bench/benchmarks/io/excel.py @@ -3,12 +3,9 @@ from pandas.compat import BytesIO import pandas.util.testing as tm -from ..pandas_vb_common import BaseIO, setup # noqa - class Excel(object): - goal_time = 0.2 params = ['openpyxl', 'xlsxwriter', 'xlwt'] param_names = ['engine'] @@ -34,3 +31,6 @@ def time_write_excel(self, engine): writer_write = ExcelWriter(bio_write, engine=engine) self.df.to_excel(writer_write, sheet_name='Sheet1') writer_write.save() + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/hdf.py b/asv_bench/benchmarks/io/hdf.py index 4b6e1d69af92d..f08904ba70a5f 100644 --- a/asv_bench/benchmarks/io/hdf.py +++ b/asv_bench/benchmarks/io/hdf.py @@ -4,13 +4,11 @@ from pandas import DataFrame, Panel, date_range, HDFStore, read_hdf import pandas.util.testing as tm -from ..pandas_vb_common import BaseIO, setup # noqa +from ..pandas_vb_common import BaseIO class HDFStoreDataFrame(BaseIO): - goal_time = 0.2 - def setup(self): N = 25000 index = tm.makeStringIndex(N) @@ -103,8 +101,6 @@ def time_store_info(self): class HDFStorePanel(BaseIO): - goal_time = 0.2 - def setup(self): self.fname = '__test__.h5' with warnings.catch_warnings(record=True): @@ -130,7 +126,6 @@ def time_write_store_table_panel(self): class HDF(BaseIO): - goal_time = 0.2 params = ['table', 'fixed'] param_names = ['format'] @@ -149,3 +144,6 @@ def time_read_hdf(self, format): def time_write_hdf(self, format): self.df.to_hdf(self.fname, 'df', format=format) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/json.py b/asv_bench/benchmarks/io/json.py index acfdd327c3b51..ec2ddc11b7c1d 100644 --- a/asv_bench/benchmarks/io/json.py +++ b/asv_bench/benchmarks/io/json.py @@ -2,12 +2,11 @@ import pandas.util.testing as tm from pandas import DataFrame, date_range, timedelta_range, concat, read_json -from ..pandas_vb_common import setup, BaseIO # noqa +from ..pandas_vb_common import BaseIO class ReadJSON(BaseIO): - goal_time = 0.2 fname = "__test__.json" params = (['split', 'index', 'records'], ['int', 'datetime']) param_names = ['orient', 'index'] @@ -27,7 +26,6 @@ def time_read_json(self, orient, index): class ReadJSONLines(BaseIO): - goal_time = 0.2 fname = "__test_lines__.json" params = ['int', 'datetime'] param_names = ['index'] @@ -58,7 +56,6 @@ def peakmem_read_json_lines_concat(self, index): class ToJSON(BaseIO): - goal_time = 0.2 fname = "__test__.json" params = ['split', 'columns', 'index'] param_names = ['orient'] @@ -125,3 +122,6 @@ def time_float_int_lines(self, orient): def time_float_int_str_lines(self, orient): self.df_int_float_str.to_json(self.fname, orient='records', lines=True) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/msgpack.py b/asv_bench/benchmarks/io/msgpack.py index 8ccce01117ca4..dc2642d920fd0 100644 --- a/asv_bench/benchmarks/io/msgpack.py +++ b/asv_bench/benchmarks/io/msgpack.py @@ -2,13 +2,11 @@ from pandas import DataFrame, date_range, read_msgpack import pandas.util.testing as tm -from ..pandas_vb_common import BaseIO, setup # noqa +from ..pandas_vb_common import BaseIO class MSGPack(BaseIO): - goal_time = 0.2 - def setup(self): self.fname = '__test__.msg' N = 100000 @@ -24,3 +22,6 @@ def time_read_msgpack(self): def time_write_msgpack(self): self.df.to_msgpack(self.fname) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/pickle.py b/asv_bench/benchmarks/io/pickle.py index 2ad0fcca6eb26..74a58bbb946aa 100644 --- a/asv_bench/benchmarks/io/pickle.py +++ b/asv_bench/benchmarks/io/pickle.py @@ -2,13 +2,11 @@ from pandas import DataFrame, date_range, read_pickle import pandas.util.testing as tm -from ..pandas_vb_common import BaseIO, setup # noqa +from ..pandas_vb_common import BaseIO class Pickle(BaseIO): - goal_time = 0.2 - def setup(self): self.fname = '__test__.pkl' N = 100000 @@ -24,3 +22,6 @@ def time_read_pickle(self): def time_write_pickle(self): self.df.to_pickle(self.fname) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/sas.py b/asv_bench/benchmarks/io/sas.py index 526c524de7fff..2783f42cad895 100644 --- a/asv_bench/benchmarks/io/sas.py +++ b/asv_bench/benchmarks/io/sas.py @@ -5,7 +5,6 @@ class SAS(object): - goal_time = 0.2 params = ['sas7bdat', 'xport'] param_names = ['format'] diff --git a/asv_bench/benchmarks/io/sql.py b/asv_bench/benchmarks/io/sql.py index ef4e501e5f3b9..075d3bdda5ed9 100644 --- a/asv_bench/benchmarks/io/sql.py +++ b/asv_bench/benchmarks/io/sql.py @@ -5,12 +5,9 @@ from pandas import DataFrame, date_range, read_sql_query, read_sql_table from sqlalchemy import create_engine -from ..pandas_vb_common import setup # noqa - class SQL(object): - goal_time = 0.2 params = ['sqlalchemy', 'sqlite'] param_names = ['connection'] @@ -43,7 +40,6 @@ def time_read_sql_query(self, connection): class WriteSQLDtypes(object): - goal_time = 0.2 params = (['sqlalchemy', 'sqlite'], ['float', 'float_with_nan', 'string', 'bool', 'int', 'datetime']) param_names = ['connection', 'dtype'] @@ -77,8 +73,6 @@ def time_read_sql_query_select_column(self, connection, dtype): class ReadSQLTable(object): - goal_time = 0.2 - def setup(self): N = 10000 self.table_name = 'test' @@ -106,8 +100,6 @@ def time_read_sql_table_parse_dates(self): class ReadSQLTableDtypes(object): - goal_time = 0.2 - params = ['float', 'float_with_nan', 'string', 'bool', 'int', 'datetime'] param_names = ['dtype'] @@ -130,3 +122,6 @@ def setup(self, dtype): def time_read_sql_table_column(self, dtype): read_sql_table(self.table_name, self.con, columns=[dtype]) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/stata.py b/asv_bench/benchmarks/io/stata.py index e0f5752ca930f..a7f854a853f50 100644 --- a/asv_bench/benchmarks/io/stata.py +++ b/asv_bench/benchmarks/io/stata.py @@ -2,12 +2,11 @@ from pandas import DataFrame, date_range, read_stata import pandas.util.testing as tm -from ..pandas_vb_common import BaseIO, setup # noqa +from ..pandas_vb_common import BaseIO class Stata(BaseIO): - goal_time = 0.2 params = ['tc', 'td', 'tm', 'tw', 'th', 'tq', 'ty'] param_names = ['convert_dates'] @@ -35,3 +34,6 @@ def time_read_stata(self, convert_dates): def time_write_stata(self, convert_dates): self.df.to_stata(self.fname, self.convert_dates) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/join_merge.py b/asv_bench/benchmarks/join_merge.py index de0a3b33da147..6da8287a06d80 100644 --- a/asv_bench/benchmarks/join_merge.py +++ b/asv_bench/benchmarks/join_merge.py @@ -3,20 +3,17 @@ import numpy as np import pandas.util.testing as tm -from pandas import (DataFrame, Series, MultiIndex, date_range, concat, merge, - merge_asof) +from pandas import (DataFrame, Series, Panel, MultiIndex, + date_range, concat, merge, merge_asof) + try: from pandas import merge_ordered except ImportError: from pandas import ordered_merge as merge_ordered -from .pandas_vb_common import Panel, setup # noqa - class Append(object): - goal_time = 0.2 - def setup(self): self.df1 = DataFrame(np.random.randn(10000, 4), columns=['A', 'B', 'C', 'D']) @@ -26,11 +23,7 @@ def setup(self): self.mdf1['obj1'] = 'bar' self.mdf1['obj2'] = 'bar' self.mdf1['int1'] = 5 - try: - with warnings.catch_warnings(record=True): - self.mdf1.consolidate(inplace=True) - except: - pass + self.mdf1 = self.mdf1._consolidate() self.mdf2 = self.mdf1.copy() self.mdf2.index = self.df2.index @@ -43,7 +36,6 @@ def time_append_mixed(self): class Concat(object): - goal_time = 0.2 params = [0, 1] param_names = ['axis'] @@ -56,9 +48,10 @@ def setup(self, axis): index=date_range('20130101', periods=N, freq='s')) self.empty_left = [DataFrame(), df] self.empty_right = [df, DataFrame()] + self.mixed_ndims = [df, df.head(N // 2)] def time_concat_series(self, axis): - concat(self.series, axis=axis) + concat(self.series, axis=axis, sort=False) def time_concat_small_frames(self, axis): concat(self.small_frames, axis=axis) @@ -69,10 +62,12 @@ def time_concat_empty_right(self, axis): def time_concat_empty_left(self, axis): concat(self.empty_left, axis=axis) + def time_concat_mixed_ndims(self, axis): + concat(self.mixed_ndims, axis=axis) + class ConcatPanels(object): - goal_time = 0.2 params = ([0, 1, 2], [True, False]) param_names = ['axis', 'ignore_index'] @@ -98,7 +93,6 @@ def time_f_ordered(self, axis, ignore_index): class ConcatDataFrames(object): - goal_time = 0.2 params = ([0, 1], [True, False]) param_names = ['axis', 'ignore_index'] @@ -119,23 +113,22 @@ def time_f_ordered(self, axis, ignore_index): class Join(object): - goal_time = 0.2 params = [True, False] param_names = ['sort'] def setup(self, sort): level1 = tm.makeStringIndex(10).values level2 = tm.makeStringIndex(1000).values - label1 = np.arange(10).repeat(1000) - label2 = np.tile(np.arange(1000), 10) + codes1 = np.arange(10).repeat(1000) + codes2 = np.tile(np.arange(1000), 10) index2 = MultiIndex(levels=[level1, level2], - labels=[label1, label2]) + codes=[codes1, codes2]) self.df_multi = DataFrame(np.random.randn(len(index2), 4), index=index2, columns=['A', 'B', 'C', 'D']) - self.key1 = np.tile(level1.take(label1), 10) - self.key2 = np.tile(level2.take(label2), 10) + self.key1 = np.tile(level1.take(codes1), 10) + self.key2 = np.tile(level2.take(codes2), 10) self.df = DataFrame({'data1': np.random.randn(100000), 'data2': np.random.randn(100000), 'key1': self.key1, @@ -167,8 +160,6 @@ def time_join_dataframe_index_shuffle_key_bigger_sort(self, sort): class JoinIndex(object): - goal_time = 0.2 - def setup(self): N = 50000 self.left = DataFrame(np.random.randint(1, N / 500, (N, 2)), @@ -183,8 +174,6 @@ def time_left_outer_join_index(self): class JoinNonUnique(object): # outer join of non-unique # GH 6329 - goal_time = 0.2 - def setup(self): date_index = date_range('01-Jan-2013', '23-Jan-2013', freq='T') daily_dates = date_index.to_period('D').to_timestamp('S', 'S') @@ -201,7 +190,6 @@ def time_join_non_unique_equal(self): class Merge(object): - goal_time = 0.2 params = [True, False] param_names = ['sort'] @@ -236,7 +224,6 @@ def time_merge_dataframe_integer_key(self, sort): class I8Merge(object): - goal_time = 0.2 params = ['inner', 'outer', 'left', 'right'] param_names = ['how'] @@ -255,8 +242,6 @@ def time_i8merge(self, how): class MergeCategoricals(object): - goal_time = 0.2 - def setup(self): self.left_object = DataFrame( {'X': np.random.choice(range(0, 10), size=(10000,)), @@ -293,8 +278,10 @@ def time_merge_ordered(self): class MergeAsof(object): + params = [['backward', 'forward', 'nearest']] + param_names = ['direction'] - def setup(self): + def setup(self, direction): one_count = 200000 two_count = 1000000 @@ -326,26 +313,27 @@ def setup(self): self.df1e = df1[['time', 'key', 'key2', 'value1']] self.df2e = df2[['time', 'key', 'key2', 'value2']] - def time_on_int(self): - merge_asof(self.df1a, self.df2a, on='time') + def time_on_int(self, direction): + merge_asof(self.df1a, self.df2a, on='time', direction=direction) - def time_on_int32(self): - merge_asof(self.df1d, self.df2d, on='time32') + def time_on_int32(self, direction): + merge_asof(self.df1d, self.df2d, on='time32', direction=direction) - def time_by_object(self): - merge_asof(self.df1b, self.df2b, on='time', by='key') + def time_by_object(self, direction): + merge_asof(self.df1b, self.df2b, on='time', by='key', + direction=direction) - def time_by_int(self): - merge_asof(self.df1c, self.df2c, on='time', by='key2') + def time_by_int(self, direction): + merge_asof(self.df1c, self.df2c, on='time', by='key2', + direction=direction) - def time_multiby(self): - merge_asof(self.df1e, self.df2e, on='time', by=['key', 'key2']) + def time_multiby(self, direction): + merge_asof(self.df1e, self.df2e, on='time', by=['key', 'key2'], + direction=direction) class Align(object): - goal_time = 0.2 - def setup(self): size = 5 * 10**5 rng = np.arange(0, 10**13, 10**7) @@ -360,3 +348,6 @@ def time_series_align_int64_index(self): def time_series_align_left_monotonic(self): self.ts1.align(self.ts2, join='left') + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/multiindex_object.py b/asv_bench/benchmarks/multiindex_object.py index 0c92214795557..adc6730dcd946 100644 --- a/asv_bench/benchmarks/multiindex_object.py +++ b/asv_bench/benchmarks/multiindex_object.py @@ -4,13 +4,9 @@ import pandas.util.testing as tm from pandas import date_range, MultiIndex -from .pandas_vb_common import setup # noqa - class GetLoc(object): - goal_time = 0.2 - def setup(self): self.mi_large = MultiIndex.from_product( [np.arange(1000), np.arange(20), list(string.ascii_letters)], @@ -46,8 +42,6 @@ def time_small_get_loc_warm(self): class Duplicates(object): - goal_time = 0.2 - def setup(self): size = 65536 arrays = [np.random.randint(0, 8192, size), @@ -62,8 +56,6 @@ def time_remove_unused_levels(self): class Integer(object): - goal_time = 0.2 - def setup(self): self.mi_int = MultiIndex.from_product([np.arange(1000), np.arange(1000)], @@ -82,15 +74,13 @@ def time_is_monotonic(self): class Duplicated(object): - goal_time = 0.2 - def setup(self): n, k = 200, 5000 levels = [np.arange(n), tm.makeStringIndex(n).values, 1000 + np.arange(n)] - labels = [np.random.choice(n, (k * n)) for lev in levels] - self.mi = MultiIndex(levels=levels, labels=labels) + codes = [np.random.choice(n, (k * n)) for lev in levels] + self.mi = MultiIndex(levels=levels, codes=codes) def time_duplicated(self): self.mi.duplicated() @@ -98,8 +88,6 @@ def time_duplicated(self): class Sortlevel(object): - goal_time = 0.2 - def setup(self): n = 1182720 low, high = -4096, 4096 @@ -124,8 +112,6 @@ def time_sortlevel_one(self): class Values(object): - goal_time = 0.2 - def setup_cache(self): level1 = range(1000) @@ -138,3 +124,6 @@ def time_datetime_level_values_copy(self, mi): def time_datetime_level_values_sliced(self, mi): mi[:10].values + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/offset.py b/asv_bench/benchmarks/offset.py index e161b887ee86f..4570e73cccc71 100644 --- a/asv_bench/benchmarks/offset.py +++ b/asv_bench/benchmarks/offset.py @@ -34,8 +34,6 @@ class ApplyIndex(object): - goal_time = 0.2 - params = other_offsets param_names = ['offset'] @@ -49,8 +47,6 @@ def time_apply_index(self, offset): class OnOffset(object): - goal_time = 0.2 - params = offsets param_names = ['offset'] @@ -67,7 +63,6 @@ def time_on_offset(self, offset): class OffsetSeriesArithmetic(object): - goal_time = 0.2 params = offsets param_names = ['offset'] @@ -83,7 +78,6 @@ def time_add_offset(self, offset): class OffsetDatetimeIndexArithmetic(object): - goal_time = 0.2 params = offsets param_names = ['offset'] @@ -98,7 +92,6 @@ def time_add_offset(self, offset): class OffestDatetimeArithmetic(object): - goal_time = 0.2 params = offsets param_names = ['offset'] diff --git a/asv_bench/benchmarks/pandas_vb_common.py b/asv_bench/benchmarks/pandas_vb_common.py index e255cd94f265b..d479952cbfbf6 100644 --- a/asv_bench/benchmarks/pandas_vb_common.py +++ b/asv_bench/benchmarks/pandas_vb_common.py @@ -2,19 +2,31 @@ from importlib import import_module import numpy as np -from pandas import Panel +import pandas as pd # Compatibility import for lib for imp in ['pandas._libs.lib', 'pandas.lib']: try: lib = import_module(imp) break - except: + except (ImportError, TypeError, ValueError): pass numeric_dtypes = [np.int64, np.int32, np.uint32, np.uint64, np.float32, np.float64, np.int16, np.int8, np.uint16, np.uint8] datetime_dtypes = [np.datetime64, np.timedelta64] +string_dtypes = [np.object] +try: + extension_dtypes = [pd.Int8Dtype, pd.Int16Dtype, + pd.Int32Dtype, pd.Int64Dtype, + pd.UInt8Dtype, pd.UInt16Dtype, + pd.UInt32Dtype, pd.UInt64Dtype, + pd.CategoricalDtype, + pd.IntervalDtype, + pd.DatetimeTZDtype('ns', 'UTC'), + pd.PeriodDtype('D')] +except AttributeError: + extension_dtypes = [] def setup(*args, **kwargs): @@ -34,7 +46,7 @@ def remove(self, f): """Remove created files""" try: os.remove(f) - except: + except OSError: # On Windows, attempting to remove a file that is in use # causes an exception to be raised pass diff --git a/asv_bench/benchmarks/panel_ctor.py b/asv_bench/benchmarks/panel_ctor.py index ce946c76ed199..627705284481b 100644 --- a/asv_bench/benchmarks/panel_ctor.py +++ b/asv_bench/benchmarks/panel_ctor.py @@ -1,14 +1,10 @@ import warnings from datetime import datetime, timedelta -from pandas import DataFrame, DatetimeIndex, date_range - -from .pandas_vb_common import Panel, setup # noqa +from pandas import DataFrame, Panel, date_range class DifferentIndexes(object): - goal_time = 0.2 - def setup(self): self.data_frames = {} start = datetime(1990, 1, 1) @@ -26,12 +22,10 @@ def time_from_dict(self): class SameIndexes(object): - goal_time = 0.2 - def setup(self): - idx = DatetimeIndex(start=datetime(1990, 1, 1), - end=datetime(2012, 1, 1), - freq='D') + idx = date_range(start=datetime(1990, 1, 1), + end=datetime(2012, 1, 1), + freq='D') df = DataFrame({'a': 0, 'b': 1, 'c': 2}, index=idx) self.data_frames = dict(enumerate([df] * 100)) @@ -42,19 +36,20 @@ def time_from_dict(self): class TwoIndexes(object): - goal_time = 0.2 - def setup(self): start = datetime(1990, 1, 1) end = datetime(2012, 1, 1) df1 = DataFrame({'a': 0, 'b': 1, 'c': 2}, - index=DatetimeIndex(start=start, end=end, freq='D')) + index=date_range(start=start, end=end, freq='D')) end += timedelta(days=1) df2 = DataFrame({'a': 0, 'b': 1, 'c': 2}, - index=DatetimeIndex(start=start, end=end, freq='D')) + index=date_range(start=start, end=end, freq='D')) dfs = [df1] * 50 + [df2] * 50 self.data_frames = dict(enumerate(dfs)) def time_from_dict(self): with warnings.catch_warnings(record=True): Panel.from_dict(self.data_frames) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/panel_methods.py b/asv_bench/benchmarks/panel_methods.py index a5b1a92e9cf67..a4c12c082236e 100644 --- a/asv_bench/benchmarks/panel_methods.py +++ b/asv_bench/benchmarks/panel_methods.py @@ -1,13 +1,11 @@ import warnings import numpy as np - -from .pandas_vb_common import Panel, setup # noqa +from pandas import Panel class PanelMethods(object): - goal_time = 0.2 params = ['items', 'major', 'minor'] param_names = ['axis'] @@ -22,3 +20,6 @@ def time_pct_change(self, axis): def time_shift(self, axis): with warnings.catch_warnings(record=True): self.panel.shift(1, axis=axis) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/period.py b/asv_bench/benchmarks/period.py index c34f9a737473e..6d2c7156a0a3d 100644 --- a/asv_bench/benchmarks/period.py +++ b/asv_bench/benchmarks/period.py @@ -1,5 +1,6 @@ -from pandas import (DataFrame, Series, Period, PeriodIndex, date_range, - period_range) +from pandas import ( + DataFrame, Period, PeriodIndex, Series, date_range, period_range) +from pandas.tseries.frequencies import to_offset class PeriodProperties(object): @@ -35,27 +36,50 @@ def time_asfreq(self, freq): self.per.asfreq('A') -class PeriodIndexConstructor(object): +class PeriodConstructor(object): + params = [['D'], [True, False]] + param_names = ['freq', 'is_offset'] - goal_time = 0.2 + def setup(self, freq, is_offset): + if is_offset: + self.freq = to_offset(freq) + else: + self.freq = freq - params = ['D'] - param_names = ['freq'] + def time_period_constructor(self, freq, is_offset): + Period('2012-06-01', freq=freq) - def setup(self, freq): + +class PeriodIndexConstructor(object): + + params = [['D'], [True, False]] + param_names = ['freq', 'is_offset'] + + def setup(self, freq, is_offset): self.rng = date_range('1985', periods=1000) self.rng2 = date_range('1985', periods=1000).to_pydatetime() - - def time_from_date_range(self, freq): + self.ints = list(range(2000, 3000)) + self.daily_ints = date_range('1/1/2000', periods=1000, + freq=freq).strftime('%Y%m%d').map(int) + if is_offset: + self.freq = to_offset(freq) + else: + self.freq = freq + + def time_from_date_range(self, freq, is_offset): PeriodIndex(self.rng, freq=freq) - def time_from_pydatetime(self, freq): + def time_from_pydatetime(self, freq, is_offset): PeriodIndex(self.rng2, freq=freq) + def time_from_ints(self, freq, is_offset): + PeriodIndex(self.ints, freq=freq) -class DataFramePeriodColumn(object): + def time_from_ints_daily(self, freq, is_offset): + PeriodIndex(self.daily_ints, freq=freq) - goal_time = 0.2 + +class DataFramePeriodColumn(object): def setup(self): self.rng = period_range(start='1/1/1990', freq='S', periods=20000) @@ -72,8 +96,6 @@ def time_set_index(self): class Algorithms(object): - goal_time = 0.2 - params = ['index', 'series'] param_names = ['typ'] @@ -95,10 +117,8 @@ def time_value_counts(self, typ): class Indexing(object): - goal_time = 0.2 - def setup(self): - self.index = PeriodIndex(start='1985', periods=1000, freq='D') + self.index = period_range(start='1985', periods=1000, freq='D') self.series = Series(range(1000), index=self.index) self.period = self.index[500] @@ -119,3 +139,6 @@ def time_align(self): def time_intersection(self): self.index[:750].intersection(self.index[250:]) + + def time_unique(self): + self.index.unique() diff --git a/asv_bench/benchmarks/plotting.py b/asv_bench/benchmarks/plotting.py index 5b49112b0e07d..8a67af0bdabd1 100644 --- a/asv_bench/benchmarks/plotting.py +++ b/asv_bench/benchmarks/plotting.py @@ -7,27 +7,52 @@ import matplotlib matplotlib.use('Agg') -from .pandas_vb_common import setup # noqa +class SeriesPlotting(object): + params = [['line', 'bar', 'area', 'barh', 'hist', 'kde', 'pie']] + param_names = ['kind'] -class Plotting(object): + def setup(self, kind): + if kind in ['bar', 'barh', 'pie']: + n = 100 + elif kind in ['kde']: + n = 10000 + else: + n = 1000000 - goal_time = 0.2 + self.s = Series(np.random.randn(n)) + if kind in ['area', 'pie']: + self.s = self.s.abs() - def setup(self): - self.s = Series(np.random.randn(1000000)) - self.df = DataFrame({'col': self.s}) + def time_series_plot(self, kind): + self.s.plot(kind=kind) - def time_series_plot(self): - self.s.plot() - def time_frame_plot(self): - self.df.plot() +class FramePlotting(object): + params = [['line', 'bar', 'area', 'barh', 'hist', 'kde', 'pie', 'scatter', + 'hexbin']] + param_names = ['kind'] + def setup(self, kind): + if kind in ['bar', 'barh', 'pie']: + n = 100 + elif kind in ['kde', 'scatter', 'hexbin']: + n = 10000 + else: + n = 1000000 + + self.x = Series(np.random.randn(n)) + self.y = Series(np.random.randn(n)) + if kind in ['area', 'pie']: + self.x = self.x.abs() + self.y = self.y.abs() + self.df = DataFrame({'x': self.x, 'y': self.y}) + + def time_frame_plot(self, kind): + self.df.plot(x='x', y='y', kind=kind) -class TimeseriesPlotting(object): - goal_time = 0.2 +class TimeseriesPlotting(object): def setup(self): N = 2000 @@ -49,10 +74,11 @@ def time_plot_regular_compat(self): def time_plot_irregular(self): self.df2.plot() + def time_plot_table(self): + self.df.plot(table=True) -class Misc(object): - goal_time = 0.6 +class Misc(object): def setup(self): N = 500 @@ -62,3 +88,6 @@ def setup(self): def time_plot_andrews_curves(self): andrews_curves(self.df, "Name") + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/reindex.py b/asv_bench/benchmarks/reindex.py index 413427a16f40b..3080b34024a33 100644 --- a/asv_bench/benchmarks/reindex.py +++ b/asv_bench/benchmarks/reindex.py @@ -1,16 +1,14 @@ import numpy as np import pandas.util.testing as tm -from pandas import (DataFrame, Series, DatetimeIndex, MultiIndex, Index, - date_range) -from .pandas_vb_common import setup, lib # noqa +from pandas import (DataFrame, Series, MultiIndex, Index, date_range, + period_range) +from .pandas_vb_common import lib class Reindex(object): - goal_time = 0.2 - def setup(self): - rng = DatetimeIndex(start='1/1/1970', periods=10000, freq='1min') + rng = date_range(start='1/1/1970', periods=10000, freq='1min') self.df = DataFrame(np.random.rand(10000, 10), index=rng, columns=range(10)) self.df['foo'] = 'bar' @@ -37,22 +35,20 @@ def time_reindex_multiindex(self): class ReindexMethod(object): - goal_time = 0.2 - params = ['pad', 'backfill'] - param_names = ['method'] + params = [['pad', 'backfill'], [date_range, period_range]] + param_names = ['method', 'constructor'] - def setup(self, method): + def setup(self, method, constructor): N = 100000 - self.idx = date_range('1/1/2000', periods=N, freq='1min') + self.idx = constructor('1/1/2000', periods=N, freq='1min') self.ts = Series(np.random.randn(N), index=self.idx)[::2] - def time_reindex_method(self, method): + def time_reindex_method(self, method, constructor): self.ts.reindex(self.idx, method=method) class Fillna(object): - goal_time = 0.2 params = ['pad', 'backfill'] param_names = ['method'] @@ -72,14 +68,12 @@ def time_float_32(self, method): class LevelAlign(object): - goal_time = 0.2 - def setup(self): self.index = MultiIndex( levels=[np.arange(10), np.arange(100), np.arange(100)], - labels=[np.arange(10).repeat(10000), - np.tile(np.arange(100).repeat(100), 10), - np.tile(np.tile(np.arange(100), 100), 10)]) + codes=[np.arange(10).repeat(10000), + np.tile(np.arange(100).repeat(100), 10), + np.tile(np.tile(np.arange(100), 100), 10)]) self.df = DataFrame(np.random.randn(len(self.index), 4), index=self.index) self.df_level = DataFrame(np.random.randn(100, 4), @@ -94,7 +88,6 @@ def time_reindex_level(self): class DropDuplicates(object): - goal_time = 0.2 params = [True, False] param_names = ['inplace'] @@ -139,8 +132,6 @@ def time_frame_drop_dups_bool(self, inplace): class Align(object): # blog "pandas escaped the zoo" - goal_time = 0.2 - def setup(self): n = 50000 indices = tm.makeStringIndex(n) @@ -156,8 +147,6 @@ def time_align_series_irregular_string(self): class LibFastZip(object): - goal_time = 0.2 - def setup(self): N = 10000 K = 10 @@ -170,3 +159,6 @@ def setup(self): def time_lib_fast_zip(self): lib.fast_zip(self.col_array_list) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/replace.py b/asv_bench/benchmarks/replace.py index 41208125e8f32..d8efaf99e2c4d 100644 --- a/asv_bench/benchmarks/replace.py +++ b/asv_bench/benchmarks/replace.py @@ -1,12 +1,9 @@ import numpy as np import pandas as pd -from .pandas_vb_common import setup # noqa - class FillNa(object): - goal_time = 0.2 params = [True, False] param_names = ['inplace'] @@ -26,7 +23,6 @@ def time_replace(self, inplace): class ReplaceDict(object): - goal_time = 0.2 params = [True, False] param_names = ['inplace'] @@ -42,7 +38,6 @@ def time_replace_series(self, inplace): class Convert(object): - goal_time = 0.5 params = (['DataFrame', 'Series'], ['Timestamp', 'Timedelta']) param_names = ['constructor', 'replace_data'] @@ -56,3 +51,6 @@ def setup(self, constructor, replace_data): def time_replace(self, constructor, replace_data): self.data.replace(self.to_replace) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/reshape.py b/asv_bench/benchmarks/reshape.py index 07634811370c7..f6ee107ab618e 100644 --- a/asv_bench/benchmarks/reshape.py +++ b/asv_bench/benchmarks/reshape.py @@ -5,13 +5,9 @@ from pandas import DataFrame, MultiIndex, date_range, melt, wide_to_long import pandas as pd -from .pandas_vb_common import setup # noqa - class Melt(object): - goal_time = 0.2 - def setup(self): self.df = DataFrame(np.random.randn(10000, 3), columns=['A', 'B', 'C']) self.df['id1'] = np.random.randint(0, 10, 10000) @@ -23,8 +19,6 @@ def time_melt_dataframe(self): class Pivot(object): - goal_time = 0.2 - def setup(self): N = 10000 index = date_range('1/1/2000', periods=N, freq='h') @@ -39,8 +33,6 @@ def time_reshape_pivot_time_series(self): class SimpleReshape(object): - goal_time = 0.2 - def setup(self): arrays = [np.arange(100).repeat(100), np.roll(np.tile(np.arange(100), 100), 25)] @@ -57,30 +49,38 @@ def time_unstack(self): class Unstack(object): - goal_time = 0.2 + params = ['int', 'category'] - def setup(self): + def setup(self, dtype): m = 100 n = 1000 levels = np.arange(m) index = MultiIndex.from_product([levels] * 2) columns = np.arange(n) - values = np.arange(m * m * n).reshape(m * m, n) + if dtype == 'int': + values = np.arange(m * m * n).reshape(m * m, n) + else: + # the category branch is ~20x slower than int. So we + # cut down the size a bit. Now it's only ~3x slower. + n = 50 + columns = columns[:n] + indices = np.random.randint(0, 52, size=(m * m, n)) + values = np.take(list(string.ascii_letters), indices) + values = [pd.Categorical(v) for v in values.T] + self.df = DataFrame(values, index, columns) self.df2 = self.df.iloc[:-1] - def time_full_product(self): + def time_full_product(self, dtype): self.df.unstack() - def time_without_last_row(self): + def time_without_last_row(self, dtype): self.df2.unstack() class SparseIndex(object): - goal_time = 0.2 - def setup(self): NUM_ROWS = 1000 self.df = DataFrame({'A': np.random.randint(50, size=NUM_ROWS), @@ -97,8 +97,6 @@ def time_unstack(self): class WideToLong(object): - goal_time = 0.2 - def setup(self): nyrs = 20 nidvars = 20 @@ -117,8 +115,6 @@ def time_wide_to_long_big(self): class PivotTable(object): - goal_time = 0.2 - def setup(self): N = 100000 fac1 = np.array(['A', 'B', 'C'], dtype='O') @@ -135,13 +131,43 @@ def setup(self): def time_pivot_table(self): self.df.pivot_table(index='key1', columns=['key2', 'key3']) + def time_pivot_table_agg(self): + self.df.pivot_table(index='key1', columns=['key2', 'key3'], + aggfunc=['sum', 'mean']) -class GetDummies(object): - goal_time = 0.2 + def time_pivot_table_margins(self): + self.df.pivot_table(index='key1', columns=['key2', 'key3'], + margins=True) + + +class Crosstab(object): + + def setup(self): + N = 100000 + fac1 = np.array(['A', 'B', 'C'], dtype='O') + fac2 = np.array(['one', 'two'], dtype='O') + self.ind1 = np.random.randint(0, 3, size=N) + self.ind2 = np.random.randint(0, 2, size=N) + self.vec1 = fac1.take(self.ind1) + self.vec2 = fac2.take(self.ind2) + + def time_crosstab(self): + pd.crosstab(self.vec1, self.vec2) + def time_crosstab_values(self): + pd.crosstab(self.vec1, self.vec2, values=self.ind1, aggfunc='sum') + + def time_crosstab_normalize(self): + pd.crosstab(self.vec1, self.vec2, normalize=True) + + def time_crosstab_normalize_margins(self): + pd.crosstab(self.vec1, self.vec2, normalize=True, margins=True) + + +class GetDummies(object): def setup(self): categories = list(string.ascii_letters[:12]) - s = pd.Series(np.random.choice(categories, size=1_000_000), + s = pd.Series(np.random.choice(categories, size=1000000), dtype=pd.api.types.CategoricalDtype(categories)) self.s = s @@ -150,3 +176,44 @@ def time_get_dummies_1d(self): def time_get_dummies_1d_sparse(self): pd.get_dummies(self.s, sparse=True) + + +class Cut(object): + params = [[4, 10, 1000]] + param_names = ['bins'] + + def setup(self, bins): + N = 10**5 + self.int_series = pd.Series(np.arange(N).repeat(5)) + self.float_series = pd.Series(np.random.randn(N).repeat(5)) + self.timedelta_series = pd.Series(np.random.randint(N, size=N), + dtype='timedelta64[ns]') + self.datetime_series = pd.Series(np.random.randint(N, size=N), + dtype='datetime64[ns]') + + def time_cut_int(self, bins): + pd.cut(self.int_series, bins) + + def time_cut_float(self, bins): + pd.cut(self.float_series, bins) + + def time_cut_timedelta(self, bins): + pd.cut(self.timedelta_series, bins) + + def time_cut_datetime(self, bins): + pd.cut(self.datetime_series, bins) + + def time_qcut_int(self, bins): + pd.qcut(self.int_series, bins) + + def time_qcut_float(self, bins): + pd.qcut(self.float_series, bins) + + def time_qcut_timedelta(self, bins): + pd.qcut(self.timedelta_series, bins) + + def time_qcut_datetime(self, bins): + pd.qcut(self.datetime_series, bins) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/rolling.py b/asv_bench/benchmarks/rolling.py index e3bf551fa5f2b..659b6591fbd4b 100644 --- a/asv_bench/benchmarks/rolling.py +++ b/asv_bench/benchmarks/rolling.py @@ -1,8 +1,6 @@ import pandas as pd import numpy as np -from .pandas_vb_common import setup # noqa - class Methods(object): @@ -23,6 +21,42 @@ def time_rolling(self, constructor, window, dtype, method): getattr(self.roll, method)() +class ExpandingMethods(object): + + sample_time = 0.2 + params = (['DataFrame', 'Series'], + ['int', 'float'], + ['median', 'mean', 'max', 'min', 'std', 'count', 'skew', 'kurt', + 'sum']) + param_names = ['contructor', 'window', 'dtype', 'method'] + + def setup(self, constructor, dtype, method): + N = 10**5 + arr = (100 * np.random.random(N)).astype(dtype) + self.expanding = getattr(pd, constructor)(arr).expanding() + + def time_expanding(self, constructor, dtype, method): + getattr(self.expanding, method)() + + +class EWMMethods(object): + + sample_time = 0.2 + params = (['DataFrame', 'Series'], + [10, 1000], + ['int', 'float'], + ['mean', 'std']) + param_names = ['contructor', 'window', 'dtype', 'method'] + + def setup(self, constructor, window, dtype, method): + N = 10**5 + arr = (100 * np.random.random(N)).astype(dtype) + self.ewm = getattr(pd, constructor)(arr).ewm(halflife=window) + + def time_ewm(self, constructor, window, dtype, method): + getattr(self.ewm, method)() + + class VariableWindowMethods(Methods): sample_time = 0.2 params = (['DataFrame', 'Series'], @@ -77,3 +111,6 @@ def setup(self, constructor, window, dtype, percentile, interpolation): def time_quantile(self, constructor, window, dtype, percentile, interpolation): self.roll.quantile(percentile, interpolation=interpolation) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/series_methods.py b/asv_bench/benchmarks/series_methods.py index a5ccf5c32b876..5b0981dc10a8a 100644 --- a/asv_bench/benchmarks/series_methods.py +++ b/asv_bench/benchmarks/series_methods.py @@ -4,12 +4,9 @@ import pandas.util.testing as tm from pandas import Series, date_range, NaT -from .pandas_vb_common import setup # noqa - class SeriesConstructor(object): - goal_time = 0.2 params = [None, 'dict'] param_names = ['data'] @@ -26,8 +23,7 @@ def time_constructor(self, data): class IsIn(object): - goal_time = 0.2 - params = ['int64', 'object'] + params = ['int64', 'uint64', 'object'] param_names = ['dtype'] def setup(self, dtype): @@ -38,9 +34,66 @@ def time_isin(self, dtypes): self.s.isin(self.values) +class IsInFloat64(object): + + def setup(self): + self.small = Series([1, 2], dtype=np.float64) + self.many_different_values = np.arange(10**6, dtype=np.float64) + self.few_different_values = np.zeros(10**7, dtype=np.float64) + self.only_nans_values = np.full(10**7, np.nan, dtype=np.float64) + + def time_isin_many_different(self): + # runtime is dominated by creation of the lookup-table + self.small.isin(self.many_different_values) + + def time_isin_few_different(self): + # runtime is dominated by creation of the lookup-table + self.small.isin(self.few_different_values) + + def time_isin_nan_values(self): + # runtime is dominated by creation of the lookup-table + self.small.isin(self.few_different_values) + + +class IsInForObjects(object): + + def setup(self): + self.s_nans = Series(np.full(10**4, np.nan)).astype(np.object) + self.vals_nans = np.full(10**4, np.nan).astype(np.object) + self.s_short = Series(np.arange(2)).astype(np.object) + self.s_long = Series(np.arange(10**5)).astype(np.object) + self.vals_short = np.arange(2).astype(np.object) + self.vals_long = np.arange(10**5).astype(np.object) + # because of nans floats are special: + self.s_long_floats = Series(np.arange(10**5, + dtype=np.float)).astype(np.object) + self.vals_long_floats = np.arange(10**5, + dtype=np.float).astype(np.object) + + def time_isin_nans(self): + # if nan-objects are different objects, + # this has the potential to trigger O(n^2) running time + self.s_nans.isin(self.vals_nans) + + def time_isin_short_series_long_values(self): + # running time dominated by the preprocessing + self.s_short.isin(self.vals_long) + + def time_isin_long_series_short_values(self): + # running time dominated by look-up + self.s_long.isin(self.vals_short) + + def time_isin_long_series_long_values(self): + # no dominating part + self.s_long.isin(self.vals_long) + + def time_isin_long_series_long_values_floats(self): + # no dominating part + self.s_long_floats.isin(self.vals_long_floats) + + class NSort(object): - goal_time = 0.2 params = ['first', 'last', 'all'] param_names = ['keep'] @@ -56,7 +109,6 @@ def time_nsmallest(self, keep): class Dropna(object): - goal_time = 0.2 params = ['int', 'datetime'] param_names = ['dtype'] @@ -74,7 +126,6 @@ def time_dropna(self, dtype): class Map(object): - goal_time = 0.2 params = ['dict', 'Series'] param_names = 'mapper' @@ -90,8 +141,6 @@ def time_map(self, mapper): class Clip(object): - goal_time = 0.2 - def setup(self): self.s = Series(np.random.randn(50)) @@ -101,8 +150,7 @@ def time_clip(self): class ValueCounts(object): - goal_time = 0.2 - params = ['int', 'float', 'object'] + params = ['int', 'uint', 'float', 'object'] param_names = ['dtype'] def setup(self, dtype): @@ -114,8 +162,6 @@ def time_value_counts(self, dtype): class Dir(object): - goal_time = 0.2 - def setup(self): self.s = Series(index=tm.makeStringIndex(10000)) @@ -125,8 +171,6 @@ def time_dir_strings(self): class SeriesGetattr(object): # https://github.com/pandas-dev/pandas/issues/19764 - goal_time = 0.2 - def setup(self): self.s = Series(1, index=date_range("2012-01-01", freq='s', @@ -134,3 +178,6 @@ def setup(self): def time_series_datetimeindex_repr(self): getattr(self.s, 'a', None) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/sparse.py b/asv_bench/benchmarks/sparse.py index dcb7694abc2ad..64f87c1670170 100644 --- a/asv_bench/benchmarks/sparse.py +++ b/asv_bench/benchmarks/sparse.py @@ -5,8 +5,6 @@ from pandas import (SparseSeries, SparseDataFrame, SparseArray, Series, date_range, MultiIndex) -from .pandas_vb_common import setup # noqa - def make_array(size, dense_proportion, fill_value, dtype): dense_size = int(size * dense_proportion) @@ -18,8 +16,6 @@ def make_array(size, dense_proportion, fill_value, dtype): class SparseSeriesToFrame(object): - goal_time = 0.2 - def setup(self): K = 50 N = 50001 @@ -37,7 +33,6 @@ def time_series_to_frame(self): class SparseArrayConstructor(object): - goal_time = 0.2 params = ([0.1, 0.01], [0, np.nan], [np.int64, np.float64, np.object]) param_names = ['dense_proportion', 'fill_value', 'dtype'] @@ -52,8 +47,6 @@ def time_sparse_array(self, dense_proportion, fill_value, dtype): class SparseDataFrameConstructor(object): - goal_time = 0.2 - def setup(self): N = 1000 self.arr = np.arange(N) @@ -72,8 +65,6 @@ def time_from_dict(self): class FromCoo(object): - goal_time = 0.2 - def setup(self): self.matrix = scipy.sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), @@ -85,8 +76,6 @@ def time_sparse_series_from_coo(self): class ToCoo(object): - goal_time = 0.2 - def setup(self): s = Series([np.nan] * 10000) s[0] = 3.0 @@ -103,7 +92,6 @@ def time_sparse_series_to_coo(self): class Arithmetic(object): - goal_time = 0.2 params = ([0.1, 0.01], [0, np.nan]) param_names = ['dense_proportion', 'fill_value'] @@ -129,7 +117,6 @@ def time_divide(self, dense_proportion, fill_value): class ArithmeticBlock(object): - goal_time = 0.2 params = [np.nan, 0] param_names = ['fill_value'] @@ -160,3 +147,6 @@ def time_addition(self, fill_value): def time_division(self, fill_value): self.arr1 / self.arr2 + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/stat_ops.py b/asv_bench/benchmarks/stat_ops.py index c447c78d0d070..7fdc713f076ed 100644 --- a/asv_bench/benchmarks/stat_ops.py +++ b/asv_bench/benchmarks/stat_ops.py @@ -1,8 +1,6 @@ import numpy as np import pandas as pd -from .pandas_vb_common import setup # noqa - ops = ['mean', 'sum', 'median', 'std', 'skew', 'kurt', 'mad', 'prod', 'sem', 'var'] @@ -10,7 +8,6 @@ class FrameOps(object): - goal_time = 0.2 params = [ops, ['float', 'int'], [0, 1], [True, False]] param_names = ['op', 'dtype', 'axis', 'use_bottleneck'] @@ -18,7 +15,7 @@ def setup(self, op, dtype, axis, use_bottleneck): df = pd.DataFrame(np.random.randn(100000, 4)).astype(dtype) try: pd.options.compute.use_bottleneck = use_bottleneck - except: + except TypeError: from pandas.core import nanops nanops._USE_BOTTLENECK = use_bottleneck self.df_func = getattr(df, op) @@ -29,16 +26,15 @@ def time_op(self, op, dtype, axis, use_bottleneck): class FrameMultiIndexOps(object): - goal_time = 0.2 params = ([0, 1, [0, 1]], ops) param_names = ['level', 'op'] def setup(self, level, op): levels = [np.arange(10), np.arange(100), np.arange(100)] - labels = [np.arange(10).repeat(10000), - np.tile(np.arange(100).repeat(100), 10), - np.tile(np.tile(np.arange(100), 100), 10)] - index = pd.MultiIndex(levels=levels, labels=labels) + codes = [np.arange(10).repeat(10000), + np.tile(np.arange(100).repeat(100), 10), + np.tile(np.tile(np.arange(100), 100), 10)] + index = pd.MultiIndex(levels=levels, codes=codes) df = pd.DataFrame(np.random.randn(len(index), 4), index=index) self.df_func = getattr(df, op) @@ -48,7 +44,6 @@ def time_op(self, level, op): class SeriesOps(object): - goal_time = 0.2 params = [ops, ['float', 'int'], [True, False]] param_names = ['op', 'dtype', 'use_bottleneck'] @@ -56,7 +51,7 @@ def setup(self, op, dtype, use_bottleneck): s = pd.Series(np.random.randn(100000)).astype(dtype) try: pd.options.compute.use_bottleneck = use_bottleneck - except: + except TypeError: from pandas.core import nanops nanops._USE_BOTTLENECK = use_bottleneck self.s_func = getattr(s, op) @@ -67,16 +62,15 @@ def time_op(self, op, dtype, use_bottleneck): class SeriesMultiIndexOps(object): - goal_time = 0.2 params = ([0, 1, [0, 1]], ops) param_names = ['level', 'op'] def setup(self, level, op): levels = [np.arange(10), np.arange(100), np.arange(100)] - labels = [np.arange(10).repeat(10000), - np.tile(np.arange(100).repeat(100), 10), - np.tile(np.tile(np.arange(100), 100), 10)] - index = pd.MultiIndex(levels=levels, labels=labels) + codes = [np.arange(10).repeat(10000), + np.tile(np.arange(100).repeat(100), 10), + np.tile(np.tile(np.arange(100), 100), 10)] + index = pd.MultiIndex(levels=levels, codes=codes) s = pd.Series(np.random.randn(len(index)), index=index) self.s_func = getattr(s, op) @@ -86,7 +80,6 @@ def time_op(self, level, op): class Rank(object): - goal_time = 0.2 params = [['DataFrame', 'Series'], [True, False]] param_names = ['constructor', 'pct'] @@ -103,12 +96,49 @@ def time_average_old(self, constructor, pct): class Correlation(object): - goal_time = 0.2 - params = ['spearman', 'kendall', 'pearson'] - param_names = ['method'] + params = [['spearman', 'kendall', 'pearson'], [True, False]] + param_names = ['method', 'use_bottleneck'] - def setup(self, method): + def setup(self, method, use_bottleneck): + try: + pd.options.compute.use_bottleneck = use_bottleneck + except TypeError: + from pandas.core import nanops + nanops._USE_BOTTLENECK = use_bottleneck self.df = pd.DataFrame(np.random.randn(1000, 30)) + self.df2 = pd.DataFrame(np.random.randn(1000, 30)) + self.s = pd.Series(np.random.randn(1000)) + self.s2 = pd.Series(np.random.randn(1000)) - def time_corr(self, method): + def time_corr(self, method, use_bottleneck): self.df.corr(method=method) + + def time_corr_series(self, method, use_bottleneck): + self.s.corr(self.s2, method=method) + + def time_corrwith_cols(self, method, use_bottleneck): + self.df.corrwith(self.df2, method=method) + + def time_corrwith_rows(self, method, use_bottleneck): + self.df.corrwith(self.df2, axis=1, method=method) + + +class Covariance(object): + + params = [[True, False]] + param_names = ['use_bottleneck'] + + def setup(self, use_bottleneck): + try: + pd.options.compute.use_bottleneck = use_bottleneck + except TypeError: + from pandas.core import nanops + nanops._USE_BOTTLENECK = use_bottleneck + self.s = pd.Series(np.random.randn(100000)) + self.s2 = pd.Series(np.random.randn(100000)) + + def time_cov_series(self, use_bottleneck): + self.s.cov(self.s2) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/strings.py b/asv_bench/benchmarks/strings.py index b203c8b0fa5c9..e9f2727f64e15 100644 --- a/asv_bench/benchmarks/strings.py +++ b/asv_bench/benchmarks/strings.py @@ -1,20 +1,15 @@ import warnings import numpy as np -from pandas import Series +from pandas import Series, DataFrame import pandas.util.testing as tm class Methods(object): - goal_time = 0.2 - def setup(self): self.s = Series(tm.makeStringIndex(10**5)) - def time_cat(self): - self.s.str.cat(sep=',') - def time_center(self): self.s.str.center(100) @@ -31,21 +26,42 @@ def time_extract(self): def time_findall(self): self.s.str.findall('[A-Z]+') + def time_find(self): + self.s.str.find('[A-Z]+') + + def time_rfind(self): + self.s.str.rfind('[A-Z]+') + def time_get(self): self.s.str.get(0) def time_len(self): self.s.str.len() + def time_join(self): + self.s.str.join(' ') + def time_match(self): self.s.str.match('A') + def time_normalize(self): + self.s.str.normalize('NFC') + def time_pad(self): self.s.str.pad(100, side='both') + def time_partition(self): + self.s.str.partition('A') + + def time_rpartition(self): + self.s.str.rpartition('A') + def time_replace(self): self.s.str.replace('A', '\x01\x01') + def time_translate(self): + self.s.str.translate({'A': '\x01\x01'}) + def time_slice(self): self.s.str.slice(5, 15, 2) @@ -70,10 +86,15 @@ def time_upper(self): def time_lower(self): self.s.str.lower() + def time_wrap(self): + self.s.str.wrap(10) + + def time_zfill(self): + self.s.str.zfill(10) + class Repeat(object): - goal_time = 0.2 params = ['int', 'array'] param_names = ['repeats'] @@ -87,9 +108,33 @@ def time_repeat(self, repeats): self.s.str.repeat(self.repeat) +class Cat(object): + + params = ([0, 3], [None, ','], [None, '-'], [0.0, 0.001, 0.15]) + param_names = ['other_cols', 'sep', 'na_rep', 'na_frac'] + + def setup(self, other_cols, sep, na_rep, na_frac): + N = 10 ** 5 + mask_gen = lambda: np.random.choice([True, False], N, + p=[1 - na_frac, na_frac]) + self.s = Series(tm.makeStringIndex(N)).where(mask_gen()) + if other_cols == 0: + # str.cat self-concatenates only for others=None + self.others = None + else: + self.others = DataFrame({i: tm.makeStringIndex(N).where(mask_gen()) + for i in range(other_cols)}) + + def time_cat(self, other_cols, sep, na_rep, na_frac): + # before the concatenation (one caller + other_cols columns), the total + # expected fraction of rows containing any NaN is: + # reduce(lambda t, _: t + (1 - t) * na_frac, range(other_cols + 1), 0) + # for other_cols=3 and na_frac=0.15, this works out to ~48% + self.s.str.cat(others=self.others, sep=sep, na_rep=na_rep) + + class Contains(object): - goal_time = 0.2 params = [True, False] param_names = ['regex'] @@ -102,7 +147,6 @@ def time_contains(self, regex): class Split(object): - goal_time = 0.2 params = [True, False] param_names = ['expand'] @@ -112,10 +156,11 @@ def setup(self, expand): def time_split(self, expand): self.s.str.split('--', expand=expand) + def time_rsplit(self, expand): + self.s.str.rsplit('--', expand=expand) -class Dummies(object): - goal_time = 0.2 +class Dummies(object): def setup(self): self.s = Series(tm.makeStringIndex(10**5)).str.join('|') @@ -126,8 +171,6 @@ def time_get_dummies(self): class Encode(object): - goal_time = 0.2 - def setup(self): self.ser = Series(tm.makeUnicodeIndex()) @@ -137,8 +180,6 @@ def time_encode_decode(self): class Slice(object): - goal_time = 0.2 - def setup(self): self.s = Series(['abcdefg', np.nan] * 500000) diff --git a/asv_bench/benchmarks/timedelta.py b/asv_bench/benchmarks/timedelta.py index 3fe75b3c34299..0cfbbd536bc8b 100644 --- a/asv_bench/benchmarks/timedelta.py +++ b/asv_bench/benchmarks/timedelta.py @@ -1,12 +1,12 @@ import datetime import numpy as np -from pandas import Series, timedelta_range, to_timedelta, Timestamp, Timedelta +from pandas import ( + DataFrame, Series, Timedelta, Timestamp, timedelta_range, to_timedelta) -class TimedeltaConstructor(object): - goal_time = 0.2 +class TimedeltaConstructor(object): def time_from_int(self): Timedelta(123456789) @@ -36,8 +36,6 @@ def time_from_missing(self): class ToTimedelta(object): - goal_time = 0.2 - def setup(self): self.ints = np.random.randint(0, 60, size=10000) self.str_days = [] @@ -58,7 +56,6 @@ def time_convert_string_seconds(self): class ToTimedeltaErrors(object): - goal_time = 0.2 params = ['coerce', 'ignore'] param_names = ['errors'] @@ -73,8 +70,6 @@ def time_convert(self, errors): class TimedeltaOps(object): - goal_time = 0.2 - def setup(self): self.td = to_timedelta(np.arange(1000000)) self.ts = Timestamp('2000') @@ -85,8 +80,6 @@ def time_add_td_ts(self): class TimedeltaProperties(object): - goal_time = 0.2 - def setup_cache(self): td = Timedelta(days=365, minutes=35, seconds=25, milliseconds=35) return td @@ -106,8 +99,6 @@ def time_timedelta_nanoseconds(self, td): class DatetimeAccessor(object): - goal_time = 0.2 - def setup_cache(self): N = 100000 series = Series(timedelta_range('1 days', periods=N, freq='h')) @@ -127,3 +118,36 @@ def time_timedelta_microseconds(self, series): def time_timedelta_nanoseconds(self, series): series.dt.nanoseconds + + +class TimedeltaIndexing(object): + + def setup(self): + self.index = timedelta_range(start='1985', periods=1000, freq='D') + self.index2 = timedelta_range(start='1986', periods=1000, freq='D') + self.series = Series(range(1000), index=self.index) + self.timedelta = self.index[500] + + def time_get_loc(self): + self.index.get_loc(self.timedelta) + + def time_shape(self): + self.index.shape + + def time_shallow_copy(self): + self.index._shallow_copy() + + def time_series_loc(self): + self.series.loc[self.timedelta] + + def time_align(self): + DataFrame({'a': self.series, 'b': self.series[:500]}) + + def time_intersection(self): + self.index.intersection(self.index2) + + def time_union(self): + self.index.union(self.index2) + + def time_unique(self): + self.index.unique() diff --git a/asv_bench/benchmarks/timeseries.py b/asv_bench/benchmarks/timeseries.py index eada401d2930b..6efd720d1acdd 100644 --- a/asv_bench/benchmarks/timeseries.py +++ b/asv_bench/benchmarks/timeseries.py @@ -1,6 +1,6 @@ -import warnings from datetime import timedelta +import dateutil import numpy as np from pandas import to_datetime, date_range, Series, DataFrame, period_range from pandas.tseries.frequencies import infer_freq @@ -9,13 +9,10 @@ except ImportError: from pandas.tseries.converter import DatetimeConverter -from .pandas_vb_common import setup # noqa - class DatetimeIndex(object): - goal_time = 0.2 - params = ['dst', 'repeated', 'tz_aware', 'tz_naive'] + params = ['dst', 'repeated', 'tz_aware', 'tz_local', 'tz_naive'] param_names = ['index_type'] def setup(self, index_type): @@ -29,6 +26,10 @@ def setup(self, index_type): periods=N, freq='s', tz='US/Eastern'), + 'tz_local': date_range(start='2000', + periods=N, + freq='s', + tz=dateutil.tz.tzlocal()), 'tz_naive': date_range(start='2000', periods=N, freq='s')} @@ -61,9 +62,10 @@ def time_to_pydatetime(self, index_type): class TzLocalize(object): - goal_time = 0.2 + params = [None, 'US/Eastern', 'UTC', dateutil.tz.tzutc()] + param_names = 'tz' - def setup(self): + def setup(self, tz): dst_rng = date_range(start='10/29/2000 1:00:00', end='10/29/2000 1:59:59', freq='S') self.index = date_range(start='10/29/2000', @@ -74,13 +76,12 @@ def setup(self): end='10/29/2000 3:00:00', freq='S')) - def time_infer_dst(self): - self.index.tz_localize('US/Eastern', ambiguous='infer') + def time_infer_dst(self, tz): + self.index.tz_localize(tz, ambiguous='infer') class ResetIndex(object): - goal_time = 0.2 params = [None, 'US/Eastern'] param_names = 'tz' @@ -94,7 +95,6 @@ def time_reest_datetimeindex(self, tz): class Factorize(object): - goal_time = 0.2 params = [None, 'Asia/Tokyo'] param_names = 'tz' @@ -109,7 +109,6 @@ def time_factorize(self, tz): class InferFreq(object): - goal_time = 0.2 params = [None, 'D', 'B'] param_names = ['freq'] @@ -126,8 +125,6 @@ def time_infer_freq(self, freq): class TimeDatetimeConverter(object): - goal_time = 0.2 - def setup(self): N = 100000 self.rng = date_range(start='1/1/2000', periods=N, freq='T') @@ -138,7 +135,6 @@ def time_convert(self): class Iteration(object): - goal_time = 0.2 params = [date_range, period_range] param_names = ['time_index'] @@ -159,7 +155,6 @@ def time_iter_preexit(self, time_index): class ResampleDataFrame(object): - goal_time = 0.2 params = ['max', 'mean', 'min'] param_names = ['method'] @@ -174,7 +169,6 @@ def time_method(self, method): class ResampleSeries(object): - goal_time = 0.2 params = (['period', 'datetime'], ['5min', '1D'], ['mean', 'ohlc']) param_names = ['index', 'freq', 'method'] @@ -195,8 +189,6 @@ def time_resample(self, index, freq, method): class ResampleDatetetime64(object): # GH 7754 - goal_time = 0.2 - def setup(self): rng3 = date_range(start='2000-01-01 00:00:00', end='2000-01-01 10:00:00', freq='555000U') @@ -208,7 +200,6 @@ def time_resample(self): class AsOf(object): - goal_time = 0.2 params = ['DataFrame', 'Series'] param_names = ['constructor'] @@ -256,7 +247,6 @@ def time_asof_nan_single(self, constructor): class SortIndex(object): - goal_time = 0.2 params = [True, False] param_names = ['monotonic'] @@ -276,8 +266,6 @@ def time_get_slice(self, monotonic): class IrregularOps(object): - goal_time = 0.2 - def setup(self): N = 10**5 idx = date_range(start='1/1/2000', periods=N, freq='s') @@ -291,8 +279,6 @@ def time_add(self): class Lookup(object): - goal_time = 0.2 - def setup(self): N = 1500000 rng = date_range(start='1/1/2000', periods=N, freq='S') @@ -306,8 +292,6 @@ def time_lookup_and_cleanup(self): class ToDatetimeYYYYMMDD(object): - goal_time = 0.2 - def setup(self): rng = date_range(start='1/1/2000', periods=10000, freq='D') self.stringsD = Series(rng.strftime('%Y%m%d')) @@ -318,8 +302,6 @@ def time_format_YYYYMMDD(self): class ToDatetimeISO8601(object): - goal_time = 0.2 - def setup(self): rng = date_range(start='1/1/2000', periods=20000, freq='H') self.strings = rng.strftime('%Y-%m-%d %H:%M:%S').tolist() @@ -343,9 +325,33 @@ def time_iso8601_tz_spaceformat(self): to_datetime(self.strings_tz_space) -class ToDatetimeFormat(object): +class ToDatetimeNONISO8601(object): + + def setup(self): + N = 10000 + half = int(N / 2) + ts_string_1 = 'March 1, 2018 12:00:00+0400' + ts_string_2 = 'March 1, 2018 12:00:00+0500' + self.same_offset = [ts_string_1] * N + self.diff_offset = [ts_string_1] * half + [ts_string_2] * half + + def time_same_offset(self): + to_datetime(self.same_offset) + + def time_different_offset(self): + to_datetime(self.diff_offset) - goal_time = 0.2 + +class ToDatetimeFormatQuarters(object): + + def setup(self): + self.s = Series(['2Q2005', '2Q05', '2005Q1', '05Q1'] * 10000) + + def time_infer_quarter(self): + to_datetime(self.s) + + +class ToDatetimeFormat(object): def setup(self): self.s = Series(['19MAY11', '19MAY11:00:00:00'] * 100000) @@ -360,7 +366,6 @@ def time_no_exact(self): class ToDatetimeCache(object): - goal_time = 0.2 params = [True, False] param_names = ['cache'] @@ -389,12 +394,35 @@ def time_dup_string_tzoffset_dates(self, cache): class DatetimeAccessor(object): - def setup(self): + params = [None, 'US/Eastern', 'UTC', dateutil.tz.tzutc()] + param_names = 'tz' + + def setup(self, tz): N = 100000 - self.series = Series(date_range(start='1/1/2000', periods=N, freq='T')) + self.series = Series( + date_range(start='1/1/2000', periods=N, freq='T', tz=tz) + ) - def time_dt_accessor(self): + def time_dt_accessor(self, tz): self.series.dt - def time_dt_accessor_normalize(self): + def time_dt_accessor_normalize(self, tz): self.series.dt.normalize() + + def time_dt_accessor_month_name(self, tz): + self.series.dt.month_name() + + def time_dt_accessor_day_name(self, tz): + self.series.dt.day_name() + + def time_dt_accessor_time(self, tz): + self.series.dt.time + + def time_dt_accessor_date(self, tz): + self.series.dt.date + + def time_dt_accessor_year(self, tz): + self.series.dt.year + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/timestamp.py b/asv_bench/benchmarks/timestamp.py index c142a9b59fc43..b45ae22650e17 100644 --- a/asv_bench/benchmarks/timestamp.py +++ b/asv_bench/benchmarks/timestamp.py @@ -1,8 +1,10 @@ import datetime -from pandas import Timestamp +import dateutil import pytz +from pandas import Timestamp + class TimestampConstruction(object): @@ -29,9 +31,8 @@ def time_fromtimestamp(self): class TimestampProperties(object): - goal_time = 0.2 - - _tzs = [None, pytz.timezone('Europe/Amsterdam')] + _tzs = [None, pytz.timezone('Europe/Amsterdam'), pytz.UTC, + dateutil.tz.tzutc()] _freqs = [None, 'B'] params = [_tzs, _freqs] param_names = ['tz', 'freq'] @@ -46,7 +47,7 @@ def time_dayofweek(self, tz, freq): self.ts.dayofweek def time_weekday_name(self, tz, freq): - self.ts.weekday_name + self.ts.day_name def time_dayofyear(self, tz, freq): self.ts.dayofyear @@ -76,22 +77,24 @@ def time_is_quarter_end(self, tz, freq): self.ts.is_quarter_end def time_is_year_start(self, tz, freq): - self.ts.is_quarter_end + self.ts.is_year_start def time_is_year_end(self, tz, freq): - self.ts.is_quarter_end + self.ts.is_year_end def time_is_leap_year(self, tz, freq): - self.ts.is_quarter_end + self.ts.is_leap_year def time_microsecond(self, tz, freq): self.ts.microsecond + def time_month_name(self, tz, freq): + self.ts.month_name() -class TimestampOps(object): - goal_time = 0.2 - params = [None, 'US/Eastern'] +class TimestampOps(object): + params = [None, 'US/Eastern', pytz.UTC, + dateutil.tz.tzutc()] param_names = ['tz'] def setup(self, tz): @@ -106,10 +109,28 @@ def time_replace_None(self, tz): def time_to_pydatetime(self, tz): self.ts.to_pydatetime() + def time_normalize(self, tz): + self.ts.normalize() -class TimestampAcrossDst(object): - goal_time = 0.2 + def time_tz_convert(self, tz): + if self.ts.tz is not None: + self.ts.tz_convert(tz) + + def time_tz_localize(self, tz): + if self.ts.tz is None: + self.ts.tz_localize(tz) + + def time_to_julian_date(self, tz): + self.ts.to_julian_date() + def time_floor(self, tz): + self.ts.floor('5T') + + def time_ceil(self, tz): + self.ts.ceil('5T') + + +class TimestampAcrossDst(object): def setup(self): dt = datetime.datetime(2016, 3, 27, 1) self.tzinfo = pytz.timezone('CET').localize(dt, is_dst=False).tzinfo diff --git a/asv_bench/vbench_to_asv.py b/asv_bench/vbench_to_asv.py deleted file mode 100644 index b1179387e65d5..0000000000000 --- a/asv_bench/vbench_to_asv.py +++ /dev/null @@ -1,163 +0,0 @@ -import ast -import vbench -import os -import sys -import astor -import glob - - -def vbench_to_asv_source(bench, kinds=None): - tab = ' ' * 4 - if kinds is None: - kinds = ['time'] - - output = 'class {}(object):\n'.format(bench.name) - output += tab + 'goal_time = 0.2\n\n' - - if bench.setup: - indented_setup = [tab * 2 + '{}\n'.format(x) for x in bench.setup.splitlines()] - output += tab + 'def setup(self):\n' + ''.join(indented_setup) + '\n' - - for kind in kinds: - output += tab + 'def {}_{}(self):\n'.format(kind, bench.name) - for line in bench.code.splitlines(): - output += tab * 2 + line + '\n' - output += '\n\n' - - if bench.cleanup: - output += tab + 'def teardown(self):\n' + tab * 2 + bench.cleanup - - output += '\n\n' - return output - - -class AssignToSelf(ast.NodeTransformer): - def __init__(self): - super(AssignToSelf, self).__init__() - self.transforms = {} - self.imports = [] - - self.in_class_define = False - self.in_setup = False - - def visit_ClassDef(self, node): - self.transforms = {} - self.in_class_define = True - - functions_to_promote = [] - setup_func = None - - for class_func in ast.iter_child_nodes(node): - if isinstance(class_func, ast.FunctionDef): - if class_func.name == 'setup': - setup_func = class_func - for anon_func in ast.iter_child_nodes(class_func): - if isinstance(anon_func, ast.FunctionDef): - functions_to_promote.append(anon_func) - - if setup_func: - for func in functions_to_promote: - setup_func.body.remove(func) - func.args.args.insert(0, ast.Name(id='self', ctx=ast.Load())) - node.body.append(func) - self.transforms[func.name] = 'self.' + func.name - - ast.fix_missing_locations(node) - - self.generic_visit(node) - - return node - - def visit_TryExcept(self, node): - if any(isinstance(x, (ast.Import, ast.ImportFrom)) for x in node.body): - self.imports.append(node) - else: - self.generic_visit(node) - return node - - def visit_Assign(self, node): - for target in node.targets: - if isinstance(target, ast.Name) and not isinstance(target.ctx, ast.Param) and not self.in_class_define: - self.transforms[target.id] = 'self.' + target.id - self.generic_visit(node) - - return node - - def visit_Name(self, node): - new_node = node - if node.id in self.transforms: - if not isinstance(node.ctx, ast.Param): - new_node = ast.Attribute(value=ast.Name(id='self', ctx=node.ctx), attr=node.id, ctx=node.ctx) - - self.generic_visit(node) - - return ast.copy_location(new_node, node) - - def visit_Import(self, node): - self.imports.append(node) - - def visit_ImportFrom(self, node): - self.imports.append(node) - - def visit_FunctionDef(self, node): - """Delete functions that are empty due to imports being moved""" - self.in_class_define = False - - self.generic_visit(node) - - if node.body: - return node - - -def translate_module(target_module): - g_vars = {} - l_vars = {} - exec('import ' + target_module) in g_vars - - print(target_module) - module = eval(target_module, g_vars) - - benchmarks = [] - for obj_str in dir(module): - obj = getattr(module, obj_str) - if isinstance(obj, vbench.benchmark.Benchmark): - benchmarks.append(obj) - - if not benchmarks: - return - - rewritten_output = '' - for bench in benchmarks: - rewritten_output += vbench_to_asv_source(bench) - - with open('rewrite.py', 'w') as f: - f.write(rewritten_output) - - ast_module = ast.parse(rewritten_output) - - transformer = AssignToSelf() - transformed_module = transformer.visit(ast_module) - - unique_imports = {astor.to_source(node): node for node in transformer.imports} - - transformed_module.body = unique_imports.values() + transformed_module.body - - transformed_source = astor.to_source(transformed_module) - - with open('benchmarks/{}.py'.format(target_module), 'w') as f: - f.write(transformed_source) - - -if __name__ == '__main__': - cwd = os.getcwd() - new_dir = os.path.join(os.path.dirname(__file__), '../vb_suite') - sys.path.insert(0, new_dir) - - for module in glob.glob(os.path.join(new_dir, '*.py')): - mod = os.path.basename(module) - if mod in ['make.py', 'measure_memory_consumption.py', 'perf_HEAD.py', 'run_suite.py', 'test_perf.py', 'generate_rst_files.py', 'test.py', 'suite.py']: - continue - print('') - print(mod) - - translate_module(mod.replace('.py', '')) diff --git a/azure-pipelines.yml b/azure-pipelines.yml new file mode 100644 index 0000000000000..f0567d76659b6 --- /dev/null +++ b/azure-pipelines.yml @@ -0,0 +1,119 @@ +# Adapted from https://github.com/numba/numba/blob/master/azure-pipelines.yml +jobs: +# Mac and Linux use the same template +- template: ci/azure/posix.yml + parameters: + name: macOS + vmImage: xcode9-macos10.13 +- template: ci/azure/posix.yml + parameters: + name: Linux + vmImage: ubuntu-16.04 + +# Windows Python 2.7 needs VC 9.0 installed, handled in the template +- template: ci/azure/windows.yml + parameters: + name: Windows + vmImage: vs2017-win2016 + +- job: 'Checks_and_doc' + pool: + vmImage: ubuntu-16.04 + timeoutInMinutes: 90 + steps: + - script: | + # XXX next command should avoid redefining the path in every step, but + # made the process crash as it couldn't find deactivate + #echo '##vso[task.prependpath]$HOME/miniconda3/bin' + echo '##vso[task.setvariable variable=CONDA_ENV]pandas-dev' + echo '##vso[task.setvariable variable=ENV_FILE]environment.yml' + echo '##vso[task.setvariable variable=AZURE]true' + displayName: 'Setting environment variables' + + # Do not require a conda environment + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + ci/code_checks.sh patterns + displayName: 'Looking for unwanted patterns' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + sudo apt-get install -y libc6-dev-i386 + ci/incremental/install_miniconda.sh + ci/incremental/setup_conda_environment.sh + displayName: 'Set up environment' + condition: true + + # Do not require pandas + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/code_checks.sh lint + displayName: 'Linting' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/code_checks.sh dependencies + displayName: 'Dependencies consistency' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/incremental/build.sh + displayName: 'Build' + condition: true + + # Require pandas + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/code_checks.sh code + displayName: 'Checks on imported code' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/code_checks.sh doctests + displayName: 'Running doctests' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/code_checks.sh docstrings + displayName: 'Docstring validation' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + pytest --capture=no --strict scripts + displayName: 'Testing docstring validaton script' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + git remote add upstream https://github.com/pandas-dev/pandas.git + git fetch upstream + if git diff upstream/master --name-only | grep -q "^asv_bench/"; then + cd asv_bench + asv machine --yes + ASV_OUTPUT="$(asv dev)" + if [[ $(echo "$ASV_OUTPUT" | grep "failed") ]]; then + echo "##vso[task.logissue type=error]Benchmarks run with errors" + echo "$ASV_OUTPUT" + exit 1 + else + echo "Benchmarks run without errors" + fi + else + echo "Benchmarks did not run, no changes detected" + fi + displayName: 'Running benchmarks' + condition: true diff --git a/ci/README.txt b/ci/README.txt deleted file mode 100644 index bb71dc25d6093..0000000000000 --- a/ci/README.txt +++ /dev/null @@ -1,17 +0,0 @@ -Travis is a ci service that's well-integrated with GitHub. -The following types of breakage should be detected -by Travis builds: - -1) Failing tests on any supported version of Python. -2) Pandas should install and the tests should run if no optional deps are installed. -That also means tests which rely on optional deps need to raise SkipTest() -if the dep is missing. -3) unicode related fails when running under exotic locales. - -We tried running the vbench suite for a while, but with varying load -on Travis machines, that wasn't useful. - -Travis currently (4/2013) has a 5-job concurrency limit. Exceeding it -basically doubles the total runtime for a commit through travis, and -since dep+pandas installation is already quite long, this should become -a hard limit on concurrent travis runs. diff --git a/ci/azure/posix.yml b/ci/azure/posix.yml new file mode 100644 index 0000000000000..b9e0cd0b9258c --- /dev/null +++ b/ci/azure/posix.yml @@ -0,0 +1,100 @@ +parameters: + name: '' + vmImage: '' + +jobs: +- job: ${{ parameters.name }} + pool: + vmImage: ${{ parameters.vmImage }} + strategy: + matrix: + ${{ if eq(parameters.name, 'macOS') }}: + py35_np_120: + ENV_FILE: ci/deps/azure-macos-35.yaml + CONDA_PY: "35" + PATTERN: "not slow and not network" + + ${{ if eq(parameters.name, 'Linux') }}: + py27_np_120: + ENV_FILE: ci/deps/azure-27-compat.yaml + CONDA_PY: "27" + PATTERN: "not slow and not network" + + py27_locale_slow_old_np: + ENV_FILE: ci/deps/azure-27-locale.yaml + CONDA_PY: "27" + PATTERN: "slow" + LOCALE_OVERRIDE: "zh_CN.UTF-8" + EXTRA_APT: "language-pack-zh-hans" + + py36_locale_slow: + ENV_FILE: ci/deps/azure-36-locale_slow.yaml + CONDA_PY: "36" + PATTERN: "not slow and not network" + LOCALE_OVERRIDE: "it_IT.UTF-8" + + py37_locale: + ENV_FILE: ci/deps/azure-37-locale.yaml + CONDA_PY: "37" + PATTERN: "not slow and not network" + LOCALE_OVERRIDE: "zh_CN.UTF-8" + + py37_np_dev: + ENV_FILE: ci/deps/azure-37-numpydev.yaml + CONDA_PY: "37" + PATTERN: "not slow and not network" + TEST_ARGS: "-W error" + PANDAS_TESTING_MODE: "deprecate" + EXTRA_APT: "xsel" + + steps: + - script: | + if [ "$(uname)" == "Linux" ]; then sudo apt-get install -y libc6-dev-i386 $EXTRA_APT; fi + echo "Installing Miniconda" + ci/incremental/install_miniconda.sh + export PATH=$HOME/miniconda3/bin:$PATH + echo "Setting up Conda environment" + ci/incremental/setup_conda_environment.sh + displayName: 'Before Install' + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/incremental/build.sh + displayName: 'Build' + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/run_tests.sh + displayName: 'Test' + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev && pushd /tmp && python -c "import pandas; pandas.show_versions();" && popd + - task: PublishTestResults@2 + inputs: + testResultsFiles: 'test-data-*.xml' + testRunTitle: ${{ format('{0}-$(CONDA_PY)', parameters.name) }} + - powershell: | + $junitXml = "test-data-single.xml" + $(Get-Content $junitXml | Out-String) -match 'failures="(.*?)"' + if ($matches[1] -eq 0) + { + Write-Host "No test failures in test-data-single" + } + else + { + # note that this will produce $LASTEXITCODE=1 + Write-Error "$($matches[1]) tests failed" + } + + $junitXmlMulti = "test-data-multiple.xml" + $(Get-Content $junitXmlMulti | Out-String) -match 'failures="(.*?)"' + if ($matches[1] -eq 0) + { + Write-Host "No test failures in test-data-multi" + } + else + { + # note that this will produce $LASTEXITCODE=1 + Write-Error "$($matches[1]) tests failed" + } + displayName: Check for test failures diff --git a/ci/azure/windows.yml b/ci/azure/windows.yml new file mode 100644 index 0000000000000..cece002024936 --- /dev/null +++ b/ci/azure/windows.yml @@ -0,0 +1,59 @@ +parameters: + name: '' + vmImage: '' + +jobs: +- job: ${{ parameters.name }} + pool: + vmImage: ${{ parameters.vmImage }} + strategy: + matrix: + py36_np14: + ENV_FILE: ci/deps/azure-windows-36.yaml + CONDA_PY: "36" + + py27_np121: + ENV_FILE: ci/deps/azure-windows-27.yaml + CONDA_PY: "27" + + steps: + - task: CondaEnvironment@1 + inputs: + updateConda: no + packageSpecs: '' + + - powershell: | + $wc = New-Object net.webclient + $wc.Downloadfile("https://download.microsoft.com/download/7/9/6/796EF2E4-801B-4FC4-AB28-B59FBF6D907B/VCForPython27.msi", "VCForPython27.msi") + Start-Process "VCForPython27.msi" /qn -Wait + displayName: 'Install VC 9.0 only for Python 2.7' + condition: eq(variables.CONDA_PY, '27') + + - script: | + ci\\incremental\\setup_conda_environment.cmd + displayName: 'Before Install' + - script: | + call activate pandas-dev + ci\\incremental\\build.cmd + displayName: 'Build' + - script: | + call activate pandas-dev + pytest -m "not slow and not network" --junitxml=test-data.xml pandas -n 2 -r sxX --strict --durations=10 %* + displayName: 'Test' + - task: PublishTestResults@2 + inputs: + testResultsFiles: 'test-data.xml' + testRunTitle: 'Windows-$(CONDA_PY)' + - powershell: | + $junitXml = "test-data.xml" + $(Get-Content $junitXml | Out-String) -match 'failures="(.*?)"' + if ($matches[1] -eq 0) + { + Write-Host "No test failures in test-data" + } + else + { + # note that this will produce $LASTEXITCODE=1 + Write-Error "$($matches[1]) tests failed" + } + displayName: Check for test failures diff --git a/ci/build_docs.sh b/ci/build_docs.sh index 90a666dc34ed7..bf22f0764144c 100755 --- a/ci/build_docs.sh +++ b/ci/build_docs.sh @@ -1,32 +1,19 @@ #!/bin/bash +set -e + if [ "${TRAVIS_OS_NAME}" != "linux" ]; then echo "not doing build_docs on non-linux" exit 0 fi -cd "$TRAVIS_BUILD_DIR" +cd "$TRAVIS_BUILD_DIR"/doc echo "inside $0" -git show --pretty="format:" --name-only HEAD~5.. --first-parent | grep -P "rst|txt|doc" - -# if [ "$?" != "0" ]; then -# echo "Skipping doc build, none were modified" -# # nope, skip docs build -# exit 0 -# fi - - if [ "$DOC" ]; then echo "Will build docs" - source activate pandas - - mv "$TRAVIS_BUILD_DIR"/doc /tmp - mv "$TRAVIS_BUILD_DIR/LICENSE" /tmp # included in the docs. - cd /tmp/doc - echo ############################### echo # Log file for the doc build # echo ############################### @@ -38,37 +25,32 @@ if [ "$DOC" ]; then echo # Create and send docs # echo ######################## - cd /tmp/doc/build/html - git config --global user.email "pandas-docs-bot@localhost.foo" - git config --global user.name "pandas-docs-bot" - - # create the repo - git init + echo "Only uploading docs when TRAVIS_PULL_REQUEST is 'false'" + echo "TRAVIS_PULL_REQUEST: ${TRAVIS_PULL_REQUEST}" - touch README - git add README - git commit -m "Initial commit" --allow-empty - git branch gh-pages - git checkout gh-pages - touch .nojekyll - git add --all . - git commit -m "Version" --allow-empty + if [ "${TRAVIS_PULL_REQUEST}" == "false" ]; then + cd build/html + git config --global user.email "pandas-docs-bot@localhost.foo" + git config --global user.name "pandas-docs-bot" - git remote remove origin - git remote add origin "https://${PANDAS_GH_TOKEN}@github.com/pandas-dev/pandas-docs-travis.git" - git fetch origin - git remote -v + # create the repo + git init - git push origin gh-pages -f + touch README + git add README + git commit -m "Initial commit" --allow-empty + git branch gh-pages + git checkout gh-pages + touch .nojekyll + git add --all . + git commit -m "Version" --allow-empty - echo "Running doctests" - cd "$TRAVIS_BUILD_DIR" - pytest --doctest-modules \ - pandas/core/reshape/concat.py \ - pandas/core/reshape/pivot.py \ - pandas/core/reshape/reshape.py \ - pandas/core/reshape/tile.py + git remote add origin "https://${PANDAS_GH_TOKEN}@github.com/pandas-dev/pandas-docs-travis.git" + git fetch origin + git remote -v + git push origin gh-pages -f + fi fi exit 0 diff --git a/ci/check_imports.py b/ci/check_imports.py deleted file mode 100644 index 3f09290f8c375..0000000000000 --- a/ci/check_imports.py +++ /dev/null @@ -1,36 +0,0 @@ -""" -Check that certain modules are not loaded by `import pandas` -""" -import sys - -blacklist = { - 'bs4', - 'gcsfs', - 'html5lib', - 'ipython', - 'jinja2' - 'lxml', - 'numexpr', - 'openpyxl', - 'py', - 'pytest', - 's3fs', - 'scipy', - 'tables', - 'xlrd', - 'xlsxwriter', - 'xlwt', -} - - -def main(): - import pandas # noqa - - modules = set(x.split('.')[0] for x in sys.modules) - imported = modules & blacklist - if modules & blacklist: - sys.exit("Imported {}".format(imported)) - - -if __name__ == '__main__': - main() diff --git a/ci/circle-35-ascii.yaml b/ci/circle-35-ascii.yaml deleted file mode 100644 index 745678791458d..0000000000000 --- a/ci/circle-35-ascii.yaml +++ /dev/null @@ -1,13 +0,0 @@ -name: pandas -channels: - - defaults -dependencies: - - cython>=0.28.2 - - nomkl - - numpy - - python-dateutil - - python=3.5* - - pytz - # universal - - pytest - - pytest-xdist diff --git a/ci/code_checks.sh b/ci/code_checks.sh new file mode 100755 index 0000000000000..3e62a08975dad --- /dev/null +++ b/ci/code_checks.sh @@ -0,0 +1,257 @@ +#!/bin/bash +# +# Run checks related to code quality. +# +# This script is intended for both the CI and to check locally that code standards are +# respected. We are currently linting (PEP-8 and similar), looking for patterns of +# common mistakes (sphinx directives with missing blank lines, old style classes, +# unwanted imports...), we run doctests here (currently some files only), and we +# validate formatting error in docstrings. +# +# Usage: +# $ ./ci/code_checks.sh # run all checks +# $ ./ci/code_checks.sh lint # run linting only +# $ ./ci/code_checks.sh patterns # check for patterns that should not exist +# $ ./ci/code_checks.sh code # checks on imported code +# $ ./ci/code_checks.sh doctests # run doctests +# $ ./ci/code_checks.sh docstrings # validate docstring errors +# $ ./ci/code_checks.sh dependencies # check that dependencies are consistent + +[[ -z "$1" || "$1" == "lint" || "$1" == "patterns" || "$1" == "code" || "$1" == "doctests" || "$1" == "docstrings" || "$1" == "dependencies" ]] || \ + { echo "Unknown command $1. Usage: $0 [lint|patterns|code|doctests|docstrings|dependencies]"; exit 9999; } + +BASE_DIR="$(dirname $0)/.." +RET=0 +CHECK=$1 + +function invgrep { + # grep with inverse exist status and formatting for azure-pipelines + # + # This function works exactly as grep, but with opposite exit status: + # - 0 (success) when no patterns are found + # - 1 (fail) when the patterns are found + # + # This is useful for the CI, as we want to fail if one of the patterns + # that we want to avoid is found by grep. + if [[ "$AZURE" == "true" ]]; then + set -o pipefail + grep -n "$@" | awk -F ":" '{print "##vso[task.logissue type=error;sourcepath=" $1 ";linenumber=" $2 ";] Found unwanted pattern: " $3}' + else + grep "$@" + fi + return $((! $?)) +} + +if [[ "$AZURE" == "true" ]]; then + FLAKE8_FORMAT="##vso[task.logissue type=error;sourcepath=%(path)s;linenumber=%(row)s;columnnumber=%(col)s;code=%(code)s;]%(text)s" +else + FLAKE8_FORMAT="default" +fi + +### LINTING ### +if [[ -z "$CHECK" || "$CHECK" == "lint" ]]; then + + # `setup.cfg` contains the list of error codes that are being ignored in flake8 + + echo "flake8 --version" + flake8 --version + + # pandas/_libs/src is C code, so no need to search there. + MSG='Linting .py code' ; echo $MSG + flake8 --format="$FLAKE8_FORMAT" . + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Linting .pyx code' ; echo $MSG + flake8 --format="$FLAKE8_FORMAT" pandas --filename=*.pyx --select=E501,E302,E203,E111,E114,E221,E303,E128,E231,E126,E265,E305,E301,E127,E261,E271,E129,W291,E222,E241,E123,F403,C400,C401,C402,C403,C404,C405,C406,C407,C408,C409,C410,C411 + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Linting .pxd and .pxi.in' ; echo $MSG + flake8 --format="$FLAKE8_FORMAT" pandas/_libs --filename=*.pxi.in,*.pxd --select=E501,E302,E203,E111,E114,E221,E303,E231,E126,F403 + RET=$(($RET + $?)) ; echo $MSG "DONE" + + echo "flake8-rst --version" + flake8-rst --version + + MSG='Linting code-blocks in .rst documentation' ; echo $MSG + flake8-rst doc/source --filename=*.rst --format="$FLAKE8_FORMAT" + RET=$(($RET + $?)) ; echo $MSG "DONE" + + # Check that cython casting is of the form `obj` as opposed to ` obj`; + # it doesn't make a difference, but we want to be internally consistent. + # Note: this grep pattern is (intended to be) equivalent to the python + # regex r'(?])> ' + MSG='Linting .pyx code for spacing conventions in casting' ; echo $MSG + invgrep -r -E --include '*.pyx' --include '*.pxi.in' '[a-zA-Z0-9*]> ' pandas/_libs + RET=$(($RET + $?)) ; echo $MSG "DONE" + + # readability/casting: Warnings about C casting instead of C++ casting + # runtime/int: Warnings about using C number types instead of C++ ones + # build/include_subdir: Warnings about prefacing included header files with directory + + # We don't lint all C files because we don't want to lint any that are built + # from Cython files nor do we want to lint C files that we didn't modify for + # this particular codebase (e.g. src/headers, src/klib, src/msgpack). However, + # we can lint all header files since they aren't "generated" like C files are. + MSG='Linting .c and .h' ; echo $MSG + cpplint --quiet --extensions=c,h --headers=h --recursive --filter=-readability/casting,-runtime/int,-build/include_subdir pandas/_libs/src/*.h pandas/_libs/src/parser pandas/_libs/ujson pandas/_libs/tslibs/src/datetime + RET=$(($RET + $?)) ; echo $MSG "DONE" + + echo "isort --version-number" + isort --version-number + + # Imports - Check formatting using isort see setup.cfg for settings + MSG='Check import format using isort ' ; echo $MSG + isort --recursive --check-only pandas asv_bench + RET=$(($RET + $?)) ; echo $MSG "DONE" + +fi + +### PATTERNS ### +if [[ -z "$CHECK" || "$CHECK" == "patterns" ]]; then + + # Check for imports from pandas.core.common instead of `import pandas.core.common as com` + MSG='Check for non-standard imports' ; echo $MSG + invgrep -R --include="*.py*" -E "from pandas.core.common import " pandas + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for pytest warns' ; echo $MSG + invgrep -r -E --include '*.py' 'pytest\.warns' pandas/tests/ + RET=$(($RET + $?)) ; echo $MSG "DONE" + + # Check for the following code in testing: `np.testing` and `np.array_equal` + MSG='Check for invalid testing' ; echo $MSG + invgrep -r -E --include '*.py' --exclude testing.py '(numpy|np)(\.testing|\.array_equal)' pandas/tests/ + RET=$(($RET + $?)) ; echo $MSG "DONE" + + # Check for the following code in the extension array base tests: `tm.assert_frame_equal` and `tm.assert_series_equal` + MSG='Check for invalid EA testing' ; echo $MSG + invgrep -r -E --include '*.py' --exclude base.py 'tm.assert_(series|frame)_equal' pandas/tests/extension/base + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for deprecated messages without sphinx directive' ; echo $MSG + invgrep -R --include="*.py" --include="*.pyx" -E "(DEPRECATED|DEPRECATE|Deprecated)(:|,|\.)" pandas + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for old-style classes' ; echo $MSG + invgrep -R --include="*.py" -E "class\s\S*[^)]:" pandas scripts + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for backticks incorrectly rendering because of missing spaces' ; echo $MSG + invgrep -R --include="*.rst" -E "[a-zA-Z0-9]\`\`?[a-zA-Z0-9]" doc/source/ + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for incorrect sphinx directives' ; echo $MSG + invgrep -R --include="*.py" --include="*.pyx" --include="*.rst" -E "\.\. (autosummary|contents|currentmodule|deprecated|function|image|important|include|ipython|literalinclude|math|module|note|raw|seealso|toctree|versionadded|versionchanged|warning):[^:]" ./pandas ./doc/source + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check that the deprecated `assert_raises_regex` is not used (`pytest.raises(match=pattern)` should be used instead)' ; echo $MSG + invgrep -R --exclude=*.pyc --exclude=testing.py --exclude=test_util.py assert_raises_regex pandas + RET=$(($RET + $?)) ; echo $MSG "DONE" + + # Check for the following code in testing: `unittest.mock`, `mock.Mock()` or `mock.patch` + MSG='Check that unittest.mock is not used (pytest builtin monkeypatch fixture should be used instead)' ; echo $MSG + invgrep -r -E --include '*.py' '(unittest(\.| import )mock|mock\.Mock\(\)|mock\.patch)' pandas/tests/ + RET=$(($RET + $?)) ; echo $MSG "DONE" + + # Check that we use pytest.raises only as a context manager + # + # For any flake8-compliant code, the only way this regex gets + # matched is if there is no "with" statement preceding "pytest.raises" + MSG='Check for pytest.raises as context manager (a line starting with `pytest.raises` is invalid, needs a `with` to precede it)' ; echo $MSG + MSG='TODO: This check is currently skipped because so many files fail this. Please enable when all are corrected (xref gh-24332)' ; echo $MSG + # invgrep -R --include '*.py' -E '[[:space:]] pytest.raises' pandas/tests + # RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for wrong space after code-block directive and before colon (".. code-block ::" instead of ".. code-block::")' ; echo $MSG + invgrep -R --include="*.rst" ".. code-block ::" doc/source + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for wrong space after ipython directive and before colon (".. ipython ::" instead of ".. ipython::")' ; echo $MSG + invgrep -R --include="*.rst" ".. ipython ::" doc/source + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check that no file in the repo contains tailing whitespaces' ; echo $MSG + set -o pipefail + if [[ "$AZURE" == "true" ]]; then + ! grep -n --exclude="*.svg" -RI "\s$" * | awk -F ":" '{print "##vso[task.logissue type=error;sourcepath=" $1 ";linenumber=" $2 ";] Tailing whitespaces found: " $3}' + else + ! grep -n --exclude="*.svg" -RI "\s$" * | awk -F ":" '{print $1 ":" $2 ":Tailing whitespaces found: " $3}' + fi + RET=$(($RET + $?)) ; echo $MSG "DONE" +fi + +### CODE ### +if [[ -z "$CHECK" || "$CHECK" == "code" ]]; then + + MSG='Check import. No warnings, and blacklist some optional dependencies' ; echo $MSG + python -W error -c " +import sys +import pandas + +blacklist = {'bs4', 'gcsfs', 'html5lib', 'ipython', 'jinja2' 'hypothesis', + 'lxml', 'numexpr', 'openpyxl', 'py', 'pytest', 's3fs', 'scipy', + 'tables', 'xlrd', 'xlsxwriter', 'xlwt'} +mods = blacklist & set(m.split('.')[0] for m in sys.modules) +if mods: + sys.stderr.write('err: pandas should not import: {}\n'.format(', '.join(mods))) + sys.exit(len(mods)) + " + RET=$(($RET + $?)) ; echo $MSG "DONE" + +fi + +### DOCTESTS ### +if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then + + MSG='Doctests frame.py' ; echo $MSG + pytest -q --doctest-modules pandas/core/frame.py \ + -k"-axes -combine -itertuples -join -pivot_table -query -reindex -reindex_axis -round" + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Doctests series.py' ; echo $MSG + pytest -q --doctest-modules pandas/core/series.py \ + -k"-nonzero -reindex -searchsorted -to_dict" + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Doctests generic.py' ; echo $MSG + pytest -q --doctest-modules pandas/core/generic.py \ + -k"-_set_axis_name -_xs -describe -droplevel -groupby -interpolate -pct_change -pipe -reindex -reindex_axis -to_json -transpose -values -xs -to_clipboard" + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Doctests top-level reshaping functions' ; echo $MSG + pytest -q --doctest-modules \ + pandas/core/reshape/concat.py \ + pandas/core/reshape/pivot.py \ + pandas/core/reshape/reshape.py \ + pandas/core/reshape/tile.py \ + -k"-crosstab -pivot_table -cut" + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Doctests interval classes' ; echo $MSG + pytest --doctest-modules -v \ + pandas/core/indexes/interval.py \ + pandas/core/arrays/interval.py \ + -k"-from_arrays -from_breaks -from_intervals -from_tuples -get_loc -set_closed -to_tuples -interval_range" + RET=$(($RET + $?)) ; echo $MSG "DONE" + +fi + +### DOCSTRINGS ### +if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then + + MSG='Validate docstrings (GL06, GL07, GL09, SS04, PR03, PR05, EX04)' ; echo $MSG + $BASE_DIR/scripts/validate_docstrings.py --format=azure --errors=GL06,GL07,GL09,SS04,PR03,PR05,EX04 + RET=$(($RET + $?)) ; echo $MSG "DONE" + +fi + +### DEPENDENCIES ### +if [[ -z "$CHECK" || "$CHECK" == "dependencies" ]]; then + + MSG='Check that requirements-dev.txt has been generated from environment.yml' ; echo $MSG + $BASE_DIR/scripts/generate_pip_deps_from_conda.py --compare --azure + RET=$(($RET + $?)) ; echo $MSG "DONE" + +fi + +exit $RET diff --git a/ci/circle-27-compat.yaml b/ci/deps/azure-27-compat.yaml similarity index 53% rename from ci/circle-27-compat.yaml rename to ci/deps/azure-27-compat.yaml index b5be569eb28a4..8899e22bdf6cf 100644 --- a/ci/circle-27-compat.yaml +++ b/ci/deps/azure-27-compat.yaml @@ -1,22 +1,20 @@ -name: pandas +name: pandas-dev channels: - defaults - conda-forge dependencies: - - bottleneck=1.0.0 + - bottleneck=1.2.0 - cython=0.28.2 - jinja2=2.8 - - numexpr=2.4.4 # we test that we correctly don't use an unsupported numexpr - - numpy=1.9.2 - - openpyxl - - psycopg2 - - pytables=3.2.2 + - numexpr=2.6.1 + - numpy=1.12.0 + - openpyxl=2.5.5 + - pytables=3.4.2 - python-dateutil=2.5.0 - python=2.7* - pytz=2013b - - scipy=0.14.0 - - sqlalchemy=0.7.8 - - xlrd=0.9.2 + - scipy=0.18.1 + - xlrd=1.0.0 - xlsxwriter=0.5.2 - xlwt=0.7.5 # universal @@ -25,4 +23,4 @@ dependencies: - pip: - html5lib==1.0b2 - beautifulsoup4==4.2.1 - - pymysql==0.6.0 + - hypothesis>=3.58.0 diff --git a/ci/travis-27-locale.yaml b/ci/deps/azure-27-locale.yaml similarity index 75% rename from ci/travis-27-locale.yaml rename to ci/deps/azure-27-locale.yaml index 78cbe8f59a8e0..0846ef5e8264e 100644 --- a/ci/travis-27-locale.yaml +++ b/ci/deps/azure-27-locale.yaml @@ -1,13 +1,13 @@ -name: pandas +name: pandas-dev channels: - defaults - conda-forge dependencies: - - bottleneck=1.0.0 + - bottleneck=1.2.0 - cython=0.28.2 - lxml - - matplotlib=1.4.3 - - numpy=1.9.2 + - matplotlib=2.0.0 + - numpy=1.12.0 - openpyxl=2.4.0 - python-dateutil - python-blosc @@ -16,12 +16,13 @@ dependencies: - pytz=2013b - scipy - sqlalchemy=0.8.1 - - xlrd=0.9.2 + - xlrd=1.0.0 - xlsxwriter=0.5.2 - xlwt=0.7.5 # universal - pytest - pytest-xdist + - hypothesis>=3.58.0 - pip: - html5lib==1.0b2 - beautifulsoup4==4.2.1 diff --git a/ci/circle-36-locale.yaml b/ci/deps/azure-36-locale_slow.yaml similarity index 85% rename from ci/circle-36-locale.yaml rename to ci/deps/azure-36-locale_slow.yaml index 091a5a637becd..c7d2334623501 100644 --- a/ci/circle-36-locale.yaml +++ b/ci/deps/azure-36-locale_slow.yaml @@ -1,10 +1,11 @@ -name: pandas +name: pandas-dev channels: - defaults - conda-forge dependencies: - beautifulsoup4 - cython>=0.28.2 + - gcsfs - html5lib - ipython - jinja2 @@ -14,15 +15,12 @@ dependencies: - numexpr - numpy - openpyxl - - psycopg2 - - pymysql - pytables - python-dateutil - python=3.6* - pytz - s3fs - scipy - - sqlalchemy - xarray - xlrd - xlsxwriter @@ -31,3 +29,5 @@ dependencies: - pytest - pytest-xdist - moto + - pip: + - hypothesis>=3.58.0 diff --git a/ci/deps/azure-37-locale.yaml b/ci/deps/azure-37-locale.yaml new file mode 100644 index 0000000000000..b5a05c49b8083 --- /dev/null +++ b/ci/deps/azure-37-locale.yaml @@ -0,0 +1,32 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - beautifulsoup4 + - cython>=0.28.2 + - html5lib + - ipython + - jinja2 + - lxml + - matplotlib + - nomkl + - numexpr + - numpy + - openpyxl + - pytables + - python-dateutil + - python=3.7* + - pytz + - s3fs + - scipy + - xarray + - xlrd + - xlsxwriter + - xlwt + # universal + - pytest + - pytest-xdist + - pip: + - hypothesis>=3.58.0 + - moto # latest moto in conda-forge fails with 3.7, move to conda dependencies when this is fixed diff --git a/ci/travis-36-numpydev.yaml b/ci/deps/azure-37-numpydev.yaml similarity index 84% rename from ci/travis-36-numpydev.yaml rename to ci/deps/azure-37-numpydev.yaml index 038c6537622dd..99ae228f25de3 100644 --- a/ci/travis-36-numpydev.yaml +++ b/ci/deps/azure-37-numpydev.yaml @@ -1,13 +1,14 @@ -name: pandas +name: pandas-dev channels: - defaults dependencies: - - python=3.6* + - python=3.7* - pytz - Cython>=0.28.2 # universal - pytest - pytest-xdist + - hypothesis>=3.58.0 - pip: - "git+git://github.com/dateutil/dateutil.git" - "-f https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com" diff --git a/ci/travis-35-osx.yaml b/ci/deps/azure-macos-35.yaml similarity index 73% rename from ci/travis-35-osx.yaml rename to ci/deps/azure-macos-35.yaml index fff7acc64d537..58abbabce3d86 100644 --- a/ci/travis-35-osx.yaml +++ b/ci/deps/azure-macos-35.yaml @@ -1,4 +1,4 @@ -name: pandas +name: pandas-dev channels: - defaults dependencies: @@ -8,11 +8,12 @@ dependencies: - html5lib - jinja2 - lxml - - matplotlib + - matplotlib=2.2.0 - nomkl - numexpr - - numpy=1.10.4 - - openpyxl + - numpy=1.12.0 + - openpyxl=2.5.5 + - pyarrow - pytables - python=3.5* - pytz @@ -25,3 +26,4 @@ dependencies: - pytest-xdist - pip: - python-dateutil==2.5.3 + - hypothesis>=3.58.0 diff --git a/ci/appveyor-27.yaml b/ci/deps/azure-windows-27.yaml similarity index 85% rename from ci/appveyor-27.yaml rename to ci/deps/azure-windows-27.yaml index 114dcfb0c6440..b1533b071fa74 100644 --- a/ci/appveyor-27.yaml +++ b/ci/deps/azure-windows-27.yaml @@ -1,4 +1,4 @@ -name: pandas +name: pandas-dev channels: - defaults - conda-forge @@ -10,7 +10,7 @@ dependencies: - html5lib - jinja2=2.8 - lxml - - matplotlib + - matplotlib=2.0.1 - numexpr - numpy=1.12* - openpyxl @@ -28,3 +28,4 @@ dependencies: - pytest - pytest-xdist - moto + - hypothesis>=3.58.0 diff --git a/ci/appveyor-36.yaml b/ci/deps/azure-windows-36.yaml similarity index 71% rename from ci/appveyor-36.yaml rename to ci/deps/azure-windows-36.yaml index 63e45d0544ad9..7b132a134c44e 100644 --- a/ci/appveyor-36.yaml +++ b/ci/deps/azure-windows-36.yaml @@ -1,23 +1,23 @@ -name: pandas +name: pandas-dev channels: - defaults - conda-forge dependencies: - blosc - bottleneck - - fastparquet - - feather-format + - boost-cpp<1.67 + - fastparquet>=0.2.1 - matplotlib - numexpr - numpy=1.14* - openpyxl + - parquet-cpp - pyarrow - pytables - python-dateutil - - python=3.6.* + - python=3.6.6 - pytz - scipy - - thrift=0.10* - xlrd - xlsxwriter - xlwt @@ -25,3 +25,4 @@ dependencies: - cython>=0.28.2 - pytest - pytest-xdist + - hypothesis>=3.58.0 diff --git a/ci/travis-27.yaml b/ci/deps/travis-27.yaml similarity index 77% rename from ci/travis-27.yaml rename to ci/deps/travis-27.yaml index 9cb20734dc63d..0f2194e71de31 100644 --- a/ci/travis-27.yaml +++ b/ci/deps/travis-27.yaml @@ -1,4 +1,4 @@ -name: pandas +name: pandas-dev channels: - defaults - conda-forge @@ -6,16 +6,14 @@ dependencies: - beautifulsoup4 - bottleneck - cython=0.28.2 - - fastparquet - - feather-format - - flake8=3.4.1 + - fastparquet>=0.2.1 - gcsfs - html5lib - ipython - jemalloc=4.5.0.post - jinja2=2.8 - lxml - - matplotlib + - matplotlib=2.2.2 - mock - nomkl - numexpr @@ -24,10 +22,11 @@ dependencies: - patsy - psycopg2 - py - - pyarrow=0.4.1 + - pyarrow=0.7.0 - PyCrypto - pymysql=0.6.3 - pytables + - blosc=1.14.3 - python-blosc - python-dateutil=2.5.0 - python=2.7* @@ -35,16 +34,16 @@ dependencies: - s3fs - scipy - sqlalchemy=0.9.6 - - xarray=0.8.0 - - xlrd=0.9.2 + - xarray=0.9.6 + - xlrd=1.0.0 - xlsxwriter=0.5.2 - xlwt=0.7.5 # universal - pytest - pytest-xdist - - moto + - moto==1.3.4 + - hypothesis>=3.58.0 - pip: - backports.lzma - - cpplint - pandas-gbq - pathlib diff --git a/ci/travis-36-doc.yaml b/ci/deps/travis-36-doc.yaml similarity index 83% rename from ci/travis-36-doc.yaml rename to ci/deps/travis-36-doc.yaml index 153a81197a6c7..26f3a17432ab2 100644 --- a/ci/travis-36-doc.yaml +++ b/ci/deps/travis-36-doc.yaml @@ -1,15 +1,15 @@ -name: pandas +name: pandas-dev channels: - defaults - conda-forge - - r dependencies: - beautifulsoup4 - bottleneck - cython>=0.28.2 - - fastparquet - - feather-format + - fastparquet>=0.2.1 + - gitpython - html5lib + - hypothesis>=3.58.0 - ipykernel - ipython - ipywidgets @@ -21,16 +21,16 @@ dependencies: - notebook - numexpr - numpy=1.13* + - numpydoc - openpyxl - pandoc + - pyarrow - pyqt - pytables - python-dateutil - python-snappy - python=3.6* - pytz - - r - - rpy2 - scipy - seaborn - sphinx diff --git a/ci/circle-36-locale_slow.yaml b/ci/deps/travis-36-locale.yaml similarity index 88% rename from ci/circle-36-locale_slow.yaml rename to ci/deps/travis-36-locale.yaml index 649f93f7aa427..2b38465c04512 100644 --- a/ci/circle-36-locale_slow.yaml +++ b/ci/deps/travis-36-locale.yaml @@ -1,11 +1,10 @@ -name: pandas +name: pandas-dev channels: - defaults - conda-forge dependencies: - beautifulsoup4 - cython>=0.28.2 - - gcsfs - html5lib - ipython - jinja2 @@ -32,3 +31,5 @@ dependencies: - pytest - pytest-xdist - moto + - pip: + - hypothesis>=3.58.0 diff --git a/ci/travis-36-slow.yaml b/ci/deps/travis-36-slow.yaml similarity index 90% rename from ci/travis-36-slow.yaml rename to ci/deps/travis-36-slow.yaml index f6738e3837186..a6ffdb95e5e7c 100644 --- a/ci/travis-36-slow.yaml +++ b/ci/deps/travis-36-slow.yaml @@ -1,4 +1,4 @@ -name: pandas +name: pandas-dev channels: - defaults - conda-forge @@ -28,3 +28,4 @@ dependencies: - pytest - pytest-xdist - moto + - hypothesis>=3.58.0 diff --git a/ci/travis-36.yaml b/ci/deps/travis-36.yaml similarity index 79% rename from ci/travis-36.yaml rename to ci/deps/travis-36.yaml index 7eceba76cab96..74db888d588f4 100644 --- a/ci/travis-36.yaml +++ b/ci/deps/travis-36.yaml @@ -1,35 +1,31 @@ -name: pandas +name: pandas-dev channels: - defaults - conda-forge dependencies: - beautifulsoup4 + - botocore>=1.11 - cython>=0.28.2 - dask - - fastparquet - - feather-format + - fastparquet>=0.2.1 - gcsfs - geopandas - html5lib - - ipython - - jinja2 - - lxml - matplotlib - nomkl - numexpr - numpy - openpyxl - psycopg2 - - pyarrow + - pyarrow=0.9.0 - pymysql - pytables - python-snappy - - python=3.6* + - python=3.6.6 - pytz - s3fs - scikit-learn - scipy - - seaborn - sqlalchemy - statsmodels - xarray @@ -40,9 +36,10 @@ dependencies: - pytest - pytest-xdist - pytest-cov - - moto + - hypothesis>=3.58.0 - pip: - brotlipy - coverage + - moto - pandas-datareader - python-dateutil diff --git a/ci/travis-37.yaml b/ci/deps/travis-37.yaml similarity index 63% rename from ci/travis-37.yaml rename to ci/deps/travis-37.yaml index 1dc2930bf7287..c503124d8cd26 100644 --- a/ci/travis-37.yaml +++ b/ci/deps/travis-37.yaml @@ -1,14 +1,20 @@ -name: pandas +name: pandas-dev channels: - defaults - conda-forge - c3i_test dependencies: - python=3.7 + - botocore>=1.11 - cython>=0.28.2 - numpy - python-dateutil - nomkl + - pyarrow - pytz - pytest - pytest-xdist + - hypothesis>=3.58.0 + - s3fs + - pip: + - moto diff --git a/ci/environment-dev.yaml b/ci/environment-dev.yaml deleted file mode 100644 index 797506547b773..0000000000000 --- a/ci/environment-dev.yaml +++ /dev/null @@ -1,16 +0,0 @@ -name: pandas-dev -channels: - - defaults - - conda-forge -dependencies: - - Cython>=0.28.2 - - NumPy - - flake8 - - moto - - pytest>=3.1 - - python-dateutil>=2.5.0 - - python=3 - - pytz - - setuptools>=24.2.0 - - sphinx - - sphinxcontrib-spelling diff --git a/ci/incremental/build.cmd b/ci/incremental/build.cmd new file mode 100644 index 0000000000000..2cce38c03f406 --- /dev/null +++ b/ci/incremental/build.cmd @@ -0,0 +1,9 @@ +@rem https://github.com/numba/numba/blob/master/buildscripts/incremental/build.cmd + +@rem Build numba extensions without silencing compile errors +python setup.py build_ext -q --inplace + +@rem Install pandas locally +python -m pip install -e . + +if %errorlevel% neq 0 exit /b %errorlevel% diff --git a/ci/incremental/build.sh b/ci/incremental/build.sh new file mode 100755 index 0000000000000..05648037935a3 --- /dev/null +++ b/ci/incremental/build.sh @@ -0,0 +1,16 @@ +#!/bin/bash + +# Make sure any error below is reported as such +set -v -e + +echo "[building extensions]" +python setup.py build_ext -q --inplace +python -m pip install -e . + +echo +echo "[show environment]" +conda list + +echo +echo "[done]" +exit 0 diff --git a/ci/incremental/install_miniconda.sh b/ci/incremental/install_miniconda.sh new file mode 100755 index 0000000000000..a47dfdb324b34 --- /dev/null +++ b/ci/incremental/install_miniconda.sh @@ -0,0 +1,19 @@ +#!/bin/bash + +set -v -e + +# Install Miniconda +unamestr=`uname` +if [[ "$unamestr" == 'Linux' ]]; then + if [[ "$BITS32" == "yes" ]]; then + wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86.sh -O miniconda.sh + else + wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh + fi +elif [[ "$unamestr" == 'Darwin' ]]; then + wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh +else + echo Error +fi +chmod +x miniconda.sh +./miniconda.sh -b diff --git a/ci/incremental/setup_conda_environment.cmd b/ci/incremental/setup_conda_environment.cmd new file mode 100644 index 0000000000000..c104d78591384 --- /dev/null +++ b/ci/incremental/setup_conda_environment.cmd @@ -0,0 +1,21 @@ +@rem https://github.com/numba/numba/blob/master/buildscripts/incremental/setup_conda_environment.cmd +@rem The cmd /C hack circumvents a regression where conda installs a conda.bat +@rem script in non-root environments. +set CONDA_INSTALL=cmd /C conda install -q -y +set PIP_INSTALL=pip install -q + +@echo on + +@rem Deactivate any environment +call deactivate +@rem Display root environment (for debugging) +conda list +@rem Clean up any left-over from a previous build +conda remove --all -q -y -n pandas-dev +@rem Scipy, CFFI, jinja2 and IPython are optional dependencies, but exercised in the test suite +conda env create --file=ci\deps\azure-windows-%CONDA_PY%.yaml + +call activate pandas-dev +conda list + +if %errorlevel% neq 0 exit /b %errorlevel% diff --git a/ci/incremental/setup_conda_environment.sh b/ci/incremental/setup_conda_environment.sh new file mode 100755 index 0000000000000..f174c17a614d8 --- /dev/null +++ b/ci/incremental/setup_conda_environment.sh @@ -0,0 +1,52 @@ +#!/bin/bash + +set -v -e + +CONDA_INSTALL="conda install -q -y" +PIP_INSTALL="pip install -q" + + +# Deactivate any environment +source deactivate +# Display root environment (for debugging) +conda list +# Clean up any left-over from a previous build +# (note workaround for https://github.com/conda/conda/issues/2679: +# `conda env remove` issue) +conda remove --all -q -y -n pandas-dev + +echo +echo "[create env]" +time conda env create -q --file="${ENV_FILE}" || exit 1 + +set +v +source activate pandas-dev +set -v + +# remove any installed pandas package +# w/o removing anything else +echo +echo "[removing installed pandas]" +conda remove pandas -y --force || true +pip uninstall -y pandas || true + +echo +echo "[no installed pandas]" +conda list pandas + +if [ -n "$LOCALE_OVERRIDE" ]; then + sudo locale-gen "$LOCALE_OVERRIDE" +fi + +# # Install the compiler toolchain +# if [[ $(uname) == Linux ]]; then +# if [[ "$CONDA_SUBDIR" == "linux-32" || "$BITS32" == "yes" ]] ; then +# $CONDA_INSTALL gcc_linux-32 gxx_linux-32 +# else +# $CONDA_INSTALL gcc_linux-64 gxx_linux-64 +# fi +# elif [[ $(uname) == Darwin ]]; then +# $CONDA_INSTALL clang_osx-64 clangxx_osx-64 +# # Install llvm-openmp and intel-openmp on OSX too +# $CONDA_INSTALL llvm-openmp intel-openmp +# fi diff --git a/ci/install.ps1 b/ci/install.ps1 deleted file mode 100644 index 64ec7f81884cd..0000000000000 --- a/ci/install.ps1 +++ /dev/null @@ -1,92 +0,0 @@ -# Sample script to install Miniconda under Windows -# Authors: Olivier Grisel, Jonathan Helmus and Kyle Kastner, Robert McGibbon -# License: CC0 1.0 Universal: http://creativecommons.org/publicdomain/zero/1.0/ - -$MINICONDA_URL = "http://repo.continuum.io/miniconda/" - - -function DownloadMiniconda ($python_version, $platform_suffix) { - $webclient = New-Object System.Net.WebClient - $filename = "Miniconda3-latest-Windows-" + $platform_suffix + ".exe" - $url = $MINICONDA_URL + $filename - - $basedir = $pwd.Path + "\" - $filepath = $basedir + $filename - if (Test-Path $filename) { - Write-Host "Reusing" $filepath - return $filepath - } - - # Download and retry up to 3 times in case of network transient errors. - Write-Host "Downloading" $filename "from" $url - $retry_attempts = 2 - for($i=0; $i -lt $retry_attempts; $i++){ - try { - $webclient.DownloadFile($url, $filepath) - break - } - Catch [Exception]{ - Start-Sleep 1 - } - } - if (Test-Path $filepath) { - Write-Host "File saved at" $filepath - } else { - # Retry once to get the error message if any at the last try - $webclient.DownloadFile($url, $filepath) - } - return $filepath -} - - -function InstallMiniconda ($python_version, $architecture, $python_home) { - Write-Host "Installing Python" $python_version "for" $architecture "bit architecture to" $python_home - if (Test-Path $python_home) { - Write-Host $python_home "already exists, skipping." - return $false - } - if ($architecture -match "32") { - $platform_suffix = "x86" - } else { - $platform_suffix = "x86_64" - } - - $filepath = DownloadMiniconda $python_version $platform_suffix - Write-Host "Installing" $filepath "to" $python_home - $install_log = $python_home + ".log" - $args = "/S /D=$python_home" - Write-Host $filepath $args - Start-Process -FilePath $filepath -ArgumentList $args -Wait -Passthru - if (Test-Path $python_home) { - Write-Host "Python $python_version ($architecture) installation complete" - } else { - Write-Host "Failed to install Python in $python_home" - Get-Content -Path $install_log - Exit 1 - } -} - - -function InstallCondaPackages ($python_home, $spec) { - $conda_path = $python_home + "\Scripts\conda.exe" - $args = "install --yes " + $spec - Write-Host ("conda " + $args) - Start-Process -FilePath "$conda_path" -ArgumentList $args -Wait -Passthru -} - -function UpdateConda ($python_home) { - $conda_path = $python_home + "\Scripts\conda.exe" - Write-Host "Updating conda..." - $args = "update --yes conda" - Write-Host $conda_path $args - Start-Process -FilePath "$conda_path" -ArgumentList $args -Wait -Passthru -} - - -function main () { - InstallMiniconda "3.5" $env:PYTHON_ARCH $env:CONDA_ROOT - UpdateConda $env:CONDA_ROOT - InstallCondaPackages $env:CONDA_ROOT "conda-build jinja2 anaconda-client" -} - -main diff --git a/ci/install_circle.sh b/ci/install_circle.sh deleted file mode 100755 index 5ffff84c88488..0000000000000 --- a/ci/install_circle.sh +++ /dev/null @@ -1,80 +0,0 @@ -#!/usr/bin/env bash - -home_dir=$(pwd) -echo "[home_dir: $home_dir]" - -echo "[ls -ltr]" -ls -ltr - -echo "[Using clean Miniconda install]" -rm -rf "$MINICONDA_DIR" - -# install miniconda -wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -q -O miniconda.sh || exit 1 -bash miniconda.sh -b -p "$MINICONDA_DIR" || exit 1 - -export PATH="$MINICONDA_DIR/bin:$PATH" - -echo "[update conda]" -conda config --set ssl_verify false || exit 1 -conda config --set always_yes true --set changeps1 false || exit 1 -conda update -q conda - -# add the pandas channel to take priority -# to add extra packages -echo "[add channels]" -conda config --add channels pandas || exit 1 -conda config --remove channels defaults || exit 1 -conda config --add channels defaults || exit 1 - -# Useful for debugging any issues with conda -conda info -a || exit 1 - -# support env variables passed -export ENVS_FILE=".envs" - -# make sure that the .envs file exists. it is ok if it is empty -touch $ENVS_FILE - -# assume all command line arguments are environmental variables -for var in "$@" -do - echo "export $var" >> $ENVS_FILE -done - -echo "[environmental variable file]" -cat $ENVS_FILE -source $ENVS_FILE - -# edit the locale override if needed -if [ -n "$LOCALE_OVERRIDE" ]; then - echo "[Adding locale to the first line of pandas/__init__.py]" - rm -f pandas/__init__.pyc - sedc="3iimport locale\nlocale.setlocale(locale.LC_ALL, '$LOCALE_OVERRIDE')\n" - sed -i "$sedc" pandas/__init__.py - echo "[head -4 pandas/__init__.py]" - head -4 pandas/__init__.py - echo -fi - -# create envbuild deps -echo "[create env]" -time conda env create -q -n pandas --file="${ENV_FILE}" || exit 1 - -source activate pandas - -# remove any installed pandas package -# w/o removing anything else -echo -echo "[removing installed pandas]" -conda remove pandas -y --force -pip uninstall -y pandas - -# build but don't install -echo "[build em]" -time python setup.py build_ext --inplace || exit 1 - -echo -echo "[show environment]" - -conda list diff --git a/ci/install_db_circle.sh b/ci/install_db_circle.sh deleted file mode 100755 index a00f74f009f54..0000000000000 --- a/ci/install_db_circle.sh +++ /dev/null @@ -1,8 +0,0 @@ -#!/bin/bash - -echo "installing dbs" -mysql -e 'create database pandas_nosetest;' -psql -c 'create database pandas_nosetest;' -U postgres - -echo "done" -exit 0 diff --git a/ci/install_travis.sh b/ci/install_travis.sh index fd4a36f86db6c..d1a940f119228 100755 --- a/ci/install_travis.sh +++ b/ci/install_travis.sh @@ -80,9 +80,9 @@ echo echo "[create env]" # create our environment -time conda env create -q -n pandas --file="${ENV_FILE}" || exit 1 +time conda env create -q --file="${ENV_FILE}" || exit 1 -source activate pandas +source activate pandas-dev # remove any installed pandas package # w/o removing anything else diff --git a/ci/lint.sh b/ci/lint.sh deleted file mode 100755 index 9bcee55e1344c..0000000000000 --- a/ci/lint.sh +++ /dev/null @@ -1,189 +0,0 @@ -#!/bin/bash - -echo "inside $0" - -source activate pandas - -RET=0 - -if [ "$LINT" ]; then - - # pandas/_libs/src is C code, so no need to search there. - echo "Linting *.py" - flake8 pandas --filename=*.py --exclude pandas/_libs/src - if [ $? -ne "0" ]; then - RET=1 - fi - echo "Linting *.py DONE" - - echo "Linting setup.py" - flake8 setup.py - if [ $? -ne "0" ]; then - RET=1 - fi - echo "Linting setup.py DONE" - - echo "Linting asv_bench/benchmarks/" - flake8 asv_bench/benchmarks/ --exclude=asv_bench/benchmarks/*.py --ignore=F811 - if [ $? -ne "0" ]; then - RET=1 - fi - echo "Linting asv_bench/benchmarks/*.py DONE" - - echo "Linting scripts/*.py" - flake8 scripts --filename=*.py - if [ $? -ne "0" ]; then - RET=1 - fi - echo "Linting scripts/*.py DONE" - - echo "Linting doc scripts" - flake8 doc/make.py doc/source/conf.py - if [ $? -ne "0" ]; then - RET=1 - fi - echo "Linting doc scripts DONE" - - echo "Linting *.pyx" - flake8 pandas --filename=*.pyx --select=E501,E302,E203,E111,E114,E221,E303,E128,E231,E126,E265,E305,E301,E127,E261,E271,E129,W291,E222,E241,E123,F403 - if [ $? -ne "0" ]; then - RET=1 - fi - echo "Linting *.pyx DONE" - - echo "Linting *.pxi.in" - for path in 'src' - do - echo "linting -> pandas/$path" - flake8 pandas/$path --filename=*.pxi.in --select=E501,E302,E203,E111,E114,E221,E303,E231,E126,F403 - if [ $? -ne "0" ]; then - RET=1 - fi - done - echo "Linting *.pxi.in DONE" - - echo "Linting *.pxd" - for path in '_libs' - do - echo "linting -> pandas/$path" - flake8 pandas/$path --filename=*.pxd --select=E501,E302,E203,E111,E114,E221,E303,E231,E126,F403 - if [ $? -ne "0" ]; then - RET=1 - fi - done - echo "Linting *.pxd DONE" - - # readability/casting: Warnings about C casting instead of C++ casting - # runtime/int: Warnings about using C number types instead of C++ ones - # build/include_subdir: Warnings about prefacing included header files with directory - - # We don't lint all C files because we don't want to lint any that are built - # from Cython files nor do we want to lint C files that we didn't modify for - # this particular codebase (e.g. src/headers, src/klib, src/msgpack). However, - # we can lint all header files since they aren't "generated" like C files are. - echo "Linting *.c and *.h" - for path in '*.h' 'period_helper.c' 'datetime' 'parser' 'ujson' - do - echo "linting -> pandas/_libs/src/$path" - cpplint --quiet --extensions=c,h --headers=h --filter=-readability/casting,-runtime/int,-build/include_subdir --recursive pandas/_libs/src/$path - if [ $? -ne "0" ]; then - RET=1 - fi - done - echo "Linting *.c and *.h DONE" - - echo "Check for invalid testing" - - # Check for the following code in testing: - # - # np.testing - # np.array_equal - grep -r -E --include '*.py' --exclude testing.py '(numpy|np)(\.testing|\.array_equal)' pandas/tests/ - - if [ $? = "0" ]; then - RET=1 - fi - - # Check for pytest.warns - grep -r -E --include '*.py' 'pytest\.warns' pandas/tests/ - - if [ $? = "0" ]; then - RET=1 - fi - - # Check for the following code in the extension array base tests - # tm.assert_frame_equal - # tm.assert_series_equal - grep -r -E --include '*.py' --exclude base.py 'tm.assert_(series|frame)_equal' pandas/tests/extension/base - - if [ $? = "0" ]; then - RET=1 - fi - - echo "Check for invalid testing DONE" - - # Check for imports from pandas.core.common instead - # of `import pandas.core.common as com` - echo "Check for non-standard imports" - grep -R --include="*.py*" -E "from pandas.core.common import " pandas - if [ $? = "0" ]; then - RET=1 - fi - echo "Check for non-standard imports DONE" - - echo "Check for use of lists instead of generators in built-in Python functions" - - # Example: Avoid `any([i for i in some_iterator])` in favor of `any(i for i in some_iterator)` - # - # Check the following functions: - # any(), all(), sum(), max(), min(), list(), dict(), set(), frozenset(), tuple(), str.join() - grep -R --include="*.py*" -E "[^_](any|all|sum|max|min|list|dict|set|frozenset|tuple|join)\(\[.* for .* in .*\]\)" pandas - - if [ $? = "0" ]; then - RET=1 - fi - echo "Check for use of lists instead of generators in built-in Python functions DONE" - - echo "Check for incorrect sphinx directives" - SPHINX_DIRECTIVES=$(echo \ - "autosummary|contents|currentmodule|deprecated|function|image|"\ - "important|include|ipython|literalinclude|math|module|note|raw|"\ - "seealso|toctree|versionadded|versionchanged|warning" | tr -d "[:space:]") - for path in './pandas' './doc/source' - do - grep -R --include="*.py" --include="*.pyx" --include="*.rst" -E "\.\. ($SPHINX_DIRECTIVES):[^:]" $path - if [ $? = "0" ]; then - RET=1 - fi - done - echo "Check for incorrect sphinx directives DONE" - - echo "Check for deprecated messages without sphinx directive" - grep -R --include="*.py" --include="*.pyx" -E "(DEPRECATED|DEPRECATE|Deprecated)(:|,|\.)" pandas - - if [ $? = "0" ]; then - RET=1 - fi - echo "Check for deprecated messages without sphinx directive DONE" - - echo "Check for old-style classes" - grep -R --include="*.py" -E "class\s\S*[^)]:" pandas scripts - - if [ $? = "0" ]; then - RET=1 - fi - echo "Check for old-style classes DONE" - - echo "Check for backticks incorrectly rendering because of missing spaces" - grep -R --include="*.rst" -E "[a-zA-Z0-9]\`\`?[a-zA-Z0-9]" doc/source/ - - if [ $? = "0" ]; then - RET=1 - fi - echo "Check for backticks incorrectly rendering because of missing spaces DONE" - -else - echo "NOT Linting" -fi - -exit $RET diff --git a/ci/print_skipped.py b/ci/print_skipped.py index dd2180f6eeb19..67bc7b556cd43 100755 --- a/ci/print_skipped.py +++ b/ci/print_skipped.py @@ -10,7 +10,7 @@ def parse_results(filename): root = tree.getroot() skipped = [] - current_class = old_class = '' + current_class = '' i = 1 assert i - 1 == len(skipped) for el in root.findall('testcase'): @@ -24,7 +24,9 @@ def parse_results(filename): out = '' if old_class != current_class: ndigits = int(math.log(i, 10) + 1) - out += ('-' * (len(name + msg) + 4 + ndigits) + '\n') # 4 for : + space + # + space + + # 4 for : + space + # + space + out += ('-' * (len(name + msg) + 4 + ndigits) + '\n') out += '#{i} {name}: {msg}'.format(i=i, name=name, msg=msg) skipped.append(out) i += 1 diff --git a/ci/print_versions.py b/ci/print_versions.py deleted file mode 100755 index 8be795174d76d..0000000000000 --- a/ci/print_versions.py +++ /dev/null @@ -1,28 +0,0 @@ -#!/usr/bin/env python - - -def show_versions(as_json=False): - import imp - import os - fn = __file__ - this_dir = os.path.dirname(fn) - pandas_dir = os.path.abspath(os.path.join(this_dir, "..")) - sv_path = os.path.join(pandas_dir, 'pandas', 'util') - mod = imp.load_module( - 'pvmod', *imp.find_module('print_versions', [sv_path])) - return mod.show_versions(as_json) - - -if __name__ == '__main__': - # optparse is 2.6-safe - from optparse import OptionParser - parser = OptionParser() - parser.add_option("-j", "--json", metavar="FILE", nargs=1, - help="Save output as JSON into file, pass in '-' to output to stdout") - - (options, args) = parser.parse_args() - - if options.json == "-": - options.json = True - - show_versions(as_json=options.json) diff --git a/ci/requirements-optional-conda.txt b/ci/requirements-optional-conda.txt deleted file mode 100644 index 18aac30f04aea..0000000000000 --- a/ci/requirements-optional-conda.txt +++ /dev/null @@ -1,29 +0,0 @@ -beautifulsoup4>=4.2.1 -blosc -bottleneck -fastparquet -feather-format -gcsfs -html5lib -ipython>=5.6.0 -ipykernel -jinja2 -lxml -matplotlib -nbsphinx -numexpr -openpyxl -pyarrow -pymysql -pytables -pytest-cov -pytest-xdist -s3fs -scipy -seaborn -sqlalchemy -statsmodels -xarray -xlrd -xlsxwriter -xlwt diff --git a/ci/requirements-optional-pip.txt b/ci/requirements-optional-pip.txt deleted file mode 100644 index 28dafc43b09c0..0000000000000 --- a/ci/requirements-optional-pip.txt +++ /dev/null @@ -1,31 +0,0 @@ -# This file was autogenerated by scripts/convert_deps.py -# Do not modify directly -beautifulsoup4>=4.2.1 -blosc -bottleneck -fastparquet -feather-format -gcsfs -html5lib -ipython>=5.6.0 -ipykernel -jinja2 -lxml -matplotlib -nbsphinx -numexpr -openpyxl -pyarrow -pymysql -tables -pytest-cov -pytest-xdist -s3fs -scipy -seaborn -sqlalchemy -statsmodels -xarray -xlrd -xlsxwriter -xlwt \ No newline at end of file diff --git a/ci/requirements_dev.txt b/ci/requirements_dev.txt deleted file mode 100644 index 83ee30b52071d..0000000000000 --- a/ci/requirements_dev.txt +++ /dev/null @@ -1,12 +0,0 @@ -# This file was autogenerated by scripts/convert_deps.py -# Do not modify directly -Cython -NumPy -flake8 -moto -pytest>=3.1 -python-dateutil>=2.5.0 -pytz -setuptools>=24.2.0 -sphinx -sphinxcontrib-spelling \ No newline at end of file diff --git a/ci/run_build_docs.sh b/ci/run_build_docs.sh deleted file mode 100755 index 2909b9619552e..0000000000000 --- a/ci/run_build_docs.sh +++ /dev/null @@ -1,10 +0,0 @@ -#!/bin/bash - -echo "inside $0" - -"$TRAVIS_BUILD_DIR"/ci/build_docs.sh 2>&1 - -# wait until subprocesses finish (build_docs.sh) -wait - -exit 0 diff --git a/ci/run_circle.sh b/ci/run_circle.sh deleted file mode 100755 index 435985bd42148..0000000000000 --- a/ci/run_circle.sh +++ /dev/null @@ -1,9 +0,0 @@ -#!/usr/bin/env bash - -echo "[running tests]" -export PATH="$MINICONDA_DIR/bin:$PATH" - -source activate pandas - -echo "pytest --strict --junitxml=$CIRCLE_TEST_REPORTS/reports/junit.xml $@ pandas" -pytest --strict --junitxml=$CIRCLE_TEST_REPORTS/reports/junit.xml $@ pandas diff --git a/ci/run_tests.sh b/ci/run_tests.sh new file mode 100755 index 0000000000000..ee46da9f52eab --- /dev/null +++ b/ci/run_tests.sh @@ -0,0 +1,58 @@ +#!/bin/bash + +set -e + +if [ "$DOC" ]; then + echo "We are not running pytest as this is a doc-build" + exit 0 +fi + +# Workaround for pytest-xdist flaky collection order +# https://github.com/pytest-dev/pytest/issues/920 +# https://github.com/pytest-dev/pytest/issues/1075 +export PYTHONHASHSEED=$(python -c 'import random; print(random.randint(1, 4294967295))') + +if [ -n "$LOCALE_OVERRIDE" ]; then + export LC_ALL="$LOCALE_OVERRIDE" + export LANG="$LOCALE_OVERRIDE" + PANDAS_LOCALE=`python -c 'import pandas; pandas.get_option("display.encoding")'` + if [[ "$LOCALE_OVERIDE" != "$PANDAS_LOCALE" ]]; then + echo "pandas could not detect the locale. System locale: $LOCALE_OVERRIDE, pandas detected: $PANDAS_LOCALE" + # TODO Not really aborting the tests until https://github.com/pandas-dev/pandas/issues/23923 is fixed + # exit 1 + fi +fi +if [[ "not network" == *"$PATTERN"* ]]; then + export http_proxy=http://1.2.3.4 https_proxy=http://1.2.3.4; +fi + + +if [ -n "$PATTERN" ]; then + PATTERN=" and $PATTERN" +fi + +for TYPE in single multiple +do + if [ "$COVERAGE" ]; then + COVERAGE_FNAME="/tmp/coc-$TYPE.xml" + COVERAGE="-s --cov=pandas --cov-report=xml:$COVERAGE_FNAME" + fi + + TYPE_PATTERN=$TYPE + NUM_JOBS=1 + if [[ "$TYPE_PATTERN" == "multiple" ]]; then + TYPE_PATTERN="not single" + NUM_JOBS=2 + fi + + PYTEST_CMD="pytest -m \"$TYPE_PATTERN$PATTERN\" -n $NUM_JOBS -s --strict --durations=10 --junitxml=test-data-$TYPE.xml $TEST_ARGS $COVERAGE pandas" + echo $PYTEST_CMD + # if no tests are found (the case of "single and slow"), pytest exits with code 5, and would make the script fail, if not for the below code + sh -c "$PYTEST_CMD; ret=\$?; [ \$ret = 5 ] && exit 0 || exit \$ret" + + if [[ "$COVERAGE" && $? == 0 ]]; then + echo "uploading coverage for $TYPE tests" + echo "bash <(curl -s https://codecov.io/bash) -Z -c -F $TYPE -f $COVERAGE_FNAME" + bash <(curl -s https://codecov.io/bash) -Z -c -F $TYPE -f $COVERAGE_FNAME + fi +done diff --git a/ci/script_multi.sh b/ci/script_multi.sh deleted file mode 100755 index 2b2d4d5488b91..0000000000000 --- a/ci/script_multi.sh +++ /dev/null @@ -1,46 +0,0 @@ -#!/bin/bash -e - -echo "[script multi]" - -source activate pandas - -if [ -n "$LOCALE_OVERRIDE" ]; then - export LC_ALL="$LOCALE_OVERRIDE"; - echo "Setting LC_ALL to $LOCALE_OVERRIDE" - - pycmd='import pandas; print("pandas detected console encoding: %s" % pandas.get_option("display.encoding"))' - python -c "$pycmd" -fi - -# Enforce absent network during testing by faking a proxy -if echo "$TEST_ARGS" | grep -e --skip-network -q; then - export http_proxy=http://1.2.3.4 https_proxy=http://1.2.3.4; -fi - -# Workaround for pytest-xdist flaky collection order -# https://github.com/pytest-dev/pytest/issues/920 -# https://github.com/pytest-dev/pytest/issues/1075 -export PYTHONHASHSEED=$(python -c 'import random; print(random.randint(1, 4294967295))') -echo PYTHONHASHSEED=$PYTHONHASHSEED - -if [ "$DOC" ]; then - echo "We are not running pytest as this is a doc-build" - -elif [ "$COVERAGE" ]; then - echo pytest -s -n 2 -m "not single" --cov=pandas --cov-report xml:/tmp/cov-multiple.xml --junitxml=/tmp/multiple.xml --strict $TEST_ARGS pandas - pytest -s -n 2 -m "not single" --cov=pandas --cov-report xml:/tmp/cov-multiple.xml --junitxml=/tmp/multiple.xml --strict $TEST_ARGS pandas - -elif [ "$SLOW" ]; then - TEST_ARGS="--only-slow --skip-network" - echo pytest -r xX -m "not single and slow" -v --junitxml=/tmp/multiple.xml --strict $TEST_ARGS pandas - pytest -r xX -m "not single and slow" -v --junitxml=/tmp/multiple.xml --strict $TEST_ARGS pandas - -else - echo pytest -n 2 -r xX -m "not single" --junitxml=/tmp/multiple.xml --strict $TEST_ARGS pandas - pytest -n 2 -r xX -m "not single" --junitxml=/tmp/multiple.xml --strict $TEST_ARGS pandas # TODO: doctest - -fi - -RET="$?" - -exit "$RET" diff --git a/ci/script_single.sh b/ci/script_single.sh deleted file mode 100755 index 60e2fbb33ee5d..0000000000000 --- a/ci/script_single.sh +++ /dev/null @@ -1,39 +0,0 @@ -#!/bin/bash - -echo "[script_single]" - -source activate pandas - -if [ -n "$LOCALE_OVERRIDE" ]; then - export LC_ALL="$LOCALE_OVERRIDE"; - echo "Setting LC_ALL to $LOCALE_OVERRIDE" - - pycmd='import pandas; print("pandas detected console encoding: %s" % pandas.get_option("display.encoding"))' - python -c "$pycmd" -fi - -if [ "$SLOW" ]; then - TEST_ARGS="--only-slow --skip-network" -fi - -# Enforce absent network during testing by faking a proxy -if echo "$TEST_ARGS" | grep -e --skip-network -q; then - export http_proxy=http://1.2.3.4 https_proxy=http://1.2.3.4; -fi - -if [ "$DOC" ]; then - echo "We are not running pytest as this is a doc-build" - -elif [ "$COVERAGE" ]; then - echo pytest -s -m "single" -r xXs --strict --cov=pandas --cov-report xml:/tmp/cov-single.xml --junitxml=/tmp/single.xml $TEST_ARGS pandas - pytest -s -m "single" -r xXs --strict --cov=pandas --cov-report xml:/tmp/cov-single.xml --junitxml=/tmp/single.xml $TEST_ARGS pandas - -else - echo pytest -m "single" -r xXs --junitxml=/tmp/single.xml --strict $TEST_ARGS pandas - pytest -m "single" -r xXs --junitxml=/tmp/single.xml --strict $TEST_ARGS pandas # TODO: doctest - -fi - -RET="$?" - -exit "$RET" diff --git a/ci/show_circle.sh b/ci/show_circle.sh deleted file mode 100755 index bfaa65c1d84f2..0000000000000 --- a/ci/show_circle.sh +++ /dev/null @@ -1,8 +0,0 @@ -#!/usr/bin/env bash - -echo "[installed versions]" - -export PATH="$MINICONDA_DIR/bin:$PATH" -source activate pandas - -python -c "import pandas; pandas.show_versions();" diff --git a/ci/upload_coverage.sh b/ci/upload_coverage.sh deleted file mode 100755 index a7ef2fa908079..0000000000000 --- a/ci/upload_coverage.sh +++ /dev/null @@ -1,12 +0,0 @@ -#!/bin/bash - -if [ -z "$COVERAGE" ]; then - echo "coverage is not selected for this build" - exit 0 -fi - -source activate pandas - -echo "uploading coverage" -bash <(curl -s https://codecov.io/bash) -Z -c -F single -f /tmp/cov-single.xml -bash <(curl -s https://codecov.io/bash) -Z -c -F multiple -f /tmp/cov-multiple.xml diff --git a/circle.yml b/circle.yml deleted file mode 100644 index 66415defba6fe..0000000000000 --- a/circle.yml +++ /dev/null @@ -1,38 +0,0 @@ -machine: - environment: - # these are globally set - MINICONDA_DIR: /home/ubuntu/miniconda3 - - -database: - override: - - ./ci/install_db_circle.sh - - -checkout: - post: - # since circleci does a shallow fetch - # we need to populate our tags - - git fetch --depth=1000 - - -dependencies: - override: - - > - case $CIRCLE_NODE_INDEX in - 0) - sudo apt-get install language-pack-it && ./ci/install_circle.sh JOB="2.7_COMPAT" ENV_FILE="ci/circle-27-compat.yaml" LOCALE_OVERRIDE="it_IT.UTF-8" ;; - 1) - sudo apt-get install language-pack-zh-hans && ./ci/install_circle.sh JOB="3.6_LOCALE" ENV_FILE="ci/circle-36-locale.yaml" LOCALE_OVERRIDE="zh_CN.UTF-8" ;; - 2) - sudo apt-get install language-pack-zh-hans && ./ci/install_circle.sh JOB="3.6_LOCALE_SLOW" ENV_FILE="ci/circle-36-locale_slow.yaml" LOCALE_OVERRIDE="zh_CN.UTF-8" ;; - 3) - ./ci/install_circle.sh JOB="3.5_ASCII" ENV_FILE="ci/circle-35-ascii.yaml" LOCALE_OVERRIDE="C" ;; - esac - - ./ci/show_circle.sh - - -test: - override: - - case $CIRCLE_NODE_INDEX in 0) ./ci/run_circle.sh --skip-slow --skip-network ;; 1) ./ci/run_circle.sh --only-slow --skip-network ;; 2) ./ci/run_circle.sh --skip-slow --skip-network ;; 3) ./ci/run_circle.sh --skip-slow --skip-network ;; esac: - parallel: true diff --git a/conda.recipe/meta.yaml b/conda.recipe/meta.yaml index 2bc42c1bd2dec..f92090fecccf3 100644 --- a/conda.recipe/meta.yaml +++ b/conda.recipe/meta.yaml @@ -29,8 +29,11 @@ requirements: - pytz test: - imports: - - pandas + requires: + - pytest + commands: + - python -c "import pandas; pandas.test()" + about: home: http://pandas.pydata.org diff --git a/doc/README.rst b/doc/README.rst index 12950d323f5d3..5423e7419d03b 100644 --- a/doc/README.rst +++ b/doc/README.rst @@ -1,173 +1 @@ -.. _contributing.docs: - -Contributing to the documentation -================================= - -Whether you are someone who loves writing, teaching, or development, -contributing to the documentation is a huge value. If you don't see yourself -as a developer type, please don't stress and know that we want you to -contribute. You don't even have to be an expert on *pandas* to do so! -Something as simple as rewriting small passages for clarity -as you reference the docs is a simple but effective way to contribute. The -next person to read that passage will be in your debt! - -Actually, there are sections of the docs that are worse off by being written -by experts. If something in the docs doesn't make sense to you, updating the -relevant section after you figure it out is a simple way to ensure it will -help the next person. - -.. contents:: Table of contents: - :local: - - -About the pandas documentation ------------------------------- - -The documentation is written in **reStructuredText**, which is almost like writing -in plain English, and built using `Sphinx `__. The -Sphinx Documentation has an excellent `introduction to reST -`__. Review the Sphinx docs to perform more -complex changes to the documentation as well. - -Some other important things to know about the docs: - -- The pandas documentation consists of two parts: the docstrings in the code - itself and the docs in this folder ``pandas/doc/``. - - The docstrings provide a clear explanation of the usage of the individual - functions, while the documentation in this folder consists of tutorial-like - overviews per topic together with some other information (what's new, - installation, etc). - -- The docstrings follow the **Numpy Docstring Standard** which is used widely - in the Scientific Python community. This standard specifies the format of - the different sections of the docstring. See `this document - `_ - for a detailed explanation, or look at some of the existing functions to - extend it in a similar manner. - -- The tutorials make heavy use of the `ipython directive - `_ sphinx extension. - This directive lets you put code in the documentation which will be run - during the doc build. For example: - - :: - - .. ipython:: python - - x = 2 - x**3 - - will be rendered as - - :: - - In [1]: x = 2 - - In [2]: x**3 - Out[2]: 8 - - This means that almost all code examples in the docs are always run (and the - output saved) during the doc build. This way, they will always be up to date, - but it makes the doc building a bit more complex. - - -How to build the pandas documentation -------------------------------------- - -Requirements -^^^^^^^^^^^^ - -To build the pandas docs there are some extra requirements: you will need to -have ``sphinx`` and ``ipython`` installed. `numpydoc -`_ is used to parse the docstrings that -follow the Numpy Docstring Standard (see above), but you don't need to install -this because a local copy of ``numpydoc`` is included in the pandas source -code. `nbsphinx `_ is used to convert -Jupyter notebooks. You will need to install it if you intend to modify any of -the notebooks included in the documentation. - -Furthermore, it is recommended to have all `optional dependencies -`_ -installed. This is not needed, but be aware that you will see some error -messages. Because all the code in the documentation is executed during the doc -build, the examples using this optional dependencies will generate errors. -Run ``pd.show_versions()`` to get an overview of the installed version of all -dependencies. - -.. warning:: - - Sphinx version >= 1.2.2 or the older 1.1.3 is required. - -Building pandas -^^^^^^^^^^^^^^^ - -For a step-by-step overview on how to set up your environment, to work with -the pandas code and git, see `the developer pages -`_. -When you start to work on some docs, be sure to update your code to the latest -development version ('master'):: - - git fetch upstream - git rebase upstream/master - -Often it will be necessary to rebuild the C extension after updating:: - - python setup.py build_ext --inplace - -Building the documentation -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -So how do you build the docs? Navigate to your local folder -``pandas/doc/`` directory in the console and run:: - - python make.py html - -And then you can find the html output in the folder ``pandas/doc/build/html/``. - -The first time it will take quite a while, because it has to run all the code -examples in the documentation and build all generated docstring pages. -In subsequent evocations, sphinx will try to only build the pages that have -been modified. - -If you want to do a full clean build, do:: - - python make.py clean - python make.py build - - -Starting with 0.13.1 you can tell ``make.py`` to compile only a single section -of the docs, greatly reducing the turn-around time for checking your changes. -You will be prompted to delete `.rst` files that aren't required, since the -last committed version can always be restored from git. - -:: - - #omit autosummary and API section - python make.py clean - python make.py --no-api - - # compile the docs with only a single - # section, that which is in indexing.rst - python make.py clean - python make.py --single indexing - -For comparison, a full doc build may take 10 minutes. a ``-no-api`` build -may take 3 minutes and a single section may take 15 seconds. - -Where to start? ---------------- - -There are a number of issues listed under `Docs -`_ -and `good first issue -`_ -where you could start out. - -Or maybe you have an idea of your own, by using pandas, looking for something -in the documentation and thinking 'this can be improved', let's do something -about that! - -Feel free to ask questions on `mailing list -`_ or submit an -issue on Github. +See `contributing.rst `_ in this repo. diff --git a/doc/cheatsheet/Pandas_Cheat_Sheet_JA.pdf b/doc/cheatsheet/Pandas_Cheat_Sheet_JA.pdf new file mode 100644 index 0000000000000..daa65a944e68a Binary files /dev/null and b/doc/cheatsheet/Pandas_Cheat_Sheet_JA.pdf differ diff --git a/doc/cheatsheet/Pandas_Cheat_Sheet_JA.pptx b/doc/cheatsheet/Pandas_Cheat_Sheet_JA.pptx new file mode 100644 index 0000000000000..6270a71e20ee8 Binary files /dev/null and b/doc/cheatsheet/Pandas_Cheat_Sheet_JA.pptx differ diff --git a/doc/make.py b/doc/make.py index d85747458148d..0b14a9dcd4c34 100755 --- a/doc/make.py +++ b/doc/make.py @@ -15,11 +15,9 @@ import sys import os import shutil -# import subprocess +import subprocess import argparse -from contextlib import contextmanager import webbrowser -import jinja2 DOC_PATH = os.path.dirname(os.path.abspath(__file__)) @@ -28,174 +26,68 @@ BUILD_DIRS = ['doctrees', 'html', 'latex', 'plots', '_static', '_templates'] -@contextmanager -def _maybe_exclude_notebooks(): - """Skip building the notebooks if pandoc is not installed. - - This assumes that nbsphinx is installed. - - Skip notebook conversion if: - 1. nbconvert isn't installed, or - 2. nbconvert is installed, but pandoc isn't - """ - # TODO move to exclude_pattern - base = os.path.dirname(__file__) - notebooks = [os.path.join(base, 'source', nb) - for nb in ['style.ipynb']] - contents = {} - - def _remove_notebooks(): - for nb in notebooks: - with open(nb, 'rt') as f: - contents[nb] = f.read() - os.remove(nb) - - try: - import nbconvert - except ImportError: - sys.stderr.write('Warning: nbconvert not installed. ' - 'Skipping notebooks.\n') - _remove_notebooks() - else: - try: - nbconvert.utils.pandoc.get_pandoc_version() - except nbconvert.utils.pandoc.PandocMissing: - sys.stderr.write('Warning: Pandoc is not installed. ' - 'Skipping notebooks.\n') - _remove_notebooks() - - yield - - for nb, content in contents.items(): - with open(nb, 'wt') as f: - f.write(content) - - class DocBuilder: - """Class to wrap the different commands of this script. + """ + Class to wrap the different commands of this script. All public methods of this class can be called as parameters of the script. """ - def __init__(self, num_jobs=1, include_api=True, single_doc=None, - verbosity=0): + def __init__(self, num_jobs=0, include_api=True, single_doc=None, + verbosity=0, warnings_are_errors=False): self.num_jobs = num_jobs - self.include_api = include_api self.verbosity = verbosity - self.single_doc = None - self.single_doc_type = None - if single_doc is not None: - self._process_single_doc(single_doc) - self.exclude_patterns = self._exclude_patterns - - self._generate_index() - if self.single_doc_type == 'docstring': - self._run_os('sphinx-autogen', '-o', - 'source/generated_single', 'source/index.rst') - - @property - def _exclude_patterns(self): - """Docs source files that will be excluded from building.""" - # TODO move maybe_exclude_notebooks here - if self.single_doc is not None: - rst_files = [f for f in os.listdir(SOURCE_PATH) - if ((f.endswith('.rst') or f.endswith('.ipynb')) - and (f != 'index.rst') - and (f != '{0}.rst'.format(self.single_doc)))] - if self.single_doc_type != 'api': - rst_files += ['generated/*.rst'] - elif not self.include_api: - rst_files = ['api.rst', 'generated/*.rst'] - else: - rst_files = ['generated_single/*.rst'] - - exclude_patterns = ','.join( - '{!r}'.format(i) for i in ['**.ipynb_checkpoints'] + rst_files) - - return exclude_patterns + self.warnings_are_errors = warnings_are_errors + + if single_doc: + single_doc = self._process_single_doc(single_doc) + include_api = False + os.environ['SPHINX_PATTERN'] = single_doc + elif not include_api: + os.environ['SPHINX_PATTERN'] = '-api' + + self.single_doc_html = None + if single_doc and single_doc.endswith('.rst'): + self.single_doc_html = os.path.splitext(single_doc)[0] + '.html' + elif single_doc: + self.single_doc_html = 'api/generated/pandas.{}.html'.format( + single_doc) def _process_single_doc(self, single_doc): - """Extract self.single_doc (base name) and self.single_doc_type from - passed single_doc kwarg. + """ + Make sure the provided value for --single is a path to an existing + .rst/.ipynb file, or a pandas object that can be imported. + For example, categorial.rst or pandas.DataFrame.head. For the latter, + return the corresponding file path + (e.g. generated/pandas.DataFrame.head.rst). """ - self.include_api = False - - if single_doc == 'api.rst' or single_doc == 'api': - self.single_doc_type = 'api' - self.single_doc = 'api' - elif os.path.exists(os.path.join(SOURCE_PATH, single_doc)): - self.single_doc_type = 'rst' - self.single_doc = os.path.splitext(os.path.basename(single_doc))[0] - elif os.path.exists( - os.path.join(SOURCE_PATH, '{}.rst'.format(single_doc))): - self.single_doc_type = 'rst' - self.single_doc = single_doc - elif single_doc is not None: + base_name, extension = os.path.splitext(single_doc) + if extension in ('.rst', '.ipynb'): + if os.path.exists(os.path.join(SOURCE_PATH, single_doc)): + return single_doc + else: + raise FileNotFoundError('File {} not found'.format(single_doc)) + + elif single_doc.startswith('pandas.'): try: obj = pandas # noqa: F821 for name in single_doc.split('.'): obj = getattr(obj, name) except AttributeError: - raise ValueError('Single document not understood, it should ' - 'be a file in doc/source/*.rst (e.g. ' - '"contributing.rst" or a pandas function or ' - 'method (e.g. "pandas.DataFrame.head")') + raise ImportError('Could not import {}'.format(single_doc)) else: - self.single_doc_type = 'docstring' - if single_doc.startswith('pandas.'): - self.single_doc = single_doc[len('pandas.'):] - else: - self.single_doc = single_doc - - def _copy_generated_docstring(self): - """Copy existing generated (from api.rst) docstring page because - this is more correct in certain cases (where a custom autodoc - template is used). - - """ - fname = os.path.join(SOURCE_PATH, 'generated', - 'pandas.{}.rst'.format(self.single_doc)) - temp_dir = os.path.join(SOURCE_PATH, 'generated_single') - - try: - os.makedirs(temp_dir) - except OSError: - pass - - if os.path.exists(fname): - try: - # copying to make sure sphinx always thinks it is new - # and needs to be re-generated (to pick source code changes) - shutil.copy(fname, temp_dir) - except: # noqa - pass - - def _generate_index(self): - """Create index.rst file with the specified sections.""" - if self.single_doc_type == 'docstring': - self._copy_generated_docstring() - - with open(os.path.join(SOURCE_PATH, 'index.rst.template')) as f: - t = jinja2.Template(f.read()) - - with open(os.path.join(SOURCE_PATH, 'index.rst'), 'w') as f: - f.write(t.render(include_api=self.include_api, - single_doc=self.single_doc, - single_doc_type=self.single_doc_type)) - - @staticmethod - def _create_build_structure(): - """Create directories required to build documentation.""" - for dirname in BUILD_DIRS: - try: - os.makedirs(os.path.join(BUILD_PATH, dirname)) - except OSError: - pass + return single_doc[len('pandas.'):] + else: + raise ValueError(('--single={} not understood. Value should be a ' + 'valid path to a .rst or .ipynb file, or a ' + 'valid pandas object (e.g. categorical.rst or ' + 'pandas.DataFrame.head)').format(single_doc)) @staticmethod def _run_os(*args): - """Execute a command as a OS terminal. + """ + Execute a command as a OS terminal. Parameters ---------- @@ -206,13 +98,11 @@ def _run_os(*args): -------- >>> DocBuilder()._run_os('python', '--version') """ - # TODO check_call should be more safe, but it fails with - # exclude patterns, needs investigation - # subprocess.check_call(args, stderr=subprocess.STDOUT) - os.system(' '.join(args)) + subprocess.check_call(args, stdout=sys.stdout, stderr=sys.stderr) def _sphinx_build(self, kind): - """Call sphinx to build documentation. + """ + Call sphinx to build documentation. Attribute `num_jobs` from the class is used. @@ -224,51 +114,52 @@ def _sphinx_build(self, kind): -------- >>> DocBuilder(num_jobs=4)._sphinx_build('html') """ - if kind not in ('html', 'latex', 'spelling'): - raise ValueError('kind must be html, latex or ' - 'spelling, not {}'.format(kind)) - - self._run_os('sphinx-build', - '-j{}'.format(self.num_jobs), - '-b{}'.format(kind), - '-{}'.format( - 'v' * self.verbosity) if self.verbosity else '', - '-d{}'.format(os.path.join(BUILD_PATH, 'doctrees')), - '-Dexclude_patterns={}'.format(self.exclude_patterns), - SOURCE_PATH, - os.path.join(BUILD_PATH, kind)) - - def _open_browser(self): - base_url = os.path.join('file://', DOC_PATH, 'build', 'html') - if self.single_doc_type == 'docstring': - url = os.path.join( - base_url, - 'generated_single', 'pandas.{}.html'.format(self.single_doc)) - else: - url = os.path.join(base_url, '{}.html'.format(self.single_doc)) + if kind not in ('html', 'latex'): + raise ValueError('kind must be html or latex, ' + 'not {}'.format(kind)) + + self.clean() + + cmd = ['sphinx-build', '-b', kind] + if self.num_jobs: + cmd += ['-j', str(self.num_jobs)] + if self.warnings_are_errors: + cmd += ['-W', '--keep-going'] + if self.verbosity: + cmd.append('-{}'.format('v' * self.verbosity)) + cmd += ['-d', os.path.join(BUILD_PATH, 'doctrees'), + SOURCE_PATH, os.path.join(BUILD_PATH, kind)] + return subprocess.call(cmd) + + def _open_browser(self, single_doc_html): + """ + Open a browser tab showing single + """ + url = os.path.join('file://', DOC_PATH, 'build', 'html', + single_doc_html) webbrowser.open(url, new=2) def html(self): - """Build HTML documentation.""" - self._create_build_structure() - with _maybe_exclude_notebooks(): - self._sphinx_build('html') - zip_fname = os.path.join(BUILD_PATH, 'html', 'pandas.zip') - if os.path.exists(zip_fname): - os.remove(zip_fname) - - if self.single_doc is not None: - self._open_browser() - shutil.rmtree(os.path.join(SOURCE_PATH, 'generated_single'), - ignore_errors=True) + """ + Build HTML documentation. + """ + ret_code = self._sphinx_build('html') + zip_fname = os.path.join(BUILD_PATH, 'html', 'pandas.zip') + if os.path.exists(zip_fname): + os.remove(zip_fname) + + if self.single_doc_html is not None: + self._open_browser(self.single_doc_html) + return ret_code def latex(self, force=False): - """Build PDF documentation.""" - self._create_build_structure() + """ + Build PDF documentation. + """ if sys.platform == 'win32': sys.stderr.write('latex build has not been tested on windows\n') else: - self._sphinx_build('latex') + ret_code = self._sphinx_build('latex') os.chdir(os.path.join(BUILD_PATH, 'latex')) if force: for i in range(3): @@ -279,20 +170,27 @@ def latex(self, force=False): '"build/latex/pandas.pdf" for problems.') else: self._run_os('make') + return ret_code def latex_forced(self): - """Build PDF documentation with retries to find missing references.""" - self.latex(force=True) + """ + Build PDF documentation with retries to find missing references. + """ + return self.latex(force=True) @staticmethod def clean(): - """Clean documentation generated files.""" + """ + Clean documentation generated files. + """ shutil.rmtree(BUILD_PATH, ignore_errors=True) - shutil.rmtree(os.path.join(SOURCE_PATH, 'generated'), + shutil.rmtree(os.path.join(SOURCE_PATH, 'api', 'generated'), ignore_errors=True) def zip_html(self): - """Compress HTML documentation into a zip file.""" + """ + Compress HTML documentation into a zip file. + """ zip_fname = os.path.join(BUILD_PATH, 'html', 'pandas.zip') if os.path.exists(zip_fname): os.remove(zip_fname) @@ -305,18 +203,6 @@ def zip_html(self): '-q', *fnames) - def spellcheck(self): - """Spell check the documentation.""" - self._sphinx_build('spelling') - output_location = os.path.join('build', 'spelling', 'output.txt') - with open(output_location) as output: - lines = output.readlines() - if lines: - raise SyntaxError( - 'Found misspelled words.' - ' Check pandas/doc/build/spelling/output.txt' - ' for more details.') - def main(): cmds = [method for method in dir(DocBuilder) if not method.startswith('_')] @@ -330,7 +216,7 @@ def main(): help='command to run: {}'.format(', '.join(cmds))) argparser.add_argument('--num-jobs', type=int, - default=1, + default=0, help='number of jobs used by sphinx-build') argparser.add_argument('--no-api', default=False, @@ -349,6 +235,9 @@ def main(): argparser.add_argument('-v', action='count', dest='verbosity', default=0, help=('increase verbosity (can be repeated), ' 'passed to the sphinx build command')) + argparser.add_argument('--warnings-are-errors', '-W', + action='store_true', + help='fail if warnings are raised') args = argparser.parse_args() if args.command not in cmds: @@ -368,8 +257,8 @@ def main(): os.environ['MPLBACKEND'] = 'module://matplotlib.backends.backend_agg' builder = DocBuilder(args.num_jobs, not args.no_api, args.single, - args.verbosity) - getattr(builder, args.command)() + args.verbosity, args.warnings_are_errors) + return getattr(builder, args.command)() if __name__ == '__main__': diff --git a/doc/source/10min.rst b/doc/source/10min.rst index fbbe94a72c71e..972b562cfebba 100644 --- a/doc/source/10min.rst +++ b/doc/source/10min.rst @@ -1,24 +1,6 @@ .. _10min: -.. currentmodule:: pandas - -.. ipython:: python - :suppress: - - import numpy as np - import pandas as pd - import os - np.random.seed(123456) - np.set_printoptions(precision=4, suppress=True) - import matplotlib - # matplotlib.style.use('default') - pd.options.display.max_rows = 15 - - #### portions of this were borrowed from the - #### Pandas cheatsheet - #### created during the PyData Workshop-Sprint 2012 - #### Hannah Chen, Henry Chow, Eric Cox, Robert Mauriello - +{{ header }} ******************** 10 Minutes to pandas @@ -31,9 +13,8 @@ Customarily, we import as follows: .. ipython:: python - import pandas as pd import numpy as np - import matplotlib.pyplot as plt + import pandas as pd Object Creation --------------- @@ -45,7 +26,7 @@ a default integer index: .. ipython:: python - s = pd.Series([1,3,5,np.nan,6,8]) + s = pd.Series([1, 3, 5, np.nan, 6, 8]) s Creating a :class:`DataFrame` by passing a NumPy array, with a datetime index @@ -55,22 +36,22 @@ and labeled columns: dates = pd.date_range('20130101', periods=6) dates - df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) + df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) df Creating a ``DataFrame`` by passing a dict of objects that can be converted to series-like. .. ipython:: python - df2 = pd.DataFrame({ 'A' : 1., - 'B' : pd.Timestamp('20130102'), - 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), - 'D' : np.array([3] * 4,dtype='int32'), - 'E' : pd.Categorical(["test","train","test","train"]), - 'F' : 'foo' }) + df2 = pd.DataFrame({'A': 1., + 'B': pd.Timestamp('20130102'), + 'C': pd.Series(1, index=list(range(4)), dtype='float32'), + 'D': np.array([3] * 4, dtype='int32'), + 'E': pd.Categorical(["test", "train", "test", "train"]), + 'F': 'foo'}) df2 -The columns of the resulting ``DataFrame`` have different +The columns of the resulting ``DataFrame`` have different :ref:`dtypes `. .. ipython:: python @@ -84,7 +65,7 @@ will be completed: .. ipython:: @verbatim - In [1]: df2. + In [1]: df2. # noqa: E225, E999 df2.A df2.bool df2.abs df2.boxplot df2.add df2.C @@ -114,13 +95,40 @@ Here is how to view the top and bottom rows of the frame: df.head() df.tail(3) -Display the index, columns, and the underlying NumPy data: +Display the index, columns: .. ipython:: python df.index df.columns - df.values + +:meth:`DataFrame.to_numpy` gives a NumPy representation of the underlying data. +Note that his can be an expensive operation when your :class:`DataFrame` has +columns with different data types, which comes down to a fundamental difference +between pandas and NumPy: **NumPy arrays have one dtype for the entire array, +while pandas DataFrames have one dtype per column**. When you call +:meth:`DataFrame.to_numpy`, pandas will find the NumPy dtype that can hold *all* +of the dtypes in the DataFrame. This may end up being ``object``, which requires +casting every value to a Python object. + +For ``df``, our :class:`DataFrame` of all floating-point values, +:meth:`DataFrame.to_numpy` is fast and doesn't require copying data. + +.. ipython:: python + + df.to_numpy() + +For ``df2``, the :class:`DataFrame` with multiple dtypes, +:meth:`DataFrame.to_numpy` is relatively expensive. + +.. ipython:: python + + df2.to_numpy() + +.. note:: + + :meth:`DataFrame.to_numpy` does *not* include the index or column + labels in the output. :func:`~DataFrame.describe` shows a quick statistic summary of your data: @@ -190,31 +198,31 @@ Selecting on a multi-axis by label: .. ipython:: python - df.loc[:,['A','B']] + df.loc[:, ['A', 'B']] Showing label slicing, both endpoints are *included*: .. ipython:: python - df.loc['20130102':'20130104',['A','B']] + df.loc['20130102':'20130104', ['A', 'B']] Reduction in the dimensions of the returned object: .. ipython:: python - df.loc['20130102',['A','B']] + df.loc['20130102', ['A', 'B']] For getting a scalar value: .. ipython:: python - df.loc[dates[0],'A'] + df.loc[dates[0], 'A'] For getting fast access to a scalar (equivalent to the prior method): .. ipython:: python - df.at[dates[0],'A'] + df.at[dates[0], 'A'] Selection by Position ~~~~~~~~~~~~~~~~~~~~~ @@ -231,37 +239,37 @@ By integer slices, acting similar to numpy/python: .. ipython:: python - df.iloc[3:5,0:2] + df.iloc[3:5, 0:2] By lists of integer position locations, similar to the numpy/python style: .. ipython:: python - df.iloc[[1,2,4],[0,2]] + df.iloc[[1, 2, 4], [0, 2]] For slicing rows explicitly: .. ipython:: python - df.iloc[1:3,:] + df.iloc[1:3, :] For slicing columns explicitly: .. ipython:: python - df.iloc[:,1:3] + df.iloc[:, 1:3] For getting a value explicitly: .. ipython:: python - df.iloc[1,1] + df.iloc[1, 1] For getting fast access to a scalar (equivalent to the prior method): .. ipython:: python - df.iat[1,1] + df.iat[1, 1] Boolean Indexing ~~~~~~~~~~~~~~~~ @@ -283,9 +291,9 @@ Using the :func:`~Series.isin` method for filtering: .. ipython:: python df2 = df.copy() - df2['E'] = ['one', 'one','two','three','four','three'] + df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three'] df2 - df2[df2['E'].isin(['two','four'])] + df2[df2['E'].isin(['two', 'four'])] Setting ~~~~~~~ @@ -295,7 +303,7 @@ by the indexes. .. ipython:: python - s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6)) + s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6)) s1 df['F'] = s1 @@ -303,19 +311,19 @@ Setting values by label: .. ipython:: python - df.at[dates[0],'A'] = 0 + df.at[dates[0], 'A'] = 0 Setting values by position: .. ipython:: python - df.iat[0,1] = 0 + df.iat[0, 1] = 0 Setting by assigning with a NumPy array: .. ipython:: python - df.loc[:,'D'] = np.array([5] * len(df)) + df.loc[:, 'D'] = np.array([5] * len(df)) The result of the prior setting operations. @@ -345,7 +353,7 @@ returns a copy of the data. .. ipython:: python df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E']) - df1.loc[dates[0]:dates[1],'E'] = 1 + df1.loc[dates[0]:dates[1], 'E'] = 1 df1 To drop any rows that have missing data. @@ -394,7 +402,7 @@ In addition, pandas automatically broadcasts along the specified dimension. .. ipython:: python - s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2) + s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2) s df.sub(s, axis='index') @@ -487,12 +495,12 @@ Another example that can be given is: Append ~~~~~~ -Append rows to a dataframe. See the :ref:`Appending ` +Append rows to a dataframe. See the :ref:`Appending ` section. .. ipython:: python - df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D']) + df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D']) df s = df.iloc[3] df.append(s, ignore_index=True) @@ -512,27 +520,27 @@ See the :ref:`Grouping section `. .. ipython:: python - df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', - 'foo', 'bar', 'foo', 'foo'], - 'B' : ['one', 'one', 'two', 'three', - 'two', 'two', 'one', 'three'], - 'C' : np.random.randn(8), - 'D' : np.random.randn(8)}) + df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', + 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'three', + 'two', 'two', 'one', 'three'], + 'C': np.random.randn(8), + 'D': np.random.randn(8)}) df -Grouping and then applying the :meth:`~DataFrame.sum` function to the resulting +Grouping and then applying the :meth:`~DataFrame.sum` function to the resulting groups. .. ipython:: python df.groupby('A').sum() -Grouping by multiple columns forms a hierarchical index, and again we can +Grouping by multiple columns forms a hierarchical index, and again we can apply the ``sum`` function. .. ipython:: python - df.groupby(['A','B']).sum() + df.groupby(['A', 'B']).sum() Reshaping --------- @@ -578,11 +586,11 @@ See the section on :ref:`Pivot Tables `. .. ipython:: python - df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3, - 'B' : ['A', 'B', 'C'] * 4, - 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, - 'D' : np.random.randn(12), - 'E' : np.random.randn(12)}) + df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3, + 'B': ['A', 'B', 'C'] * 4, + 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, + 'D': np.random.randn(12), + 'E': np.random.randn(12)}) df We can produce pivot tables from this data very easily: @@ -649,11 +657,12 @@ Categoricals ------------ pandas can include categorical data in a ``DataFrame``. For full docs, see the -:ref:`categorical introduction ` and the :ref:`API documentation `. +:ref:`categorical introduction ` and the :ref:`API documentation `. .. ipython:: python - df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']}) + df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], + "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']}) Convert the raw grades to a categorical data type. @@ -662,7 +671,7 @@ Convert the raw grades to a categorical data type. df["grade"] = df["raw_grade"].astype("category") df["grade"] -Rename the categories to more meaningful names (assigning to +Rename the categories to more meaningful names (assigning to ``Series.cat.categories`` is inplace!). .. ipython:: python @@ -674,7 +683,8 @@ Reorder the categories and simultaneously add the missing categories (methods un .. ipython:: python - df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) + df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", + "good", "very good"]) df["grade"] Sorting is per order in the categories, not lexical order. @@ -703,13 +713,14 @@ See the :ref:`Plotting ` docs. .. ipython:: python - ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000)) + ts = pd.Series(np.random.randn(1000), + index=pd.date_range('1/1/2000', periods=1000)) ts = ts.cumsum() @savefig series_plot_basic.png ts.plot() -On a DataFrame, the :meth:`~DataFrame.plot` method is a convenience to plot all +On a DataFrame, the :meth:`~DataFrame.plot` method is a convenience to plot all of the columns with labels: .. ipython:: python @@ -718,8 +729,10 @@ of the columns with labels: columns=['A', 'B', 'C', 'D']) df = df.cumsum() + plt.figure() + df.plot() @savefig frame_plot_basic.png - plt.figure(); df.plot(); plt.legend(loc='best') + plt.legend(loc='best') Getting Data In/Out ------------------- @@ -742,6 +755,7 @@ CSV .. ipython:: python :suppress: + import os os.remove('foo.csv') HDF5 @@ -753,13 +767,13 @@ Writing to a HDF5 Store. .. ipython:: python - df.to_hdf('foo.h5','df') + df.to_hdf('foo.h5', 'df') Reading from a HDF5 Store. .. ipython:: python - pd.read_hdf('foo.h5','df') + pd.read_hdf('foo.h5', 'df') .. ipython:: python :suppress: @@ -796,7 +810,7 @@ If you are attempting to perform an operation you might see an exception like: .. code-block:: python >>> if pd.Series([False, True, False]): - print("I was true") + ... print("I was true") Traceback ... ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all(). diff --git a/doc/source/_static/banklist.html b/doc/source/_static/banklist.html index cbcce5a2d49ff..cb07c332acbe7 100644 --- a/doc/source/_static/banklist.html +++ b/doc/source/_static/banklist.html @@ -37,7 +37,7 @@ else var sValue = li.selectValue; $('#googlesearch').submit(); - + } function findValue2(li) { if( li == null ) return alert("No match!"); @@ -47,7 +47,7 @@ // otherwise, let's just display the value in the text box else var sValue = li.selectValue; - + $('#googlesearch2').submit(); } function selectItem(li) { @@ -62,7 +62,7 @@ function log(event, data, formatted) { $("
  • ").html( !data ? "No match!" : "Selected: " + formatted).appendTo("#result"); } - + function formatItem(row) { return row[0] + " (id: " + row[1] + ")"; } @@ -81,7 +81,7 @@ selectFirst: false }); - + $("#search2").autocomplete("/searchjs.asp", { width: 160, autoFill: false, @@ -93,7 +93,7 @@ selectFirst: false }); - + }); @@ -232,16 +232,16 @@

    Each depositor insured to at least $250,000 per insured bank

    Failed Bank List

    The FDIC is often appointed as receiver for failed banks. This page contains useful information for the customers and vendors of these banks. This includes information on the acquiring bank (if applicable), how your accounts and loans are affected, and how vendors can file claims against the receivership. Failed Financial Institution Contact Search displays point of contact information related to failed banks.

    - +

    This list includes banks which have failed since October 1, 2000. To search for banks that failed prior to those on this page, visit this link: Failures and Assistance Transactions

    - +

    Failed Bank List - CSV file (Updated on Mondays. Also opens in Excel - Excel Help)

    - +

    Due to the small screen size some information is no longer visible.
    Full information available when viewed on a larger screen.

    @@ -253,7 +253,7 @@

    Failed Bank List

    City ST CERT - Acquiring Institution + Acquiring Institution Closing Date Updated Date @@ -294,7 +294,7 @@

    Failed Bank List

    Capital Bank, N.A. May 10, 2013 May 14, 2013 - + Douglas County Bank Douglasville @@ -383,7 +383,7 @@

    Failed Bank List

    Sunwest Bank January 11, 2013 January 24, 2013 - + Community Bank of the Ozarks Sunrise Beach @@ -392,7 +392,7 @@

    Failed Bank List

    Bank of Sullivan December 14, 2012 January 24, 2013 - + Hometown Community Bank Braselton @@ -401,7 +401,7 @@

    Failed Bank List

    CertusBank, National Association November 16, 2012 January 24, 2013 - + Citizens First National Bank Princeton @@ -518,7 +518,7 @@

    Failed Bank List

    Metcalf Bank July 20, 2012 December 17, 2012 - + First Cherokee State Bank Woodstock @@ -635,7 +635,7 @@

    Failed Bank List

    Southern States Bank May 18, 2012 May 20, 2013 - + Security Bank, National Association North Lauderdale @@ -644,7 +644,7 @@

    Failed Bank List

    Banesco USA May 4, 2012 October 31, 2012 - + Palm Desert National Bank Palm Desert @@ -734,7 +734,7 @@

    Failed Bank List

    No Acquirer March 9, 2012 October 29, 2012 - + Global Commerce Bank Doraville @@ -752,7 +752,7 @@

    Failed Bank List

    No Acquirer February 24, 2012 December 17, 2012 - + Central Bank of Georgia Ellaville @@ -761,7 +761,7 @@

    Failed Bank List

    Ameris Bank February 24, 2012 August 9, 2012 - + SCB Bank Shelbyville @@ -770,7 +770,7 @@

    Failed Bank List

    First Merchants Bank, National Association February 10, 2012 March 25, 2013 - + Charter National Bank and Trust Hoffman Estates @@ -779,7 +779,7 @@

    Failed Bank List

    Barrington Bank & Trust Company, National Association February 10, 2012 March 25, 2013 - + BankEast Knoxville @@ -788,7 +788,7 @@

    Failed Bank List

    U.S.Bank National Association January 27, 2012 March 8, 2013 - + Patriot Bank Minnesota Forest Lake @@ -797,7 +797,7 @@

    Failed Bank List

    First Resource Bank January 27, 2012 September 12, 2012 - + Tennessee Commerce Bank Franklin @@ -806,7 +806,7 @@

    Failed Bank List

    Republic Bank & Trust Company January 27, 2012 November 20, 2012 - + First Guaranty Bank and Trust Company of Jacksonville Jacksonville @@ -815,7 +815,7 @@

    Failed Bank List

    CenterState Bank of Florida, N.A. January 27, 2012 September 12, 2012 - + American Eagle Savings Bank Boothwyn @@ -824,7 +824,7 @@

    Failed Bank List

    Capital Bank, N.A. January 20, 2012 January 25, 2013 - + The First State Bank Stockbridge @@ -833,7 +833,7 @@

    Failed Bank List

    Hamilton State Bank January 20, 2012 January 25, 2013 - + Central Florida State Bank Belleview @@ -842,7 +842,7 @@

    Failed Bank List

    CenterState Bank of Florida, N.A. January 20, 2012 January 25, 2013 - + Western National Bank Phoenix @@ -869,7 +869,7 @@

    Failed Bank List

    First NBC Bank November 18, 2011 August 13, 2012 - + Polk County Bank Johnston @@ -887,7 +887,7 @@

    Failed Bank List

    Century Bank of Georgia November 10, 2011 August 13, 2012 - + SunFirst Bank Saint George @@ -896,7 +896,7 @@

    Failed Bank List

    Cache Valley Bank November 4, 2011 November 16, 2012 - + Mid City Bank, Inc. Omaha @@ -905,7 +905,7 @@

    Failed Bank List

    Premier Bank November 4, 2011 August 15, 2012 - + All American Bank Des Plaines @@ -914,7 +914,7 @@

    Failed Bank List

    International Bank of Chicago October 28, 2011 August 15, 2012 - + Community Banks of Colorado Greenwood Village @@ -959,7 +959,7 @@

    Failed Bank List

    Blackhawk Bank & Trust October 14, 2011 August 15, 2012 - + First State Bank Cranford @@ -968,7 +968,7 @@

    Failed Bank List

    Northfield Bank October 14, 2011 November 8, 2012 - + Blue Ridge Savings Bank, Inc. Asheville @@ -977,7 +977,7 @@

    Failed Bank List

    Bank of North Carolina October 14, 2011 November 8, 2012 - + Piedmont Community Bank Gray @@ -986,7 +986,7 @@

    Failed Bank List

    State Bank and Trust Company October 14, 2011 January 22, 2013 - + Sun Security Bank Ellington @@ -1202,7 +1202,7 @@

    Failed Bank List

    Ameris Bank July 15, 2011 November 2, 2012 - + One Georgia Bank Atlanta @@ -1247,7 +1247,7 @@

    Failed Bank List

    First American Bank and Trust Company June 24, 2011 November 2, 2012 - + First Commercial Bank of Tampa Bay Tampa @@ -1256,7 +1256,7 @@

    Failed Bank List

    Stonegate Bank June 17, 2011 November 2, 2012 - + McIntosh State Bank Jackson @@ -1265,7 +1265,7 @@

    Failed Bank List

    Hamilton State Bank June 17, 2011 November 2, 2012 - + Atlantic Bank and Trust Charleston @@ -1274,7 +1274,7 @@

    Failed Bank List

    First Citizens Bank and Trust Company, Inc. June 3, 2011 October 31, 2012 - + First Heritage Bank Snohomish @@ -1283,7 +1283,7 @@

    Failed Bank List

    Columbia State Bank May 27, 2011 January 28, 2013 - + Summit Bank Burlington @@ -1292,7 +1292,7 @@

    Failed Bank List

    Columbia State Bank May 20, 2011 January 22, 2013 - + First Georgia Banking Company Franklin @@ -2030,7 +2030,7 @@

    Failed Bank List

    Westamerica Bank August 20, 2010 September 12, 2012 - + Los Padres Bank Solvang @@ -2624,7 +2624,7 @@

    Failed Bank List

    MB Financial Bank, N.A. April 23, 2010 August 23, 2012 - + Amcore Bank, National Association Rockford @@ -2768,7 +2768,7 @@

    Failed Bank List

    First Citizens Bank March 19, 2010 August 23, 2012 - + Bank of Hiawassee Hiawassee @@ -3480,7 +3480,7 @@

    Failed Bank List

    October 2, 2009 August 21, 2012 - + Warren Bank Warren MI @@ -3767,7 +3767,7 @@

    Failed Bank List

    Herring Bank July 31, 2009 August 20, 2012 - + Security Bank of Jones County Gray @@ -3848,7 +3848,7 @@

    Failed Bank List

    California Bank & Trust July 17, 2009 August 20, 2012 - + BankFirst Sioux Falls @@ -4811,7 +4811,7 @@

    Failed Bank List

    Bank of the Orient October 13, 2000 March 17, 2005 - + @@ -4854,7 +4854,7 @@

    Failed Bank List