in ,

alan-turing-institute / CleverCSV, Hacker News

alan-turing-institute / CleverCSV, Hacker News




CleverCSV provides a drop-in replacement for the Python.************************ (csv) ****************************package with improved dialect detection for messy CSV files. It also provides a handy command line tool that can standardize a messy file or generate Python code to import it.

Useful links:

******************CleverCSV on Github
CleverCSV on PyPI
************ Demo of CleverCSV on Binder (interactive!)
*********************************** ********************** (Paper (PDF)
****************************** ****************** Paper (HTML)
************************** (Reproducible Research Repo **********************************) ************************** (Blog post on messy CSV files
NEW! (************************************ (****************************************** (********************************************** (************************************************ (************ (Introduction) CSV files are awesome! They are lightweight, easy to share, human-readable, version-controllable, and supported by many systems and tools!

  • CSV files are terrible! They can have many different formats, multiple tables, headers or no headers, escape characters, and there’s no support for recording metadata!
  • CleverCSV is a Python package that aims to solve some of the pain points of CSV files, while maintaining many of the good things. The package automatically detects (with high accuracy) the format (


    of CSV files, thus making it easier to simply point to a CSV file and load it, without the need for human inspection. In the future, we hope to solve some of the other issues of CSV files too.

    CleverCSV isbased on science. We investigated thousands of real-world CSV files to find a robust way to automatically detect the dialect of a file. This may seem like an easy problem, but to a computer a CSV file is simply a long string, and every dialect will give youBindersometable. In CleverCSV we use a technique based on the patterns of the parsed file and the data type of the parsed cells. With our method we achieve a 0343% accuracy for dialect detection, with a (******************************************************************************************************************************************************************************************************% improvement on non-standard (messyCSV files.

    We think this kind of work can be very valuable for working data scientists and programmers and we hope that you find CleverCSV useful (if there's a problem, please open an issue!) Since the academic world counts citations, please

    cite CleverCSV if you use the package. Here’s a BibTeX entry you can use:

    (************************************************** (@ article) *************************************************** ({************************************************** (van) ****************************************************************************************************************************************** (wrangling) ,         

    =(********************************** ({************************************************** Wrangling Messy {CSV} Files by Detecting Row and Type Patterns} ,         
    =(********************************** ({************************************************** {van den Burg}, GJJ and Naz { 'a} bal, A. and Sutton, C.}
    =(********************************** ({************************************************** (Data Mining and Knowledge Discovery) **************************************************}************************************** (******************************************************,         
    =(********************************** ({********************************************************************************************************************************************************************}  (****************************************************,         
    =(********************************** ({************************************************************************************************************************************************************************************************************************ (************************************************} ,          (number)=(********************************** ({************************************************** (6) **************************************************}**********************************,         
    =(********************************** ({************************************************-  (**********************************} ,         
    =(********************************** ({********************************************************************************************************************************************************************** -  (X) *************************************************} ,         
    =(********************************** ({**************************************************************************************************************************************************************************************************************************************.  / s
    ******************************************************************************************************************** - 25 - 756 - y
    (****************************************************, }

    And of course, if you like the package pleaseBinderspread the word!

    You can do this by Tweeting about it (
    # CleverCSV
    or clicking the (⭐️) ********************************************************* on GitHub



    CleverCSV consists of a Python library and a command line tool called clevercsv**************. ************ (******************************************************************** (************************************************Library

    We designed CleverCSV to provide a drop-in replacement for the built-in CSV module, with some useful functionality added to it. Therefore, if you simply want to replace the builtin CSV module with CleverCSV, you can import CleverCSV as follows, and use it as you would use the builtin.

    CleverCSV provides an improved version of the dialect sniffer in the CSV module, but it also adds some useful wrapper functions. These functions automatically detect the dialect and aim to make working with CSV files easier. We currently have the following helper functions:****************** (detect_dialect) : takes a path to a CSV file and returns the detected dialect

  • read_csv
    : automatically detects the dialect and encoding of the file, and returns the data as a list of rows.
  • csv2df
    : detects the dialect and encoding of the file and then uses Pandas to read the CSV into a DataFrame.
  • Of course, you can also use the traditional way of loading a CSV file, as in the Python CSV module:

    (************************************************** (#) importing this way makes it easy to port existing code to CleverCSV!
    import  (clevercsv) ************************************************** (as)  csv
    (open) ************************************************** (********************************************************************************** (data.csv) *************************************************** ,
    (************************************************* (r) **************************************************************************** (****************************************************, ************************************************ (newline) ***************************************************
  • =**************************** **************************************************************************** () ) (as) ************************************************** fp:   
    #you can use verbose=True to see what CleverCSV does:  dialect=csv.Sniffer (). sniff ( (),**************************** (verbose) ****************************************************=[] ************************************ False) ( (0)    reader=csv.reader (fp, dialect)   rows=(list) ************************************************** (reader)

    That's the basics! If you want more details, you can look at the code of the package, the test suite, or theAPI documentation.(**************************************************** (********************************************** (**************************************************
    Command-Line Tool

    Theclevercsv command line application has a number of handy features to make working with CSV files easier. For instance, it can be used to view a CSV file on the command line while automatically detecting the dialect. It can also generate Python code for importing data from a file with the correct dialect. The full help text is as follows:

    USAGE   clevercsv [-h] [-v] [-V] [] ... [] ARGUMENTS   The command to execute    The arguments of the command GLOBAL OPTIONS   -h (--help) Display this help message.   -v (--verbose) Enable verbose mode.   -V (--version) Display the application version. AVAILABLE COMMANDS   code Generate Python code for importing the CSV file.   detect Detect the dialect of a CSV file   help Display the manual of a command   standardize Convert a CSV file to one that conforms to RFC - 4180.   view View the CSV file on the command line using TabView************************************************

    Each of the commands has further options (for instance, thecodecommand can generate code for importing a Pandas DataFrame). Use clevercsv helpfor more information. Below are some examples for each command:

    $ clevercsv code imdb.csv # Code generated with CleverCSV import clevercsv with open ("imdb.csv", "r", newline="", encoding="utf-8") as fp:     reader=clevercsv.reader (fp, delimiter=",", quotechar="", escapechar="\")     rows=list (reader)


    We also have a version that reads a Pandas dataframe:

    $ clevercsv code --pandas imdb.csv # Code generated with CleverCSV import clevercsv df=clevercsv.csv2df ("imdb.csv", delimiter=",", quotechar="", escapechar="\")************************************************


    Detection is useful when you only want to know the dialect.

    $ clevercsv detect imdb.csv Detected: SimpleDialect (',', '', '\')************************************************


    The plain flag gives the components of the dialect on separate lines, which makes combining it withgrep Easier.

    $ clevercsv detect --plain imdb.csv delimiter=, quotechar=escapechar=************************************************Standardize

    This command allows you to view the file in the terminal. The dialect is of course detected using CleverCSV! Both this command and the standardizecommand support the- transposeflag, if you want to transpose the file before viewing or saving:

    $ clevercsv view --transpose imdb.csv************************************************ (********************************************** (**********************************************Contributors



    If you want to encourage development of CleverCSV, the best thing to do now is to

    spread the word!

    If you encounter an issue in CleverCSV, please open an issue or submit a pull request. Don't hesitate, you're helping to make this project better! If GitHub's not your thing but you still want to contact us, you can send an email to gertjanvandenburg at gmail dot com instead.(******************************************** (************************************************ (Notes) *************************************************

    License: MIT (see LICENSE file).

    Copyright (c) (************************************************************************************ The Alan Turing Institute (********************************************************   

    (*****************************************************************************************Read More(********************************************************************************************

  • What do you think?

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    GIPHY App Key not set. Please check settings

    Antonio Brown's 'Baby Mama' Incident Should Eject Him from the NFL Forever, Crypto Coins News

    Antonio Brown's 'Baby Mama' Incident Should Eject Him from the NFL Forever, Crypto Coins News

    William Barr’s call for Apple to unlock the Pensacola shooter’s iPhones reopens the encryption debate, Recode

    William Barr’s call for Apple to unlock the Pensacola shooter’s iPhones reopens the encryption debate, Recode