Difference between revisions of "Data Analysis"

From TeleCafeWiki
Jump to navigation Jump to search
(Start page.)
 
(→‎PDF Conversion: Editor's Note)
 
(56 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
{{RightTOC}}
 +
== Big Data ==
 +
* [http://datascience.stackexchange.com/questions/19/how-big-is-big-data How big is big data?] (StackExchange)
 +
: Good question; good answer.
 +
 +
* [http://www.analyticsvidhya.com/blog/2016/05/complete-tutorial-work-big-data-amazon-web-services-aws/ A Complete Tutorial to work on Big Data with Amazon Web Services (AWS)]
 +
: A step by step process the author used to connect to a 24GB AWS instance via a laptop. Examples for Python and R users.
 +
 
== Business Intelligence ==
 
== Business Intelligence ==
 +
* [https://github.com/Quartz/bad-data-guide '''Quartz/bad-data-guide''' - The Quartz guide to bad data]
 +
: An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.
 +
 +
=== [[wikipedia:Business intelligence|BI]] Tools ===
 
* [http://2012books.lardbucket.org/books/getting-the-most-out-of-information-systems-a-managers-guide-v1.1/s15-06-the-business-intelligence-tool.html The Business Intelligence Toolkit]
 
* [http://2012books.lardbucket.org/books/getting-the-most-out-of-information-systems-a-managers-guide-v1.1/s15-06-the-business-intelligence-tool.html The Business Intelligence Toolkit]
 
: A section from the Creative Commons book [http://2012books.lardbucket.org/books/getting-the-most-out-of-information-systems-a-managers-guide-v1.1/index.html Getting the Most Out of Information Systems: A Manager's Guide].
 
: A section from the Creative Commons book [http://2012books.lardbucket.org/books/getting-the-most-out-of-information-systems-a-managers-guide-v1.1/index.html Getting the Most Out of Information Systems: A Manager's Guide].
 +
 +
* [http://www.butleranalytics.com/10-mysql-reporting-tools/ 10+ MySQL Reporting Tools]
 +
: These MySQL reporting tools fall into two broad camps – business intelligence suites where reporting is a major component, and tools that are specifically aimed at reporting. Also many of them are free.
 +
 +
* [http://www.predictiveanalyticstoday.com/open-source-free-business-intelligence-solutions/ 33 Open Source and Free Business Intelligence Solutions]
 +
: '''Free Open Source Business Intelligence Solutions'''
 +
:: Pentaho Community Edition, OpenText Actuate Information Hub, Free Edition, ReportServer, JasperReports Business Intelligence, Jedox Base, SpagoBI, ART, Pentaho Reporting, JMagallanes, OpenReports, Seal Report, Openi, NextReports, RapidMiner, Mondrian, KNIME.
 +
: '''Free Cloud Business Intelligence Solutions'''
 +
:: Watson Analytics, SAP Lumira Cloud, Power BI, Microstrategy Analytics Express and Birst Express for NetSuite.
 +
: '''Free Proprietary Business Intelligence Solutions'''
 +
:: EspressReport Lite, SAP Lumira, QlikView Personal Edition, InetSoft, Qlik Sense Desktop, icCube, Tableau Public.
 +
: '''Open Source Commercial Business Intelligence Solutions'''
 +
:: Pentaho, Jaspersoft, Palo, Actuate Corporation, TACTIC.
 +
 +
== Data Analysis & Data Science Tools ==
 +
* [https://datamaps.co/ Datamaps.co] - Create and download data maps.
 +
: Datamaps.co is a free and simple platform for creating visualizations with data maps. It allows you to upload CSV file with region data, and fully customize your map's appearance. Your map chart can be saved as PNG or SVG. With datamaps.co, you can create a custom map of World, USA, China, Canada and more are coming.
 +
 +
* [http://vis.stanford.edu/wrangler/ Data Wrangler] (Stanford Visualization Group)
 +
: Wrangler allows interactive transformation of messy, real-world data into the data tables analysis tools expect. Export data for use in Excel, R, Tableau, Protovis, ...
 +
 +
* [http://htsql.org/ HTSQL—A Database Query Language]
 +
: HTSQL is designed for data analysts and other ''accidental programmers'' who have complex business inquiries to solve and need a productive tool to write and share database queries. HTSQL is ''free and open source'' software.
 +
 +
* [http://www.cc.gatech.edu/gvu/ii/jigsaw/ Jigsaw: Visual Analytics for Exploring and Understanding Document Collections]
 +
: Jigsaw is a visual analytics system to help analysts and researchers better explore, analyze, and make sense of such document collections.
 +
 +
* [http://openrefine.org/ OpenRefine] (Formerly [http://code.google.com/p/google-refine/Google Refine].)
 +
: '''[[wikipedia:OpenRefine|Open Refine]]''' is a standalone open source desktop application for data cleanup and transformation to other formats, the activity known as [[wikipedia:Data wrangling|data wrangling]]. It is similar to [[wikipedia:Spreadsheet|spreadsheet]] applications (and can work with spreadsheet file formats), however, it behaves more like a database.
 +
 +
=== Command Line ===
 +
* [http://datascienceatthecommandline.com/ Data Science at the Command Line]
 +
: This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
 +
 +
* [https://www.analyticsvidhya.com/blog/2016/08/tutorial-data-science-command-line-scikit-learn/ Tutorial – Data Science at Command Line with R & Python (Scikit Learn)]
 +
: Tutorial inspired by [http://jeroenjanssens.com/ Jeroen Janssens]' book ''[http://datascienceatthecommandline.com/ Data Science at the Command Line]''.
 +
 +
==== IDEs in the Browser | Cloud IDEs ====
 +
''Also See [[Data Analysis#Python_in_Your_Browser:_Tools|Python in Your Browser: Tools]] (below).''
 +
 +
* [https://c9.io/ Cloud9] - Your development environment, in the cloud.
 +
: Cloud9 IDE is an online integrated development environment, published as open source from version 3.0. It supports hundreds of programming languages, including PHP, Ruby, Perl, Python, JavaScript with Node.js, and Go. <!-- Recommended by Hassan in Python Throttle class. -->
 +
 +
* [https://www.nitrous.io Nitrous.io]
 +
: Code and Collaborate in the Cloud, for Free! No installation required. Create a new environment with our cloud IDE in seconds. <!-- Recommended by Hassan in Python Throttle class, then he realized (after conferring with Riley) that he actually meant to recommend Cloud 9. -->
 +
 +
; Further Reading
 +
* [http://www.hongkiat.com/blog/cloud-ide-developers/ Cloud IDEs For Web Developers – Best Of]
 +
: Compares "13 of the best Cloud IDEs," including the following (to name just a few): Akshell, Cloud9, Python Fiddle
 +
 +
=== D3 ===
 +
* [http://d3js.org/ D3.js - Data-Driven Documents]
 +
: '''D3.js''' is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.
 +
 +
=== Excel ===
 +
; Text Analysis using Excel
 +
 +
* [http://jacobjwalker.effectiveeducation.org/blog/2012/11/18/excel-vba-macro-code-to-find-word-level-n-grams-in-a-text-entry/ Excel VBA Macro Code to Find Word-Level N-Grams in a Text Entry]
 +
: FEATURES EXAMPLE WORKBOOK: Analysis of Literature and Medicine Courses.xlsm
 +
:: The featured workbook looks interesting and appears quite comprehensive.
 +
 +
* [https://www.keithyap.com.au/text-analysis-using-excel/ Text Analysis using Excel]
 +
: A quick guide to the types of analysis you can run, ranging from easy to hard.
 +
:: EXAMPLE: Use this formula to count the number of words in a text string. =LEN(A1)-LEN(SUBSTITUTE(A1," ",""))+1
 +
 +
* [http://www.clearlyandsimply.com/clearly_and_simply/2015/03/the-implementation-of-word-clouds-with-excel.html The Implementation of Word Clouds with Excel]
 +
: Approach, algorithm, VBA code and performance optimization of the Word Cloud with Excel implementation.
 +
:: FEATURES: Split Text Tool (Microsoft Excel 2007-2013, 115K)
 +
:: NOTE: If the Split Text Tool downloads as a Zip file, change the file extension from .zip to .xlsm.
 +
 +
* [http://www.clearlyandsimply.com/clearly_and_simply/2015/02/word-clouds-with-microsoft-excel.html Create dynamic Word Clouds / Tag Clouds in Microsoft Excel]
 +
: FEATURES FOLLOWING DOWNLOADS:
 +
::- Word Clouds Bruce Springsteen (Microsoft Excel 2007-2013, 518.4K)
 +
::- Word Cloud Generic Template (Microsoft Excel 2007-2013, 78.7K)
 +
::- Word Cloud UDF (Microsoft Excel 2007-2013, 72.4K)
 +
: NOTE: If any of the tools downloads as a Zip file, change the file extension from .zip to .xlsm.
 +
 +
=== Python ===
 +
Python is a very powerful programming language used for many different applications. Over time, the huge community around this open source language has created quite a few tools to efficiently work with Python. In recent years, a number of tools have been built specifically for data science. As a result, analyzing data with Python has never been easier.<ref>[https://www.edx.org/course/introduction-python-data-science-microsoft-dat208x-2 Introduction to Python for Data Science] ([https://www.edx.org/ edX])</ref>
 +
 +
===== Python in Your Browser: Propaganda =====
 +
* [https://eev.ee/blog/2016/07/31/python-faq-why-should-i-use-python-3/ Python FAQ: Why should I use Python 3?] ([https://eev.ee/ fuzzy notepad])
 +
: Python 3 is great and you should use it. Here’s why...
 +
 +
* [http://www.wired.com/2015/05/running-python-browser-awesome-think/ Running Python in a Browser Is More Awesome Than You Think]
 +
: Shows simple code examples running on [https://trinket.io/python Trinket].
 +
 +
* [http://physicslogos.blogspot.com/2015/05/python-graphing-tutorials-in-trinket.html Python Graphing Tutorials in Trinket]
 +
: Trinket can be used to post interactive Python code which your students can run in the browser or copy and paste for the purpose of running locally on their computer.
 +
 +
===== Python in Your Browser: Tools =====
 +
* [https://www.pythonanywhere.com/ PythonAnywhere]
 +
: Host, run, and code Python in the cloud!
 +
 +
* [http://pythonfiddle.com/ Python Fiddle: Python Cloud IDE]
 +
: The Python IDE for the web. Play around with and modify live example code. Share or demonstrate solutions to problems.
 +
 +
* [https://repl.it/ repl.it: online REPL, Compiler & IDE]
 +
: Powerful and simple online compiler, IDE, interpreter, and REPL. Write, run, save, and share code from your browser!
 +
 +
* [http://www.skulpt.org Skulpt]: Python. Client side.
 +
: Skulpt is an entirely in-browser implementation of Python. No preprocessing, plugins, or server-side support required, just write Python and reload.
 +
 +
* [https://trinket.io/python Your Python Trinket]
 +
: Python in the browser. No installation required.
 +
 +
==== Python &amp; Spreadsheets ====
 +
* [http://sheetsync.readthedocs.io/en/latest/ SheetSync]: Welcome to SheetSync!
 +
: A python library to create, update and delete rows of data in a google spreadsheet.
 +
 +
* [http://xlsxwriter.readthedocs.io  XlsxWriter]: Creating Excel files with Python and XlsxWriter
 +
: XlsxWriter is a Python module for creating Excel XLSX files.
 +
 +
==== Python Tips, Tutorials &amp; Tricks ====
 +
* [https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/ A Complete Tutorial to Learn Data Science with Python from Scratch]
 +
: Python was originally a general purpose language. Over the years, with strong community support, dedicated libraries for data analysis and predictive modeling emerged.
 +
* [http://www.automatingosint.com/blog/2016/07/dark-web-osint-with-python-and-onionscan-part-one/ Dark Web OSINT With Python and OnionScan: Part One]
 +
: Step-by-step instructions for setting up an environment to scan hidden services in the dark web!
 +
 +
==== Web Scraping With Python ====
 +
* [https://scrapy.org/ Scrapy]
 +
: Scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
 +
 +
=== R ===
 +
* [http://datascience.stackexchange.com/questions/2269/any-online-r-console Any Online R console?] (StackExchange)
 +
 +
==== R Studio ====
 +
* RStudio: [https://www.rstudio.com/ide/download/ Download RStudio]
 +
: Take control of your R code. RStudio is the premier integrated development environment for R. It is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or over the web with RStudio Server. Download RStudio (for Windows, Mac, or Linux).
 +
 +
==== R &amp; Excel ====
 +
* [http://www.r-bloggers.com/a-million-ways-to-connect-r-and-excel/ A million ways to connect R and Excel]
 +
: In quantitative finance both R and Excel are the basis tools for any type of analysis. Whenever one has to use Excel in conjunction with R, there are many ways to approach the problem and many solutions. It depends on what you really want to do and the size of the dataset you’re dealing with.
 +
 +
* [https://cran.r-project.org/web/packages/openxlsx/index.html openxlsx: Read, Write and Edit XLSX Files]
 +
: Simplifies the creation of Excel .xlsx files by providing a high level interface to writing, styling and editing worksheets. Through the use of Rcpp, read/write times are comparable to the xlsx and XLConnect packages with the added benefit of removing the dependency on Java.
 +
 +
==== R &amp; Google Sheets ====
 +
* [https://cran.r-project.org/web/packages/googlesheets/index.html googlesheets: Manage Google Spreadsheets from R]
 +
: Interact with Google Sheets from R.
 +
 +
* [https://github.com/jennybc/googlesheets Google Sheets R API]
 +
: Access and manage Google spreadsheets from R with googlesheets.
 +
 +
==== Shiny ====
 +
* [http://shiny.rstudio.com/ Shiny] - by [https://www.rstudio.com/ RStudio]
 +
: A web application framework for R.
 +
::- Turn your analyses into interactive web applications.
 +
::- No HTML, CSS, or JavaScript knowledge required.
 +
 +
== Learn Data Science ==
 +
''Also See: [[Computer Productivity Hacks#Educate_Yourself]]''
 +
{{cquote2|Academic credentials are important but not necessary for high-quality data science. The core aptitudes – curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor, skeptical nature – that distinguish the best data scientists are widely distributed throughout the population.<br /><br />We’re likely to see more uncredentialed, inexperienced individuals try their hands at data science, bootstrapping their skills on the open-source ecosystem and using the diversity of modeling tools available. Just as data-science platforms and tools are proliferating through the magic of open source, big data’s data-scientist pool will as well.<br /><br />And there’s yet another trend that will alleviate any talent gap: the democratization of data science. [[wikipedia:Autodidacticism|Autodidacts]] – the self-taught, uncredentialed, data-passionate people – will come to play a significant role in many organizations’ data science initiatives.|James Kobielus|[http://www.ibmbigdatahub.com/blog/data-scientist-closing-talent-gap Data Scientist: Closing the Talent Gap]|##px|##px}}
 +
 +
* [https://www.dataquest.io/ Dataquest: Learn data science online.]
 +
: Learn applied skills for a career in data science. Work with data, build projects, and gain experience. (No longer free.)
 +
 +
* [https://github.com/dataquestio/data-science-blogs Dataquest.io's Curated List Of Data Science Blogs]
 +
: [http://www.analyticsvidhya.com/blog/ Analytics Vidhya] | [http://dataaspirant.com/ Dataaspirant] | [http://www.randalolson.com/blog/ Dr. Randal S. Olson] | [http://blog.dominodatalab.com/ Domino Data Lab's blog] | [http://ianozsvald.com/ Entrepreneurial Geekiness] | [http://blog.kaggle.com/ no free hunch] | [https://jakevdp.github.io/ Pythonic Perambulations] | [http://rayli.net/blog/ Rayli.Net] | [http://blog.yhathq.com/ yhat] | [http://mdbecker.github.io/ Beckerfuffle] | [http://carlshan.com/ Carl Shan] | [http://datamining.typepad.com/data_mining/ Data Mining: Text Mining, Visualization and Social Media] | [http://fastml.com/ FastML] | [http://www.chioka.in/ Garbled Notes] | [http://www.joyofdata.de/blog/ Joy Of Data] | [http://mlwave.com/ MLWave] | [http://trevorstephens.com/ Trevor Stephens] | [http://twiecki.github.io/ While My MCMC Gently Samples] | [http://zacstewart.com/ Zac Stewart] | [http://www.machinedlearnings.com/ Machined Learnings] | [http://deeplearning.net/feed/ Deep Learning] | [http://nuit-blanche.blogspot.com/ Nuit Blanche] | [http://hunch.net/ Machine Learning (Theory)] | [http://bickson.blogspot.com/ Large Scale Machine Learning] | [http://blog.revolutionanalytics.com/ Revolutions] | [http://www.kdnuggets.com/ KDnuggets] | [http://wellecks.wordpress.com/ Wellecks] | [http://www.pyimagesearch.com/ PyImageSearch] | [http://www.dataminingblog.com/ Data Mining Research] | [http://treycausey.com/ Trey Causey] | [https://timdettmers.wordpress.com/ Deep Learning] | [http://alexhwoods.com/ Machine Learning and Data Science] | [http://iamtrask.github.io/ i am trask] | [http://317070.github.io/ Jonas Degrave] | [http://yanirseroussi.com/ Yanir Seroussi] | [http://daoudclarke.github.io/ Life, Language, Learning] | [http://medriscoll.com/ M.E.Driscoll] | [https://peadarcoyle.wordpress.com/ Models are illuminating and wrong] | [https://medium.com/@chris_bour Christophe Bourguignat] | [https://medium.com/@samim samim] | [http://karpathy.github.io/ Andrej Karpathy blog] | [http://colah.github.io/archive.html colah's blog] | [http://sebastianraschka.com/articles.html Sebastian Raschka] | [http://blog.echen.me Edwin Chen] | [http://simplystatistics.org Simply Statistics] | [http://rinzewind.org/blog-en Will do stuff for stuff] | [http://www.computervisionblog.com/ Tombone's Computer Vision Blog] | [https://blogs.princeton.edu/imabandit/ I’m a bandit] | [http://101.datascience.community/ Data Science 101] | [http://www.r-bloggers.com/ R-bloggers] | [http://multithreaded.stitchfix.com/blog/ Stitch Fix Tech Blog] | [http://datagenetics.com/blog.html DataGenetics] | [https://www.dataquest.io/blog/ Dataquest Blog]
 +
 +
* [http://datasciencemasters.org/ The Open-Source Data Science Masters]
 +
: The open-source curriculum for learning Data Science. Foundational in both theory and technologies, the OSDSM breaks down the core competencies necessary to make data useful.
 +
: '''The Internet is Your Oyster''': With Coursera, ebooks, Stack Overflow, and GitHub -- all free and open -- how can you afford not to take advantage of an open source education?
 +
 +
* [https://quizlet.com/ Quizlet]—''Study Everywhere''! | Subject: [https://quizlet.com/subject/data-science/ Data Science]
 +
: Numerous "flash card"-style data science lessons.
 +
 +
== Hash, Random Text, Number &amp; String Generators ==
 +
;[[wikipedia:Hash function|Hash]]
 +
* [https://www.md5hashgenerator.com/ MD5 Hash Generator] ([https://www.danstools.com/ Dan's Tools])
 +
* [http://www.unit-conversion.info/texttools/md5/ MD5 create hash online] ([http://www.unit-conversion.info/ unit-conversion.info])
 +
 +
;Tools
 +
* [https://www.generatedata.com/ GenerateData.com]
 +
: Free, GNU-licensed, random custom data generator for testing software.
 +
 +
* [https://www.random.org/ RANDOM.ORG] - True Random Number Service
 +
: Generate random lists, numbers, text strings, sequences and more.
 +
 +
;Tutorials
 +
* [https://www.extendoffice.com/documents/excel/642-excel-generate-random-string.html How to generate random character strings in a range in Excel?]
 +
: Sometimes you may need to generate random strings, such as different passwords. This article tries to show you some tricks to generate different random strings in Excel.
 +
 +
== Regular Expressions ==
 +
* [[wikipedia:Regular expression|Regular expression]] (Wikipedia)
 +
: In theoretical computer science and formal language theory, a regular expression (abbreviated <tt>regex</tt> or <tt>regexp</tt> and sometimes called a ''rational expression'') is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations.
 +
 +
=== Quick Tips ===
 +
* [http://stackoverflow.com/questions/1331815/regular-expression-to-match-cross-platform-newline-characters Regular Expression to match cross platform newline characters]
 +
 +
* [http://stackoverflow.com/questions/3219014/what-is-a-cross-platform-regex-for-removal-of-line-breaks What is a cross platform regex for removal of line breaks?]
 +
: "The regex I use when I want to be precise is '''<span style="color: green; text-decoration: none;"><tt>\r\n?|\n</tt></span>'''."
 +
 +
=== RegEx Tutorials ===
 +
* [http://regexone.com RegexOne - Learn regular expressions with simple, interactive examples.]
 +
: Includes:
 +
::- [http://regexone.com/lesson/0 Interactive Tutorial] › Learn to use Regular Expressions
 +
::- [http://regexone.com/example/0 Practical Examples] › Practice your Regular Expressions
 +
::- [http://regexone.com/cheatsheet RegEx Cheatsheet] › Regular Expressions in PHP & More
 +
 +
* [http://www.regular-expressions.info/tutorial.html Regular Expressions Tutorial: Learn How to Use and Get The Most out of Regular Expressions]
 +
: Any non-trivial <tt>regex</tt> looks daunting to anybody not familiar with them. But with just a bit of experience, you will soon be able to craft your own regular expressions like you have never done anything else.
 +
 +
* [http://www.rexegg.com/regex-uses.html The Many Uses of Regex]
 +
: Regex is the gift that keeps giving. Once you learn it, you discover it comes in handy in many places where you hadn't planned to use it.
 +
 +
* [http://regex.learncodethehardway.org/book/ Learn Regex The Hard Way]
 +
: This is an in-progress book that quickly teaches you regular expressions.
 +
 +
=== RegEx Tools ===
 +
* [https://regex101.com Regex101: Online regex tester and debugger: JavaScript, Python, PHP, and PCRE]
 +
: Regex101 allows you to create, debug, test and have your expressions explained for PHP, PCRE, JavaScript and Python. The website also features a community where you can share useful expressions.
 +
 +
* [http://www.regexr.com/ RegExr: Learn, Build, & Test RegEx]
 +
: Regular expression tester with syntax highlighting, contextual help, video tutorial, reference, and searchable community patterns.
 +
 +
* [http://pythex.org/ Pythex: a Python regular expression editor]
 +
: Pythex is a real-time regular expression editor for Python, a quick way to test your regular expressions.
 +
 +
* [http://regexpal.com/ Regex Tester]
 +
: JavaScript regex tester. Highlights matches on the fly.
 +
 +
== Small Data ==
 +
[[wikipedia:Small data|Small Data]] is everything [[wikipedia:Big data|Big Data]] is not.
 +
;You Might Not Be Big Data If:
 +
- You were generated through human data entry. (Big Data came about in order to handle the exponential growth of machine-generated data, because we humans aren’t fast enough to outpace a good old relational database).<br />
 +
- You are an operational database. For instance, CRM is never Big Data, and ERP is never Big Data.<br />
 +
- You fit just fine in a MySQL database. Even if you have to put a lot of RAM in it, it’s still not Big Data.<ref>[http://www.datanami.com/2015/05/20/forget-big-data-small-data-is-where-the-money-lies/ Forget Big Data–Small Data Is Where the Money Lies]</ref>
 +
 +
* [http://www.datanami.com/2015/05/20/forget-big-data-small-data-is-where-the-money-lies/ Forget Big Data–Small Data Is Where the Money Lies]
 +
: For the majority of small and medium businesses, Big Data is the technology of the future, not the reality they experience today. To be blunt, most SMBs don’t even have a handle on the Small Data they’re already creating and collecting themselves. (And if many enterprise organizations are being honest, neither do they.) According to Forrester Research, most companies are analyzing a mere 12% of their existing data. That leaves a whopping 88% of data that businesses are flat out ignoring. Can you imagine the potential of actually leveraging that existing data to derive data-driven business insights? Instead of chasing the Big Data dream, SMBs should consider picking up the dollars that are effectively lying on the floor, and invest first in leveraging their Small Data.
 +
 +
* [http://www.theguardian.com/news/datablog/2013/apr/25/forget-big-data-small-data-revolution Forget big data, small data is the real revolution]
 +
: Just as we now find it ludicrous to talk of "big software" – as if size in itself were a measure of value – we should, and will one day, find it equally odd to talk of "big data". Size in itself doesn't matter – what matters is having the data, of whatever size, that helps us solve a problem or address the question we have.
 +
 +
* [http://www.forbes.com/sites/mikekavis/2015/02/25/forget-big-data-small-data-is-driving-the-internet-of-things/ Forget Big Data -- Small Data Is Driving The Internet Of Things]
 +
: What is small data, you ask? Small data is a dataset that contains very specific attributes. Small data is used to determine current states and conditions  or may be generated by analyzing larger data sets.
 +
 +
== SQL ==
 +
=== Cheat Sheets ===
 +
* [https://www.simple-talk.com/sql/sql-training/ssms-the-query-window-keyboard-shortcuts/ SQL Server Management Studio Query Window Keyboard Shortcuts]
 +
: Simple-Talk's free wallchart of the most important SSMS keyboard shortcuts aims to help find all those curiously forgettable key combinations within SQL Server Management Studio that unlock the hidden magic that is available for editing and executing queries.
 +
 +
=== Fiddle with SQL in the Browser ===
 +
* [http://www.compileonline.com/execute_sql_online.php codingground: EXECUTE SQL ONLINE]
 +
: I got this one to work. | [[User:Dave|Dave]] ([[User talk:Dave|talk]]) 18:01, 7 December 2016 (PST)
 +
 +
* [http://rextester.com/l/postgresql_online_compiler Rextester | compile postgresql online]
 +
: At Rextester you can not only compile PostgreSQL, but also several several other SQL flavors, too (SQL Server, MySQL, Oracle). Rextester also offers options for several other programming languages.
 +
 +
=== SQL &amp; PowerShell ===
 +
* [https://cmatskas.com/execute-sql-query-with-powershell/ Execute SQL Query with PowerShell]
 +
: Scripting is very powerful. And for me, one of the best scripting languages is PowerShell (PoSH). Yes, PoSH takes a bit of getting used to, but once you pass the initial learning curve, you end up with a powerful tool in your hands.
 +
 +
* [http://stackoverflow.com/questions/8423541/how-do-you-run-a-sql-server-query-from-powershell How do you run a SQL Server query from PowerShell?]
 +
: Multiple examples; approaches.
 +
 +
* [https://www.simple-talk.com/content/file.ashx?file=7305 Practical PowerShell For SQL Server Developers and DBAs]
 +
: Download the latest version of this PowerShell™ wallchart and read the accompanying in-depth article from Simple-Talk.
 +
 +
;Also See'''<nowiki>:</nowiki>'''
 +
* [[Spreadsheet Tricks#SQL_.26_Excel|SQL & Excel]]
 +
 +
== Text Extraction ==
 +
=== PDF Conversion ===
 +
* [https://www.pdfyeah.com/ All-in-One Online PDF Solution]
 +
: Free online PDF converter that helps with your everyday PDF tasks, making it easy to compress, convert, join and edit PDF files online.
 +
:: [[User:Dave|Editor]]'s Note: Used this online service to join two PDFs.
 +
 +
[[File:capture2text_ocr_conceptual_illustration.png|thumb|alt=Capture2Text conceptual illustration.|[http://capture2text.sourceforge.net/ Capture2Text] conceptual illustration.|710px]]
 +
* [http://capture2text.sourceforge.net/ Capture2Text]
 +
: Capture2Text enables users to quickly OCR a portion of the screen using a keyboard shortcut. The resulting text will be saved to the clipboard by default.
 +
 +
* [https://sourceforge.net/projects/detexter/ Detexter]
 +
: Detexter is an app designed to extract text from PDF files.
 +
 +
* [http://www.pdfonline.com/pdf-to-word-converter/ PDF Online - Convert PDF to Word (Free!)] (A unit of [http://www.bcltechnologies.com/ BCL Technologies].)
 +
: [[User:Dave|I]] [http://www.pdfonline.com/pdf-to-word-converter/?fb_comment_id=10150571095090174_1789396217960207#f3c561ba7c434c8 used this service] to successfully convert the [http://www.neustar.biz/enterprise/docs/misc/domain-name-registry/us_localityregistrationagreementv1-0-2-2_march_rev.pdf .US Locality Domain Name Registration Terms and Conditions] form from PDF to [https://onedrive.live.com/redir?resid=18D3C89A0E2B3BB4!523&authkey=!AFiywWEHxCSWHbo&ithint=file%2cdocx Word format].
 +
 +
* [http://stackoverflow.com/questions/6187250/pdf-text-extraction PDF TEXT Extraction]
 +
: Lists several options.
 +
 +
* [https://github.com/coolwanglu/pdf2htmlEX pdf2htmlEX]
 +
: Convert PDF to HTML without losing text or format.
 +
 +
* [http://pdftotext.github.io/ pdftotext.org] (Now at [http://pdftotext.github.io/ pdftotext.github.io].)
 +
:: Note: (Domain squatter has apparently hijacked the original <tt>pdftotext.org</tt> domain. But the service is still here: http://pdftotext.github.io/)
 +
: pdftotext.org is the best online service for easily extracting text from your PDF files. Conversion from PDF to TXT is really fast thanks to our in-browser conversion architecture. Your PDF files are never uploaded to the Internet, so even private PDF files are safe to convert with this service. The conversion is done locally in your browser – you can even convert when you are offline! There is no need for any registration or sign-up, and the service will always be free to use.
 +
 +
* [https://sourceforge.net/projects/doctotext/ SILVERCODERS DocToText]
 +
: Extracts plain text from documents in all popular formats.
 +
 +
* [http://tabula.technology Tabula]
 +
: Tabula is a tool for liberating data tables locked inside PDF files. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux.
 +
 +
;Also See: [[Office Productivity Hacks#Portable_Document_Format_.28PDF.29|Office Productivity Hacks#Portable Document Format (PDF)]]
 +
 +
=== Data Scrape ; Web Scrape ===
 +
* [https://docs.google.com/a/evolvnet.com/document/d/18Q2THQvYCG2_n6nKVsZRHlaPG9iJ9NvLezOOQbEuAJs/edit?hl=en Almost Scraping: Web Scraping for Non-Programmers]
 +
: Tools and tips compiled by journalists from PBS and Omaha World-Herald.
 +
 +
* [http://blog.scrapinghub.com/2014/01/18/open-source-at-scrapinghub/ Open Source at Scrapinghub]
 +
: Scrapinghub's list of open source scraping projects.
 +
 +
* [http://sitestalker.net/ Sitestalker]
 +
: Monitor website links with ease. Sitestalker supervises websites and notifies you when your desired content hits the web.Stop wasting your time constantly refreshing websites.
 +
: Sitestalker is great for:
 +
:: Finding jobs
 +
:: Searching for an apartment
 +
:: Getting the best bargains
 +
:: Clipping
 +
 +
* [http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creating-insightful-content/ Six tools for web scraping – To use for data journalism & creating insightful content]
 +
: Tools for gathering data from public sources.
 +
 +
* [http://www.hongkiat.com/blog/web-scraping-tools/ 10 Web Scraping Tools to Extract Online Data]
 +
: Web Scraping tools are specifically developed for extracting information from websites. They are also known as web harvesting tools or web data extraction tools. These tools are useful for anyone trying to collect some form of data from the Internet. Web Scraping is the new data entry technique that don’t require repetitive typing or copy-pasting.
 +
 +
;R Scraper
 +
* [http://www.computerworld.com/article/2909560/business-intelligence/web-scraping-with-r-and-rvest-includes-video-code.html Web scraping with R and rvest (includes video & code)]
 +
: Watch (video) how easy it is to import data from a Web page into R.
 +
 +
=== Text Search ===
 +
* [http://geekdadaji.com/ geekDadaji - A SEARCH INITIATIVE]
 +
: Makes tools to search text content, including:
 +
# [https://sourceforge.net/projects/falcontextsearch/ FALCON - Text Search Java Project]: JSON based text search Java Project
 +
# [https://sourceforge.net/projects/hawksearch/ HAWK - PDF Text Search Java Project]: Taking initiative for Document Text Search
 +
 +
* [http://www.foolabs.com/xpdf/home.html Xpdf: A PDF Viewer for X]
 +
: Xpdf is an open source viewer for Portable Document Format (PDF) files.
 +
: Windows installer: [http://www.compgeom.com/~piyush/scripts/scripts.html Short Programs/Scripts] (Look for the xpdf3.exe / poppler.exe links in left sidebar.)
 +
 +
== Text Transformation ==
 +
* [https://www.transformy.io/#/ Transformy] - Change your data with one example.
 +
: Do you have a list of text strings that you want to modify the format on? Copy and paste the list into the a box, then provide an example of how you want each text string formatted. The hope is that Transformy will do the rest.
 +
 +
== Workflow ==
 +
=== [[wikipedia:Git (software)|Git]], [[wikipedia:GitHub|GitHub]] ===
 +
* [http://stackoverflow.com/questions/14679614/whats-the-best-practice-for-putting-multiple-projects-in-a-git-repository What's the best practice for putting multiple projects in a git repository?] (Stack Overflow)
 +
: ''Solution 1'': A single repository can contain multiple independent branches, called orphan branches. Orphan branches are completely separate from each other; they do not share histories.
 +
: ''Solution 2'': Avoid all the hassle of orphan branches. Create two independent repositories, and push them to the same remote. Just use different branch names for each repo.
 +
 +
[[wikipedia:Git (software)|#git]], [[wikipedia:GitHub|#github]]
 +
 +
== See Also ==
 +
* [[Career Portfolio Tips]]
 +
* [[Computer Productivity Hacks]]
 +
* [[Mobile Apps]]
 +
* [[Office Productivity Hacks]]
 +
* [[Social Networking Tips]]
 +
* [[Spreadsheet Tricks]]
 +
* [[.US Locality Domains]]
 +
* [[User:Dave/Admin Notes]]
 +
* [[User:Dave/CV Sandbox]]
 +
* [[User:Dave/Web Hack Notes (Non-Wiki)]]
 +
* [[User:Dave/Wikis In The Enterprise]]
 +
 +
== References ==
 +
{{Reflist}}

Latest revision as of 19:49, 30 July 2019

Big Data

Good question; good answer.
A step by step process the author used to connect to a 24GB AWS instance via a laptop. Examples for Python and R users.

Business Intelligence

An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.

BI Tools

A section from the Creative Commons book Getting the Most Out of Information Systems: A Manager's Guide.
These MySQL reporting tools fall into two broad camps – business intelligence suites where reporting is a major component, and tools that are specifically aimed at reporting. Also many of them are free.
Free Open Source Business Intelligence Solutions
Pentaho Community Edition, OpenText Actuate Information Hub, Free Edition, ReportServer, JasperReports Business Intelligence, Jedox Base, SpagoBI, ART, Pentaho Reporting, JMagallanes, OpenReports, Seal Report, Openi, NextReports, RapidMiner, Mondrian, KNIME.
Free Cloud Business Intelligence Solutions
Watson Analytics, SAP Lumira Cloud, Power BI, Microstrategy Analytics Express and Birst Express for NetSuite.
Free Proprietary Business Intelligence Solutions
EspressReport Lite, SAP Lumira, QlikView Personal Edition, InetSoft, Qlik Sense Desktop, icCube, Tableau Public.
Open Source Commercial Business Intelligence Solutions
Pentaho, Jaspersoft, Palo, Actuate Corporation, TACTIC.

Data Analysis & Data Science Tools

Datamaps.co is a free and simple platform for creating visualizations with data maps. It allows you to upload CSV file with region data, and fully customize your map's appearance. Your map chart can be saved as PNG or SVG. With datamaps.co, you can create a custom map of World, USA, China, Canada and more are coming.
Wrangler allows interactive transformation of messy, real-world data into the data tables analysis tools expect. Export data for use in Excel, R, Tableau, Protovis, ...
HTSQL is designed for data analysts and other accidental programmers who have complex business inquiries to solve and need a productive tool to write and share database queries. HTSQL is free and open source software.
Jigsaw is a visual analytics system to help analysts and researchers better explore, analyze, and make sense of such document collections.
Open Refine is a standalone open source desktop application for data cleanup and transformation to other formats, the activity known as data wrangling. It is similar to spreadsheet applications (and can work with spreadsheet file formats), however, it behaves more like a database.

Command Line

This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
Tutorial inspired by Jeroen Janssens' book Data Science at the Command Line.

IDEs in the Browser | Cloud IDEs

Also See Python in Your Browser: Tools (below).

  • Cloud9 - Your development environment, in the cloud.
Cloud9 IDE is an online integrated development environment, published as open source from version 3.0. It supports hundreds of programming languages, including PHP, Ruby, Perl, Python, JavaScript with Node.js, and Go.
Code and Collaborate in the Cloud, for Free! No installation required. Create a new environment with our cloud IDE in seconds.
Further Reading
Compares "13 of the best Cloud IDEs," including the following (to name just a few): Akshell, Cloud9, Python Fiddle

D3

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.

Excel

Text Analysis using Excel
FEATURES EXAMPLE WORKBOOK: Analysis of Literature and Medicine Courses.xlsm
The featured workbook looks interesting and appears quite comprehensive.
A quick guide to the types of analysis you can run, ranging from easy to hard.
EXAMPLE: Use this formula to count the number of words in a text string. =LEN(A1)-LEN(SUBSTITUTE(A1," ",""))+1
Approach, algorithm, VBA code and performance optimization of the Word Cloud with Excel implementation.
FEATURES: Split Text Tool (Microsoft Excel 2007-2013, 115K)
NOTE: If the Split Text Tool downloads as a Zip file, change the file extension from .zip to .xlsm.
FEATURES FOLLOWING DOWNLOADS:
- Word Clouds Bruce Springsteen (Microsoft Excel 2007-2013, 518.4K)
- Word Cloud Generic Template (Microsoft Excel 2007-2013, 78.7K)
- Word Cloud UDF (Microsoft Excel 2007-2013, 72.4K)
NOTE: If any of the tools downloads as a Zip file, change the file extension from .zip to .xlsm.

Python

Python is a very powerful programming language used for many different applications. Over time, the huge community around this open source language has created quite a few tools to efficiently work with Python. In recent years, a number of tools have been built specifically for data science. As a result, analyzing data with Python has never been easier.[1]

Python in Your Browser: Propaganda
Python 3 is great and you should use it. Here’s why...
Shows simple code examples running on Trinket.
Trinket can be used to post interactive Python code which your students can run in the browser or copy and paste for the purpose of running locally on their computer.
Python in Your Browser: Tools
Host, run, and code Python in the cloud!
The Python IDE for the web. Play around with and modify live example code. Share or demonstrate solutions to problems.
Powerful and simple online compiler, IDE, interpreter, and REPL. Write, run, save, and share code from your browser!
Skulpt is an entirely in-browser implementation of Python. No preprocessing, plugins, or server-side support required, just write Python and reload.
Python in the browser. No installation required.

Python & Spreadsheets

A python library to create, update and delete rows of data in a google spreadsheet.
  • XlsxWriter: Creating Excel files with Python and XlsxWriter
XlsxWriter is a Python module for creating Excel XLSX files.

Python Tips, Tutorials & Tricks

Python was originally a general purpose language. Over the years, with strong community support, dedicated libraries for data analysis and predictive modeling emerged.
Step-by-step instructions for setting up an environment to scan hidden services in the dark web!

Web Scraping With Python

Scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

R

R Studio

Take control of your R code. RStudio is the premier integrated development environment for R. It is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or over the web with RStudio Server. Download RStudio (for Windows, Mac, or Linux).

R & Excel

In quantitative finance both R and Excel are the basis tools for any type of analysis. Whenever one has to use Excel in conjunction with R, there are many ways to approach the problem and many solutions. It depends on what you really want to do and the size of the dataset you’re dealing with.
Simplifies the creation of Excel .xlsx files by providing a high level interface to writing, styling and editing worksheets. Through the use of Rcpp, read/write times are comparable to the xlsx and XLConnect packages with the added benefit of removing the dependency on Java.

R & Google Sheets

Interact with Google Sheets from R.
Access and manage Google spreadsheets from R with googlesheets.

Shiny

A web application framework for R.
- Turn your analyses into interactive web applications.
- No HTML, CSS, or JavaScript knowledge required.

Learn Data Science

Also See: Computer Productivity Hacks#Educate_Yourself

Academic credentials are important but not necessary for high-quality data science. The core aptitudes – curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor, skeptical nature – that distinguish the best data scientists are widely distributed throughout the population.

We’re likely to see more uncredentialed, inexperienced individuals try their hands at data science, bootstrapping their skills on the open-source ecosystem and using the diversity of modeling tools available. Just as data-science platforms and tools are proliferating through the magic of open source, big data’s data-scientist pool will as well.

And there’s yet another trend that will alleviate any talent gap: the democratization of data science. Autodidacts – the self-taught, uncredentialed, data-passionate people – will come to play a significant role in many organizations’ data science initiatives.
 
Learn applied skills for a career in data science. Work with data, build projects, and gain experience. (No longer free.)
Analytics Vidhya | Dataaspirant | Dr. Randal S. Olson | Domino Data Lab's blog | Entrepreneurial Geekiness | no free hunch | Pythonic Perambulations | Rayli.Net | yhat | Beckerfuffle | Carl Shan | Data Mining: Text Mining, Visualization and Social Media | FastML | Garbled Notes | Joy Of Data | MLWave | Trevor Stephens | While My MCMC Gently Samples | Zac Stewart | Machined Learnings | Deep Learning | Nuit Blanche | Machine Learning (Theory) | Large Scale Machine Learning | Revolutions | KDnuggets | Wellecks | PyImageSearch | Data Mining Research | Trey Causey | Deep Learning | Machine Learning and Data Science | i am trask | Jonas Degrave | Yanir Seroussi | Life, Language, Learning | M.E.Driscoll | Models are illuminating and wrong | Christophe Bourguignat | samim | Andrej Karpathy blog | colah's blog | Sebastian Raschka | Edwin Chen | Simply Statistics | Will do stuff for stuff | Tombone's Computer Vision Blog | I’m a bandit | Data Science 101 | R-bloggers | Stitch Fix Tech Blog | DataGenetics | Dataquest Blog
The open-source curriculum for learning Data Science. Foundational in both theory and technologies, the OSDSM breaks down the core competencies necessary to make data useful.
The Internet is Your Oyster: With Coursera, ebooks, Stack Overflow, and GitHub -- all free and open -- how can you afford not to take advantage of an open source education?
Numerous "flash card"-style data science lessons.

Hash, Random Text, Number & String Generators

Hash
Tools
Free, GNU-licensed, random custom data generator for testing software.
Generate random lists, numbers, text strings, sequences and more.
Tutorials
Sometimes you may need to generate random strings, such as different passwords. This article tries to show you some tricks to generate different random strings in Excel.

Regular Expressions

In theoretical computer science and formal language theory, a regular expression (abbreviated regex or regexp and sometimes called a rational expression) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations.

Quick Tips

"The regex I use when I want to be precise is \r\n?|\n."

RegEx Tutorials

Includes:
- Interactive Tutorial › Learn to use Regular Expressions
- Practical Examples › Practice your Regular Expressions
- RegEx Cheatsheet › Regular Expressions in PHP & More
Any non-trivial regex looks daunting to anybody not familiar with them. But with just a bit of experience, you will soon be able to craft your own regular expressions like you have never done anything else.
Regex is the gift that keeps giving. Once you learn it, you discover it comes in handy in many places where you hadn't planned to use it.
This is an in-progress book that quickly teaches you regular expressions.

RegEx Tools

Regex101 allows you to create, debug, test and have your expressions explained for PHP, PCRE, JavaScript and Python. The website also features a community where you can share useful expressions.
Regular expression tester with syntax highlighting, contextual help, video tutorial, reference, and searchable community patterns.
Pythex is a real-time regular expression editor for Python, a quick way to test your regular expressions.
JavaScript regex tester. Highlights matches on the fly.

Small Data

Small Data is everything Big Data is not.

You Might Not Be Big Data If

- You were generated through human data entry. (Big Data came about in order to handle the exponential growth of machine-generated data, because we humans aren’t fast enough to outpace a good old relational database).
- You are an operational database. For instance, CRM is never Big Data, and ERP is never Big Data.
- You fit just fine in a MySQL database. Even if you have to put a lot of RAM in it, it’s still not Big Data.[2]

For the majority of small and medium businesses, Big Data is the technology of the future, not the reality they experience today. To be blunt, most SMBs don’t even have a handle on the Small Data they’re already creating and collecting themselves. (And if many enterprise organizations are being honest, neither do they.) According to Forrester Research, most companies are analyzing a mere 12% of their existing data. That leaves a whopping 88% of data that businesses are flat out ignoring. Can you imagine the potential of actually leveraging that existing data to derive data-driven business insights? Instead of chasing the Big Data dream, SMBs should consider picking up the dollars that are effectively lying on the floor, and invest first in leveraging their Small Data.
Just as we now find it ludicrous to talk of "big software" – as if size in itself were a measure of value – we should, and will one day, find it equally odd to talk of "big data". Size in itself doesn't matter – what matters is having the data, of whatever size, that helps us solve a problem or address the question we have.
What is small data, you ask? Small data is a dataset that contains very specific attributes. Small data is used to determine current states and conditions or may be generated by analyzing larger data sets.

SQL

Cheat Sheets

Simple-Talk's free wallchart of the most important SSMS keyboard shortcuts aims to help find all those curiously forgettable key combinations within SQL Server Management Studio that unlock the hidden magic that is available for editing and executing queries.

Fiddle with SQL in the Browser

I got this one to work. | Dave (talk) 18:01, 7 December 2016 (PST)
At Rextester you can not only compile PostgreSQL, but also several several other SQL flavors, too (SQL Server, MySQL, Oracle). Rextester also offers options for several other programming languages.

SQL & PowerShell

Scripting is very powerful. And for me, one of the best scripting languages is PowerShell (PoSH). Yes, PoSH takes a bit of getting used to, but once you pass the initial learning curve, you end up with a powerful tool in your hands.
Multiple examples; approaches.
Download the latest version of this PowerShell™ wallchart and read the accompanying in-depth article from Simple-Talk.
Also See:

Text Extraction

PDF Conversion

Free online PDF converter that helps with your everyday PDF tasks, making it easy to compress, convert, join and edit PDF files online.
Editor's Note: Used this online service to join two PDFs.
Capture2Text conceptual illustration.
Capture2Text conceptual illustration.
Capture2Text enables users to quickly OCR a portion of the screen using a keyboard shortcut. The resulting text will be saved to the clipboard by default.
Detexter is an app designed to extract text from PDF files.
I used this service to successfully convert the .US Locality Domain Name Registration Terms and Conditions form from PDF to Word format.
Lists several options.
Convert PDF to HTML without losing text or format.
Note: (Domain squatter has apparently hijacked the original pdftotext.org domain. But the service is still here: http://pdftotext.github.io/)
pdftotext.org is the best online service for easily extracting text from your PDF files. Conversion from PDF to TXT is really fast thanks to our in-browser conversion architecture. Your PDF files are never uploaded to the Internet, so even private PDF files are safe to convert with this service. The conversion is done locally in your browser – you can even convert when you are offline! There is no need for any registration or sign-up, and the service will always be free to use.
Extracts plain text from documents in all popular formats.
Tabula is a tool for liberating data tables locked inside PDF files. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux.
Also See
Office Productivity Hacks#Portable Document Format (PDF)

Data Scrape ; Web Scrape

Tools and tips compiled by journalists from PBS and Omaha World-Herald.
Scrapinghub's list of open source scraping projects.
Monitor website links with ease. Sitestalker supervises websites and notifies you when your desired content hits the web.Stop wasting your time constantly refreshing websites.
Sitestalker is great for:
Finding jobs
Searching for an apartment
Getting the best bargains
Clipping
Tools for gathering data from public sources.
Web Scraping tools are specifically developed for extracting information from websites. They are also known as web harvesting tools or web data extraction tools. These tools are useful for anyone trying to collect some form of data from the Internet. Web Scraping is the new data entry technique that don’t require repetitive typing or copy-pasting.
R Scraper
Watch (video) how easy it is to import data from a Web page into R.

Text Search

Makes tools to search text content, including:
  1. FALCON - Text Search Java Project: JSON based text search Java Project
  2. HAWK - PDF Text Search Java Project: Taking initiative for Document Text Search
Xpdf is an open source viewer for Portable Document Format (PDF) files.
Windows installer: Short Programs/Scripts (Look for the xpdf3.exe / poppler.exe links in left sidebar.)

Text Transformation

Do you have a list of text strings that you want to modify the format on? Copy and paste the list into the a box, then provide an example of how you want each text string formatted. The hope is that Transformy will do the rest.

Workflow

Git, GitHub

Solution 1: A single repository can contain multiple independent branches, called orphan branches. Orphan branches are completely separate from each other; they do not share histories.
Solution 2: Avoid all the hassle of orphan branches. Create two independent repositories, and push them to the same remote. Just use different branch names for each repo.

#git, #github

See Also

References