Big Data development

Whatever your data requirements may be, our experts will choose the most relevant and applicable technology and implement state-of-the-art methods to gather what you need.

Data Mining & Analysis

  • Orange: A Python based, open source machine learning and data visualization toolbox. Suitable for interactive data analysis workflow.
  • Weka 3: A Java based, open source data mining software. Weka 3 includes tools for data pre-processing, classification, regression, clustering, association rules, and visualization.

Web Crawler Tool

  • Scrapyis an open source, crawling framework used for extracting data from websites. It is fast and simple, yet powerful and expandable. Written in Python and can run on local PC’s or Cloud servers.
  • Pyspider is another powerful Python web crawler system. An abundance of databases is supported in backend. Can define numerous rules such as task priority, retry, periodical, and re-crawl.

Data Visualization

  • D3: A combination of powerful visualization JavaScript libraries. With widely implemented SVG, HTML5, and CSS standards, it allows considerable control over the final visual result.
  • ECharts: A free, powerful charting and visualization library offers a simple method of adding intuitive, interactive, and highly customizable charts to your commercial products. It is written in pure JavaScript and based on render, a lightweight canvas library that provides 2d drawings for ECharts. Some render shapes include: Arc, BezierCurve, Circle, Droplet, Ellipse, Heart, Isogon, Line, Polygon, Polyline, Rect, Ring, Rose, Sector, Star, and Trochoid.

Data storage & computing

  • We use Hadoopto efficiently process large datasets, reliable, scalable, distributed computing and storage. Also based on Hadoop, we use Apache Pig, a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. We also use Apache Hive, a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
  • Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop.
  • Created by the Google Brain team,TensorFlow is an open source library for numerical computation and large-scale machine learning. TensorFlow bundles together a slew of machine learning and deep learning (aka neural networking) models and algorithms and makes them useful by way of a common metaphor. It uses Python to provide a convenient front-end API for building applications with the framework, while executing those applications in high-performance C++.