DOCSP-37419 - Quick Start (#7)

mongoKart · jordan-smith721 · web-flow · commit 4c8a09e4c94f · 2024-03-07T10:30:24.000-06:00
Co-authored-by: Jordan Smith &lt;45415425+jordan-smith721@users.noreply.github.com&gt;
diff --git a/source/quick-start.txt b/source/quick-start.txt
@@ -4,178 +4,239 @@
 Quick Start
 ===========
 
+.. facet::
+   :name: genre
+   :values: reference
+ 
+.. meta::
+   :keywords: tutorial, introduction, setup, begin  
+
 This tutorial is intended as an introduction to working with
-**PyMongoArrow**. The reader is assumed to be familiar with basic
-`PyMongo <https://pymongo.readthedocs.io/en/stable/tutorial.html>`_ and
-`MongoDB <https://docs.mongodb.com>`_ concepts.
+**{+driver-short+}**. The tutorial assumes the reader is familiar with basic
+`PyMongo <https://pymongo.readthedocs.io/en/stable/tutorial.html>`__ and
+`MongoDB <https://docs.mongodb.com>`__ concepts.
 
 Prerequisites
 -------------
-Before we start, make sure that you have the **PyMongoArrow** distribution
-:doc:`installed <installation>`. In the Python shell, the following should
-run without raising an exception::
 
-  import pymongoarrow as pma
+Ensure that you have the {+driver-short+} distribution
+:ref:`installed <pymongo-arrow-install>`. In the Python shell, the following should
+run without raising an exception:
+
+.. code-block:: python
+
+   >>> import pymongoarrow as pma
 
 This tutorial also assumes that a MongoDB instance is running on the
-default host and port. Assuming you have `downloaded and installed
-<https://docs.mongodb.com/manual/installation/>`_ MongoDB, you can start
-it like so:
+default host and port. After you have `downloaded and installed
+<https://docs.mongodb.com/manual/installation/>`__ MongoDB, you can start
+it as shown in the following code example:
 
 .. code-block:: bash
 
    $ mongod
 
 Extending PyMongo
 ~~~~~~~~~~~~~~~~~
-The :mod:`pymongoarrow.monkey` module provides an interface to patch PyMongo,
-in place, and add **PyMongoArrow**'s functionality directly to
-:class:`~pymongo.collection.Collection` instances::
 
-  from pymongoarrow.monkey import patch_all
-  patch_all()
+The ``pymongoarrow.monkey`` module provides an interface to patch PyMongo
+in place, and add {+driver-short+} functionality directly to
+``Collection`` instances:
+
+.. code-block:: python
 
-After running :meth:`~pymongoarrow.monkey.patch_all`, new instances of
-:class:`~pymongo.collection.Collection` will have PyMongoArrow's APIs,
-e.g. :meth:`~pymongoarrow.api.find_pandas_all`.
+   from pymongoarrow.monkey import patch_all
+   patch_all()
 
-.. note:: Users can also directly use any of **PyMongoArrow**'s APIs
-   by importing them from :mod:`pymongoarrow.api`. The only difference in
-   usage would be the need to manually pass the instance of
-   :class:`~pymongo.collection.Collection` on which the operation is to be
-   run as the first argument when directly using the API method.
+After you run the ``monkey.patch_all()`` method, new instances of
+the ``Collection`` class will contain the {+driver-short+} APIs--
+for example, the ``pymongoarrow.api.find_pandas_all()`` method.
+
+.. note::
+  
+   You can also use any of the {+driver-short+} APIs
+   by importing them from the ``pymongoarrow.api`` module. If you do,
+   you must pass the instance of the ``Collection`` on which the operation is to be
+   run as the first argument when calling the API method.
 
 Test Data
 ~~~~~~~~~
-Before we begin, we must first add some data to our cluster that we can
-query. We can do so using **PyMongo**::
-
-  from datetime import datetime
-  from pymongo import MongoClient
-  client = MongoClient()
-  client.db.data.insert_many([
-    {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']},
-    {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']},
-    {'_id': 3, 'amount': 3,  'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']},
-    {'_id': 4, 'amount': 0,  'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}])
+
+The following code uses PyMongo to add sample data to your cluster:
+
+.. code-block:: python
+
+   from datetime import datetime
+   from pymongo import MongoClient
+   client = MongoClient()
+   client.db.data.insert_many([
+     {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']},
+     {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']},
+     {'_id': 3, 'amount': 3,  'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']},
+     {'_id': 4, 'amount': 0,  'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}])
 
 Defining the Schema
 -------------------
-**PyMongoArrow** relies upon a data schema to marshall
-query result sets into tabular form. This schema can either be automatically inferred from the data,
-or provided by the user. Users can define the schema by
-instantiating :class:`pymongoarrow.api.Schema` using a mapping of field names
-to type-specifiers, e.g.::
 
-  from pymongoarrow.api import Schema
-  schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
+{+driver-short+} relies on a data schema to marshall
+query result sets into tabular form. If you don't provide this schema, {+driver-short+}
+infers one from the data. You can define the schema by
+creating a ``Schema`` object and mapping the field names
+to type-specifiers, as shown in the following example:
+
+.. code-block:: python
+
+   from pymongoarrow.api import Schema
+   schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
 
+MongoDB uses embedded documents to represent nested data. {+driver-short+} offers
+first-class support for these documents:
 
-PyMongoArrow offers first-class support for Nested data (embedded documents)::
+.. code-block:: python
 
-  schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
+   schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
 
-Lists (and nested lists) are also supported::
+{+driver-short+} also supports lists and nested lists:
 
-  from pyarrow import list_, string
-  schema = Schema({'txns': list_(string())})
-  polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
+.. code-block:: python
 
-There are multiple permissible type-identifiers for each supported BSON type.
-For a full-list of data types and associated type-identifiers see
-:doc:`data_types`.
+   from pyarrow import list_, string
+   schema = Schema({'txns': list_(string())})
+   polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
 
+.. tip::
+  
+   {+driver-short+} includes multiple permissible type-identifiers for each supported BSON
+   type. For a full list of these data types and their associated type-identifiers, see
+   :ref:`<pymongo-arrow-data-types>`.
 
 Find Operations
 ---------------
-We are now ready to query our data. Let's start by running a ``find``
-operation to load all records with a non-zero ``amount`` as a
-:class:`pandas.DataFrame`::
 
-  df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema)
+The following code example shows how to load all records that have a non-zero
+value for the ``amount`` field as a ``pandas.DataFrame`` object:
+
+.. code-block:: python
+
+   df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema)
+
+You can also load the same result set as a ``pyarrow.Table`` instance:
 
-We can also load the same result set as a :class:`pyarrow.Table` instance::
+.. code-block:: python
 
-  arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
+   arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
 
-a :class:`polars.DataFrame`::
+Or as a ``polars.DataFrame`` instance:
 
-  df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
+.. code-block:: python
 
-or as **Numpy arrays**::
+   df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
 
-  ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
+Or as a NumPy ``arrays`` object:
 
-In the NumPy case, the return value is a dictionary where the keys are field
-names and values are corresponding :class:`numpy.ndarray` instances.
+.. code-block:: python
+  
+   ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
+
+When using NumPy, the return value is a dictionary where the keys are field
+names and the values are the corresponding ``numpy.ndarray`` instances.
 
 .. note::
 
-   For all of the examples above, the schema can be omitted like so::
+   In all of the preceding examples, you can omit the schema as shown in the following
+   example:
 
-    arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})
+   .. code-block:: python
 
-   In this case, PyMongoArrow will try to automatically apply a schema based on
-   the data contained in the first batch.
+      arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})
 
+   If you omit the schema, {+driver-short+} tries to automatically apply a schema based on
+   the data contained in the first batch.
 
 Aggregate Operations
 --------------------
-Running an ``aggregate`` operation is similar to ``find``, but it takes a sequence of operations to perform.
-Here is a simple example of ``aggregate_pandas_all`` that outputs a new dataframe
-in which all ``_id`` values are grouped together and their ``amount`` values summed::
 
-  df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}])
+Running an aggregate operation is similar to running a find operation, but it takes a
+sequence of operations to perform.
+
+The following is a simple example of the ``aggregate_pandas_all()`` method that outputs a
+new dataframe in which all ``_id`` values are grouped together and their ``amount`` values
+summed:
+
+.. code-block:: python
+
+   df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}])
+
+You can also run aggregate operations on embedded documents.
+The following example unwinds values in the nested ``txn`` field, counts the number of each
+value, then returns the results as a list of NumPy ``ndarray`` objects, sorted in
+descending order:
 
-Nested data (embedded documents) are also supported.
-In this more complex example, we unwind values in the nested ``txn`` field, count the number of each,
-then return as a list of numpy ndarrays sorted in decreasing order::
+.. code-block:: python
 
-  pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}]
-  ndarrays = client.db.data.aggregate_numpy_all(pipeline)
+   pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}]
+   ndarrays = client.db.data.aggregate_numpy_all(pipeline)
 
-More information on aggregation pipelines can be found `here <https://www.mongodb.com/docs/manual/core/aggregation-pipeline/>`_.
+.. tip::
+  
+   For more information about aggregation pipelines, see the
+   :manual:`MongoDB Server documentation </core/aggregation-pipeline/>`.
 
 Writing to MongoDB
 ------------------
-All of these types, Arrow's :class:`~pyarrow.Table`, Pandas'
-:class:`~pandas.DataFrame`, NumPy's :class:`~numpy.ndarray`, or :class:`~polars.DataFrame` can
-be easily written to your MongoDB database using the :meth:`~pymongoarrow.api.write` function::
 
- from pymongoarrow.api import write
- from pymongo import MongoClient
- coll = MongoClient().db.my_collection
- write(coll, df)
- write(coll, arrow_table)
- write(coll, ndarrays)
+You can use the ``write()`` method to write objects of the following types to MongoDB:
 
-(Keep in mind that NumPy arrays are specified as ``dict[str, ndarray]``.)
+- Arrow ``Table``
+- Pandas ``DataFrame``
+- NumPy ``ndarray``
+- Polars ``DataFrame``
+
+.. code-block:: python
+ 
+   from pymongoarrow.api import write
+   from pymongo import MongoClient
+   coll = MongoClient().db.my_collection
+   write(coll, df)
+   write(coll, arrow_table)
+   write(coll, ndarrays)
+
+.. note::
+  
+   NumPy arrays are specified as ``dict[str, ndarray]``.
 
 Writing to Other Formats
 ------------------------
-Once result sets have been loaded, one can then write them to any format that the package supports.
 
-For example, to write the table referenced by the variable ``arrow_table`` to a Parquet
-file ``example.parquet``, run::
-
-  import pyarrow.parquet as pq
-  pq.write_table(arrow_table, 'example.parquet')
+Once result sets have been loaded, you can then write them to any format that the package
+supports.
 
-Pandas also supports writing :class:`~pandas.DataFrame` instances to a variety
-of formats including CSV, and HDF. To write the data frame
-referenced by the variable ``df`` to a CSV file ``out.csv``, for example, run::
+For example, to write the table referenced by the variable ``arrow_table`` to a Parquet
+file named ``example.parquet``, run the following code:
 
-  df.to_csv('out.csv', index=False)
+.. code-block:: python
+  
+   import pyarrow.parquet as pq
+   pq.write_table(arrow_table, 'example.parquet')
 
-The Polars API is a mix of the two::
+Pandas also supports writing ``DataFrame`` instances to a variety
+of formats, including CSV and HDF. To write the data frame
+referenced by the variable ``df`` to a CSV file named ``out.csv``, run the following
+code:
 
+.. code-block:: python
+  
+   df.to_csv('out.csv', index=False)
 
- import polars as pl
- df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]})
- df.write_parquet('example.parquet')
+The Polars API is a mix of the two preceding examples:
 
+.. code-block:: python
+ 
+   import polars as pl
+   df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]})
+   df.write_parquet('example.parquet')
 
 .. note::
 
-  Nested data is supported for parquet read/write but is not well supported
-  by Arrow or Pandas for CSV read/write.
+   Nested data is supported for parquet read and write operations, but is not well
+   supported by Arrow or Pandas for CSV read and write operations.